Having the raw analytics data in a decentralized data store is great. It’s an open, permissionless, user owned, community asset available to everyone. Like blockchain data.
But, like blockchain data, it’s difficult to get insights from and build dashboards with a raw data format. You need a way to load it into traditional datastores for processing.
To address this, we created an indexer for the data. It's an automated pipeline that uses the Airbyte open source ELT platform and pushes normalized data directly into an S3 data lake. Our source connector continuously monitors the blockchain and Ceramic for new apps, users and data for indexing.
For now, data is stored in an S3 data lake in Apache Parquet format and accessed via AWS Athena. Apache Spark also supports S3 data lakes in parquet format and is another option for us as we scale. That said, Ceramic is working on a GraphQL interface. Once they have robust indexing and sufficient performance to support analytics queries, we will pull data directly from Ceramic instead of using S3.