What does a world class analytics stack look like in 2019?
Over the past few years we've seen the technology and tools for building an analytics stack change dramatically. Best practices are changing at an ever increasing pace, but thankfully it's now easier than ever to build a world-class stack without massive investment. We break down what we see as the most effective and scalable approach to building a data stack in 2019.
The majority of my data & analytics experience has been at Google. 7 years there taught me a lot, but my understanding didn’t go much deeper than the warehouse. The data I worked with was prepared by engineering teams with whom I had little interaction, and the data tools I used were in house.
Then, 2 years ago, I left Google to set up the data team at a 25 person startup. When I joined, the analytics was all done using the production MySQL DB, but 18 months later:
I’ve since joined Dataform, and had the opportunity to learn how many other companies have set up their data stacks.
Data warehouse systems, of all the components of a modern data stack, have seen the most significant improvements over the last few years. Highly scalable, managed cloud data warehouses enable you to transform TBs of data with just a few lines of SQL and no infrastructure. And on demand pricing means the technology is affordable for everyone, with only a few minutes of set up time. This has implications throughout the data stack:
Google BigQuery and Snowflake are currently the best offerings, with low storage costs, high processing speeds and on demand pricing. Amazon Redshift is also worth consideration. Tools like Intermix can be used to optimize your Redshift warehouse for performance.
Before the cloud data warehouse, data pipelines followed the ETL process: Extract the data from underlying systems, transform it into a useable format, load into your analytics database. The cloud data warehouse has enabled teams to push the transformation step later, so teams now follow an ELT process.
In fact, I like to break the data pipeline into four stages: collect, load, transform, analyze.
The advantages of moving the transformation step into the warehouse are significant:
At a minimum, there are two types of data that need to be tracked:
Segment and Snowplow are popular choices that make it easy for engineering teams to track important user events.
There are many more sources of data (Google Analytics, Shopify, Adwords) available to companies today. The more sources that can be integrated into the data stack, the better.
As well as event data and transactional data, your company may have access to many more sources of data: Google Analytics, Shopify, Ad spend data, Intercom data etc. Loading each of these sources into your cloud warehouse will enable your team to perform comprehensive analytics.
Stitch and Fivetran are two popular services that make loading hundreds of data types easy. It can take as little as an hour to set up scheduled ingestion jobs, and you should be able to avoid needing to write any code.
To make data usable, it first needs to be transformed into clean, descriptive, reliable and easy to query datasets.
The transformation layer is where the uniqueness and complexity of your business is introduced. Exactly how to turn raw data into datasets that make sense for your business requires skill, and your transformation rules are going to change over time. There is always more data being generated, and so the datasets need to be continuously updated and refreshed. For these reasons and more, the best data teams are now incorporating software engineering best practices into their transformations: testing, error reporting, version control, code reviews, and alerting.
Until recently, Airflow and Luigi were popular choices to manage transformations. However these services require teams to maintain the supporting infrastructure, which turns out to be time consuming, challenging, and ultimately frustrating.
Dataform offers a fully managed development environment and is one of the first products to have been built with this complex transformation layer in mind. Software engineering principles like version control and testing are built directly into the product, ensuring reliability and trust in the data pipelines. If you’d like to learn more, sign up here.
Data is at the heart of decision making in modern business. To enable the business to be truly data driven, the data analysis solutions can’t rely on analysts manually generating reports, because this doesn’t scale. Transforming raw data into meaningful, easy to use and understandable datasets is the first step. The best companies are backing that up with BI solutions that allow everyone to self-serve their data requests. Employees should know where to find the data they need, be able to quickly access it, and then trust the data and insights they find.
Looker is an example of a BI tool built with this contemporary approach to data in mind, and we see it being used by a large share of companies with an advanced approach to data.
The modern data stack is centered on a powerful data warehouse. Data is loaded directly into the warehouse. A robust and reliable transformation layer is used to turn that raw data into dependable and meaningful datasets. And the best BI solutions are now designed to enable everyone in the organisation to use these datasets to pursue insights and find answers to their questions.
If you're in the process of setting up your stack or looking to make improvements to your current setup, I'd love to talk. Reach me at firstname.lastname@example.org.
We publish great new resources every week, get them straight to your inbox.
Turn on BigQuery audit log exports to start analysing your BigQuery usage
Data warehousing technologies are advancing fast. The cloud data warehousing revolution means more and more companies are moving away from an ETL approach and towards an ELT approach for managing analytical data.
A deep dive into some advanced data quality testing use cases with SQL and the open-source Dataform framework.