Over the past few years we've seen the technology and tools for building a data stack change dramatically. Best practices are changing at an ever increasing pace, but thankfully it's now easier than ever to build a world-class stack without massive investment. We break down what we see as the most effective and scalable approach to building a data stack in 2019.
The majority of my data & analytics experience has been at Google. 7 years there taught me a lot, but my understanding didn’t go much deeper than the warehouse. The data I worked with was prepared by engineering teams with whom I had little interaction, and the data tools I used were in house.
Then, 2 years ago, I left Google to set up the data team at a 25 person startup. When I joined, the analytics was all done using the production MySQL DB, but 18 months later:
- The data team had grown to 7
- Our BigQuery data warehouse had more than 200 modelled tables
- The warehouse was updated every hour on a schedule that included automated testing of the output data
- The SQL to build our data models was version controlled
- We had 90% weekly adoption of Looker across the company
I’ve since joined Dataform, and had the opportunity to learn how many other companies have set up their data stacks.
The cloud first data warehouse
Data warehouse systems, of all the components of a modern data stack, have seen the most significant improvements over the last few years. Highly scalable, managed cloud data warehouses enable you to transform TBs of data with just a few lines of SQL and no infrastructure. And on demand pricing means the technology is affordable for everyone, with only a few minutes of set up time. This has implications throughout the data stack:
- Most companies don’t need to worry about the costs of storing data
- Most data preparation and transformation can be run in the warehouse, using SQL
- BI and reporting tools can serve from the warehouse, in real time
Google BigQuery and Snowflake are currently the best offerings, with low storage costs, high processing speeds and on demand pricing. Amazon Redshift is also worth consideration. Tools like Intermix can be used to optimize your Redshift warehouse for performance.
The rise of ELT
Before the cloud data warehouse, data pipelines followed the ETL process: Extract the data from underlying systems, transform it into a useable format, load into your analytics database. The cloud data warehouse has enabled teams to push the transformation step later, so teams now follow an ELT process.
In fact, I like to break the data pipeline into four stages: collect, load, transform, analyze.
The advantages of moving the transformation step into the warehouse are significant:
- Business logic can be removed from the extraction and loading step. This allows teams to use plug and play 3rd party providers to collect and then load their data with minimal resource investment
- Business logic can be defined in SQL. Rather than having engineering teams defining transformations in Java, Python, Scala etc, the analyst that work closely with business teams can own and manage this logic
At a minimum, there are two types of data that need to be tracked:
- Event data: What are your users doing when they’re interacting with your product?
- Transactional data history: To enable advanced analytics, you need to see not only the current state of the system, but also how it’s changed over time. Ideally, you should be creating a new event for every change (insert, edit, delete) to your transactional database. If this isn’t possible, regular snapshots of the DB are a minimum requirement.
Segment and Snowplow are popular choices that make it easy for engineering teams to track important user events.
There are many more sources of data (Google Analytics, Shopify, Adwords) available to companies today. The more sources that can be integrated into the data stack, the better.
As well as event data and transactional data, your company may have access to many more sources of data: Google Analytics, Shopify, Ad spend data, Intercom data etc. Loading each of these sources into your cloud warehouse will enable your team to perform comprehensive analytics.
Stitch and Fivetran are two popular services that make loading hundreds of data types easy. It can take as little as an hour to set up scheduled ingestion jobs, and you should be able to avoid needing to write any code.
To make data usable, it first needs to be transformed into clean, descriptive, reliable and easy to query datasets.
The transformation layer is where the uniqueness and complexity of your business is introduced. Exactly how to turn raw data into datasets that make sense for your business requires skill, and your transformation rules are going to change over time. There is always more data being generated, and so the datasets need to be continuously updated and refreshed. For these reasons and more, the best data teams are now incorporating software engineering best practices into their transformations: testing, error reporting, version control, code reviews, and alerting.
Until recently, Airflow and Luigi were popular choices to manage transformations. However these services require teams to maintain the supporting infrastructure, which turns out to be time consuming, challenging, and ultimately frustrating.
Dataform offers a fully managed development environment and is one of the first products to have been built with this complex transformation layer in mind. Software engineering principles like version control and testing are built directly into the product, ensuring reliability and trust in the data pipelines. If you’d like to learn more, sign up here.
Data is at the heart of decision making in modern business. To enable the business to be truly data driven, the data analysis solutions can’t rely on analysts manually generating reports, because this doesn’t scale. Transforming raw data into meaningful, easy to use and understandable datasets is the first step. The best companies are backing that up with BI solutions that allow everyone to self-serve their data requests. Employees should know where to find the data they need, be able to quickly access it, and then trust the data and insights they find.
Looker is an example of a BI tool built with this contemporary approach to data in mind, and we see it being used by a large share of companies with an advanced approach to data.
The modern data stack is centered on a powerful data warehouse. Data is loaded directly into the warehouse. A robust and reliable transformation layer is used to turn that raw data into dependable and meaningful datasets. And the best BI solutions are now designed to enable everyone in the organisation to use these datasets to pursue insights and find answers to their questions.
If you're in the process of setting up your stack or looking to make improvements to your current setup, I'd love to talk. Reach me at firstname.lastname@example.org.