We recently launched the Dataform open-source SDK - a framework to help data teams manage modern cloud data warehouses such as Google BigQuery, Amazon Redshift and Snowflake. Dataform makes it easy to develop, test, and deploy SQL based data workflows to your warehouse without having to build or maintain custom infrastructure.
In this post we want to provide a bit more information on the what, the why, and the how of the framework. If you’d instead like to jump in and start playing with Dataform, check out our documentation and the project on GitHub.
Why we built Dataform
Our goal for the Dataform framework is to enable data analysts, software engineers and data engineers to build a world-class data stack.
We believe that to manage the exponential growth in data complexity data teams must invest in data modeling - the process of producing core datasets that enable teams quickly answer common business questions. These datasets form a single source of truth for data across the entire organization, allowing questions to be answered quickly and consistently, without having to start from scratch every time.
We certainly aren’t the first to realize that a solution is needed to this problem, Other frameworks and tools exist to support the data modeling process, many of which we have used ourselves. However over the past year as we started to understand our customers needs more - we started to develop our own solution.
What follows is an overview of why we built Dataform, how it works, the design goals of the project, and finally some of our plans for the future.
Why do you need a data modeling framework?
After speaking to many companies over the past year, we identified a number of common patterns in data teams processes:
- Data teams were often stuck in reactive mode. A constant influx of critical business requests stopped them investing in long term improvements to their architecture.
- Data teams didn’t have the headcount to build and maintain their own data modeling infrastructure or weren’t able to hire data engineers to achieve this.
- Even with headcount and a fully staffed team, data teams would spend significant time building and deploying an internal solution for managing data pipelines - often Airflow or custom code to manage dependency graphs, scheduling, notifications, and so on.
We believe that Dataform will enable data-driven companies to take their build a world-class data stack at a significantly reduced cost and in a fraction of the time, and help them avoid having to reinvent the wheel.
What does Dataform do?
At a high level, Dataform’s framework makes it easy to adopt engineering best practices within your data team - version control data definitions, write tests, share reusable code, and deploy and develop to different environments.
Dataform’s API and CLI is built around the concept of actions which could be any of the following:
- Publish new datasets (or add to existing ones)
- Test a dataset for quality issues
- Execute arbitrary SQL statements
Dataform automatically maintains a DAG of actions like most data workflow tools do - to make sure all of your actions run in the correct order. For example, making sure a data quality test runs only after the corresponding dataset has been created.
How does it work?
- You create SQL files in a Dataform project folder, which can be enriched using Dataform’s templating syntax.
- Files and templates get compiled into a graph (DAG) of actions to be executed.
- Actions are executed against your data warehouse in the correct order.
- Dataform returns a log of every single action and statement that was executed, including whether they succeeded or not.
Dataform’s APIs make it easy to perform these common tasks and abstracts away much of the complexity for you. To publish a new dataset: create a new SQL file, write a
SELECT query, and Dataform will manage the rest. Writing a data quality check is similar: write a query to look for bad data and Dataform will run the query, check the results, and notify you of any problems.
When we first started developing Dataform we had a number of key design requirements that we wanted to work towards. We haven’t managed all of these just yet, but where we haven’t, we have a plan!
Data definitions should be version controlled
This is perhaps one of the biggest differences between Dataform and most other data products. Everything you do in Dataform lives in a file on disk and can be version controlled. Software is never built outside of a version control system; we don’t think your data stack should be treated any differently.
Data transformations and outputs should be reproducible
If you somehow delete or lose your Dataform datasets it should be fairly easy to recreate them from the original data sources. This is hard to achieve in practice, but there are a number of things Dataform does to help you maintain reproducibility.
Dataform executes project compilation inside a sandbox, making it much easier to reason about how your code is (or isn’t) working and helps you avoid common mistakes or pitfalls. This also means Dataform projects can’t use external inputs such as network calls to effect the transformation graph, and is key to enabling our next design goal - speed.
Lightweight and fast APIs
No one likes a slow build process. This problem is exacerbated when defining datasets, where the work is often iterative - we repeatedly make small edits to queries, and want to inspect their output immediately.
Dataform is able to compile projects containing 200+ data warehouse operations and tens of thousands of lines of SQL in well under a second.
Data should never leave your warehouse
Dataform is an orchestrator, not a data processor. Although your configuration is stored in Dataform, your actual data will never leave your warehouse. You maintain complete control of your data.
Configuration as code (if you want to)
Dataform’s core framework is a JS/TS API, which means that you can do anything our CLI or web platform does entirely through code. Today, many of our customers use this functionality to manage advanced use cases such as GDPR cleanup.
Extensibility and reusability
Software engineers generally avoid writing the same code or functions more than once - code should be reusable. In the data world this is often not the case. Extending on top of existing SQL queries often starts with a copy-paste. Dataform makes this easy to avoid through the ability to define includes, reusable SQL snippets, and even NPM packages that can be reused across multiple queries or even projects.
Dataform is a young project but is already being used by several businesses to manage their production data stack. Over the coming months we intend to improve the framework in a number of areas.
Defining a consistent set of core tables for your organization is great, but they also need to be discoverable. This includes dataset documentation, better support for access control, and dataset search functionality. We’ll be launching a standalone data catalog product in the coming months - stay tuned!
Currently Dataform consists of a CLI and set of APIs. We plan on bringing much more of our Dataform web platform into the open source project. This will enable organizations to run a Dataform server privately, configure schedules, and view run logs in a graphical web UI.
More data warehouse support
We started with the most commonly used cloud data warehouses: BigQuery, Redshift, and Snowflake. Over the next few months we plan to add support for Azure data warehouse and Athena / Presto based warehouses. If you have a request to support a specific warehouse type please let us know!
We want to help data teams avoid reinventing the wheel, and have a number of packages in development that will be ready to use within a Dataform project to help you with common tasks such as:
- Analyzing A/B experiments
- Processing common data sources such as Segment and Magento
- Managing GDPR compliance and data anonymization tasks
- Providing a number of ready to integrate datasets such as FX rates