Cut data warehouse costs with run caching | Dataform
Product Update

Cut data warehouse costs with run caching

How to save time and money by using our run caching feature

Features
picture of author

Ben Birt on September 23, 2020

As we've mentioned before, one of the core design goals of Dataform is to make project compilation hermetic. The idea is to ensure that your final ELT pipeline is as reproducible as possible given the same input (your project code), with a few tightly-controlled exceptions (like support for 'incremental' tables).

Being able to reason this way about the code in Dataform pipelines gives us the opportunity to build some cool features into the Dataform framework. An example is our "run caching" feature.

Don't waste time and money re-computing the same data

Most analytics pipelines are executed periodically as part of some schedule. Generally, these schedules are configured to run as often as necessary to keep the final data as up-to-date as the business requires.

Unfortunately, this can lead to a waste of resources. Consider a pipeline that is executed once an hour. If its input data doesn't change between one execution and the next, then the next execution will result in no changes to the output data, but it'll still cost time and money to run.

Instead, we believe that the pipeline should automatically detect if it's not going to change the output data - and if so, then the affected stage(s) should be skipped, saving those resources.

We've built this feature into Dataform.

Run caching in Dataform

Try out an example project with run caching here!

You can turn run caching on in your project with a few small changes which are described here. Once enabled, run caching skips re-execution of code which cannot result in a change to output data.

For example, consider the following SQLX file, which configures Dataform to publish a table age_count containing the transformed results of a query reading a people relation:

config { name: "age_count", type: "table" }

select age, count(1) from ${ref("people")} group by age

Dataform only needs to (re-)publish this table if any of the following conditions are true: The output table age_count doesn't exist The output table age_count has changed since the last time this table was published (i.e. it was modified by something other than Dataform itself) The query has changed since the last time the age_count table was published The input table people has changed since the last time the age_count table was published (or, if people is a view, then if any of the input(s) to people have changed)

Dataform uses these rules to decide whether or not to publish the table. If all of the tests fail, i.e. re-publishing the table would result in no change to the output table, then this action is skipped.

Building in intelligence so you don't have to

At Dataform we believe that you shouldn't have to manage the infrastructure involved in running analytics workloads.

This philosophy is what drives us to build out features like run caching, which automatically help to manage and operationalize analytics workloads, so that you don't have to. All you need to do is define your business-logic transformations, and we'll handle the rest.

If you'd like to learn more, the Dataform framework documentation is here. Join us on Slack and let us know what you think!

More content from Dataform

What can data teams learn from Google’s State of DevOps report? illustration
Opinion

What can data teams learn from Google’s State of DevOps report?

Google's research shows us that elite engineering teams have a lot in common when it comes to DevOps. We think similar principles apply to data teams or, more precisely, DataOps.
Learn more
Building the Dataform VS Code extension illustration
Guide

Building the Dataform VS Code extension

How we made our own extension for Visual Studio Code.
Learn more
The startup data stack starter pack (2020) illustration
Opinion

The startup data stack starter pack (2020)

Data collection, integration, warehousing, modeling and visualization. What are the options, best of breed products and how much do they all cost.
Learn more

Learn more about the Dataform data modeling platform

Dataform brings open source tooling, best practices, and software engineering inspired workflows to advanced data teams that are seeking to scale, enabling you to deliver reliable data to your entire organization.