Cut data warehouse costs with run caching | Dataform
Product Update

Cut data warehouse costs with run caching

How to save time and money by using our run caching feature

picture of author

Ben Birt on September 23, 2020

As we've mentioned before, one of the core design goals of Dataform is to make project compilation hermetic. The idea is to ensure that your final ELT pipeline is as reproducible as possible given the same input (your project code), with a few tightly-controlled exceptions (like support for 'incremental' tables).

Being able to reason this way about the code in Dataform pipelines gives us the opportunity to build some cool features into the Dataform framework. An example is our "run caching" feature.

Don't waste time and money re-computing the same data

Most analytics pipelines are executed periodically as part of some schedule. Generally, these schedules are configured to run as often as necessary to keep the final data as up-to-date as the business requires.

Unfortunately, this can lead to a waste of resources. Consider a pipeline that is executed once an hour. If its input data doesn't change between one execution and the next, then the next execution will result in no changes to the output data, but it'll still cost time and money to run.

Instead, we believe that the pipeline should automatically detect if it's not going to change the output data - and if so, then the affected stage(s) should be skipped, saving those resources.

We've built this feature into Dataform.

Run caching in Dataform

Try out an example project with run caching here!

You can turn run caching on in your project with a few small changes which are described here. Once enabled, run caching skips re-execution of code which cannot result in a change to output data.

For example, consider the following SQLX file, which configures Dataform to publish a table age_count containing the transformed results of a query reading a people relation:

config { name: "age_count", type: "table" }

select age, count(1) from ${ref("people")} group by age

Dataform only needs to (re-)publish this table if any of the following conditions are true: The output table age_count doesn't exist The output table age_count has changed since the last time this table was published (i.e. it was modified by something other than Dataform itself) The query has changed since the last time the age_count table was published The input table people has changed since the last time the age_count table was published (or, if people is a view, then if any of the input(s) to people have changed)

Dataform uses these rules to decide whether or not to publish the table. If all of the tests fail, i.e. re-publishing the table would result in no change to the output table, then this action is skipped.

Building in intelligence so you don't have to

At Dataform we believe that you shouldn't have to manage the infrastructure involved in running analytics workloads.

This philosophy is what drives us to build out features like run caching, which automatically help to manage and operationalize analytics workloads, so that you don't have to. All you need to do is define your business-logic transformations, and we'll handle the rest.

If you'd like to learn more, the Dataform framework documentation is here. Join us on Slack and let us know what you think!

More content from Dataform

Dataform is joining Google Cloud


Dataform is joining Google Cloud

Today we're excited to announce that we've joined Google Cloud. Over the course of the past several months, our partnership with Google Cloud has deepened and we believe that our new combined efforts can make our customers and partners even more successful.
Learn more
Using Javascript to Create an Acquisition Funnel Dataset illustration

Using Javascript to Create an Acquisition Funnel Dataset

How to use a combination of SQL and Javascript to create an acquisition funnel dataset.
Learn more
How to add a new warehouse adapter to Dataform illustration

How to add a new warehouse adapter to Dataform

I recently wrote a Presto adapter for Dataform during a week long quarterly hackathon. Here's a guide on how I'd do it again.
Learn more

Learn more about the Dataform data modeling platform

Dataform brings open source tooling, best practices, and software engineering inspired workflows to advanced data teams that are seeking to scale, enabling you to deliver reliable data to your entire organization.