How to add a new warehouse adapter to Dataform | Dataform
Guide

How to add a new warehouse adapter to Dataform

I recently wrote a Presto adapter for Dataform during a week long quarterly hackathon. Here's a guide on how I'd do it again.

FeaturesEngineering
picture of author

Elias Kassell on November 4, 2020

Introduction

I recently wrote a Presto adapter for Dataform during a week long quarterly hackathon. To do this, I had to touch a lot of code paths in our open source repository. I would definitely struggle to do this (and still found it difficult, regardless) if I didn’t have as in-depth knowledge of the code base. To help other contributors with similar endeavours, or readers just generally interested in how Dataform warehouse adapters work, I figured I’d outline the process, and explain how the moving parts fit together.

Dataform is built to be extensible to work with different warehouses. There’s a good deal of complexity involved because of this, so in parts of this blog post I’ll simplify and give a more high level view.

I’ll also run through specifically how I added the adapter for Presto. If you’re just interested in the code without any context, you can check out this pull request for a Presto adapter. However, the context here should be useful, as I’ll cover some core concepts of Dataform (and hopefully prevent you from making the same mistakes I made)!

Dataform structure

To add a new adapter, you’ll need to first get an understanding of how Dataform’s code base is structured and why it is structured the way it is.

Dataform core

This is available on npm as @dataform/core, and translates Dataform features into SQL specific to a given warehouse.

The most important concept in the core package is the compiled graph, and how it is composed of actions. Actions are steps in the compiled graph, and the compiled graph is produced when the SQLX, configuration files and Javascript files in a project are evaluated into a more structured format. This structure can be visualized for all projects in Dataform’s dependency tree. Each node in the dependency tree can be either a view, table, incremental table, operation, declaration or assertion.

Actions are uniquely identified by Targets. Actions that output warehouse relations, such as tables and views, will use the Action’s Target as its name. In general, warehouses follow the format of database.schema.table, so this information is encoded in a Target. For actions that don’t directly map to a warehouse relation, such as operations, the dataset is referred to by the SQL in each of the operations in the operation file.

Actions and targets in relation to the warehouse

To create a core adapter for a warehouse, you’ll need to understand the purpose of different actions, and be able to create the relevant SQL for the new warehouse to perform the desired effect of the Dataform action. The core adapter will need to implement the abstract class of IAdapter found in core/adapters/index.ts.

Dataform API

This is available on npm as @dataform/api, and facilitates connecting to different warehouses. How the connection is implemented can vary hugely between different warehouses, but they must all implement the abstract class of IDbAdapter found in api/dbadapters/index.ts.

Some warehouse-specific SQL is here too, but only that which has no relevance for generating a compiled graph. For example, the SQL required to preview the result of running a SQL statement (in presto’s case using the EXPLAIN prefix), or the SQL to list schemas within a database.

Integration tests

Reliable integration tests are essential for providing stability to end users. Each warehouse adapter has an associated integration test, which runs a test project either directly against the warehouse (for example BigQuery and Snowflake) or against a local image running in docker (for example Postgres, Presto). A lot of the difficulty of setting up a fixture has been abstracted away, but you’ll still need to add connection settings, new secrets and/or a build rule if testing against a local image.

Dataform CLI

This is available on npm as @dataform/cli, and is essentially just a wrapper around the API, and can be used to compile and run Dataform projects locally. You won’t need to touch this for adding a new warehouse adapter.

The Dataform web app

The web UI. Currently this is closed source, but when adapters are added to the open source, we’ll add them to the web UI pretty quickly once the vetting process is complete.

Steps for creating an adapter

With the knowledge of how Dataform is structured, it’s now time to take a closer look at the warehouse you’re going to be integrating. Here’s some things to think about that tend to differ between warehouses.

If you’re building an adapter while reading this then you’ll need to familiarize yourself with the development environment. You can find how to set it up in our contributing guide, as well as some useful Bazel commands.

1. Get to know the warehouse you’re designing for, and add support for it to @dataform/core.

This is going to be important for every subsequent step, so some fundamental understanding of what makes the warehouse different is important. This logic will need to be put in a new core/adapters/[new_warehouse].ts file.

Presto is designed to connect lots of different sources, and provide a general query language across all of them (Presto has its own SQL offshoot: https://prestodb.io/docs/current/sql.html). Initially open sourced by Facebook in 2012, it was quickly adopted by Netflix, and now it’s used across many companies having to deal with large quantities of data in different databases/warehouses. How it achieves cross data source communication, put simply, is that SQL queries are split into tasks which are run across a deployed Presto cluster. The tasks are then split up further, retrieving data in small portions from each connection, then eventually bubbling back up to return a response.

How will (Dataform) targets be made?

Presto fits this naturally, with connection.catalog.table, where a connection maps to a database.

How will incremental tables work?

In Dataform’s other adapters, we tend to use a merge to insert new rows when they become available for incremental tables. Presto doesn’t support the merge statement however; a proposed solution for this is to form a temporary table using an in memory connection.

Can statements be joined to preserve context between Dataform’s pre and post operations?

In Presto, yes, with a semicolon.

How will assertions be run?

Presto uses views (similar to other adapters) then a row count check to check it’s success.

What warehouse specific options should be available in the config block of SQLX files? These are added in the table.ts file, and are visible in the Dataform documentation.

For now, just the option for partitionBy has been added for Presto.

2. Decide how you’re going to connect to the warehouse, and add support to @dataform/api.

In most circumstances, there will be a node package for connecting to the warehouse via REST. This logic will need to be put in a new api/dbadapters/[new_warehouse].ts file. You’ll also need to add the connection profile details to protos/profiles.proto for storing connection details.

We used the presto-client package, and tweaked it slightly to fit our use case.

Can queries be run in parallel? This can speed up execution for the end user.

Yes, with Presto we used the promise-pool-executor package.

How can evaluation be done (validation of presented SQL statements)? In the web UI, these are shown real time during editing in the sidebar.

In Presto, prefix explain to the query.

3. Register the new adapter as an available adapter in the core and in the API

This is done in api/dbadapters/index.ts and core/adapters/index.ts.

Brilliant! You can now access the adapter in the same way as the other adapters! Now to make sure it works the way you intended.

4. Add integration tests

As described earlier, you’ll need to decide which instance of the warehouse you’re going to use for integration tests. For local images, adding Bazel build rules can be difficult, but other adapters can offer some inspiration. This code will go in a new integration test specification (for presto this was tests/integration/presto.spec.ts), and you’ll need to create a new example project for the tests.

Our Presto integration tests use the standard Presto Docker image, with some custom configuration files copied over to disk. The custom configuration files are needed for exposing a port other than Presto’s default port of 8080 (as we use that port for other things).

The tests should implement similar integration tests that the other adapters have, such as compiling a test project, and listing schemas in a warehouse.

Pat on the back

All done! You deserve it. All that’s left to do now is make the pull request.

As a reward, here’s a picture of a traditional Welsh sheep I saw on the slopes of mount Snowden a couple weeks ago.

Sheep

More content from Dataform

Dataform is joining Google Cloud

Announcement

Dataform is joining Google Cloud

Today we're excited to announce that we've joined Google Cloud. Over the course of the past several months, our partnership with Google Cloud has deepened and we believe that our new combined efforts can make our customers and partners even more successful.
Learn more
Using Javascript to Create an Acquisition Funnel Dataset illustration
Tutorial

Using Javascript to Create an Acquisition Funnel Dataset

How to use a combination of SQL and Javascript to create an acquisition funnel dataset.
Learn more
Switching to a better data warehouse naming convention illustration
Opinion

Switching to a better data warehouse naming convention

We realised that our data warehouse was starting to become a mess and difficult to use. This post walks through the process we followed to decide upon a better solution, and then roll out the change.
Learn more

Learn more about the Dataform data modeling platform

Dataform brings open source tooling, best practices, and software engineering inspired workflows to advanced data teams that are seeking to scale, enabling you to deliver reliable data to your entire organization.