I recently wrote a Presto adapter for Dataform during a week long quarterly hackathon. To do this, I had to touch a lot of code paths in our open source repository. I would definitely struggle to do this (and still found it difficult, regardless) if I didn’t have as in-depth knowledge of the code base. To help other contributors with similar endeavours, or readers just generally interested in how Dataform warehouse adapters work, I figured I’d outline the process, and explain how the moving parts fit together.
Dataform is built to be extensible to work with different warehouses. There’s a good deal of complexity involved because of this, so in parts of this blog post I’ll simplify and give a more high level view.
I’ll also run through specifically how I added the adapter for Presto. If you’re just interested in the code without any context, you can check out this pull request for a Presto adapter. However, the context here should be useful, as I’ll cover some core concepts of Dataform (and hopefully prevent you from making the same mistakes I made)!
To add a new adapter, you’ll need to first get an understanding of how Dataform’s code base is structured and why it is structured the way it is.
This is available on npm as @dataform/core, and translates Dataform features into SQL specific to a given warehouse.
The most important concept in the core package is the compiled graph, and how it is composed of
Actions are uniquely identified by
Actions that output warehouse relations, such as
views, will use the
Target as its name. In general, warehouses follow the format of
database.schema.table, so this information is encoded in a
actions that don’t directly map to a warehouse relation, such as
operations, the dataset is referred to by the SQL in each of the operations in the
To create a core adapter for a warehouse, you’ll need to understand the purpose of different
actions, and be able to create the relevant SQL for the new warehouse to perform the desired effect of the Dataform
action. The core adapter will need to implement the abstract class of
IAdapter found in
This is available on npm as @dataform/api, and facilitates connecting to different warehouses. How the connection is implemented can vary hugely between different warehouses, but they must all implement the abstract class of
IDbAdapter found in
Some warehouse-specific SQL is here too, but only that which has no relevance for generating a compiled graph. For example, the SQL required to preview the result of running a SQL statement (in presto’s case using the
EXPLAIN prefix), or the SQL to list schemas within a database.
Reliable integration tests are essential for providing stability to end users. Each warehouse adapter has an associated integration test, which runs a test project either directly against the warehouse (for example BigQuery and Snowflake) or against a local image running in docker (for example Postgres, Presto). A lot of the difficulty of setting up a fixture has been abstracted away, but you’ll still need to add connection settings, new secrets and/or a build rule if testing against a local image.
This is available on npm as @dataform/cli, and is essentially just a wrapper around the API, and can be used to compile and run Dataform projects locally. You won’t need to touch this for adding a new warehouse adapter.
The Dataform web app
The web UI. Currently this is closed source, but when adapters are added to the open source, we’ll add them to the web UI pretty quickly once the vetting process is complete.
Steps for creating an adapter
With the knowledge of how Dataform is structured, it’s now time to take a closer look at the warehouse you’re going to be integrating. Here’s some things to think about that tend to differ between warehouses.
If you’re building an adapter while reading this then you’ll need to familiarize yourself with the development environment. You can find how to set it up in our contributing guide, as well as some useful Bazel commands.
1. Get to know the warehouse you’re designing for, and add support for it to @dataform/core.
This is going to be important for every subsequent step, so some fundamental understanding of what makes the warehouse different is important. This logic will need to be put in a new
Presto is designed to connect lots of different sources, and provide a general query language across all of them (Presto has its own SQL offshoot: https://prestodb.io/docs/current/sql.html). Initially open sourced by Facebook in 2012, it was quickly adopted by Netflix, and now it’s used across many companies having to deal with large quantities of data in different databases/warehouses. How it achieves cross data source communication, put simply, is that SQL queries are split into tasks which are run across a deployed Presto cluster. The tasks are then split up further, retrieving data in small portions from each connection, then eventually bubbling back up to return a response.
How will (Dataform) targets be made?
Presto fits this naturally, with
connection.catalog.table, where a connection maps to a database.
How will incremental tables work?
In Dataform’s other adapters, we tend to use a merge to insert new rows when they become available for incremental tables. Presto doesn’t support the merge statement however; a proposed solution for this is to form a temporary table using an in memory connection.
Can statements be joined to preserve context between Dataform’s pre and post operations?
In Presto, yes, with a semicolon.
How will assertions be run?
Presto uses views (similar to other adapters) then a row count check to check it’s success.
What warehouse specific options should be available in the config block of SQLX files? These are added in the
table.ts file, and are visible in the Dataform documentation.
For now, just the option for
partitionByhas been added for Presto.
2. Decide how you’re going to connect to the warehouse, and add support to @dataform/api.
In most circumstances, there will be a node package for connecting to the warehouse via REST. This logic will need to be put in a new
api/dbadapters/[new_warehouse].ts file. You’ll also need to add the connection profile details to
protos/profiles.proto for storing connection details.
Can queries be run in parallel? This can speed up execution for the end user.
Yes, with Presto we used the promise-pool-executor package.
How can evaluation be done (validation of presented SQL statements)? In the web UI, these are shown real time during editing in the sidebar.
In Presto, prefix
explainto the query.
3. Register the new adapter as an available adapter in the core and in the API
Brilliant! You can now access the adapter in the same way as the other adapters! Now to make sure it works the way you intended.
4. Add integration tests
As described earlier, you’ll need to decide which instance of the warehouse you’re going to use for integration tests. For local images, adding Bazel build rules can be difficult, but other adapters can offer some inspiration. This code will go in a new integration test specification (for presto this was
tests/integration/presto.spec.ts), and you’ll need to create a new example project for the tests.
Our Presto integration tests use the standard Presto Docker image, with some custom configuration files copied over to disk. The custom configuration files are needed for exposing a port other than Presto’s default port of 8080 (as we use that port for other things).
The tests should implement similar integration tests that the other adapters have, such as compiling a test project, and listing schemas in a warehouse.
Pat on the back
All done! You deserve it. All that’s left to do now is make the pull request.
As a reward, here’s a picture of a traditional Welsh sheep I saw on the slopes of mount Snowden a couple weeks ago.