Advanced data quality testing with SQL and Dataform | Dataform

Advanced data quality testing with SQL and Dataform

A deep dive into some advanced data quality testing use cases with SQL and the open-source Dataform framework.

picture of author

Lewis Hemens on March 11, 2020

It goes without saying that you should be testing your data quality. I've written about this myself before, as have many others. You can't make informed decisions if you don't trust your data!

Cloud data warehouses enable us to manipulate data more easily than ever before. This is fantastic for the data analytics team that wants to move quickly, but it can lead to some problems. When we start to create hundreds or even thousands of datasets in our data warehouse, understanding which of them are accurate, up-to-date, and correct becomes challenging.

When we launched the Dataform open-source framework we shipped it with the concept of data assertions, a way for you to write SQL queries that can quickly detect issues with your production data and help you capture problems early. Over the past year we've seen more and more use cases for data quality tests and how they are used in practice.

This is an advanced guide to data quality testing with Dataform for people who are currently using BigQuery, Redshift, Snowflake or Azure SQL Data Warehouse and SQL to manage their ELT data transformation processes.

In this article we'll cover some of those use cases, some of the new features in Dataform that make testing data quality super easy, and lots of handy SQL snippets that should be useful regardless of whether or not you are using Dataform.

Testing basics

Dataform is an open-source data modeling framework. It allows you to manage all your tables, views and dependencies and pipelines with SQL and version control their definitions. Part of that framework includes data tests, and we have recently introduced a simpler way to get started writing these tests.

Here's a simple example of a SQLX file in a Dataform project that defines a dataset that we'll use as a base for the examples below, containing information and stats about our customers.

definitions/customer_stats.sqlx:

config {
  type: "table"
}

SELECT
  customer_id,
  email,
  attribution_source AS source,
  sessions
FROM
  customers
LEFT JOIN (
  SELECT
    customer_id,
    SUM(sessions)
  FROM
    sessions)
USING
  (session_id)

Unique keys

Most cloud data warehouses don't have the concept of primary keys or a built-in way to check that a key is unique.

To add a key uniqueness check to the dataset we created above, we simply add the relevant assertion to the config block of our dataset:

config {
  type: "table",
  assertions: {
    index: ["customer_id"]
  }
}

That's it! When we run our pipeline using dataform, it will generate a query specific to your configured data warehouse type that looks for duplicate rows in the source dataset and warn us if any of them fail.

Nullness checks

Another common data integrity requirement which most warehouses don't support is column non-nullness checking. We can easily add this by extending our config block:

config {
  type: "table",
  assertions: {
    index: ["customer_id"],
    nonNull: ["email"]
  }
}

Custom row checks

The above use cases don't really support conditions that are specific to application business logic. What if we want to write something a bit more custom?

We can use the rowConditions feature of the assertions config block to write SQL based rules that should apply to every row in our data.

For example, let's say we track where our customers came from using a string value: either "outbound" or "inbound".

We can write a check for this using a custom row condition:

config {
  type: "table",
  assertions: {
    index: ["customer_id"],
    nonNull: ["email"],
    rowConditions: [
      "source in ('inbound', 'outbound')"
    ]
  }
}

Row conditions capture many practical data quality checks we see people writing that aren't covered by the above shorthands, and adding new conditions is easy.

Advanced use cases

For these examples, we're going to use custom built assertions in Dataform. Dataform assertions are SQL queries that look for problems in data - that is, the query should return zero rows if the assertion passes. You can read more about assertions here.

Data freshness

Let's assume you rely on data being loaded into your warehouse by some external system, e.g. a product like Stitch Data that incrementally loads data from your production database into your warehouse.

It would be great if we knew if that data stopped flowing so that we can go and fix any issues with the connection, or get a warning before people start asking why the dashboards stopped updating.

Here's an example assertion for BigQuery that makes sure that the latest data to be loaded in was less than 6 hours old:

config {
  type: "assertion"
}

WITH
  freshness AS (
  SELECT
    time_diff(CURRENT_TIMESTAMP(),
      MAX(load_timestamp),
      "hour") AS delay
  FROM
    production_database_staging )
SELECT
  *
FROM
  freshness
WHERE
  delay > 6

This query will return a single row if the current delay on our table is greater than 6 hours, and cause the assertion to fail.

Data completeness

Sometimes just looking at freshness doesn't cut it. It's common to have an ETL system that loads daily dumps of data into our warehouse into partitioned tables. If the load for a given day fails, that could indicate a problem which a freshness assertion won't necessarily capture.

Instead, we want to check that there is some data present for each day of data in the last 30 complete days. We can use the very handy generate_date_array function from BigQuery for this, assuming our load table uses date partitioning:

config {
 type: "assertion",
}

WITH
  required_dates AS (
  SELECT
    *
  FROM
    UNNEST( GENERATE_DATE_ARRAY( DATE_SUB(CURRENT_DATE(), INTERVAL 30 day), DATE_SUB(CURRENT_DATE(), INTERVAL 1 day), INTERVAL 1 day) ) AS date )
SELECT
  rd.date
FROM
  required_dates rd
LEFT JOIN (
  SELECT
    DISTINCT _PARTITIONDATE AS date
  FROM
    data_load_tables) dlt
USING
  (date)
WHERE
  dlt.date IS NULL

This will return the dates that are missing, and zero rows if there are no dates missing.

ML model accuracy (BigQuery)

Here's a cool one - BigQuery allows us to train and use ML models entirely within BigQuery using a SQL like syntax.

What if our ML model accuracy decreases for some reason? We probably want to know about it, and we may want to actually gate deployment of our model on the accuracy being over a certain baseline level.

config {
 type: "assertion",
}

SELECT
 *
FROM
 ML.EVALUATE (
   MODEL my_ml_model,
   (
     SELECT
       *
     FROM
       source_data
     WHERE
       dataframe = 'evaluation'
   )
 )
WHERE accuracy < .8

If our accuracy drops below 80% then our assertion will fail. By making this assertion a dependency of our model deployment query, we can also make sure we don't push new models that don't meet this bar. Check out our recent blog post on building an end to end ML pipeline using BigQuery that covers this in more detail.

Unit tests

Assertions aren't hermetic, they are dependent on the production data inside your warehouse that is changing all the time. This means that your assertions can start failing even if your data pipeline or SQL queries are exactly the same.

This is very much intended behaviour, but not always what we want. For testing query logic without a dependency on production data we can use unit tests that make it easy to mock out data sources. We won't cover this here, but you can read more in our unit testing SQL queries with Dataform blog post.

Conclusion

Using SQL is great for doing data quality testing, and Dataform makes it even easier than before to configure both common and advanced data quality tests with just a few lines of code.

If you have testing use cases that you don't think are covered by the above, we'd love to hear about them!

We publish great new resources every week, get them straight to your inbox.

More content from Dataform

Exporting BigQuery usage logs to... BigQuery

picture of author

Dan Lee

Turn on BigQuery audit log exports to start analysing your BigQuery usage

ETL vs ELT. Why make the flip?

picture of author

Josie Hall

Data warehousing technologies are advancing fast. The cloud data warehousing revolution means more and more companies are moving away from an ETL approach and towards an ELT approach for managing analytical data.

Building an end to end Machine Learning Pipeline in Bigquery

picture of author

Ahmad Faiyaz

In this article we walk through building a simple end to end BigQuery Machine Learning pipeline using Dataform to help us manage the end to end process of data preparation, training and prediction.

Learn more about the Dataform data modeling platform

Dataform brings open source tooling, best practices and software engineering inspired workflows to advanced data teams that are looking to scale, helping you deliver reliable data to the entire organization.