Enforcing Data Quality using Assertions

picture of authorLewis Hemenson 9 May 2019

A short overview of how to to use SQL based data assertions to enforce high data quality in your data warehouse.

Ensuring that data consumers can use their data to reliably answer questions is of paramount importance to any data analytics team. Having a mechanism to enforce high data quality across datasets is therefore a key requirement for these teams.

Often, input data sources are missing rows, contain duplicates, or include just plain invalid data. Over time, changes to business definitions or the underlying software which produces input data can cause drift in the meaning of columns - or even the overall structure of tables. Addressing these issues is critical to creating a successful data team and generating valuable, correct insights.

In this article we’ll take a look at some common data quality problems, how to detect them, and - most importantly - how to fix them in a way that persists for all data consumers so everyone can take advantage of our data cleaning efforts.

The SQL snippets in this post apply to Google BigQuery but can be ported easily enough to Redshift, Postgres or Snowflake data warehouses.

What is a data assertion?

A data assertion is a query that looks for errors in a dataset. If the query returns any rows then the assertion fails.

Data assertions are defined this way because it’s much easier to look for problems rather than the absence of them. It also means that assertion queries can themselves be used to quickly inspect the data causing the assertion to fail - making it easy to diagnose and fix the problem.

Checking field values

Let’s take a look at a simple example.

Assume that there is a database.customers table containing information about customers in the database. Some checks that we might want to verify on the table’s contents include:

  • The field email_address is always set
  • The field customer_type is one of “business” or “individual”

The following simple query will return any rows violating these rules:

SELECT customer_id
FROM database.customers
WHERE  email_address IS NULL
OR NOT customer_type IN (“business”, “individual”)

Checking for unique fields

We may also want to run checks across more than one row. For example, we might want to verify that the customer_id field is unique. A query like the following will return any duplicate customer_id values:

    SUM(1) AS count
FROM database.customers
HAVING count > 1

Creating a data assertion

We can combine all of the above into a single query to quickly find any customer_id value violating one of our rules:

SELECT customer_id, “missing_email” AS reason
FROM database.customers
WHERE email_address IS NULL


SELECT customer_id, “invalid_customer_type” AS reason
WHERE not customer_type in (“business”, “individual”)
FROM database.customers


SELECT customer_id, “duplicate_id” AS reason
    SELECT customer_id, SUM(1) AS count
    FROM database.customers
    GROUP BY 1
WHERE count > 1

We now have one query we can run to detect any problems in our table, and we can easily add another unioned SELECT statement if we want to add new conditions in the future.

Creating clean datasets

Now that we’ve detected the issues in our data, we need to clean them up. Ultimately, choosing how to handle data quality issues depends on your business use case.

In this example we will:

  • Remove any rows that are missing the email_address field
  • Set a default customer type if it’s invalid
  • Remove rows with duplicate customer_id fields, retaining one row per customer_id value (we don’t care which one)

Rather than editing the dataset directly, we can create a new clean copy of the dataset, applying our data cleaning transformations with simple SQL statements. This gives us freedom to change or add rules in the future and avoids deleting any data.

The following query creates a copy of our database.customers table in which invalid rows are removed, default customer types are set, and duplicate rows for the same customer_id are removed:

    any_value(email_address) AS email_address,
        CASE customer_type
        WHEN “individual” THEN “individual”
        WHEN “business”   THEN “business”
        ELSE “unknown”
    END) AS customer_type
FROM database.customers
WHERE NOT email_address IS NULL

This query can be used to create either a view or a table in our cloud data warehouse, perhaps called database_clean.customers, which can be consumed in dashboards or by analysts who want to query the data.

We can check that the above query has correctly cleaned the dataset by re-running the original assertion on the new dataset.

Continuous data quality testing

Dataform provides a way to define data assertions and run them as part of a larger data pipeline or as standalone checks.

These can be run at any frequency, and if an assertion fails an email will be sent to notify you of the problem. Dataform also provides a way to easily create new datasets in your warehouse, making managing the process of cleaning and testing your data extremely straightforward.

For more information on how to start writing data assertions, check out the assertions documentation guide for Dataform’s open-source framework, or create an account for free and start using Dataform's fully managed Web platform.