Ensuring that data consumers can use their data to reliably answer questions is of paramount importance to any data analytics team. Having a mechanism to enforce high data quality across datasets is therefore a key requirement for these teams.
Often, input data sources are missing rows, contain duplicates, or include just plain invalid data. Over time, changes to business definitions or the underlying software which produces input data can cause drift in the meaning of columns - or even the overall structure of tables. Addressing these issues is critical to creating a successful data team and generating valuable, correct insights.
In this article we explain the concept of a SQL data assertion, look at some common data quality problems, how to detect them, and - most importantly - how to fix them in a way that persists for all data consumers.
The SQL snippets in this post apply to Google BigQuery but can be ported easily enough to Redshift, Postgres or Snowflake data warehouses.
What is a data assertion?
A data assertion is a query that looks for problems in a dataset. If the query returns any rows then the assertion fails.
Data assertions are defined this way because it’s much easier to look for problems rather than the absence of them. It also means that assertion queries can themselves be used to quickly inspect the data causing the assertion to fail - making it easy to diagnose and fix the problem.
Checking field values
Let’s take a look at a simple example.
Assume that there is a
database.customers table containing information about customers in the database.
Some checks that we might want to verify on the table’s contents include:
- The field
email_addressis always set
- The field
customer_typeis one of
The following simple query will return any rows violating these rules:
SELECT customer_id FROM database.customers WHERE email_address IS NULL OR NOT customer_type IN (“business”, “individual”)
Checking for unique fields
We may also want to run checks across more than one row. For example, we might want to verify that the
customer_id field is unique. A query like the following will return any duplicate
SELECT customer_id, SUM(1) AS count FROM database.customers GROUP BY 1 HAVING count > 1
Combinining multiple assertions into a single query
We can combine all of the above into a single query to quickly find any
customer_id value violating one of our rules using
SELECT customer_id, “missing_email” AS reason FROM database.customers WHERE email_address IS NULL UNION ALL SELECT customer_id, “invalid_customer_type” AS reason WHERE not customer_type in (“business”, “individual”) FROM database.customers UNION ALL SELECT customer_id, “duplicate_id” AS reason FROM ( SELECT customer_id, SUM(1) AS count FROM database.customers GROUP BY 1 ) WHERE count > 1
We now have one query we can run to detect any problems in our table, and we can easily add another unioned
SELECT statement if we want to add new conditions in the future.
Creating clean datasets
Now that we’ve detected the issues in our data, we need to clean them up. Ultimately choosing how to handle data quality issues depends on your business use case.
In this example we will:
- Remove any rows that are missing the
- Set a default customer type if it’s invalid
- Remove rows with duplicate
customer_idfields, retaining one row per
customer_idvalue (we don’t care which one)
Rather than editing the dataset directly, we can create a new clean copy of the dataset - this gives us freedom to change or add rules in the future and avoids deleting any data.
The following SQL query defines a view of our
database.customers table in which invalid rows are removed, default customer types are set, and duplicate rows for the same
customer_id are removed:
SELECT customer_id, ANY_VALUE(email_address) AS email_address, ANY_VALUE( CASE customer_type WHEN “individual” THEN “individual” WHEN “business” THEN “business” ELSE “unknown” END) AS customer_type FROM database.customers WHERE NOT email_address IS NULL GROUP BY 1
This query can be used to create either a view or a table in our cloud data warehouse, perhaps called
database_clean.customers, which can be consumed in dashboards or by analysts who want to query the data.
Now we've fixed the problem, we can check that the above query has correctly fixed the problems by re-running the original assertion on the new dataset.
Continuous data quality testing
Assertions should be run as part of any data pipelines to make sure breaking changes are picked up the moment they happen.
If an assertion returns any rows, future steps in a pipeline should either fail, or a notification delivered to the data owner.
Dataform has built in support for data assertions, and provides a way to run them as part of a larger SQL pipeline.
These can be run at any frequency, and if an assertion fails an email will be sent to notify you of the problem. Dataform also provides a way to easily create new datasets in your warehouse, making managing the process of cleaning and testing your data extremely straightforward.
For more information on how to start writing data assertions with Dataform, check out the assertions documentation guide for Dataform’s open-source framework, or create an account for free and start using Dataform's fully managed Web platform.