How to Remove Duplicates from a Bigquery Table | Dataform
Data Warehouse Guide

How to Remove Duplicates from a Bigquery Table

Learn how to deduplicate data in a Bigquery table

Data Warehouse

One of the most common problems when it comes to maintaining data is managing duplicate records.  Duplication of data happens for many reasons. A few examples include the result of joins, errors with processing, improper collection, or perhaps the source data already contains duplicates before ending up in BigQuery.

This post will describe one method for deduplicating data in Bigquery.

Step 1: Identify whether your dataset contains duplicates

For this example, I’m using this Bigquery public dataset showing information about baseball games. The query below checks whether there are any duplicate rows. As the total row number is higher than the distinct row number we know that this dataset contains duplicates:

select
  (select count(1) from (select distinct * from bigquery-public-data.baseball.games_wide)) as distinct_rows,
  (select count(1) from bigquery-public-data.baseball.games_wide) as total_rows

Here is the result:

Step 2: Create a SELECT statement to identify unique values

The next step is to write a SELECT statement that removes any duplicate rows: the DISTINCT function makes this simple:

select distinct * from bigquery-public-data.baseball.games_wide

The result is a table with the duplicates removed, showing only unique values. We can check that this has worked by looking at whether the new row count of the table matches the distinct_rows count we spoke about above.

You may choose to save this query as a view. One benefit of doing this is that it won’t incur storage costs, however you will be charged every time the dataset is queried: BigQuery pricing depends on the number of bytes processed in the columns selected or filtered.

Step 3: Materialize the result to a new table

A better practice, if you expect the result to be regularly queried, is to copy (or materialize) the results to another table. This will incur storage costs, but the processing costs when querying the dataset will likely be lower than querying a view.

CREATE OR REPLACE TABLE my_dataset.baseball.games_wide_deduped
AS
SELECT
DISTINCT *
FROM bigquery-public-data.baseball.games_wide
group by 1

The above query will first look for the existence of the table baseball.games_wide_deduped within the my_dataset dataset and then replace it with the result if it exists, or creates the table if it doesn’t.

The practice of taking a result from a SQL query and creating a table from it is known as "materializing." Generally speaking, the cost of storing materialized data is much less than the costs of processing vast amounts of data. Materializing results is also a best practice recommended by Google.

To summarize, to deduplicate records with BigQuery, follow these steps:

  1. Identify whether your dataset contains duplicates.
  2. Create a SELECT query that aggregates the desired column using a GROUP BY clause.
  3. Materialize the result to a new table using CREATE OR REPLACE TABLE [tablename] AS [SELECT STATEMENT].

You might be wondering why we aren’t using the DELETE statement to remove duplicates?  When it comes to BigQuery it is not optimized for DELETE performance.  BigQuery is a "write once, read many" technology that uses immutable files to store its data, so it's not possible to remove data from the middle of a table.  Instead, data has to be rewritten entirely to a new table.  The best strategy for managing duplicate records is to use a SELECT query to augment the original structure, then save the results to a new table.

More content from Dataform

What can data teams learn from Google’s State of DevOps report? illustration
Opinion

What can data teams learn from Google’s State of DevOps report?

Google's research shows us that elite engineering teams have a lot in common when it comes to DevOps. We think similar principles apply to data teams or, more precisely, DataOps.
Learn more
Cut data warehouse costs with run caching illustration
Product Update

Cut data warehouse costs with run caching

How to save time and money by using our run caching feature
Learn more
Building the Dataform VS Code extension illustration
Guide

Building the Dataform VS Code extension

How we made our own extension for Visual Studio Code.
Learn more

Learn more about the Dataform data modeling platform

Dataform brings open source tooling, best practices, and software engineering inspired workflows to advanced data teams that are seeking to scale, enabling you to deliver reliable data to your entire organization.