One of the most common problems when it comes to maintaining data is managing duplicate records. Duplication of data happens for many reasons. A few examples include the result of joins, errors with processing, improper collection, or perhaps the source data already contains duplicates before ending up in BigQuery.
This post will describe one method for deduplicating data in Bigquery.
Step 1: Identify whether your dataset contains duplicates
For this example, I’m using this Bigquery public dataset showing information about baseball games. The query below checks whether there are any duplicate rows. As the total row number is higher than the distinct row number we know that this dataset contains duplicates:
select
(select count(1) from (select distinct * from bigquery-public-data.baseball.games_wide)) as distinct_rows,
(select count(1) from bigquery-public-data.baseball.games_wide) as total_rows
Here is the result:
Step 2: Create a SELECT
statement to identify unique values
The next step is to write a SELECT
statement that removes any duplicate rows: the DISTINCT
function makes this simple:
select distinct * from bigquery-public-data.baseball.games_wide
The result is a table with the duplicates removed, showing only unique values. We can check that this has worked by looking at whether the new row count of the table matches the distinct_rows
count we spoke about above.
You may choose to save this query as a view. One benefit of doing this is that it won’t incur storage costs, however you will be charged every time the dataset is queried: BigQuery pricing depends on the number of bytes processed in the columns selected or filtered.
Step 3: Materialize the result to a new table
A better practice, if you expect the result to be regularly queried, is to copy (or materialize) the results to another table. This will incur storage costs, but the processing costs when querying the dataset will likely be lower than querying a view.
CREATE OR REPLACE TABLE my_dataset.baseball.games_wide_deduped
AS
SELECT
DISTINCT *
FROM bigquery-public-data.baseball.games_wide
group by 1
The above query will first look for the existence of the table baseball.games_wide_deduped
within the my_dataset
dataset and then replace it with the result if it exists, or creates the table if it doesn’t.
The practice of taking a result from a SQL query and creating a table from it is known as "materializing." Generally speaking, the cost of storing materialized data is much less than the costs of processing vast amounts of data. Materializing results is also a best practice recommended by Google.
To summarize, to deduplicate records with BigQuery, follow these steps:
- Identify whether your dataset contains duplicates.
- Create a
SELECT
query that aggregates the desired column using aGROUP BY
clause. - Materialize the result to a new table using
CREATE OR REPLACE TABLE [tablename] AS [SELECT STATEMENT]
.
You might be wondering why we aren’t using the DELETE
statement to remove duplicates? When it comes to BigQuery it is not optimized for DELETE
performance. BigQuery is a "write once, read many" technology that uses immutable files to store its data, so it's not possible to remove data from the middle of a table. Instead, data has to be rewritten entirely to a new table. The best strategy for managing duplicate records is to use a SELECT
query to augment the original structure, then save the results to a new table.