A practical guide to DataOps best practices

picture of authorLewis Hemenson 18 June 2019

What are DataOps best practices, why do you need them, and how can you adopt them.

Published by XKCD under a Creative Commons Attribution-NonCommercial 2.5 License

Why DataOps?

It might be time to start investing in DataOps if you often find yourself in any of these situations:

  • Does it take ages to add new dimensions, features or metrics to dashboards or tables?
  • Does critical business data depend on manual processes run on a daily or weekly basis?
  • Have you ever tried to reproduce a piece of data analysis only to get a different answer?
  • Do you often find yourself sharing either SQL scripts or CSV data over Slack/email?

All of these problems can be addressed with a good approach to DataOps, utilizing the right processes and tooling.

What is DataOps?

If you’re a Data engineer or if you’ve spent any time developing data pipelines, SQL queries, or managing a cloud data warehouse - you’ve probably thought a bit about DataOps already.

The exact definition will vary depending on who you ask, but from a technical perspective we would agree with Wikipedia’s definition:

"... an automated, process-oriented methodology, used by analytic and data teams, to improve the quality and reduce the cycle time of data analytics."

Many concepts and best practices come from DevOps, which generally concerns itself with the processes around developing, testing, deploying and monitoring software.

In this guide we’ll discuss the primary principles and objectives of DataOps, their benefits, and then the steps you can take to achieve them.

DataOps objectives

Your data stack should be reproducible

If you lost all those datasets you’ve been working on for the last month, could you easily recreate them? If you had the code to generate them in source control, chances are you probably could.

Reproducing datasets is much trickier when they're built with a mixture of one off Python scripts, SQL queries and notebooks.

Your data definitions should be centralized, shared, and discoverable

Anyone should be able to find out the code or query that generated a dataset, dashboard or model.

Being able to edit those datasets (or at least, how they are defined) is critical to fostering collaboration as your team grows.

Your production data should be isolated

Your production data is important. It should be treated as such.

Development work should stay well clear of production and ideally production datasets should only change as a result of their source-controlled definitions being updated.

You should be able to change your data definitions quickly

DevOps teams pride themselves on releasing new software every day. There’s no reason you shouldn’t be able to do the same with your data. For data driven companies, there is a strong business need for data teams to move at least as quickly as new software rolls out.

New product reporting requirements shouldn’t be a major hurdle. It should be possible to change data definitions and make them available to the business within hours or at most days.

If your data pipeline breaks you should know about it

You or your team should be immediately notified when a script breaks, a pipeline stops working, or data quality has been impacted.

It’s difficult to maintain a data driven culture if your datasets are often broken or out of date.

Manual processes should be automated

Nobody wants to be on the hook for periodically running a script.

Automating manual work reduces mental workload on the teams involved. This frees up time for proactive analytics and longer term investments in your data. It also greatly reduces the chances of human error.

Data access should be controlled

Controlling data access may or may not be a pertinent issue depending on the size of your company, but for large organizations with hundreds or thousands of employees where data may be sensitive - this is a strong requirement.

This is made possible through documentation detailing who has access to what and a simple and transparent approval process. Access control also aids with accountability, and is key to meeting regulatory requirements such as GDPR.

How do we get there?

Everyone builds out their data stack in slightly different way. Where you store the data, how you process it and how you use it - all varies between different companies.

You may use Apache Spark, Hadoop, DataBricks, BigQuery, Redshift, Snowflake, Presto or something else. Regardless, the following rules should guide you no matter what your underlying tech stack might be.

Source control everything

GitHub, GitLab, Bitbucket - it doesn’t matter where, as long as your SQL, Python or R code is stored there. Everyone on your team should be able to clone that repository and run all steps in your pipeline.

Even if you already use source control, make sure everything is actually in there!

It’s likely that you have some business-critical SQL scripts locked up in a dashboarding tools which are not source controlled, or some saved views in your cloud data warehouse upon which your teams depend.

Adopting source control is a critical step. Attempts to follow other DataOps rules will likely fail without it. Source control provides you with a clear history of changes, a place to share your code, and the ability to easily rollback to a previous working state.

Write tests for your data

Are there are certain things you expect or require to be true about your data? Codifying these checks or expectations (we call them assertions) gives your team the confidence to move quickly. If someone makes a change to your data processes and it breaks those expectations then you can find out quickly.

These checks can be run directly in production for monitoring, or as part of a deployment or staging environment if you’re feeling ambitious.

Agile software engineering teams would find it difficult to automate releases on a daily basis if they didn’t have thorough checks in place, which brings us to...

For more info on how to write data assertions, check out our guide here.

Automate all the things

Once all your code is checked into source control, automating its deployment becomes much easier. Deployment can be as simple as a bash script, but usually some tooling should be used to manage the execution of several steps in a pipeline.

If you manage a business critical dataset using source control and a CI/CD setup you might expect a deployment process similar to this:

  1. You develop your change in an isolated development environment and verify that it works
  2. You write a test for that dataset
  3. You commit your changes and create a pull request on your source controlled repository
  4. A team member reviews and approves your pull request
  5. Your CI/CD pipeline picks up your changes and automatically deploys your pipeline in production so the new or fixed dataset is made available to everyone

Use development environments

When starting out, it may be OK for you to directly modify production data, but this doesn’t scale well beyond just a few people. Development environments are the norm in software engineering, where changes can be tested on a local copy of the application before it goes live.

If using a data warehouse, this can be as simple as making sure that everyone works in a separate development schema to avoid conflicts. In other environments, for example where you might be managing files on disk, you could write development datasets into a seperate S3 bucket, or under another folder such as dev_$USER.

To enforce this, ideally only your automated process should have the right permissions to push to your production data tables.

Avoid data mutation

Relying heavily on INSERT and DELETE operations makes it very hard to reproduce your data from scratch. This is particularly relevant when using a data warehouse where it’s actually possible to mutate datasets.

If you need to change a dataset whose definition you can’t edit directly, create a new transformed copy of it. Or, if the transformation is cheap and you’re using SQL, just create a view of the dataset.

Of course, there are some exceptions to this rule, such as meeting GDPR obligations where you must be able to delete data upon request (our guess is that this is one of the main reasons Google BigQuery actually added support for delete statements in 2016!).

Conclusion

Embracing DataOps practices doesn’t need to be difficult and can have a massive impact on your organization’s ability to be data driven. Hopefully you’ll find some of this advice useful on your journey to building out a world class data stack.

If you’d like to learn more about how you can easily adopt DataOps best practices in your team, particularly if you work with cloud data warehouses such as Google BigQuery, AWS Redshift and Snowflake - come and talk to us at Dataform!

Although a number of these changes require some upfront investment, the payoff can be dramatic, enabling your team to scale and ultimately spend less time dealing with reactive data questions and more time focusing on the interesting data problems.