Typical organizations have hundreds of data sources: web analytics, production databases, online ads, order systems, sales pipeline management tools and more. For each of these areas, there might be several sources of data: for example your company may use Facebook, LinkedIn, and AdWords for running digital ad campaigns. For each of these individual services, there are likely ten or more different tables. The result is that data teams today are working with hundreds, if not thousands, of data sources.
This level of complexity (often handled by teams of only a handful of data analysts) can seem unmanageable at first. Each time a new question is asked, the analyst needs to work out which sources of data are relevant, find a way to combine the ones of interest, and make sure the final result makes sense.
But this is not what modern data teams do. Instead, they create simplified datasets on top of the source data, expressly designed to help them answer typical business questions quickly. These tables are called data models.
Data modeling - the process of creating, updating and maintaining data models - is quickly becoming the most important skill of the modern data team. In this chapter we’ll discuss why it’s important, and cover the best practices to keep in mind while you’re building your organization's data models.
An example: marketing attribution
Most organizations today use digital marketing to attract new customers, running ads through many channels: Facebook, LinkedIn, Adwords, Twitter and more.
A common question posed to the data team is: ‘which of our ad providers brought us the most new customers last week?’ Answering this question is generally described as “marketing attribution”.
Without a data model, each time you get asked this question, you need to follow these steps:
- Find out which ad providers the company uses
- Locate datasets for each
- Translate the metrics provided by each provider into a consistent format
- Join all the ads data together, and run through some QA checks to make sure there are no mistakes
- Join the ads data with customer data, and implement some (often complex) attribution logic
- More QA
- Share with the marketing team
If you do have a data model you already have a dataset called
marketing_attribution that summarises the key metrics for each ad provider in one table
Answering the question is now achieved with a couple of lines of simple SQL:
SELECT ad_provider, SUM(clicks) AS total_clicks FROM marketing_attribution WHERE date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
Data modeling is a mindset
Where did marketing_attribution come from? Of course, the answer is that someone, at some point in the past, followed each of the steps outlined in the first scenario. But rather than stopping at the point of sharing the results with the marketing team, they carried out a few more steps:
- Create dataset marketing_attribution that has data not just for last week, but for all time
- Share that dataset with the rest of the data team (and the entire business, through a BI tool, if it makes sense)
- Schedule that dataset to be updated on a regular basis (e.g. daily)
Data modeling is the result of a shift in mindset: the most effective data teams don't rush to answer each individual question as quickly as possible. The most effective teams invest time upfront, creating simple, intuitive, general purpose data models that will save time in the future. They assume that any questions they're answering will likely be asked again, so they make sure that next time, it will only take a few moments.
The transformational benefits of a data model
It turns out that data models do a lot more than saving you wasted effort in the future. Data models help teams in three key areas:
- You can respond quickly to requests. Data models provide you with a simple route to the answer, often with just a few lines of SQL.
- As data quality issues arise, the data model can be updated to correct them. Over time, the accuracy of your team’s output increases.
- As your team collaboratively builds a set of data models, knowledge is shared from one person to another. Each small improvement benefits the entire team.
The data modeling layer and the data warehouse
As we discussed in a previous chapter, we recommend teams follow the ELT paradigm and centralise all of their raw data in a single cloud data warehouse. The cloud data warehouse is also the ideal environment to create and store the data models which power your team’s analytics.
Because all raw data that’s required to build the data models is available in the warehouse, the transformations required to create the data models can be written as SQL statements. No complex programming languages required!
Most organizations will have a set of data models that are much more complex than this. There will likely be hundreds of sources of data and tens, if not hundreds of modeled datasets.
While the technical complexity of building a data model doesn’t go much further than a good understanding of SQL, it helps to keep a few principles in mind as your data models evolve.
Our suggested approach: principle-led data modeling
The principles we suggest take a lot of inspiration from software engineering. We’ve adapted them to the domain of data modeling, and if followed, these principles should help you build a data model that is:
- easy to maintain
- optimized for analytics
These principles are specific enough to be practically useful while creating your data models. But they’re also general enough to not box you in to an outcome that doesn’t suit your business needs. That said, there are almost certainly specific situations where some of these principles need to be broken - use common sense!
There are many ways you could build your data model. You could use a drag and drop UI. You could use Spark, or Python, or Go! But in our opinion, SQL is the best choice.
- It’s designed specifically for the purpose of transforming data
- It’s widely understood
- It’s quick to learn
There are definitely times when SQL doesn’t make sense, but the default choice should be SQL (more on this here).
Don’t repeat yourself
This principle (aka DRY) is borrowed from software engineering. Software engineers know that a key factor in creating a maintainable codebase is keeping it DRY. What does this mean?
A common example would be a country-to-region mapping statement, e.g.
CASE WHEN country_name IN ('united states', 'canada') THEN 'NA' WHEN country_name IN ( 'united kingdom', 'france', 'germany', 'italy', 'poland' ) THEN 'EU' WHEN country_name IN ('australia') THEN country_name ELSE 'Other countries' END AS country_group
This snippet of code is likely useful in many data models. It might be tempting to write it out by hand (or more likely copy/paste it) each time you create a data model that needs it. This is WET (Write Every Time) code: the same piece of SQL is repeated in several different places.
The problem with this approach arises when it is time to update the country-to-region mapping. You’ll need to update the SQL snippet in all of the places it’s used in your data modeling codebase. If it’s missed in one place, your data models will be inconsistent.
Instead, you should write this snippet of code out once, and reference that piece of code in other queries. Then, updating the regional mapping is simply a case of updating that one snippet: all other references update automatically.
Data modeling is a complex business. It sits at the intersection of raw data and business logic:
- Raw data is constantly changing. The underlying systems which generate the data are constantly evolving: new tables become available, field types change, your organization switches from one system to another. Change is constant.
- Business logic is also constantly changing. There is no such thing as a status quo! Every business has competitors, and constant change and evolution is required to stay successful. This inevitably leads to changes in business logic.
The result is that your data model is in a constant state of flux. For this reason, we recommend an agile approach to data modeling. This means:
- Solve problems when they arise - don’t try to predict the future by creating data models that nobody asked for yet.
- Aim to work on small, incremental updates. If you set out on a 3 month project to update your data model, chances are that by the time you’ve finished, the model will no longer be relevant.
- Avoid the temptation to do too much planning when you first start building a data model. It’s really hard to predict what the organization's needs will be in a few months’ time. Instead, focus on delivering value for current needs, and iterate from there.
Humans are a more precious resource than machines
Modern data teams recognise that the most precious resource to them is their people, and more specifically, their people's time. The simpler your data models are, the easier they will be for everyone to understand and contribute to. Don’t overly optimize for cost or performance until you can be sure that you’ll be saving more than your analysts time is worth.
Design for the end user
This is really an extension of the principle above, but we’re specifically thinking of the end user of the data model, i.e. your business stakeholders. In a modern, data driven organization, that’s essentially the whole company!
The primary goal of your data model is to empower your organization to use analytics to make data-driven decisions. This goal should always be in the back of your mind. If you’re choosing between two implementation options, one of which will be more optimal for storage in your data warehouse, the other being simpler for end users, the latter should be your preference.
Define every data model with a SELECT query
The building block of a data model is an individual transformation. A step that takes some input data, possibly from several sources, and transforms it into some output data (while adding value along the way, e.g. incorporating some business logic).
For your data model to be up to date, each of these transformations needs to be executed at some regular time interval: hourly, daily, weekly, etc. You’ll likely use an orchestration tool to help you manage this schedule (Dataform comes with orchestration baked in as scheduling).
While this sounds simple, it turns out that in practice, it isn’t. There are a myriad of reasons that a schedule might fail.
As such, it’s important to design the individual steps within your data transformations to be idempotent. In practice, this means that the correctness of a transformation’s results are not affected by how often the transformation runs.
When designed this way, your data model will be robust to the (inevitable) mishaps that can interrupt a regular update schedule.
Principles are made to be broken!
While the principles above are generally going to serve you well, each has its exceptions. Data modeling is a complex and challenging process, and mastering it takes time. The principles above should give you a headstart, but there is no substitute for practice. Progressing from being a good data modeler to a great modeler is about getting a sense for exceptional circumstances: the times it makes sense to ignore the principles above.
Data modeling best practices
With these principles in mind, you’re ready to start building a data model that will provide a foundation for successful analytics at your organization. Here are some practical examples of patterns you’ll want to avoid, and what to do instead.
Avoid long SQL scripts
Long SQL queries violate several of the principles above!
DRY: A long SQL query is usually doing more than one thing. It’s cleaning some data, re-aggregating at a different granularity, joining with other data, and performing some complex transformations. If the transformation is instead broken into smaller pieces that each do one thing, then other datasets can rely on the smaller chunks, removing the need to repeat the steps in other places.
Precious humans: long scripts take a lot longer to contribute to. It just takes longer to read and understand a long file! If there is an issue that needs fixing in a 300 line query, you’ll likely need to read the whole script before you can be confident what effect your change will make. If instead the query is broken into several smaller ones, the change can be made without having to understand all 300 lines first.
Use clear schema and dataset naming conventions
Data is the lifeblood of the modern organization. It’s used by every team, every day. For this reason, it’s important to make it as easy as possible to find the data you’re looking for.
Cloud data warehouses organize data into datasets (tables and views) and schemas. As such, the way you organize your data into datasets and schemas has an impact on the data’s ease-of-use.
- Use schemas to organize data into functional areas, like marketing, finance, or support.
- Use table and view names to clearly describe the contents. It should be easy to understand what a single row in the table holds just from reading the name (e.g. daily_user_engagement, sessions, website_hits).
Views > Tables > Inserts
When creating an individual transformation within your data model, you’ll need to decide whether to save the results as a view or a table. Views are saved blocks of SQL that you can interact with as if they’re tables. Tables actually write data into storage in the structure defined by your SQL query.
Views should be the default choice. They’re free and quick to create and modify (after all, they’re just a saved snippet of SQL). That said, if the views are being queried by the users of your data model (e.g. through a BI tool), those users will experience latency, because the SQL behind the view needs to be run each time a query is issued..
Create tables if latency for the end user is important. Because you’re now storing a new chunk of data, this will cost money. And you’ll need an orchestration tool to keep your tables up to date regularly (Dataform handles this for you). But these costs are nothing compared to forcing your users to wait every time they load a report.
For very large datasets, you may choose to save costs and insert new data into the data model (rather than rebuilding each time). This should be a last resort! This pattern adds a lot of complexity, and makes future changes to the model significantly more difficult. Use this update method only if you have a cost or performance problem to solve.
Move logic upstream
Writing DRY code in data modeling often amounts to “moving logic upstream”. What does this mean?
Let’s take an example of an incorrectly formatted customer_id for a 2 day period. Your billing system had an issue for two days and was incorrectly logging customer_ids with a “
#” at the start: customer_id
1234567 was being logged as
Billing information is used all over your data model. You could write some code in each of the places that joins billing information to resolve this issue:
However, this isn’t very DRY! Instead, this logic could be written into a new dataset, billing_events_cleaned, and all downstream dependencies could refer to this dataset instead.
Now, you’ve only written the logic in one place. You’ve moved the logic upstream.
Data modeling is the difference between a data team drowning in ad-hoc requests, and a data team that empowers their organization to be data-driven. In this chapter we’ve discussed what a data model is, how data modelling can help your team, and outlined a set of fundamental principles that, if followed, will help you develop a robust and future proof data model for your organization.
Now you understand the fundamentals of data modeling, it’s time to start equipping yourself with the practical skills you need to manage the data model.