As an analyst, your role is to ensure that data informs decision making across your organization. In a modern organization, there are several challenges you’ll be dealing with:
- Decision making happens quickly - so you need to be able to swiftly respond to the constant influx of data questions.
- The volume of data you’re working with is either already large, or will be soon. You need a data storage solution that’s going to scale.
- You’ll be expected to work with hundreds of different data sources, from many different source systems. You need to be able to manage this complexity and be able to quickly find the data you’re looking for.
- You don’t want to spend much time managing data infrastructure. You’re already incredibly busy!
The cloud data warehouse
Fortunately, there is a solution to these requirements. Over the past decade, cloud data warehousing has made significant advances. For modern analytics teams, a cloud data warehouse is the de-facto choice.
- Terabytes of data can be queried in seconds.
- Storage and compute is near-infinitely scalable, from both a performance and cost perspective.
- They natively support semi-structured data as JSON, making it easy to store and work with a variety of different data formats.
- They are managed services, leaving your data team to focus on providing value to your business.
On top of this, cloud data warehouses are SQL based. Working with big data does not require engineering resources: the analyst teams can (and should) have full ownership and control of the data available to them.
Single source of truth
The cloud data warehouse has all the properties you’re looking for, and will act as the foundation for your data team - a place to centralize of your organization's data. This is a crucially important step, for two major reasons:
- Simplicity: you’re probably working with tens, if not hundreds, of different data sources. If they’re all stored in different places, you’ll spend too much of your time working out how to access and join across datasets. Bringing everything into one place keeps things simple and allows you to move quickly.
- Consistency: for analysts to be successful, they need to be trusted. If data is stored in many different places, you’ll find it almost impossible to ensure that a consistent view of data is presented to the business. With all your data in one place, you can easily monitor and ensure consistency. Your organization knows where to look for the truth.
To achieve the goal of a single source of truth, you need to find a solution for moving data from each of your organization's source systems into your data warehouse. For this, we recommend the ELT paradigm.
ELT: a warehouse-centric model
Data warehousing technologies are advancing quickly. The cloud data warehousing revolution means more and more companies are moving away from an ETL approach and towards an ELT approach for managing analytical data.
What is ETL?
ETL (extract, transform, load) has been the traditional approach to analytics for several decades. In the ETL paradigm, data moves from the data source, to staging, and then into the warehouse. All transformations are performed before the data is loaded into the warehouse.
Data teams using this approach have to predict how the data will be used before any analysis is even performed and then create transformations accordingly. This often means that data teams are left waiting as they can’t access any information until the whole process has completed.
Because engineering teams typically own the extract and transform process, analysts have little visibility into the logic that has been used. This makes it hard for analysts to understand exactly what the data represents, often leading to incorrectly drawn conclusions. When there are bugs in the transformation code, analysts can’t help the engineering team fix them.
Cloud data warehouses have finally made it cost-effective to store all of a company’s raw data in a central location, meaning data no longer needs to be transformed before it is loaded into a data warehouse: instead, transformations can be run after the data is loaded, within the data warehouse.
ELT offers a modern alternative to ETL in which analysts load data into the warehouse before transforming it. This supports a more flexible and agile way of working. Since the transformation happens in the warehouse, it can be defined using SQL. This allows analysts to contribute to (or entirely own) the transformation logic.
So why is it beneficial to transform your data after loading it into the data warehouse, rather than before?
- Agility: all the data is stored in the warehouse and readily available to use. You don’t have to think about how to structure the data before you load it into the warehouse. The data modelling to transform the raw data can be set up as and when it is needed.
- Simplicity: Transformations in the data warehouse are generally written in SQL, a language that the entire data team (data engineers, data scientists, data analyst) understands. This enables the entire team to contribute to the transformation logic.
- Self service analytics: If all of your raw data is within the warehouse, you can use BI tools to drill down from aggregated summary statistics to the raw data underlying them.
- Fixing bugs: If you find errors in your transformation pipeline, you can fix the bug and re-run just the transformations to fix your data. With an ETL approach, the entire extract-load-transform process may need to be re-run.
Fundamentally the main advantage of using an ELT approach is that it makes life easier for the data team. All of the data is easily accessible in one place, and mistakes in your transformation logic can be fixed easily because you can always get access to the raw underlying data when needed.
A simple ELT example
Now we know why an ELT approach makes sense, let’s look at a simple example of it in practice. Let’s take the e-commerce shop that we spoke about in previous chapters and imagine we want to start building out a data stack for it. Your business has three data sources:
- Shopify for the web store
- Stripe to process payments
- Salesforce as their CRM
You want to use all that data to build reports, track KPIs, create dashboards and conduct ad hocs analysis to understand the business. One of the fist tasks would be to create a dashboard in a BI tool containing all the information you have about customers. You want everyone in the company to use this dashboard in order to answer questions they have about customers.
You want to use all the data you have (coming from Shopify, Stripe, and Salesforce) to create a unified dashboard.
Extraction and Loading
The first step in building a data stack is to extract raw data from all sources and load it in the data warehouse. You can achieve this with third party tools, or by writing custom scripts.
At this stage the data loaded in your warehouse is raw and unprocessed. Each of those sources will generate dozens of tables in your data warehouse. The data is not really usable for analytics. Answering simple questions like “Which customers order the most products?" would probably take a considerable amount of time and you’d have to write complex queries.
Transform the data
To solve this problem, you want to take these raw data tables and turn them into prepared datasets that can be used for analytics.
For our example of creating a customer dashboard, you will want to join the data from the different sources together, normalize the fields and filter bad data to create a unique customers table. That table will contain all the information you have about your customers and will let you answer questions like “Which customers order the most products?” very quickly with a simple SQL query.
This is the step that Dataform helps you complete. Dataform helps you and your team turn the raw data in your warehouse into a suite of data tables to represent your business.
This customers table will be the table you will use for your dashboard. That means that everyone in the company will see that data and rely on it to make decisions. You want this table to be:
- Tested: running data quality tests on this data means that you can spot errors when they occur and fix them before they appear downstream in your dashboards.
- A data assertion is a query that looks for problems in a dataset. If the query returns any rows then the assertion fails- this makes it easy to diagnose and fix the problem
- Refreshed frequently: this ensures that your dashboards always show the latest information.
- You can set up schedulesin Dataform to run any selection of your datasets at a user-specified frequency. Running your code on a repeating schedule ensures output data is always kept up to date.
- Documented: this is important so everyone knows what the different fields in your table mean. It’s also useful for things like onboarding new team members as your team grows.
- You can document datasets in Dataform in the same file as your transformation logic and data quality tests. It is then displayed in a data catalog.
Use transformed data in your analytics
After your data is transformed, you can use BI and other analytics tools to build dashboards and conduct analysis. You can now create a dashboard which sits on top of your customers table and the whole organization can use to answer their questions about customers. We know the table is regularly updated, tested and well documented, so that everyone will get consistent answers to their questions and it can act as the single source of truth.
This article has given you an understanding of how to set up the foundation for modern analytics: a cloud data warehouse with a full view of data across your organization, loaded from source as-is in an ELT set up.
Getting started has never been easier:
- It’s free to create accounts in BigQuery and Snowflake.
- Stitch has connectors for hundreds of data sources and is free to get started (it’s free forever if you’re processing less than 5 million rows of data or less per month).
- Dataform helps you transform your data and has a free developer plan.
- Google Data Studio is a free BI tool for creating simple dashboards and charts if you have a Bigquery warehouse. Mode is another option which is free to get started.
For a more detailed step by step tutorial on how to set up a data warehouse, extract & load the data with Stitch and model it using Dataform you can read this post.
There are of course other options, and in some cases those other options may make more sense. But if you’re not sure where to start, the recommendations above are likely to work for you.
With the foundations laid, you can move on to creating a data model.