Over the last year at Dataform we have spoken to hundreds of data teams that are migrating from an ETL to ELT approach in their data stack. ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) are processes for moving data from one system to another (data sources to a data warehouse). The fundamental difference between these two approaches lies in how the raw data is managed, at which stage it is loaded into the warehouse and how analysis is then performed.
In ETL data moves from the data source, to staging and then into the warehouse. All transformations are performed before the data is loaded into the warehouse. ELT offers a modern alternative to ETL where analysts load data into the warehouse before transforming it, supporting a more flexible and agile way of working.
In this article we’ll discuss the differences between these two approaches and what the key benefits are of making the flip to an ELT approach.
The 3 stages
Regardless of whether it’s ETL or ELT, both processes involve the following steps:
- Extract: Source data is extracted from the original data source in an unstructured format. In traditional ETL processes this data is put into a temporary staging repository.
- Transform: Additional transformation must be done to clean and model data before it can be practically useful for other analytics applications.
- Load: In the ELT model, data is copied then copied directly into the data warehouse without significant modification. In an ETL model, data would be transformed into a suitable format before loading it into the warehouse.
What is ETL?
ETL (extract, transform, load) has been the traditional approach in analytics for several decades. It was originally designed to work with relational databases which have historically dominated the market. ETL requires the transformations to happen before the loading process. Data is extracted from data sources and then deposited into a staging area. Data is then cleaned, enriched, transformed and finally loaded into the data warehouse.
Data teams using this approach have to predict all the uses cases for the data before any analysis is even performed and then create transformations accordingly. This often means that data teams are left waiting as they can’t access any information until the whole process is completed.
Because Engineering teams typically own the extract and transform process, analysts have little visibility into the logic that has been used for the transformation process. This makes it hard for analysts to understand exactly what the data represents, often leading to incorrectly drawn conclusions. And because the analysts have little visibility, they can’t help the engineering team fix any bugs with the transformation code.
What is ELT?
ELT is a modern variation of ETL where data cleaning, enrichment and transformation happen after the loading process. This is enabled by the fact that modern cloud data warehouses are extremely scalable and separate storage from compute resources. So essentially the main difference between ETL and ELT is the order that these steps take place.
With the transformation happening in the warehouse, it’s typically defined using SQL. This allows analysts to contribute to (or indeed entirely own) the transformation logic.
So why is it beneficial to transform your data after loading it into the data warehouse?
- Agility: all the data is stored in the warehouse and readily available to use. You don’t have to think about how to structure the data before you load it into the warehouse. The data modeling to transform the raw data can be set up as and when it is needed.
- Simplicity: Transformations in the data warehouse are generally written in SQL, a language that the entire data team (data engineers, data scientists, data analyst) understands. This allows the entire team to contribute to the transformation logic.
- Self service analytics: If all of your raw data is within the warehouse, you can use BI tools to drill down from aggregated summary statistics to the raw data underlying them .
- Fixing bugs: If you find errors in your transformation pipeline, you can fix the bug and re-run just the transformations to fix your data. With an ETL approach, the entire extract-load-transform process would need to be re-run.
The ETL process was developed in order to cope with limitations of traditional legacy data warehouses such as the high costs of computation and storage. With the rapid growth of cloud based solutions and the decreasing costs of cloud based storage, ETL is becoming slightly out-dated.
There are still some good situations where a traditional ETL approach makes sense, or should be combined with ELT (ETLT).
- Data scrubbing - removing PII or other sensitive data before it hits the warehouse and can be accessed by everyone.
- Extremely large volumes of data - for example we may not want to store binary data for images or user generated content in our warehouse directly as it could be costly or slow.
- Streaming - most data warehouses don’t support streaming transformations. These can reduce both latency and cost particularly when data volumes are large.
The main advantage of using an ELT approach is that you can move all raw data from a multitude of sources into a single, unified repository (a single source of truth) and have unlimited access to all of your data at any time. You can work more flexibly and it makes it easy to store new, unstructured data. Data analysts and engineers save time when working with new information as they no longer have to develop complex ETL processes before the data is loaded into the warehouse.
The Dataform ELT framework
Dataform is a framework fundamentally designed around an ELT approach. Raw data is piped into your warehouse where Dataform sits and conducts transformations as and when you need them. It helps to automate some of the manual tasks involved in building and maintaining a data pipeline, allowing data teams to focus on what they want to be doing; gaining valuable insights for the business.