Dataform enables analysts to manage all data processes in your warehouse, turning raw data into the clean datasets you need for analytics.
SQLX is an open source extension of SQL. SQLX brings additional features to SQL to make development faster, more reliable, and scalable.
config { type: "view" }
select
country as country
device_type as device_type
sum(page_views) as page_views
sum(sessions) as sessions
from ${ref("sessions")}
group by 1, 2
Define everything as a select
statement. Tell Dataform what type of relation you want to create. Dataform manages any create
, insert
, and drop
boilerplate. See documentation.
select
*
from ${ref("users")}
left join ${ref("sessions")} on (user_id)
left join ${ref("pageviews")} on (user_id)
group by 1, 2
Use Dataform's built-in ref
function to automatically build a dependency graph between datasets and actions. See documentation.
tests {
assertions: {
uniqueKeys: [
"column1",
["column1", "column2"],
],
nonNull: ["column1"],
rowConditions: [
"column1 > 0",
"column2 is null or column2 >= column1"
]
}
}
Define data quality checks in the same file as the model you're testing. Automatically receive alerts when those checks fail. See documentation.
docs {
description: "Stats about app sessions",
columns: {
country: "Country of the user",
device_type: "Either 'mobile', 'desktop' or 'tablet'"
}
}
Write data documentation that is automatically included in Dataform's data catalog. Share this with your team, or integrate with other data cataloging tools. See documentation.
The core Dataform framework is open source, and is bundled with a command-line tool that can be used to initialize, test, and run Dataform projects. The Dataform CLI is useful for local development, and can also be integrated into an Airflow pipeline using Airflow’s bash operator.
Dataform detects changes in your source data and only updates downstream datasets when it needs to, helping you reduce end to end data latency and saving on BigQuery costs.
With run caching enabled, Dataform will only update your datasets if either their definition, or the source data has changed, keeping your pipelines fast and reducing BigQuery costs.
Dataform is designed to scale to 1000s of data models, and can compile your entire project into SQL in seconds, giving you instant real-time feedback on your code even as your team and complexity grows.
Designed for modern BigQuery engineering-driven data teams.