
Segment, the customer data infrastructure platform, helps companies collect user event data across web and mobile apps. With a few clicks, this data can be exported to a cloud data warehouse like BigQuery, Redshift or Snowflake. The tables Segment creates in the warehouse are a rich source of potential insights for the data team, but without some preprocessing are complex and time consuming to work with.
We've prepared a Dataform package that creates some core data models with the Segment data enabling quick and easy analysis. The Dataform package can be added to a project with a few lines of code, freeing up data teams to focus on analysing the specifics of their business, rather than rewriting boiler plate code that others have already written.
Why you should model Segment data
Once your data warehouse is connected to Segment, there'll be several new tables in your warehouse:
- Event tables: For each event triggered in your product, there'll be an event table. Each row in the event table corresponds to one occurence of the event being triggered by the user. All properties associated with the event are in this table. For many Segment setups, there will be hundreds of event tables.
- Tracks table: This table is a union of all the event tables. But it only contains the properties common to all events
(like
user_id
,timestamp
,event_name
). - Pages tables: Each time a user loads a new page on your website, an entry will be added to this table.
- Identifies table: One row for each time an
identifies()
call is made. Generally this would be each time a user logs in, and any time the user updates their info (for example changes theiraddress
).
These tables provide analysts with detailed data on users and their activity. But answering common questions requires complex SQL queries:
- How long do users typically spend on our site, and how has it trended over time?
- For users visiting our site for the first time, what's typically the last thing they do?
- How many new users signed up in the last week?
By modeling Segment data into two easy-to-use tables, each of the above questions can be answered with (relatively) simple queries.
Segment data models
segment_sessions
Contains a combined view of track and page events from Segment. Each session is a period of sustained activity, with
a new session starting after a 30min+ period of inactivity. Each session contains a repeated field of records which are
either tracks or pages. Common fields are extracted out into the top level and type specific fields are kept within two
structs: records.track
and records.page
.
segment_users
Aggregates all identifies calls to give a table with one row per user_id. Identify calls with only an anonymous_id are mapped to the matching user_id where possible.
Simplified query examples
We can now answer those questions about our users with a few lines of SQL:
How long do users typically spend on our site, and how has it trended over time?
select
date(session_start_timestamp) as date,
avg(stats.duration_millis) as mean_session_duration_millis,
approx_quantiles(stats.duration_millis, 101)[safe_offset(25)] as percentile_25,
approx_quantiles(stats.duration_millis, 101)[safe_offset(50)] as percentile_50,
approx_quantiles(stats.duration_millis, 101)[safe_offset(75)] as percentile_75
from
dataform_data_sources.segment_sessions
group by
date
order by
date desc
For users visiting our site for the first time, what's typically the last thing they do?
select
array(select coalesce(track.event, page.url) from unnest(records) as event order by timestamp desc)[safe_offset(0)] as last_action,
count(1) as session_count
from
dataform_data_sources.segment_sessions
where
session_index=1
group by
last_action
order by
session_count desc
How many new users signed up in the last week?
select
count(distinct user_id) as users
from
dataform_segment.segment_users
where
first_seen_at > timestamp_sub(current_timestamp(), interval 7 day)
Installing the package
To install the package, add dataform-segment
to the list of dependencies in the package.json
file in your Dataform project:
{
"dependencies": {
"@dataform/core": "1.4.7",
"dataform-segment": "git+https://github.com/dataform-co/dataform-segment.git#4da1cf4f6b3d4f136e9367dbd8c746f83709735c"
}
}
Then create a .js
file in your definitions/
folder and add the following code:
const segment = require("dataform-segment");
segment({
segmentSchema: "javascript",
defaultConfig: {
schema: "dataform_segment",
type: "view"
}
});
The segment package is now installed and you can now see the models in your dependency graph.
Customising the output
Whilst the core data model structure should make sense for most teams, there are some ways in which you may want to customise
the data models. This is done by passing inputs to the segment()
function:
sessionTimeoutMillis
: This sets the length of time (in milliseconds) that should pass after the last event seen by a user
before a new session starts. The default value is 30 _ 60 _ 1000 (30 minutes), a common choice for web-based products. However,
depending on your business model and your user's typical interaction pattern, you may want to change this.
customPageFields
: Each time a user opens a new page on your site, Segment creates a record in the pages
table. By
default Segment annotates each row with context (e.g. url
, timestamp
and referrer
). It's also possible to add custom
fields to any page call. You can include them in your data model using this parameter. There is also customTrackFields
(for
the tracks table) and customUserFields
(for the identifies table).
To customise the models, include the parameters you'd like to override in the segment()
call. For example:
const segment = require("dataform-segment");
segment({
segmentSchema: "javascript",
// 1 hour session timeout
sessionTimeoutMillis: 60 * 60 * 1000,
defaultConfig: {
schema: "segment_package_example",
tags: ["segment", "hourly"],
type: "table"
},
customPageFields: ["category"],
customUserFields: ["email", "name", "company_name"]
});
The segment_sessions
and segment_users
data models can now be used across your project: added to schedules, added as dependencies to other models, tested with assertions etc.
Continuing to work with packages
Today we are releasing the first version of the Segment package. We expect this package will be updated and improved going forward. To receive updates when new releases of the package are available, you should watch the dataform-co/dataform-segment GitHub repository.
If you would like to contribute to the Segment package (for example adding support for other warehouses), submit a pull request to the master
branch. If you'd like some help making a contribution, get in touch via email or Slack.
A full list of Dataform packages can be found at docs.dataform.co.