Model Segment data in minutes using the Dataform Segment package | Dataform
Package

Model Segment data in minutes using the Dataform Segment package

The Dataform Segment package helps teams set up core Segment data models with a few lines of code, enabling data teams to spend more time focusing on the specifics of their business

Data modelingData stacks
picture of author

Dan Lee on January 29, 2020

Segment, the customer data infrastructure platform, helps companies collect user event data across web and mobile apps. With a few clicks, this data can be exported to a cloud data warehouse like BigQuery, Redshift or Snowflake. The tables Segment creates in the warehouse are a rich source of potential insights for the data team, but without some preprocessing are complex and time consuming to work with.

We've prepared a Dataform package that creates some core data models with the Segment data enabling quick and easy analysis. The Dataform package can be added to a project with a few lines of code, freeing up data teams to focus on analysing the specifics of their business, rather than rewriting boiler plate code that others have already written.

Why you should model Segment data

Once your data warehouse is connected to Segment, there'll be several new tables in your warehouse:

  • Event tables: For each event triggered in your product, there'll be an event table. Each row in the event table corresponds to one occurence of the event being triggered by the user. All properties associated with the event are in this table. For many Segment setups, there will be hundreds of event tables.
  • Tracks table: This table is a union of all the event tables. But it only contains the properties common to all events (like user_id, timestamp, event_name).
  • Pages tables: Each time a user loads a new page on your website, an entry will be added to this table.
  • Identifies table: One row for each time an identifies() call is made. Generally this would be each time a user logs in, and any time the user updates their info (for example changes their address).

These tables provide analysts with detailed data on users and their activity. But answering common questions requires complex SQL queries:

  • How long do users typically spend on our site, and how has it trended over time?
  • For users visiting our site for the first time, what's typically the last thing they do?
  • How many new users signed up in the last week?

By modelling Segment data into two easy-to-use tables, each of the above questions can be answered with (relatively) simple queries.

Segment data models

segment_sessions

Contains a combined view of track and page events from Segment. Each session is a period of sustained activity, with a new session starting after a 30min+ period of inactivity. Each session contains a repeated field of records which are either tracks or pages. Common fields are extracted out into the top level and type specific fields are kept within two structs: records.track and records.page.

segment_users

Aggregates all identifies calls to give a table with one row per user_id. Identify calls with only an anonymous_id are mapped to the matching user_id where possible.

Simplified query examples

We can now answer those questions about our users with a few lines of SQL:

How long do users typically spend on our site, and how has it trended over time?

select
  date(session_start_timestamp) as date,
  avg(stats.duration_millis) as mean_session_duration_millis,
  approx_quantiles(stats.duration_millis, 101)[safe_offset(25)] as percentile_25,
  approx_quantiles(stats.duration_millis, 101)[safe_offset(50)] as percentile_50,
  approx_quantiles(stats.duration_millis, 101)[safe_offset(75)] as percentile_75
from
  dataform_data_sources.segment_sessions
group by
  date
order by
  date desc

For users visiting our site for the first time, what's typically the last thing they do?

select
  array(select coalesce(track.event, page.url) from unnest(records) as event order by timestamp desc)[safe_offset(0)] as last_action,
  count(1) as session_count
from
  dataform_data_sources.segment_sessions
where
  session_index=1
group by
  last_action
order by
  session_count desc

How many new users signed up in the last week?

select
  count(distinct user_id) as users
from
  dataform_segment.segment_users
where
  first_seen_at > timestamp_sub(current_timestamp(), interval 7 day)

Installing the package

To install the package, add dataform-segment to the list of dependencies in the package.json file in your Dataform project:

{
  "dependencies": {
    "@dataform/core": "1.4.7",
    "dataform-segment": "git+https://github.com/dataform-co/dataform-segment.git#4da1cf4f6b3d4f136e9367dbd8c746f83709735c"
  }
}

Then create a .js file in your definitions/ folder and add the following code:

const segment = require("dataform-segment");

segment({
  segmentSchema: "javascript",
  defaultConfig: {
    schema: "dataform_segment",
    type: "view"
  }
});

The segment package is now installed and you can now see the models in your dependency graph.

Customising the output

Whilst the core data model structure should make sense for most teams, there are some ways in which you may want to customise the data models. This is done by passing inputs to the segment() function:

sessionTimeoutMillis: This sets the length of time (in milliseconds) that should pass after the last event seen by a user before a new session starts. The default value is 30 60 1000 (30 minutes), a common choice for web-based products. However, depending on your business model and your user's typical interaction pattern, you may want to change this.

customPageFields: Each time a user opens a new page on your site, Segment creates a record in the pages table. By default Segment annotates each row with context (e.g. url, timestamp and referrer). It's also possible to add custom fields to any page call. You can include them in your data model using this parameter. There is also customTrackFields (for the tracks table) and customUserFields (for the identifies table).

To customise the models, include the parameters you'd like to override in the segment() call. For example:

const segment = require("dataform-segment");

segment({
  segmentSchema: "javascript",
  // 1 hour session timeout
  sessionTimeoutMillis: 60 * 60 * 1000,
  defaultConfig: {
    schema: "segment_package_example",
    tags: ["segment", "hourly"],
    type: "table"
  },
  customPageFields: ["category"],
  customUserFields: ["email", "name", "company_name"]
});

The segment_sessions and segment_users data models can now be used across your project: added to schedules, added as dependencies to other models, tested with assertions etc.

Continuing to work with packages

Today we are releasing the first version of the Segment package. We expect this package will be updated and improved going forward. To receive updates when new releases of the package are available, you should watch the dataform-co/dataform-segment GitHub repository.

If you would like to contribute to the Segment package (for example adding support for other warehouses), submit a pull request to the master branch. If you'd like some help making a contribution, get in touch via email or Slack.

A full list of Dataform packages can be found at docs.dataform.co.

More content from Dataform

SQL vs R. Which to use for data analysis? illustration
Guest Post

SQL vs R. Which to use for data analysis?

Learn more
Data as a Utility Tool illustration
Guest Post

Data as a Utility Tool

Learn more
CI/CD for ETL/ELT SQL pipelines illustration
Guide

CI/CD for ETL/ELT SQL pipelines

Learn more

Learn more about the Dataform data modeling platform

Dataform brings open source tooling, best practices and software engineering inspired workflows to advanced data teams that are looking to scale, helping you deliver reliable data to the entire organization.