Accelerate your Dataform modeling using Stitch ETL

picture of authorLee Schlesinger- Content Marketing Manager @ Stitchon 01 Oct 2019

A tutorial on how to use Dataform and Stitch together to power your company's analytics

Dataform is a powerful data modeling platform, but before you get to tap into its powerful features and intuitive workflow, you have to get your hands on the data. The challenge for most organizations is that all of that information sits across numerous cloud applications (such as Salesforce and Google Ads) and internal databases.

So how do you begin combining data from such diverse sources to gain insight into your business? You need a data stack with several layers. At the base is a data warehouse that consolidates all of your data in a single location. Most businesses take advantage of cloud data warehouses such as Amazon Redshift, Google BigQuery, Snowflake, or Microsoft Azure SQL Data Warehouse.

You can extract data that you have stored in SaaS applications and databases and load it into the data warehouse using an ETL (extract, transform, load) tool. Stitch is a simple, powerful ETL service for businesses of all sizes, from startups to large enterprises. It can replicate data from more than 100 data sources. You can sign up and begin moving data to a data warehouse in five minutes.

Most businesses get more value out of their data as they integrate more data sources; for our example, we'll look at data from two separate SaaS applications.

To prepare, transform, and model data from diverse sources, you can take advantage of a data modeling tool like Dataform.

Finally, once you have data in your data warehouse you can use business intelligence (BI) software or tools for data analytics or machine learning.

Goal:

I wanted to use some of Stitch’s real data to build a data visualization for this post. Specifically, I was curious to see how many support conversations came from each of our pricing tiers. To find out, I needed to use billing data stored in Salesforce and join it with support conversation data from Intercom to create a visualization of support conversations.

Step 1: Setting up a data warehouse

My first step was to set up BigQuery to hold the data I'd be replicating from Salesforce and Intercom. Following the Stitch documentation, I added to my Google user the permissions that Stitch would need to load data to BigQuery.

Step 2: Using Stitch for ETL

Using Stitch makes extracting data from a source and loading it into a data warehouse easy. I logged in to Stitch and added new integrations for Salesforce and Intercom. From Salesforce I selected the Account table to replicate. Stitch's Intercom integration automatically extracts data from all available tables. From both, I chose the period from which to replicate data.

Once I'd set up my integrations in Stitch, I added BigQuery as my destination and specified a BigQuery project name. Within a few minutes, Stitch had extracted the data I specified and loaded it into BigQuery.

Step 3: Using Dataform for data modelling

Now I needed to create the datasets to use for analysis. For this I turned to Dataform, a tool for managing data transformations in cloud data warehouses. Dataform lets you define each step of your data transformation pipeline, easily create dependencies between each dataset, and run schedules to keep your data up to date. To ensure reliable data for your analytics you can add data tests to your pipeline, and everything is version controlled, enabling large teams to collaborate with ease.

Once in Dataform, I created a new project and was prompted to configure my data warehouse. Once I had selected Google BigQuery as my data warehouse, I then followed Dataform’s instructions on how to create BigQuery credentials.

Voilà – Dataform was connected with my BiqQuery data.

For this transformation I knew I needed to create three tables: one for Intercom data, one for Salesforce data, and one that joined the two together.

Table 1: userco

config { type: "table" }

WITH
 userco AS (
 SELECT
   users.id AS user_id,
   company.id AS company_id,
   company.company_id AS stitch_id
 FROM
   `reporting01-216119.intercomintegration.users` AS users,
   UNNEST(companies.companies) AS company
   WHERE  company.company_id IS NOT  NULL)

   SELECT *
   FROM userco

Table 2: pricing_tiers

config { type: "table" }

 WITH pricing_tiers AS (
 SELECT
   CAST(cid__c AS STRING) AS stitchid,
   case
     when tier__c LIKE '%Starter%' then 'Starter'
     when tier__c LIKE '%Basic%' then 'Basic'
     when tier__c LIKE '%Premier%' then 'Premier'
     when tier__c LIKE '%Custom%' then 'Enterprise'
     else tier__c
     end as tier
 FROM
   `reporting01-216119.salesforceintegration.Account` AS account
 WHERE
   account.isdeleted IS FALSE
   AND tier__c IS NOT NULL)

   SELECT *
   FROM pricing_tiers

Table 3: combined_support_conversations

config { type: "table" }
  WITH combined_support_conversations AS (
 SELECT
 conversations.id AS convo_id,
 count(conversations.id) AS convo_count,
 conversations.user.id AS convo_user_id,
 userco.user_id,
 userco.company_id,
 userco.stitch_id,
 tier
FROM
 `reporting01-216119.intercomintegration.conversations` AS conversations
JOIN
 ${ref("userco")} AS userco
ON
 conversations.user.id = userco.user_id
JOIN
${ref("pricing_tiers")}
ON
 stitchid = stitch_id
WHERE
 conversations.user.type != "lead"
GROUP BY tier, convo_id, convo_user_id, user_id, company_id, stitch_id
ORDER BY convo_count)

SELECT *
FROM combined_support_conversation

This query brought together columns from multiple tables in two different schemas. The SQL code excludes prospects by retrieving only companies for which we have account IDs, and selects current customers by excluding companies that are marked as deleted. It also consolidates clients on annual and monthly billing programs in each of our plans into a single number for each tier.

I could easily preview the results for each table in the Dataform UI before publishing the table to my warehouse.

Dataform allows you to easily reference another dataset using the ref {} function without having to provide the full SQL dataset name. It then creates a dependency tree (which can be found by clicking the Modelling tab) that nicely visualizes the tables and their dependencies. This has important meaning for your pipeline as dependent scripts are only run when their dependencies complete successfully.

Once I had my clean datasets ready to use I could use any business intelligence tool to do the analysis. For this report I used Mode.

Step 4: Building a sample report

The next step was to turn the table data into a chart. I located my combined_support_conversations table in Mode and created a bar chart. I dragged the tier field to the X-Axis box and convo_count to the Y-Axis. I also sorted the bars in descending order by y-axis values. This chart showed the data I wanted to represent. Mode lets you specify custom labels and titles. When you have just the look you like, you can aggregate your charts into dashboards.

Put Dataform and Stitch to work for you

So there you have it – a quick walk through of the process of using an ETL tool like Stitch and a data modeling tool like Dataform to move data from multiple sources into a data warehouse and transform and model it for reporting.