JSON file splitting in Snowflake | Dataform
Data Warehouse Guide

JSON file splitting in Snowflake

Learn how to split a large JSON file into multiple smaller files within Snowflake.

Data Warehouse

We've all had those days.

You know, the ones where we unexpectedly end up with a monstrously large JSON file, that we need to load into Snowflake.

Rather than making you go back to the source, let's guide you through how to split the beast up into smaller files and tell you why this matters when using Snowflake.

Understanding Snowflake

In Snowflake costs are directly tied to performance.

If you want more performance, you'll need to increase your costs. Inversely, if you want to save costs, you'll need to decrease performance.

Understanding how these cost factors tie together enables you to try out different combinations of query types and warehouse sizes, helping you to find the right balance of cost and performance.

The Three Costs

At its simplest, Snowflake costs are incurred for using three types of resources - Data Storage, Cloud Services and Virtual Warehouses (Compute).

1. Data Storage

There are 3 primary scenarios involving data storage in Snowflake;

  • Stages - These are used for data loading, where data from an external source is initially uploaded to Snowflake and stored as a stage. This is then copied into a Snowflake table using bulk data loading.
  • Tables - A useful feature here is that Snowflake automatically compresses all table data.
  • Fail-safes. “Fail-safe provides a (non-configurable) 7-day period during which historical data is recoverable by Snowflake. This period starts immediately after the Time Travel retention period ends.”

2. Cloud Services

Within Snowflake, there are a series of background services that require compute to operate and will, therefore, consume some of your credits.

These include tasks such as;

  • Infrastructure and Metadata management
  • Query parsing and optimisation
  • Authentication and access control

3. Virtual Warehouses

When you use virtual warehouses, the costs you incur are determined by the number of Snowflake Credits the warehouses consume. This, in turn, is based on;

  1. The size you select for your warehouse (i.e. Large, X-Large).
  2. The number of clusters you use (applicable if you're using a multi-cluster warehouse).
  3. The number of servers per cluster, which is linked to the chosen warehouse-size.
  4. The duration of time that each cluster, in each server, is running for.

Your Data & Snowflake

Now that you understand at a high-level how you _could _optimise your warehouse, it's time to look at your data.

Here are the key points you need to consider ahead of loading it into Snowflake.

1. File Sizes

When loading data into Snowflake, it's recommended to split large files into multiple smaller files - between 10MB and 100MB in size - for faster loads.

2. The VARIANT Data Type

JSON can be stored inside Snowflake in a few different ways. You'll likely end up using the VARIANT data type more often though.

Why? – The VARIANT data type is described by Snowflake as "a tagged universal type, which can store values of any other type, including OBJECT and ARRAY, up to a maximum size of 16 MB compressed."

In effect, it can hold up to 16 MB of any data type that Snowflake supports. This should make it your default choice when you look to import and operate on semi-structured data within Snowflake.

The 16 MB size limit is a critical point to remember when you think of your monstrous JSON file. You need to be aware of this size limit on your rows when you load your newly split files into Snowflake.

3. Using STRIP_OUTER_ARRAY

Snowflake instead recommends that you enable the STRIP_OUTER_ARRAY file format option for the COPY INTO

command. This

removes the outer array structure and loads the records into separate table rows.

Splitting your JSON File

You now know the why, so it's time for the how!

There are a wide variety of techniques you can use to split up JSON files. Here's an overview of some of your options to get you started on the right path.

Using jq

jq, in its own words, is "a lightweight and flexible command-line JSON processor. You can use it to slice and filter and map and transform structured data with the same ease that sed, awk, grep and friends let you play with text."

If you don't have the experience to dive into other solutions such as Node.js or Python, the chances are you've still used the command-line before, making jq an ideal tool for what we want to achieve.

Windows Installation

To access the command line on Windows, you'll need the PowerShell application. You can do this via the start menu, by searching for it, or using run and inputting "PowerShell".

From PowerShell, you can then install the Windows package manager Chocolatey, and then install jq with the following command:

chocolatey install jq

OS X Installation 

To install jq on OS X, we'd recommend using the package management system Homebrew.

Homebrew then works via OS X's "Terminal" application, which gives you access to the Bash command line.

Follow the Homebrew installation instructions, and then use the following command to install jq:

brew install jq

The Basics of Invoking jq

Example:

jq -r '.musicObjects[] | [.id, .title, .principalOrFirstMaker, .webImage.url] | @csv' jq_abc.json > jq_abc.csv

Breakdown: 

  1. jq calls the jq program.
  2. -r sets the "Raw Output" option.
  3. The actual filter text is placed between "quotes.
  4. jq_abc.json indicates that jq should read JSON from the file jq_abc.json.
  5. jq_abc.csv tells the command line to write jq's output into a file named jq_abc.csv.

Solution 1 - Splitting your file by Objects

Using a single call to jq, and assuming that your original JSON is in the standard format ({"item1": {...}, "item2": {...}, ...}), here's how you could produce sequential files like item1.json, item2.json.

jq -cr 'keys[] as $k | "\($k)\n\(.[$k])"' input.json |

while read -r key; do

read -r item

printf "%s\n" "$item" > "/tmp/$key.json"

# split -b nM {file_name} // n is the numeric value

done

Your output will be a single file per object, which in many cases will result in a well-formatted set of files ideal for loading into Snowflake.

However, this solution doesn't force any specifics around file sizes, so you could end up with files larger than recommended.

Solution 2 - Splitting your file by Chunks

If you find that the previous solution is outputting files that are too large, then here's another potential approach.

To trial this option, we'd suggest using a freely available and, importantly, extensive data set, such as the San Francisco City Lots spatial data layer.

Step A 

After following the initial instructions that come alongside the data to convert it to JSON, we should find out what root level keys are available.

cat ./citylots.json | jq 'keys'#

By running the above command, we can see that 'features' is the key holding all the data.

Step B

Based on this, let's set up our solution below to use 1000 items per chunk, and then see what the resulting file sizes are.

Example;

JsonInput=$(cat ./citylots.json)

ArrayLength=$(echo $JsonInput | jq '.features | length')

StartIndex=0

EndIndex=999

OutputCounter=1

while [ $EndIndex -lt $ArrayLength ]

do

StartIndex=$(( $StartIndex + 1000 ))

EndIndex=$(( $EndIndex + 1000 ))

if [ $EndIndex -gt $ArrayLength ]; then EndIndex=$ArrayLength; fi

# echo "$StartIndex >> $EndIndex"

echo $JsonInput | jq .features[$StartIndex:$EndIndex] >> ./output-$OutputCounter.json

OutputCounter=$(($OutputCounter + 1))

done

If you've used the suggested data set, then on completion, you should have a set of files that are all around 1000KB in size.

To work towards the 10-100MB file sizes recommended by Snowflake, simply adjust the index values in the solution above.

Solution 3 - Streaming

With the two previous approaches, their performance is dependent on the memory you have available.

If you're finding this to be an issue, then you could look at streaming the data, rather than loading it into memory.  It's a more complicated approach but one that results in far better performance.

To do this, you could use either Python or Node.js. In the case of the latter, there are a range of modules that do streaming JSON such as;

If you're more comfortable with Python though, then there are also various streaming JSON parsers available;

As you can imagine, there are lots we could cover in these so we won't guide you through the ins and outs of streaming just yet. Keep an eye out for future guides though!

Next Steps

Congratulations on defeating your monstrous JSON file, and taking the first steps to optimising your use of Snowflake by gaining control of costs and performance.

You can find out more about our Snowflake support here, or you can check out one of our example Snowflake projects!

More content from Dataform

What can data teams learn from Google’s State of DevOps report? illustration
Opinion

What can data teams learn from Google’s State of DevOps report?

Google's research shows us that elite engineering teams have a lot in common when it comes to DevOps. We think similar principles apply to data teams or, more precisely, DataOps.
Learn more
Cut data warehouse costs with run caching illustration
Product Update

Cut data warehouse costs with run caching

How to save time and money by using our run caching feature
Learn more
Building the Dataform VS Code extension illustration
Guide

Building the Dataform VS Code extension

How we made our own extension for Visual Studio Code.
Learn more

Learn more about the Dataform data modeling platform

Dataform brings open source tooling, best practices, and software engineering inspired workflows to advanced data teams that are seeking to scale, enabling you to deliver reliable data to your entire organization.