I regularly use SQL and R to perform data analysis and manage my data transformations and data pipelines. I love both languages, and have used them for a number of years now.
Which do I like the most?
Well that’s a tough question to answer, as they both have their own strengths and weaknesses for different tasks. In this post I’ll go through my perceived strengths and weaknesses of each language, and where each should and perhaps shouldn’t be used.
SQL is the lingua franca for data manipulation and transformation, as well as permanent storage and management. SQL has come a long way since its initial development by IBM in the 60s, and now is quite capable at performing relatively sophisticated analysis.
The Power of SQL
SQL’s 3 biggest strengths (as I see them) are performance, ease of use and scalability.
Even a small database running on your laptop will be able to perform calculations and transformations on datasets far bigger than what would be possible in desktops spreadsheet applications such as Excel or Google Sheets. For a start, there are no row limits (like the 1,048,576 row limit in Excel or the 500,000 cell limit in Google Sheets). As well as this, databases such as PostgreSQL and MySQL, and modern data warehouses such BigQuery, Redshift and Snowflake, are highly optimised to perform these types of row-by-row calculations.
If I want to perform analysis on a large dataset, I will almost always try to move my data into something like BigQuery. Its ability to easily crunch through data quickly and cost effectively is fantastic. BigQuery is also very easy to use, and integrates well with other GCP products I might use for data pipelines - this is also true for Redshift in the AWS ecosystem.
This makes a SQL environment ideal for data pipelines. With the advent of ELT over ETL, and great tools like Dataform and dbt, SQL pipelines running in your data warehouse have become incredibly powerful.
Then there’s the SQL language itself. I have a love-hate relationship with SQL, not least because it was the first coding language I learnt and therefore have sort of nostalgic feelings towards it.
SQL is built on Relational Algebra, introduced by Edgar Codd in the 70s, and this logic for how to join, merge, retrieve, filter and intersect data has truly stood the test of time.For many analysts SQL is the basis of how they think about tabular data.
However, SQL dialects are limited. There are things that just are not possible in SQL, or at least are extremely difficult and convoluted. Take the query below:
SELECT user_id, event_name, event_timestamp, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_timestamp ASC) AS rn FROM users WHERE rn = 1
Here, I’m trying to filter for the first interaction by each user, and I’m using
ROW_NUMBER to find it. Those reading this who are familiar with window functions like
ROW_NUMBER will already see that there is a problem. I can’t filter for
rn = 1 in my
WHERE clause, as that doesn’t actually exist in the underlying users table. Therefore, I have to adapt my query to look like this:
WITH users_rn AS (SELECT user_id, event_name, event_timestamp, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_timestamp ASC) AS rn FROM users) SELECT user_id, event_name, event_timestamp FROM users_rn WHERE rn = 1
This isn’t necessarily that hard, but it slows down the speed of development and getting to the answer you want. Having to deal with the mental burden of SQL’s quirks does become annoying at least, and genuinely problematic at worst.
This is assuming you’re using a database engine that has window functions like
ROW_NUMBER. If you’re not, then your query starts to look much more troublesome: you have to pre-aggregate the minimum timestamp of your event in a subquery, then
INNER JOIN back to the same table on user ID and timestamp.
And then let’s look at performing simple analysis tasks - like getting a median. Getting a mean average in SQL is trivial:
SELECT AVG(revenue) as mean_revenue FROM sales
However most database engines don’t have a built in function for a median (Snowflake being an exception). You have to manually calculate the median row position in a subquery, then use that as a filter to find the correct row in an outer query:
WITH row_numbers AS ( SELECT revenue, row_number() OVER (ORDER BY revenue) AS rn FROM sales), number_of_rows AS ( SELECT max(rn) / 2 AS median_row FROM row_numbers), rn_check AS ( SELECT CASE WHEN mod(cast(median_row AS numeric), cast(2 AS numeric)) = 0 THEN median_row ELSE median_row + 0.5 END AS CHECK FROM number_of_rows) SELECT FROM row_numbers WHERE rn = (SELECT CHECK FROM rn_check);
All that to get a median value?! Sheesh. And again, this is in a database that supports window functions.
This is because SQL exists to manage the storage and processing of tabular data. It was not designed for analytical purposes.
(For the record, solving these kinds of problems within the boundaries of the SQL dialect is the sort of problem I enjoy, but that’s besides the point.)
SQL is still great for running a production data pipeline, and for prepping your data for further analysis. If you can get around it’s foibles, it’s a wonderful tool for manipulating large quantities of data.
R is a language that is specifically designed for data analysts, data scientists and statisticians. It’s a fully fledged programming language, and as a result, opens up a huge world of possibilities. It is also open source, and there is a huge community of users and developers who create free to use packages for common data tasks (and uncommon ones as well).
Key Benefits of R
R makes performing common data analysis tasks such as loading data, transforming, manipulating, aggregating, charting and sharing your analyses very easy, and the workflow is much more seamless than in SQL.
First of all - there is a median function! 😂
Using the Tidyverse group of packages, you can perform complex data transformations, with simple syntax without the limitations or difficulties you get in SQL. We’ll start by looking back at my first example - find the first records within a group:
filter(event_timestamp == min(event_timestamp))
In R, I can mix aggregate and non-aggregate calculations right next to each other. Something SQL simply won’t allow. As a result, this is so much easier than the crazy manipulations you have to do in SQL to achieve the same result. It also means you can perform analysis almost as fast as you can think of it. Pivoting data (changing it from a wide dataset to a long dataset or vice-versa) is ludicrously hard in SQL, especially if it’s an unknown number of columns or values - for instance if you have a dataset that has a column for each year. In R, it is almost trivial to take a table that looks like this:
With 3 lines of code:
sales_data_by_year %>% pivot_longer(cols = starts_with("sales_yr_"), names_to = "year", values_to = "sales") %>% mutate(year = str_remove(year, "sales_yr_"))
Doing this in a database would take some carefully crafted CASE statements, and while this is a small toy example, it’s indicative of things analysts see in the wild all the time.
R’s capabilities for advanced analytics such as predictions, modeling and clustering are industry leading and used by statisticians the world over. And R’s charting and plotting capabilities (through packages like ggplot2) are wonderfully flexible and customisable, and allow you to create any type of plot you wish - not just what’s possible in a GUI like Tableau or Google Data Studio.
And then of course, there’s the fact that R is a full programming language. You can write scripts that include IF statements, FOR loops, you can read in from external APIs, you can write to local files, databases, cloud storage etc, include error handling, you can scrape web pages, and even process data in streams.
I use it almost daily for pulling data from multiple places, joining them, performing analysis, creating charts and sharing my work with colleagues and stakeholders.
However, R does have some issues. If you’ve ever tried to “put R into production” you will have likely come across some of these.
The first issue is performance. R is a high level language that has multiple layers until you get down to the underlying C interface. R works entirely in memory and is single threaded: if you throw a data set that has more than a few million rows at it, you start to notice it slowing down. R can really grind to a halt when dealing with ‘big data’. There are ways to deal with this - R can run on Linux servers, so you can install R there, give it a load of RAM and let it do it’s work there. However at this point, you’re asking analysts to do things they may not be comfortable with or have much knowledge of. And I wasn’t the first analyst, and won’t be the last, who spun up a large cloud VM to do some heavy work with R and forget to spin it down again…
The biggest pain I have when working with R is package management. R is reliant on the packages built by the community which are regularly updated (sometimes with changes that break things) The real world manifestations of this are:
- Scripts and functions not working after packages have been updated.
- Team members each having different versions of packages: when you share your code it often doesn’t work on other people’s machines.
- In an ETL environment (such as hosted on a server), if you make a change or update a package, you need to be careful to test this across this server environment to make sure there aren’t unexpected side effects on the next run.
- R in a production environment (such as a pipeline or a user facing application like a Shiny app) can also suffer from this, if individual team members upload code to the production app which other members either don’t have or have conflicting versions of. This makes it difficult for people to pull down production code and work on it locally.
There are processes to manage these things such as the packrat and renv packages, as well as utilising Docker to bundle up R code into reproducible images. However, these are not trivial to implement, and don’t nicely fit into an analyst’s normal workflow.
To summarise, I love using R and SQL to work with data, and think all analysts should learn at least one programming language to work with data. But there is also the need to know when each should be used:
SQL is great for:
- Data that needs to be transformed and made ready for analysis and presentation.
A cloud data warehouse and a great service like Dataform is the way to go
R is great for:
- Doing analysis locally, and wanting to flexibly express your analytical train of thought
R is (in my opinion) the best tool for performing analysis, so would be my weapon of choice.
Can you use a SQL database to perform analysis, or use R to build a reliable data pipeline? Of course you can, just be wary of the pitfalls and the difficulties going into it...
You can check out Snowplow Analytics here.