How we store protobufs in MongoDB | Dataform
Guide

How we store protobufs in MongoDB

Use a custom codec to cleanly store protobuf documents in MongoDB

Engineering
picture of author

Ben Birt on January 8, 2020

At Dataform, we use Google Datastore to store customer data. However, for various reasons, we need to move off Datastore and onto a self-managed database.

We store all of our data in protobuf format; each entity we store corresponds to a single protobuf message. Since we already store structured documents (as opposed to SQL table rows), MongoDB is a great fit for us.

Here's a simple example of a protobuf definition:

message Person {
  string first_name = 1;
  string last_name = 2;
  int64 birth_timestamp_millis = 3;
}

One of the major benefits of using protocol buffers as a storage format is that it's very easy to make changes to our database 'schema'. Renaming a field is as simple as editing the .proto file, and it's (usually, with some caveats) safe to change a field's type, etc, whereas renaming a 'field' (column) in a traditional SQL-like table is usually a lot of work, involving some amount of DB migration.

However, safely making changes to a protobuf definition requires the data at rest to actually be stored in protobuf format, which would make it impossible to query, since the database engine doesn't speak protobuf.

One solution to this problem is to just store messages in their canonical JSON format. However, we'd then lose the ability to make many kinds of changes to our protobuf definitions. For example, we'd never be able to (easily) rename fields: imagine we stored an instance of Person (as defined above) in JSON format, but then renamed birth_timestamp_millis to birthday_timestamp - the previously stored Person would now have an undecodeable birthTimestampMillis field, and would be missing a value for birthdayTimestamp.

What we really want is the best of both worlds: we want to be able to store messages as JSON, so that it's possible to easily query the data; but we want stored data to be agnostic to the various kinds of backwards/forwards-compatible changes we might want to make to the protobuf definition.

Luckily, the MongoDB client libraries include a very helpful feature: they allow the user to define how data is encoded/decoded as it is stored/retrieved from the database, using custom, user-defined codecs.

We have used this feature to define our own new codec, written in Go, which solves the protobuf storage problem for us. It encodes protobuf messages using their tag numbers as document keys, and uses standard encoding/decoding for each of the protobuf field values.

For example, given the following protobuf definition:

message Example {
  string string_field = 3;
  ExampleEnum enum_field = 10;
  oneof example_oneof {
    int32 int32_field = 78;
    int64 int64_field = 33;
  }
  NestedMessage nested_message = 107;
}

enum ExampleEnum {
  VAL_0 = 1;
  VAL_1 = 573;
}

message NestedMessage {
  string nested_string_field = 2;
  int32 nested_int32_field = 1;
}

And the following instance of Example, in canonical JSON format:

{
  "stringField": "foo",
  "enumField": "VAL_0",
  // Note that this is represented as a string because the JavaScript number type is smaller than an int64.
  "int64Field": "123456789",
  "nestedMessage": {
    "nestedStringField": "bar",
    "nestedInt32Field": 12
  }
}

Our MongoDB codec will encode the instance of Example as the following Mongo BSON document:

{
  "3": "foo",
  "10": 1,
  "33": 123456789,
  "107": {
    "2": "bar",
    "1": 12
  }
}

With this encoding, if we change the name of nested_string_field to something_else, or the enum value VAL_0 to BETTER_ENUM_VALUE_NAME, we'll still be able to decode the document, without any loss of data.

This does make it slightly harder to query the database, since we now need to specify field numbers as opposed to human-readable field names. However, for production use, we have put a gRPC server in front of MongoDB which knows how to construct correct MongoDB queries, and for ad-hoc queries we plan to write a small translator which can do the same when given queries containing protobuf field names.

The code is open-sourced here (godoc). Examples of how to use it in a MongoDB codec registry are in the tests. Please feel free to use it if it helps you!

More content from Dataform

What can data teams learn from Google’s State of DevOps report? illustration
Opinion

What can data teams learn from Google’s State of DevOps report?

Google's research shows us that elite engineering teams have a lot in common when it comes to DevOps. We think similar principles apply to data teams or, more precisely, DataOps.
Learn more
Cut data warehouse costs with run caching illustration
Product Update

Cut data warehouse costs with run caching

How to save time and money by using our run caching feature
Learn more
Building the Dataform VS Code extension illustration
Guide

Building the Dataform VS Code extension

How we made our own extension for Visual Studio Code.
Learn more

Learn more about the Dataform data modeling platform

Dataform brings open source tooling, best practices, and software engineering inspired workflows to advanced data teams that are seeking to scale, enabling you to deliver reliable data to your entire organization.