How we store protobufs in MongoDB | Dataform

How we store protobufs in MongoDB

Use a custom codec to cleanly store protobuf documents in MongoDB

picture of author

Ben Birt on January 8, 2020

At Dataform, we use Google Datastore to store customer data. However, for various reasons, we need to move off Datastore and onto a self-managed database.

We store all of our data in protobuf format; each entity we store corresponds to a single protobuf message. Since we already store structured documents (as opposed to SQL table rows), MongoDB is a great fit for us.

Here's a simple example of a protobuf definition:

message Person {
  string first_name = 1;
  string last_name = 2;
  int64 birth_timestamp_millis = 3;
}

One of the major benefits of using protocol buffers as a storage format is that it's very easy to make changes to our database 'schema'. Renaming a field is as simple as editing the .proto file, and it's (usually, with some caveats) safe to change a field's type, etc, whereas renaming a 'field' (column) in a traditional SQL-like table is usually a lot of work, involving some amount of DB migration.

However, safely making changes to a protobuf definition requires the data at rest to actually be stored in protobuf format, which would make it impossible to query, since the database engine doesn't speak protobuf.

One solution to this problem is to just store messages in their canonical JSON format. However, we'd then lose the ability to make many kinds of changes to our protobuf definitions. For example, we'd never be able to (easily) rename fields: imagine we stored an instance of Person (as defined above) in JSON format, but then renamed birth_timestamp_millis to birthday_timestamp - the previously stored Person would now have an undecodeable birthTimestampMillis field, and would be missing a value for birthdayTimestamp.

What we really want is the best of both worlds: we want to be able to store messages as JSON, so that it's possible to easily query the data; but we want stored data to be agnostic to the various kinds of backwards/forwards-compatible changes we might want to make to the protobuf definition.

Luckily, the MongoDB client libraries include a very helpful feature: they allow the user to define how data is encoded/decoded as it is stored/retrieved from the database, using custom, user-defined codecs.

We have used this feature to define our own new codec, written in Go, which solves the protobuf storage problem for us. It encodes protobuf messages using their tag numbers as document keys, and uses standard encoding/decoding for each of the protobuf field values.

For example, given the following protobuf definition:

message Example {
  string string_field = 3;
  ExampleEnum enum_field = 10;
  oneof example_oneof {
    int32 int32_field = 78;
    int64 int64_field = 33;
  }
  NestedMessage nested_message = 107;
}

enum ExampleEnum {
  VAL_0 = 1;
  VAL_1 = 573;
}

message NestedMessage {
  string nested_string_field = 2;
  int32 nested_int32_field = 1;
}

And the following instance of Example, in canonical JSON format:

{
  "stringField": "foo",
  "enumField": "VAL_0",
  // Note that this is represented as a string because the JavaScript number type is smaller than an int64.
  "int64Field": "123456789",
  "nestedMessage": {
    "nestedStringField": "bar",
    "nestedInt32Field": 12
  }
}

Our MongoDB codec will encode the instance of Example as the following Mongo BSON document:

{
  "3": "foo",
  "10": 1,
  "33": 123456789,
  "107": {
    "2": "bar",
    "1": 12
  }
}

With this encoding, if we change the name of nested_string_field to something_else, or the enum value VAL_0 to BETTER_ENUM_VALUE_NAME, we'll still be able to decode the document, without any loss of data.

This does make it slightly harder to query the database, since we now need to specify field numbers as opposed to human-readable field names. However, for production use, we have put a gRPC server in front of MongoDB which knows how to construct correct MongoDB queries, and for ad-hoc queries we plan to write a small translator which can do the same when given queries containing protobuf field names.

The code is open-sourced here (godoc). Examples of how to use it in a MongoDB codec registry are in the tests. Please feel free to use it if it helps you!

We publish great new resources every week, get them straight to your inbox.

More content from Dataform

Exporting BigQuery usage logs to... BigQuery

picture of author

Dan Lee

Turn on BigQuery audit log exports to start analysing your BigQuery usage

ETL vs ELT. Why make the flip?

picture of author

Josie Hall

Data warehousing technologies are advancing fast. The cloud data warehousing revolution means more and more companies are moving away from an ETL approach and towards an ELT approach for managing analytical data.

Advanced data quality testing with SQL and Dataform

picture of author

Lewis Hemens

A deep dive into some advanced data quality testing use cases with SQL and the open-source Dataform framework.

Learn more about the Dataform data modeling platform

Dataform brings open source tooling, best practices and software engineering inspired workflows to advanced data teams that are looking to scale, helping you deliver reliable data to the entire organization.