Today, we are going to explore the benefits of using schemas in your Kafka processing pipeline.
We’ll focus on two specific aspects of it:
- the schema itself; we’ll use Protobuf (short for Protocol Buffers)
- the schema registry; we’ll the use the Confluent Schema Registry
Let’s start with Protobuf.
What are Protocol buffers? 🔗
Protobuf is just a way of serializing structured data.
The high level idea is you can
- define a language agnostic schema (in a .proto file)
- generate source code in one or many programming languages (e.g. Java, C++, Go, Python, etc.)
- use this generated source code to serialize / deserialize your data
Protobuf is a binary transfer format, which implies the following pros and cons
- pros: lightweight
- cons: not directly human readable
For instance,
JSON
{
"first_name": "Jason",
"last_name": "Smith"
}
would translate to the following .proto file
message Person {
string first_name = 1;
string last_name = 2;
}
and would get encoded as
125Jason225Smith
where
- the first digit is the tag (all proto fields have a tag that you define by assigning an int to the field)
- the second one is the type, 2 for string
- the third one is the length of the field
Hopefully, that gives you a quick introduction to Protobuf. We’ll now dig a little bit more into the message definition and compilation.
Defining a protobuf message 🔗
In a .proto file you add a message for each data structure you want to serialize, then specify
- a type (Protobuf supports many standard data types, e.g. bool, int32, float, double)
- a name
- a tag (equivalent to the ID of this field in the binary encoding) for each field in the message.
You can use messages as types, for instance you could have an Address message that can be used within the Person message as a type for the field address.
For more details on messages definition, the proto3 language guide is going to be your best friend.
Generating the source code 🔗
When you’re done with your proto message definition, there is an intermediate step to compile your proto message to your target programming language. You do that using the protoc
compiler. The gRPC
website provides helpful instructions for the protoc installation.
Once you’ve installed protoc
, you can find more details on how to compile your protocol buffers to Python on Google’s developer guide for Python.
The approach is the same for all languages, the only difference is the language option you’ll provide (you can provide more than one if you want to compile your Message(s) to more than one target language). For Python, you need to use the -python_out=$DST_DIR
option. The full syntax looks like follows:
protoc -I=$SRC_DIR --python_out=$DST_DIR $SRC_DIR/my-file.proto
This will generate descriptors and metaclasses that you can use as normal classes in Python for getting/setting fields.
At this stage, you should have a basic understanding of Protobuf and how to define messages (our Protobuf schemas). Let’s see a handy way to store, share and access, and version our schemas with the Confluent Schema Registry.
Schema registry 🔗
First of all, please note that we could use the schema registry with another data serialization mechanism (e.g. Apache Avro) and that we could use Protobuf without the schema registry.
We’ll see, however, that the schema registry is a great way to manage schemas and that it brings the following benefits:
- a centralized place to store schemas and potentially share them
- a schema versioning that allows to evolve schemas, producers and consumers in sync
- validation that the schema evolution is acceptable (backward and forward compatible)
Confluent Schema Registry 🔗
We’ll explore more specifically the Confluent Schema registry in this post.
When producing using the Confluent Kafka Python library, the ProtobufSerializer will
- look up the schema from the schema registry (register it if not there) and cache it
- validate the schema is the same or the changes are compatible, it will crash otherwise
- encode the message using the generated source code and pass the schema ID along with the message to Kafka (rather than the full schema without a schema registry)
When consuming the message, the ProtobufDeserializer will
- check that the message was produced from the ProtobufSerializer and has a valid schema ID
- decode using the generated source code
Given that the Protobuf messages don’t carry their schema and if we’d share the generated source code from a library or a service, we could run Protobuf without the schema registry but we’d miss the benefits mentioned earlier.
Conclusion 🔗
I hope that gave you a basic understanding of Protobuf and the Schema Registry. Using them in combination should help you build the fundations of a solid approach to data validation, schema definition, maintenance and versioning.
I’ll see you on a future blog post for taking these concepts in action.
References 🔗
- https://developers.google.com/protocol-buffers/docs/proto3
- https://developers.google.com/protocol-buffers/docs/encoding
- https://developers.google.com/protocol-buffers/docs/pythontutorial
- https://betterprogramming.pub/understanding-protocol-buffers-43c5bced0d47
- https://medium.com/@stephane.maarek/introduction-to-schemas-in-apache-kafka-with-the-confluent-schema-registry-3bf55e401321