How To Create Avro Schema

ElasticSearch On Steroids With Avro Schemas

How to tackle the interface version explosion in a large enterprise setup

What is it about

The following article explains an approach to use modern data storage and serialization technologies to decouple components and stop the explosion of service interface versions inherent in large-scale data consumer applications of an enterprise.

Being responsible in a large financial institute for a heavily used data provider system, I'm always confronted with endless deployment iterations of new service versions and the inability of multiple consumers to move to the new version in a decent timeframe. This results over time in a landscape of a multitude of parallel versions running for a service.

A word o f caution: The approach is based on a NoSQL data storage infrastructure and, I'm fully aware that such data storage may not be suited for all kinds of business applications. Nevertheless, there are more than enough use cases which would fit nicely in such type of architecture.

Interaction Journey

Our tutorial covers the following scenario:
We have a Service Component which processes a user input provided via a React-based browser app to persist its data in an ElasticSearch Cluster-based persistency store.

(3): The ReactUI sends a localized JSON data payload (user input) to the ServiceComponent
(4): The ServiceComponentde-localizes the JSON data payload by replacing localized text through reference code values and prepares an Avro Binary Message (serialization) which then is sent to the BackendComponent
(5): The BackendComponent de-serializes the Avro Binary Message and transforms it into an Avro JSON Message and then stores it in ElasticSearch Cluster

Avro Schema

A key feature of the Avro Message is that it is self-describing via it's associated Avro Schema. The Avro Schema is a JSON-based definition of the message structure.

Refer to the below simple Avro Schema.

The example outlines already some specifics of the Avro Schema definition language.

personid and lastname are mandatory attributes of type long and string
surname is a union attribute, i.e., it can be either null or have a value of typestring. By default, its value is null.

Optional values are always expressed as unions, and to be prepared for seamless Schema Evolution (more on that later), you should always define a default value of optional attributes.

Avro Schema Parser and Client Bindings Generator

The Schema is used by the Avro Serializer and Deserializer to parse an Avro binary message (or an Avro JSON message depending on your configuration chosen) into Java Data Access Objects (DAO's). These DAO's may be generic or pre-compiled strong typed schema Java classes.

I.e., the Maven goal generate-sourcesgenerates a class Personwho has three attributes personId, lastName and firstname The person object can then be instantiated out of a binary message of this type by using the Avro parser.

To be self-describing, an Avro Message may be complemented with the Schema itself or a Schema Fingerprint.

Provisioning of the full schema JSON is usually done in file-based approaches, which bundles a large amount of Avro Messages. In our interaction journey — a request-response style with a single Avro Messages — such an approach would be too much of an overhead. For this scenario, the Schema fingerprint is the right approach.

Avro Fingerprint and Schema Directory

The Avro Schema Fingerprint is a globally unique identifier for referencing the correct schema in a central Schema Registry.

The below abstract SchemaRegistry class describes the methods required by such a Registry.

The method registerSchema, allows a sender (publisher) to register the Avro schema of the message he is sending out to the reader's (consumer). As a return value, the Avro fingerprint will be returned, which uniquely identifies this schema. Any change to the schema itself will result in a new fingerprint.
The fingerprint is calculated with the method getSchemaFingerprint
The method getSchema will return the Avro JSON associated with the passed in fingerprint

That's it, having a global Schema Registry established which can be used by readers or writers of Avro messages to exchange a schema fingerprint for the associated JSON schema you are ready to exploit the full power of Avro.

The tutorial provides two implementations of a Schema Registry:

a File-based implementation for lightweight testing, as well as an
ElasticSearch (ES)-based implementation. In case you have a productive ES Cluster in your company instantiated, you can easily leverage it as your company-wide schema registry by introducing a dedicated ES index.
There are off-the-shelf Schema Registries available, especially in the Kafka domain, for example, the Schema Registry of the Confluent Kafka product. But we keep it as simple as possible and leverage ElasticSearch, which is used as our target data store.

The below example shows the ElasticSearch registered schema under its fingerprint 8354639983941950121 in the index avroschema. Any writer or reader of messages will use Avro fingerprints for referencing its underlying schema used in its processing.

Using ElasticSearch, you will get out of the box a central repository of the schema definitions, which can be easily enhanced with lookup- and query capabilities used by business analysts, designers, or developers during the system development time.

You are wrong if you think there is a lot of coding necessary to implement a Schema Registry.

Below you see the full functioning ElasticSearchSchemaRegistry java class.

ESPersistency Manager

Pretty straightforward and compact. You could argue now the ESPersistencyManager class hides the complexity (which is used and implemented as part of the tutorial).

Well not really, the class is an ultra-light layer around Jest, an HTTP Java client for Elasticsearch. While Elasticsearch provides its own native Java client, Jest provides a more fluent API and a more natural interface for working. Interested in JEST, check out Blaedung's tutorial article.

Our ESPersisencyManager just shields our tutorial classes from direct Jest exposures. An encapsulation technique used to be capable of replacing the persistency manager in a future version.

Again very compact to get an Avro Schema persisted. It's worthwhile to mention that you can provide ElasticSearch your own primary key (in ES called the _id). In our tutorial, we use the global unique fingerprint as the primary identifier in ES, which makes the lookup straightforward (see screenshot above). It's a key feature to provide your primary key, especially when using ElasticSearch in a data replication scenario, where ES is only the slave of another master system, which already has primary keys generated.

As you can imagine retrieving a JSON object back is even simpler Jest does the job for us.

Backend Component Startup Sequence

We should now have a pretty good understanding of what must be prepared and configured by the Backend component using our approach.

The diagram below outlines the most important steps:

Startup Sequence of the Backend Component

1,2: we create a connection to our ElasticSearch cluster
10,11: we create (if not existing) two ES indices, businessmodel and avroschema. The first one is used for our domain data records, the latter one for the Avro Schema
20,30,31: we load and write a customized ES mapping for our avroschema. This needs some explanation. ES needs some schema information about it's persisted JSON documents ( it's an ES document schema and shouldn' mixed up with the Avro schema). By default, ES will automatically assign an ES mappings schema by analyzing the provide JSON document. In our case, we have to adjust the ES mappings to our needs and tell ES not to interpret the JSON part provided in the attribute avroschema (by switching the type to object).

ElasticSearch Index Meta-information for our index 'avroschema'

40–46: we load now our Avro Message Schema from the file system and register it in ES Schema Registry. As a return, we get our Avro Schema Fingerprint, which is our universal interface contract identifier for the provisioning/consuming of our messages. The Schema defines one of our Business Domain Data Object (DO), the "Business Model Strategy DO," which collects strategy data about companies. Beside simple types, the schema also introduces an Enum for illustration purpose. The doc attribute allows providing a textual explanation, which is helpful if people are browsing schemas in a registry.

Avro Schema for collecting strategy data of a company

Service Component Startup Sequence

Let's check out what the Service Component has to prepare.

10–11: we create a connection object (ESPersistencyManager) to the ES Cluster
20,21,22: we initialize an ES-based Schema Registry, passing in our connection object
30,31: we retrieve the Schema based on the fingerprint version, upon which the Service component was built.

Message Version Evolvement Over Time

The last step -resolving a schema based on the "pre-configured static" fingerprint-must be understood in detail.

The Schema fingerprint represents the message version number, which either the ServiceComponent or BackendComponent is capable of processing.

In our simple Hello world example, the Schema fingerprint is identical for the Backend- and ServiceComponet. But in reality, the BackendComponent as well one or multiple ServiceComponent(s) may evolve with a different change speed over time.

This scenario is illustrated in the below diagram.

Schema (Interface) Version Evolution over time

Think of a BackendComponent, which is released with its ServiceRWComponent1 in a Release 1 with Schema Version (Fingerprint) 100. Then also two read-only components ServiceROComponent2 and ServiceROComponent3, are consuming data with version 100.
Three months later, the BackendComponent releases a new version 2, which supports an enhanced data model (some additional optional attributes) with schema version 101. It's ServiceRWComponent1 also supports this new version.
For the two independent data consumer ServiceROComponent2 and ServiceROComponent3 — which may evolve with a different change speed - the situation is as follows: ServiceROComponent2 has no own release and stays on version 100, ServiceROComponent3 requires the additional data elements of version 101 and also release a new overall version of its ServiceROComponent3.

Interface Version Explosion

There we are in real life. In case you are the owner of a BackendComponent, which offers data sets that are important for a lot of independent data consumers (ServiceROComponent2 and ServiceROComponent3), you will be confronted with a need to support multiple interface versions. This version support requirement may grow dramatically over time and impacts your time to deliver, the flexibility, and the increasing costs of your components.

As a data provider to third party components (which are not under your control), you may have only limited power to enforce them to upgrade to newer release interface version numbers.

Service Versioning and its management over the whole lifecycle of your component play a vital (and sometimes cumbersome) role for a lot of data providers.

Running SOAP-Webservices backed by WSDL interface definition language or JSON-Services supported by Open API Specification 3 (i.e., Swagger) — two very popular request-response interaction technologies — will expose you to constant version change, in case you enhance your published interfaces.

The version change is explicit and must be adopted by your consumers, which may be out of reach for a single coordinated version upgrade allowing the removal of the older version from your productive component.

And here Avro shines with its Schema Evolution support out of the box.

Avro Schema Evolution

Avro supports schema evolution, which means that you can have producers and consumers of Avro messages with different versions of the schema at the same time. It all continues to work (as long as the schemas are compatible). Schema Evolution is a crucial feature required in large production systems to decouple components and allowing them to run system updates with interface changes at different times.

The Avro algorithm implementation requires that the reader of an Avro message has the same version of the schema as the writer of the message. So how does Avro support then schema evolution? In case the reader is on another schema version, the reader may pass into the Avro parser two different schemas. The parser then uses its resolution rules to translate data from the writer schema into the reader schema.

Let's check that out on our BackendComponent.convertAvroBinarToJSON method. It's one of the tutorial's main methods, which converts an Avro Binary Message into an Avro JSON Message format, which is persisted afterward in ES.

BackendCompoent convertAvroBinaryToJSON method

The method gets passed in an Avro Single Object Encoded binary message, which is defined as follows:

Single Avro objects are encoded as follows: A two-byte marker, C3 01, to show that the message is Avro and uses this single-record format (version 1). The 8-byte little-endian CRC-64-AVRO fingerprint of the object's schema. The Avro object encoded using Avro's binary encoding. (Link)

Line 2: We extract the schema finger out of the message
Line 3: We fetch the associated schema from our schema registry
Line 4: We obtain the message payload
Line 7: We check if the fingerprint of the received message is the same as the one used in the BackendCompontent
Line 8: If the fingerprint is the same, we create an Avro GenericDatumReader (which is used to decode the message) with the BackendComponent schema
Line 10: If not, we create an Avro GenericDatumReader with the BackendComponent schema as well as the retrieve message schema. The reader will now apply the Schema evolution logic during the decoding of the binary message.
Line 12: We create as well an AvroGenericDatumWriter to produce a JSON representation of the message. As already mentioned, Avro supports a compact binary encoding, as well as a JSON base encoding. You can get from one encoding to another one with some simple Avro helper classes.
Line 14: We create a binaryDecoder for our binarypayload, as well as
Line 18: a jsonEncoder based on our schema
Line 19-23: Finally, we decode our binary message and encode it into a JSON message format, which is ready to be persisted.

The methodBackendComponent.persist is responsible for this task, which is straightforward.

we pass in the ElasticSearch index ("businessmodel"), the ElasticSearch type ("strategy") of our object, as well as the uniqueid

BackendComponent persist method

In a real-life scenario, the persist method would be exposed by the BackendComponent as a JSON-, Webservice, or any other remote interfaces.

Our persisted Avro JSON document is stored as part of the _source attribute.

Just think about what we have now achieved.

Our BackendComponent can now serve JSON Document Structures, which are fully Avro message specification compliant. The transformation to the JSON persistence format works full automatically based on the binary message received by the ServiceComponent.
The JSON _sourceDocument Structure (the payload) is fully self-describing. We injected the primary key index_ipid, as well as the schema fingerprint avro_fingerpint as part of the payload
That means a consumer of the JSON message document can resolve the schema via ES to process the message, as well as knows the unique storage identifier of the document in ES.

We also injected additional common attributes that may be relevant, i.e., last_update_timestamp as well as last_update_loginid, to keep the information when and who was updating the document. This would allow us to build up time-series and an audit trail by introducing, for example, a Kafka based event sink (will be a topic of another article)

Schema Evolution revisited

An essential aspect of data management is schema evolution. After the initial schema is defined, applications may need to evolve, as shown in the above diagram. When this happens, the downstream consumers must be able to handle data encoded with both the old and the new schema seamlessly.

We differentiate two compatibility types:

backward-compatibility means consumer working with a newer schema, may read data based on an older schema
forward-compatibility means consumer working on an older schema may read data based on a newer schema

If a schema evolution is backward-, as well as forward compatible is dependent if the change complies to Avro Schema Resolution constraints

Here three essential rules out of the resolution spec:

Reorder fields is possible without any impact on the reader or writer
Adding a field to a record is ok, provided that you also give a default. The provisioning of a default value allows a reader with the new schema to read messages from a writer based on an older schema.
Removing of a field to a record is ok, as long as you gave a default definition initially. I.e., a reader on an old schema, receives a default value.

This implies that adding or removing optional fields to your data message keeps the message fully compatible, as long as a default value was or will be defined for the removed or added attribute. So when you start defining your Schema, you should consider that from the beginning.

Adding or removing optional data attributes in a data message is a scenario, which happens all the time in production systems, which evolve to fulfill new business requirements.

Enhanced Versioning Strategy Explained

Let's see how Avro will help us to reduce the versioning complexity and component coupling in a simple example, by comparing a classic application setup, as well our enhanced one using ElasticSearch and Avro.

Classic Application

Assume the following component setup in a productive environment.

We have an application that covers a particular bounded context (i.e., a CRM client profiling application), consisting of a React Maintenance UI ReactMaintainUI backed by a presentation service component ServiceComponent that offers a JSON localized payload interface to the UI. We assume it's a Swagger style OpenAIP interface. It will connect to a backend component BackendComponent, which offers a Webservice interface and persists its data ina Relational Database RelaionalDB
The application itself is offering its API not only to its UI but also integrates a 3rd party UI JSONConsumer1via its JSON API. As well as has a third party data consumer, which reads data over the Webservice interface BackendComponent

When the application is initially released, the following interface- and DB schema versions are frozen and made available for consumption by consumers. JSON_V100, WSDL_V100 as well as DBSchema_V100. All these artifacts are representing an agreed contract to consuming applications, be it the one managed within the BoundedContext (and which are normally under your control), as well as to third-party components.
Assume now that our bounded context Application has multiple release cycles in a short period, introducing new features and optional data attributes. This is a typical example in an agile setup where you may start with a Minimum Viable Product (MVP) and enhance the application in a short release cycle.
In our example, we released a new version V101 which is requesting additional optional data attributes via the ReactMaintainUI

We added a new interface- and schema version for our bounded context application: JSON_V101, WSDL_V101, and DBSchema_V101. This results in changes in the red colored components.
Our existing 3rd party consumers, JSONConsumer1 and WSDLConsumer1 (green ones), are fine with the current interfaces (don't require the new data) and don't have any release planned and stay on the initial version JSON_V100 and WSDL_V100.
There you are now. As the bounded context application owner, you now have to start to manage your interface versions and provide potentially multiple versions to your data consumers.
Especially if your data is required by multiple other applications (i.e., CRM data), the need will grow quickly along with your changing application. This may result in multiple versions of the same service, and believe me, forcing third party consumers to migrate to new versions is a difficult undertaking.
Migrating to a new interface version means code-changes (at least recompiling your client bindings), testing, and releasing of the component. At the end cost on the consumer side, they try to avoid it ever possible

New Generation Application

Now let's check the scenario with our new Bounded Context Application. which is using in its BackendComponentan Avro Message Interface supporting Schema Evolution, as well as a NoSQL data store.

We still have an explicit interface contract JSON_V100, which is exposed to UI's as a REST-based interface.
The BackendComponent offers a straightforward JSON interface to consumers to pass in an Avro message in the binary encoded single object format.
A very simplified JSON interface would allow us to pass in a Base64 encoded string, which represents our binary single object encoded Avro message.

Simple JSON Interface for our Avro message (input)

As a return, the service would give back the unique identifier of the Avro Object, as well as the fingerprint of the BackendComponent writer. This will allow the interface consumer to detect if he is working on the same version.

Simple JSON Interface for our Avro message (output)

For the sake of the tutorial that should be sufficient, we look into the detail of such an interface in a subsequent article.
As you can see in the above diagram, the BackendComponent interface, as well as the EsIndex1 schema, are not explicitly versioned. The versioning will be transparently managed as long our interface enhancements are in withing the constraints of the Avro schema evolution rules.

No explicit versioning required for out backend component

Adding now additional optional attributes would result only in changes of the ReactMaintainUI component, and the corresponding ServiceComponent component, which offers an updated OpenAPI compliant JSON interface.

Out BackendComponent and persistence schema Avro_JSON_Storemay stay unaffected (or minimally, we have to run a re-index on ES in case we need some specific schema settings).
We finally could decouple our BackendComponent from the evolving ServiceComponent and reduce the amount of development and testing effort, as well as the overall complexity of our application.

This ultimately results in a faster delivery cycle, and fewer change costs for introducing new features in your component.

Wrap it Up

So it's time for you to check out the tutorial, which can be found on Github: https://github.com/talfco/tutorial-apache-avro (a Maven-based tutorial).

Start with the HelloWorld class, which establishes and simulates the interaction flow between the ServiceComponent and BackendComponent.
From there, you can drill down in the lightweight supporting classes, which gives you a good overview of the overall approach.
The only pre-requisite is the existence of an ElasticSearch test-cluster. Straightforward is to create a 14-days test account on https://www.elastic.co/ which gives you all you need (and allows you to play around with some of the nifty features of ElasticSearch)
The HelloWorld program excepts the ElasticSearch instance endpoint, as well a user and password (which you can create in the ElasticSearch Admin GUI)

Required "Hello World" Program arguments

Have fun!