How To Create Avro Schema
ElasticSearch On Steroids With Avro Schemas
How to tackle the interface version explosion in a large enterprise setup
What is it about
The following article explains an approach to use modern data storage and serialization technologies to decouple components and stop the explosion of service interface versions inherent in large-scale data consumer applications of an enterprise.
Being responsible in a large financial institute for a heavily used data provider system, I'm always confronted with endless deployment iterations of new service versions and the inability of multiple consumers to move to the new version in a decent timeframe. This results over time in a landscape of a multitude of parallel versions running for a service.
A word o f caution: The approach is based on a NoSQL data storage infrastructure and, I'm fully aware that such data storage may not be suited for all kinds of business applications. Nevertheless, there are more than enough use cases which would fit nicely in such type of architecture.
Interaction Journey
Our tutorial covers the following scenario:
We have a Service Component which processes a user input provided via a React-based browser app to persist its data in an ElasticSearch Cluster-based persistency store.
- (3): The
ReactUIsends a localized JSON data payload (user input) to theServiceComponent - (4): The
ServiceComponentde-localizes the JSON data payload by replacing localized text through reference code values and prepares an Avro Binary Message (serialization) which then is sent to theBackendComponent - (5): The
BackendComponentde-serializes the Avro Binary Message and transforms it into an Avro JSON Message and then stores it inElasticSearch Cluster
Avro Schema
A key feature of the Avro Message is that it is self-describing via it's associated Avro Schema. The Avro Schema is a JSON-based definition of the message structure.
Refer to the below simple Avro Schema.
The example outlines already some specifics of the Avro Schema definition language.
-
personidandlastnameare mandatory attributes of typelongandstring -
surnameis a union attribute, i.e., it can be eithernullor have a value of typestring. By default, its value isnull.
Optional values are always expressed as unions, and to be prepared for seamless Schema Evolution (more on that later), you should always define a default value of optional attributes.
Avro Schema Parser and Client Bindings Generator
The Schema is used by the Avro Serializer and Deserializer to parse an Avro binary message (or an Avro JSON message depending on your configuration chosen) into Java Data Access Objects (DAO's). These DAO's may be generic or pre-compiled strong typed schema Java classes.
I.e., the Maven goal generate-sourcesgenerates a class Personwho has three attributes personId, lastName and firstname The person object can then be instantiated out of a binary message of this type by using the Avro parser.
To be self-describing, an Avro Message may be complemented with the Schema itself or a Schema Fingerprint.
Provisioning of the full schema JSON is usually done in file-based approaches, which bundles a large amount of Avro Messages. In our interaction journey — a request-response style with a single Avro Messages — such an approach would be too much of an overhead. For this scenario, the Schema fingerprint is the right approach.
Avro Fingerprint and Schema Directory
The Avro Schema Fingerprint is a globally unique identifier for referencing the correct schema in a central Schema Registry.
The below abstract SchemaRegistry class describes the methods required by such a Registry.
- The method
registerSchema, allows a sender (publisher) to register the Avro schema of the message he is sending out to the reader's (consumer). As a return value, the Avro fingerprint will be returned, which uniquely identifies this schema. Any change to the schema itself will result in a new fingerprint. - The fingerprint is calculated with the method
getSchemaFingerprint - The method
getSchemawill return the Avro JSON associated with the passed infingerprint
That's it, having a global Schema Registry established which can be used by readers or writers of Avro messages to exchange a schema fingerprint for the associated JSON schema you are ready to exploit the full power of Avro.
The tutorial provides two implementations of a Schema Registry:
- a File-based implementation for lightweight testing, as well as an
- ElasticSearch (ES)-based implementation. In case you have a productive ES Cluster in your company instantiated, you can easily leverage it as your company-wide schema registry by introducing a dedicated ES index.
- There are off-the-shelf Schema Registries available, especially in the Kafka domain, for example, the Schema Registry of the Confluent Kafka product. But we keep it as simple as possible and leverage ElasticSearch, which is used as our target data store.
The below example shows the ElasticSearch registered schema under its fingerprint 8354639983941950121 in the index avroschema. Any writer or reader of messages will use Avro fingerprints for referencing its underlying schema used in its processing.
Using ElasticSearch, you will get out of the box a central repository of the schema definitions, which can be easily enhanced with lookup- and query capabilities used by business analysts, designers, or developers during the system development time.
You are wrong if you think there is a lot of coding necessary to implement a Schema Registry.
Below you see the full functioning ElasticSearchSchemaRegistry java class.
ESPersistency Manager
Pretty straightforward and compact. You could argue now the ESPersistencyManager class hides the complexity (which is used and implemented as part of the tutorial).
Well not really, the class is an ultra-light layer around Jest, an HTTP Java client for Elasticsearch. While Elasticsearch provides its own native Java client, Jest provides a more fluent API and a more natural interface for working. Interested in JEST, check out Blaedung's tutorial article.
Our ESPersisencyManager just shields our tutorial classes from direct Jest exposures. An encapsulation technique used to be capable of replacing the persistency manager in a future version.
Again very compact to get an Avro Schema persisted. It's worthwhile to mention that you can provide ElasticSearch your own primary key (in ES called the _id). In our tutorial, we use the global unique fingerprint as the primary identifier in ES, which makes the lookup straightforward (see screenshot above). It's a key feature to provide your primary key, especially when using ElasticSearch in a data replication scenario, where ES is only the slave of another master system, which already has primary keys generated.
As you can imagine retrieving a JSON object back is even simpler Jest does the job for us.
Backend Component Startup Sequence
We should now have a pretty good understanding of what must be prepared and configured by the Backend component using our approach.
The diagram below outlines the most important steps:
-
1,2:we create a connection to our ElasticSearch cluster -
10,11:we create (if not existing) two ES indices,businessmodelandavroschema. The first one is used for our domain data records, the latter one for the Avro Schema -
20,30,31:we load and write a customized ES mapping for ouravroschema. This needs some explanation. ES needs some schema information about it's persisted JSON documents ( it's an ES document schema and shouldn' mixed up with the Avro schema). By default, ES will automatically assign an ES mappings schema by analyzing the provide JSON document. In our case, we have to adjust the ES mappings to our needs and tell ES not to interpret the JSON part provided in the attributeavroschema(by switching the type toobject).
-
40–46:we load now our Avro Message Schema from the file system and register it in ES Schema Registry. As a return, we get our Avro Schema Fingerprint, which is our universal interface contract identifier for the provisioning/consuming of our messages. The Schema defines one of our Business Domain Data Object (DO), the "Business Model Strategy DO," which collects strategy data about companies. Beside simple types, the schema also introduces an Enum for illustration purpose. Thedocattribute allows providing a textual explanation, which is helpful if people are browsing schemas in a registry.
Service Component Startup Sequence
Let's check out what the Service Component has to prepare.
-
10–11:we create a connection object (ESPersistencyManager) to the ES Cluster -
20,21,22:we initialize an ES-based Schema Registry, passing in our connection object -
30,31:we retrieve the Schema based on the fingerprint version, upon which the Service component was built.
Message Version Evolvement Over Time
The last step -resolving a schema based on the "pre-configured static" fingerprint-must be understood in detail.
The Schema fingerprint represents the message version number, which either the ServiceComponent or BackendComponent is capable of processing.
In our simple Hello world example, the Schema fingerprint is identical for the Backend- and ServiceComponet. But in reality, the BackendComponent as well one or multiple ServiceComponent(s) may evolve with a different change speed over time.
This scenario is illustrated in the below diagram.
- Think of a
BackendComponent, which is released with itsServiceRWComponent1in a Release 1 with Schema Version (Fingerprint)100. Then also two read-only componentsServiceROComponent2andServiceROComponent3, are consuming data with version100. - Three months later, the
BackendComponentreleases a new version 2, which supports an enhanced data model (some additional optional attributes) with schema version101. It'sServiceRWComponent1also supports this new version. - For the two independent data consumer
ServiceROComponent2andServiceROComponent3— which may evolve with a different change speed - the situation is as follows:ServiceROComponent2has no own release and stays on version100,ServiceROComponent3requires the additional data elements of version101and also release a new overall version of itsServiceROComponent3.
Interface Version Explosion
There we are in real life. In case you are the owner of a BackendComponent, which offers data sets that are important for a lot of independent data consumers (ServiceROComponent2 and ServiceROComponent3), you will be confronted with a need to support multiple interface versions. This version support requirement may grow dramatically over time and impacts your time to deliver, the flexibility, and the increasing costs of your components.
As a data provider to third party components (which are not under your control), you may have only limited power to enforce them to upgrade to newer release interface version numbers.
Service Versioning and its management over the whole lifecycle of your component play a vital (and sometimes cumbersome) role for a lot of data providers.
Running SOAP-Webservices backed by WSDL interface definition language or JSON-Services supported by Open API Specification 3 (i.e., Swagger) — two very popular request-response interaction technologies — will expose you to constant version change, in case you enhance your published interfaces.
The version change is explicit and must be adopted by your consumers, which may be out of reach for a single coordinated version upgrade allowing the removal of the older version from your productive component.
And here Avro shines with its Schema Evolution support out of the box.
Avro Schema Evolution
Avro supports schema evolution, which means that you can have producers and consumers of Avro messages with different versions of the schema at the same time. It all continues to work (as long as the schemas are compatible). Schema Evolution is a crucial feature required in large production systems to decouple components and allowing them to run system updates with interface changes at different times.
The Avro algorithm implementation requires that the reader of an Avro message has the same version of the schema as the writer of the message. So how does Avro support then schema evolution? In case the reader is on another schema version, the reader may pass into the Avro parser two different schemas. The parser then uses its resolution rules to translate data from the writer schema into the reader schema.
Let's check that out on our BackendComponent.convertAvroBinarToJSON method. It's one of the tutorial's main methods, which converts an Avro Binary Message into an Avro JSON Message format, which is persisted afterward in ES.
- The method gets passed in an Avro Single Object Encoded binary message, which is defined as follows:
Single Avro objects are encoded as follows: A two-byte marker, C3 01, to show that the message is Avro and uses this single-record format (version 1). The 8-byte little-endian CRC-64-AVRO fingerprint of the object's schema. The Avro object encoded using Avro's binary encoding. (Link)
-
Line 2:We extract the schema finger out of the message -
Line 3:We fetch the associated schema from our schema registry -
Line 4:We obtain the message payload -
Line 7:We check if the fingerprint of the received message is the same as the one used in theBackendCompontent -
Line 8:If the fingerprint is the same, we create an AvroGenericDatumReader(which is used to decode the message) with theBackendComponentschema -
Line 10:If not, we create an AvroGenericDatumReaderwith theBackendComponentschema as well as the retrieve message schema. The reader will now apply the Schema evolution logic during the decoding of the binary message. -
Line 12:We create as well an AvroGenericDatumWriterto produce a JSON representation of the message. As already mentioned, Avro supports a compact binary encoding, as well as a JSON base encoding. You can get from one encoding to another one with some simple Avro helper classes. -
Line 14:We create abinaryDecoderfor our binarypayload, as well as -
Line 18:ajsonEncoderbased on ourschema -
Line 19-23:Finally, we decode our binary message and encode it into a JSON message format, which is ready to be persisted.
The methodBackendComponent.persist is responsible for this task, which is straightforward.
- we pass in the ElasticSearch
index("businessmodel"), the ElasticSearchtype("strategy") of our object, as well as the uniqueid
In a real-life scenario, the persist method would be exposed by the BackendComponent as a JSON-, Webservice, or any other remote interfaces.
Our persisted Avro JSON document is stored as part of the _source attribute.
Just think about what we have now achieved.
- Our
BackendComponentcan now serve JSON Document Structures, which are fully Avro message specification compliant. The transformation to the JSON persistence format works full automatically based on the binary message received by theServiceComponent. - The JSON
_sourceDocument Structure (the payload) is fully self-describing. We injected the primary keyindex_ipid, as well as the schema fingerprintavro_fingerpintas part of the payload - That means a consumer of the JSON message document can resolve the schema via ES to process the message, as well as knows the unique storage identifier of the document in ES.
- We also injected additional common attributes that may be relevant, i.e.,
last_update_timestampas well aslast_update_loginid, to keep the information when and who was updating the document. This would allow us to build up time-series and an audit trail by introducing, for example, a Kafka based event sink (will be a topic of another article)
Schema Evolution revisited
An essential aspect of data management is schema evolution. After the initial schema is defined, applications may need to evolve, as shown in the above diagram. When this happens, the downstream consumers must be able to handle data encoded with both the old and the new schema seamlessly.
We differentiate two compatibility types:
- backward-compatibility means consumer working with a newer schema, may read data based on an older schema
- forward-compatibility means consumer working on an older schema may read data based on a newer schema
If a schema evolution is backward-, as well as forward compatible is dependent if the change complies to Avro Schema Resolution constraints
Here three essential rules out of the resolution spec:
- Reorder fields is possible without any impact on the reader or writer
- Adding a field to a record is ok, provided that you also give a default. The provisioning of a default value allows a reader with the new schema to read messages from a writer based on an older schema.
- Removing of a field to a record is ok, as long as you gave a default definition initially. I.e., a reader on an old schema, receives a default value.
This implies that adding or removing optional fields to your data message keeps the message fully compatible, as long as a default value was or will be defined for the removed or added attribute. So when you start defining your Schema, you should consider that from the beginning.
Adding or removing optional data attributes in a data message is a scenario, which happens all the time in production systems, which evolve to fulfill new business requirements.
Enhanced Versioning Strategy Explained
Let's see how Avro will help us to reduce the versioning complexity and component coupling in a simple example, by comparing a classic application setup, as well our enhanced one using ElasticSearch and Avro.
Classic Application
Assume the following component setup in a productive environment.
- We have an application that covers a particular bounded context (i.e., a CRM client profiling application), consisting of a React Maintenance UI
ReactMaintainUIbacked by a presentation service componentServiceComponentthat offers a JSON localized payload interface to the UI. We assume it's a Swagger style OpenAIP interface. It will connect to a backend componentBackendComponent, which offers a Webservice interface and persists its data ina Relational DatabaseRelaionalDB - The application itself is offering its API not only to its UI but also integrates a 3rd party UI
JSONConsumer1via its JSON API. As well as has a third party data consumer, which reads data over the Webservice interfaceBackendComponent
- When the application is initially released, the following interface- and DB schema versions are frozen and made available for consumption by consumers.
JSON_V100,WSDL_V100as well asDBSchema_V100. All these artifacts are representing an agreed contract to consuming applications, be it the one managed within the BoundedContext (and which are normally under your control), as well as to third-party components. - Assume now that our bounded context Application has multiple release cycles in a short period, introducing new features and optional data attributes. This is a typical example in an agile setup where you may start with a Minimum Viable Product (MVP) and enhance the application in a short release cycle.
- In our example, we released a new version
V101which is requesting additional optional data attributes via theReactMaintainUI
- We added a new interface- and schema version for our bounded context application:
JSON_V101,WSDL_V101, andDBSchema_V101. This results in changes in the red colored components. - Our existing 3rd party consumers,
JSONConsumer1andWSDLConsumer1(green ones), are fine with the current interfaces (don't require the new data) and don't have any release planned and stay on the initial versionJSON_V100andWSDL_V100. - There you are now. As the bounded context application owner, you now have to start to manage your interface versions and provide potentially multiple versions to your data consumers.
- Especially if your data is required by multiple other applications (i.e., CRM data), the need will grow quickly along with your changing application. This may result in multiple versions of the same service, and believe me, forcing third party consumers to migrate to new versions is a difficult undertaking.
- Migrating to a new interface version means code-changes (at least recompiling your client bindings), testing, and releasing of the component. At the end cost on the consumer side, they try to avoid it ever possible
New Generation Application
Now let's check the scenario with our new Bounded Context Application. which is using in its BackendComponentan Avro Message Interface supporting Schema Evolution, as well as a NoSQL data store.
- We still have an explicit interface contract
JSON_V100, which is exposed to UI's as a REST-based interface. - The
BackendComponentoffers a straightforward JSON interface to consumers to pass in an Avro message in the binary encoded single object format. - A very simplified JSON interface would allow us to pass in a Base64 encoded string, which represents our binary single object encoded Avro message.
As a return, the service would give back the unique identifier of the Avro Object, as well as the fingerprint of the BackendComponent writer. This will allow the interface consumer to detect if he is working on the same version.
- For the sake of the tutorial that should be sufficient, we look into the detail of such an interface in a subsequent article.
- As you can see in the above diagram, the
BackendComponentinterface, as well as theEsIndex1schema, are not explicitly versioned. The versioning will be transparently managed as long our interface enhancements are in withing the constraints of the Avro schema evolution rules.
- Adding now additional optional attributes would result only in changes of the
ReactMaintainUIcomponent, and the correspondingServiceComponentcomponent, which offers an updated OpenAPI compliant JSON interface.
- Out
BackendComponentand persistence schemaAvro_JSON_Storemay stay unaffected (or minimally, we have to run a re-index on ES in case we need some specific schema settings). - We finally could decouple our
BackendComponentfrom the evolvingServiceComponentand reduce the amount of development and testing effort, as well as the overall complexity of our application.
This ultimately results in a faster delivery cycle, and fewer change costs for introducing new features in your component.
Wrap it Up
So it's time for you to check out the tutorial, which can be found on Github: https://github.com/talfco/tutorial-apache-avro (a Maven-based tutorial).
- Start with the
HelloWorldclass, which establishes and simulates the interaction flow between theServiceComponentandBackendComponent. - From there, you can drill down in the lightweight supporting classes, which gives you a good overview of the overall approach.
- The only pre-requisite is the existence of an ElasticSearch test-cluster. Straightforward is to create a 14-days test account on https://www.elastic.co/ which gives you all you need (and allows you to play around with some of the nifty features of ElasticSearch)
- The
HelloWorldprogram excepts the ElasticSearch instance endpoint, as well a user and password (which you can create in the ElasticSearch Admin GUI)
Have fun!
How To Create Avro Schema
Source: https://towardsdatascience.com/elasticsearch-on-steroids-with-avro-schemas-3bfc483e3b30
Posted by: milliganmolithery.blogspot.com

0 Response to "How To Create Avro Schema"
Post a Comment