Avro Serialization 1


In this post, we will discuss about basic introduction about Avro serialization.

What is Avro Serialization? :

Avro is a one of the famous data serialization and deserialization frameworks that greatly integrates with almost all hadoop platforms. Avro framework is created by Doug Cutting, the creator of Hadoop and now it is full fledged project under Apache Software foundation.

Need for Avro Serialization:

Hadoop‘s native library provides Writables for data serialization (converting object data into byte stream) and deserialization (converting byte stream data to object data) and also it provides support for Sequence Files which will store the data in binary format. These are the only two mechanisms provided by hadoop for data serialization.

The main draw back of these two mechanisms is that Writables and SequenceFiles have only a Java API and they cannot be written or read in any other language.

So any of the files created in hadoop with above two mechanisms can’t be read by any other third language which makes hadoop as a limited box. To address this drawback, Doug cutting created Avro, which is a language independent data structure.

Avro Serialization Features:

  • Avro is a language neutral data serialization system and it can be processed by many languages (currently C, C++, C#, Java, Python, and Ruby).
  • Avro creates binary structured format that is both compressible and splittable, So, it can be efficiently used as the input to hadoop MapReduce jobs.
  • Avro provides rich data structures, for example, we can create a record that contains an array, an enumerated type, and a sub record. These can be created in any language and can be processed in hadoop and the results can be fed to a third language.
  • Avro schemas are defined in JSON. This facilitates implementation in languages that already have JSON libraries.
  • In an Avro data file along with avro data, even schema is stored in a metadata section, and it makes the file self-describing.
  • Avro is also used in RPC (Remote Procedure Call) and in this, the client and server exchange schemas in the connection handshake.
  • Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation.
Comparison with Other Cross-Language Serialization Frameworks:

There are other serialization frameworks which provide language independent serialization mechanism. They are Protocol buffers (by google) and Thrift (by Apache).

  • These languages require code to be generated (for schema) to read or write the data files. However this is optional in Avro.
  • Schema is not stored with data in Thrift or Protocol Buffers but in Avro, Since the schema is present when data is read, considerably less type information is need to be encoded with data.
  • Avro has rich schema resolution capabilities. The schema used to read data need not be identical to the schema that was used to write the data. For example, a new, optional field may be added to a record by declaring it in the schema used to read the old data. New and old clients alike will be able to read the old data, while new clients can write new data that uses the new field. Conversely, if an old client sees newly encoded data, it will gracefully ignore the new field and carry on processing as it would have done with old data.

Avro Installation:

Avro Installation is a very simple and straight process. All that we need is downloading the required binary jar files into our cluster and and adding them to classpath. In this we will show the installation of Avro for java environment.

Avro mainly requires below jar files to be present in the classpath. These jar files contain all the classes for compiler, hadoop, mapred, mapreduce packages.

avro-mapred-x.y.z-hadoop2.jar
avro-tools-x.y.z.jar

In this section, we will below steps to install Avro on Ubuntu machine.

  • Download the latest stable versions of the above jar file from apache download mirrors. At the time writing this post, avro-1.7.7 was the latest stable version  but the hadoop-2.3.0 had avro-1.7.4.jar version in its  $HADOOP_HOME/share/hadoop/common/lib directory and the same 1.7.4 version is used to describe the installation process in this post.
  • Copy the thisavro-mapred-x.y.z-hadoop2.jar into hadoop distribution folder, usually into $HADOOP_HOME/share/hadoop/tools/lib  & $HADOOP_HOME/share/hadoop/common/lib folders which contains jar files for many other tools.
  • Add the above folder to classpath in .bashrc file.

avro installation

Note:  Avro has dependencies with Paranamer & Jackson JSON but the jar files required for these (paranamer-*.jar, jackson-core-asl-*.jar & jackson-mapper-asl-*.jar) will already be included in $HADOOP_HOME/share/hadoop/tools/lib folder in hadoop 2 distribution. If not, we need to download them and need to place in this folder.

Available tools in Avro:


About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.

Leave a comment

Your email address will not be published. Required fields are marked *


Review Comments
default image

I have attended Siva’s Spark and Scala training. He is good in presentation skills and explaining technical concepts easily to everyone in the group. He is having excellent real time experience and provided enough use cases to understand each concepts. Duration of the course and time management is awesome. Happy that I found a right person on time to learn Spark. Thanks Siva!!!

Dharmeswaran ETL / Hadoop Developer Spark Nov 2016 September 21, 2017

.