Working with Apache Avro to manage Big Data Files

What is Avro?

Apache Avro is a language-neutral data serialization system and is a preferred tool to serialize data in Hadoop. Serialization is the process of translating data structures or objects state into binary or textual form to transport the data over network or to store on some persistent storage. Once the data is transported over network or retrieved from the persistent storage, it needs to be deserialized again.

Avro is not only language independent but also it is schema-based. Avro serializes the data into a compact binary format, which can be deserialized by any application.

Avro uses JSON format to declare the data structures. Presently, it supports languages such as Java, C, C++, C#, Python, and Ruby. It serializes fast and the resulting serialized data is lesser in size (compressible and splittable). Schema is stored along with the Avro data in a file for any further processing. In RPC, the client and the server exchange schemas during the connection.

Download/Install/Configure

Download Avro from below link

http://www-us.apache.org/dist/avro/avro-1.7.7/

You can select and download the library for any of the languages provided. In this post, we use Java. Hence download the jar files avro-1.7.7.jar and avro-tools-1.7.7.jar.

Setting Classpath
**********************

To work with Avro in Linux environment, download the following jar files, place these jar files to your desired location.

avro-1.77.jar
avro-tools-1.77.jar

After copying these files into a folder , set the classpath to the folder, in the ./bashrc or .bash_profile file.

[hdpclient@en01 ~]$ echo $CLASSPATH

.:/usr/java/default/jre/lib:/usr/java/default/lib:/usr/java/default/lib/tools.jar:/usr/hadoopsw/hadoop-2.7.3/lib/*:.:/usr/hadoopsw/apache-hive-2.1.1-bin/lib/*:.:/usr/hadoopsw/db-derby-10.13.1.1-bin/lib/derby.jar:/usr/hadoopsw/db-derby-10.13.1.1-bin/lib/derbyclient.jar:/usr/hadoopsw/db-derby-10.13.1.1-bin/lib/derbytools.jar:/u01/app/oracle/product/12.2.0.1/db_1/jlib:/u01/app/oracle/product/12.2.0.1/db_1/rdbms/jlib:/usr/hadoopsw/apache-flume-1.7.0-bin/lib/*

#class path for Avro

export CLASSPATH=$CLASSPATH:/usr/hadoopsw/avro/*

Creating Avro Schemas

Avro, being a schema-based serialization utility, accepts schemas as input. In spite of various schemas being available, Avro follows its own standards of defining schemas. These schemas describe the following details −

type of file (record by default)
location of record
name of the record
fields in the record with their corresponding data types
Using these schemas, you can store serialized values in binary format using less space. These values are stored without any metadata.

The Avro schema is created in JavaScript Object Notation (JSON) document format, which is a lightweight text-based data interchange format. It is created in one of the following ways

A JSON string
A JSON object
A JSON array

Example

************

The given schema defines a (record type) document within "myns" namespace. The name of document is "emp" which contains two "Fields"

{

"type" : "record",

"namespace" : "myns",

"name" : "emp",

"fields" : [

{ "name" : "name" , "type" : "string" },

{ "name" : "age" , "type" : "int" }

]

}

We observed that schema contains four attributes, they are briefly described below −

type − Describes document type, in this case a "record".

namespace − Describes the name of the namespace in which the object resides.

name − Describes the schema name.

fields − This is an attribute array which contains the following −

name − Describes the name of field

type − Describes data type of field

Primitive Data Types of Avro

***********************************

Avro schema is having primitive data types as well as complex data types.

null Null is a type having no value.

int 32-bit signed integer.

long 64-bit signed integer.

float single precision (32-bit) IEEE 754 floating-point number.

double double precision (64-bit) IEEE 754 floating-point number.

bytes sequence of 8-bit unsigned bytes.

string Unicode character sequence.

Complex Data Types of Avro

**********************************

Along with primitive data types, Avro provides six complex data types namely Records, Enums, Arrays, Maps, Unions, and Fixed.

Enum

*******

An enumeration is a list of items in a collection

{

"type" : "enum",

"name" : "Numbers", "namespace": "data", "symbols" : [ "ONE", "TWO", "THREE", "FOUR" ]

}

name − The value of this field holds the name of the enumeration.

namespace − The value of this field contains the string that qualifies the name of the Enumeration.

symbols − The value of this field holds the enum's symbols as an array of names.

Arrays

********

This data type defines an array field having a single attribute items.

{ " type " : " array ", " items " : " int " }

Maps

*******

The map data type is an array of key-value pairs.

{"type" : "map", "values" : "int"}

The values attribute holds the data type of the content of map. Avro map values are implicitly taken as strings.

Unions

*********

A union datatype is used whenever the field has one or more datatypes. They are represented as JSON arrays. For example, if a field that could be either an int or null, then the union is represented as ["int", "null"].

{

"type" : "record",

"namespace" : "tutorialspoint",

"name" : "empdetails ",

"fields" :

[

{ "name" : "experience", "type": ["int", "null"] }, { "name" : "age", "type": "int" }

]

}

Fixed

********

This data type is used to declare a fixed-sized field that can be used for storing binary data.

{ "type" : "fixed" , "name" : "bdata", "size" : 1048576}

Name holds the name of the field, and size holds the size of the field.

Working Example

After some Avro and its schema/datatype understanding we move forward for a working example as below. I've a Presto Data store with a table "emp". I want to query this table and result should be stored in an avro data file. I need to write small java programs to write and read avro files. I've provided the code which can be modified easily as any specific requirments.

Defining a Schema

***********************

Create an Avro schema as shown below and save it as emp.avsc as per your query. I want to query only four column from "emp" table residing in Presto Data Store.

{

"namespace": "myns",

"type": "record",

"name": "emp",

"fields": [

{"name": "empno", "type": "int"},

{"name": "ename", "type": "string"},

{"name": "sal", "type": "int"},

{"name": "comm", "type": "int"}

]

}

Compiling the Schema

***************************

After creating the Avro schema, we need to compile it using Avro tools. Arvo tools can be located in avro-tools-1.7.7.jar file. We need to provide arvo-tools-1.7.7.jar file path at compilation.

java -jar <path/to/avro-tools-1.7.7.jar> compile schema <path/to/schema-file> <destination-folder>

java -jar /usr/hadoopsw/avro/avro-tools-1.7.7.jar compile schema /usr/hadoopsw/avro/schema/emp.avsc /usr/hadoopsw/avro/gen_code

[hdpclient@en01 ~]$ java -jar /usr/hadoopsw/avro/avro-tools-1.7.7.jar compile schema /usr/hadoopsw/avro/schema/emp.avsc /usr/hadoopsw/avro/gen_code

Input files to compile:

/usr/hadoopsw/avro/schema/emp.avsc

After this compilation, a package is created in the destination directory with the name mentioned as namespace in the schema file. Within this package, the Java source file with schema name is generated. The generated file contains java code corresponding to the schema. This java file can be directly accessed by an application and is useful to create data according to schema.

The generated class contains:

Default constructor, and parameterized constructor which accept all the variables of the schema.

The setter and getter methods for all variables in the schema.
Get() method which returns the schema.
Builder methods.

Creating and Serializing the Data

****************************************

First of all, copy the generated java file (with package folder eg; myns) as the result of compiling schema into the current directory (app dir where you will write your java program to create and read avro files)or import it from where it is located.

Now we can write a new Java file and instantiate the class in the generated file (emp) to add employee data to the schema.

Below are the steps to create new java file.

Step 1: Instantiate the generated emp class.

Step 2: Use setter methods to insert data

Step 3: Create an object of DatumWriter interface using the SpecificDatumWriter class. This converts Java objects into in-memory serialized format.

Step 4: Instantiate DataFileWriter for emp (generated) class. This class writes a sequence serialized records of data conforming to a schema, along with the schema itself, in a file. This class requires the DatumWriter object, as a parameter to the constructor.

Step 5: Open a new file to store the data matching to the given schema using create() method. This method requires the schema, and the path of the file where the data is to be stored, as parameters.

Step 6: Add all the created records to the file using append() method

The following complete program shows how to serialize data into a file using Apache Avro. It reads data from database (ie; Presto in our case) and serializes it.

///////////////////////////////////////////////////////////////////////////////////

import java.sql.*;

import java.io.File;

import java.io.IOException;

import org.apache.avro.file.DataFileWriter;

import org.apache.avro.io.DatumWriter;

import org.apache.avro.specific.SpecificDatumWriter;

import myns.*;

public class CreateAvroFile {

public static void main (String[] args) {

System.out.println ("Simple Avro File Creation Utility");

System.out.println();

String query = "select empno,ename,sal,comm from hive.scott.emp";

String avroFile="emp.avsc";

try{

// JDBC driver name and database URL

String JDBC_DRIVER = "com.teradata.presto.jdbc4.Driver";

String CONNECTION_URL = "jdbc:presto://en01:6060;User=presto;";

//Register JDBC driver

Class.forName(JDBC_DRIVER);

// Open a connection

Connection connection = DriverManager.getConnection(CONNECTION_URL);

System.out.println("Connection Established...");

System.out.println();

//Execute a query

Statement stmt = connection.createStatement();

ResultSet rs = stmt.executeQuery(query);

//Instantiate necessary objects for serialization

emp e=new emp(); //Step 1

//Instantiate DatumWriter class

DatumWriter<emp> empDatumWriter = new SpecificDatumWriter<emp>(emp.class); //Step 3

DataFileWriter<emp> empFileWriter = new DataFileWriter<emp>(empDatumWriter); //Step 4

empFileWriter.create(e.getSchema(), new File("/usr/hadoopsw/avro/emp.avro")); //Step 5

//Extract data from result set and use for serialization

//System.out.println("EMPNO,ENAME,SAL,COMM");

while(rs.next()){

//Retrieve by column name

String empno = rs.getString("empno");

String ename = rs.getString("ename");

String sal = rs.getString("sal");

String comm = rs.getString("comm");

//Display values

String rec = empno+","+ename+","+sal+","+comm;

//System.out.println(rec);

//Serializing the Data, see emp.java file generated by avro compilation

//Creating values according the schema - Step 2

e.setEmpno(Integer.parseInt(empno));

e.setEname(ename);

e.setSal(Integer.parseInt(sal));

e.setComm(Integer.parseInt(comm));

System.out.println(e.toString());

empFileWriter.append(e); //Step 6

//break;

}//while ends

//Clean-up environment

rs.close();

stmt.close();

empFileWriter.close();

connection.close();

System.out.println();

System.out.println("Above data successfully serialized in emp.avro");

}catch(Exception ex){System.out.println(ex.toString());}

}

///////////////////////////////////////////////////////////////////////////////////

Compile and run the utility to test

[hdpclient@en01 avro]$ javac CreateAvroFile.java

[hdpclient@en01 avro]$ java CreateAvroFile

Simple Avro File Creation Utility

Connection Established...

{"empno": 7369, "ename": "SMITH", "sal": 800, "comm": 0}

{"empno": 7499, "ename": "ALLEN", "sal": 1600, "comm": 300}

{"empno": 7521, "ename": "WARD", "sal": 1250, "comm": 500}

{"empno": 7566, "ename": "JONES", "sal": 2975, "comm": 0}

{"empno": 7654, "ename": "MARTIN", "sal": 1250, "comm": 1400}

{"empno": 7698, "ename": "BLAKE", "sal": 2850, "comm": 0}

{"empno": 7782, "ename": "CLARK", "sal": 2450, "comm": 0}

{"empno": 7788, "ename": "SCOTT", "sal": 3000, "comm": 0}

{"empno": 7839, "ename": "KING", "sal": 5000, "comm": 0}

{"empno": 7844, "ename": "TURNER", "sal": 1500, "comm": 0}

{"empno": 7876, "ename": "ADAMS", "sal": 1100, "comm": 0}

{"empno": 7900, "ename": "JAMES", "sal": 950, "comm": 0}

{"empno": 7902, "ename": "FORD", "sal": 3000, "comm": 0}

{"empno": 7934, "ename": "MILLER", "sal": 1300, "comm": 0}

Above data successfully serialized in emp.avro

Deserialization by Generating a Class

**********************************************

One can read an Avro schema into a program either by generating a class corresponding to the schema or by using the parsers library.

For the purpose of this post, I read the schema by generating a class and Deserialize the data using Avro. The procedure is as follows

Step 1: Create an object of DatumReader interface using SpecificDatumReader class.

Step 2: Instantiate DataFileReader class. This class reads serialized data from a file. It requires the DatumReader object, and path of the file (emp.avro) where the serialized data is existing , as a parameters to the constructor.

Step 3: Print the deserialized data, using the methods of DataFileReader.

The following complete program shows how to deserialize the data in a file using Avro.

//////////////////////////////////////////////////////////////////

import java.io.File;

import java.io.IOException;

import org.apache.avro.file.DataFileReader;

import org.apache.avro.io.DatumReader;

import org.apache.avro.specific.SpecificDatumReader;

import myns.*;

public class ReadAvroFile {

public static void main(String args[]) throws IOException{

//DeSerializing the objects

DatumReader<emp> empDatumReader = new SpecificDatumReader<emp>(emp.class); //Step 1

//Instantiating DataFileReader

DataFileReader<emp> dataFileReader = new DataFileReader<emp>(new

File("/usr/hadoopsw/avro/emp.avro"), empDatumReader); //Step 2

emp em=null;

while(dataFileReader.hasNext()){ //Step 3

em=dataFileReader.next(em);

System.out.println(em);

}

//////////////////////////////////////////////////////////////////

Compile and test deserialization

[hdpclient@en01 avro]$ javac ReadAvroFile.java

[hdpclient@en01 avro]$ java ReadAvroFile

{"empno": 7369, "ename": "SMITH", "sal": 800, "comm": 0}

{"empno": 7499, "ename": "ALLEN", "sal": 1600, "comm": 300}

{"empno": 7521, "ename": "WARD", "sal": 1250, "comm": 500}

{"empno": 7566, "ename": "JONES", "sal": 2975, "comm": 0}

{"empno": 7654, "ename": "MARTIN", "sal": 1250, "comm": 1400}

{"empno": 7698, "ename": "BLAKE", "sal": 2850, "comm": 0}

{"empno": 7782, "ename": "CLARK", "sal": 2450, "comm": 0}

{"empno": 7788, "ename": "SCOTT", "sal": 3000, "comm": 0}

{"empno": 7839, "ename": "KING", "sal": 5000, "comm": 0}

{"empno": 7844, "ename": "TURNER", "sal": 1500, "comm": 0}

{"empno": 7876, "ename": "ADAMS", "sal": 1100, "comm": 0}

{"empno": 7900, "ename": "JAMES", "sal": 950, "comm": 0}

{"empno": 7902, "ename": "FORD", "sal": 3000, "comm": 0}

{"empno": 7934, "ename": "MILLER", "sal": 1300, "comm": 0}

Using Avro Tools

Avro provides a set of tools for working with Avro data files and schemas. Below are some examples.

Running without any command line parameter shows help

[hdpclient@en01 avro]$ java -jar avro-tools-1.7.7.jar
Version 1.7.7 of Apache Avro
Copyright 2010 The Apache Software Foundation

This product includes software developed at
The Apache Software Foundation (http://www.apache.org/).

C JSON parsing provided by Jansson and
written by Petri Lehtinen. The original software is
available from http://www.digip.org/jansson/.
----------------
Available tools:
cat extracts samples from files
compile Generates Java code for the given schema.
concat Concatenates avro files without re-compressing.
fragtojson Renders a binary-encoded Avro datum as JSON.
fromjson Reads JSON records and writes an Avro data file.
fromtext Imports a text file into an avro data file.
getmeta Prints out the metadata of an Avro data file.
getschema Prints out schema of an Avro data file.
idl Generates a JSON schema from an Avro IDL file
idl2schemata Extract JSON schemata of the types from an Avro IDL file
induce Induce schema/protocol from Java class/interface via reflection.
jsontofrag Renders a JSON-encoded Avro datum as binary.
random Creates a file with randomly generated instances of a schema.
recodec Alters the codec of a data file.
rpcprotocol Output the protocol of a RPC service
rpcreceive Opens an RPC Server and listens for one message.
rpcsend Sends a single RPC message.
tether Run a tethered mapreduce job.
tojson Dumps an Avro data file as JSON, record per line or pretty.
totext Converts an Avro data file to a text file.
totrevni Converts an Avro data file to a Trevni file.
trevni_meta Dumps a Trevni file's metadata as JSON.
trevni_random Create a Trevni file filled with random instances of a schema.
trevni_tojson Dumps a Trevni file as JSON.

1. Dump the file’s header key-value metadata

[hdpclient@en01 avro]$ java -jar avro-tools-1.7.7.jar getmeta emp.avro

avro.schema {"type":"record","name":"emp","namespace":"myns","fields":[{"name":"empno","type":"int"},{"name":"ename","type":"string"},{"name":"sal","type":"int"},{"name":"comm","type":"int"}]}

2. Dump the file’s schema

[hdpclient@en01 avro]$ java -jar avro-tools-1.7.7.jar getschema emp.avro

{
"type" : "record",
"name" : "emp",
"namespace" : "myns",
"fields" : [ {
"name" : "empno",
"type" : "int"
}, {
"name" : "ename",
"type" : "string"
}, {
"name" : "sal",
"type" : "int"
}, {
"name" : "comm",
"type" : "int"
} ]
}

3. Dump the content of an Avro data file as JSON

[hdpclient@en01 avro]$ java -jar avro-tools-1.7.7.jar tojson emp.avro | tail
{"empno":7654,"ename":"MARTIN","sal":1250,"comm":1400}
{"empno":7698,"ename":"BLAKE","sal":2850,"comm":0}
{"empno":7782,"ename":"CLARK","sal":2450,"comm":0}
{"empno":7788,"ename":"SCOTT","sal":3000,"comm":0}
{"empno":7839,"ename":"KING","sal":5000,"comm":0}
{"empno":7844,"ename":"TURNER","sal":1500,"comm":0}
{"empno":7876,"ename":"ADAMS","sal":1100,"comm":0}
{"empno":7900,"ename":"JAMES","sal":950,"comm":0}
{"empno":7902,"ename":"FORD","sal":3000,"comm":0}

{"empno":7934,"ename":"MILLER","sal":1300,"comm":0}

4. Merge Avro files

Syntax:
jar avro-tools.jar concat /input/part* /output/bigfile.avro

[hdpclient@te1-hdp-rp-en01 avro]$ java -jar avro-tools-1.7.7.jar concat /data/hdfsloc/tmp/avroTestData/000000* /data/hdfsloc/tmp/avroTestData/empBigAvroFile

DBMentors - Inam Bukhari's Blog

Pages

Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Thursday, January 25, 2018

Working with Apache Avro to manage Big Data Files

No comments:

Translate

Followers

Labels

Blog Archive

About Me

Total Pageviews