Introduction
Hadoop shines as a batch processing system, but serving real-time results can be challenging. For truly interactive data discovery, ES-Hadoop (The Elasticsearch-Hadoop connector) lets you index Hadoop data into the Elastic Stack to take full advantage of the speedy Elasticsearch engine and beautiful Kibana visualizations.
With ES-Hadoop, you can easily build dynamic, embedded search applications to serve your Hadoop data or perform deep, low-latency analytics using full-text, geospatial queries and aggregations.
At the core, elasticsearch-hadoop integrates two distributed systems: Hadoop, a distributed computing platform and Elasticsearch, a real-time search and analytics engine. From a high-level view both provide a computational component: Hadoop through Map/Reduce or recent libraries like Apache Spark on one hand, and Elasticsearch through its search and aggregation on the other. elasticsearch-hadoop goal is to connect these two entities so that they can transparently benefit from each other.
Pre-Requisite
Working with Elasticsearch
Installation
elasticsearch-hadoop binaries can be obtained by downloading them from the elastic.co site. Download and unzip Elasticsearch for Apache Hadoop (ES-Hadoop) distribution elasticsearch-hadoop binary is suitable for Hadoop 2.x. elasticsearch-hadoop provides minimalistic jars for each integration.
Configuration
There are different configuration options available,for the purpose of this post I'll be using Hive integration which is required in my environment.
After extracting connector, place the jar in Hive and Hadoop distribution folders.
Below in HIVE_HOME/lib
elasticsearch-hadoop-hive-6.2.3-javadoc.jar
elasticsearch-hadoop-hive-6.2.3-sources.jar
Below in HADOOP_HOME/share/hadoop/hdfs/lib
elasticsearch-hadoop-6.2.3.jar
Now You need to create the external table in Hive backed by Elasticsearch
CREATE EXTERNAL TABLE customer (name string)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'customer/_doc',
'es.index.auto.create' = 'false')
hive>
> CREATE EXTERNAL TABLE customer (name string)
> STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
> TBLPROPERTIES('es.resource' = 'customer/_doc',
> 'es.index.auto.create' = 'false');
OK
Time taken: 7.027 seconds
hive> select * from customer;
OK
Inaam Bukhary
Time taken: 7.274 seconds, Fetched: 1 row(s)
You can find the list of all properties in below link
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.html
Reading data from elasticsearch
Below Hive table based on bank index in Elasticsearch.
Below Hive table based on bank index in Elasticsearch.
--Elasticsearch Index sample record.
{
"_index" : "bank",
"_type" : "_doc",
"_id" : "1",
"_version" : 1,
"found" : true,
"_source" : {
"account_number" : 1,
"balance" : 39225,
"firstname" : "Amber",
"lastname" : "Duke",
"age" : 32,
"gender" : "M",
"address" : "880 Holmes Lane",
"employer" : "Pyrami",
"email" : "amberduke@pyrami.com",
"city" : "Brogan",
"state" : "IL"
}
}
CREATE EXTERNAL TABLE bank3 (account_number string,balance string,firstname string, lastname string, age string,gender string,address string,employer string,email string,city string,state string)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'bank/_doc',
'es.index.auto.create' = 'false')
hive> describe bank;
OK
account_number int from deserializer
balance int from deserializer
firstname string from deserializer
lastname string from deserializer
age int from deserializer
gender string from deserializer
address string from deserializer
employer string from deserializer
email string from deserializer
Time taken: 0.746 seconds, Fetched: 9 row(s)
WRITING TO Elasticsearch
drop table artists;
CREATE EXTERNAL TABLE artists (
id BIGINT,
name STRING
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'artists/_doc','es.index.auto.create' = 'true');
-- insert data to Elasticsearch from another table which is source
INSERT OVERWRITE TABLE artists SELECT distinct s.id, s.name FROM employee s;
hive> INSERT OVERWRITE TABLE artists SELECT distinct id, s.name FROM employee s;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hdpsysuser_20180409191844_7b9806b6-833b-483d-8d34-9f7407a18a1f
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1523248038304_0014, Tracking URL = http://localhost:8088/proxy/application_1523248038304_0014/
Kill Command = /usr/hadoopsw/hadoop-2.7.3/bin/hadoop job -kill job_1523248038304_0014
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2018-04-09 19:19:42,879 Stage-1 map = 0%, reduce = 0%
2018-04-09 19:20:05,022 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 10.79 sec
2018-04-09 19:20:39,869 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 22.94 sec
2018-04-09 19:20:46,854 Stage-1 map = 100%, reduce = 71%, Cumulative CPU 25.76 sec
2018-04-09 19:20:49,278 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 26.47 sec
MapReduce Total cumulative CPU time: 26 seconds 470 msec
Ended Job = job_1523248038304_0014
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 26.47 sec HDFS Read: 11197 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 26 seconds 470 msec
OK
Time taken: 138.918 seconds
Query data by Elasticsearch
--List of indices
curl -XGET 'localhost:9200/_cat/indices?v&pretty'
[hdpsysuser@hdpmaster ~]$ curl -XGET 'localhost:9200/_cat/indices?v&pretty'
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open logstash-2015.05.18 BnCViLEGT6i21FYJnlCvvw 5 1 4631 0 22.5mb 22.5mb
yellow open shakespeare 44dwGPXmTJ6eOOdRuGQf9w 5 1 111395 0 22.9mb 22.9mb
yellow open logstash-2015.05.20 5bQkSCxGT2Oka-hNjl1c-A 5 1 4750 0 23.6mb 23.6mb
yellow open bank ymIPSIiDS9i_iiQeM6mD1w 5 1 1000 0 499kb 499kb
yellow open logstash-2015.05.19 gvUQBcvVSweGcHob0t7FXw 5 1 4624 0 24.2mb 24.2mb
yellow open customer f-9cBDSlQ-aNmAWhHhj_rA 5 1 1 0 4.5kb 4.5kb
yellow open artists E_OTOKO3TTGLWg8D6jRltg 5 1 14 0 19.3kb 19.3kb
Query index information
[hdpsysuser@hdpmaster ~]$ curl -XGET 'localhost:9200/artists/?pretty&pretty'
{
"artists" : {
"aliases" : { },
"mappings" : {
"_doc" : {
"properties" : {
"id" : {
"type" : "long"
},
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
},
"settings" : {
"index" : {
"creation_date" : "1523301641741",
"number_of_shards" : "5",
"number_of_replicas" : "1",
"uuid" : "E_OTOKO3TTGLWg8D6jRltg",
"version" : {
"created" : "6020399"
},
"provided_name" : "artists"
}
}
}
}
Query Index doc
curl -XGET 'localhost:9200/artists/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query": { "match_all": {} }
}
'
[hdpsysuser@hdpmaster ~]$ curl -XGET 'localhost:9200/artists/_search?pretty' -H 'Content-Type: application/json' -d'
> {
> "query": { "match_all": {} }
> }
> '
{
"took" : 117,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 14,
"max_score" : 1.0,
"hits" : [
{
"_index" : "artists",
"_type" : "_doc",
"_id" : "Xxbaq2IBeMlMXEgTgPjf",
"_score" : 1.0,
"_source" : {
"id" : 7782,
"name" : "CLARK"
}
},
...
...
curl -XGET 'localhost:9200/artists/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"must": [
{ "match": { "id": "7369" } }
]
}
}
}
'
1 comment:
very intresting and useful blog.
Thank you..keep sharing..
big data online training
Post a Comment