Using Hadoop Compression

Hadoop Compression

Hive can read data from a variety of sources, such as text files, sequence files, or even custom formats using Hadoop’s InputFormat APIs as well as can write data to various formats using OutputFormat API. You can take the leverage from Hadoop to store data as compressed to save significant disk storage. Compression also can increase throughput and performance. Compressing and decompressing data incurs extra CPU overhead, however, the I/O savings resulting from moving fewer bytes into memory can result in a net performance gain.

Hadoop jobs tend to be I/O bound, rather than CPU bound. If so, compression will

improve performance. However, if your jobs are CPU bound, then compression will

probably lower your performance. The only way to really know is to experiment with

different options and measure the results.

Hadoop provides a number of available compression schemes, called codecs (shortened form of compressor/decompressor). some codecs support splittable compression in which files are split if they’re larger than the file’s block size setting and individual file splits can be processed in parallel by different mappers. Splittable compression is only a factor for text files. For binary files, Hadoop compression codecs compress data within a binary-encoded container, depending on the file type (for example, a SequenceFile, Avro, or ProtocolBuffer).

There are many different compression algorithms and tools, and their characteristics and strengths vary. The most common trade-off is between compression ratios (the degree to which a file is compressed) and compress/decompress speeds.

Below are some common codecs that are supported by the Hadoop framework.

Gzip: Generates compressed files that have a .gz extension.

Bzip2: generates a better compression ratio than does Gzip, but it’s much slower.

Snappy: modest compression ratios, but fast compression and decompression speeds.

LZO: Similar to Snappy, supports splittable compression, which enables the parallel processing of compressed text file splits by your MapReduce jobs.

Create compressed Table
First, let’s enable intermediate compression. This won’t affect the final output, however
the job counters will show less physical data transferred for the job, since the shuffle
sort data was compressed.

hive (scott)> set hive.exec.compress.intermediate=true;

CREATE TABLE intermediate_comp_translog ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' AS SELECT * FROM translog;

hive (scott)> CREATE TABLE intermediate_comp_translog ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' AS SELECT * FROM translog;

......

Moving data to directory hdfs://nn01:9000/user/hive/warehouse/scott.db/.hive-staging_hive_2017-05-01_10-22-29_762_6373409087183829364-1/-ext-10002

Moving data to directory hdfs://nn01:9000/user/hive/warehouse/scott.db/intermediate_comp_translog

MapReduce Jobs Launched:

......

translog.record_id translog.emp_id translog.request_type translog.json_input translog.result_code translog.result_description translog.error_type translog.start_time translog.proccessing_time translog.request_channel translog.custom_1translog.custom_2 translog.custom_3 translog.custom_4 translog.custom_5 translog.custom_6 translog.year

Time taken: 92.632 seconds

As expected, intermediate compression did not affect the final output, which remains

uncompressed, please check below.

hive (scott)> !hdfs dfs -ls /user/hive/warehouse/scott.db/intermediate_comp_translog;

Found 39 items

-rwxrwxrwx 3 hdpclient supergroup 334958784 2017-05-01 10:23 /user/hive/warehouse/scott.db/intermediate_comp_translog/000000_0

-rwxrwxrwx 3 hdpclient supergroup 249179231 2017-05-01 10:23 /user/hive/warehouse/scott.db/intermediate_comp_translog/000001_0

-rwxrwxrwx 3 hdpclient supergroup 248390875 2017-05-01 10:23 /user/hive/warehouse/scott.db/intermediate_comp_translog/000002_0

-rwxrwxrwx 3 hdpclient supergroup 248096285 2017-05-01 10:23

.......

-rwxrwxrwx 3 hdpclient supergroup 243161657 2017-05-01 10:24 /user/hive/warehouse/scott.db/intermediate_comp_translog/000036_0

-rwxrwxrwx 3 hdpclient supergroup 243851779 2017-05-01 10:24 /user/hive/warehouse/scott.db/intermediate_comp_translog/000037_0

-rwxrwxrwx 3 hdpclient supergroup 85779553 2017-05-01 10:23 /user/hive/warehouse/scott.db/intermediate_comp_translog/000038_0

hive (scott)>

hive (scott)> !hdfs dfs -cat /user/hive/warehouse/scott.db/intermediate_comp_translog/000000_0;

273483692,1084721533,"LOAD_EMP_PROFILE",1084721533,0,"?? ??? ??????? ?????","",03-DEC-16 02.47.57.856000000 AM,7,"WEB","(Google Chrome): Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36","","","192.168.155.140 ","",""

We can also chose an intermediate compression codec other then the default codec.
hive (scott)> set mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GZipCodec;
hive (scott)> set hive.exec.compress.intermediate=true;

Next, we can enable output compression:

hive (scott)> set hive.exec.compress.output=true;

hive (scott)> CREATE TABLE intermediate_comp_translog_gz ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' AS SELECT * FROM translog;

hive (scott)> !hdfs dfs -ls /user/hive/warehouse/scott.db/intermediate_comp_translog_gz;

Found 39 items

-rwxrwxrwx 3 hdpclient supergroup 34790085 2017-05-01 12:43 /user/hive/warehouse/scott.db/intermediate_comp_translog_gz/000000_0.deflate

-rwxrwxrwx 3 hdpclient supergroup 25862237 2017-05-01 12:43 /user/hive/warehouse/scott.db/intermediate_comp_translog_gz/000001_0.deflate

-rwxrwxrwx 3 hdpclient supergroup 26318840 2017-05-01 12:43 /

....

-rwxrwxrwx 3 hdpclient supergroup 26008312 2017-05-01 12:43 /user/hive/warehouse/scott.db/intermediate_comp_translog_gz/000005_0.deflate

Trying to cat the file is not suggested, as you get binary output. However, Hive can

query this data normally.

hive (scott)> !hdfs dfs -cat /user/hive/warehouse/scott.db/intermediate_comp_translog_gz/000000_0.deflate

Observe the compressed files size of files now, its about 90% compression.

Now try to query the compressed table

hive (scott)> select * from intermediate_comp_translog_gz limit 1;

You get the actual data. This ability to seamlessly work with compressed files is not Hive-specific; Hadoop’s TextInputFormat is at work here.TextInput Format understands file extensions such as .deflate or .gz and decompresses these files on the fly. Hive is unaware if the underlying files are uncompressed or compressed using any of the supported compression schemes.

DBMentors - Inam Bukhari's Blog

Pages

Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Tuesday, May 02, 2017

Using Hadoop Compression

No comments:

Translate

Followers

Labels

Blog Archive

About Me

Total Pageviews