Hadoop Compression
Hive can read data from a variety of sources, such as text files, sequence files, or even custom formats using Hadoop’s InputFormat APIs as well as can write data to various formats using OutputFormat API. You can take the leverage from Hadoop to store data as compressed to save significant disk storage. Compression also can increase throughput and performance. Compressing and decompressing data incurs extra CPU overhead, however, the I/O savings resulting from moving fewer bytes into memory can result in a net performance gain.
Hadoop jobs tend to be I/O bound, rather than CPU bound. If so, compression will
improve performance. However, if your jobs are CPU bound, then compression will
probably lower your performance. The only way to really know is to experiment withHadoop provides a number of available compression schemes, called codecs (shortened form of compressor/decompressor). some codecs support splittable compression in which files are split if they’re larger than the file’s block size setting and individual file splits can be processed in parallel by different mappers. Splittable compression is only a factor for text files. For binary files, Hadoop compression codecs compress data within a binary-encoded container, depending on the file type (for example, a SequenceFile, Avro, or ProtocolBuffer).
There are many different compression algorithms and tools, and their characteristics and strengths vary. The most common trade-off is between compression ratios (the degree to which a file is compressed) and compress/decompress speeds.
Below are some common codecs that are supported by the Hadoop framework.
Gzip: Generates compressed files that have a .gz extension.
Bzip2: generates a better compression ratio than does Gzip, but it’s much slower.
Snappy: modest compression ratios, but fast compression and decompression speeds.
LZO: Similar to Snappy, supports splittable compression, which enables the parallel processing of compressed text file splits by your MapReduce jobs.
Create compressed Table
First, let’s enable intermediate compression. This won’t affect the final output, however
the job counters will show less physical data transferred for the job, since the shuffle
sort data was compressed.
hive (scott)> set hive.exec.compress.intermediate=true;
CREATE TABLE intermediate_comp_translog ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' AS SELECT * FROM translog;
hive (scott)> CREATE TABLE intermediate_comp_translog ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' AS SELECT * FROM translog;
......
Moving data to directory hdfs://nn01:9000/user/hive/warehouse/scott.db/.hive-staging_hive_2017-05-01_10-22-29_762_6373409087183829364-1/-ext-10002
Moving data to directory hdfs://nn01:9000/user/hive/warehouse/scott.db/intermediate_comp_translog
MapReduce Jobs Launched:
......
translog.record_id translog.emp_id translog.request_type translog.json_input translog.result_code translog.result_description translog.error_type translog.start_time translog.proccessing_time translog.request_channel translog.custom_1translog.custom_2 translog.custom_3 translog.custom_4 translog.custom_5 translog.custom_6 translog.year
Time taken: 92.632 seconds
As expected, intermediate compression did not affect the final output, which remains
uncompressed, please check below.
hive (scott)> !hdfs dfs -ls /user/hive/warehouse/scott.db/intermediate_comp_translog;
Found 39 items
-rwxrwxrwx 3 hdpclient supergroup 334958784 2017-05-01 10:23 /user/hive/warehouse/scott.db/intermediate_comp_translog/000000_0
-rwxrwxrwx 3 hdpclient supergroup 249179231 2017-05-01 10:23 /user/hive/warehouse/scott.db/intermediate_comp_translog/000001_0
-rwxrwxrwx 3 hdpclient supergroup 248390875 2017-05-01 10:23 /user/hive/warehouse/scott.db/intermediate_comp_translog/000002_0
-rwxrwxrwx 3 hdpclient supergroup 248096285 2017-05-01 10:23
.......
-rwxrwxrwx 3 hdpclient supergroup 243161657 2017-05-01 10:24 /user/hive/warehouse/scott.db/intermediate_comp_translog/000036_0
-rwxrwxrwx 3 hdpclient supergroup 243851779 2017-05-01 10:24 /user/hive/warehouse/scott.db/intermediate_comp_translog/000037_0
-rwxrwxrwx 3 hdpclient supergroup 85779553 2017-05-01 10:23 /user/hive/warehouse/scott.db/intermediate_comp_translog/000038_0
hive (scott)>
hive (scott)> !hdfs dfs -cat /user/hive/warehouse/scott.db/intermediate_comp_translog/000000_0;
273483692,1084721533,"LOAD_EMP_PROFILE",1084721533,0,"?? ??? ??????? ?????","",03-DEC-16 02.47.57.856000000 AM,7,"WEB","(Google Chrome): Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36","","","192.168.155.140 ","",""
hive (scott)> set mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GZipCodec;
hive (scott)> set hive.exec.compress.intermediate=true;
Next, we can enable output compression:
hive (scott)> set hive.exec.compress.output=true;
hive (scott)> CREATE TABLE intermediate_comp_translog_gz ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' AS SELECT * FROM translog;
hive (scott)> !hdfs dfs -ls /user/hive/warehouse/scott.db/intermediate_comp_translog_gz;
Found 39 items
-rwxrwxrwx 3 hdpclient supergroup 34790085 2017-05-01 12:43 /user/hive/warehouse/scott.db/intermediate_comp_translog_gz/000000_0.deflate
-rwxrwxrwx 3 hdpclient supergroup 25862237 2017-05-01 12:43 /user/hive/warehouse/scott.db/intermediate_comp_translog_gz/000001_0.deflate
-rwxrwxrwx 3 hdpclient supergroup 26318840 2017-05-01 12:43 /
....
-rwxrwxrwx 3 hdpclient supergroup 26008312 2017-05-01 12:43 /user/hive/warehouse/scott.db/intermediate_comp_translog_gz/000005_0.deflate
Trying to cat the file is not suggested, as you get binary output. However, Hive can
query this data normally.
Now try to query the compressed table
hive (scott)> select * from intermediate_comp_translog_gz limit 1;
273483692,1084721533,"LOAD_EMP_PROFILE",1084721533,0,"?? ??? ??????? ?????","",03-DEC-16 02.47.57.856000000 AM,7,"WEB","(Google Chrome): Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36","","","192.168.155.140 ","",""
You get the actual data. This ability to seamlessly work with compressed files is not Hive-specific; Hadoop’s TextInputFormat is at work here.TextInput Format understands file extensions such as .deflate or .gz and decompresses these files on the fly. Hive is unaware if the underlying files are uncompressed or compressed using any of the supported compression schemes.
No comments:
Post a Comment