How to merge multiple part files (ORC) in Hadoop created by PolyBase?

Problem:

One of my client is using PolyBase to query and offloading SQL Server data to Hadoop. While offloading data in Hive ORC format, multiple part files are created by the PolyBase in HDFS. For better query performance all these part files are needed to be merged.

Solution:

Create an external table in Hive for the same location PolyBase is using for HDFS and use below statement

ALTER TABLE <<TableName>> CONCATENATE;

Simulation

For problem simulation, please perform the below steps.

1- Create an PolyBase external table in SQL Server (SSMS)

CREATE EXTERNAL TABLE bigdata.dbo.testExport (
[name] varchar(60),
[object_id] [int] NOT NULL,
[principal_id] [int] NULL,
[schema_id] [int] NOT NULL,
[parent_object_id] [int] NOT NULL,
[type] [char](2) NULL,
[type_desc] [nvarchar](60) NULL,
[create_date] [datetime] NOT NULL,
[modify_date] [datetime] NOT NULL,
[is_ms_shipped] [bit] NULL,
[is_published] [bit] NULL,
[is_schema_published] [bit] NULL
)
WITH (
LOCATION='/tmp/testExportData',
DATA_SOURCE = ds_hdp2_6,
FILE_FORMAT = myfileformat ,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);

2- Create a table in SQL Server (SSMS) from some system table to populate our simulation table created in step 1.

select * into bigdata.dbo.t1 from sys.all_objects

3- Load/export the data in PolyBase external table

INSERT INTO bigdata.dbo.testExport

SELECT * FROM t1

4- Check on the HDFS to see how many part files generated as the result of step 3.

[root@dn04 ~]# hdfs dfs -ls /tmp/testExportData
Found 8 items
-rw-r--r-- 3 pdw_user hdfs 4982 2018-01-22 15:15 /tmp/testExportData/QID630_20180122_151541_0.orc.snappy
-rw-r--r-- 3 pdw_user hdfs 8412 2018-01-22 15:15 /tmp/testExportData/QID630_20180122_151541_1.orc.snappy
-rw-r--r-- 3 pdw_user hdfs 6908 2018-01-22 15:15 /tmp/testExportData/QID630_20180122_151541_2.orc.snappy
-rw-r--r-- 3 pdw_user hdfs 5027 2018-01-22 15:15 /tmp/testExportData/QID630_20180122_151541_3.orc.snappy
-rw-r--r-- 3 pdw_user hdfs 5020 2018-01-22 15:15 /tmp/testExportData/QID630_20180122_151541_4.orc.snappy
-rw-r--r-- 3 pdw_user hdfs 14075 2018-01-22 15:15 /tmp/testExportData/QID630_20180122_151541_5.orc.snappy
-rw-r--r-- 3 pdw_user hdfs 11167 2018-01-22 15:15 /tmp/testExportData/QID630_20180122_151541_6.orc.snappy

-rw-r--r-- 3 pdw_user hdfs 19261 2018-01-22 15:15 /tmp/testExportData/QID630_20180122_151541_7.orc.snappy

...
...

You see there are many part files generated by the export activity.

5- Now in order to merge files, you need to create an external table in Hive and then concatenate the table using alter statement

a) CREATE EXTERNAL TABLE scott.testExport(name STRING,object_id int,principal_id int,schema_id int,parent_object_id int,obj_type STRING,type_desc STRING,create_date STRING,modify_date STRING,is_ms_shipped int,is_published int,is_schema_published int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS ORC
LOCATION '/tmp/testExportData'
;

b) Check data structure of the table from Hive, you need to use the compatible data types in Hive as compared to the PolyBase external table

select * from scott.testExport limit 2
select count(*) from scott.testExport

0: jdbc:hive2://dn04:10000/flume> desc scott.testExport;
+----------------------+------------+----------+--+
| col_name | data_type | comment |
+----------------------+------------+----------+--+
| name | string | |
| object_id | int | |
| principal_id | int | |
| schema_id | int | |
| parent_object_id | int | |
| obj_type | string | |
| type_desc | string | |
| create_date | string | |
| modify_date | string | |
| is_ms_shipped | int | |
| is_published | int | |
| is_schema_published | int | |
+----------------------+------------+----------+--+
12 rows selected (0.19 seconds)

c) Merge the files (ORC) by Hive

0: jdbc:hive2://dn04:10000/flume> ALTER TABLE scott.testExport CONCATENATE;
INFO : Session is already open
INFO : Dag name: hive_20180123120229_9f7da386-29b3-4c32-a860-5bf4d96fc468
INFO : Status: Running (Executing on YARN cluster with App id application_1514186075960_0036)

--------------------------------------------------------------------------------
VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
--------------------------------------------------------------------------------
File Merge ..... SUCCEEDED 1 1 0 0 0 0
--------------------------------------------------------------------------------
VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 5.11 s
--------------------------------------------------------------------------------
INFO : Loading data to table scott.testexport from hdfs://nn01:8020/tmp/testExportData/.hive-staging_hive_2018-01-23_12-02-29_644_2561974917269981859-5/-ext-10000
INFO : Table scott.testexport stats: [numFiles=1, totalSize=71302]
No rows affected (5.714 seconds)

6- Check again files (ORC) on HDFS, files merged successfully or not

[root@dn04 ~]# hdfs dfs -ls /tmp/testExportData
Found 1 items

-rwxrwxrwx 3 hive hdfs 71302 2018-01-23 12:02 /tmp/testExportData/000000_0

You see only one file now in the HDFS location after the concatenation activity.

7- Query PolyBase external table in SQL Server SSMS and Hive external table for the same HDFS location to confirm data.

-- SQL Server
select * from bigdata.dbo.testExport

select count(*) from bigdata.dbo.testExport

--Hive
select * from scott.testExport limit 2

select count(*) from scott.testExport