Using Preprocessor with External Table [Over HDFS]

Oracle 11g Release 2 introduced the PREPROCESSOR clause to identify a directory object and script used to process the files before they are read by the external table. This feature was backported to 11gR1 (11.1.0.7). The PREPROCESSOR clause is especially useful for reading compressed files, since they are unzipped and piped straight into the external table process without ever having to be unzipped on the file system.

In our environment, text files are located on Hadoop HDFS, need to be loaded into the database, are compressed or not precisely in the format expected by the external tables. In such cases, the text files must be transformed into the appropriate format—for example, uncompressed—in a separate process before they can be read by external tables. Thanks to Oracle Database, rather than uncompress the file in a separate process, you can use the preprocessor feature of Oracle Database 11g Release 2 with external tables to uncompress the file inline.

Before starting the test, we need data for this post first.

Prepare Data

****************

Create comma delimited text file. You can choose any tool like toad or SQL Plus to create the data.

set colsep ',';

SET HEADING OFF;

SET FEEDBACK OFF;

SET LINESIZE 1000;

SPOOL /data/ora_ext_tab_dir/xt_uo_tx_preprocessor.txt;

select * from myobj ;

SPOOL OFF;

If you want better data generation, you could create a procedure like below.

create or replace procedure dump_table_to_csv( p_tname in varchar2, p_dir in varchar2, p_filename in varchar2,p_separator varchar2,p_where_cond varchar2 default '1=1' )

l_output utl_file.file_type;

l_theCursor integer default dbms_sql.open_cursor;

l_columnValue varchar2(4000);

l_status integer;

l_query varchar2(1000) default 'select * from ' || p_tname|| ' where '||p_where_cond;

l_colCnt number := 0;

l_separator varchar2(10);

l_descTbl dbms_sql.desc_tab;

l_fileGenerated varchar2(255);

begin

dbms_output.put_line('QueryGenerated : '||l_query);

l_output := utl_file.fopen( p_dir, p_filename, 'w' );

execute immediate 'alter session set nls_date_format=''dd-mon-yyyy hh24:mi:ss'' ';

dbms_sql.parse( l_theCursor, l_query, dbms_sql.native );

dbms_sql.describe_columns( l_theCursor, l_colCnt, l_descTbl );

for i in 1 .. l_colCnt loop

--utl_file.put( l_output, l_separator || '"' || l_descTbl(i).col_name || '"' );

dbms_sql.define_column( l_theCursor, i, l_columnValue, 4000 );

l_separator := p_separator;

end loop;

--utl_file.new_line( l_output );

l_status := dbms_sql.execute(l_theCursor);

while ( dbms_sql.fetch_rows(l_theCursor) > 0 ) loop

l_separator := '';

for i in 1 .. l_colCnt loop

dbms_sql.column_value( l_theCursor, i, l_columnValue );

utl_file.put( l_output, l_separator || l_columnValue );

l_separator := p_separator;

end loop;

utl_file.new_line( l_output );

end loop;

dbms_sql.close_cursor(l_theCursor);

utl_file.fclose( l_output );

select directory_path||'/'||p_filename into l_fileGenerated from dba_directories where directory_name =p_dir;

dbms_output.put_line('File generated at: '||l_fileGenerated );

execute immediate 'alter session set nls_date_format=''dd-MON-yy'' ';

exception

when others then

execute immediate 'alter session set nls_date_format=''dd-MON-yy'' ';

raise;

end;

After creating the procedure call it to generate your data.

SQL> EXEC dump_table_to_csv( p_tname =>'SYS.MYOBJ', p_dir => 'LOCAL_OS_DIR', p_filename => 'xt_uo_tx_preprocessor.txt',p_separator =>',' )

PL/SQL procedure successfully completed.

Elapsed: 00:52:04.54

MYOBJ is based on user_objects but with 4GB size.

--Another example

EXEC dump_table_to_csv( p_tname =>'SCOTT.EMPLOYEE', p_dir => 'LOCAL_OS_DIR', p_filename => 'employee.txt',p_separator =>'<FT>' ,p_where_cond => 'job=''CLERK''' )

Create necessary Directory Objects

********************************************

You may need to create the relevant directory objects to be used in your external table definition also.

SQL> CREATE OR REPLACE DIRECTORY admin_log_dir AS '/data/log/'

2 /

Directory created.

SQL> CREATE OR REPLACE DIRECTORY nfs_dir AS '/data/hdfsloc/data/oraclenfs/';

Directory created.

SQL> CREATE OR REPLACE DIRECTORY local_os_dir AS '/data/ora_ext_tab_dir';

Directory created.

SQL> CREATE OR REPLACE DIRECTORY exec_dir AS '/data/ora_ext_tab_dir';

Directory created.

exec_dir is required to hold the executable to be used as preprocessor option while creating external table. You need to copy the related executable also in this directory if required. In my example I'm using zcat in preprocessor option of external table. zcat exists in /bin (RHEL) , I copied it to EXEC_DIR. Remember Oracle Directory object should not contain symbolic links.

[hdpclient@te1-hdp-rp-en01 bin]$ cp zcat /data/ora_ext_tab_dir

Zip the data generated

****************************

After the data creation zip this delimited file. Compare the file size. For my test, delimited text file size is about 4GB and its compressed version is just 352MB.

[root@te1-hdp-rp-en01 ora_ext_tab_dir]# gzip -c xt_uo_tx_preprocessor.txt > xt_uo_tx_preprocessor.txt.gz

[root@te1-hdp-rp-en01 ora_ext_tab_dir]# ls -l xt_uo_tx_pre*

-rw-r--r-- 1 oracle oinstall 4324215296 Jan 2 16:53 xt_uo_tx_preprocessor.txt

-rw-r--r-- 1 root root 368855592 Jan 2 16:59 xt_uo_tx_preprocessor.txt.gz

Create External Table WO Preprocessor and test the performance

********************************************************************************

Now create external table on local storage (LOCAL_OS_DIR) using text file (size 4GB) and check performance

drop table xt_uo_tx_preprocessor;

create table xt_uo_tx_preprocessor

(

OBJECT_NAME VARCHAR2(128)

,SUBOBJECT_NAME VARCHAR2(128)

,OBJECT_ID NUMBER

,DATA_OBJECT_ID NUMBER

,OBJECT_TYPE VARCHAR2(23)

,CREATED DATE

,LAST_DDL_TIME DATE

,TIMESTAMP1 VARCHAR2(19)

,STATUS VARCHAR2(7)

,TEMPORARY VARCHAR2(1)

,GENERATED VARCHAR2(1)

,SECONDARY VARCHAR2(1)

,NAMESPACE NUMBER

,EDITION_NAME VARCHAR2(128)

,SHARING VARCHAR2(18)

,EDITIONABLE VARCHAR2(1)

,ORACLE_MAINTAINED VARCHAR2(1)

,APPLICATION VARCHAR2(1)

,DEFAULT_COLLATION VARCHAR2(100)

,DUPLICATED VARCHAR2(1)

,SHARDED VARCHAR2(1)

,CREATED_APPID NUMBER

,CREATED_VSNID NUMBER

,MODIFIED_APPID NUMBER

,MODIFIED_VSNID NUMBER

)

organization external

(

type oracle_loader

default directory LOCAL_OS_DIR

access parameters

(

RECORDS DELIMITED BY NEWLINE

FIELDS TERMINATED BY ','

OPTIONALLY ENCLOSED BY '"'

MISSING FIELD VALUES ARE NULL

( OBJECT_NAME,SUBOBJECT_NAME,OBJECT_ID,DATA_OBJECT_ID,OBJECT_TYPE,CREATED CHAR(20) DATE_FORMAT DATE MASK "DD-MON-YYYY HH24:MI:SS" ,LAST_DDL_TIME CHAR(20) DATE_FORMAT DATE MASK "DD-MON-YYYY HH24:MI:SS" ,TIMESTAMP1,STATUS,TEMPORARY,GENERATED,SECONDARY,NAMESPACE,EDITION_NAME,SHARING,EDITIONABLE,ORACLE_MAINTAINED,APPLICATION,DEFAULT_COLLATION,DUPLICATED,SHARDED,CREATED_APPID,CREATED_VSNID,MODIFIED_APPID,MODIFIED_VSNID

)

location ('xt_uo_tx_preprocessor.txt')

)

REJECT LIMIT 100;

--Check table performance with and without parallelism (LOCAL_OS_DIR)

SQL> select count(*) from xt_uo_tx_preprocessor;

COUNT(*)

----------

26524160

Elapsed: 00:02:21.48

SQL> SELECT /*+ PARALLEL(xt_uo_tx_preprocessor,4) */ COUNT(*) FROM xt_uo_tx_preprocessor;

COUNT(*)

----------

26524160

Elapsed: 00:00:35.76

Create External Table with Preprocessor and test the performance

*********************************************************************************

drop and creat table again and this time with preprocessor option and update the location parameter also with .gz file created already existing on LOCAL_OS_DIR

drop table xt_uo_tx_preprocessor;

create table xt_uo_tx_preprocessor

(

OBJECT_NAME VARCHAR2(128)

,SUBOBJECT_NAME VARCHAR2(128)

,OBJECT_ID NUMBER

,DATA_OBJECT_ID NUMBER

,OBJECT_TYPE VARCHAR2(23)

,CREATED DATE

,LAST_DDL_TIME DATE

,TIMESTAMP1 VARCHAR2(19)

,STATUS VARCHAR2(7)

,TEMPORARY VARCHAR2(1)

,GENERATED VARCHAR2(1)

,SECONDARY VARCHAR2(1)

,NAMESPACE NUMBER

,EDITION_NAME VARCHAR2(128)

,SHARING VARCHAR2(18)

,EDITIONABLE VARCHAR2(1)

,ORACLE_MAINTAINED VARCHAR2(1)

,APPLICATION VARCHAR2(1)

,DEFAULT_COLLATION VARCHAR2(100)

,DUPLICATED VARCHAR2(1)

,SHARDED VARCHAR2(1)

,CREATED_APPID NUMBER

,CREATED_VSNID NUMBER

,MODIFIED_APPID NUMBER

,MODIFIED_VSNID NUMBER

)

organization external

(

type oracle_loader

default directory LOCAL_OS_DIR

access parameters

(

RECORDS DELIMITED BY NEWLINE

preprocessor exec_dir:'zcat'

FIELDS TERMINATED BY ','

OPTIONALLY ENCLOSED BY '"'

MISSING FIELD VALUES ARE NULL

)

location ('xt_uo_tx_preprocessor.txt.gz')

)

REJECT LIMIT 100;

Now run the count query again (zip file at this time in location) and observe the time taken.

SQL> select count(*) from xt_uo_tx_preprocessor;

COUNT(*)

----------

26524160

Elapsed: 00:02:29.85

SQL> SELECT /*+ PARALLEL(xt_uo_tx_preprocessor,4) */ COUNT(*) FROM xt_uo_tx_preprocessor;

COUNT(*)

----------

26524160

Elapsed: 00:02:28.40

Testing with Preprocessor with Hadoop HDFS

********************************************************

Now you have the test results with you while your text and zip data is available on LOCAL_OS_DIR. Simply copy the zip and text file on HDFS (NFS_DIR) and change the location parameter of the external table and run the queries again to see the time taken.

--copy the zip file on HDFS

SQL> host time cp /data/ora_ext_tab_dir/xt_uo_tx_preprocessor.txt.gz /data/hdfsloc/data/oraclenfs

real 0m30.635s

user 0m0.000s

sys 0m0.375s

--alter the location of the table and run the count query again

SQL> alter table xt_uo_tx_preprocessor location (NFS_DIR:'xt_uo_tx_preprocessor.txt.gz')

2 /

Table altered.

--run count query again, result is almost same as on LOCAL_OS_DIR

SQL> SELECT /*+ PARALLEL(xt_uo_tx_preprocessor,4) */ COUNT(*) FROM xt_uo_tx_preprocessor;

COUNT(*)

----------

26524160

Elapsed: 00:02:29.60

Another Use Case - Directory Listing

Compression (zip) is not the only use case you can use with external table. You can use this preprocessing technique to show the output of a program as an external table like cat. I'm providing below example to create an external table for showing a directory listing.

1- Create a bash scripts which we want to call from external table by preprocessor option.

[oracle@te1-hdp-rp-en01 ~]$ vi /data/ora_ext_tab_dir/myprog.sh
[oracle@te1-hdp-rp-en01 ora_ext_tab_dir]$ cat /data/ora_ext_tab_dir/myprog.sh

#!/bin/bash

/bin/ls ls /data/hdfsloc/flume/twitter/ -a -l -X

2- Make the script executable.

[oracle@te1-hdp-rp-en01 ~]$ chmod u+x /data/ora_ext_tab_dir/myprog.sh

3- Create external table.

drop table extlstfile;

create table extlstfile
(
file_priv VARCHAR2(11),
file_links NUMBER,
file_owner VARCHAR2(25),
file_owner_grp VARCHAR2(25),
file_size NUMBER,
file_month VARCHAR2(3),
file_day NUMBER,
file_time VARCHAR2(6),
file_name VARCHAR2(25)
)
organization external
(
type oracle_loader
default directory LOCAL_OS_DIR
access parameters
(
records delimited by newline
preprocessor LOCAL_OS_DIR:'myprog.sh'
skip 1
fields terminated by whitespace ldrtrim
)
location (LOCAL_OS_DIR:'myprog.sh')
)
reject limit unlimited
/

LDRTRIM is used to provide compatibility with SQL*Loader trim features. It is the same as NOTRIM except in the following cases:

If the field is not a delimited field, then spaces will be trimmed from the right.
If the field is a delimited field with OPTIONALLY ENCLOSED BY specified, and the optional enclosures are missing for a particular instance, then spaces will be trimmed from the left.

4- Query external table

SQL> set lines 200
SQL> select * from extlstfile;

FILE_PRIV FILE_LINKS FILE_OWNER FILE_OWNER_GRP FILE_SIZE FIL FILE_DAY FILE_T FILE_NAME
----------- ---------- ------------------------- ------------------------- ---------- --- ---------- ------ -------------------------
drwxrwxrwx 481 hdfs hdfs 15392 Jan 7 2018 .
drwxrwxrwx 6 hdfs hdfs 192 Aug 9 12:50 ..
-rw-r--r-- 1 hdpclient hdfs 15294 Jan 7 07:41 FlumeData.1515300019774
-rw-r--r-- 1 hdpclient hdfs 24359 Jan 7 07:41 FlumeData.1515300049921
-rw-r--r-- 1 hdpclient hdfs 5659 Jan 7 07:42 FlumeData.1515300080442
-rw-r--r-- 1 hdpclient hdfs 26006 Jan 7 07:42 FlumeData.1515300113515
-rw-r--r-- 1 hdpclient hdfs 17226 Jan 7 07:43 FlumeData.1515300149653
-rw-r--r-- 1 hdpclient hdfs 30049 Jan 7 07:43 FlumeData.1515300183800
-rw-r--r-- 1 hdpclient hdfs 21977 Jan 7 07:44 FlumeData.1515300228408
-rw-r--r-- 1 hdpclient hdfs 13785 Jan 7 07:45 FlumeData.1515300263047
-rw-r--r-- 1 hdpclient hdfs 9785 Jan 7 07:45 FlumeData.1515300305630

SQL> select count(*) from extlstfile;

COUNT(*)
----------

469