Purpose:
Offloading Oracle Data to HDFS
Prerequisites:
Hortonworks NFS Gateway is running
Creating External table using csv File on HDFS
1- Place any csv file in HDFS using NFS gateway already configured using post Configuring NFS Gateway for HDFS [HDP]
I've copied to HDFS emp.csv for the purpose of this post.
2- Verify the contents of csv file in HDFS
SQL> host pwd
/data/hdfsloc/data/oraclenfs
SQL> host ls -l
total 1
-rwxr-xr-x 1 oracle oinstall 617 Dec 25 10:43 emp.csv
SQL> host cat emp.csv
7369,SMITH,CLERK,7902,17-DEC-80,800,,20
7499,ALLEN,SALESMAN,7698,20-FEB-81,1600,300,30
7521,WARD,SALESMAN,7698,22-FEB-81,1250,500,30
7566,JONES,MANAGER,7839,02-APR-81,2975,,20
7654,MARTIN,SALESMAN,7698,28-SEP-81,1250,1400,30
7698,BLAKE,MANAGER,7839,01-MAY-81,2850,,30
7782,CLARK,MANAGER,7839,09-JUN-81,2450,,10
7788,SCOTT,ANALYST,7566,19-APR-87,3000,,20
7839,KING,PRESIDENT,,17-NOV-81,5000,,10
7844,TURNER,SALESMAN,7698,08-SEP-81,1500,0,30
7876,ADAMS,CLERK,7788,23-MAY-87,1100,,20
7900,JAMES,CLERK,7698,03-DEC-81,950,,30
7902,FORD,ANALYST,7566,03-DEC-81,3000,,20
7934,MILLER,CLERK,7782,23-JAN-82,1300,,10
drop directory admin_log_dir
SQL> CREATE OR REPLACE DIRECTORY admin_log_dir AS '/data/log/'
2 /
Directory created.
drop directory nfs_dir
SQL> CREATE OR REPLACE DIRECTORY nfs_dir AS '/data/hdfsloc/data/oraclenfs/';
Directory created.
4- Create external table, The ORACLE_LOADER access driver by default considers all fields to be of CHAR(255) type. In order to process the date field, we need to specify the data type and the date format.
5- Query external table
If you encounter any issue while selecting data, please check the log files directory mentioned in external table definition for insights. I encountered below one for my example.
Creating External table using Data Pump on HDFS
2- Create a data pump file by below statement
Query the newly created table to verify data
3- Now copy the created data pump file to HDFS location (NFS_DIR)
4- Now alter table with new location on NFS (HDFS)
Query the external table again
Performance Tip
You may experience poor performance when using NFS. Please view the below post for NFS performance and then test with Oracle external tables. Use Oracle Parallelism also to improve query results.
Optimizing NFS Performance [HDP NFS]
I performed different tests as below. Pay attention to the time taken by query execution.
1- Create an external table using datapump access driver on local file system
CREATE TABLE ext_tab1
ORGANIZATION EXTERNAL
(
TYPE oracle_datapump
DEFAULT DIRECTORY LOCAL_OS_DIR
LOCATION('ext_tab1.dmp')
)
AS SELECT * FROM sys.myobj;
3- Change the storage location (HDFS), I copied the dmp file created in step 1 to HDFS for this test.
SQL> host cp /data/ora_ext_tab_dir/ext_tab1.dmp /data/hdfsloc/data/oraclenfs
Now alter table with new location on NFS (HDFS)
SQL> alter table ext_tab1 LOCATION(NFS_DIR:'ext_tab1.dmp');
Table altered.
4- Ran the count statement, the result took about 5 minutes for HDFS and with local storage it took about 3 seconds.
SQL> select count(*) from ext_tab1;
COUNT(*)
----------
26524160
Elapsed: 00:04:59.99
5- Ran the query with parallel hint now and observe the time taken
SQL> SELECT /*+ PARALLEL(ext_tab1,4) */ COUNT(*) FROM ext_tab1;
COUNT(*)
----------
26524160
Elapsed: 00:01:58.77
-- again test with value 8 and result improved
SQL> SELECT /*+ PARALLEL(ext_tab1,8) */ COUNT(*) FROM ext_tab1;
COUNT(*)
----------
26524160
Elapsed: 00:00:36.83
--third time with value 12 and result as per the value 8 again
SQL> SELECT /*+ PARALLEL(ext_tab1,12) */ COUNT(*) FROM ext_tab1;
COUNT(*)
----------
26524160
Elapsed: 00:00:36.56
Offloading Oracle Data to HDFS
Prerequisites:
Hortonworks NFS Gateway is running
Creating External table using csv File on HDFS
1- Place any csv file in HDFS using NFS gateway already configured using post Configuring NFS Gateway for HDFS [HDP]
I've copied to HDFS emp.csv for the purpose of this post.
2- Verify the contents of csv file in HDFS
SQL> host pwd
/data/hdfsloc/data/oraclenfs
SQL> host ls -l
total 1
-rwxr-xr-x 1 oracle oinstall 617 Dec 25 10:43 emp.csv
SQL> host cat emp.csv
7369,SMITH,CLERK,7902,17-DEC-80,800,,20
7499,ALLEN,SALESMAN,7698,20-FEB-81,1600,300,30
7521,WARD,SALESMAN,7698,22-FEB-81,1250,500,30
7566,JONES,MANAGER,7839,02-APR-81,2975,,20
7654,MARTIN,SALESMAN,7698,28-SEP-81,1250,1400,30
7698,BLAKE,MANAGER,7839,01-MAY-81,2850,,30
7782,CLARK,MANAGER,7839,09-JUN-81,2450,,10
7788,SCOTT,ANALYST,7566,19-APR-87,3000,,20
7839,KING,PRESIDENT,,17-NOV-81,5000,,10
7844,TURNER,SALESMAN,7698,08-SEP-81,1500,0,30
7876,ADAMS,CLERK,7788,23-MAY-87,1100,,20
7900,JAMES,CLERK,7698,03-DEC-81,950,,30
7902,FORD,ANALYST,7566,03-DEC-81,3000,,20
7934,MILLER,CLERK,7782,23-JAN-82,1300,,10
3- Create DB directories to be used while creating external table. As random writes are not allowed we will create log related directory on the local OS filesystem not HDFS, but for the external table data we will create DB directory on HDFS.
drop directory admin_log_dir
SQL> CREATE OR REPLACE DIRECTORY admin_log_dir AS '/data/log/'
2 /
Directory created.
SQL> CREATE OR REPLACE DIRECTORY nfs_dir AS '/data/hdfsloc/data/oraclenfs/';
Directory created.
4- Create external table, The ORACLE_LOADER access driver by default considers all fields to be of CHAR(255) type. In order to process the date field, we need to specify the data type and the date format.
drop table emp_ext;
CREATE TABLE emp_ext (
EMPNO NUMBER(4),
ENAME VARCHAR2(10),
JOB VARCHAR2(9),
MGR NUMBER(4),
HIREDATE DATE,
SAL NUMBER(7,2),
COMM NUMBER(7,2),
DEPTNO VARCHAR2(3)
)
ORGANIZATION EXTERNAL
(
TYPE ORACLE_LOADER
DEFAULT DIRECTORY NFS_DIR
ACCESS PARAMETERS
(
RECORDS DELIMITED BY NEWLINE
BADFILE ADMIN_LOG_DIR:'emp_ext%a_%p.bad'
LOGFILE ADMIN_LOG_DIR: 'emp_ext%a_%p.log'
DISCARDFILE ADMIN_LOG_DIR:'emp_ext%a_%p.disc'
DNFS_ENABLE
DNFS_READBUFFERS 10
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
MISSING FIELD VALUES ARE NULL
(
empno,
ename,
job,
mgr,
hiredate CHAR(9) DATE_FORMAT DATE MASK "DD-MON-YYYY HH24:MI:SS",
sal,
comm,
deptno
)
)
LOCATION (nfs_dir:'emp*.csv')
)
PARALLEL 1
REJECT LIMIT UNLIMITED;
5- Query external table
SQL> select * from emp_ext;
EMPNO ENAME JOB MGR HIREDATE SAL COMM DEP
---------- ---------- --------- ---------- ------------------ ---------- ---------- ---
7369 SMITH CLERK 7902 17-DEC-80 800 20
7499 ALLEN SALESMAN 7698 20-FEB-81 1600 300 30
7521 WARD SALESMAN 7698 22-FEB-81 1250 500 30
7566 JONES MANAGER 7839 02-APR-81 2975 20
7654 MARTIN SALESMAN 7698 28-SEP-81 1250 1400 30
7698 BLAKE MANAGER 7839 01-MAY-81 2850 30
7782 CLARK MANAGER 7839 09-JUN-81 2450 10
7788 SCOTT ANALYST 7566 19-APR-87 3000 20
7839 KING PRESIDENT 17-NOV-81 5000 10
7844 TURNER SALESMAN 7698 08-SEP-81 1500 0 30
7876 ADAMS CLERK 7788 23-MAY-87 1100 20
EMPNO ENAME JOB MGR HIREDATE SAL COMM DEP
---------- ---------- --------- ---------- ------------------ ---------- ---------- ---
7900 JAMES CLERK 7698 03-DEC-81 950 30
7902 FORD ANALYST 7566 03-DEC-81 3000 20
7934 MILLER CLERK 7782 23-JAN-82 1300 10
14 rows selected.
If you encounter any issue while selecting data, please check the log files directory mentioned in external table definition for insights. I encountered below one for my example.
error processing column DEPTNO in row 1 for datafile /data/hdfsloc/data/oraclenfs/emp.csv
ORA-01722: invalid number
the above error was found because deptno column in data (csv) file had space appended with value.
1- Create the DB directory on local file system to hold the data pump file.
SQL> CREATE OR REPLACE DIRECTORY local_os_dir AS '/data/ora_ext_tab_dir';
Directory created.
2- Create a data pump file by below statement
DROP TABLE export_employee;
CREATE TABLE export_employee
ORGANIZATION EXTERNAL
(
TYPE oracle_datapump
DEFAULT DIRECTORY LOCAL_OS_DIR
LOCATION('employee_ext.dmp')
)
AS SELECT * FROM scott.employee;
Query the newly created table to verify data
SQL> SELECT * FROM export_employee;
EMPNO NAME JOB BOSS HIREDATE SALARY COMM DEPTNO
---------- ---------- --------- ---------- ------------ ---------- ---------- ----------
7839 KING PRESIDENT 1981-11-17 5000 10
7566 JONES MANAGER 7839 1981-04-02 2975 20
7788 SCOTT ANALYST 7566 1982-12-09 3000 20
7876 ADAMS CLERK 7788 1983-01-12 1100 20
7902 FORD ANALYST 7566 1981-12-03 3000 20
7369 SMITH CLERK 7902 1980-12-17 800 20
7698 BLAKE MANAGER 7839 1981-05-01 2850 30
7499 ALLEN SALESMAN 7698 1981-02-20 1600 300 30
7521 WARD SALESMAN 7698 1981-02-22 1250 500 30
7654 MARTIN SALESMAN 7698 1981-09-28 1250 1400 30
7844 TURNER SALESMAN 7698 1981-09-08 1500 0 30
EMPNO NAME JOB BOSS HIREDATE SALARY COMM DEPTNO
---------- ---------- --------- ---------- ------------ ---------- ---------- ----------
7900 JAMES CLERK 7698 1981-12-03 950 30
7782 CLARK MANAGER 7839 1981-06-09 2450 10
7934 MILLER CLERK 7782 1982-01-23 1300 10
14 rows selected.
You are querying this table from the data pump file located at local file system (local_os_dir)
3- Now copy the created data pump file to HDFS location (NFS_DIR)
SQL> host cp /data/ora_ext_tab_dir/employee_ext.dmp .
SQL> host ls -l
total 14
-rwxr-xr-x 1 oracle oinstall 617 Dec 25 10:43 emp.csv
-rw-r----- 1 oracle oinstall 12288 Dec 25 12:22 employee_ext.dmp
4- Now alter table with new location on NFS (HDFS)
SQL> alter table export_employee LOCATION(NFS_DIR:'employee_ext.dmp');
Table altered.
Query the external table again
SQL> SELECT * FROM export_employee;
EMPNO NAME JOB BOSS HIREDATE SALARY COMM DEPTNO
---------- ---------- --------- ---------- ------------ ---------- ---------- ----------
7839 KING PRESIDENT 1981-11-17 5000 10
7566 JONES MANAGER 7839 1981-04-02 2975 20
7788 SCOTT ANALYST 7566 1982-12-09 3000 20
7876 ADAMS CLERK 7788 1983-01-12 1100 20
7902 FORD ANALYST 7566 1981-12-03 3000 20
7369 SMITH CLERK 7902 1980-12-17 800 20
7698 BLAKE MANAGER 7839 1981-05-01 2850 30
7499 ALLEN SALESMAN 7698 1981-02-20 1600 300 30
7521 WARD SALESMAN 7698 1981-02-22 1250 500 30
7654 MARTIN SALESMAN 7698 1981-09-28 1250 1400 30
7844 TURNER SALESMAN 7698 1981-09-08 1500 0 30
EMPNO NAME JOB BOSS HIREDATE SALARY COMM DEPTNO
---------- ---------- --------- ---------- ------------ ---------- ---------- ----------
7900 JAMES CLERK 7698 1981-12-03 950 30
7782 CLARK MANAGER 7839 1981-06-09 2450 10
7934 MILLER CLERK 7782 1982-01-23 1300 10
14 rows selected.
Performance Tip
You may experience poor performance when using NFS. Please view the below post for NFS performance and then test with Oracle external tables. Use Oracle Parallelism also to improve query results.
Optimizing NFS Performance [HDP NFS]
I performed different tests as below. Pay attention to the time taken by query execution.
1- Create an external table using datapump access driver on local file system
CREATE TABLE ext_tab1
ORGANIZATION EXTERNAL
(
TYPE oracle_datapump
DEFAULT DIRECTORY LOCAL_OS_DIR
LOCATION('ext_tab1.dmp')
)
AS SELECT * FROM sys.myobj;
myobj was created from user_objects view with 4GB data.
2- Run count query on the table while dmp file is on the local file system.
SQL> select count(*) from ext_tab1;
COUNT(*)
----------
26524160
Elapsed: 00:00:03.28
3- Change the storage location (HDFS), I copied the dmp file created in step 1 to HDFS for this test.
SQL> host cp /data/ora_ext_tab_dir/ext_tab1.dmp /data/hdfsloc/data/oraclenfs
Now alter table with new location on NFS (HDFS)
SQL> alter table ext_tab1 LOCATION(NFS_DIR:'ext_tab1.dmp');
Table altered.
4- Ran the count statement, the result took about 5 minutes for HDFS and with local storage it took about 3 seconds.
SQL> select count(*) from ext_tab1;
COUNT(*)
----------
26524160
Elapsed: 00:04:59.99
5- Ran the query with parallel hint now and observe the time taken
SQL> SELECT /*+ PARALLEL(ext_tab1,4) */ COUNT(*) FROM ext_tab1;
COUNT(*)
----------
26524160
Elapsed: 00:01:58.77
-- again test with value 8 and result improved
SQL> SELECT /*+ PARALLEL(ext_tab1,8) */ COUNT(*) FROM ext_tab1;
COUNT(*)
----------
26524160
Elapsed: 00:00:36.83
--third time with value 12 and result as per the value 8 again
SQL> SELECT /*+ PARALLEL(ext_tab1,12) */ COUNT(*) FROM ext_tab1;
COUNT(*)
----------
26524160
Elapsed: 00:00:36.56
6- changed parallelism of table
SQL> alter table ext_tab1 parallel (degree 8);
Table altered.
--ran count statement again without hint
SQL> select count(*) from ext_tab1;
COUNT(*)
----------
26524160
Elapsed: 00:00:36.42
No comments:
Post a Comment