Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Note: All the posts are based on practical approach avoiding lengthy theory. All have been tested on some development servers. Please don’t test any post on production servers until you are sure.

Monday, December 25, 2017

Creating Oracle External Table (12c) on HDFS using HDP NFS Gateway

Purpose:

Offloading Oracle Data to HDFS

Prerequisites:

Hortonworks NFS Gateway is running 



Creating External table using csv File on HDFS

1- Place any csv file in HDFS using NFS gateway already configured using post Configuring NFS Gateway for HDFS [HDP]

I've copied to HDFS emp.csv for the purpose of this post.

2- Verify the contents of csv file in HDFS

SQL> host pwd
/data/hdfsloc/data/oraclenfs


SQL> host ls -l
total 1
-rwxr-xr-x 1 oracle oinstall 617 Dec 25 10:43 emp.csv


SQL> host cat emp.csv
7369,SMITH,CLERK,7902,17-DEC-80,800,,20
7499,ALLEN,SALESMAN,7698,20-FEB-81,1600,300,30
7521,WARD,SALESMAN,7698,22-FEB-81,1250,500,30
7566,JONES,MANAGER,7839,02-APR-81,2975,,20
7654,MARTIN,SALESMAN,7698,28-SEP-81,1250,1400,30
7698,BLAKE,MANAGER,7839,01-MAY-81,2850,,30
7782,CLARK,MANAGER,7839,09-JUN-81,2450,,10
7788,SCOTT,ANALYST,7566,19-APR-87,3000,,20
7839,KING,PRESIDENT,,17-NOV-81,5000,,10
7844,TURNER,SALESMAN,7698,08-SEP-81,1500,0,30
7876,ADAMS,CLERK,7788,23-MAY-87,1100,,20
7900,JAMES,CLERK,7698,03-DEC-81,950,,30
7902,FORD,ANALYST,7566,03-DEC-81,3000,,20
7934,MILLER,CLERK,7782,23-JAN-82,1300,,10



3- Create DB directories to be used while creating external table. As random writes are not allowed we will create log related directory on the local OS filesystem not HDFS, but for the external table data we will create DB directory on HDFS.

drop directory admin_log_dir

SQL> CREATE OR REPLACE DIRECTORY admin_log_dir AS '/data/log/'
  2  /


Directory created.

drop directory nfs_dir

SQL> CREATE OR REPLACE DIRECTORY nfs_dir AS '/data/hdfsloc/data/oraclenfs/';


Directory created.


4- Create external table, The ORACLE_LOADER access driver by default considers all fields to be of CHAR(255) type. In order to process the date field, we need to specify the data type and the date format. 
drop table emp_ext;

CREATE TABLE emp_ext (
 EMPNO     NUMBER(4),
 ENAME     VARCHAR2(10),
 JOB       VARCHAR2(9),
 MGR       NUMBER(4),
 HIREDATE  DATE,
 SAL       NUMBER(7,2),
 COMM      NUMBER(7,2),
 DEPTNO    VARCHAR2(3)
)
ORGANIZATION EXTERNAL
(
  TYPE ORACLE_LOADER
  DEFAULT DIRECTORY NFS_DIR
  ACCESS PARAMETERS
  (
    RECORDS DELIMITED BY NEWLINE
    BADFILE ADMIN_LOG_DIR:'emp_ext%a_%p.bad'
    LOGFILE ADMIN_LOG_DIR: 'emp_ext%a_%p.log'
    DISCARDFILE ADMIN_LOG_DIR:'emp_ext%a_%p.disc'
    DNFS_ENABLE
    DNFS_READBUFFERS 10
    FIELDS TERMINATED BY ',' 
OPTIONALLY ENCLOSED BY '"'
    MISSING FIELD VALUES ARE NULL
    (
      empno,
      ename,
      job,
      mgr,
      hiredate   CHAR(9) DATE_FORMAT DATE MASK "DD-MON-YYYY HH24:MI:SS",
      sal,
      comm,
      deptno  
    )
  )
  LOCATION (nfs_dir:'emp*.csv')
)
PARALLEL 1
REJECT LIMIT UNLIMITED;




5- Query external table

SQL> select * from emp_ext;

     EMPNO ENAME      JOB              MGR HIREDATE                  SAL       COMM DEP
---------- ---------- --------- ---------- ------------------ ---------- ---------- ---
      7369 SMITH      CLERK           7902 17-DEC-80                 800            20
      7499 ALLEN      SALESMAN        7698 20-FEB-81                1600        300 30
      7521 WARD       SALESMAN        7698 22-FEB-81                1250        500 30
      7566 JONES      MANAGER         7839 02-APR-81                2975            20
      7654 MARTIN     SALESMAN        7698 28-SEP-81                1250       1400 30
      7698 BLAKE      MANAGER         7839 01-MAY-81                2850            30
      7782 CLARK      MANAGER         7839 09-JUN-81                2450            10
      7788 SCOTT      ANALYST         7566 19-APR-87                3000            20
      7839 KING       PRESIDENT            17-NOV-81                5000            10
      7844 TURNER     SALESMAN        7698 08-SEP-81                1500          0 30
      7876 ADAMS      CLERK           7788 23-MAY-87                1100            20

     EMPNO ENAME      JOB              MGR HIREDATE                  SAL       COMM DEP
---------- ---------- --------- ---------- ------------------ ---------- ---------- ---
      7900 JAMES      CLERK           7698 03-DEC-81                 950            30
      7902 FORD       ANALYST         7566 03-DEC-81                3000            20
      7934 MILLER     CLERK           7782 23-JAN-82                1300            10

14 rows selected.



If you encounter any issue while selecting data, please check the log files directory mentioned in external table definition for insights. I encountered below one for my example.

error processing column DEPTNO in row 1 for datafile /data/hdfsloc/data/oraclenfs/emp.csv
ORA-01722: invalid number

the above error was found because deptno column in data (csv) file had space appended with value.


Creating External table using Data Pump on HDFS

1- Create the DB directory on local file system to hold the data pump file. 

SQL> CREATE OR REPLACE DIRECTORY local_os_dir AS '/data/ora_ext_tab_dir';

Directory created.

2- Create a data pump file by below statement


DROP TABLE export_employee;

CREATE TABLE export_employee
   ORGANIZATION EXTERNAL
   (
   TYPE oracle_datapump
   DEFAULT DIRECTORY LOCAL_OS_DIR
   LOCATION('employee_ext.dmp')
   )
AS SELECT * FROM scott.employee;


Query the newly created table to verify data


SQL> SELECT * FROM export_employee;

     EMPNO NAME       JOB             BOSS HIREDATE         SALARY       COMM     DEPTNO
---------- ---------- --------- ---------- ------------ ---------- ---------- ----------
      7839 KING       PRESIDENT            1981-11-17         5000                    10
      7566 JONES      MANAGER         7839 1981-04-02         2975                    20
      7788 SCOTT      ANALYST         7566 1982-12-09         3000                    20
      7876 ADAMS      CLERK           7788 1983-01-12         1100                    20
      7902 FORD       ANALYST         7566 1981-12-03         3000                    20
      7369 SMITH      CLERK           7902 1980-12-17          800                    20
      7698 BLAKE      MANAGER         7839 1981-05-01         2850                    30
      7499 ALLEN      SALESMAN        7698 1981-02-20         1600        300         30
      7521 WARD       SALESMAN        7698 1981-02-22         1250        500         30
      7654 MARTIN     SALESMAN        7698 1981-09-28         1250       1400         30
      7844 TURNER     SALESMAN        7698 1981-09-08         1500          0         30

     EMPNO NAME       JOB             BOSS HIREDATE         SALARY       COMM     DEPTNO
---------- ---------- --------- ---------- ------------ ---------- ---------- ----------
      7900 JAMES      CLERK           7698 1981-12-03          950                    30
      7782 CLARK      MANAGER         7839 1981-06-09         2450                    10
      7934 MILLER     CLERK           7782 1982-01-23         1300                    10

14 rows selected.


You are querying this table from the data pump file located at local file system (local_os_dir)


3- Now copy the created data pump file to HDFS location (NFS_DIR)

SQL> host cp /data/ora_ext_tab_dir/employee_ext.dmp .

SQL> host ls -l
total 14
-rwxr-xr-x 1 oracle oinstall   617 Dec 25 10:43 emp.csv
-rw-r----- 1 oracle oinstall 12288 Dec 25 12:22 employee_ext.dmp

4- Now alter table with new location on NFS (HDFS)

SQL> alter table export_employee LOCATION(NFS_DIR:'employee_ext.dmp');

Table altered.

Query the external table again 
SQL> SELECT * FROM export_employee;


     EMPNO NAME       JOB             BOSS HIREDATE         SALARY       COMM     DEPTNO
---------- ---------- --------- ---------- ------------ ---------- ---------- ----------
      7839 KING       PRESIDENT            1981-11-17         5000                    10
      7566 JONES      MANAGER         7839 1981-04-02         2975                    20
      7788 SCOTT      ANALYST         7566 1982-12-09         3000                    20
      7876 ADAMS      CLERK           7788 1983-01-12         1100                    20
      7902 FORD       ANALYST         7566 1981-12-03         3000                    20
      7369 SMITH      CLERK           7902 1980-12-17          800                    20
      7698 BLAKE      MANAGER         7839 1981-05-01         2850                    30
      7499 ALLEN      SALESMAN        7698 1981-02-20         1600        300         30
      7521 WARD       SALESMAN        7698 1981-02-22         1250        500         30
      7654 MARTIN     SALESMAN        7698 1981-09-28         1250       1400         30
      7844 TURNER     SALESMAN        7698 1981-09-08         1500          0         30

     EMPNO NAME       JOB             BOSS HIREDATE         SALARY       COMM     DEPTNO
---------- ---------- --------- ---------- ------------ ---------- ---------- ----------
      7900 JAMES      CLERK           7698 1981-12-03          950                    30
      7782 CLARK      MANAGER         7839 1981-06-09         2450                    10
      7934 MILLER     CLERK           7782 1982-01-23         1300                    10

14 rows selected.



Performance Tip

You may experience poor performance when using NFS. Please view the below post for NFS performance and then test with Oracle external tables. Use Oracle Parallelism also to improve query results.

Optimizing NFS Performance [HDP NFS]

I performed different tests as below. Pay attention to the time taken by query execution.

1- Create an external table using datapump access driver on local file system

CREATE TABLE ext_tab1
   ORGANIZATION EXTERNAL
   (
   TYPE oracle_datapump
   DEFAULT DIRECTORY LOCAL_OS_DIR
   LOCATION('ext_tab1.dmp')
   )
AS SELECT * FROM sys.myobj;

myobj was created from user_objects view with 4GB data.

2- Run count query on the table while dmp file is on the local file system.

SQL> select count(*) from ext_tab1;

  COUNT(*)
----------
  26524160

Elapsed: 00:00:03.28


3- Change the storage location (HDFS), I  copied the dmp file created in step 1 to HDFS for this test.

SQL> host cp /data/ora_ext_tab_dir/ext_tab1.dmp /data/hdfsloc/data/oraclenfs


Now alter table with new location on NFS (HDFS)

SQL> alter table ext_tab1 LOCATION(NFS_DIR:'ext_tab1.dmp');

Table altered.

4- Ran the count statement, the result took about 5 minutes  for HDFS and with local storage it took about 3 seconds.

SQL> select count(*) from ext_tab1;

  COUNT(*)
----------
  26524160

Elapsed: 00:04:59.99

 5- Ran the query with parallel hint now and observe the time taken

SQL> SELECT /*+ PARALLEL(ext_tab1,4) */ COUNT(*)  FROM ext_tab1;

  COUNT(*)
----------
  26524160

Elapsed: 00:01:58.77

-- again test with value 8 and result improved

SQL> SELECT /*+ PARALLEL(ext_tab1,8) */ COUNT(*)  FROM ext_tab1;

  COUNT(*)
----------
  26524160

Elapsed: 00:00:36.83

--third time with value 12 and result as per the value 8 again

SQL> SELECT /*+ PARALLEL(ext_tab1,12) */ COUNT(*)  FROM ext_tab1;

  COUNT(*)
----------
  26524160

Elapsed: 00:00:36.56

6- changed parallelism of table

SQL> alter table ext_tab1 parallel (degree 8);

Table altered.

--ran count statement again without hint

SQL> select count(*) from ext_tab1;

  COUNT(*)
----------
  26524160

Elapsed: 00:00:36.42

No comments: