Introduction
Oracle Big Data Connectors facilitate data access to data stored in an Apache Hadoop cluster. It can be licensed for use on either Oracle Big Data Appliance or a Hadoop cluster running on commodity hardware. There are three connectors available from which we are going to work on Oracle SQL Connector for Hadoop Distributed File System for the purpose of this post.
SQL Connector for Hadoop Distributed File System or OSCH enables an Oracle external table to access data stored in Hadoop Distributed File System (HDFS) files or a table in Apache Hive. The data can remain in HDFS or the Hive table, or it can be loaded into an Oracle database. OSCH is a command-line utility that accepts generic command line arguments supported by the org.apache.hadoop.util.Tool interface. It also provides a preprocessor for Oracle external tables.
You install OSCH on the system where Oracle Database runs as it works as a Hadoop client.. If Hive tables are used as the data source, then you must also install and run Oracle SQL Connector for HDFS on a Hadoop client where users access Hive.
For Oracle RAC systems including Oracle Exadata Database Machine, you must install and configure Oracle SQL Connector for HDFS using identical paths on all systems running Oracle instances.
OSCH uses location files (specified in the location clause of the external table) that contain only the Universal Resource Identifiers (URIs) of the data files. A data file contains the data stored in HDFS. The format of a location file name is osch-timestamp-number-n.
You can enable parallel processing with external tables by specifying multiple files in the location clause of the external table. The number of files determines the number of child processes started by the external table during a table read, which is known as the degree of parallelism or DOP.
Software Requirements
OSCH requires the following software
On the Hadoop cluster: Cloudera's Distribution including Apache Hadoop version 4 (CDH4) or version 5 (CDH5), Apache Hadoop 1.0 (formerly 0.20.2), or Apache Hadoop 1.1.1
Java Development Kit (JDK) 1.6_08 or later. Consult the distributor of your Hadoop software (Cloudera or Apache) for the recommended version.
Hive 0.7.1, 0.8.1, 0.9.0, 0.10.0, or 0.12.0 (required for Hive table access, otherwise optional)
On the Hadoop cluster: Cloudera's Distribution including Apache Hadoop version 4 (CDH4) or version 5 (CDH5), Apache Hadoop 1.0 (formerly 0.20.2), or Apache Hadoop 1.1.1
Java Development Kit (JDK) 1.6_08 or later. Consult the distributor of your Hadoop software (Cloudera or Apache) for the recommended version.
Hive 0.7.1, 0.8.1, 0.9.0, 0.10.0, or 0.12.0 (required for Hive table access, otherwise optional)
On the Oracle Database system and Hadoop client systems:
Oracle Database 12c, Oracle Database 11g release 2 (11.2.0.2 or later), or Oracle Database 10g release 2 (10.2.0.5) for Linux.
The same version of Hadoop as your Hadoop cluster: CDH4, CDH5, Apache Hadoop 1.0, or Apache Hadoop 1.1.1.
The same version of JDK as your Hadoop cluster.
Prerequisites:
Hadoop Cluster is already up and running
Steps to use OSCH
1- Object Preparation at Oracle Side
Create related tablespace and user for OSCH
SQL> CREATE TABLESPACE hdfsdata datafile 'hdfsdata.ora' size 100M;
Tablespace created.
SQL> CREATE USER hdfsuser IDENTIFIED BY hdfsuser DEFAULT TABLESPACE hdfsdata QUOTA UNLIMITED ON hdfsdata;
User created.
Grant succeeded.
2- Directory Preparation (OS Level)
Create appropriate folders on hadoop client/Oracle Server for OSCH
[root@en01 ~]# chmod -R 777 /usr/hadoopsw/osch
3- Download and Extract OSCH
Download the latest OSCH zip file (oraosch-3.6.0.zip) from below URL and extract the file into the /usr/hadoopsw/osch directory
http://www.oracle.com/technetwork/database/database-technologies/bdc/big-data-connectors/downloads/index.html
[root@en01 ~]# cd /usr/hadoopsw/osch/
[root@en01 osch]# ll
total 6084
-rw-r--r--. 1 root root 6229854 May 28 2017 oraosch-3.6.0.zip
[root@en01 osch]# unzip oraosch-3.6.0.zip
Archive: oraosch-3.6.0.zip
extracting: orahdfs-3.6.0.zip
inflating: README.txt
[root@en01 osch]# ll
total 12176
-r--r--r--. 1 root root 6227413 Aug 30 2016 orahdfs-3.6.0.zip
-rw-r--r--. 1 root root 6229854 May 28 2017 oraosch-3.6.0.zip
-r-xr-xr-x. 1 root root 6034 Aug 30 2016 README.txt
[root@en01 osch]# unzip orahdfs-3.6.0.zip
Archive: orahdfs-3.6.0.zip
creating: orahdfs-3.6.0/
creating: orahdfs-3.6.0/examples/
creating: orahdfs-3.6.0/examples/sql/
inflating: orahdfs-3.6.0/examples/sql/mkhive_unionall_view.sql
creating: orahdfs-3.6.0/jlib/
inflating: orahdfs-3.6.0/jlib/osdt_cert.jar
inflating: orahdfs-3.6.0/jlib/oraloader.jar
inflating: orahdfs-3.6.0/jlib/oraclepki.jar
inflating: orahdfs-3.6.0/jlib/ojdbc7.jar
inflating: orahdfs-3.6.0/jlib/osdt_core.jar
inflating: orahdfs-3.6.0/jlib/ora-hadoop-common.jar
inflating: orahdfs-3.6.0/jlib/orahdfs.jar
inflating: orahdfs-3.6.0/jlib/orai18n.jar
creating: orahdfs-3.6.0/log/
creating: orahdfs-3.6.0/doc/
inflating: orahdfs-3.6.0/doc/README.txt
creating: orahdfs-3.6.0/bin/
inflating: orahdfs-3.6.0/bin/hdfs_stream
inflating: orahdfs-3.6.0/bin/hdfs_stream.cmd
Set the environment variables in .bash_profile, I've Hadoop client, Hive , PIG and Oracle Server all are one machine so you will see all related environment variables in .bash_profile.
vi ~/.bash_profile
###################Inside .bash_profile [hdpclient]########################
## JAVA env variables
export JAVA_HOME=/usr/java/default
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$JAVA_HOME/jre/lib:$JAVA_HOME/lib:$JAVA_HOME/lib/tools.jar
## Hadoop Variables
export HADOOP_HOME=/usr/hadoopsw/hadoop-2.7.3
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
#Hive Variables
export HIVE_HOME=/usr/hadoopsw/apache-hive-2.1.1-bin
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:$HADOOP_HOME/lib/*:.
export CLASSPATH=$CLASSPATH:$HIVE_HOME/lib/*:.
#Derby Variables
export DERBY_HOME=/usr/hadoopsw/db-derby-10.13.1.1-bin
export PATH=$PATH:$DERBY_HOME/bin
export CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_HOME/lib/derbyclient.jar:$DERBY_HOME/lib/derbytools.jar
export DERBY_OPTS=-Dij.protocol=jdbc:derby://hadoopedge1/
# PIG Variables
export PIG_HOME=/usr/hadoopsw/pig-0.16.0
export PATH=$PATH:$PIG_HOME/bin
export PIG_CLASSPATH=$HADOOP_HOME/etc/hadoop/
# Oracle and OSCH Settings
TMP=/tmp; export TMP
TMPDIR=$TMP; export TMPDIR
ORACLE_HOSTNAME=en01; export ORACLE_HOSTNAME
ORACLE_UNQNAME=orcl; export ORACLE_UNQNAME
ORACLE_BASE=/u01/app/oracle; export ORACLE_BASE
ORACLE_HOME=$ORACLE_BASE/product/12.2.0.1/db_1; export ORACLE_HOME
ORACLE_SID=orcl; export ORACLE_SID
export NLS_LANG=American_America.AL32UTF8
PATH=/usr/sbin:$PATH; export PATH
PATH=$ORACLE_HOME/bin:$PATH; export PATH
LD_LIBRARY_PATH=$ORACLE_HOME/lib:/lib:/usr/lib; export LD_LIBRARY_PATH
CLASSPATH=$CLASSPATH:$ORACLE_HOME/jlib:$ORACLE_HOME/rdbms/jlib; export CLASSPATH
export OSCH_HOME=/usr/hadoopsw/osch/orahdfs-3.6.0
export HADOOP_CLASSPATH=$HADOOP_HOME/lib/*:$OSCH_HOME/jlib/*:$HIVE_HOME/lib/*::$HIVE_HOME/conf
export PATH=$PATH:$HADOOP_MAPRED_HOME:$OSCH_HOME/bin
We need to make another environment variable setting, which is in a preprocessor script $OSCH_HOME/bin/hdfs_stream. The hdfs_stream script does not get the environment variables set in the bash shell and has the PATH environment variable set to /bin:/usr/bin. The script needs to access the Hadoop bin directory and as we have not installed Hadoop in the /bin:/usr/bin directories add the Hadoop bin directory to the PATH variable in the hdfs_stream script.
export PATH=/usr/hadoopsw/hadoop-2.7.3/bin:/usr/bin:/bin
If the Hadoop bin directory is not added to the PATH env variable in hdfs_stream the following error gets generated when a SQL query is made on the external table generated by OSCH.
ORA-29913: error in executing ODCIEXTTABLEFETCH callout
ORA-29400: data cartridge error
KUP-04095: preprocessor command /osch/orahdfs-3.6.0/bin/hdfs_stream encountered error "which no hadoop in (/usr/bin:bin)
5- Creating Directory Objects and Setting Permissions
Before running OSCH, some directory objects are required to be created with appropriate permissions set for the Oracle Database schema (ie; hdfsuser) for which an external table is to be created on the HDFS data. For one of the directory objects create an OS directory with the following command.
7- Using the ExternalTable Tool
OSCH provides the ExternalTable tool to create an external table and may be run with the following command.
hadoop jar $OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable [genericOptions] COMMAND [options]
The genericOptions are the generic command-line options supported by the Hadoop Tool interface. The ExternalTable tool is used with one of the commands listed in following table.
The ExternalTable tool supports the following options.
8- Creating the Configuration file for the ExternalTable tool
[root@en01 ~]# mkdir -p /data/ora_ext_tab_dir
[root@en01 ~]# chmod -R 777 /data/ora_ext_tab_dir
Run the following SQL script in SQL Plus to create directory objects OSCH_EXTTAB_DIR and OSCH_BIN_PATH and grant the required permissions to the hdfsuser schema.
GRANT CREATE ANY DIRECTORY to hdfsuser;
CREATE OR REPLACE DIRECTORY OSCH_EXTTAB_DIR AS '/data/ora_ext_tab_dir';
CREATE OR REPLACE DIRECTORY OSCH_BIN_PATH AS '/usr/hadoopsw/osch/orahdfs-3.6.0/bin';
GRANT READ, WRITE ON DIRECTORY OSCH_EXTTAB_DIR TO hdfsuser;
GRANT CREATE SESSION, CREATE TABLE, CREATE VIEW TO hdfsuser;
GRANT EXECUTE ON sys.utl_file TO hdfsuser;
GRANT READ, EXECUTE ON DIRECTORY OSCH_BIN_PATH TO hdfsuser;
[root@en01 ~]# chmod -R 777 /data/ora_ext_tab_dir
Run the following SQL script in SQL Plus to create directory objects OSCH_EXTTAB_DIR and OSCH_BIN_PATH and grant the required permissions to the hdfsuser schema.
CREATE OR REPLACE DIRECTORY OSCH_EXTTAB_DIR AS '/data/ora_ext_tab_dir';
CREATE OR REPLACE DIRECTORY OSCH_BIN_PATH AS '/usr/hadoopsw/osch/orahdfs-3.6.0/bin';
GRANT READ, WRITE ON DIRECTORY OSCH_EXTTAB_DIR TO hdfsuser;
GRANT CREATE SESSION, CREATE TABLE, CREATE VIEW TO hdfsuser;
GRANT EXECUTE ON sys.utl_file TO hdfsuser;
GRANT READ, EXECUTE ON DIRECTORY OSCH_BIN_PATH TO hdfsuser;
6- Adding Data to HDFS
As we will be creating an external table on a delimited text file in HDFS. Create the following text file to be stored in HDFS and then we will read it using Oracle external table.
catalog.txt.
1,Item1
2,Item2
3,Item3
4,Item4
5,Item5
1,Item1
2,Item2
3,Item3
4,Item4
5,Item5
[hdpsysuser@nn01 ~]$ hdfs dfs -mkdir /catalog
[hdpsysuser@nn01 ~]$ hdfs dfs -put /data/mydata/catalog.txt /catalog/catalog.txt
OSCH provides the ExternalTable tool to create an external table and may be run with the following command.
hadoop jar $OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable [genericOptions] COMMAND [options]
Command | Description |
-createTable | Creates an external table and also publishes HDFS data URIs to the location files of the table |
-describe | Describes tables created by OSCH |
-drop | Drop tables created by OSCH |
-publish | Publishes the HDFS data URIs to the location files of an external table |
-listLocations | Prints the location files of an external table |
-getDDL | Prints the DDL of an external table |
Option | Description |
--noexecute |
Prints the execution plan of the -createTable, -publish, -drop
commands
|
--details | Prints a detailed listing of the location files for the -listLocations command |
--output <filename> | Writes the table DDL from -createTable to the specified file |
8- Creating the Configuration file for the ExternalTable tool
The OSCH ExternalTable tool is run using configuration properties; general and connection properties. The configuration properties may be specified either on the command line with –D or in an XML configuration file. We shall use an XML configuration file. Create a file catalog_hdfs.xml in which specify each property with the following structure.
catalog_hdfs.xml
<?xml version="1.0"?>
<configuration>
<property>
<name></name>
<value></value>
</property>
</configuration>
The data type mapping for the Oracle Database columns may be set with the oracle.hadoop.exttab.colMap.columnType property, which applies to all columns and defaults toVARCHAR2. The data type mapping may be set individually with the oracle.hadoop.exttab.colMap.column_name.columnType property as we have. The source type is set with the oracle.hadoop.exttab.sourceType property, which may be set to text, hive or datapump. The default for oracle.hadoop.exttab.sourceType is text. The number of location files for the external table is set with the oracle.hadoop.exttab.locationFileCount and defaults to 4. The specified value is not always used as the lesser of the number of data files and the specified value for the oracle.hadoop.exttab.locationFileCount is used. For a delimited text file input the field terminator may be specified with the oracle.hadoop.exttab.fieldTerminator property and defaults to ‘,’. The configuration file catalog_hdfs.xml is listed below; copy the file to the /osch directory.
hadoop jar $OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable -Doracle.hadoop.exttab.printStackTrace=true -conf catalog_hdfs.xml –createTable
[hdpclient@en01 osch]$ pwd
/usr/hadoopsw/osch
Run a DESC SQL statement in SQL Plus to list the description of the external table created.
LOG file opened at 05/29/17 16:09:53
KUP-05007: Warning: Intra source concurrency disabled because the preprocessor option is being used.
Field Definitions for table CATALOG_EXT
Record format DELIMITED, delimited by 0A
Data in file has same endianness as the platform
Rows with all null fields are accepted
Fields in Data Source:
CATALOGID CHAR (255)
Terminated by "2C"
Trim whitespace same as SQL Loader
CATALOGNAME CHAR (4000)
Terminated by "2C"
Trim whitespace same as SQL Loader
KUP-04004: error while reading file /data/ora_ext_tab_dir/osch-20170529035900-5230-1
KUP-04017: OS message: Error 0
KUP-04017: OS message: /usr/hadoopsw/osch/orahdfs-3.6.0/bin/hdfs_stream: line 286: /usr/hadoopsw/osch/orahdfs-3.6.0/bin/../log/osch-log_16999-0529171496063393.log: Permission denied
KUP-04118: operation "pipe read", location "skudmir:2"
Pay attention to the permission and grant permission
Try again
SQL> select * from catalog_ext;
CATALOGID
----------
CATALOGNAME
--------------------------------------------------------------------------------
1
Item1
2
Item2
3
Item3
CATALOGID
----------
CATALOGNAME
--------------------------------------------------------------------------------
4
Item4
5
Item5
SQL>
select catalogname,count(*) cnt from hdfsuser.catalog_ext group by catalogname having count(*) > 0 order by CATALOGNAME desc
12- Describing Tables created by OSCH
To describe the table/s created by OSCH run the –describe command using the ExternalTable tool. The output from the command is listed:
[hdpclient@en01 osch]$ hadoop jar $OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable -Doracle.hadoop.exttab.printStackTrace=true -conf catalog_hdfs.xml -describe
Oracle SQL Connector for HDFS Release 3.6.0 - Production
Copyright (c) 2011, 2016, Oracle and/or its affiliates. All rights reserved.
[Enter Database Password:]
The described object is "HDFSUSER"."CATALOG_EXT"
CREATE TABLE "HDFSUSER"."CATALOG_EXT"
( "CATALOGID" NUMBER,
"CATALOGNAME" VARCHAR2(4000)
)
ORGANIZATION EXTERNAL
( TYPE ORACLE_LOADER
DEFAULT DIRECTORY "OSCH_EXTTAB_DIR"
ACCESS PARAMETERS
( RECORDS DELIMITED BY 0X'0A'
CHARACTERSET AL32UTF8
PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream'
FIELDS TERMINATED BY 0X'2C'
MISSING FIELD VALUES ARE NULL
(
"CATALOGID" CHAR,
"CATALOGNAME" CHAR(4000)
)
)
LOCATION
( 'osch-20170529035900-5230-1'
)
)
REJECT LIMIT UNLIMITED
PARALLEL
Listing Location Files for external table: [CATALOG_EXT]
osch-20170529035900-5230-1 contains 1 URI, 40 bytes
40 hdfs://nn01:9000/catalog/catalog.txt
13- Listing the Location Files of the External Table
To list the location files for the external table HDFSUSER.CATALOG_EXT run the –listLocations command with the ExternalTabletool.
[hdpclient@en01 osch]$ hadoop jar $OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable -Doracle.hadoop.exttab.printStackTrace=true -conf catalog_hdfs.xml -listLocations
Oracle SQL Connector for HDFS Release 3.6.0 - Production
Copyright (c) 2011, 2016, Oracle and/or its affiliates. All rights reserved.
[Enter Database Password:]
Listing Location Files for external table: [CATALOG_EXT]
osch-20170529035900-5230-1 contains 1 URI, 40 bytes
40 hdfs://nn01:9000/catalog/catalog.txt
14- Getting the DDL for the External Table
The DDL for the external table created may be output with the –getDDL command.
<?xml version="1.0"?>
<configuration>
<property>
<name></name>
<value></value>
</property>
</configuration>
OSCH supports input formats delimited text file, data pump file and Hive table and for each input format some configuration properties are required and some are optional. For the delimited text file input the following configuration properties are required.
Property | Description | Value used |
oracle.hadoop.exttab.tableName | Schema qualified name of the external table in which the schema defaults to the user name if omitted. | hdfs.CATALOG_EXT |
oracle.hadoop.exttab.defaultDirectory | The Oracle Database directory for the external table. | OSCH_EXTTAB_DIR created earlier. |
oracle.hadoop.exttab.dataPaths | A comma separated list of HDFS paths. | hdfs://nn01/catalog |
oracle.hadoop.exttab.columnCount or oracle.hadoop.exttab.columnNames | columnCount is the number of columns in the external table. columnNames is a comma separated list of column names and defaults to C1, C2,…Cn. |
CATALOGID,
CATALOGNAME
|
oracle.hadoop.connection.url | The connection URL for Oracle Database. |
jdbc:oracle:thin:
@en01:1521:orcl
|
oracle.hadoop.connection.user | The Oracle Database user. | HDFSUSER |
The data type mapping for the Oracle Database columns may be set with the oracle.hadoop.exttab.colMap.columnType property, which applies to all columns and defaults toVARCHAR2. The data type mapping may be set individually with the oracle.hadoop.exttab.colMap.column_name.columnType property as we have. The source type is set with the oracle.hadoop.exttab.sourceType property, which may be set to text, hive or datapump. The default for oracle.hadoop.exttab.sourceType is text. The number of location files for the external table is set with the oracle.hadoop.exttab.locationFileCount and defaults to 4. The specified value is not always used as the lesser of the number of data files and the specified value for the oracle.hadoop.exttab.locationFileCount is used. For a delimited text file input the field terminator may be specified with the oracle.hadoop.exttab.fieldTerminator property and defaults to ‘,’. The configuration file catalog_hdfs.xml is listed below; copy the file to the /osch directory.
catalog_hdfs.xml
<?xml version="1.0"?>
<configuration>
<property>
<name>oracle.hadoop.exttab.tableName</name>
<value>HDFSUSER.CATALOG_EXT</value>
</property>
<property>
<name>oracle.hadoop.exttab.locationFileCount</name>
<value>3</value>
</property>
<property>
<name>oracle.hadoop.exttab.dataPaths</name>
<value>hdfs://nn01:9000/catalog</value>
</property>
<property>
<name>oracle.hadoop.exttab.fieldTerminator</name>
<value>\u002C</value>
</property>
<property>
<name>oracle.hadoop.exttab.defaultDirectory</name>
<value>OSCH_EXTTAB_DIR</value>
</property>
<property>
<name>oracle.hadoop.exttab.columnNames</name>
<value>CATALOGID,CATALOGNAME</value>
</property>
<property>
<name>oracle.hadoop.exttab.colMap.CATALOGID.columnType</name>
<value>NUMBER</value>
</property>
<property>
<name>oracle.hadoop.exttab.colMap.CATALOGNAME.columnType</name>
<value>VARCHAR2</value>
</property>
<property>
<name>oracle.hadoop.exttab.sourceType</name>
<value>text</value>
</property>
<property>
<name>oracle.hadoop.connection.url</name>
<value>jdbc:oracle:thin:@en01:1521:orcl</value>
</property>
<property>
<name>oracle.hadoop.connection.user</name>
<value>HDFSUSER</value>
</property>
</configuration>
9- Creating the External Table for the Delimited Text File in HDFS
Next, we shall run the ExternalTable tool to create an external table for the delimited text file catalog.txt in HDFS. The configuration file is specified on the command line with –conf. Optionally the oracle.hadoop.exttab.printStackTraceproperty may be specified to print the stack trace. To create an external table and add data from the HDFS file catalog.txt to the external table run the following command from the /osch directory.
The OSCH gets started and prompts for the password for the Oracle Database schema specified as the user in the configuration file. Unlike the Oracle Loader for Hadoop the password is specified on a command prompt and not in the configuration file. Specify the OE schema password and select Enter.
/usr/hadoopsw/osch
[hdpclient@en01 osch]$ hadoop jar $OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable -Doracle.hadoop.exttab.printStackTrace=true -conf catalog_hdfs.xml -createTable
Oracle SQL Connector for HDFS Release 3.6.0 - Production
Copyright (c) 2011, 2016, Oracle and/or its affiliates. All rights reserved.
[Enter Database Password:]
The create table command succeeded.
User: "HDFSUSER" performed the following actions in schema: HDFSUSER
CREATE TABLE "HDFSUSER"."CATALOG_EXT"
(
"CATALOGID" NUMBER,
"CATALOGNAME" VARCHAR2(4000)
)
ORGANIZATION EXTERNAL
(
TYPE ORACLE_LOADER
DEFAULT DIRECTORY "OSCH_EXTTAB_DIR"
ACCESS PARAMETERS
(
RECORDS DELIMITED BY 0X'0A'
CHARACTERSET AL32UTF8
PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream'
FIELDS TERMINATED BY 0X'2C'
MISSING FIELD VALUES ARE NULL
(
"CATALOGID" CHAR,
"CATALOGNAME" CHAR(4000)
)
)
LOCATION
(
'osch-20170530021546-6194-1'
)
) PARALLEL REJECT LIMIT UNLIMITED;
The following location files were created.
osch-20170530021546-6194-1 contains 1 URI, 52 bytes
52 hdfs://nn01:9000/catalog/catalog.txt
The following location files were created.
osch-20170529035900-5230-1 contains 1 URI, 40 bytes
40 hdfs://nn01:9000/catalog/catalog.txt
[hdpclient@en01 osch]$
[hdpclient@en01 osch]$ ll /data/ora_ext_tab_dir
total 4
-rw-r--r--. 1 oracle oinstall 596 May 30 14:15 osch-20170530021546-6194-1
10- Verify Data by location file and SQL Plus
[hdpclient@en01 osch]$ hdfs_stream /data/ora_ext_tab_dir/osch-20170530021546-6194-1
1,Item1
2,Item2
3,Item3
4,Item4
5,Item5
SQL> conn hdfsuser/hdfsuser
Connected.
SQL> set lines 100
SQL> desc catalog_ext;
Name Null? Type
----------------------------------------------------- -------- ------------------------------------
CATALOGID NUMBER
CATALOGNAME VARCHAR2(4000)
Run an SQL query on the external table to list the HDFS data for which the OSCH generates a location file for an external table. The HDFS data on which the location file for the external table is based gets listed. If you encounter any error pay attention to error and eradicate first.
SQL> select * from catalog_ext;
select * from catalog_ext
*
ERROR at line 1:
ORA-12801: error signaled in parallel query server P000
ORA-29913: error in executing ODCIEXTTABLEFETCH callout
ORA-29400: data cartridge error
KUP-04004: error while reading file
/data/ora_ext_tab_dir/osch-20170529035900-5230-1
[hdpclient@en01 osch]$ ll /data/ora_ext_tab_dir/
total 8
-rw-r-----. 1 oracle oinstall 2787 May 29 16:05 CATALOG_EXT_7588.log
-rw-r--r--. 1 oracle oinstall 596 May 29 15:59 osch-20170529035900-5230-1
I checked the log file and found permission issue. Log file is location in the same folder designated as eternal table's directory object.
select * from catalog_ext
*
ERROR at line 1:
ORA-12801: error signaled in parallel query server P000
ORA-29913: error in executing ODCIEXTTABLEFETCH callout
ORA-29400: data cartridge error
KUP-04004: error while reading file
/data/ora_ext_tab_dir/osch-20170529035900-5230-1
total 8
-rw-r-----. 1 oracle oinstall 2787 May 29 16:05 CATALOG_EXT_7588.log
-rw-r--r--. 1 oracle oinstall 596 May 29 15:59 osch-20170529035900-5230-1
[hdpclient@en01 osch]$ cat /data/ora_ext_tab_dir/
CATALOG_EXT_7588.log osch-20170530021546-6194-1
LOG file opened at 05/29/17 16:09:53
KUP-05007: Warning: Intra source concurrency disabled because the preprocessor option is being used.
Field Definitions for table CATALOG_EXT
Record format DELIMITED, delimited by 0A
Data in file has same endianness as the platform
Rows with all null fields are accepted
Fields in Data Source:
CATALOGID CHAR (255)
Terminated by "2C"
Trim whitespace same as SQL Loader
CATALOGNAME CHAR (4000)
Terminated by "2C"
Trim whitespace same as SQL Loader
KUP-04004: error while reading file /data/ora_ext_tab_dir/osch-20170529035900-5230-1
KUP-04017: OS message: Error 0
KUP-04017: OS message: /usr/hadoopsw/osch/orahdfs-3.6.0/bin/hdfs_stream: line 286: /usr/hadoopsw/osch/orahdfs-3.6.0/bin/../log/osch-log_16999-0529171496063393.log: Permission denied
KUP-04118: operation "pipe read", location "skudmir:2"
[root@en01 ~]# chmod -R 777 /data
[root@en01 ~]# chmod -R 777 /usr/hadoopsw/osch
[root@en01 ~]# chmod -R 777 /usr/hadoopsw/osch
SQL> select * from catalog_ext;
CATALOGID
----------
CATALOGNAME
--------------------------------------------------------------------------------
1
Item1
2
Item2
3
Item3
CATALOGID
----------
CATALOGNAME
--------------------------------------------------------------------------------
4
Item4
5
Item5
SQL>
11- Perform some more tests
Remove the file on HDFS and try the above query again
[hdpsysuser@nn01 ~]$ hdfs dfs -rm /catalog/*
17/05/29 16:18:25 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 120 minutes, Emptier interval = 45 minutes.
Moved: 'hdfs://nn01:9000/catalog/catalog.txt' to trash at: hdfs://nn01:9000/user/hdpsysuser/.Trash/Current
SQL> select * from catalog_ext;
select * from catalog_ext
*
ERROR at line 1:
ORA-12801: error signaled in parallel query server P000
ORA-29913: error in executing ODCIEXTTABLEFETCH callout
ORA-29400: data cartridge error
KUP-04004: error while reading file /data/ora_ext_tab_dir/osch-20170530021546-6194-1
Add file again to HDFS and try the above query again, it will succeed.
[hdpsysuser@nn01 ~]$ hdfs dfs -put /data/mydata/catalog.txt /catalog
17/05/29 16:18:25 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 120 minutes, Emptier interval = 45 minutes.
Moved: 'hdfs://nn01:9000/catalog/catalog.txt' to trash at: hdfs://nn01:9000/user/hdpsysuser/.Trash/Current
select * from catalog_ext
*
ERROR at line 1:
ORA-12801: error signaled in parallel query server P000
ORA-29913: error in executing ODCIEXTTABLEFETCH callout
ORA-29400: data cartridge error
KUP-04004: error while reading file /data/ora_ext_tab_dir/osch-20170530021546-6194-1
SQL> select * from catalog_ext;
CATALOGID
----------
CATALOGNAME
----------------------------------------------------------------------------------------------------
1
Item1
2
Item2
Now create a second table in Oracle and join with the external table
create table child(id number, cat_detail varchar2(20))
insert into child values (1,'Item-1 Detail 1');
insert into child values (1,'Item-1 Detail 2');
insert into child values (2,'Item-2 Detail 1');
insert into child values (4,'Item-1 Detail 2');
commit;insert into child values (1,'Item-1 Detail 1');
insert into child values (1,'Item-1 Detail 2');
insert into child values (2,'Item-2 Detail 1');
insert into child values (4,'Item-1 Detail 2');
select catalogname,count(*) cnt from hdfsuser.catalog_ext group by catalogname having count(*) > 0 order by CATALOGNAME desc
12- Describing Tables created by OSCH
To describe the table/s created by OSCH run the –describe command using the ExternalTable tool. The output from the command is listed:
[hdpclient@en01 osch]$ hadoop jar $OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable -Doracle.hadoop.exttab.printStackTrace=true -conf catalog_hdfs.xml -describe
Oracle SQL Connector for HDFS Release 3.6.0 - Production
Copyright (c) 2011, 2016, Oracle and/or its affiliates. All rights reserved.
[Enter Database Password:]
The described object is "HDFSUSER"."CATALOG_EXT"
CREATE TABLE "HDFSUSER"."CATALOG_EXT"
( "CATALOGID" NUMBER,
"CATALOGNAME" VARCHAR2(4000)
)
ORGANIZATION EXTERNAL
( TYPE ORACLE_LOADER
DEFAULT DIRECTORY "OSCH_EXTTAB_DIR"
ACCESS PARAMETERS
( RECORDS DELIMITED BY 0X'0A'
CHARACTERSET AL32UTF8
PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream'
FIELDS TERMINATED BY 0X'2C'
MISSING FIELD VALUES ARE NULL
(
"CATALOGID" CHAR,
"CATALOGNAME" CHAR(4000)
)
)
LOCATION
( 'osch-20170529035900-5230-1'
)
)
REJECT LIMIT UNLIMITED
PARALLEL
Listing Location Files for external table: [CATALOG_EXT]
osch-20170529035900-5230-1 contains 1 URI, 40 bytes
40 hdfs://nn01:9000/catalog/catalog.txt
13- Listing the Location Files of the External Table
To list the location files for the external table HDFSUSER.CATALOG_EXT run the –listLocations command with the ExternalTabletool.
[hdpclient@en01 osch]$ hadoop jar $OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable -Doracle.hadoop.exttab.printStackTrace=true -conf catalog_hdfs.xml -listLocations
Oracle SQL Connector for HDFS Release 3.6.0 - Production
Copyright (c) 2011, 2016, Oracle and/or its affiliates. All rights reserved.
[Enter Database Password:]
Listing Location Files for external table: [CATALOG_EXT]
osch-20170529035900-5230-1 contains 1 URI, 40 bytes
40 hdfs://nn01:9000/catalog/catalog.txt
14- Getting the DDL for the External Table
The DDL for the external table created may be output with the –getDDL command.
[hdpclient@en01 osch]$ hadoop jar $OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable -Doracle.hadoop.exttab.printStackTrace=true -conf catalog_hdfs.xml -getDDL
16- Find the Location FIles/Log Files:
[hdpclient@en01 osch]$ ll /data/ora_ext_tab_dir/
total 28
-rwxrwxrwx. 1 oracle oinstall 11327 May 29 16:53 CATALOG_EXT_7588.log
-rw-r-----. 1 oracle oinstall 2 May 29 16:53 CATALOG_EXT_7592.bad
-rw-r-----. 1 oracle oinstall 705 May 29 16:53 CATALOG_EXT_7592.log
-rw-r--r--. 1 oracle oinstall 595 May 29 16:52 osch-20170529045207-379-1
-rw-r--r--. 1 oracle oinstall 596 May 29 16:52 osch-20170529045207-379-2
17- Enabling Parallel Processing
ALTER SESSION ENABLE PARALLEL QUERY;
Before loading the data into an Oracle database from the external files created by Oracle SQL Connector for HDFS, enable parallel DDL:
Oracle SQL Connector for HDFS Release 3.6.0 - Production
Copyright (c) 2011, 2016, Oracle and/or its affiliates. All rights reserved.
[Enter Database Password:]
CREATE TABLE "HDFSUSER"."CATALOG_EXT"
( "CATALOGID" NUMBER,
"CATALOGNAME" VARCHAR2(4000)
)
ORGANIZATION EXTERNAL
( TYPE ORACLE_LOADER
DEFAULT DIRECTORY "OSCH_EXTTAB_DIR"
ACCESS PARAMETERS
( RECORDS DELIMITED BY 0X'0A'
CHARACTERSET AL32UTF8
PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream'
FIELDS TERMINATED BY 0X'2C'
MISSING FIELD VALUES ARE NULL
(
"CATALOGID" CHAR,
"CATALOGNAME" CHAR(4000)
)
)
LOCATION
( 'osch-20170529035900-5230-1'
)
)
REJECT LIMIT UNLIMITED
PARALLEL
Copyright (c) 2011, 2016, Oracle and/or its affiliates. All rights reserved.
[Enter Database Password:]
CREATE TABLE "HDFSUSER"."CATALOG_EXT"
( "CATALOGID" NUMBER,
"CATALOGNAME" VARCHAR2(4000)
)
ORGANIZATION EXTERNAL
( TYPE ORACLE_LOADER
DEFAULT DIRECTORY "OSCH_EXTTAB_DIR"
ACCESS PARAMETERS
( RECORDS DELIMITED BY 0X'0A'
CHARACTERSET AL32UTF8
PREPROCESSOR "OSCH_BIN_PATH":'hdfs_stream'
FIELDS TERMINATED BY 0X'2C'
MISSING FIELD VALUES ARE NULL
(
"CATALOGID" CHAR,
"CATALOGNAME" CHAR(4000)
)
)
LOCATION
( 'osch-20170529035900-5230-1'
)
)
REJECT LIMIT UNLIMITED
PARALLEL
15- Adding the HDFS Data to the External Table Location files Separately
Add another text file but with same fields , external table does not read the new file contents until you publish
The output from the –publish command indicates that the OE.CATALOG_EXT external table is altered and a new location file gets created with the previous location file being deleted.
The –createTable command adds the HDFS data to the location file of the external table. But, it may be required to add the HDFS data separately. For example, the data in the HDFS file has been modified and the new data is required to be added to the location file, or the external table has been created separately and the OSCH is being used only to add the data from HDFS. The –publish command is used to add the HDFS data to the external table location file. If a location file already exists the location file gets deleted and a new location file gets created. Run the –publish command as follows.
Add another text file but with same fields , external table does not read the new file contents until you publish
vi catalog1.txt
6,Item6
7,Item7
[hdpsysuser@nn01 ~]$ hdfs dfs -put /data/mydata/catalog1.txt /catalog
6,Item6
7,Item7
[hdpclient@en01 osch]$ hadoop jar $OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable -Doracle.hadoop.exttab.printStackTrace=true -conf catalog_hdfs.xml -publish
Oracle SQL Connector for HDFS Release 3.6.0 - Production
Copyright (c) 2011, 2016, Oracle and/or its affiliates. All rights reserved.
[Enter Database Password:]
The publish command succeeded.
ALTER TABLE "HDFSUSER"."CATALOG_EXT"
LOCATION
(
'osch-20170529045207-379-1',
'osch-20170529045207-379-2'
);
The following location files were created.
osch-20170529045207-379-1 contains 1 URI, 52 bytes
52 hdfs://nn01:9000/catalog/catalog.txt
osch-20170529045207-379-2 contains 1 URI, 20 bytes
20 hdfs://nn01:9000/catalog/catalog1.txt
The following location files were deleted.
osch-20170529035900-5230-1 was deleted.
Copyright (c) 2011, 2016, Oracle and/or its affiliates. All rights reserved.
[Enter Database Password:]
The publish command succeeded.
ALTER TABLE "HDFSUSER"."CATALOG_EXT"
LOCATION
(
'osch-20170529045207-379-1',
'osch-20170529045207-379-2'
);
The following location files were created.
osch-20170529045207-379-1 contains 1 URI, 52 bytes
52 hdfs://nn01:9000/catalog/catalog.txt
osch-20170529045207-379-2 contains 1 URI, 20 bytes
20 hdfs://nn01:9000/catalog/catalog1.txt
The following location files were deleted.
osch-20170529035900-5230-1 was deleted.
The location files and the log files are created in the OS directory for directory object for the external table, which is the ora_ext_tab_dir directory created earlier.
[hdpclient@en01 osch]$ ll /data/ora_ext_tab_dir/
total 28
-rwxrwxrwx. 1 oracle oinstall 11327 May 29 16:53 CATALOG_EXT_7588.log
-rw-r-----. 1 oracle oinstall 2 May 29 16:53 CATALOG_EXT_7592.bad
-rw-r-----. 1 oracle oinstall 705 May 29 16:53 CATALOG_EXT_7592.log
-rw-r--r--. 1 oracle oinstall 595 May 29 16:52 osch-20170529045207-379-1
-rw-r--r--. 1 oracle oinstall 596 May 29 16:52 osch-20170529045207-379-2
17- Enabling Parallel Processing
To enable parallel processing with external tables, you must specify multiple files in the location clause of the external table. The number of files, also known as the degree of parallelism, determines the number of child processes started by the external table during a table read. Ideally, the degree of parallelism is no larger than the number of data files, to avoid idle child processes.
Parallel processing is extremely important when you are working with large volumes of data. When you use external tables, always enable parallel query with this SQL command:
ALTER SESSION ENABLE PARALLEL QUERY;
Before loading the data into an Oracle database from the external files created by Oracle SQL Connector for HDFS, enable parallel DDL:
ALTER SESSION ENABLE PARALLEL DDL;
Hints such as APPEND and PQ_DISTRIBUTE also improve performance when you are
inserting data.
No comments:
Post a Comment