Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Note: All the posts are based on practical approach avoiding lengthy theory. All have been tested on some development servers. Please don’t test any post on production servers until you are sure.

Thursday, August 10, 2017

Working with Talend for Big Data (TOSBD)


Introduction                                                                               

Talend (eclipse based) provides unified development and management tools to integrate and process all of your data with an easy to use, visual designer. It helps companies become data driven by making data more accessible, improving its quality and quickly moving it where it’s needed for real-time decision making.
Talend for Big Data is built on top of Talend's data integration solution that enables users to access, transform, move and synchronize big data by leveraging the Apache Hadoop Big Data Platform and makes the Hadoop platform ever so easy to use.

TOS Big Data Functional Architecture                                     

The following chart illustrates the main architectural functional blocks.



Using TOS, you design and launch Big Data Jobs that leverage a Hadoop cluster (independent of the Talend system) to handle large data sets. Once launched, these Jobs are sent to, deployed on and executed on this Hadoop cluster. The Oozie workflow scheduler system is integrated within the Studio through which you can deploy, schedule, and execute Big Data Jobs on a Hadoop cluster and monitor the execution status and results of these Jobs.


Requirements for TOS Big Data                                               

Memory: 4GB recommended, Memory usage heavily depends on the size and nature of your Talend projects.
Disk Space: 3-4GB
OS: MS Windows 64bit, Linux Ubunto recommended, Apple OS X also supported
Software: Oracle Java 8 JRE, properly installed and configured Hadoop cluster

Access: Host names of the nodes of the Hadoop cluster, mapping entries for the services of Hadoop cluster in the hosts file, Hadoop edge node must be set. You can use the below steps for it.

a) Download and install JDK 8 from below location

http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

b) Download and extract Apache Hadoop eg; hadoop-2.7.3.tar.gz, edit core-site.xml and yarn-site.xml and set all related configuration. You can check the below post for details.

Setting up Hadoop Edge/Gateway Node (Hadoop Client)

c) Download and extract winutils required for hadoop edge node from below location

https://github.com/steveloughran/winutils

After extracting go to "hadoop-2.7.1\bin" folder , copy all files and paste in %HADOOP_HOME%\bin

Enviroment Variable: In Windows System Properties set the related environment variables.

a) JAVA_HOME must be set eg; JAVA_HOME=C:\hadoopEdge\jdk8
C:\Users\ibukhary>echo %JAVA_HOME%

b) HADOOP_HOME=C:\hadoopEdge\hadoop-2.7.3

c) append C:\hadoopEdge\hadoop-2.7.3\bin;C:\hadoopEdge\jdk8\bin to system PATH variable
PATH=%PATH%;C:\hadoopEdge\hadoop-2.7.3\bin;C:\hadoopEdge\jdk8\bin

Test Hadoop Connectivity
After installing hadoop software , test the connectivity from command prompt.
C:\Users\ibukhary>hdfs dfs -ls /
Found 11 items
drwxrwxrwx   - yarn   hadoop          0 2017-08-06 12:29 /app-logs
drwxr-xr-x   - hdfs   hdfs            0 2017-07-26 17:26 /apps
drwxr-xr-x   - yarn   hadoop          0 2017-07-26 17:24 /ats
drwxrwxrwx   - hdfs   hdfs            0 2017-08-09 12:00 /data
drwxrwxrwx   - hdfs   hdfs            0 2017-08-09 12:50 /flume
drwxr-xr-x   - hdfs   hdfs            0 2017-07-26 17:24 /hdp
drwxr-xr-x   - mapred hdfs            0 2017-07-26 17:24 /mapred
drwxrwxrwx   - mapred hadoop          0 2017-07-26 17:24 /mr-history
drwxrwxrwx   - spark  hadoop          0 2017-08-10 08:46 /spark2-history
drwxrwxrwx   - hdfs   hdfs            0 2017-08-10 08:27 /tmp
drwxr-xr-x   - hdfs   hdfs            0 2017-08-06 12:29 /user

try to put a file in HDFS

C:\hadoopEdge>hdfs dfs -put syslog.1502020878710 /tmp



Download Talend from below location and extract the zip

https://www.talend.com/download/


Memory Configuration: If you want to tune the memory allocation for your JVM, you only need to edit the TOS_BD-win-x86_64.ini file.
The default values are:
-vmargs -Xms40m -Xmx500m -XX:MaxMetaspaceSize=128m




Launching the Studio                                                                

After extracting double-click the TOS_BD-win-x86_64.exe executable file to launch your Talend Studio. To log in to the Talend Studio for the first time, do the following:

In the Talend Studio login window, select Create a new project, specify the project name:
"Talend Demo" and click Finish to create a new local project.



Now you have successfully logged in to the Talend Studio. Next you need to install additional packages required for the Talend Studio to work properly. Talend recommends that you install additional packages, including third-party libraries and database drivers, as soon as you log in to your Talend Studio to allow you to fully benefit from the functionalities of the Studio. After all packages installation, restart your Talend Studio for certain additional packages to take effect.



Install modules downloaded from external websites
Some modules are not available on the Talend website but can be downloaded directly from external websites. Once downloaded, these modules must be placed in specific folders.

For the studio, the downloaded modules must be placed in the following folder:

<StudioPath>/configuration/.m2


Setting up Hadoop connection 

1. In the Repository tree view of your studio, expand Metadata and then right-click Hadoop
cluster.
2. Select Create Hadoop cluster from the contextual menu to open the Hadoop cluster connection
wizard.
3. Fill in generic information about this connection, such as Name and Description and click Next
to open the Hadoop configuration import wizard that helps you import the ready-for-use
configuration if any.
4. Select the Enter manually Hadoop services check box to manually enter the configuration
information for the Hadoop connection being created. I've Hortonworks cluster setup with me, I'll be using Ambari option.  All the configuration will be retrieved. Click the Check services button to verify that the Studio can connect to the NameNode and the ResourceManager services, install any missing modules/jars. Press Finish.












The new connection is displayed under the Hadoop cluster folder in the Repository tree view. You can then continue to create the child connections to different Hadoop elements such as HDFS or Hive based on this connection.


Setting up HDFS connection

1. Expand the Hadoop cluster node under Metadata in the Repository tree view, right click the Hadoop connection to be used and select Create HDFS from the contextual menu.


2. In the connection wizard that opens up, fill in the generic properties of the connection you need create, such as Name, Purpose and Description.


3. Click Next when completed. The second step requires you to fill in the HDFS connection data. The User name property is automatically pre-filled with the value inherited from the Hadoop connection you selected in the previous steps. The Row separator and the Field separator properties are using the default values.



4. Select the Set heading row as column names check box to use the data in the heading rows of the HDFS file to be used to define the column names of this file. The Header check box is then automatically selected and the Header field is filled with 1. This means that the first row of the file will be ignored as data body but used as column names of the file.

5. Click Check to verify your connection. A message pops up to indicate whether the connection is successful. 

6. Click Finish to validate these changes.


The new HDFS connection is now available under the Hadoop cluster node in the Repository tree view. You can then use it to define and centralize the schemas of the files stored in the connected HDFS system in order to reuse these schemas in a Talend Job.

Uploading files to HDFS


Uploading a file to HDFS allows the Big Data Jobs to read and process it. create a Job that writes data in the HDFS system of the Hortonworks Hadoop cluster to which the connection has been set up in the Repository as explained above.


1. In the Repository tree view, right click the Job Designs node, and select Create folder from the contextual menu.

2. In the New Folder wizard, name your Job folder TalendDemo and click Finish to create
your folder.

3. Right-click the TalendDemo folder and select Create Job from the contextual menu.

4. In the New Job wizard, give a name to the Job you are going to create and provide other useful information if needed. For example, enter write_to_hdfs in the Name field.
In this step of the wizard, Name is the only mandatory field. The information you provide in the Description field will appear as hover text when you move your mouse pointer over the Job in the Repository tree view.

5. Click Finish to create your Job. An empty Job is opened in the Studio.

6. Expand the Hadoop cluster node under Metadata in the Repository tree view.

7. Expand the Hadoop connection you have created and then the HDFS folder under it. 


8. Drop the HDFS connection from the HDFS folder into the workspace of the Job you are creating. The Components window is displayed to show all the components that can directly reuse this HDFS connection in a Job.

9. Select tHDFSPut and click OK to validate your choice.
Components window is closed and a tHDFSPut component is automatically placed in the workspace of the current Job, with this component having been labelled using the name of the HDFS connection mentioned in the previous step.


10. Double-click tHDFSPut to open its Component view. The connection to the HDFS system to be used has been automatically configured by using the configuration of the HDFS connection you have set up and stored in the Repository. The related parameters in this tab therefore becomes read-only. These parameters are: Distribution, Version, NameNode URI, Use Datanode Hostname, User kerberos authentication and Username.


11. In the Local directory field, enter the path, or browse to the folder in which the files to be copied to HDFS are stored.

12. In the HDFS directory field, enter the path, or browse to the target directory in HDFS to store the files. This directory is created on the fly if it does not exist.

13. From the Overwrite file drop-down list, select always to overwrite the files if they already exist in the target directory in HDFS.

14. In the Files table, add one row by clicking the [+] button in order to define the criteria to select the files to be copied.

15. In the Filemask column, enter an asterisk (*) within the double quotation marks to make
tHDFSPut select all the files stored in the folder you specified in the Local directory field.

16. Leave the New name column empty, that is to say, keep the default double quotation marks as is, so as to make the name of the files unchanged after being uploaded.

17. Press F6 to run the Job. The Run view is opened automatically. It shows the progress of this Job.




When the Job is done, the files you uploaded can be found in HDFS in the directory you have specified.






Setting up file metadata                                                            

In the Repository, setting up the metadata of a file stored in HDFS allows you to directly reuse its schema in a related Big Data component without having to define each related parameter manually. I've placed dept.csv and emp.csv files already stored in the HDFS system, I can retrieve schema to set up metadata in the Repository.

1. Expand the Hadoop cluster node under Metadata in the Repository tree view.
2. Expand the Hadoop connection you have created and then the HDFS folder under it.
3. Right click the HDFS connection in this HDFS folder and from the contextual menu, select
Retrieve schema. A Schema wizard is displayed, allowing you to browse to files in HDFS.

4. Expand the file tree to show the csv file, from which you need to retrieve the schema,
and select it.



5. Click Next to display the retrieved schema in the wizard. The schema of the dept data is displayed in the wizard and the first row of the data is
automatically used as the column names.



If the first row of the data you are using is not used this way, you need to review how you set the Header configuration when you were creating the HDFS connection as explained in Setting up connection to HDFS.



6. Click Finish to validate these changes. You can now see the file metadata under the HDFS connection you are using in the Repository tree view.





Performing data integration tasks for Big Data                         

Tasks Required:

  • Upload data (dept.csv,emp.csv) stored in a local file system to the HDFS file system of the company's Hadoop cluster.
  • Join the dept data to the emp data to produce a new dataset and store this dataset in the HDFS system too.


Creating the Job

A Talend Job allows you to access and use the Talend components to design technical processes to read, transform or write data.

Right-click the TalendDemo folder and select Create Job from the contextual menu. In the New Job wizard, give a name to the Job you are going to create and provide other useful
information if needed. For example, enter aggregate_dept_emp in the Name field. Click Finish to create your Job. An empty Job is opened in the Studio. The component Palette is now available in the Studio. You can start to design the Job by leveraging this Palette and the Metadata node in the Repository.

Dropping and linking components

The Pig components to be used are orchestrated in the Job workspace to compose a Pig process for data transformation.

1. In the Job, enter the name of the component to be used and select this component from the list that appears. In this scenario, the components are two tPigLoad components, a tPigMap component and two tPigStoreResult components. 

• The two tPigLoad components are used to load the dept data and the emp data, respectively, from HDFS into the data flow of the current Job.
• The tPigMap component is used to transform the input data. 

• The tPigStoreResult components write the results into given directories in HDFS.

2. Double-click (not too fast) the label of one of the tPigLoad component to make this label editable and then enter dept to change the label of this tPigLoad. Do the same to label another tPigLoad component to emp.

3. Right click the tPigLoad component that is labelled dept, then from the contextual menu, select Row > Pig combine and click tPigMap to connect this tPigLoad to the tPigMap component. This is the main link through which the de[t data is sent to tPigMap.

4. Do the same to connect the director tPigLoad component to tPigMap using the Row > Pig
combine link. This is the Lookup link through which the emp data is sent to tPigMap as

lookup data.

5. Do the same to connect the tPigMap component to tPigStoreResult using the Row > Pig combine link, then in the pop-up wizard, name this link to out1 and click OK to validate this change.

6. Repeat these operations to connect the tPigMap component to another tPigStoreResult component using the Row > Pig combine link and name it to reject.

Now the whole Job looks as follows in the workspace:


Configuring the input data for Pig

Two tPigLoad components are configured to load data from HDFS into the Job. The source files, dept.csv and emp.csv have been uploaded into HDFS already. The metadata of the dept.csv file has been set up in the HDFS folder under the Hadoop cluster node in the Repository.

1. Expand the Hadoop cluster node under the Metadata node in the Repository and then the
 connection node and its child node to display the movies schema metadata node you have set up under the HDFS folder.

2. Drop schema metadata node onto the dept tPigLoad component in the workspace of the Job.

3. Double-click the dept tPigLoad component to open its Component view. 
This tPigLoad has automatically reused the HDFS configuration and the dept metadata from the Repository to define the related parameters in its Basic settings view.




4. From the Load function drop-down list, select PigStorage to use the PigStorage function, a builtin function from Pig, to load the dept data as a structured text file. For more information on PIG , please see my other post Apache PIG - a Short Tutorial


5. From the Hadoop connection node  in the Repository, drop the related
HDFS connection node under the HDFS folder onto the tPigLoad component labelled

emp in the workspace of the Job.

This applies the configuration of the HDFS connection you have created in the Repository on the HDFS-related settings in the current tPigLoad component.

6. Double-click the emp tPigLoad component to open its Component view.
This tPigLoad has automatically reused the HDFS configuration from the Repository to define the related parameters in its Basic settings view.



7. Click the [...] button next to Edit schema to open the schema editor.


8. Click the [+] button twice to add two rows and in the Column column, rename them to EMPNO,ENAME,JOB,MGR,HIREDATE,SAL,COMM,DEPTNO respectively.



9. Click OK to validate these changes and accept the propagation prompted by the pop-up dialog box.

10. From the Load function drop-down list, select PigStorage to use the PigStorage function.

11. In the Input file URI field, enter the directory where the data about the director data is stored eg; /data/common/emp.csv. 

12. Click the Field separator field to open the Edit parameter using repository dialog box to

update the field separator if required.

The tPigLoad components are now configured to load the dept data and the emp data to the Job.


Configuring the data transformation for Pig

The tPigMap component is configured to join the dept data and the emp data.
Once the dept data and the emp data are loaded into the Job, you need to configure the tPigMap component to join them to produce the output you expect.


1. Double-click tPigMap to open its Map Editor view. In the Map Editor, left side is input side having two tables represents one of the input flow, the upper one for the main flow and the lower one for the lookup flow. Right side is output side, having two tables represent the output flows that you named to out1and reject when you linked tPigMap to tPigStoreResult.

2. On the input side, drop the DEPTNO column from the main flow table to the Expr.key
column of the DEPTNO row in the lookup flow table. This way, the join key between the main flow and the lookup flow is defined. Now drop the column as per below image for both output1 and reject.




From the Schema editor view in the lower part of the editor, you can see the schemas on both sides have been automatically completed.

3. On the out1 output flow table, click the button "Enable/disable expression filter" (with green + plus symbol) to display the editing field for the filter expression.

4. Enter row2.COMM is not null



This allows tPigMap to output only the emp records in each of which the COMM field is

not empty. A record with an empty COMM field is filtered out.

5. On the reject output flow table, click the "tPigMap Setting" button (tool symbol) . In the Catch Output Reject row, select true to output the records with empty COMM fields in the reject flow.  


6. Click Apply, then click OK to validate these changes and accept the propagation prompted by the pop-up dialog box.

The transformation is now configured to complete the dept data with the names of their emp and write the emp records that do not contain any COMM data into a separate data flow.


Writing the output

Two tPigStoreResult components are configured to write the expected data and the rejected data to different directories in HDFS.

1. Double-click the tPigStoreResult which receives the out1 link. Its Basic settings view is opened in the lower part of the Studio. 2. In the Result file field, enter the directory you need to write the result in eg; /data/talend/out.

2. Select Remove result directory if exists check box. In the Store function list, select PigStorage to write the records in human-readable UTF-8 format. In the Field separator field, enter ; within double quotation marks.




3. Repeat the same operations to configure the tPigStoreResult that receives the rejectreject link, but set the directory, in the Result file field, to /data/talend/reject.

4. Press F6 to run the Job. The Run view is automatically opened in the lower part of the Studio and shows the execution progress of this Job. Once done, you can check, for example in the web console of your HDFS system, that the output has been written in HDFS.





Congrats! Your first Talend job on Hadoop  is run successfully 

No comments: