Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Note: All the posts are based on practical approach avoiding lengthy theory. All have been tested on some development servers. Please don’t test any post on production servers until you are sure.

Monday, February 20, 2017

Setting up multi node Apache Hadoop Cluster 2.7.3 on RHEL 7.3


Multi node Hadoop cluster as composed of Master-Slave Architecture to accomplish BigData processing which contains multiple nodes. For setting up multi node Hadoop Cluster, I am going to use three machines (One as MasterNode and rest two are as SlaveNodes). 

Remember there is a correlation between number of computer in cluster with size of data and data processing technique, heavier the dataset (as well as heavier the data processing technique) require larger number of computers/nodes in Hadoop cluster. Let’s start now.

Environment

Get you environment prepared OS wise first, mine is below one. 

RHV,  Red Hat Enterprise Linux Server 7.3
Master Node:   hostname è hdpmaster        IPè192.166.44.170    rootpwdè hadoop123
Slave1:             hostname
èhdpslave1           IPè 192.168.44.171    rootpwdè hadoop123
Slave2:             hostname
è hdpslave2         IPè192.168.44.172     rootpwdè hadoop123

Software Locations :
Java installation
è  /usr/java
Hadoop Installation
è /usr/hadoopsw

Hadoop File System Storage Locations:
DataNode
è /opt/volume/hadoop_tmp/hdfs/datanode
NameNode
è /opt/volume/hadoop_tmp/hdfs/namenode

Hadoop System related objects:
Hadoop system user group
è hadoop_grp
Dedicated hadoop system user
è hdpsysuser


Prerequisites 

1- Installation and Configuration of Single node Hadoop 


Install and Configure Single node Hadoop which will be our Masternode (Name Node) for multi node cluster. To get instructions over How to setup Hadoop Single node, visit previous post Setting up single node Apache Hadoop Cluster 2.7.3 on RHEL 7 , it is a must requirement to work on this post.


2- Basic installation and configuration

 a) Configure the hostname identification for all of your Hadoop nodes (Masters,Slaves) in /etc/hosts.

[root@hdpmaster ~]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.44.170  hdpmaster
192.168.44.171  hdpslave1
192.168.44.172  hdpslave2

b) Create same hadoop group (hadoop_grp)and hadoop system user (hdpsysuser) for all the nodes as described in previous post.

c) ADD hdpsysuser USER IN SUDOER list as described in previous post.

d) add the ssh keys to slave nodes (hdpslave1,hdpslave2) from hdpmaster (MasterNode)
To manage (start/stop) all nodes of Master-Slave architecture, hdpsysuser (hadoop user of Masternode ie;hdpmaster) need to be login on all Slave as well as all Master nodes which can be possible through setting up passwrdless SSH login. (If you are not setting this then you need to provide password while starting and stopping daemons on Slave nodes from Master node). 

[hdpsysuser@hdpmaster ~]$ ssh-copy-id hdpsysuser@hdpslave1

The authenticity of host 'hdpslave1 (192.168.44.171)' can't be established.
ECDSA key fingerprint is de:3f:51:37:66:c1:69:7a:ce:b4:5e:34:04:bb:6c:93.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
hdpsysuser@hdpslave1's password:

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'hdpsysuser@hdpslave1'"
and check to make sure that only the key(s) you wanted were added.

[hdpsysuser@hdpmaster ~]$ ssh-copy-id hdpsysuser@hdpslave2

The authenticity of host 'hdpslave2 (192.168.44.172)' can't be established.
ECDSA key fingerprint is a1:58:46:fd:82:0b:7f:26:c3:ac:eb:5d:7b:c8:ff:20.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
hdpsysuser@hdpslave2's password:

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'hdpsysuser@hdpslave2'"
and check to make sure that only the key(s) you wanted were added.

[hdpsysuser@hdpmaster ~]$ ssh hdpslave1
Last login: Tue Feb  7 15:04:39 2017 from 10.30.2.119

[hdpsysuser@hdpmaster ~]$ ssh hdpslave2
Last login: Tue Feb  7 15:04:56 2017 from 10.30.2.119

e) Stop/Disable Firewall

[hdpsysuser@hdpmaster ~]$ sudo service firewalld stop

[hdpsysuser@hdpmaster ~]$ sudo systemctl disable firewalld

Removed symlink /etc/systemd/system/dbus-org.fedoraproject.FirewallD1.service.
Removed symlink /etc/systemd/system/basic.target.wants/firewalld.service.

f) Reboot all nodes

To make above changes reflected, we need to reboot all of the nodes
sudo reboot


3- Hadoop configuration steps

Applying Common Hadoop Configuration on master node (hdpmaster) :

In Hadoop we have Master-Slave architecture for files and processing, so we need to apply some common changes (both type of Mater and Slave nodes) in Hadoop config files before we distribute these Hadoop files over the rest of the machines/nodes. Consequently these changes will be reflected over our singlenode Hadoop setup we built earlier.



Update core-site.xml



Update this file by changing hostname from localhost to hdpmaster if you used localhost, we already have the master node name instead of localhost in our configuration in the previous post



Update hdfs-site.xml

Update this file by updating replication factor from 1 to 3.

<property>
  <name>dfs.replication</name>
  <value>3</value>
</property>

Update Mapred-site.xml 

Update this file by updating and adding following properties

<configuration>
        <property>
                <name>mapreduce.framework.name</name>
                <value>yarn</value>
        </property>
</configuration>

  
Update yarn-site.xml

Update this file by updating the following three properties

<configuration>
<!-- Site specific YARN configuration properties -->
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>
        <property>
                <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
                <value>org.apache.hadoop.mapred.ShuffleHandler</value>
        </property>
        
</configuration>

Update masters

Update the directory of master nodes of Hadoop cluster, if file is not existing create it

[hdpsysuser@hdpmaster hadoop]$ vi /usr/hadoopsw/hadoop-2.7.3/etc/hadoop/masters
masters

Update slaves

[hdpsysuser@hdpmaster hadoop]$ vi slaves
hdpslave1
hdpslave2


4- Install java on all slave nodes

Install java (if not installed already) on all slave nodes as described in previous post


5- Copying/Sharing/Distributing Hadoop files

rsync is a utility for efficiently transferring and synchronizing files across computer systems. The rsync algorithm is a type of delta encoding, and is used for minimizing network usage. Rsync is typically used for synchronizing files and directories between two different systems. For example, if the command rsync local-file user@remote-host:remote-file is run, rsync will use SSH to connect as user to remote-host. Once connected, it will invoke the remote host's rsync and then the two programs will determine what parts of the file need to be transferred over the connection. We will copy all the Hadoop files to all rest of the nodes ie; all master/slaves. In our case hdpslave1 and hdpslave2 nodes.

[hdpsysuser@hdpmaster ~]$ echo $HADOOP_HOME
/usr/hadoopsw/hadoop-2.7.3

[hdpsysuser@hdpmaster ~]$ sudo rsync -avxP /usr/hadoopsw/ hdpsysuser@hdpslave1:/usr/hadoopsw/
hdpsysuser@hdpslave1's password:

sending incremental file list

./

.ICEauthority

306 100% 0.00kB/s 0:00:00 (xfer#1, to-check=1050/1052)

.Xauthority

165 100% 161.13kB/s 0:00:00 (xfer#2, to-check=1049/1052)

.bash_history

2365 100% 2.26MB/s 0:00:00 (xfer#3, to-check=1048/1052)

.bash_profile

572 100% 558.59kB/s 0:00:00 (xfer#4, to-check=1046/1052)

.esd_auth

16 100% 15.62kB/s 0:00:00 (xfer#5, to-check=1044/1052)

.viminfo

6167 100% 5.88MB/s 0:00:00 (xfer#6, to-check=1043/1052)

hadoop-2.7.3.tar.gz

214092195 100% 100.18MB/s 0:00:02 (xfer#7, to-check=1042/1052)

.cache/

.cache/event-sound-cache.tdb.hdpmaster.x86_64-redhat-linux-gnu

12288 100% 300.00kB/s 0:00:00 (xfer#8, to-check=1024/1052)

.cache/abrt/

.cache/abrt/applet_dirlist

0 100% 0.00kB/s 0:00:00 (xfer#9, to-check=1015/1052)

.cache/abrt/lastnotification

11 100% 0.27kB/s 0:00:00 (xfer#10, to-check=1014/1052)

.cache/evolution/

.cache/evolution/addressbook/
…… <<output truncated>>
…… <<output truncated>>
hadoop-2.7.3/share/hadoop/yarn/sources/hadoop-yarn-server-tests-2.7.3-sources.jar

37150 100% 146.88kB/s 0:00:00 (xfer#6610, to-check=4/7597)

hadoop-2.7.3/share/hadoop/yarn/sources/hadoop-yarn-server-tests-2.7.3-test-sources.jar

52603 100% 207.14kB/s 0:00:00 (xfer#6611, to-check=3/7597)

hadoop-2.7.3/share/hadoop/yarn/sources/hadoop-yarn-server-web-proxy-2.7.3-sources.jar

45449 100% 177.54kB/s 0:00:00 (xfer#6612, to-check=2/7597)

hadoop-2.7.3/share/hadoop/yarn/sources/hadoop-yarn-server-web-proxy-2.7.3-test-sources.jar

37988 100% 147.80kB/s 0:00:00 (xfer#6613, to-check=1/7597)

hadoop-2.7.3/share/hadoop/yarn/test/

hadoop-2.7.3/share/hadoop/yarn/test/hadoop-yarn-server-tests-2.7.3-tests.jar

75411 100% 291.08kB/s 0:00:00 (xfer#6614, to-check=0/7597)

sent 570703746 bytes received 176188 bytes 17041192.06 bytes/sec

total size is 571575853 speedup is 1.00

Do the same for hdpslave2 also

[hdpsysuser@hdpmaster ~]$ sudo rsync -avxP /usr/hadoopsw/ hdpsysuser@hdpslave2:/usr/hadoopsw/

The above command will share the files stored within /usr/hadoopsw folder to Slave nodes with location – /usr/hadoopsw. So, you don’t need to again download as well as setup the above configuration in rest of all nodes. You just need Java and rsync to be installed over all nodes and JAVA_HOME path need to be matched with $HADOOP_HOME/etc/hadoop/hadoop-env.sh file of your Hadoop distribution which we had already configured in Single node Hadoop configuration.

6- Applying Master/Slave node specific configuration

Only for master nodes

These are some configuration to be applied over Hadoop MasterNodes (Since we have only one master node it will be applied to only one master node ie; hdpmaster.) 

Remove existing Hadoop data folder (ie; /opt/volume/hadoop_tmp) which was created while single node hadoop setup as we have to store our data in Slaves nodes only.

[hdpsysuser@hdpmaster ~]$ sudo rm -rf /opt/volume/hadoop_tmp/

Make same (/opt/volume/hadoop_tmp/hdfs) directory and create NameNode (/opt/volume/hadoop_tmp/hdfs/namenode) directory

[hdpsysuser@hdpmaster ~]$ sudo mkdir -p /opt/volume/hadoop_tmp/hdfs/namenode

Make hdpsysuser as owner of that directory.

[hdpsysuser@hdpmaster ~]$ sudo chown hdpsysuser:hadoop_grp -R /opt/volume/hadoop_tmp/

[hdpsysuser@hdpmaster volume]$ ll
total 0
drwxr-xr-x. 3 hdpsysuser hadoop_grp 18 Feb 7 17:43 hadoop_tmp


Only for slave nodes

Since we have two slave nodes, we will be applying the following changes over hdpslave1 and hdpslave2 nodes.

Make datanode directory [hdpsysuser@hdpslave1 ~]$ sudo mkdir -p /opt/volume/hadoop_tmp/hdfs/datanode

[hdpsysuser@hdpslave2 ~]$ sudo mkdir -p /opt/volume/hadoop_tmp/hdfs/datanode

Make hdpsysuser as owner of that directory
[hdpsysuser@hdpslave1 ~]$ sudo chown hdpsysuser:hadoop_grp -R /opt/volume/hadoop_tmp/

[hdpsysuser@hdpslave2 ~]$ sudo chown hdpsysuser:hadoop_grp -R /opt/volume/hadoop_tmp/

Only for master nodes
Format name node now.


[hdpsysuser@hdpmaster volume]$ hdfs namenode –format


Starting up Hadoop cluster daemons

Start HDFS daemons 

[hdpsysuser@hdpmaster volume]$ start-dfs.sh

Starting namenodes on [hdpmaster]
hdpmaster: starting namenode, logging to /usr/hadoopsw/hadoop-2.7.3/logs/hadoop-hdpsysuser-namenode-hdpmaster.out
hdpslave1: starting datanode, logging to /usr/hadoopsw/hadoop-2.7.3/logs/hadoop-hdpsysuser-datanode-hdpslave1.localdomain.out
hdpslave2: starting datanode, logging to /usr/hadoopsw/hadoop-2.7.3/logs/hadoop-hdpsysuser-datanode-hdpslave2.localdomain.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/hadoopsw/hadoop-2.7.3/logs/hadoop-hdpsysuser-secondarynamenode-hdpmaster.out

[hdpsysuser@hdpmaster volume]$ jps

4454 NameNode
4777 Jps
4653 SecondaryNameNode


Start MapReduce daemons:

[hdpsysuser@hdpmaster volume]$ start-yarn.sh

starting yarn daemons
starting resourcemanager, logging to /usr/hadoopsw/hadoop-2.7.3/logs/yarn-hdpsysuser-resourcemanager-hdpmaster.out
hdpslave2: starting nodemanager, logging to /usr/hadoopsw/hadoop-2.7.3/logs/yarn-hdpsysuser-nodemanager-hdpslave2.localdomain.out
hdpslave1: starting nodemanager, logging to /usr/hadoopsw/hadoop-2.7.3/logs/yarn-hdpsysuser-nodemanager-hdpslave1.localdomain.out


[hdpsysuser@hdpmaster volume]$ jps

5106 Jps
4837 ResourceManager
4454 NameNode
4653 SecondaryNameNode

For slave nodes

Verify Hadoop daemons on all slave nodes also:

[hdpsysuser@hdpslave1 ~]$ jps

22305 NodeManager
22434 Jps
22003 DataNode

[hdpsysuser@hdpslave2 ~]$ jps

22243 DataNode
22679 Jps
22542 NodeManager


The NodeManager (NM) is YARN’s per-node agent, and takes care of the individual compute nodes in a Hadoop cluster. This includes keeping up-to date with the ResourceManager (RM), overseeing containers’ life-cycle management; monitoring resource usage (memory, CPU) of individual containers, tracking node-health, log’s management and auxiliary services which may be exploited by different YARN applications.


Track/Veify using web UI

For ResourceManager – http://hdpmaster:8088


For NameNode – http://hdpmaster:50070



Congratulations! You have successfully setup Apache Hadoop multi node cluster in Redhat Linux.