Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Note: All the posts are based on practical approach avoiding lengthy theory. All have been tested on some development servers. Please don’t test any post on production servers until you are sure.

Thursday, February 23, 2017

Hadoop Administration: Accessing HDFS (File system & Shell Commands)

You can access HDFS in many different ways. HDFS provides a native Java application programming interface (API) and a native C-language wrapper for the Java API. In addition, you can use a web browser to browse HDFS files. I'll be using CLI only in this post.

FileSystem (FS) shell
A command-line interface similar to common Linux® and UNIX® shells (bash, csh, etc.) that allows interaction with HDFS data.
A command set that you can use to administer an HDFS cluster.
A subcommand of the Hadoop command/application. You can use the fsck command to check for inconsistencies with files, such as missing blocks, but you cannot use the fsck command to correct these inconsistencies.
Name nodes and data nodes
These have built-in web servers that let administrators check the current status of a cluster.

Shell Commands

Hadoop includes various shell-like commands that directly interact with HDFS and other file systems that Hadoop supports. The command bin/hdfs dfs -help lists the commands supported by Hadoop shell. Furthermore, the command bin/hdfs dfs -help command-name displays more detailed help for a command. These commands support most of the normal files system operations like copying files, changing file permissions, etc. It also supports a few HDFS specific operations like changing replication of files.

FS relates to a generic file system which can point to any file systems like local, HDFS etc. But DFS is very specific to HDFS. So when we use FS it can perform operation with from/to local or hadoop distributed file system to destination . But specifying DFS operation relates to HDFS.

FS Shell, The FileSystem (FS) shell is invoked by bin/hadoop fs . All the FS shell commands take path URIs as arguments. The URI format is scheme://autority/path. For HDFS the scheme is hdfs, and for the local filesystem the scheme is file. The scheme and authority are optional. If not specified, the default scheme specified in the configuration is used. An HDFS file or directory such as /parent/child can be specified as hdfs://namenodehost/parent/child or simply as /parent/child (given that your configuration is set to point to hdfs://namenodehost). Most of the commands in FS shell behave like corresponding Unix commands. Error information is sent to stderr and the output is sent to stdout.

Please review below HDFS shell commands to work with local file system and HDFS.

Read (cat) from local file system
[hdpsysuser@hdpslave1 ~]$ hadoop fs -cat file:///etc/hosts hdpmaster hdpslave1 hdpslave2

List from local file system
[hdpsysuser@hdpslave1 ~]$ hadoop fs -ls file:///usr/hadoopsw/
Found 26 items
-rw-------   1 hdpsysuser hdpsysuser        306 2017-02-07 13:21 file:///usr/hadoopsw/.ICEauthority
-rw-------   1 hdpsysuser hdpsysuser        397 2017-02-08 14:33 file:///usr/hadoopsw/.Xauthority
-rw-------   1 hdpsysuser hdpsysuser       4029 2017-02-08 14:58 file:///usr/hadoopsw/.bash_history
-rw-r--r--   1 hdpsysuser hdpsysuser         18 2016-07-12 18:17 file:///usr/hadoopsw/.bash_logout
-rw-r--r--   1 hdpsysuser hdpsysuser        572 2017-02-06 18:12 file:///usr/hadoopsw/.bash_profile
-rw-r--r--   1 hdpsysuser hdpsysuser        231 2016-07-12 18:17 file:///usr/hadoopsw/.bashrc
drwxrwxr-x   - hdpsysuser hdpsysuser       4096 2017-02-08 11:55 file:///usr/hadoopsw/.cache

Create folder in HDFS
[hdpsysuser@hdpslave1 ~]$ hadoop fs -mkdir -p hdfs://hdpmaster:9000/userdata/bukhari
[hdpsysuser@hdpslave1 ~]$ hadoop dfs -mkdir -p hdfs://hdpmaster:9000/userdata/zeeshan

DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

[hdpsysuser@hdpslave1 ~]$ hdfs dfs -mkdir -p hdfs://hdpmaster:9000/userdata/Zeeshan

List from HDFS

[hdpsysuser@hdpslave1 ~]$ hadoop fs -ls hdfs://hdpmaster:9000/

Found 1 items
drwxr-xr-x - hdpsysuser supergroup 0 2017-02-08 16:14 hdfs://hdpmaster:9000/userdata

[hdpsysuser@hdpslave1 ~]$ hadoop fs -ls hdfs://hdpmaster:9000/userdata
Found 2 items
drwxr-xr-x - hdpsysuser supergroup 0 2017-02-08 16:13 hdfs://hdpmaster:9000/userdata/bukhari
drwxr-xr-x - hdpsysuser supergroup 0 2017-02-08 16:14 hdfs://hdpmaster:9000/userdata/Zeeshan

Above examples make it clear how you use ‘hadoop’ and ‘hdfs’ commands , now we check with other options available.

Put (file from local file system to the destination file system)
First create file and then put into HDFS.

[hdpsysuser@hdpslave1 ~]$ vi /tmp/mydata.txt
[hdpsysuser@hdpslave1 ~]$ cat /tmp/mydata.txt ##reads from local file system

Name: Inam Ullah Bukhari
Location: Riyadh, Saudi Arabia

[hdpsysuser@hdpslave1 ~]$ hdfs dfs -put /tmp/mydata.txt hdfs://hdpmaster:9000/userdata/bukhari OR
[hdpsysuser@hdpslave1 ~]$ hdfs dfs -put /tmp/mydata.txt /userdata/bukhari 

[hdpsysuser@hdpslave1 ~]$ hdfs dfs -cat hdfs://hdpmaster:9000/userdata/bukhari/mydata.txt
Name: Inam Ullah Bukhari
Location: Riyadh, Saudi Arabia

checksum - Returns the checksum information of a file.

[hdpsysuser@hdpslave1 ~]$ hdfs dfs -checksum hdfs://hdpmaster:9000/userdata/bukhari/mydata.txt

hdfs://hdpmaster:9000/userdata/bukhari/mydata.txt MD5-of-0MD5-of-512CRC32C 0000020000000000000000008a94c02ab16c6a5dec8f0ce9cca5beba

Count the number of directories, files and bytes under the paths

[hdpsysuser@hdpslave1 ~]$ hdfs dfs -count hdfs://hdpmaster:9000/userdata/bukhari

1 2 93 hdfs://hdpmaster:9000/userdata/bukhari

Displays free space

[hdpsysuser@hdpslave1 ~]$ hdfs dfs -df hdfs://hdpmaster:9000/userdata/bukhari 

Filesystem Size Used Available Use%
hdfs://hdpmaster:9000 96841113600 41190 88159330304 0%

Displays sizes of files and directories 

[hdpsysuser@hdpslave1 ~]$ hdfs dfs -du hdfs://hdpmaster:9000/userdata/bukhari
59 hdfs://hdpmaster:9000/userdata/bukhari/mydata.txt
34 hdfs://hdpmaster:9000/userdata/bukhari/test.txt

Finds all files that match the specified expression 

[hdpsysuser@hdpslave1 ~]$ hdfs dfs -find hdfs://hdpmaster:9000/userdata/ -name tes*

Copy files to the local file system 

[hdpsysuser@hdpslave1 ~]$ hdfs dfs -get hdfs://hdpmaster:9000/userdata/bukhari/test.txt /tmp/

Displays the Access Control Lists (ACLs) of files and directories

[hdpsysuser@hdpslave1 ~]$ hdfs dfs -getfacl hdfs://hdpmaster:9000/userdata/bukhari/test.txt

# file: hdfs://hdpmaster:9000/userdata/bukhari/test.txt
# owner: hdpsysuser
# group: supergroup
getfacl: The ACL operation has been rejected. Support for ACLs has been disabled by setting dfs.namenode.acls.enabled to false.

Move files

[hdpsysuser@hdpslave1 ~]$ hdfs dfs -mv hdfs://hdpmaster:9000/userdata/bukhari/test.txt hdfs://hdpmaster:9000/userdata/zeeshan/test.txt

Remove files 

[hdpsysuser@hdpslave1 ~]$ hdfs dfs -rm hdfs://hdpmaster:9000/userdata/zeeshan/mydata.txt

17/02/08 17:02:13 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.

Deleted hdfs://hdpmaster:9000/userdata/zeeshan/mydata.txt

Delete a directory 

[hdpsysuser@hdpslave1 ~]$ hdfs dfs -rmdir hdfs://hdpmaster:9000/userdata/zeeshan

rmdir: `hdfs://hdpmaster:9000/userdata/zeeshan': Directory is not empty

Delete files and try again 

Print statistics about the file/directory at <path> in the specified format
[hdpsysuser@hdpslave1 ~]$ hadoop fs -stat "%F %u:%g %b %y %n" hdfs://hdpmaster:9000/userdata/zeeshan/test.txt

regular file hdpsysuser:supergroup 34 2017-02-08 13:41:08 test.txt

Format accepts filesize in blocks (%b), type (%F), group name of owner (%g), name (%n), block size (%o), replication (%r), user name of owner(%u), and modification date (%y, %Y). %y shows UTC date as “yyyy-MM-dd HH:mm:ss” and %Y shows milliseconds since January 1, 1970 UTC. If the format is not specified, %y is used by default.
Displays last kilobyte of the file to stdout 

[hdpsysuser@hdpslave1 ~]$ hadoop fs -tail hdfs://hdpmaster:9000/userdata/zeeshan/test.txt

This is 1st line in test.txt file

The -f option will output appended data as the file grows, as in Unix
Create a file of zero length 

[hdpsysuser@hdpslave1 ~]$ hdfs dfs -touchz hdfs://hdpmaster:9000/userdata/bukhari/file1.txt
Download file from HDFS to local file system 

[hdpclient@hadoopedge1 ~]$ hdfs dfs -get /hadoopedge1_data/test.txt /tmp/

File count in an HDFS directory

[hdfs@te1-hdp-rp-en01 ~]$ hdfs dfs -count  /flume/twitter
           1        11172          293202863 /flume/twitter

[hdfs@te1-hdp-rp-en01 ~]$ hadoop fs -count /flume/twitter
           1        11174          293299721 /flume/twitter

[hdfs@te1-hdp-rp-en01 ~]$ hadoop fs -count -q /flume/twitter
        none             inf            none             inf            1        11175          293332991 /flume/twitter

for i in `hdfs dfs -ls -R <DIRECTORY_PATH> | awk '{print $8}'`; do echo $i ; hdfs dfs -cat $i | wc -l; done

[hdfs@te1-hdp-rp-en01 ~]$ for i in `hdfs dfs -ls -R /flume/twitter | awk '{print $8}'`; do echo $i ; hdfs dfs -cat $i | wc -l; done 


It will recursively list the files in <DIRECTORY_PATH> and then print the number of lines in each file.

Check the locations of file blocks

hdfs fsck / -files -locations

Check the locations of file blocks containing rack information

hdfs fsck / -files -blocks -racks

Delete corrupted files

hdfs fsck -delete

Move corrupted files to /lost+found

hdfs fsck -move

List all the active TaskTrackers

mapred job -list-active-trackers

List all the running jobs

mapred job -list

List all the submitted jobs since the start of the cluster

mapred job -list all

Check the status of the default queue

mapred queue -list

Check the status of a queue ACL

hadoop queue -showacls

Show all the jobs in the default queue

hadoop queue -info default -showJobs

Check the status of a job

hadoop job -status job_201302152353_0001

Set the job job_201302152353_0001 to be on high priority

hadoop job -set-priority job_201302152353_0003 HIGH

Empty the trash

hdfs dfs -expunge

Back up block locations of the data on HDFS

hdfs fsck / -files -blocks -locations > dfs.block.locations.fsck.backup

Save the list of all files on the HDFS filesystem

hdfs dfs -ls -R / > dfs.namespace.lsr.backup

Dump Hadoop config 

hdfs org.apache.hadoop.conf.Configuration

Return the help for an individual command 

[hdpsysuser@hdpslave1 ~]$ hdfs dfs -usage mkdir

Usage: hadoop fs [generic options] -mkdir [-p] <path> ...

1 comment:

sowmiya gopal said...

hi ,your post on hadoop hdfs helped to understand the basic commands and made easy to crack my interview thanks for your post do keep posting your blog Hadoop Training in Velachery | Hadoop Training .
Hadoop Training in Chennai | Hadoop .