The top Hadoop DFS commands, which you should master, to work with Big Data

Hadoop is an open source tool, which is exclusively used by big data enthusiasts to manage and handle large amounts of data efficiently. Yes, big means big. Yes, very big. I am not talking about 1 TB of data, present on your hard drive. But big data refers to working with tons of data, which is, in most cases, in the range of Petabyte and Exabyte, or even more than that. I don’t think, how large it is, needs an explanation! Yes, with the help of Hadoop, such huge tons of data can efficiently be managed for the purpose of mining, which can be later used for the purpose of showing targeted advertisements, social engineering, and marketing and for other similar purposes.

While working with big data, disk space shouldn’t be a constraint, and for that, commodity hardware is used to store and process data in large amounts, without worrying about the initial cost of setup. Before diving into the Hadoop commands, let’s first find out the basics of how Hadoop works, to help you understand the commands better, and make your working with Hadoop, a piece of the pie.

Hadoop actually works on a master-slave architecture, where the master assigns the jobs to various other slaves, connected to it. In case of Hadoop, the master is termed Name node, while the other connected slaves are termed Data nodes. Both the Name node and the Data nodes communicate with each other with the help of SSH, and as a user of Hadoop, you just need to execute the commands, and Hadoop will handle the rest. You will not have to worry about the connection between the Name node and Data nodes unless you are a Hadoop administrator.

Hadoop has its own file system, termed as Hadoop Distributed File System or HDFS. Each of the data nodes and the name nodes has its own local file system, and HDFS provides a bridge to communicate with the local file system available within the connected nodes. Apart from the name node, there is a secondary name node, which works, once the name node is down. The secondary name node just acts as a backup. That is not important in this context, as we will only be discussing the Hadoop commands. All the commands I will be giving here will be file system commands, which you will need if you are working with Hadoop. There are also a number of other Hadoop admin commands, which can be helpful if you are a Hadoop administrator. I will make a separate article, on those commands, and if you want them, comment it down below.

As Hadoop is made with Java and works on Linux, the Hadoop and Linux commands are pretty similar, and most Linux commands used in Hadoop. Let’s now move on to our first command on Hadoop HDFS commands list.

 

List of basic Linux type commands for Hadoop DFS

hadoop HDFS commands list for big data

start-all.sh

Starts the Hadoop daemons, the name node, and the connected data nodes.

 

stop-all.sh

Stops the Hadoop daemons, the name node, and the connected data nodes.

There are a number of commands, which are same for both Linux and Hadoop, but there is a little twist. Let’s find them.

 

hdfs dfs -ls

The hadoop fs -ls output, will list all the files and directories on the Hadoop home directory. You can find similarities between it and the native ‘ls’ command on Linux, which is used to list all the files and directories in the present working directory.

 

hdfs dfs -ls <HDFS URL>

Lists all the files and directories in the HDFS location, entered through the URL.

  • Example: hdfs dfs -ls rough/big/data

Lists all the files and directories within the path ‘rough/big/data’ on HDFS

 

hdfs dfs -put <Local file’s  URL> <HDFS URL>

Puts a file from the local file system, from the given URL, to HDFS, at the location entered. It is like copying a file and pasting.

  • Example: hdfs dfs -put abc.txt rough/big/data

Puts the file with the name ‘abc.txt’, from the present working directory to the path ‘rough/big/data’ on HDFS.

 

hdfs dfs -get <HDFS URL> <Local file system URL>

Gets a file from any location within HDFS to the desired location in the local file system. It is similar to copying and pasting, but the source is in HDFS.

  • Example: hdfs dfs -get rough/big/data/file.txt local/client

Gets the file with the name ‘file.txt’ from the URL ‘rough/big/data’ within HDFS, to the directory ‘local/client’, on the local file system.

 

hdfs dfs -copyFromLocal <local file URL> <URL on HDFS> / hdfs dfs -copyFromLocal -f <local file URL> <URL on HDFS>

Copies a file from the local file system to the given URL, present on the HDFS. With the -f, the file will be overwritten on the destination directory. The working is quite similar to the ‘put‘ command, discussed earlier.

  • Example: hdfs dfs –copyFromLocal -f abc.txt rough/big/data

Copy the file named ‘abc.txt’, from the present working directory, to the HDFS path ‘rough/big/data’, even if the file with the same name exists there.

 

hdfs dfs -moveFromLocal <local file URL> <URL on HDFS>

It is similar to that of the previous command, but the only difference is that, the source file will no longer be present. It is similar to the cut and paste command on Windows, and any other GUI interface.

  • Example: hdfs dfs -moveFromLocal abc.txt rough/big/data

Move the file with name ‘abc.txt’, from the present working directory, to the HDFS path ‘rough/big/data’. The source file ‘abc.txt’, in the source will be deleted after executing the command.

 

hdfs dfs -copyToLocal <HDFS file URL> <local directory >

Copy a file from the HDFS URL, to the local directory, in the given URL. The local URL should always be a directory in this case.

  • Example: hdfs dfs –copyToLocal rough/big/data/abc.txt training/clients

Copy the file with the name abc.txt from the URL ‘rough/big/data’ from the HDFS URL to the local directory clients, within the directory training.

 

hdfs dfs -moveToLocal <HDFS file URL> <local directory >

Moves a file from the HDFS URL, to the local directory, in the given URL. Just like the previous command, the local URL should always be a directory. Just like cutting and pasting, the file from the source URL in HDFS will be deleted.

  • Example: hdfs dfs -moveToLocal rough/big/data/abc.txt training/clients

It will move the file with the name abc.txt from the URL ‘rough/big/data’ from the HDFS URL to the local directory clients, within the directory training. After execution, the file from HDFS URL will be automatically deleted.

 

hdfs dfs –cp <HDFS source URL> <HDFS destination URL> / hdfs dfs –cp -f <HDFS source URL> <HDFS destination URL>

Copy a file from any HDFS URL to a different destination within HDFS. With the help of -f, the file will be overwritten, in the destination directory.

  • Example: hdfs dfs -cp rough/big/data/abc.txt rough/big

Copy the file named ‘abc.txt’, from the directory ‘rough/big/data’ on HDFS, to the destination directory, i.e. ‘rough/big’.

 

hdfs dfs –mv <HDFS source URL> <HDFS destination URL>

Move a file from any HDFS URL to a different destination with HDFS. It works like cut and paste, but is limited only to HDFS URL. As the file will be moved, the source file will be deleted after the operation.

  • Example: hdfs dfs -mv rough/big/data/abc.txt rough/big

Move the file named ‘abc.txt’, from the directory ‘rough/big/data’ on HDFS, to the destination directory, i.e. ‘rough/big’, and delete the file ‘abc.txt’ from the source directory.

 

hdfs dfs -cat <URL/filename>

Show the contents of a file, which is stored at some location within HDFS, at the given URL.

  • Example: hdfs dfs -cat rough/big/data/abc.txt

Shows the content of the file ‘abc.txt’, within the directory ‘rough/big/data’, on HDFS.

 

hdfs dfs -chmod <mode> <HDFS URL/filename> / hdfs dfs –chmod –r <mode> <HDFS URL>

Change the permission mode of the file, present in the URL, within HDFS. With the ‘-r’, the mode of all the files present within the URL will be changed, recursively.

  • Example: hdfs dfs -chmod 777 rough/big/data/abc.txt

Set all the permissions to the file ‘abc.txt’, within the directory ‘rough/big/data’ on HDFS to read, write and execute, for the present user, users of the same group, and others. It is based on octal number system, where each number refers to a single permission. You can find more information about setting permissions on Linux, with octal numbers, online.

 

hdfs dfs -mkdir <URL/Directory name> / hdfs dfs -mkdir –p <URL/Directory name>

Makes a directory within HDFS, with the entered name, in the entered URL. If the directory name is directly entered after mkdir, a new directory with the desired name will be created right in the HDFS home directory. With the –p, all the parent directories will also be created, if they are not present.

  • Example: hdfs dfs -mkdir rough/big/data/Hadoop

It will create a new directory named ‘Hadoop’, within the URL ‘rough/big/data’, on the HDFS.

  • Example: hdfs dfs -mkdir -p learn/big/data/direc

A number of new directories will be created, where ‘direc’ will be present in data, which will be present in ‘big’, which will again be present in ‘learn’. If any of the directories are already present, the next directories will automatically be created in the parent directory.

 

hdfs dfs -rm <URL/filename> / hdfs dfs -rm -r <URL>

It is used to remove or delete a file, with the given filename, from a given HDFS location. The –r can be used to delete files recursively.

  • Example: hdfs dfs rough/big/data/del.txt

It will delete the file with the name del.txt, from the give HDFS location, i.e. rough/big/data.

 

hdfs dfs –touchz <URL/filename>

It is used to create an empty file or a file structure with the given filename, in the HDFS location. The size of the file will be 0 bytes.

  • Example: hdfs dfs –touchz rough/big/data/empty.txt

It will create a file, with the name ‘empty.txt’, on the HDFS URL ‘rough/big/data’. The size of the file will be 0 bytes. The point to keep in mind is that, it is not that straightforward to edit a file directly on the HDFS. You will have to copy it down to your local system to edit it, or you can use other Hadoop tools like MapReduce to edit the empty or any other file. You can’t use Nano, or other CUI editors to edit that file.

hdfs dfs -test <-e/-z/-d> <URL/filename>

Tests, whether the file exists or not.

With –e, the file will be checked within the URL, and if the file exists, it will return 0.

With –z, the file will be checked, and if the size of the file within the HDFS URL is 0 bytes, 0 will be returned.

With –d, 0 will be returned, if the given URL points to a directory.

 

hdfs dfs -appendToFile <Local file URL> <HDFS File URL>

It is used to append a local file to an existing file on the HDFS.

  • Example: hdfs dfs –appendToFile abc.txt rough/big/data/def.txt

It will append the contents of the local file abc.txt, to the file def.txt, which is present on the given URL on HDFS.

 

Hadoop FS vs HDFS DFS

Thus, the basic thing is, if you want to execute a Hadoop command, the ‘hdfs dfs’ should be mentioned, which will make the Terminal understand, you want to work with HDFS. Instead of ‘hdfs dfs’, you can even use ‘hadoop fs’, and the then the command. You will get same results.

For example, ‘hdfs dfs –ls’, and ‘hadoop fs –ls’ will give the same output. But, ‘hdfs dfs’ is better, as all the keys are closely placed.

I have just given some of the basic commands, which you will mostly need, for everyday tasks, associated with Hadoop. There are a number of other commands, which you will hardly need from time to time. Remembering the above commands will surely be helpful to master the basics of Hadoop, and it will be enough for industrial purposes, as well.

Hope the small list of Hadoop file system commands was helpful for you, and if you have any suggestions, comment it down below.

 

ALSO SEE: