How to setup Hadoop for Pseudo-distributed Mode
March 31, 2017
The pseudo-distributed mode is a special case of fully distributed mode in which the (single) host is localhost. Note that Hadoop daemon runs in a separate Java process. Therefore, we need to start daemons and we need to configure a couple settings until it can get it working. Let’s see how it can be achieved
Related Topics
Hadoop Setup
You haven’t setup your Hadoop system yet, you can check out my blog on How to setup Hadoop -Standalone mode for more info.
Configuring SSH
In pseudodistributed mode, we need SSH to log in to local host without having to enter a password. First, we install SSH
.
$ sudo apt-get install ssh
To enable passwordless login to localhost,
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Now you can test it by logging in to your localhost with SSH. It should not ask you for password now.
$ ssh localhost
Settings
Now it’s time to set some settings. Hope you still remember your Hadoop folder that you installed. Go to that Hadoop folder. Mine is /home/aknay/HadoopWorkSpace/hadoop-2.8.0
. Now we will modify core-site.xml
file.
$ nano etc/hadoop/core-site.xml
At the end of the file. we remove this empty block
<configuration>
</configuration>
and replace with the following. Then save it.
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
Similarly, we will modify the config file for hdfs-site.xml
.
$ nano etc/hadoop/hdfs-site.xml
with the following config.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
Formatting the HDFS file system
Before HDFS can be used for the first time, the file system must be formatted. This is done by running the following command:
$ hdfs namenode -format
HDFS Daemon
In order to start HDFS daemon, type
$ start-dfs.sh
Then go to http://localhost:50070
to see web interface for the NameNode
. Make sure you can access the web interface. Otherwise, you should check your configuration file and check that hdfs daemon is running.
Note: You can stop hdfs daemon by typing
$ stop-dfs.sh
Create HDFS directories
Here, we create the HDFS directories required to execute
General case:
$ bin/hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir /user/<username>
For my case:
$ hdfs dfs -mkdir /user
$ hdfs dfs -mkdir /user/aknay
Choose Your Path
- MapReduce locally
- MapReduce on YARN on a single node
If you want to run MapReduce locally, you can jump to the Example
topic straight away. Otherwise, we need to setup a few more configuration for YARN.
MapReduce on YARN
Here, We will configure for YARN for a single node. This file is haven’t created yet but they provide a template for it. So we make a copy of it and then change the configuration.
$ cp etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xml
$ nano etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Same for yarn-site.xml
. This file is already there. Just need to modifiy the configuration.
$ nano etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
Now let’s start YARN.
$ start-yarn.sh
Now go to http://localhost:8088
. Make sure you can access the web interface.
Note: You can stop the YARN by typing
$ stop-yarn.sh
Example
This example is same for us to run MapReduce locally or on YARN at a single node. If you are running locally, make sure hdfs daemon
is running. If you are running YARN, make sure both hdfs and yarn daemons
are running.
$ hdfs dfs -put etc/hadoop input
$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.0.jar grep input output 'dfs[a-z.]+'
You can check the output by
$ hdfs dfs -cat output/*
Your output might be like this.
6 dfs.audit.logger
4 dfs.class
3 dfs.logger
3 dfs.server.namenode.
2 dfs.audit.log.maxbackupindex
2 dfs.period
2 dfs.audit.log.maxfilesize
1 dfs.log
1 dfs.file
1 dfs.servers
1 dfsadmin
1 dfsmetrics.log
1 dfs.replication
You can also check at output
directory.
$ hdfs dfs -ls output
If you are running on YARN, you can check the status on http://localhost:8088
. The FinalStatus
should be shown as SUCCEEDED
.
Note: If you want to remove
output
folder or just want to run another round, you can remove it by$ hdfs dfs -rm -r output