Search This Blog

Friday, October 19, 2012

Hadoop Installation for Beginners

Well folks,
Here i will be giving you step by step procedures to install and configure hadoop (version 1.1.0) on a linux (debian based distro) as a single node cluster. This guide is for beginners and you need to boot into your linux machine as a root user

Step 1: First you need to download hadoop source from the following URL
http://apache.techartifact.com/mirror/hadoop/common/hadoop-1.1.1/hadoop-1.1.1.tar.gz

Open a terminal

# cd <to directory where you downloaded hadoop>
# mv hadoop-1.1.0.tar.gz /usr/local/
# cd /usr/local/
# tar zxvf hadoop-1.1.0.tar.gz

From the above commands, you have actually moved hadoop src to /usr/local and uncompressed that file in /usr/local/

Step 2: Hadoop is a standalone java based application, so it requires java 1.6 as its dependency which is to be installed by your own ( if not already installed).

Step 3: Next you need to add a specific user to associate to hadoop

# adduser hadoop

It prompts you to enter password and few other information

             Adding user `hadoop' ...
             Adding new group `hadoop' (1001) ...
             Adding new user `hadoop' (1001) with group `hadoop' ...
             Creating home directory `/home/hadoop' ...
             Copying files from `/etc/skel' ...
             Enter new UNIX password:
             Retype new UNIX password:
             passwd: password updated successfully
             Changing the user information for hadoop
             Enter the new value, or press ENTER for the default
                   Full Name []:
                   Room Number []:
                   Work Phone []:
                  Home Phone []:
                 Other []:
             Is the information correct? [Y/n] Y


Step 4: Change the configuration files

Befor we configure, type the following to identify java home

# which java

if for example output is

                /usr/bin/java
Then

your JAVA_HOME is /usr

Now,

# cd /usr/local/hadoop-1.1.0/
# cd conf/
# vi hadoop-env.sh

Find the following line

                  # export JAVA_HOME=/usr/lib/j2sdk1.5-sun

and replace it as

                  export JAVA_HOME=/usr/


Next paste the following content into the file core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
   <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:8020</value>
  </property>
</configuration>

Next paste the following content into the file hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>

</configuration>

Next paste the following content into the file mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>


<configuration>
    <property>
    <name>mapred.job.tracker</name>
    <value>localhost:8021</value>
  </property>
</configuration>

Next check the file /etc/hosts if the following content exists as the first line, if not add it

127.0.0.1       localhost <your host name>

Where,
          <your host name> is the hostname of your machine.

you can find the hostname by

# hostname

Step 5: Associate user hadoop to your source folder

# cd /usr/local/
# chown -R hadoop hadoop-1.1.0

Step 6: Format HDFS file system Name node and Data Node

# cd /usr/local/hadoop-1.1.0/bin
# su hadoop
# ./hadoop namenode -format

It provides information like

12/10/19 12:00:20 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = java.net.UnknownHostException: vignesh: vignesh
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 1.1.0
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.1 -r 1394289; compiled by 'hortonfo' on Thu Oct  4 22:06:49 UTC 2012
************************************************************/
12/10/19 12:00:20 INFO util.GSet: VM type       = 64-bit
12/10/19 12:00:20 INFO util.GSet: 2% max memory = 17.77875 MB
12/10/19 12:00:20 INFO util.GSet: capacity      = 2^21 = 2097152 entries
12/10/19 12:00:20 INFO util.GSet: recommended=2097152, actual=2097152
12/10/19 12:00:21 INFO namenode.FSNamesystem: fsOwner=hadoop
12/10/19 12:00:21 INFO namenode.FSNamesystem: supergroup=supergroup
12/10/19 12:00:21 INFO namenode.FSNamesystem: isPermissionEnabled=true
12/10/19 12:00:21 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
12/10/19 12:00:21 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
12/10/19 12:00:21 INFO namenode.NameNode: Caching file names occuring more than 10 times 
12/10/19 12:00:21 INFO common.Storage: Image file of size 112 saved in 0 seconds.
12/10/19 12:00:21 INFO namenode.FSEditLog: closing edit log: position=4, editlog=/tmp/hadoop-hadoop/dfs/name/current/edits
12/10/19 12:00:21 INFO namenode.FSEditLog: close success: truncate to 4, editlog=/tmp/hadoop-hadoop/dfs/name/current/edits
12/10/19 12:00:21 INFO common.Storage: Storage directory /tmp/hadoop-hadoop/dfs/name has been successfully formatted.
12/10/19 12:00:21 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at java.net.UnknownHostException: vignesh: vignesh
************************************************************/

Similarly format the data node by

# ./hadoop datanode -format

Step 7: Make passwordless ssh for hadoop user

# ssh-keygen -t rsa -P ""

Press enter when it promts

Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa): 

and it generates the key as

Created directory '/home/hadoop/.ssh'.
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
f7:e3:1d:e6:2d:7d:23:2f:64:ea:1c:77:99:26:af:e0 hadoop@vignesh
The key's randomart image is:
+--[ RSA 2048]----+
|                 |
|                 |
|                 |
|                 |
|        S .      |
|         . . o  o|
|           o*oo* |
|          oo+B*+o|
|          .E..B++|
+-----------------+

# cat /home/hadoop/.ssh/id_rsa.pub > /home/hadoop/.ssh/authorized_keys
# ssh hadoop@localhost

type "yes" if it prompts as below

The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is 7e:4a:40:b5:57:06:0d:83:34:58:80:80:c3:e7:18:20.
Are you sure you want to continue connecting (yes/no)? 

After this it logs into hadoop user and you have successfully configured passwordless ssh

Now type

# exit

The above command must be used only once. So you are still as hadoop user

Step 8: Start Hadoop services

# ./start-all.sh

it starts 5 services

 NameNode
 SecondaryNameNode
 DataNode
 JobTracker
 TaskTracker

You can check if the services are running by

# jps

You must see something like this. If not you are facing some errors

26207 TaskTracker
26427 Jps
25847 DataNode
25986 SecondaryNameNode
26089 JobTracker
25738 NameNode

Log into 

       http://localhost:50030

for hadoop map/reduce administration (optional)

Log into 

       http://localhost:50070

for browsing the hdfs file system (optional)

Step 9: Follow these commands

# ./hadoop dfsadmin -report

This command gives you information on your hdfs system

# ./hadoop fs -mkdir test

This command creates a directory "test" in your hdfs file system

# vi test_input

 In the text editor type

 "hi all hello all"

 save and exit the file

# ./hadoop fs -put test_input test/input

This command copyies the file (test_input)  that we just created into hdfs file system (inside test folder)

#./hadoop fs -ls test

This command list all files in folder "test" of  hdfs file system.

#./hadoop jar ../hadoop-examples-1.1.1.jar wordcount test/input test/output

This command runs a mapreduce program (word count) for your input and generates output in "test/output" of hdfs file system.

You can check the output in the following url

http://localhost:50070

Browse the filesystem -> user -> hadoop -> test -> output ->part-r-00000

Step 10: To stop hadoop (optional)

# sh stop-all.sh

Here end our step by step guide to work with hadoop ( for beginners ).

1 comment:

  1. how to create jar file of my own mapreduce program ...
    For example i have a java file named Spatial and i need to create the jar file

    In your example, hadoop-examples.jar is used for running...
    For my program,how can i create jar ...and should i execute the program?

    could u please help me?

    ReplyDelete