Wednesday, 5 February 2014

Installing Apache Hadoop in Ubuntu Linux

This tutorial covers the installation steps of Apache Hadoop 1.0 and 2.0 in Ubuntu Linux. I will also go through the configuration for running it on pseudo-distributed mode. 

Pre-Requisites

1. Java 6 or later (Hadoop is written in Java)
2. Linux OS (Windows is supported only in development mode. Other flavors of UNIX, including MAC OS can also be used for development)

Installation

1. Download the tarball from: http://hadoop.apache.org/releases.html

2. Extract the downloaded Hadoop distribution, and export/set the following variables in: ~/.bashrc
export HADOOP_INSTALL=/path/to/your/installation
PATH=$PATH:$HADOOP_INSTALL/bin:$HADOOP_INSTALL/sbin

3.  Set JAVA_HOME in file: $HADOOP_INSTALL/conf/hadoop-env.sh 

Note: In Hadoop 2.x, this file is located under:  $HADOOP_INSTALL/etc/hadoop/

4.  Check that Hadoop is properly installed by opening up a new shell and executing: 

hadoop version

Installation modes

Hadoop can be run in one of the 3 installation modes: 

1. Standalone (local) mode: This is used for development only. There are no daemons running, everything is in a single JVM. 

2. Pseudo-Distributed mode: Hadoop daemons run on the local machine, simulating a cluster in a single box.

3. Distributed mode: Hadoop daemons run on a cluster of machines.

Configuration for each installation mode

Standalone mode

By default, Hadoop is configured to run on Standalone mode, so no further action is required. 

Pseudo-Distributed mode
1. Component Configuration
Each Hadoop component (core, hdfs, mapreduce) is configured using its own XML file (under the /conf directory for Hadoop 1.x or /etc/hadoop/ for Hadoop 2).
In earlier versions of Hadoop, all configuration was done in a single file hadoop-site.xml, now it's split into 3 different files:
For pseudo-distribution mode, configure the following:

Filecore-site.xml:
<configuration>
       <property>
              <name>fs.default.name</name>
              <value>hdfs://localhost:9000</value>
        </property>
</configuration>
     
Filehdfs-site.xml:
<configuration>
       <property>
              <name>dfs.replication</name>
              <value>1</value>
          </property>
       <property>
              <name>dfs.name.dir</name>
              <value>/home/yourname/dfs/name</value>
              <final>true</final>
          </property>
       <property>
              <name>dfs.data.dir</name>
              <value>/home/yourname/dfs/data</value>
              <final>true</final>
          </property>
   </configuration>

In a pseudo-distributed installation, the entire cluster runs on a single machine, therefore we set the block replication factor to 1 otherwise Hadoop will issue warning messages since it can't replicate data to other physical datanodes. 

When you format (create) an HDFS filesystem, it creates its files in the path specified at property dfs.name.dir and dfs.data.dir. Subsequently all the filesystem data will be stored in these directories. By default, these properties point to the /tmp directory, so its is strongly advised to change these as all data will be lost if the machine is rebooted.

Set these properties as "final" to make sure they don't get overridden by other configuration files or command line options.

Filemapred-site.xml:
<configuration>
       <property>
              <name>mapred.job.tracker</name>
              <value>localhost:9001</value>
      </property>
      <property>
              <name>mapred.system.dir</name>
              <value>/home/yourname/mapred/system</value>
              <final>true</final>
      </property>
 </configuration>

The"mapred.job.tracker" property specifies the address of the jobtracker. It's by default set to "local" which means it will use Hadoop's local job runner to run MapReduce jobs inside a single JVM (development mode)
In Hadoop 2.0 (YARN) the equivalent property is called "mapreduce.framework.name" 

                                        If you're using Hadoop 2 (YARN), then set the following additional properties in yarn-site.xml
                              <property>
                                             <name>mapreduce.framework.name</name>
                                             <value>yarn</value>
                                        </property>
                                        <property>
                                               <name>yarn.nodemanager.aux-services</name>
                                               <value>mapreduce.shuffle</value>
                                        </property>
                                        <property>
                                               <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
                                               <value>org.apache.hadoop.mapred.ShuffleHandler</value>
                                        </property>

2. Configuring SSH 

Hadoop makes use of SSH to start its daemons, so you must have SSH installed in your localhost, also, be able to SSH into your host using password-less login (without a password).

Open up a terminal and execute: ssh localhost

if you get a "Connection refused" error, it's because you don't have SSH installed, so install it by running:

sudo apt-get install ssh 

Try again ssh localhost and make sure you don't need to type in a password to connect, if you do, then execute the following to enable password-less login: 

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa 
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

3. Formatting the HDFS filesystem

You need to format a brand new HDFS installation before you can use it. Formatting HDFS is easy, just type the following:

hadoop namenode -format

4. Start/Stop Hadoop Daemons:

> Hadoop 1.0:

Start daemons:

NameNode: start-dfs.sh
Jobtracker: start-mapred.sh

NoteIf you get a "JAVA_HOME is not set" error, then make sure you set JAVA_HOME in file: hadoop-env.sh

Check that you have started both, NameNode and Jobtracker daemons by accessing their respective web interfaces:

Jobtracker: http://localhost:50030

Make sure that the "State" on the Jobtracker interface is set to "Running"

Stop daemons: 

Jobtracker: stop-mapred.sh
NameNode: stop-dfs.sh

> Hadoop 2.0 (MapReduce 2):

Start daemons:

NameNode: start-dfs.sh
YARN: start-yarn.sh

Check that you have started both, NameNode and Resource Manager daemons by accessing their respective web interfaces:


Stop daemons: 

NameNode: stop-dfs.sh
YARN: stop-yarn.sh

Distributed mode

For a fully-distributed cluster configuration, follow the steps at: http://hadoop.apache.org/docs/r1.1.1/cluster_setup.html 

---------------------------------------------------
References
http://shop.oreilly.com/product/0636920021773.do (Hadoop: The Definitive Guide, 3rd Edition)
http://hadoop.apache.org/docs/r1.1.1/single_node_setup.html#PseudoDistributed 
Same-origin policy (SOP)

Same Origin Policy (SOP) is a web browser security measure that prevents JavasScript running in one site from accessing other sites (unless they're from the same origin). For example, if you have "random_site.com" open in one browser window and "gmail.com" in another, then you don't want a script from "random_site.com" to access your Gmail. Two pages are considered from the same origin if the protocol, port (if any) and host are the same.

Cross-origin resource sharing (CORS)

Cross-origin resource sharing (CORS) is a mechanism that allows scripts to bypass the Same-Origin Policy, essentially allowing JavasScript code to make requests to external sites. Such "cross-domain" requests would otherwise be forbidden by web browsers. When browsers issue requests, they always include the "Origin" header, the server can then pick up this "Origin" header and respond with an "Access-Control-Allow-Origin" header if that Origin is acceptable.  Browsers will then allow the access to go ahead.

CORS in Java

In your Java webapp, all you need to do is set the "Access-Control-Allow-Origin" CORS header to the Servlet response:

response.setHeader("Access-Control-Allow-Origin", request.getHeader("Origin"));

As of Tomcat 7, CORS  support has been added (in the form of a filter). In theory, you can add this Tomcat-specific Servlet to the web.xml and it should take care of adding CORS headers (although that didn't really work for me): 

<filter>
  <filter-name>CorsFilter</filter-name>
  <filter-class>org.apache.catalina.filters.CorsFilter</filter-class>
</filter>
<filter-mapping>
  <filter-name>CorsFilter</filter-name>
  <url-pattern>/*</url-pattern>
</filter-mapping>

More info at:

CORS in Scala

If using Spray with Scala, for example, adding CORS headers to the response can be achieved using the following code: 

import spray.http._

path("your_path") {
 get{
  respondWithHeader(HttpHeaders.`Access-Control-Allow-Origin`(AllOrigins)){
    _.complete("server response")        
  }
 }
}