How To Set up Hadoop on OS X Lion 10.7(转)

Chances are good if you are a just starting out software engineer knowing MapReduce inside and out is as important now as knowing how to configure a LAMP stack was in the last decade. Therefore most developers will want to have a local instance to learn and experiment without having to go down the route of virtualization.

Although there are a lot of competing MapReduce implementations out there, Apache Hadoop is the leader, with most PaaS vendors such as Amazon and Microsoft supporting it.

Setting up Apache Hadoop on Mac OS X follows the similar pattern to the official Hadoop single node documentation on the Apache side, but there are some bugs and custom configuration for OS X Lion that could trip you up, so this post should help you get started. Here is a quick tutorial (with some gotcha configuration changes you need to make until some bugs are fixed by Apache) to get you started. If you have any updates or suggestions please drop me a line and I’ll update.

Getting Java

Mac OS X no longer provides Java out of the box, but forcing it is fairly easy.

Option 1: From UNIX Command Line

Just check your Java version on a command line, which will prompt OS X to ask if you’d like to install Java.

$ java -version

Option 2: Get it from Apple website

You can also download it directly from Apple by visiting here: http://support.apple.com/kb/dl1421

Getting Hadoop

Setting up your environment

Some people like putting Hadoop under ~/Library/Hadoop. That’s fine, but I am use to the /usr/local/ of *nix world so I’ll use that for $HADOOP_HOME. You can make changes as appropriate.

Edit your .bash_profile and insert the following:

export HADOOP_HOME=/usr/local/hadoop export JAVA_HOME=$(/usr/libexec/java_home) export PATH=$PATH:$HADOOP_HOME/bin

Note that I have specified JAVA_HOME to point to a command which will dynamically find the correct Java in your OS X environment. This can be done both in your bash_profile and in hadoop-env.sh in your configuration. I recommend this to make sure any changes Apple (or perhaps Oracle once Apple gets out of the business of providing Java all together) makes in various updates does not break your Java configuration.

Download Hadoop from command line

$ cd /usr/local/ $ mkdir hadoop $ wget http://archive.cloudera.com/cdh/3/hadoop-0.20.2-cdh3u1.tar.gz $ tar xzvf hadoop-0.20.2-cdh3u1.tar.gz $ mv hadoop-0.20.2-cdh3u1 ./hadoop

Configuring Hadoop for OS X (and fixing some bugs)

Once installed, there will be three configuration files you’ll want to edit. Learning what these files do in general is left up to the reader, but this will get you up to speed quick.

We will set up the following single node configuration:

sets the default file system as an HDFS instance
sets the path on the local filesystem that the Hadoop daemons will use for persistence to something accessible by you
sets hdfs configuration so that HDFS will only try to store one copy of each file
sets map reduce properties to define the number of map and reduce slots that will be available on your box (you can play with these depending on your system resources)

Configuring: hadoop-env.sh

In your command window, load the environment configuration file. You won’t want to change much here, but some things will help you ensure you run right the first time and every time. I recommend making these changes.

vi /usr/local/hadoop/config/hadoop-env.sh

Uncomment #JAVA_HOME and specify the command path to dynamically load your Java location as discussed above:

# The java implementation to use. Required. export JAVA_HOME=$(/usr/libexec/java_home)

Next, uncomment HADOOP_HEAPSIZE and make it 2000. This is optional but recommended.:

# The maximum amount of heap to use, in MB. Default is 1000. export HADOOP_HEAPSIZE=2000

IMPORTANT: Fix Configuration Files To Get Around Lion Specific Problems

OS X Lion introduced a bug that many people experience when first initializing their name node storage. It typically appears as this error:

“Unable to load realm info from SCDynamicStore”

This error is currently being tracked in Apache HADOOP-7489 bug. Readers may want to check if this is fixed before applying the below fix.

To fix this issue, simply add the following to your hadoop-env.sh file:

export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk"

To sum up, your hadoop-env.sh should have the following defined:

export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk" export JAVA_HOME=$(/usr/libexec/java_home) export HADOOP_HEAPSIZE=2000

With that file ready to go, let’s move on to configuring your hdfs and map reduce XML files.

Configuring: core-site.xml

A change from previous versions of Apache Hadoop is that instead of putting all the configuration for your hadoop instance in to one XML file (hadoop-site.xml), you now have three configuration files you need to edit. This separation of concern is a good decision, but causes some extra work for us. First up is the core-site.xml file.

As stated, you need to pick a good place to run the local instance of your single node hdfs storage and setup the location for running the master HDFS instance. I also chose to dynamically inject the username in the temp directory in order to keep track of what account is writing to the HDFS store. This is good practice if you plan on running a local service account (or a few) to test different scenarios. It’s not necessary though. Keep in mind that whatever tmp directory you point to, whatever service account you are using (or your own account) will need write access to the directory.

Your file should look something like this:

<configuration> <property> <name>hadoop.tmp.dir</name> <value>/usr/local/tmp/hadoop/hadoop-${user.name}</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:8020</value> </property> </configuration>

Configuring: hdfs-site.xml

Now that we’ve accomplished that, we need to setup some configuration for hdfs itself. the hdfs-site.xml is used to configure HDFS itself. Since we are running a single node cluster on our Mac, we will want to specify for HDFS to only store one copy of the file:

<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>

Configuring: mapred-site.xml

Next we need to do some custom configuration on the map reduce engine itself. We specify the job tracker location (usually just your HDFS port + 1, but you can use any open port) and also set the maximum map and reduce jobs that can be spawned. You can configure these depending on the size / speed of your system. I specified 2 here.

<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:8021</value> </property> <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>2</value> </property> <property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>2</value> </property> </configuration>

Setup HDFS For The First Time

We are almost done here, but one final step is to format the HDFS instance we’ve specified. Since we’ve already squashed the nasty SCDynamicStore bug in your hadoop-env.sh file, this should work without issue. This is also a great way to test if the account you are running hadoop as actually has access to all the required directories.

$ $HADOOP_HOME/bin/hadoop namenode -format

You should see output like the following:

Brandons-MacBook-Air:local bbjwerner$ &nbsp;hadoop namenode -format Warning: $HADOOP_HOME is deprecated. 11/10/23 00:30:26 INFO namenode.NameNode: STARTUP_MSG: Re-format filesystem in /usr/local/tmp/hadoop/hadoop-bbjwerner/dfs/name ? (Y or N) Y <strong>&lt;-- NOTE: You have to use a capital "Y" here. Dumb script. </strong> 11/10/23 00:30:28 INFO util.GSet: VM type &nbsp; &nbsp; &nbsp; = 64-bit11/10/23 00:30:28 INFO util.GSet: 2% max memory = 39.83375 MB11/10/23 00:30:28 INFO util.GSet: capacity &nbsp; &nbsp; &nbsp;= 2^22 = 4194304 entries 11/10/23 00:30:28 INFO util.GSet: recommended=4194304, actual=4194304 11/10/23 00:30:28 INFO namenode.FSNamesystem: fsOwner=bbjwerner 11/10/23 00:30:28 INFO namenode.FSNamesystem: supergroup=supergroup 11/10/23 00:30:28 INFO namenode.FSNamesystem: isPermissionEnabled=true 11/10/23 00:30:28 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100 11/10/23 00:30:28 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s) 11/10/23 00:30:28 INFO namenode.NameNode: Caching file names occuring more than 10 times 11/10/23 00:30:29 INFO common.Storage: Image file of size 115 saved in 0 seconds. <strong>11/10/23 00:30:29 INFO common.Storage: Storage directory /usr/local/tmp/hadoop/hadoop-bbjwerner/dfs/name has been successfully formatted.</strong> 11/10/23 00:30:29 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at Brandons-MacBook-Air.local/10.0.1.31

With this complete, your setup of Hadoop is ready! Now all we have to do is run a simple test to make sure it all works!

Startup Hadoop With The Included Scripts

You use to have to start each part of Hadoop individually (datanode, namenode, jobtracker, tasktracker) but now they include a script that will start all the services at once.

$ $HADOOP_HOME/bin/start-all.sh

You will see each service startup. If there are no errors, you are ready to move on to testing out your Hadoop instance!

Run Hadoop with the included Examples JAR files in the Hadoop distribution

To test out your single node, run a quick command from your command line to test it out. To see the available tests for Hadoop, run the following command:

$ hadoop jar $HADOOP_HOME/hadoop-examples-*.jar

You will see a bunch of different cool tests. The easiest is pi. You can run it in this way:

$ hadoop jar $HADOOP_HOME/hadoop-examples-*.jar pi 10 100