Chances are good if you are a just starting out software engineer knowing MapReduce inside and out is as important now as knowing how to configure a LAMP stack was in the last decade. Therefore most developers will want to have a local instance to learn and experiment without having to go down the route of virtualization.
Although there are a lot of competing MapReduce implementations out there, Apache Hadoop is the leader, with most PaaS vendors such as Amazon and Microsoft supporting it.
Setting up Apache Hadoop on Mac OS X follows the similar pattern to the official Hadoop single node documentation on the Apache side, but there are some bugs and custom configuration for OS X Lion that could trip you up, so this post should help you get started. Here is a quick tutorial (with some gotcha configuration changes you need to make until some bugs are fixed by Apache) to get you started. If you have any updates or suggestions please drop me a line and I’ll update.
Getting Java
Mac OS X no longer provides Java out of the box, but forcing it is fairly easy.
Option 1: From UNIX Command Line
Just check your Java version on a command line, which will prompt OS X to ask if you’d like to install Java.
$ java -version
Option 2: Get it from Apple website
You can also download it directly from Apple by visiting here: http://support.apple.com/kb/dl1421
Getting Hadoop
Setting up your environment
Some people like putting Hadoop under ~/Library/Hadoop. That’s fine, but I am use to the /usr/local/ of *nix world so I’ll use that for $HADOOP_HOME. You can make changes as appropriate.
Edit your .bash_profile and insert the following:
export HADOOP_HOME=/usr/local/hadoop export JAVA_HOME=$(/usr/libexec/java_home) export PATH=$PATH:$HADOOP_HOME/bin
Note that I have specified JAVA_HOME to point to a command which will dynamically find the correct Java in your OS X environment. This can be done both in your bash_profile and in hadoop-env.sh in your configuration. I recommend this to make sure any changes Apple (or perhaps Oracle once Apple gets out of the business of providing Java all together) makes in various updates does not break your Java configuration.
Download Hadoop from command line
$ cd /usr/local/ $ mkdir hadoop $ wget http://archive.cloudera.com/cdh/3/hadoop-0.20.2-cdh3u1.tar.gz $ tar xzvf hadoop-0.20.2-cdh3u1.tar.gz $ mv hadoop-0.20.2-cdh3u1 ./hadoop
Configuring Hadoop for OS X (and fixing some bugs)
Once installed, there will be three configuration files you’ll want to edit. Learning what these files do in general is left up to the reader, but this will get you up to speed quick.
We will set up the following single node configuration:
- sets the default file system as an HDFS instance
- sets the path on the local filesystem that the Hadoop daemons will use for persistence to something accessible by you
- sets hdfs configuration so that HDFS will only try to store one copy of each file
- sets map reduce properties to define the number of map and reduce slots that will be available on your box (you can play with these depending on your system resources)
Configuring: hadoop-env.sh
In your command window, load the environment configuration file. You won’t want to change much here, but some things will help you ensure you run right the first time and every time. I recommend making these changes.
vi /usr/local/hadoop/config/hadoop-env.sh
Uncomment #JAVA_HOME and specify the command path to dynamically load your Java location as discussed above:
# The java implementation to use. Required. export JAVA_HOME=$(/usr/libexec/java_home)
Next, uncomment HADOOP_HEAPSIZE and make it 2000. This is optional but recommended.:
# The maximum amount of heap to use, in MB. Default is 1000. export HADOOP_HEAPSIZE=2000
IMPORTANT: Fix Configuration Files To Get Around Lion Specific Problems
OS X Lion introduced a bug that many people experience when first initializing their name node storage. It typically appears as this error:
“Unable to load realm info from SCDynamicStore”
This error is currently being tracked in Apache HADOOP-7489 bug. Readers may want to check if this is fixed before applying the below fix.
To fix this issue, simply add the following to your hadoop-env.sh file:
export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk"
To sum up, your hadoop-env.sh should have the following defined:
export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk" export JAVA_HOME=$(/usr/libexec/java_home) export HADOOP_HEAPSIZE=2000
With that file ready to go, let’s move on to configuring your hdfs and map reduce XML files.
Configuring: core-site.xml
A change from previous versions of Apache Hadoop is that instead of putting all the configuration for your hadoop instance in to one XML file (hadoop-site.xml), you now have three configuration files you need to edit. This separation of concern is a good decision, but causes some extra work for us. First up is the core-site.xml file.
As stated, you need to pick a good place to run the local instance of your single node hdfs storage and setup the location for running the master HDFS instance. I also chose to dynamically inject the username in the temp directory in order to keep track of what account is writing to the HDFS store. This is good practice if you plan on running a local service account (or a few) to test different scenarios. It’s not necessary though. Keep in mind that whatever tmp directory you point to, whatever service account you are using (or your own account) will need write access to the directory.
Your file should look something like this:
<configuration> <property> <name>hadoop.tmp.dir</name> <value>/usr/local/tmp/hadoop/hadoop-${user.name}</value> <description>A base for other temporary directories.</description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:8020</value> </property> </configuration>
Configuring: hdfs-site.xml
Now that we’ve accomplished that, we need to setup some configuration for hdfs itself. the hdfs-site.xml is used to configure HDFS itself. Since we are running a single node cluster on our Mac, we will want to specify for HDFS to only store one copy of the file:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
Configuring: mapred-site.xml
Next we need to do some custom configuration on the map reduce engine itself. We specify the job tracker location (usually just your HDFS port + 1, but you can use any open port) and also set the maximum map and reduce jobs that can be spawned. You can configure these depending on the size / speed of your system. I specified 2 here.
<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:8021</value> </property> <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>2</value> </property> <property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>2</value> </property> </configuration>
Setup HDFS For The First Time
We are almost done here, but one final step is to format the HDFS instance we’ve specified. Since we’ve already squashed the nasty SCDynamicStore bug in your hadoop-env.sh file, this should work without issue. This is also a great way to test if the account you are running hadoop as actually has access to all the required directories.
$ $HADOOP_HOME/bin/hadoop namenode -format
You should see output like the following:
Brandons-MacBook-Air:local bbjwerner$ hadoop namenode -format Warning: $HADOOP_HOME is deprecated. 11/10/23 00:30:26 INFO namenode.NameNode: STARTUP_MSG: Re-format filesystem in /usr/local/tmp/hadoop/hadoop-bbjwerner/dfs/name ? (Y or N) Y <strong><-- NOTE: You have to use a capital "Y" here. Dumb script. </strong> 11/10/23 00:30:28 INFO util.GSet: VM type = 64-bit11/10/23 00:30:28 INFO util.GSet: 2% max memory = 39.83375 MB11/10/23 00:30:28 INFO util.GSet: capacity = 2^22 = 4194304 entries 11/10/23 00:30:28 INFO util.GSet: recommended=4194304, actual=4194304 11/10/23 00:30:28 INFO namenode.FSNamesystem: fsOwner=bbjwerner 11/10/23 00:30:28 INFO namenode.FSNamesystem: supergroup=supergroup 11/10/23 00:30:28 INFO namenode.FSNamesystem: isPermissionEnabled=true 11/10/23 00:30:28 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100 11/10/23 00:30:28 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s) 11/10/23 00:30:28 INFO namenode.NameNode: Caching file names occuring more than 10 times 11/10/23 00:30:29 INFO common.Storage: Image file of size 115 saved in 0 seconds. <strong>11/10/23 00:30:29 INFO common.Storage: Storage directory /usr/local/tmp/hadoop/hadoop-bbjwerner/dfs/name has been successfully formatted.</strong> 11/10/23 00:30:29 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at Brandons-MacBook-Air.local/10.0.1.31
With this complete, your setup of Hadoop is ready! Now all we have to do is run a simple test to make sure it all works!
Startup Hadoop With The Included Scripts
You use to have to start each part of Hadoop individually (datanode, namenode, jobtracker, tasktracker) but now they include a script that will start all the services at once.
$ $HADOOP_HOME/bin/start-all.sh
You will see each service startup. If there are no errors, you are ready to move on to testing out your Hadoop instance!
Run Hadoop with the included Examples JAR files in the Hadoop distribution
To test out your single node, run a quick command from your command line to test it out. To see the available tests for Hadoop, run the following command:
$ hadoop jar $HADOOP_HOME/hadoop-examples-*.jar
You will see a bunch of different cool tests. The easiest is pi. You can run it in this way:
$ hadoop jar $HADOOP_HOME/hadoop-examples-*.jar pi 10 100
You should see output like the following:
Number of Maps = 10 Samples per Map = 100 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2
Congratulations!
You now have a single node Hadoop on OS X Lion. Happy Hacking!
Who Am I?
I am Brandon Werner. I love good friends, good coffee, and good ideas shared around a room. I work for Microsoft helping build the next identity platform in the cloud for Azure.
-
Hiroshi@gmail.com22 Dec 2011 4:36 PM
Thank you for the great instruction! A couple of comments for your attention;
1) $ mv hadoop-0.20.2-cdh3u1 ./hadoop : three steps before in your instruction, you have already created hadoop directory, this command will move hadoop-0.20.2-cdh3u1/ under ./hadoop.
2) During formatting namenode and start-all.sh, I hit many of permission error like, localhost: mkdir: /usr/local/hadoop/bin/../logs: Permission denied
How do you set your account on Mac (Lion)? I used "sudo" with root password during the installation. /usr/local are is protected on my Mac.
Thank you again for your help.
-Hiroshi
-
27 Dec 2011 7:14 PM
It may be best just to sudo su and run the entire process as root to ensure anything spawned during the installation also has permission to the directory. Lion has done a lot of confusing things to permissions in the unix directory to "protect" users, so your best bet is to take ownership of the entire /usr/local/ directory recursively and then set 775 on them.
There is no reason in my mind why /usr/local/ shouldn't be under ownership of the user in a single user machine.
-
Will L30 Dec 2011 9:42 AM
Hello, Have you been able to get Hadoop Eclipse Plugin to work on OS X Lion? For some reason in Eclipse 3.7.1 with Hadoop 0.20.205.0, the eclipse plugin cannot connect to the DFS and gives an error of "Failed to login". What I don't understand is why in Hadoop is starting to deprecate the Hadoop Eclipse PLugin. I tried building Hadoop 1.0.0's eclipse plugin from the source directory but it doesn't seem to generate any jar files.
Thank you again for your help!
-
Ryan9 Jan 2012 6:01 PM
Hi Will, it looks like the plugin is missing some jar files and I think the manifest may be off too.