Wednesday, March 9, 2011

Hadoop setup for Mac

I bought The Definitive Guide (paperback, should have gone Kindle for iPad) and read the tutorials and blogs.  Maybe it was me, but I feel like the changing APIs, changing names and undocumented features made it too difficult to get a simple Hadoop project up and running.  I'm not a huge Maven fan, but when an Apache project doesn't use it...  (but I digress.)  This was done with version 0.21.0 which is not available on every mirror and is categorized as a minor release.  I found it here:
http://mirror.its.uidaho.edu/pub/apache//hadoop/core/hadoop-0.21.0/
and unzipped it into
/Users/ben/Downloads/hadoop-0.21.0

These are the steps that worked for me:
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home
export HADOOP_INSTALL=/Users/ben/Downloads/hadoop-0.21.0
export PATH=$PATH:$HADOOP_INSTALL/bin
export HADOOP_COMMON_HOME=$HADOOP_INSTALL/common
export HADOOP_CONF_DIR=$HADOOP_INSTALL/conf
export HADOOP_CLASSPATH=/Users/ben/Projects/hadoopexamples/marketdata/target/classes

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

#In System Preferences -> Sharing -> Remote Login (checked)
# test this with ssh and accept the key
ssh localhost

# in the HADOOP_INSTALL/conf directory
#core-site.xml
<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost/</value>
  </property>
</configuration>

#hdfs-site.xml
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

#mapred-site.xml
<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:8021</value>
  </property>
</configuration>



# start-dfs.sh does not work even though start-all is "deprecated" and recommends it
./start-all.sh

# to run a java app in hadoop (requires the classpath export from above)
hadoop com.mycompany.myproject.MyApp

I've been running mostly in Eclipse.  I used mvn archetype:generate and then mvn eclipse:eclipse to create my folder structure and default app.  Then, I deleted the pom and started using Eclipse to manage the classpath.  I added every jar in the HADOOP_INSTALL dir and the HADOOP_INSTALL/lib dir.  After, I wrote my programs, I removed all the jars I didn't recognize, ran the programs and re-added one-by-one to fix NoClassDefFoundExceptions until I had this list:
hadoop-mapred-0.21.0.jar
hadoop-hdfs-0.21.0.jar
hadoop-common-0.21.0.jar
commons-logging-1.1.1.jar
log4j-1.2.15.jar
junit-4.8.1.jar
commons-httpclient-3.1.jar
commons-cli-1.2.jar
jackson-mapper-asl-1.4.2.jar
jackson-core-asl-1.4.2.jar
avro-1.3.2.jar
commons-codec-1.4.jar

Commons http client is in their because I'm RESTfully downloading sample data from Yahoo!  And Log4j is their because I'm using it :-)  I generally run from the command line (except main() which I run from Eclipse) to inspect, populate, delete the contents of hdfs.  Here are some common commands I use:
hadoop fs -cat /myinputs/csv.txt
hadoop fs -ls /myoutputs
hadoop fs -rmr /myoutputs

I also use Ruby to coordinate hdfs prep like deleting files, invoking the program, processing the output and generating html with the data from the output.  Using the ` (backtick) syntax you can use hadoop or cat or anything to make processing easy.  `curl someurl > somefile.txt` then `hadoop dfs -copyFromLocal URI` into HDFS... or write this in Java :-)
try {
GetMethod method = new GetMethod("http://myurl.com/gooddata");
new HttpClient().executeMethod(method);
BufferedInputStream in = new BufferedInputStream(method.getResponseBodyAsStream());
String destination = "hdfs://localhost/myinputs/file.txt";
Configuration conf = new Configuration(); 
FileSystem fs = FileSystem.get(URI.create(destination), conf);
OutputStream out = fs.create(new Path(destination), true);
IOUtils.copyBytes(in, out, 4096, true);
} catch (Exception e) {
    IOUtils.closeStream(in);
    throw e;
}

CSV is by far the easiest to process.  Actual programs... next time :-)

No comments:

Post a Comment