Adv-Tek: Hadoop setup for Mac

I bought The Definitive Guide (paperback, should have gone Kindle for iPad) and read the tutorials and blogs. Maybe it was me, but I feel like the changing APIs, changing names and undocumented features made it too difficult to get a simple Hadoop project up and running. I'm not a huge Maven fan, but when an Apache project doesn't use it... (but I digress.) This was done with version 0.21.0 which is not available on every mirror and is categorized as a minor release. I found it here:
http://mirror.its.uidaho.edu/pub/apache//hadoop/core/hadoop-0.21.0/
and unzipped it into
/Users/ben/Downloads/hadoop-0.21.0

These are the steps that worked for me:
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home

export HADOOP_INSTALL=/Users/ben/Downloads/hadoop-0.21.0

export PATH=$PATH:$HADOOP_INSTALL/bin

export HADOOP_COMMON_HOME=$HADOOP_INSTALL/common

export HADOOP_CONF_DIR=$HADOOP_INSTALL/conf

export HADOOP_CLASSPATH=/Users/ben/Projects/hadoopexamples/marketdata/target/classes

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa

cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

#In System Preferences -> Sharing -> Remote Login (checked)

# test this with ssh and accept the key

ssh localhost

# in the HADOOP_INSTALL/conf directory

#core-site.xml
<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost/</value>
  </property>
</configuration>

#hdfs-site.xml
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

#mapred-site.xml
<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:8021</value>
  </property>
</configuration>

# start-dfs.sh does not work even though start-all is "deprecated" and recommends it

./start-all.sh

# to run a java app in hadoop (requires the classpath export from above)

hadoop com.mycompany.myproject.MyApp

I've been running mostly in Eclipse. I used mvn archetype:generate and then mvn eclipse:eclipse to create my folder structure and default app. Then, I deleted the pom and started using Eclipse to manage the classpath. I added every jar in the HADOOP_INSTALL dir and the HADOOP_INSTALL/lib dir. After, I wrote my programs, I removed all the jars I didn't recognize, ran the programs and re-added one-by-one to fix NoClassDefFoundExceptions until I had this list:

hadoop-mapred-0.21.0.jar

hadoop-hdfs-0.21.0.jar

hadoop-common-0.21.0.jar

commons-logging-1.1.1.jar

log4j-1.2.15.jar

junit-4.8.1.jar

commons-httpclient-3.1.jar

commons-cli-1.2.jar

jackson-mapper-asl-1.4.2.jar

jackson-core-asl-1.4.2.jar

avro-1.3.2.jar

commons-codec-1.4.jar

Commons http client is in their because I'm RESTfully downloading sample data from Yahoo! And Log4j is their because I'm using it :-) I generally run from the command line (except main() which I run from Eclipse) to inspect, populate, delete the contents of hdfs. Here are some common commands I use:

hadoop fs -cat /myinputs/csv.txt

hadoop fs -ls /myoutputs

hadoop fs -rmr /myoutputs

I also use Ruby to coordinate hdfs prep like deleting files, invoking the program, processing the output and generating html with the data from the output. Using the ` (backtick) syntax you can use hadoop or cat or anything to make processing easy. `curl someurl > somefile.txt` then `hadoop dfs -copyFromLocal URI` into HDFS... or write this in Java :-)

try {

GetMethod method = new GetMethod("http://myurl.com/gooddata");

new HttpClient().executeMethod(method);

BufferedInputStream in = new BufferedInputStream(method.getResponseBodyAsStream());

String destination = "hdfs://localhost/myinputs/file.txt";

Configuration conf = new Configuration();

FileSystem fs = FileSystem.get(URI.create(destination), conf);

OutputStream out = fs.create(new Path(destination), true);

IOUtils.copyBytes(in, out, 4096, true);

} catch (Exception e) {

IOUtils.closeStream(in);

throw e;

}

CSV is by far the easiest to process. Actual programs... next time :-)

Adv-Tek

Wednesday, March 9, 2011

Hadoop setup for Mac

No comments:

Post a Comment