Adv-Tek: March 2011

I bought The Definitive Guide (paperback, should have gone Kindle for iPad) and read the tutorials and blogs. Maybe it was me, but I feel like the changing APIs, changing names and undocumented features made it too difficult to get a simple Hadoop project up and running. I'm not a huge Maven fan, but when an Apache project doesn't use it... (but I digress.) This was done with version 0.21.0 which is not available on every mirror and is categorized as a minor release. I found it here:
http://mirror.its.uidaho.edu/pub/apache//hadoop/core/hadoop-0.21.0/
and unzipped it into
/Users/ben/Downloads/hadoop-0.21.0

These are the steps that worked for me:
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home

export HADOOP_INSTALL=/Users/ben/Downloads/hadoop-0.21.0

export PATH=$PATH:$HADOOP_INSTALL/bin

export HADOOP_COMMON_HOME=$HADOOP_INSTALL/common

export HADOOP_CONF_DIR=$HADOOP_INSTALL/conf

export HADOOP_CLASSPATH=/Users/ben/Projects/hadoopexamples/marketdata/target/classes

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa

cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

#In System Preferences -> Sharing -> Remote Login (checked)

# test this with ssh and accept the key

ssh localhost

# in the HADOOP_INSTALL/conf directory

#core-site.xml
<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost/</value>
  </property>
</configuration>

#hdfs-site.xml
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

#mapred-site.xml
<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:8021</value>
  </property>
</configuration>

# start-dfs.sh does not work even though start-all is "deprecated" and recommends it

./start-all.sh

# to run a java app in hadoop (requires the classpath export from above)

hadoop com.mycompany.myproject.MyApp

I've been running mostly in Eclipse. I used mvn archetype:generate and then mvn eclipse:eclipse to create my folder structure and default app. Then, I deleted the pom and started using Eclipse to manage the classpath. I added every jar in the HADOOP_INSTALL dir and the HADOOP_INSTALL/lib dir. After, I wrote my programs, I removed all the jars I didn't recognize, ran the programs and re-added one-by-one to fix NoClassDefFoundExceptions until I had this list:

hadoop-mapred-0.21.0.jar

hadoop-hdfs-0.21.0.jar

hadoop-common-0.21.0.jar

commons-logging-1.1.1.jar

log4j-1.2.15.jar

junit-4.8.1.jar

commons-httpclient-3.1.jar

commons-cli-1.2.jar

jackson-mapper-asl-1.4.2.jar

jackson-core-asl-1.4.2.jar

avro-1.3.2.jar

commons-codec-1.4.jar

Commons http client is in their because I'm RESTfully downloading sample data from Yahoo! And Log4j is their because I'm using it :-) I generally run from the command line (except main() which I run from Eclipse) to inspect, populate, delete the contents of hdfs. Here are some common commands I use:

hadoop fs -cat /myinputs/csv.txt

hadoop fs -ls /myoutputs

hadoop fs -rmr /myoutputs

I also use Ruby to coordinate hdfs prep like deleting files, invoking the program, processing the output and generating html with the data from the output. Using the ` (backtick) syntax you can use hadoop or cat or anything to make processing easy. `curl someurl > somefile.txt` then `hadoop dfs -copyFromLocal URI` into HDFS... or write this in Java :-)

try {

GetMethod method = new GetMethod("http://myurl.com/gooddata");

new HttpClient().executeMethod(method);

BufferedInputStream in = new BufferedInputStream(method.getResponseBodyAsStream());

String destination = "hdfs://localhost/myinputs/file.txt";

Configuration conf = new Configuration();

FileSystem fs = FileSystem.get(URI.create(destination), conf);

OutputStream out = fs.create(new Path(destination), true);

IOUtils.copyBytes(in, out, 4096, true);

} catch (Exception e) {

IOUtils.closeStream(in);

throw e;

}

CSV is by far the easiest to process. Actual programs... next time :-)

A client and I were recently trying to articulate best practices with respect to standardizing project structures, continuous integration methodology, environmental configurations and deployment techniques. I wrote this short "manifesto" to describe the ideals without specifying specific technologies. The ideals theoretically will not change even though the tools are guaranteed to evolve over time. Perhaps techniques for achieving each ideal will be fodder for future blog posts... just in time for blogs to be out of vogue!

All environments should have working instances of all applications, follow a consistent naming scheme and have similar structure.
All environments should be self contained and depend only on application instances in the same environment.
All applications should have with the same folder structure and techniques for cleaning, compiling, running, building, releasing, branching, tagging and deploying.
All application source code will live in the same version control system and follow a similar folder structure.
All applications should expose a representation of their own health and the health of their necessary dependencies.
All applications should require only a checkout and run to be successfully launched in local properties mode.
All stateless code should be unit tested thoroughly but not excessively and tests should be run automatically every commit.
All state-full code should be tested on a schedule according to established best practices.
All properties files should have a consistent naming scheme, validation mechanism, inheritance/override paradigm and version control methodology.
All persistent data stores should be refreshable to a known state and version that is enhanced as features are added.
All deployments should be centrally executed by a single continuous integration server accessible by role.

Adv-Tek

Wednesday, March 9, 2011

Hadoop setup for Mac

Tuesday, March 8, 2011

This one goes to eleven