http://mirror.its.uidaho.edu/pub/apache//hadoop/core/hadoop-0.21.0/
and unzipped it into
/Users/ben/Downloads/hadoop-0.21.0
These are the steps that worked for me:
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home
export HADOOP_INSTALL=/Users/ben/Downloads/hadoop-0.21.0
export PATH=$PATH:$HADOOP_INSTALL/bin
export HADOOP_COMMON_HOME=$HADOOP_INSTALL/common
export HADOOP_CONF_DIR=$HADOOP_INSTALL/conf
export HADOOP_CLASSPATH=/Users/ben/Projects/hadoopexamples/marketdata/target/classes
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
#In System Preferences -> Sharing -> Remote Login (checked)
# test this with ssh and accept the key
ssh localhost
# in the HADOOP_INSTALL/conf directory
#core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost/</value>
</property>
</configuration>
#hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
#mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>
</configuration>
# start-dfs.sh does not work even though start-all is "deprecated" and recommends it
./start-all.sh
# to run a java app in hadoop (requires the classpath export from above)
hadoop com.mycompany.myproject.MyApp
I've been running mostly in Eclipse. I used mvn archetype:generate and then mvn eclipse:eclipse to create my folder structure and default app. Then, I deleted the pom and started using Eclipse to manage the classpath. I added every jar in the HADOOP_INSTALL dir and the HADOOP_INSTALL/lib dir. After, I wrote my programs, I removed all the jars I didn't recognize, ran the programs and re-added one-by-one to fix NoClassDefFoundExceptions until I had this list:
hadoop-mapred-0.21.0.jar
hadoop-hdfs-0.21.0.jar
hadoop-common-0.21.0.jar
commons-logging-1.1.1.jar
log4j-1.2.15.jar
junit-4.8.1.jar
commons-httpclient-3.1.jar
commons-cli-1.2.jar
jackson-mapper-asl-1.4.2.jar
jackson-core-asl-1.4.2.jar
avro-1.3.2.jar
commons-codec-1.4.jar
Commons http client is in their because I'm RESTfully downloading sample data from Yahoo! And Log4j is their because I'm using it :-) I generally run from the command line (except main() which I run from Eclipse) to inspect, populate, delete the contents of hdfs. Here are some common commands I use:
hadoop fs -cat /myinputs/csv.txt
hadoop fs -ls /myoutputs
hadoop fs -rmr /myoutputs
I also use Ruby to coordinate hdfs prep like deleting files, invoking the program, processing the output and generating html with the data from the output. Using the ` (backtick) syntax you can use hadoop or cat or anything to make processing easy. `curl someurl > somefile.txt` then `hadoop dfs -copyFromLocal URI` into HDFS... or write this in Java :-)
try {
GetMethod method = new GetMethod("http://myurl.com/gooddata");
new HttpClient().executeMethod(method);
BufferedInputStream in = new BufferedInputStream(method.getResponseBodyAsStream());
String destination = "hdfs://localhost/myinputs/file.txt";
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(destination), conf);
OutputStream out = fs.create(new Path(destination), true);
IOUtils.copyBytes(in, out, 4096, true);
} catch (Exception e) {
IOUtils.closeStream(in);
throw e;
}
CSV is by far the easiest to process. Actual programs... next time :-)
No comments:
Post a Comment