Adv-Tek

Wednesday, August 22, 2012

RoR setup on Ubuntu

This was straight forward but no site seemed to have all the steps:

$ sudo apt-get update

$ sudo apt-get install ruby1.9.3

$ sudo apt-get install libsqlite3-dev

$ sudo apt-get install nodejs

$ sudo gem install rails

$ mkdir rails-apps

$ cd rails-apps/

$ rails new my_app

$ cd my_app

$ rails server

http://localhost:3000/

QED

Wednesday, August 15, 2012

Hadoop 1.03 and Hive on Ubuntu 12.04

Download VMWare Workstation

https://my.vmware.com/web/vmware/details?productId=241&downloadGroup=WKST-804-LX

Download Ubuntu Desktop 12.04 AMD

http://releases.ubuntu.com/12.04/ubuntu-12.04-desktop-amd64.iso

Create the image 400G disk, 2 processors, “bridged” network

Launch a Terminal via Dash Home

$ sudo apt-get update

Wait for Update Manager to auto-launch; Click “Install Updates”

Reboot

Launch Terminal

$ sudo nano /etc/sudoers

Copy from Guest is Ctrl-Shift-C

Paste to Guest is Ctrl-Shift-V

Append to end of /etc/sudoers

ALL = (ALL) NOPASSWD: ALL

SSH Instructions

http://cloudfront.blogspot.com/2012/07/how-to-setup-and-configure-ssh-on-ubuntu.html

$ ssh-keygen -t rsa -P ""

Install Oracle Java

http://cloudfront.blogspot.in/2012/07/how-to-install-sunoracle-java-on-ubuntu.html

$ sudo add-apt-repository ppa:webupd8team/java

$ sudo apt-get update

$ sudo apt-get install oracle-java7-installer

$ java -version

Install Hadoop

http://cloudfront.blogspot.com/2012/07/how-to-configure-hadoop.html

$ wget -c http://mirror.metrocast.net/apache/hadoop/common/hadoop-1.0.3/hadoop-1.0.3-bin.tar.gz

$ tar -zxvf hadoop-1.0.3-bin.tar.gz

$ nano .bashrc

Append to .bashrc

export HADOOP_HOME=/home/myusername/hadoop-1.0.3

Close the Terminal and launch a new one to pick up the new environment variable

$ exit

Set JAVA_HOME in hadoop env

$ cd hadoop-1.0.3/conf

$ nano hadoop-env.sh

Append next to commented JAVA_HOME

export JAVA_HOME=/usr/lib/jvm/java-7-oracle

Create hdfs target directories

$ mkdir ~/hdfs

$ mkdir ~/hdfs/name

$ mkdir ~/hdfs/data

$ mkdir ~/hdfs/tmp

$ sudo chmod -R 755 ~/hdfs/

Modify the config files as described in: http://cloudfront.blogspot.com/2012/07/how-to-configure-hadoop.html

$ sudo nano ~/hadoop-1.0.3/conf/core-site.xml

$ sudo nano ~/hadoop-1.0.3/conf/hdfs-site.xml

$ sudo nano ~/hadoop-1.0.3/conf/mapred-site.xml

Format the namenode and start hadoop services

$ ~/hadoop-1.0.3/bin/hadoop namenode -format

$ ~/hadoop-1.0.3/bin/start-all.sh

Confirm services are started

$ jps

Hadoop status

http://localhost:50070

Map Reduce status

http://localhost:50030

Install Hive https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-InstallationandConfiguration

$ wget -c http://apache.claz.org/hive/hive-0.9.0/hive-0.9.0-bin.tar.gz

$ tar -xzvf hive-0.9.0-bin.tar.gz

Add these lines to ~/.bashrc and restart your terminal

export HADOOP_HOME=/home/myusername/hadoop-1.0.3

export HIVE_HOME=/home/myusername/hive-0.9.0-bin

export PATH=$HIVE_HOME/bin:$PATH

export PATH=$HADOOP_HOME/bin:$PATH

Create hive directories within hdfs and set permissions for table create

$ hadoop fs -mkdir /user/hive/warehouse

$ hadoop fs -mkdir /tmp

$ hadoop fs -chmod g+w /user/hive/warehouse

$ hadoop fs -chmod g+w /tmp

Launch hive and create sample tables

$ hive

hive> CREATE TABLE shakespeare (freq INT, word STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;

hive> CREATE TABLE kjv (freq INT, word STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;

hive> exit;

Download sample data from Cloudera

$ wget -O shakespeare.tar.gz https://github.com/cloudera/cloudera-training/blob/master/data/shakespeare.tar.gz?raw=true

$ wget -O bible.tar.gz https://github.com/cloudera/cloudera-training/blob/master/data/bible.tar.gz?raw=true

$ tar -zvxf bible.tar.gz

$ tar -zvxf shakespeare.tar.gz

Put the Shakespeare sample data into hdfs

$ hadoop fs -mkdir shakespeare-input

$ hadoop fs -put ~/input/all-shakespeare /user/myusername/shakespeare-input

$ hadoop fs -ls shakespeare-input

Run the “grep” sample against the hdfs directory “shakespeare-input” and place results in “shakespeare_freq”

$ hadoop jar ~/hadoop-1.0.3/hadoop-examples-1.0.3.jar grep shakespeare-input shakespeare_freq '\w+'

$ hadoop fs -ls shakespeare_freq

Put the bible sample data into hdfs

$ hadoop fs -mkdir bible-input

$ hadoop fs -put ~/bible/all-bible /user/myusername/bible-input

$ hadoop fs -ls bible-input

Run the “grep” sample against the hdfs directory “bible-input” and place results in “bible_freq”

$ hadoop jar ~/hadoop-1.0.3/hadoop-examples-1.0.3.jar grep bible-input bible_freq '\w+'

$ hadoop fs -ls bible_freq

Cleanup the logs

$ hadoop fs -rmr bible_freq/_logs

$ hadoop fs -rmr shakespeare_freq/_logs

Open Hive

$ hive

hive> load data inpath "shakespeare_freq" into table shakespeare;

hive> select * from shakespeare limit 10;

hive> select * from shakespeare where freq > 20 sort by freq asc limit 10;

hive> select freq, count(1) as f2 from shakespeare group by freq sort by f2 desc limit 10;

hive> explain select freq, count(1) as f2 from shakespeare group by freq sort by f2 desc limit 10;

hive> load data inpath “bible_freq” into table kjv;

hive> create table merged (word string, shake_f int, kjv_f int);

hive> insert overwrite table merged select s.word, s.freq, k.freq from shakespeare s join kjv k on (s.word = k.word) where s.freq >= 1 and k.freq >= 1;

hive> select * from merged limit 20;

hive> select word, shake_f, kjv_f, (shake_f + kjv_f) as ss from merged sort by ss desc limit 20;

Now you know; and knowing is half the battle

Total MapReduce CPU Time Spent: 6 seconds 140 msec

the 25848 62394 88242

and 19671 38985 58656

of 16700 34654 51354

I 23031 8854 31885

to 18038 13526 31564

in 10797 12445 23242

a 14170 8057 22227

that 8869 12603 21472

And 7800 12846 20646

is 8882 6884 15766

my 11297 4135 15432

you 12702 2720 15422

he 5720 9672 15392

his 6817 8385 15202

not 8409 6591 15000

be 6773 6913 13686

for 6309 7270 13579

with 7284 6057 13341

it 7178 5917 13095

shall 3293 9764 13057

Time taken: 67.711 seconds

Tuesday, May 31, 2011

Three Android room-for-improvements

I believe the biggest gaps in the Android development experience are:

1. Poor support for unit tests -- There should be an Emulator free mechanism that starts up instantly (or worst case, in the same startup timeframe as a Spring app context.)

2. No native dependency injection -- Roboguice is maturing but to miss this feature is a little shocking.

3. The use of XML for view layouts -- It seems like Java classes with static methods or constants that return instances of the metadata would have been easy to implement and would have eliminated the need for R.java. Or, a DSL would have been more compact and flexible (in Groovy, perhaps.)

Wednesday, March 9, 2011

Hadoop setup for Mac

I bought The Definitive Guide (paperback, should have gone Kindle for iPad) and read the tutorials and blogs. Maybe it was me, but I feel like the changing APIs, changing names and undocumented features made it too difficult to get a simple Hadoop project up and running. I'm not a huge Maven fan, but when an Apache project doesn't use it... (but I digress.) This was done with version 0.21.0 which is not available on every mirror and is categorized as a minor release. I found it here:
http://mirror.its.uidaho.edu/pub/apache//hadoop/core/hadoop-0.21.0/
and unzipped it into
/Users/ben/Downloads/hadoop-0.21.0

These are the steps that worked for me:
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home

export HADOOP_INSTALL=/Users/ben/Downloads/hadoop-0.21.0

export PATH=$PATH:$HADOOP_INSTALL/bin

export HADOOP_COMMON_HOME=$HADOOP_INSTALL/common

export HADOOP_CONF_DIR=$HADOOP_INSTALL/conf

export HADOOP_CLASSPATH=/Users/ben/Projects/hadoopexamples/marketdata/target/classes

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa

cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

#In System Preferences -> Sharing -> Remote Login (checked)

# test this with ssh and accept the key

ssh localhost

# in the HADOOP_INSTALL/conf directory

#core-site.xml
<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost/</value>
  </property>
</configuration>

#hdfs-site.xml
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>

#mapred-site.xml
<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:8021</value>
  </property>
</configuration>

# start-dfs.sh does not work even though start-all is "deprecated" and recommends it

./start-all.sh

# to run a java app in hadoop (requires the classpath export from above)

hadoop com.mycompany.myproject.MyApp

I've been running mostly in Eclipse. I used mvn archetype:generate and then mvn eclipse:eclipse to create my folder structure and default app. Then, I deleted the pom and started using Eclipse to manage the classpath. I added every jar in the HADOOP_INSTALL dir and the HADOOP_INSTALL/lib dir. After, I wrote my programs, I removed all the jars I didn't recognize, ran the programs and re-added one-by-one to fix NoClassDefFoundExceptions until I had this list:

hadoop-mapred-0.21.0.jar

hadoop-hdfs-0.21.0.jar

hadoop-common-0.21.0.jar

commons-logging-1.1.1.jar

log4j-1.2.15.jar

junit-4.8.1.jar

commons-httpclient-3.1.jar

commons-cli-1.2.jar

jackson-mapper-asl-1.4.2.jar

jackson-core-asl-1.4.2.jar

avro-1.3.2.jar

commons-codec-1.4.jar

Commons http client is in their because I'm RESTfully downloading sample data from Yahoo! And Log4j is their because I'm using it :-) I generally run from the command line (except main() which I run from Eclipse) to inspect, populate, delete the contents of hdfs. Here are some common commands I use:

hadoop fs -cat /myinputs/csv.txt

hadoop fs -ls /myoutputs

hadoop fs -rmr /myoutputs

I also use Ruby to coordinate hdfs prep like deleting files, invoking the program, processing the output and generating html with the data from the output. Using the ` (backtick) syntax you can use hadoop or cat or anything to make processing easy. `curl someurl > somefile.txt` then `hadoop dfs -copyFromLocal URI` into HDFS... or write this in Java :-)

try {

GetMethod method = new GetMethod("http://myurl.com/gooddata");

new HttpClient().executeMethod(method);

BufferedInputStream in = new BufferedInputStream(method.getResponseBodyAsStream());

String destination = "hdfs://localhost/myinputs/file.txt";

Configuration conf = new Configuration();

FileSystem fs = FileSystem.get(URI.create(destination), conf);

OutputStream out = fs.create(new Path(destination), true);

IOUtils.copyBytes(in, out, 4096, true);

} catch (Exception e) {

IOUtils.closeStream(in);

throw e;

}

CSV is by far the easiest to process. Actual programs... next time :-)

Tuesday, March 8, 2011

This one goes to eleven

A client and I were recently trying to articulate best practices with respect to standardizing project structures, continuous integration methodology, environmental configurations and deployment techniques. I wrote this short "manifesto" to describe the ideals without specifying specific technologies. The ideals theoretically will not change even though the tools are guaranteed to evolve over time. Perhaps techniques for achieving each ideal will be fodder for future blog posts... just in time for blogs to be out of vogue!

All environments should have working instances of all applications, follow a consistent naming scheme and have similar structure.
All environments should be self contained and depend only on application instances in the same environment.
All applications should have with the same folder structure and techniques for cleaning, compiling, running, building, releasing, branching, tagging and deploying.
All application source code will live in the same version control system and follow a similar folder structure.
All applications should expose a representation of their own health and the health of their necessary dependencies.
All applications should require only a checkout and run to be successfully launched in local properties mode.
All stateless code should be unit tested thoroughly but not excessively and tests should be run automatically every commit.
All state-full code should be tested on a schedule according to established best practices.
All properties files should have a consistent naming scheme, validation mechanism, inheritance/override paradigm and version control methodology.
All persistent data stores should be refreshable to a known state and version that is enhanced as features are added.
All deployments should be centrally executed by a single continuous integration server accessible by role.

Monday, April 5, 2010

S3 vs DB Data Validations

Hi,
Relational databases have had decades of feature enhancements and maturing best practices to handle the thorny data integrity problems that arise in enterprise software construction. Cloud storage like Amazon S3 brings some very cool capabilities but hasn't had time to develop secondary features such as schema validation. The scenarios below occurred on a real, client project while storing user preference information.

1. Your program posts a JSON structure to S3 but a required field such as zip code (e.g.) is missing and other apps require it. Databases handle this very simply but there are no "not null" constraints in S3.

2. Your program posts a JSON structure that references a key from another data store such as user_id but that User doesn't exist or is inactive. Databases provide foreign key constraint or arc-constraints to deal with this scenario.

3. Your application posts a JSON structure and then immediately does an HTTP GET on the new resource. It's possible the data hasn't replicated across S3 nodes yet (eventual consistency) and you'll actually get a 404 or an older version of the data. Amazon may be implementing a flag where you can indicate "block-until-consistent" but that is not currently available.

There are other shortcomings such as session based commit and rollback with session only visibility until the data is committed. I understand that the purpose of RESTful "bucket" storage isn't to re-implement the database, that they're solving different problems and that it is up to the architect to select the proper persistence mechanism based on specific requirements. However, I predict that standards will emerge allowing XSD style validity enforcement. If these exist already, great! Please post a comment with links :-)

Code well,
Ben

Inevitable First Post

This blog will capture advanced technology related musings, prototypes and research. The definition of advanced tech will change over time but currently consists of cloud computing, mobile application development, distributed computing (e.g. MapReduce) and rapid web application development (e.g. Grails).