Wednesday, August 15, 2012

Hadoop 1.03 and Hive on Ubuntu 12.04


Download VMWare Workstation

Download Ubuntu Desktop 12.04 AMD

Create the image 400G disk, 2 processors, “bridged” network
Launch a Terminal via Dash Home
$ sudo apt-get update

Wait for Update Manager to auto-launch; Click “Install Updates”
Reboot

Launch Terminal
$ sudo nano /etc/sudoers

Copy from Guest is Ctrl-Shift-C
Paste to Guest is Ctrl-Shift-V
Append to end of /etc/sudoers
ALL = (ALL) NOPASSWD: ALL

SSH Instructions
$ ssh-keygen -t rsa -P ""

Install Oracle Java
$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java7-installer
$ java -version

Install Hadoop

$ wget -c http://mirror.metrocast.net/apache/hadoop/common/hadoop-1.0.3/hadoop-1.0.3-bin.tar.gz
$ tar -zxvf hadoop-1.0.3-bin.tar.gz
$ nano .bashrc

Append to .bashrc
export HADOOP_HOME=/home/myusername/hadoop-1.0.3

Close the Terminal and launch a new one to pick up the new environment variable
$ exit

Set JAVA_HOME in hadoop env
$ cd hadoop-1.0.3/conf
$ nano hadoop-env.sh

Append next to commented JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/java-7-oracle

Create hdfs target directories
$ mkdir ~/hdfs
$ mkdir ~/hdfs/name
$ mkdir ~/hdfs/data
$ mkdir ~/hdfs/tmp
$ sudo chmod -R 755 ~/hdfs/

Modify the config files as described in: http://cloudfront.blogspot.com/2012/07/how-to-configure-hadoop.html
$ sudo nano ~/hadoop-1.0.3/conf/core-site.xml 
$ sudo nano ~/hadoop-1.0.3/conf/hdfs-site.xml
$ sudo nano ~/hadoop-1.0.3/conf/mapred-site.xml

Format the namenode and start hadoop services
$ ~/hadoop-1.0.3/bin/hadoop namenode -format
$ ~/hadoop-1.0.3/bin/start-all.sh

Confirm services are started
$ jps

Hadoop status

Map Reduce status

$ wget -c http://apache.claz.org/hive/hive-0.9.0/hive-0.9.0-bin.tar.gz
$ tar -xzvf hive-0.9.0-bin.tar.gz

Add these lines to ~/.bashrc and restart your terminal
export HADOOP_HOME=/home/myusername/hadoop-1.0.3
export HIVE_HOME=/home/myusername/hive-0.9.0-bin
export PATH=$HIVE_HOME/bin:$PATH
export PATH=$HADOOP_HOME/bin:$PATH

Create hive directories within hdfs and set permissions for table create
$ hadoop fs -mkdir       /user/hive/warehouse
$ hadoop fs -mkdir       /tmp
$ hadoop fs -chmod g+w   /user/hive/warehouse
$ hadoop fs -chmod g+w   /tmp

Launch hive and create sample tables
$ hive
hive> CREATE TABLE shakespeare (freq INT, word STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
hive> CREATE TABLE kjv (freq INT, word STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE;
hive> exit;

Download sample data from Cloudera
$ wget -O shakespeare.tar.gz https://github.com/cloudera/cloudera-training/blob/master/data/shakespeare.tar.gz?raw=true
$ wget -O bible.tar.gz https://github.com/cloudera/cloudera-training/blob/master/data/bible.tar.gz?raw=true
$ tar -zvxf bible.tar.gz
$ tar -zvxf shakespeare.tar.gz

Put the Shakespeare sample data into hdfs
$ hadoop fs -mkdir shakespeare-input
$ hadoop fs -put ~/input/all-shakespeare /user/myusername/shakespeare-input
$ hadoop fs -ls shakespeare-input

Run the “grep” sample against the hdfs directory “shakespeare-input” and place results in “shakespeare_freq”
$ hadoop jar ~/hadoop-1.0.3/hadoop-examples-1.0.3.jar grep shakespeare-input shakespeare_freq '\w+'
$ hadoop fs -ls shakespeare_freq

Put the bible sample data into hdfs
$ hadoop fs -mkdir bible-input
$ hadoop fs -put ~/bible/all-bible /user/myusername/bible-input
$ hadoop fs -ls bible-input

Run the “grep” sample against the hdfs directory “bible-input” and place results in “bible_freq”
$ hadoop jar ~/hadoop-1.0.3/hadoop-examples-1.0.3.jar grep bible-input bible_freq '\w+'
$ hadoop fs -ls bible_freq

Cleanup the logs
$ hadoop fs -rmr bible_freq/_logs
$ hadoop fs -rmr shakespeare_freq/_logs

Open Hive
$ hive
hive> load data inpath "shakespeare_freq" into table shakespeare;
hive> select * from shakespeare limit 10;
hive> select * from shakespeare where freq > 20 sort by freq asc limit 10;
hive> select freq, count(1) as f2 from shakespeare group by freq sort by f2 desc limit 10;
hive> explain select freq, count(1) as f2 from shakespeare group by freq sort by f2 desc limit 10;
hive> load data inpath “bible_freq” into table kjv;
hive> create table merged (word string, shake_f int, kjv_f int);
hive> insert overwrite table merged select s.word, s.freq, k.freq from shakespeare s join kjv k on (s.word = k.word) where s.freq >= 1 and k.freq >= 1;
hive> select * from merged limit 20;
hive> select word, shake_f, kjv_f, (shake_f + kjv_f) as ss from merged sort by ss desc limit 20;

Now you know; and knowing is half the battle
Total MapReduce CPU Time Spent: 6 seconds 140 msec
OK
the   25848 62394 88242
and   19671 38985 58656
of    16700 34654 51354
I     23031 8854  31885
to    18038 13526 31564
in    10797 12445 23242
a     14170 8057  22227
that  8869  12603 21472
And   7800  12846 20646
is    8882  6884  15766
my    11297 4135  15432
you   12702 2720  15422
he    5720  9672  15392
his   6817  8385  15202
not   8409  6591  15000
be    6773  6913  13686
for   6309  7270  13579
with  7284  6057  13341
it    7178  5917  13095
shall 3293  9764  13057
Time taken: 67.711 seconds

19 comments:

  1. 1024M... My machine only has 4G but it ran fine and I was able to run three VMs at 1024 without noticeable lag.

    ReplyDelete
  2. Will try to replicate your steps, but I plan on doing a couple of things differently. Instead of modifying the sudoers file as you have proposed, I will create a separate group and dedicated Hadoop user, generate a passwordless RSA keypair (as you also do), add it to the .ssh/authorized_keys, and do an SSH login to localhost so that the server is a known host.

    ReplyDelete
  3. Thanks for taking a look! The "sudoers" technique is probably overly permissive, but since it is running on a VM on my local machine, it thought it was ok. Your alterations are likely better for a production-like setup or a multi-node cluster.

    ReplyDelete
  4. I am getting error while formatting namenode

    Re-format filesystem in /mnt/data/hivedata/hdfs/name ? (Y or N) y
    Format aborted in /mnt/data/hivedata/hdfs/name

    ReplyDelete
  5. Thank you for the wonderful tutorial,its all pretty much worked for me till the end.thanks.

    ReplyDelete
  6. Thanks so very much for taking your time to create this very useful and informative site. I have learned a lot from your site. Thanks!!

    Salesforce Training in Chennai

    ReplyDelete
  7. It’s too informative blog and I am getting conglomerations of info’s about Hadoop.Thanks for sharing, I would like to see your updates regularly so keep blogging.
    Salesforce training institute in Chennai
    Salesforce training

    ReplyDelete
  8. Hi friends, This is Rebeka from Chennai. I am a technology freak. Your technical information is really useful for me. Keep update your blog.
    Regards..
    Oracle Training Chennai

    ReplyDelete
  9. Thanks for sharing this valuable information.and I gathered some information from this blog. I did SAP Training in Chennai, at FITA Academy which offer best SAP Course in Chennai with years of experienced professionals.

    ReplyDelete
  10. This comment has been removed by the author.

    ReplyDelete
  11. such a good website and given to more information thanks! and more visit
    sas online training

    ReplyDelete
  12. Android training institute in noida - webtrackker is best training institute webtrackkerr provides real time working trainer with 100% placement supprt. webtrackker provides all IT course like SAP(ABAP, BASIS, FI/CO, CRM, MM, PP, BI), SAS, WEB DESIGNING, AUTOCAD, CAM, NODEJS, ANGULARJS, HYBIRD APPS, DIGITAL MARKETING.

    ReplyDelete
  13. I believe there are many more pleasurable opportunities ahead for individuals that looked at your site.
    selenium training in chennai

    ReplyDelete
  14. Very informative blog thanks for sharing Searching for a SEO company in Chennai that can bring your brand to the top results page on Google

    ReplyDelete
  15. Welcome to a sanctuary of organic serenity, where spirituality meets sustainable living. Join us in embracing an organic, mindful path to wellness. Your practice, your soul, your sanctuary. Kosmoh Life Style is the best place to buy yoga products online, featuring a wide range of high-quality yoga accessories and yoga wear, tailored for comfort and style in your yoga practice.

    ReplyDelete