Thursday, October 30, 2014

Hadoop - pseudo distributed mode setup


You can simply make your standalone hadoop setup to a pseudo distributed mode with following changes

  • In HADOOP_HOME/etc/hadoop/core-site.xml, add

    
        fs.defaultFS
        hdfs://localhost:9000
    

  • In HADOOP_HOME/etc/hadoop/hdfs-site.xml, add
    
        
            dfs.replication
            1
        
    
  • Make sure that you can connect to localhost with ssh.

Start and test your hadoop setup


  • Fist navigate to HADOOP_HOME
  • Format the hadoop file system
        /bin/hdfs namenode –format
  • Start Name node and Data node
        /sbin/start-dfs.sh

  • Now you should be able to browse the hadoop web interface through
        And your hadoop file system under
        Utilities > browse the file system

  • Add /user/ to hadoop file system
        hdfs dfs –mkdir /user
        hdfs dfs –mkdir /user/
        You will be able to see these directories when you browse the file system now. And you can list         the files with
        hdfs dfs –ls ( ie: hdfs dfs –ls / )

  • Copy the input file to the hadoop file system
        hdfs dfs –put
        ie: hdfs dfs –put myinput input
and the file will be copied to /user//input

  • Run the application with
      hadoop jar [local path to jar file] [path to main class] [input path in dfs]  [output location in dfs]
        ie: hadoop jar myapp.jar test.org.AppRunner input output

Result file: part-r-00000 should be saved in the output directory of dfs ( /user/[username]/output

Tuesday, October 28, 2014

Setup Hadoop in Mac

It is really simple to setup hadoop in Mac. I tried the latest available version at the moment. ( hadoop-2.5.1 ) You can setup hadoop in standalone mode or pseudo-distributed mode in your local machine. By following the below steps you will be able to setup hadoop in your machine in standalone mode.

( you need to install java and ssh beforehand to run hadoop )

  1. Download the version you need to install from here 
  2. Extract the downloaded pack
    The extracted directory will be your HADOOP_HOME ( ie: /Users/username/hadoopDir )
  3. Add HADOOP_HOME to .bash_profile
    Export HADOOP_HOME=/Users/userName/hadoop-2.5.1 
    export PATH=$PATH:$HADOOP_HOME/bin

  4. Source .bash_profile to affect the new changes
    source ~/.bash_profile

    Now you should be able to echo HADOOP_HOME in terminal ( echo $HADOOP_HOME )
  5. Make sure that you can ssh to localhost
    ssh localhost

Now your stand alone hadoop setup is ready to use.
I will share a sample code I found on map reduce to test your setup.

import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static class TokenizerMapper extends Mapper{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }

Run sample

  • Create a jar from the sample
  • Create a text file of which you need to count words
  • Run
             hadoop jar [path_to_jar] [path_to_main_class] [path_to_input] [path_to_output]              ie: hadoop jar wordCount.jar WordCount inputFile output
  • On  a successful execution, you will have the output directory created at the path you specify. And your result will be in output/part-r-00000
  • When you run the program again you need to remove the ‘output’ directory or give some other path for the output to be written.
You will see that this is really simple. 
You can find steps on setting up hadoop in pseudo-distributed mode in this post