Strive for the Best: October 2014

You can simply make your standalone hadoop setup to a pseudo distributed mode with following changes

In HADOOP_HOME/etc/hadoop/core-site.xml, add


    
        fs.defaultFS
        hdfs://localhost:9000

In HADOOP_HOME/etc/hadoop/hdfs-site.xml, add


    
        dfs.replication
        1

Make sure that you can connect to localhost with ssh.

Start and test your hadoop setup

Fist navigate to HADOOP_HOME

Format the hadoop file system

/bin/hdfs namenode –format

Start Name node and Data node

/sbin/start-dfs.sh

Now you should be able to browse the hadoop web interface through

http://localhost:50070

And your hadoop file system under

Utilities > browse the file system

Add /user/ to hadoop file system

hdfs dfs –mkdir /user

hdfs dfs –mkdir /user/

You will be able to see these directories when you browse the file system now. And you can list the files with

hdfs dfs –ls ( ie: hdfs dfs –ls / )

Copy the input file to the hadoop file system

hdfs dfs –put

ie: hdfs dfs –put myinput input

and the file will be copied to /user//input

Run the application with

hadoop jar [local path to jar file] [path to main class] [input path in dfs] [output location in dfs]

ie: hadoop jar myapp.jar test.org.AppRunner input output

Result file: part-r-00000 should be saved in the output directory of dfs ( /user/[username]/output

It is really simple to setup hadoop in Mac. I tried the latest available version at the moment. ( hadoop-2.5.1 ) You can setup hadoop in standalone mode or pseudo-distributed mode in your local machine. By following the below steps you will be able to setup hadoop in your machine in standalone mode.

( you need to install java and ssh beforehand to run hadoop )

Download the version you need to install from here
Extract the downloaded pack
The extracted directory will be your HADOOP_HOME ( ie: /Users/username/hadoopDir )
Add HADOOP_HOME to .bash_profile
Export HADOOP_HOME=/Users/userName/hadoop-2.5.1
export PATH=$PATH:$HADOOP_HOME/bin
Source .bash_profile to affect the new changes
source ~/.bash_profile

Now you should be able to echo HADOOP_HOME in terminal ( echo $HADOOP_HOME )
Make sure that you can ssh to localhost
ssh localhost

Now your stand alone hadoop setup is ready to use.

I will share a sample code I found on map reduce to test your setup.

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

    public static class TokenizerMapper
            extends Mapper{

        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();

        public void map(Object key, Text value, Context context
        ) throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
            }
        }
    }

    public static class IntSumReducer
            extends Reducer {
        private IntWritable result = new IntWritable();

        public void reduce(Text key, Iterable values,
                           Context context
        ) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            result.set(sum);
            context.write(key, result);
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(TokenizerMapper.class);
        job.setCombinerClass(IntSumReducer.class);
        job.setReducerClass(IntSumReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Run sample

Create a jar from the sample
Create a text file of which you need to count words
Run

hadoop jar [path_to_jar] [path_to_main_class] [path_to_input] [path_to_output] ie: hadoop jar wordCount.jar WordCount inputFile output

On a successful execution, you will have the output directory created at the path you specify. And your result will be in output/part-r-00000
When you run the program again you need to remove the ‘output’ directory or give some other path for the output to be written.

You will see that this is really simple.

You can find steps on setting up hadoop in pseudo-distributed mode in this post

Strive for the Best

Thursday, October 30, 2014

Hadoop - pseudo distributed mode setup

Start and test your hadoop setup

Tuesday, October 28, 2014

Setup Hadoop in Mac

Run sample