Hadoop Mapreduce word count using eclipse

This article describes how to create a word count example using Eclipse.

  1. Requirement
  • Eclipse Juno 32bit
  • JDK 6.xx
  • Latest Hadoop version (At the time of this writing, 1.0.4 is the lastest stable version)
  • Linux Ubuntu 12.04 (although we can use Window, Linux seems to be better)
  1. Setup paths

Before using Hadoop MapReduce, one must set up some environment variables:

  • JAVA_HOME
  • HADOOP_HOME (optional for this tutorial)

If you don’t know how to set up these variables, just googling a little bit 🙂

3. Import Hadoop Library

The next step is to import Hadoop Library. Note that for simplicity, we won’t use Maven here. After downloading hadoop, go to the HADOOP_HOME, copy these following files into our lib folder:

  • All jar files in HADOOP_HOME (for me, it means it means all jar file in hadoop-1.0.4)
  • All jar files in lib folder (for me, it’s all jar file in hadoop-1.0.4/lib).

After that, don’t forget to add them to your classpath for later use. When all finished, you should end up with something like this:

Capture

4. The Word count example

We’re going to create a simple word count example. Given a text file, one should be able to count all occurrences of each word in it. In general, the program consists of three classes:

  • WordCountMapper.java the mapper.
  • WordCountReducer.java the reducer.
  • WordCount.java the driver. Some configurations (input type, output type, job…) are done here.

WordCountMapper.java

The WordCountMapper.java contains the map() function: it just takes the input file line-by-line (in the value variable). For each line, it emits the key value (word, 1). Where word is a specific word found in that line. For example, given this line:

“This is a line”

Then the map() function will output four key-value pairs: {this, 1}, {is, 1}, {a, 1}, {line, 1}

Here is the source code of WordCountMapper.java:

package fr.telecomParistech.mapreduce.wordcount;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class WordCountMapper 
        extends Mapper<LongWritable, Text, Text, IntWritable>{
    private static final IntWritable one = new IntWritable(1);
    private Text word = new Text();

    @Override
    protected void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String line = value.toString();
        String[] words = line.split("");
        for (String w : words) {
            word.set(w);
            context.write(word, one);
        }
    }
}

WordCountReducer.java

The reduce() function in WordCountReducer is even simpler. All we have to do is to loop over values of the same key and sum it up. The source code is straightforward and self-explained:

package fr.telecomParistech.mapreduce.wordcount;

import java.io.IOException;
import java.util.Iterator;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordCountReducer 
        extends Reducer<Text, IntWritable, Text, IntWritable>{

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, 
            Context context)
            throws IOException, InterruptedException {
        int sum = 0;
        Iterator<IntWritable> itr = values.iterator();
        while (itr.hasNext()) {
            sum += itr.next().get();
        }

        context.write(key, new IntWritable(sum));
    }
}

WordCount.java

WordCount.java is responsible for the configuration, setup…. MapReduce job. It contains several configuration information of the system:

package fr.telecomParistech.mapreduce.wordcount;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {
    public static void main(String[] args) throws Exception {        

        String input = "test.txt";
        String output = "out";

        // Create a new job
        Job job = new Job();

        // Set job name to locate it in the distributed environment
        job.setJarByClass(WordCount.class);
        job.setJobName("Word Count");

        // Set input and output Path, note that we use the default input format
        // which is TextInputFormat (each record is a line of input)
        FileInputFormat.addInputPath(job, new Path(input));
        FileOutputFormat.setOutputPath(job, new Path(output));

        // Set Mapper and Reducer class
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);

        // Set Output key and value
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Note that here, we don’t explicitly set input type for the mapper function but use the default one. Infact, the default input type of Hadoop framework is TextInputFormat in which each record is a line of input. The key, a LongWritable, is the offset within the file of the beginning of the line. The value, a Text, is the content of the line. So, our input file, whose content is the following text:

this is line 1
this is line 2
......
this is line n

is divided into these records. Note that the keys are NOT line numbers.

0               , this is line 1
offset_of_line_2, this is line 2
......
offset_of_line_n, this is line n

Now, put a paragraph into “text.txt”, and run the program, you should have the result in “out” folder…

Advertisements
This entry was posted in Java and tagged , . Bookmark the permalink.

4 Responses to Hadoop Mapreduce word count using eclipse

  1. Nam Hoang says:

    😉

  2. maho says:

    I have followed ur procedure but i am getting the following error

    The system encountered the following error while it was submitting the job:
    Exception in thread “main” java.lang.ClassNotFoundException: com.ibm.avatar.provenance.AQLEvaluator
    at java.lang.Class.forName(Class.java:172)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:149)
    help me out…..

  3. nxhoaf says:

    Which version do you use ?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s