In this post we will provide solution to famous N-Grams calculator in Mapreduce Programming. Mapreduce Use case for N-Gram Statistics.
N-Gram:
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus.
An n-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram” (or, less commonly, a “digram”); size 3 is a “trigram”. Larger sizes are sometimes referred to by the value of n, e.g., “four-gram”, “five-gram”,
and so on.
If we are to implement n-gram statistics over the text corpus of all literature collections in digital libraries in MR, how would the Key-value spaces behave?
1 2 3 4 5 |
hadoop bigdata is hadoop bigdata hadoop complex is hadoop hadoop bigdata hadoop bigdata mapreduce hive mapreduce hive |
with n value as 2, the output is
1 2 3 4 5 6 7 8 9 10 11 |
bigdata hadoop   2 bigdata is   1 bigdata mapreduce   1 complex is   1 hadoop bigdata   4 hadoop complex   1 hadoop hadoop   1 hive mapreduce   1 is hadoop   2 mapreduce hive   1 |
Solution:
Below is the Mapreduce program which can be used to calculate the N-Grams. Here in this program, N can be passed dynamically via command line arguments.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
import java.io.File; import java.io.IOException; import java.util.ArrayList; import java.util.List; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileUtil; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class Ngramstatistics { public static class Map1 extends Mapper<longwritable, text,="" intwritable=""> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); static int cnt = 0; List ls = new ArrayList(); @SuppressWarnings("unchecked") public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer dt = new StringTokenizer(value.toString(), " "); while (dt.hasMoreTokens()) { ls.add(dt.nextToken()); } } @Override protected void cleanup(Context context) throws IOException, InterruptedException { int va = Integer.parseInt(context.getConfiguration().get("grams")); StringBuffer str = new StringBuffer(""); for (int i = 0; i < ls.size() - va; i++) { int k=i; for(int j=0;j<va;j++) {="" if(j="">0) { str = str.append(" "); str = str.append(ls.get(k)); } else { str = str.append(ls.get(k)); } k++; } word.set(str.toString()); str=new StringBuffer(""); //one.set(ls.size()); context.write(word, one); } } } public static class Reduce1 extends Reducer<text, intwritable,="" text,="" intwritable=""> { public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { FileUtil.fullyDelete(new File(args[1])); Configuration conf = new Configuration(); conf.set("grams", args[2]); Job job = new Job(conf, "wordcount"); // job.setNumReduceTasks(0); job.setJarByClass(TreeDriver.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapOutputValueClass(IntWritable.class); job.setMapOutputKeyClass(Text.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); TextInputFormat.addInputPath(job, new Path(args[0])); TextOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(Map1.class); job.setReducerClass(Reduce1.class); job.waitForCompletion(true); System.out.println("Done."); } } |
Compile this program and build jar file and run the jar on HDFS data.
Dear
Thanks for your Post
I want to know that where we can set Value for Bigrams?
and  I have error in this line o Code
” for(int j=0;j<va;j++) {=”” if(j=””>0) { ”
Thanks in Advance