Mapreduce Use Case for N-Gram Statistics 2


In this post we will provide solution to famous N-Grams calculator in Mapreduce Programming. Mapreduce Use case for N-Gram Statistics.

N-Gram:

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus.
An n-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram” (or, less commonly, a “digram”); size 3 is a “trigram”. Larger sizes are sometimes referred to by the value of n, e.g., “four-gram”, “five-gram”,
and so on.
If we are to implement n-gram statistics over the text corpus of all literature collections in digital libraries in MR, how would the Key-value spaces behave?

with n value as 2, the output is

Solution:

Below is the Mapreduce program which can be used to calculate the N-Grams. Here in this program, N can be passed dynamically via command line arguments.

Compile this program and build jar file and run the jar on HDFS data.


About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.

Leave a comment

Your email address will not be published. Required fields are marked *

2 thoughts on “Mapreduce Use Case for N-Gram Statistics

  • negar

    Dear

    Thanks for your Post

    I want to know that where we can set  Value for Bigrams?

    and   I have error in this line o Code

    ” for(int j=0;j<va;j++) {=”” if(j=””>0) { ”

    Thanks in Advance