Mapreduce Use Case for N-Gram Statistics


In this post we will provide solution to famous N-Grams calculator in Mapreduce Programming. Mapreduce Use case for N-Gram Statistics.

N-Gram:

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus.
An n-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram” (or, less commonly, a “digram”); size 3 is a “trigram”. Larger sizes are sometimes referred to by the value of n, e.g., “four-gram”, “five-gram”,
and so on.
If we are to implement n-gram statistics over the text corpus of all literature collections in digital libraries in MR, how would the Key-value spaces behave?

with n value as 2, the output is

Solution:

Below is the Mapreduce program which can be used to calculate the N-Grams. Here in this program, N can be passed dynamically via command line arguments.

Compile this program and build jar file and run the jar on HDFS data.


Profile photo of Siva

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.

Leave a comment

Your email address will not be published. Required fields are marked *


Review Comments
default gravatar

I am a plsql developer. Intrested to move into bigdata.

Neetika Singh ITA

.