Map Reduce


Hadoop Performance Tuning 4

Hadoop Performance Tuning There are many ways to improve the performance of Hadoop jobs. In this post, we will provide a few MapReduce properties that can be used at various mapreduce phases to improve the performance tuning. There is no one-size-fits-all technique for tuning Hadoop jobs, because of the architecture of Hadoop, achieving balance among resources is often more effective than addressing a single problem. Depending on the type of job […]


Mapreduce Use Case for N-Gram Statistics

In this post we will provide solution to famous N-Grams calculator in Mapreduce Programming. Mapreduce Use case for N-Gram Statistics. N-Gram: In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. An n-gram […]


Mapreduce Use Case to Calculate PageRank

PageRank is a way of measuring the importance of website pages. PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites. In the general case, the PageRank value for any page u can be expressed as: , i.e. the PageRank […]


MRUnit Example for WordCount Algorithm 1

In this post we will discuss about basic MRUnit example for Wordcount algorithm. Below are the tools used in this example Eclipse 3.8, mrunit-1.0.0-hadoop2.jar Procedure: 1. Download mrunit jar from this link and add this to the java project build path (File –> properties –> java build path –> add external jars) in eclipse. 2. As we are testing wordcount algorithm…Below is the code for the same.

3. To […]


Merging Small Files Into Avro File

This post is a continuation for previous post on working with small files issue. In previous we have merged huge number of small files on HDFS directory into sequencefile and in this post we will merge huge number of small files on local file system into avro file on HDFS output directory. We will store the file names as keys and file contents as values in Avro file. We will […]


Merging Small Files into SequenceFile 7

In this post, we will discuss one of the famous use case of SequenceFiles, where we will merge large number of small files into SequenceFile. We will get to this requirement mainly due to the lack efficient processing of large number of small files in hadoop or mapreduce. Need For Merging Small Files: As hadoop stores all the HDFS files metadata in namenode’s main memory(which is a limited value) for […]


Avro MapReduce 2 API Example

Avro provides support for both old Mapreduce Package API (org.apache.hadoop.mapred) and new Mapreduce Package API (org.apache.hadoop.mapreduce). Avro data can be used as both input and output from a MapReduce job, as well as the intermediate format. In this post we will provide an example run of Avro Mapreduce 2 API. This post can be treated as continuation for the previous post on Avro Mapreduce API. In this post, we will create […]


Avro MapReduce Word Count Example 1

In this post, we will discuss about famous word count example through mapreduce and create a sample avro data file in hadoop distributed file system. Prerequisite: In order to execute the mapreduce word count program given in this post, we need avro-mapred-1.7.4-hadoop2.jar file to be present in $HADOOP_HOME/share/hadoop/common/lib directory. This jar contains the classes used for avro serialization and deserialization through mapreduce framework. For instructions on installation and integration of […]


MapReduce Multiple Outputs Use case 1

Use Case Description: In this post we will discuss about the usage of Mapreduce Multiple Outputs Output format in Mapreduce jobs by taking one real world use case. In this, we are considering an use case to generate multiple output file names from reducer and these file names should be based on the certain input data parameters. I.e. We need control over the naming of the files. In this scenario, we […]


Mapreduce Program to calculate Missing Count 4

Use Case Description: This post describes an approach to use case scenario, where an input file contains some columns and its corresponding values as records. But some of these columns may have blanks/nulls instead of actual values. I.e. data is missing for some columns. And developer needs to write a Mapreduce Program to calculate missing count and percentage for each input column. Per suppose consider the below text files as […]


Hadoop Output Formats 2

Hadoop Output Formats We have discussed input formats supported by hadoop in previous post. In this post, we will have an overview of the hadoop output formats and their usage. Hadoop provides output formats that corresponding to each input format. All hadoop output formats must implement the interface org.apache.hadoop.mapreduce.OutputFormat. OutputFormat describes the output-specification for a Map-Reduce job. Based on Output specification, Mapreduce job checks that the output directory doesn’t already […]


Hadoop Input Formats 5

Hadoop Input Formats: As we discussed about files being broken into splits as part of the job startup and the data in a split is being sent to the mapper implementation in our Mapreduce Job Flow post, in this post, we will go into detailed discussion on input formats supported by Hadoop and Mapreduce and how the input files are processed in Mapreduce job. Input Splits, Input Formats and Record Reader: […]


Creating Custom Hadoop Writable Data Type 9

If none of the built-in Hadoop Writable data types matches our requirements some times, then we can create custom Hadoop data type by implementing Writable interface or WritableComparable interface. Common Rules for creating custom Hadoop Writable Data Type A custom hadoop writable data type which needs to be used as value field in Mapreduce programs must implement Writable interface org.apache.hadoop.io.Writable. MapReduce key types should have the ability to compare against each […]


Hadoop Data Types 3

Hadoop provides Writable interface based data types for serialization and de-serialization of data storage in HDFS and mapreduce computations. Serialization Serialization is the process of converting object data into byte stream data for transmission over a network across different nodes in a cluster or for persistent data storage. Deserialization Deserialization is the reverse process of serialization and converts byte stream data into object data for reading data from HDFS. Hadoop […]


Combiner in Mapreduce 4

Combiners In Mapreduce Combiner is a semi-reducer in mapreduce. This is an optional class which can be specified in mapreduce driver class to process the output of map tasks before submitting it to reducer tasks. Purpose In Mapreduce framework, usually the output from the map tasks is large and data transfer between map and reduce tasks will be high. Since the data transfer across the network is expensive and to […]


Review Comments
default gravatar

I am a plsql developer. Intrested to move into bigdata.

Neetika Singh ITA

.