Introduction to Hadoop Streaming

In this post, we will discuss about introduction to hadoop streaming with term frequency and Inverse document frequency algorithm.

Hadoop Streaming

By default Mapreduce framework is written in Java and supports writing mapreduce programs in Java programming language but Hadoop provides API for writing mapreduce programs in other than Java Language. Hadoop Streaming is an utility that comes with hadoop distribution and allows users to write mapreduce programs in any programming/scripting language that can read standard input (stdin) and write to standard output (stdout). Supported languages are Python, PHP, Ruby, Perl, bash etc. But in this post, we will run the examples with the help of Unix Bash Shell Scripts.

As of Hadoop-2.0.0 release, Hadoop Streaming supports both Binary and Text files whereas previous releases were supporting only Text processing.

File Processing in Hadoop Streaming

Each Mapper task converts its input into lines and feed the lines to the stdin of the process. And mapper collects the line oriented outputs from the stdout of the process and converts it into tab separated key-value pairs. These mapper output key-values pairs are fed to reducer tasks and each reducer task converts its input key/values pairs into lines and feeds the lines to the stdin of the process. Reducer collects the line oriented outputs from the stdout of the process and converts into tab separated key-value pairs.

Word Count Example

Below is the basic word count example with the help of Unix Bash Shell script utilities cat and wc commands as mapper and reducer functions respectively.

Below is the output of the above command from terminal.


[Read Next Page] 

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.

Leave a comment

Your email address will not be published. Required fields are marked *

Review Comments
default image

I have attended Siva’s Spark and Scala training. He is good in presentation skills and explaining technical concepts easily to everyone in the group. He is having excellent real time experience and provided enough use cases to understand each concepts. Duration of the course and time management is awesome. Happy that I found a right person on time to learn Spark. Thanks Siva!!!

Dharmeswaran ETL / Hadoop Developer Spark Nov 2016 September 21, 2017