Hadoop Real Time Usecases with Solutions 1


Below are a few Hadoop Real Time usecases with solutions.

Usecase 1 Problem:-

Data Description:

This gives the information about the markets and the products available in different regions based on the seasons.

You will find the below fields listed in that file.

Problem Statement:

  • Select any particular county and calculate the percentage of different products produced by each Market in that particular county.

Note: Here we have total 24 products which consists of the value Y or N. Count the products that a particular market will produce will be Y and calculate percentage as count%25. Divide the products into three categories High, Medium and Low.

High       –       above 60 %

Medium –      less than or equal to 60% and greater than 40%

Low        –       less than or equal to 40%

  • Find the count of the markets that come under the category HIGH.

Usecase1 Solution:-

MARKET EVALUATION

Before going ahead copy your input file (DATA_GOV_US_Farmers_Market_DataSet.csv’) into hdfs.

Step 1:

Step 2:

Step 3:

Step 4:

Step 5:

Step 6:

Step 7:

Step 8:

Step 9:

Step 10:

Step 11:

Step 12:

Usecase 2 Problem:-

Analysing the SAT (College Board) 2010 School Levels

Data Set:

Download the data set from the below link.

https://nycopendata.socrata.com/Education/SAT-College-Board-2010-School-Level-Results/zt9s-n5aj

This data can be exported in many formats – Tabular, CSV, XML, PDF etc. Download this data into CSV file and will perform the following analysis.

Data set Description:

This data set is SAT (College Board) 2010 School Level Results which gives you the information about how the students perform in the tests from different schools.  It consists of the below fields.

DBN, School Name, Number of Test Takers, Critical Reading Mean, Mathematics Mean, Writing Mean

Here DBN will be the unique field for this dataset. The students were given a test. Based on the results from the test.

Problem Statement:

  • Find the total number of test takers.
  • Find the highest mean/average of the Critical Reading section and the school name.
  • Find the highest mean/average of the Mathematics section and the school name
  • Find the highest mean/average of the Writing section and the school name

 Note: Records with fewer than 5 students can be ignored

Usecase 2 Solution

Analyzing the SAT (College Board) 2010 School Level Results

Step1:

Copy the input file into HDFS.

Step2:

Step3:

Run the map reduce program.

Usecase 3 Problem

Analysing Queens Library Branches

Data Set:

Download the data set from the below link.

https://nycopendata.socrata.com/Recreation/Queens-Library-Branches/kh3d-xhq7?

This data can be exported in many formats – Tabular, CSV, XML, PDF etc. Download this data into CSV file and will perform the following analysis.

Data set Description:

This data set is a list of Queens Library branches and the timings of the library when it will be open on each day of the week. If the library is not opened then we that field will have the value Closed.

Problem Statement:

  • Find the total number of libraries located in the NY Queens region
  • Find the total number of the libraries and their name that are Open on Sunday
  • Find the total number of libraries that are no more working in the Queens region and their names

Usecase 3 Solution:-

Analyzing Queens Library Branches

Step 1:

Step 2:

Step 3:

 Step 4:

 Step 5:

Total number of libraries located in the NY Queens region

Step 6:

Total number of libraries that are Open on Sunday and their names

Step 7:

Total number of libraries that are no more working in the Queens region and their names

Usecase 4 Problem:-

Analysing the Movielens dataset

Data Set:

Download the data set from the below link.

http://grouplens.org/datasets/movielens/

Data set Description:

This data set contains 10000054 ratings and 95580 tags applied to 10681 movies by 71567 users of the online movie recommender service MovieLens.

Users were selected at random for inclusion. All users selected had rated at least 20 movies. Unlike previous MovieLens data sets, no demographic information is included. Each user is represented by an id, and no other information is provided.

User information is in the file “users.dat” in the below format:

UserID::Gender::Age::Occupation::Zip-code

All ratings are contained in the file “ratings.dat” in the below format:                     UserID::MovieID::Rating::Timestamp Movie information is in the file “movies.dat” in the below format:                     MovieID::Title::Genres

For detailed description go to http://www.grouplens.org/system/files/ml-1m-README.txt

Problem Statement:

  1. Find tags associated with each movie.

Usecase 4 Solution:-

Analyzing the Movie lens dataset

Step1:

#Input dataset had “::” as delimiter but PigStorage supports single-character delimiter only. So need to change delimiter from “::” to “\t”

Step2:

copy dataset from local path to hdfs for PIG load

Step3:

Pig

Step4: Load movies

Step5: –load tags

Step6: –join movies and tags by movieid

Step7:–group the movie_tags by title

Step8: –generate all tags for each title

Step9:–store output to folder “titletagsoutput”

 


Profile photo of Siva

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.


Leave a comment

Your email address will not be published. Required fields are marked *

One thought on “Hadoop Real Time Usecases with Solutions


Review Comments
default gravatar

I am a plsql developer. Intrested to move into bigdata.

Neetika Singh ITA

.