Hadoop


Introduction to Hadoop Streaming

In this post, we will discuss about introduction to hadoop streaming with term frequency and Inverse document frequency algorithm. Hadoop Streaming By default Mapreduce framework is written in Java and supports writing mapreduce programs in Java programming language but Hadoop provides API for writing mapreduce programs in other than Java Language. Hadoop Streaming is an utility that comes with hadoop distribution and allows users to write mapreduce programs in any […]


Most Popular Hadoop Distributions

Most Popular Hadoop Distributions Currently there are lot of Hadoop distributions available in the big data market, but the major free open source distribution is from Apache Software Foundation. And even remaining hadoop distribution companies provide free versions of Hadoop, and also provide customized hadoop distributions suitable for client organization needs. By using Apache Hadoop as the core framework, these companies build their own customized hadoop cluster setup and services […]


Processing Logs in Pig 3

In the previous post we have discussed about the basic introduction on log files and the architecture of log analysis in hadoop. In this post, we will enter into much deeper details on processing logs in pig. As discussed in the previous post, there will be three types of log files majorly. Web Server Access Logs Web Server Error Logs Application Server Logs All these log files will be in […]


Log Analysis in Hadoop 5

In this post we will discuss about various log file types and Log Analysis in Hadoop. Log Files: Logs are computer-generated files that capture network and server operations data.They are useful  during various stages of software development, mainly for debugging and profiling purposes and also  for managing network operations. Need For Log Files: Log files are commonly used at customer’s installations for the purpose of permanent software monitoring and/or fine-tuning. […]


Tableau Integration with Hadoop 3

In this post we are going to discuss about basic details of Tableau software and Tableau Integration with hadoop. Tableau Overview What is Tableau? Tableau is a visualization tool based on breakthrough technology  that provides drag & drop features to analyze data on large amounts of data very easily and quickly. The Dashboard of Tableau is very interactive and gives dynamic results. Tableau supports strong interactive capabilities and provides rich set of graphic […]


Cloudera Manager Installation on Amazon EC2 22

In this post, we will discuss about hadoop installation on cloud storage. Though there are number of posts available across internet on this topic, we are documenting the procedure for Cloudera Manager Installation on Amazon EC2 instances with some of our practical views on installation and tips and hints to avoid getting into issues. This post also gives a basic introduction on usage of Amazon AWS cloud services. Creation of […]


Hive Connectivity With Hunk (Splunk) 3

In this post we will discuss about the configuration required for Hive connectivity with Hunk, Hadoop flavor of Splunk, the famous visualization tool. Splunk Overview: Splunk tool captures, indexes and correlates real-time data in a searchable repository from which it can generate graphs, reports, dashboards and visualizations. Splunk released a product called Hunk: Splunk Analytics for Hadoop, which supports accessing, searching, and reporting on external data sets located in Hadoop from […]


Apache Tez – Successor of Mapreduce Framework 4

Apache Tez Overview What is Apache Tez? Apache Tez is another execution framework project from Apache Software Foundation and it is built on top of Hadoop YARN. It is considered as a more flexible and powerful successor of the mapreduce framework. Apache Tez Features: Tez provides, Performance gain over Map Reduce also Provides backward compatibility to Mapreduce framework. Optimal resource management Plan reconfiguration at run-time Dynamic physical data flow decisions Tez is […]


Merging Small Files Into Avro File

This post is a continuation for previous post on working with small files issue. In previous we have merged huge number of small files on HDFS directory into sequencefile and in this post we will merge huge number of small files on local file system into avro file on HDFS output directory. We will store the file names as keys and file contents as values in Avro file. We will […]


Merging Small Files into SequenceFile 7

In this post, we will discuss one of the famous use case of SequenceFiles, where we will merge large number of small files into SequenceFile. We will get to this requirement mainly due to the lack efficient processing of large number of small files in hadoop or mapreduce. Need For Merging Small Files: As hadoop stores all the HDFS files metadata in namenode’s main memory(which is a limited value) for […]


Reading and Writing SequenceFile Example

This post is continuation for previous post on hadoop sequence files. In this post we will discuss about Reading and Writing SequenceFile Examples using Apache Hadoop 2 API. Writing Sequence File Example: As discussed in the previous post, we will use static method SequenceFile.createWriter(conf, opts) to create SequenceFile.Writer instance and we will use append(key, value) method to insert each record into sequencefile. In the below example program, we are reading contents […]


Hadoop Sequence Files example 3

In addition to text files, hadoop also provides support for binary files. Out of these binary file formats, Hadoop Sequence Files are one of the hadoop specific file format that stores serialized key/value pairs. In this post we will discuss about basic details and format of hadoop sequence files examples. Hadoop Sequence Files: Advantages: As binary files, these are more compact than text files Provides optional support for compression at different […]


Flume Data Collection into HDFS with Avro Serialization 4

In this post, we will provide proof of concept for Flume Data collection into HDFS with Avro Serialization by using HDFS sink, Avro Serializer on Sequence Files with Snappy Compression. Also we will use the formatting escape sequences to store the events on HDFS Path. In this post, we will create a flume agent with Spooling directory source with JDBC Channel and HDFS Sink. Now lets create our agent Agent7 in flume.conf […]


Flume Data Collection into HDFS 2

In this post, we will discuss about setup of an agent for Flume data collection into HDFS . In this post, we will setup an agent with Sequence Generator Source, HDFS Sink and Memory channel and start that agent and verify its functionality. Flume data collection into HDFS Flume Agent – Sequence Generator Source, HDFS Sink and Memory channel: Add the below configuration properties in flume.conf file to create Agent4 with Sequence source, memory […]


Review Comments
default image

I have attended Siva’s Spark and Scala training. He is good in presentation skills and explaining technical concepts easily to everyone in the group. He is having excellent real time experience and provided enough use cases to understand each concepts. Duration of the course and time management is awesome. Happy that I found a right person on time to learn Spark. Thanks Siva!!!

Dharmeswaran ETL / Hadoop Developer Spark Nov 2016 September 21, 2017

.