Monthly Archives: October 2015


Oozie Notes

OOZIE NOTES Workflow scheduler to manage hadoop and related jobs Developed first in Banglore by Yahoo DAG(Direct Acyclic Graph) Acyclic means a graph cannot have any loops and action members of the graph provide control dependency. Control dependency means a second job cannot run until a first action is completed Ozzie definitions are written in hadoop process definition language (hPDL) and coded as an xml file (WORKFLOW.XML) Workflow contains: Control […]


Zookeeper Commands 2

This post is about some notes on Zookeeper commands and scripts. This is mainly useful for Hadoop Admins and all commands are self explanotry. ZooKeeper is a distributed centralized co-ordination service Zookeeper addresses issues with distributed applications: Maintain configuration information (share config info across all nodes) Naming Service(allows one node to find a specific machine in a cluster of 1000’s of servers) Distributed synchronization (locks, barriers, queues, etc) Group services […]


Separation Anxiety: A Tutorial for Isolating Your System with Linux Namespaces 2

With the advent of tools like Docker, Linux Containers, and others, it has become super easy to isolate Linux processes into their own little system environments. This makes it possible to run a whole range of applications on a single real Linux machine and ensure no two of them can interfere with each other, without having to resort to using virtual machines. These tools have been a huge boon to […]


Advanced Java Class Tutorial: A Guide to Class Reloading

In Java development, a typical workflow involves restarting the server with every class change, and no one complains about it. That is a fact about Java development. We have worked like that since our first day with Java. But is Java class reloading that difficult to achieve? And could that problem be both challenging and exciting to solve for skilled Java developers? In this Java class tutorial, I will try to address […]


Hadoop Real Time Usecases with Solutions 1

Below are a few Hadoop Real Time usecases with solutions. Usecase 1 Problem:- Data Description: This gives the information about the markets and the products available in different regions based on the seasons. You will find the below fields listed in that file.

Problem Statement: Select any particular county and calculate the percentage of different products produced by each Market in that particular county. Note: Here we have total […]


An Introduction to Machine Learning Theory and Its Applications: A Visual Tutorial with Examples

Machine Learning (ML) is coming into its own, with a growing recognition that ML can play a key role in a wide range of critical applications, such as data mining, natural language processing, image recognition, and expert systems. ML provides potential solutions in all these domains and more, and is set to be a pillar of our future civilization. The supply of able ML designers has yet to catch up […]


Meet Bond, Microsoft Bond – A New Data Serialization Framework 1

Microsoft Bond is a new serialization framework for schematized data created by Microsoft. Let’s recap where data serialization is used most: Data persistence in files, streams, NoSQL, and BigData. Data transmission in networks, IPC, etc. Commonly, these applications have to deal with schematized data, where schema means: Structure: hierarchy, relations, order. Semantic: age in number of years since born. Actually, any data has schema even if it is implicitly defined […]


Business Intelligence Platform: Tutorial Using MongoDB Aggregation Pipeline

Using data to answer interesting questions is what researchers are busy doing in today’s data driven world. Given huge volumes of data, the challenge of processing and analyzing it is a big one; particularly for statisticians or data analysts who do not have the time to invest in learning business intelligence platforms or technologies provided by Hadoop eco-system, Spark, or NoSQL databases that would help them to analyze terabytes of […]


Introduction to Apache Spark with Examples and Use Cases

I first heard of Spark in late 2013 when I became interested in Scala, the language in which Spark is written. Some time later, I did a fun data science project trying to predict survival on the Titanic. This turned out to be a great way to get further introduced to Spark concepts and programming. I highly recommend it for any aspiring Spark developers looking for a place to get […]


Sqoop Interview Cheat Sheet 1

Install sqoop sudo yum install sqoop sudo apt-get install sqoop in sqoop-normal commnd prompt sqoop config file—sqoop site.xml install jdbc drivers After you’ve obtained the driver, you need to copy the driver’s JAR file(s) into Sqoop’s lib/ directory. If you’re using the Sqoop tarball, copy the JAR files directly into the lib/ directory after unzipping the tarball. If you’re using packages, you will need to copy the driver files into the /usr/lib/sqoop/lib directory […]


Cassandra Interview Cheat Sheet

Cassandra Cassandra is a distributed database from Apache that is highly scalable and designed to manage very large amounts of structured data. It provides high availability with no single point of failure. NoSQL The primary objective of a NoSQL database is to have simplicity of design, horizontal scaling, and finer control over availability. These databases are schema-free, support easy replication, have simple API, eventually consistent, and can handle huge amounts […]


Hadoop and Hive Interview Cheat Sheet 1

Hive SQL Based Datawarehouse app built on top of hadoop(select,join,groupby…..) It is a platform used to develop SQL type scripts to do MapReduce operations. PARTITIONING Partition tables changes how HIVE structures the data storage *Used for distributing load horizantally ex: PARTITIONED BY (country STRING, state STRING); A subset of a table’s data set where one column has the same value for all records in the subset. In Hive, as in most databases […]