Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.

Impala Miscellaneous Functions

Impala Conditions with Example Impala supports the following conditional functions for testing equality, comparison operators, and nullity: ‘Case’ Example: 1)  If else select case when 20 > 10 then 20 else 15 end; Output:  20 2) If else if select case when 9 > 10 then 20 when 1 > 2 then 1.0 else 15 end; Output:  15 ===================================================================================== ‘Coalesce’ Function Example: The COALESCE function in Impala returns the first […]

PMD (Programming Mistake Detector)

PMD (Programming Mistake Detector) What is PMD? PMD aka Programming Mistake Detector is Java Source Code Analyzer. It is used to clean erroneous code in our java projects based on predefined set of rules. PMD supports the ability to write custom rules. Issues reported by PMD may not be true errors always, but rather inefficient code, i.e. the application could still function properly even if they were not corrected. PMD […]

HBase & Solr Search Integration

­HBase & Solr – Near Real time indexing and search Requirement: A. HBase Table B. Solr collection on HDFS C. Lily HBase Indexer. D. Morphline Configuration file Once Solr server ready then we are ready to configure our collection (in solr cloud); which will be link to HBase table. Add below properties to hbase-site.xml file. Add below properties to/etc/hbase-solr/conf/hbase-indexer-site.xml.  This will enable Lily indexer to reach HBase cluster for indexing. […]

Resilient Distributed Dataset

What is an RDD? A Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes. Why RDD in Spark? MapReduce is widely adopted for processing and generating large […]

Impala Best Practices

Below are Impala performance tuning options: Pre-execution Checklist    Data types    Partitioning    File Format Data Type Choices      Define integer columns as INT/BIGINT      Operations on INT/BIGINT more efficient than STRING      Convert “external” data to good “internal” types on load      e.g. CAST date strings to TIMESTAMPS      This avoids expensive CASTs in queries later Partitioning The fastest I/O is the one […]

Apache Storm Integration With Apache Kafka

Installing Apache Storm The prerequisite for storm to work on the machine. a. Download and installation commands for ZeroMQ 2.1.7: Run the following commands on terminals

b. Download and installation commands for JZMQ: 

  2. Download latest storm from 

Second start Storm Cluster by starting master and worker nodes. Start master node i.e. nimbus. To start master i.e. nimbus go to the ‘bin’ directory of the […]

Kafka Design

While developing Kafka, the main focus was to provide the following:   An API for producers and consumers to support custom implementation   Low overheads for network and storage with message persistence on disk   A high throughput supporting millions of messages for both publishing and subscribing—for example, real-time log aggregation or data feeds   Distributed and highly scalable architecture to handle low-latency delivery   Auto-balancing multiple consumers in the […]

Kafka Installation

There are number of ways in which Kafka can be used in any architecture. This section discusses some of the popular use cases for Apache Kafka and the well-known companies that have adopted Kafka. The following are the popular Kafka use cases: Log aggregation This is the process of collecting physical log files from servers and putting them in a central place (a file server or HDFS) for processing. Using […]

Cassandra production scenarios/issues

Production issue: when we are trying to write a select query with 8 lacks ids “in condition “. then we got faced below issue,    To solve the above exception, we used distributed calls in Java client as shown below,

Few Production configurations in cassandra RetryPolicy Three scenarios you can control retry policy for: Read time out: When a coordinator received the request and sent the read to replica(s) but the replica(s) […]

Cassandra write and read process

Storage engine Cassandra uses a storage structure similar to a Log-Structured Merge Tree, unlike a typical relational database that uses a B-Tree. Cassandra avoids reading before writing. Read-before-write, especially in a large distributed system, can produce stall in read performance and other problems. Cassandra never re-writes or re-reads existing data, and never overwrites the rows in place. How data is written? Different stages of write process in cassandra Logging data […]

Cassandra Architecture

Cassandra is designed in such a way that, there will not be any single point of failure. There is no master- slave architecture in cassandra. cassandra addresses the problem of SPOF by employing a peer-to-peer distributed system across homogeneous nodes where data is distributed among all nodes in the cluster. In cassandra all nodes are same. There will not be any master or slave in cassandra. Each node frequently exchanges state information about itself […]

CAP Theorem

What is CAP Theorem? CAP describes that before choosing any Database (Including distributed database), Basing on your requirement we have to choose only two properties out of three. Consistency  – Whenever you read a record (or data), consistency guaranties that it will give same data how many times you read. Simply we can say that each server returns the right response to each request, thus the system will be always […]

Oozie Notes

OOZIE NOTES Workflow scheduler to manage hadoop and related jobs Developed first in Banglore by Yahoo DAG(Direct Acyclic Graph) Acyclic means a graph cannot have any loops and action members of the graph provide control dependency. Control dependency means a second job cannot run until a first action is completed Ozzie definitions are written in hadoop process definition language (hPDL) and coded as an xml file (WORKFLOW.XML) Workflow contains: Control […]

Zookeeper Commands 1

This post is about some notes on Zookeeper commands and scripts. This is mainly useful for Hadoop Admins and all commands are self explanotry. ZooKeeper is a distributed centralized co-ordination service Zookeeper addresses issues with distributed applications: Maintain configuration information (share config info across all nodes) Naming Service(allows one node to find a specific machine in a cluster of 1000’s of servers) Distributed synchronization (locks, barriers, queues, etc) Group services […]

Separation Anxiety: A Tutorial for Isolating Your System with Linux Namespaces 2

With the advent of tools like Docker, Linux Containers, and others, it has become super easy to isolate Linux processes into their own little system environments. This makes it possible to run a whole range of applications on a single real Linux machine and ensure no two of them can interfere with each other, without having to resort to using virtual machines. These tools have been a huge boon to […]

