Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.
Impala Conditions with Example Impala supports the following conditional functions for testing equality, comparison operators, and nullity: ‘Case’ Example: 1) If else select case when 20 > 10 then 20 else 15 end; Output: 20 2) If else if select case when 9 > 10 then 20 when 1 > 2 then 1.0 else 15 end; Output: 15 ===================================================================================== ‘Coalesce’ Function Example: The COALESCE function in Impala returns the first […]
PMD (Programming Mistake Detector) What is PMD? PMD aka Programming Mistake Detector is Java Source Code Analyzer. It is used to clean erroneous code in our java projects based on predefined set of rules. PMD supports the ability to write custom rules. Issues reported by PMD may not be true errors always, but rather inefficient code, i.e. the application could still function properly even if they were not corrected. PMD […]
HBase & Solr – Near Real time indexing and search Requirement: A. HBase Table B. Solr collection on HDFS C. Lily HBase Indexer. D. Morphline Configuration file Once Solr server ready then we are ready to configure our collection (in solr cloud); which will be link to HBase table. Add below properties to hbase-site.xml file. Add below properties to/etc/hbase-solr/conf/hbase-indexer-site.xml. This will enable Lily indexer to reach HBase cluster for indexing. […]
What is an RDD? A Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes. Why RDD in Spark? MapReduce is widely adopted for processing and generating large […]
Below are Impala performance tuning options: Pre-execution Checklist Data types Partitioning File Format Data Type Choices Deﬁne integer columns as INT/BIGINT Operations on INT/BIGINT more efficient than STRING Convert “external" data to good “internal" types on load e.g. CAST date strings to TIMESTAMPS This avoids expensive CASTs in queries later Partitioning The fastest I/O is the one […]
While developing Kafka, the main focus was to provide the following: An API for producers and consumers to support custom implementation Low overheads for network and storage with message persistence on disk A high throughput supporting millions of messages for both publishing and subscribing—for example, real-time log aggregation or data feeds Distributed and highly scalable architecture to handle low-latency delivery Auto-balancing multiple consumers in the […]
There are number of ways in which Kafka can be used in any architecture. This section discusses some of the popular use cases for Apache Kafka and the well-known companies that have adopted Kafka. The following are the popular Kafka use cases: Log aggregation This is the process of collecting physical log files from servers and putting them in a central place (a file server or HDFS) for processing. Using […]