Creating UDF and UDAF for Impala

 Installing the UDF Development Package

The output will be like below code. [cloudera@quickstart impala-udf-samples-master]$ cmake . — The C compiler identification is GNU 4.4.7 — The CXX compiler identification is GNU 4.4.7 — Check for working C compiler: /usr/bin/cc — Check for working C compiler: /usr/bin/cc — works — Detecting C compiler ABI info — Detecting C compiler ABI info – done — Check for working CXX compiler: /usr/bin/c++ […]

Postgres Installation On Centos

To install the server locally use the command line and type

To start off, we need to set the password of the PostgreSQL user (role) called “postgres”; we will not be able to access the server externally otherwise. As the local “postgres” Linux user, we are allowed to connect and manipulate the server using the psql command. In a terminal, type:

this connects as a role with same […]

Impala Best Practices

Below are Impala performance tuning options: Pre-execution Checklist    Data types    Partitioning    File Format Data Type Choices      Define integer columns as INT/BIGINT      Operations on INT/BIGINT more efficient than STRING      Convert “external” data to good “internal” types on load      e.g. CAST date strings to TIMESTAMPS      This avoids expensive CASTs in queries later Partitioning The fastest I/O is the one […]

Apache Storm Integration With Apache Kafka

Installing Apache Storm The prerequisite for storm to work on the machine. a. Download and installation commands for ZeroMQ 2.1.7: Run the following commands on terminals

b. Download and installation commands for JZMQ: 

  2. Download latest storm from http://storm.apache.org/downloads.html 

Second start Storm Cluster by starting master and worker nodes. Start master node i.e. nimbus. To start master i.e. nimbus go to the ‘bin’ directory of the […]

Oozie Notes

OOZIE NOTES Workflow scheduler to manage hadoop and related jobs Developed first in Banglore by Yahoo DAG(Direct Acyclic Graph) Acyclic means a graph cannot have any loops and action members of the graph provide control dependency. Control dependency means a second job cannot run until a first action is completed Ozzie definitions are written in hadoop process definition language (hPDL) and coded as an xml file (WORKFLOW.XML) Workflow contains: Control […]

Zookeeper Commands 1

This post is about some notes on Zookeeper commands and scripts. This is mainly useful for Hadoop Admins and all commands are self explanotry. ZooKeeper is a distributed centralized co-ordination service Zookeeper addresses issues with distributed applications: Maintain configuration information (share config info across all nodes) Naming Service(allows one node to find a specific machine in a cluster of 1000’s of servers) Distributed synchronization (locks, barriers, queues, etc) Group services […]

Hadoop Real Time Usecases with Solutions 1

Below are a few Hadoop Real Time usecases with solutions. Usecase 1 Problem:- Data Description: This gives the information about the markets and the products available in different regions based on the seasons. You will find the below fields listed in that file.

Problem Statement: Select any particular county and calculate the percentage of different products produced by each Market in that particular county. Note: Here we have total […]

Sqoop Interview Cheat Sheet 1

Install sqoop sudo yum install sqoop sudo apt-get install sqoop in sqoop-normal commnd prompt sqoop config file—sqoop site.xml install jdbc drivers After you’ve obtained the driver, you need to copy the driver’s JAR file(s) into Sqoop’s lib/ directory. If you’re using the Sqoop tarball, copy the JAR files directly into the lib/ directory after unzipping the tarball. If you’re using packages, you will need to copy the driver files into the /usr/lib/sqoop/lib directory […]

Hadoop and Hive Interview Cheat Sheet 1

Hive SQL Based Datawarehouse app built on top of hadoop(select,join,groupby…..) It is a platform used to develop SQL type scripts to do MapReduce operations. PARTITIONING Partition tables changes how HIVE structures the data storage *Used for distributing load horizantally ex: PARTITIONED BY (country STRING, state STRING); A subset of a table’s data set where one column has the same value for all records in the subset. In Hive, as in most databases […]

Hadoop Testing Tools 1

Hadoop Testing Tools MRUnit  – Java framework that helps developers unit test Hadoop Map reduce jobs. Mockito –  Java Framework, similar to MRUnit for unit testing Hadoop Map reduce jobs. PigUnit – Java framework that helps developers unit test Pig Scripts. HiveRunner – An Open Source unit test framework for hadoop hivequeries based on JUnit4 Beetest –  Unit Testing Framework for Hive Queries Hive_test – Another Open source unit testing framework for Hive […]

100 Hadoop Certification Dump Questions 7

Hadoop Certification Dump Questions 1. From given below which describes how a client reads a file from HDFS?  ( 1 ) The client queries all DataNodes in parallel. The DataNode that contains the requested data responds directly to the client. The client reads the data directly from the DataNode. The client contacts the NameNode for the block location(s). The NameNode then queries the DataNodes for block locations. The DataNodes respond […]

Hadoop Performance Tuning 4

Hadoop Performance Tuning There are many ways to improve the performance of Hadoop jobs. In this post, we will provide a few MapReduce properties that can be used at various mapreduce phases to improve the performance tuning. There is no one-size-fits-all technique for tuning Hadoop jobs, because of the architecture of Hadoop, achieving balance among resources is often more effective than addressing a single problem. Depending on the type of job […]

Hadoop Best Practices

Hadoop Best Practices Avoiding small files (sized less than 1 HDFS block, typically 128MB) with one map processing a single small file. Maintain Optimal HDFS Block size, generally >= 128 MB, to avoid tens of thousands of map tasks in processing large data sets. Usage of Combiners wherever applicable/suitable to reduce the network traffic from mapper nodes to reducer nodes. Applications processing large data-sets with optimal number of reducers and […]

Formula to Calculate HDFS nodes storage 5

Formula to calculate HDFS nodes Storage (H) Below is the formula to calculate the HDFS Storage size required, when building a new Hadoop cluster. H = C*R*S/(1-i) * 120% Where: C = Compression ratio. It depends on the type of compression used (Snappy, LZOP, …) and size of the data. When no compression is used, C=1. R = Replication factor. It is usually 3 in a production cluster. S = Initial size of […]

RHadoop Installation on Ubuntu 5

In this post, we will briefly discuss about the steps for RHadoop Installation on Ubuntu 14.04 Machine with Hadoop-2.6.0 version. We also see the procedure for R & RStudio Installations on Ubuntu Machine. All these installations are done on single node hadoop machine. RStudio Installation on Hadoop Machine Before proceeding with steps detailed below, Hadoop machine setup should be completed. Please refer “install-hadoop-on-single-node-cluster” in this blog, for Hadoop installation Install […]

Review Comments
default gravatar

I am a plsql developer. Intrested to move into bigdata.

Neetika Singh ITA