Hadoop Common

Impala Miscellaneous Functions

Impala Conditions with Example Impala supports the following conditional functions for testing equality, comparison operators, and nullity: ‘Case’ Example: 1)  If else select case when 20 > 10 then 20 else 15 end; Output:  20 2) If else if select case when 9 > 10 then 20 when 1 > 2 then 1.0 else 15 end; Output:  15 ===================================================================================== ‘Coalesce’ Function Example: The COALESCE function in Impala returns the first […]

PMD (Programming Mistake Detector)

Table of ContentsPMD (Programming Mistake Detector)What is PMD?How to install PMD?How to use PMD?Finding Cut and Paste Code(CPD):Working POM confiiguration PMD (Programming Mistake Detector) What is PMD? PMD aka Programming Mistake Detector is Java Source Code Analyzer. It is used to clean erroneous code in our java projects based on predefined set of rules. PMD supports the ability to write custom rules. Issues reported by PMD may not be true […]

Creating UDF and UDAF for Impala

 Installing the UDF Development Package

The output will be like below code. [cloudera@quickstart impala-udf-samples-master]$ cmake . — The C compiler identification is GNU 4.4.7 — The CXX compiler identification is GNU 4.4.7 — Check for working C compiler: /usr/bin/cc — Check for working C compiler: /usr/bin/cc — works — Detecting C compiler ABI info — Detecting C compiler ABI info – done — Check for working CXX compiler: /usr/bin/c++ […]

Postgres Commands


We can see our new table by typing this:

List of relations Schema |    Name    | Type  |  Owner ——–+————+——-+———- public | playground | table | postgres (1 row) INSERT

  Message returned if only one row was inserted. oid is the numeric OID of the inserted row. Ex: INSERT oid 1 Message returned if more than one […]

Postgres Installation On Centos 1

To install the server locally use the command line and type

To start off, we need to set the password of the PostgreSQL user (role) called “postgres”; we will not be able to access the server externally otherwise. As the local “postgres” Linux user, we are allowed to connect and manipulate the server using the psql command. In a terminal, type:

this connects as a role with same […]

HBase & Solr Search Integration 1

Table of Contents­HBase & Solr – Near Real time indexing and searchCreating a Lily HBase Indexer configurationCreating a Morphline Configuration FileStarting & Registering a Lily HBase Indexer configuration with the Lily HBase Indexer ServiceVerifying the indexing is workingConfiguring Lily HBase NRT Indexer Service for Use with Cloudera SearchUsing the Lily HBase NRT Indexer ServiceSteps to build indexing ­HBase & Solr – Near Real time indexing and search Requirement: A. HBase […]

Resilient Distributed Dataset

Table of ContentsWhat is an RDD?Why RDD in Spark?Data Sharing in MapReduce:Data Sharing in Spark :RDD Abstraction:How to program with RDD:Example :1 Creating an RDD of Strings with text file () in Python: Example :2 Calling the filter() transformationExample 3 : Calling first() actionExample 4: Persisting an RDD in memoryLazy Evaluation What is an RDD? A Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable […]

Impala Best Practices 1

Below are Impala performance tuning options: Table of ContentsPre-execution ChecklistData Type ChoicesPartitioningUse Parquet Columnar Format for HDFSQuick Note on CompressionSnappyGzip/ZlibLeft-Deep Join TreeTypes of Hash JoinsBroadcastShuffleHow to use ANALYZEHinting JoinsDetermining Join Type From EXPLAINMemory Requirements for Joins & Aggregates Pre-execution Checklist    Data types    Partitioning    File Format Data Type Choices      Define integer columns as INT/BIGINT      Operations on INT/BIGINT more efficient than STRING      Convert […]

Apache Storm Integration With Apache Kafka

Installing Apache Storm The prerequisite for storm to work on the machine. a. Download and installation commands for ZeroMQ 2.1.7: Run the following commands on terminals

b. Download and installation commands for JZMQ: 

  2. Download latest storm from http://storm.apache.org/downloads.html 

Second start Storm Cluster by starting master and worker nodes. Start master node i.e. nimbus. To start master i.e. nimbus go to the ‘bin’ directory of the […]

Kafka Design

While developing Kafka, the main focus was to provide the following:   An API for producers and consumers to support custom implementation   Low overheads for network and storage with message persistence on disk   A high throughput supporting millions of messages for both publishing and subscribing—for example, real-time log aggregation or data feeds   Distributed and highly scalable architecture to handle low-latency delivery   Auto-balancing multiple consumers in the […]

Kafka Installation 1

There are number of ways in which Kafka can be used in any architecture. This section discusses some of the popular use cases for Apache Kafka and the well-known companies that have adopted Kafka. The following are the popular Kafka use cases: Table of ContentsLog aggregationStream processingCommit logsClick stream trackingMessagingSetting Up a Kafka ClusterTopicBrokerZookeeperProducersConsumerA single node – a single broker clusterCreating a Kafka topicStarting a producer to send messagesStarting a […]

Cassandra production scenarios/issues

Table of ContentsProduction issue: Few Production configurations in cassandraRetryPolicyDefaultRetryPolicyDowngradingConsistencyRetryPolicyReconnection PolicyConstantReconnectionPolicyExponentialReconnectionPolicy (default)Load Balancing PolicyRoundRobinPolicyDCAwareRoundRobinPolicyTokenAwarePolicy(default) Production issue: when we are trying to write a select query with 8 lacks ids “in condition “. then we got faced below issue,    To solve the above exception, we used distributed calls in Java client as shown below,

Few Production configurations in cassandra RetryPolicy Three scenarios you can control retry policy for: Read time out: When a […]

Cassandra query language (CQL) and Cassandra Java Client Example

Table of ContentsCassandra Table structure/TerminologyCQL CommandsCreating a key-spaceUse the keyspace (will use that key space)Get list of key spacesCreate tableGet list of tables in a key-spaceInsert data into tableDescribe tableCreate indexUpdate data in tableDelete data in tableLimitations in CQLJava client exampleMiscellaneous commands in CQLGet size of table:Flush data into SStable/disk:Copy table content from table to csv file:Cassandra Pooling options Cassandra Table structure/Terminology Before going to learn CQL commands, we just need to […]

Cassandra write and read process

Table of ContentsStorage engineHow data is written?Compaction:Types of compaction:How is data updated?How is data deleted?How is data read?How do write patterns effect reads? Storage engine Cassandra uses a storage structure similar to a Log-Structured Merge Tree, unlike a typical relational database that uses a B-Tree. Cassandra avoids reading before writing. Read-before-write, especially in a large distributed system, can produce stall in read performance and other problems. Cassandra never re-writes or […]

Cassandra Architecture

Cassandra is designed in such a way that, there will not be any single point of failure. There is no master- slave architecture in cassandra. cassandra addresses the problem of SPOF by employing a peer-to-peer distributed system across homogeneous nodes where data is distributed among all nodes in the cluster. In cassandra all nodes are same. There will not be any master or slave in cassandra. Each node frequently exchanges state information about itself […]