Hadoop Online Tutorials http://hadooptutorial.info Sun, 04 Mar 2018 05:46:24 +0000 en-US hourly 1 https://wordpress.org/?v=5.1.1 Impala Miscellaneous Functions http://hadooptutorial.info/impala-miscellaneous-functions/ http://hadooptutorial.info/impala-miscellaneous-functions/#respond Tue, 07 Jun 2016 13:25:07 +0000 http://hadooptutorial.info/?p=11413 Impala Conditions with Example Impala supports the following conditional functions for testing equality, comparison operators, and nullity: ‘Case’ Example: 1)  If else select case when 20 > 10 then 20 else 15 end; Output:  20 2) If else if select case when 9 > 10 then 20 when 1 > 2 then 1.0 else 15 […]

The post Impala Miscellaneous Functions appeared first on Hadoop Online Tutorials.

]]>
Impala Conditions with Example

Impala supports the following conditional functions for testing equality, comparison operators, and nullity:

‘Case’ Example:

1)  If else

select case when 20 > 10 then 20 else 15 end;

Output:  20

2) If else if

select case when 9 > 10 then 20 when 1 > 2 then 1.0 else 15 end;

Output:  15

=====================================================================================

‘Coalesce’ Function Example:

The COALESCE function in Impala returns the first non-NULL expression among its arguments.

Simple Example:

select coalesce (NULL, ‘b’, ‘c’);

Output:  b

select coalesce (‘a’, ‘Null’, ‘c’);               

Output: a

 

USECASE: we want to find out the best way to contact each person according to the following rules:

  1. If a person has a mobile phone, use the mobile phone number.
  2. If a person does not have a mobile phone and has a home phone, use the home phone number.
  3. If a person does not have a mobile phone, does not have a cell phone, and has a office phone, use the office phone number.

Create Table:

> create table coalesce (name string, mobileno int, homeno int, officeno int, city string);

Insert Data  :

> insert into collesce values ('user1’, NULL, NULL,6654756,'Hyderabad');

> insert into collesce values ('user2’, NULL,1234567, NULL,'Chennai');

> insert into collesce values('user3',9874561,NULL,6654756,'Vijayawada');

Query:

> select name, coalesce (mobileno, homeno, officeno, city) from collesce;

 

OUTPUT:

+——-+————————————–+————+

| name | coalesce (mobileno, homeno, officeno) | city       |

+——-+————————————–+————+

| user3 | 9874561                              | Vijayawada |

| user2 | 1234567                              | Chennai    |

| user1 | 6654756                              | Hyderabad  |

+——-+————————————–+————+

================================================================

DECODE Funcation Example:

Create Table:

>create table decodetable (empid INT, empname STRING, empcountry STRING, empage INT);

Insert Data:

>insert into decodetable values(11,'Azmal','Vijayawada',21);

>insert into decodetable values(12,'Raj','Delhi',35);

>insert into decodetable values(13,'Rahul','Chennai',40);

>insert into decodetable values(14,'Phani','Bangalore',15);

Query:

> select empid, decode (empname,'Ravi','RV', 'Raj','RJ', 'Rahul','RH', 'Phani', 'PN’) ShortCode, empcountry, empage from decodetable;

Output:

+——-+———–+————+——–+

| empid | shortcode | empcountry | empage |

+——-+———–+————+——–+

| 14    | PN        | Bangalore  | 15     |

| 12    | RJ        | Delhi      | 35     |

| 13    | RH        | Chennai    | 40     |

| 11    | AZ        | Vijayawada | 21     |

+——-+———–+————+——–+

‘shortcode’ is the name given to the column with the DECODE statement

==================================================================================

ISNULL ():

If that column had null value, then it will replace with give value

 

Sample Table:

Query: select * from emp1;

+——-+———+————+——–+

| empid | empname | empcountry | empage |

+——-+———+————+——–+

| NULL | Azmal   | NULL       | 15     |

| 12   | Raj     | Delhi      | 35     |

| 11   | Ravi    | hyd        | 32     |

| 13   | Rahul   | chennai    | 40     |

| 14   | Phani   | Bangalore  | 15     |

+——-+———+————+——-+

 

ISNull Query:

select isnull (empid ,15) from emp1;

 

Output:

+——————-+

| isnull (empid, 15)|

+——————-+

| 12                |

| 15                |

| 13                |

| 11                |

| 14                |

+——————-+

==================================================================================

NULLIF ():

The syntax for the NULLIF function in Impala is:

NULLIF (expression1, expression2)

*expression1, expression2

The expressions that will be compared. Values must be of the same datatype.

*Note:

expression1 can be an expression that evaluates to NULL, but it can not be the literal NULL

 

Simple Examples:

SELECT NULLIF (‘Azmal Sheik’, ‘Azmal Sheik’);

Result: NULL

(returns NULL because values are the same)

 

SELECT NULLIF(‘hadooptutorial.info’, ‘google.com’);

Result: ‘hadooptutorial.info’ (returns first value because values are different)

 

SELECT NULLIF (12, 12);

Result: NULL

(returns NULL because values are the same)

 

SELECT NULLIF (12, 45);

Result: 12

(returns first value because values are different)

==================================================================================

NULLIFZERO:

It is one of the very important Impala functions converting zero to null value when divide-by-zero problem comes into picture.

Sample Table:

Query: select * from NullIfZeroTable;

+——-+————-+————+——–+

| empid | empname     | empcountry | empage |

+——-+————-+————+——–+

| 13    | Rahul       | Chennai    | 40     |

| 18    | Sheik Azmal | Hyderabad  | 0      |

| 14    | Phani       | Bangalore  | 15     |

| 12    | Raj         | Delhi      | 35     |

| 11    | Ravi        | Vijayawada | 32     |

+——-+————-+————+——–+

Fetched 5 row(s) in 0.36s

Query: select nullifzero (empage) from decodetable

Output:

+——————–+

| nullifzero(empage) |

+——————–+

| 32                 |

| NULL               |

| 40                 |

| 15                 |

| 35                 |

+——————–+

Fetched 5 row(s) in 0.37s

 

Related Functions:

ZEROIFNULL

Replace NULL values with 0

 

 

NVL ():

This function is used to replace NULL value with another value. It is similar to the IFNULL Function in Imapala and the ISNULL Function.

Query: select * from emp1;

+——-+———+————+——–+

| empid | empname | empcountry | empage |

+——-+———+————+——–+

| NULL | Azmal   | NULL       | 15      |

| 12   | Raj     | Delhi      | 35      |

| 11   | Ravi    | hyd        | 32      |

| 13   | Rahul   | chennai    | 40      |

| 14   | Phani   | Bangalore  | 15      |

+——-+———+————+——–+

Fetched 5 row(s) in 0.40s

 

Query: select null(empid,20) from emp1

+—————-+

|null (empid, 20)|

+—————-+

| 20             |

| 11             |

| 11             |

| 13             |

| 14             |

+—————-+

 

                  Impala String Functions

Char_lenght ():

Returns the length in characters of the argument string. Aliases for the length () function.

 

Query:   select char_length(‘Impala’);

Output:

+—————————–+

| char_length(‘impala’)       |

+—————————–+

| 6                           |

+—————————–+

 

 

Concat ():

This function is used to concatenate two strings to form a single string.

Query:

select concat (empname, empcountry), empage, empcountry from emp1;

Output:

+—————————–+——–+————+

| concat (empname, empcountry) | empage | empcountry|

+—————————–+——–+————+

| RaviVijayawada              | 32     | Vijayawada |

| PhaniBangalore              | 15     | Bangalore  |

| Sheik AzmalHyderabad        | 0      | Hyderabad  |

| RahulChennai                | 40     | Chennai    |

| RajDelhi                    | 35     | Delhi      |

+—————————–+——–+————+

 

 

Find_in_set ():

 FIND_IN_SET function returns the position of a string in a comma-delimited string list.

Query:

select FIND_IN_SET (‘b’, ‘a, b, c, d, e, f’);

Output:

+—————————————-+

| find_in_set (‘b’, ‘a, b, c, d, e, f’)  |

+—————————————-+

| 2                                      |

+—————————————-+

Fetched 1 row(s) in 0.01s

 

Repeat ():

Returns the argument string repeated a specified number of times.

 

Query:

select repeat (‘Azmal ‘ ,5);

 

Output:

+——————————-+

| repeat (‘azmal ‘, 5)          |

+——————————-+

| Azmal Azmal Azmal Azmal Azmal |

+——————————-+

Fetched 1 row(s) in 0.01s

 

For Reference take a look:

reverse (string a)

Purpose: Returns the argument string with characters in reversed order.

Return type: string

 

rpad (string str, int len, string pad)

Purpose: Returns a string of a specified length, based on the first argument string. If the specified string is too short, it is padded on the right with a repeating sequence of the characters from the pad string. If the specified string is too long, it is truncated on the right.

Return type: string

 

rtrim (string a)

Purpose: Returns the argument string with any trailing spaces removed from the right side.

Return type: string

 

space (int n)

Purpose: Returns a concatenated string of the specified number of spaces. Shorthand for repeat (‘ ‘, n).

Return type: string

     

Strleft (string a, int num_chars)

Purpose: Returns the leftmost characters of the string. Shorthand for a call to substr () with 2 arguments.

Return type: string

 

Strright (string a, int num_chars)

Purpose: Returns the rightmost characters of the string. Shorthand for a call to substr () with 2 arguments.

Return type: string      

      

substr (string a, int start [, int len]), substring (string a, int start [, int len])

Purpose: Returns the portion of the string starting at a specified point, optionally with a specified maximum length. The characters in the string are indexed starting at 1.

Return type: string

 

Translate (string input, string from, string to)

Purpose: Returns the input string with a set of characters replaced by another set of characters.

Return type: string

 

Trim (string a)

Purpose: Returns the input string with both leading and trailing spaces removed. The same as passing the string through both ltrim () and rtrim ().

Return type: string

 

Upper (string a), ucase (string a)

Purpose: Returns the argument string converted to all-uppercase.

Return type: string

=======================================================

Regexp_Extract ():

 

For Reference:

select * from table_azmal where string_col like 'test%';

select * from table_azmal where string_col like string_col;

select * from table_azmal where 'test' like string_col;

select * from table_azmal where string_col rlike 'test%';

select * from table_azmal where string_col regexp 'test. *';

 

The post Impala Miscellaneous Functions appeared first on Hadoop Online Tutorials.

]]>
http://hadooptutorial.info/impala-miscellaneous-functions/feed/ 0
PMD (Programming Mistake Detector) http://hadooptutorial.info/pmd-programming-mistake-detector/ http://hadooptutorial.info/pmd-programming-mistake-detector/#respond Tue, 07 Jun 2016 12:15:48 +0000 http://hadooptutorial.info/?p=11311 PMD (Programming Mistake Detector) What is PMD? PMD aka Programming Mistake Detector is Java Source Code Analyzer. It is used to clean erroneous code in our java projects based on predefined set of rules. PMD supports the ability to write custom rules. Issues reported by PMD may not be true errors always, but rather inefficient […]

The post PMD (Programming Mistake Detector) appeared first on Hadoop Online Tutorials.

]]>
PMD (Programming Mistake Detector)

What is PMD?

  • PMD aka Programming Mistake Detector is Java Source Code Analyzer.
  • It is used to clean erroneous code in our java projects based on predefined set of rules.
  • PMD supports the ability to write custom rules.
  • Issues reported by PMD may not be true errors always, but rather inefficient code, i.e. the application could still function properly even if they were not corrected.

PMD works by scanning Java code and checks for violations in three major areas:

  • Compliance with coding standards such as:
    • Naming conventions – class, method, parameter and variable names
    • Class and method length
    • Existence and formatting of comments and JavaDocs
  • Coding anti-patterns such as:
    • Empty try/catch/finally/switch blocks
    • Unused local variables, parameters and private methods
    • Empty if/while statements
    • Over-complicated expressions – unnecessary if statements, for loops that could be while loops
    • Classes with high Cyclomatic Complexity measurements
  • Cut and Paste Detector (CPD)– a tool that scans files and looks for suspect code replication. CPD can be parameterized by the minimum size of the code block.

PMD is able to detect flaws or possible flaws in source code, like:

  • Possible bugs —Empty try/catch/finally/switch blocks.
  • Dead code —Unused local variables, parameters and private methods.
  • Empty if/while statements.
  • Over-complicated expressions—Unnecessary if statements, for loops that could be while loops.
  • Sub-optimal code—Wasteful String/StringBuffer usage.
  • Duplicate code—Copied/pasted code can mean copied/pasted bugs, and decreases maintainability.

How to install PMD?

The easiest way to install PMD is by using the remote update site. Users behind firewalls should check proxy settings before going any further. If these settings are mis-configured the updater will not work. PMD also supplies a zip file for manual installation. Download the file and follow the readme.

Demonstrated below is installing PMD via the Eclipse Software Updater.

  1. Launch Eclipse.
  2. Navigate to HelpSoftware UpdatesFind and Install…
  3. Select “Search for new features to install” and click Next
  4. Click New Remote Site…
  5. Enter a name (PMD) and the URL http://pmd.sourceforge.net/eclipsePMD1
  6. In Sites to include in searchcheck PMD and click Finish
  7. In the Search Resultsdialog check PMD for Eclipse 3 3.1.0 and click NextPMD2
  8. Accept the terms of the license agreements and click Next
  9. Click Finish.
  10. Wait for Eclipse to download the required jar files, then click Install
  11. Restart the workbench. This will load the PMD plugin.
  12. Navigate to WindowShow View | ..
  13. Select PMDPMD Violations
  14. PMD Violations view should appear at the bottom pane

How to use PMD?

Before launching Eclipse make sure you have enough memory for PMD. This is particularly important when analyzing large projects. In these situations PMD tends to be memory-hungry. Hence, make sure to start with as much memory as you can afford, for example 512M (eclipse.exe -vmargs -Xmx512M)

  1. Launch Eclipse
  2. If you have previously created a Java Project, skip to Step 6. Otherwise click FileNew..
  3. Select Java Projectand click Next
  4. Enter a project name (QA Project) and leave everything else in the default state.PMD3
  5. Click Finish. Eclipse will ask if you want to switch to the Java Perspective. Click Yes.
  6. In the Package Explorer right-click on QA Project and select NewClass
  7. In the following dialog enter the class name as Yingand click Finish
  8. A new class Yingis created in project’s default package. Paste the following code into the new class:
  9. In the Package Explorer right-click on QA Project and select PMDCheck Code With PMDPMD4
  10. Wait for PMD to scan Yingand Yang
  11. If PMD Violations view is not open navigate to WindowShow View | .. and select PMD | PMD Violations
  12. In the PMD Violationsview notice a list of 17 violations. In large projects this list could easily grow up to several thousand. This is one of the reasons PMD allows violations be filtered by priority. Priority is a configurable attribute of a PMD rule. PMD assigns priorities from 1 to 5 and each priority is represented by a colored square at the top-right corner of the view. These little squares are actually clickable on-off switches used to control the visibility of the violations they represent.
    The results table itself is well laid out and most columns are sortable. It is also possible to get more detail on a violation by simply right-clicking it and selecting Show rule. PMD pops-up a dialog with information such as rule name, implementation class, message, description and an example. This feature can be helpful when trying to makes sense of a new rule or letting inhouse developers know about a particular company rule or coding convention.

PMD5

Finally, in the PMD Violations table it is possible to add review annotations to the source where the violation occurred. This can be an effective way to mark target files for further review. Just right-click on any violation in the list and select Mark review. PMD will insert a review annotation to the Java source right above the violation line itself. The review annotation should look like this:

 

// @PMD:REVIEWED:MethodNamingConventions: by Levent Gurses on 3/28/04 5:04 PM

 

Review annotations can be removed anytime by right-clicking QA Project and selecting PMD | Clear violations reviews. Similarly, PMD Violations can be cleaned-up by right-clicking QA Project and selecting PMD | Clear PMD Violations.

Finding Cut and Paste Code(CPD):

Repeated (Cut & Paste) code generally indicates poor planning or team coordination. Therefore, refactoring classes with repeating code should be given a high priority. PMD can help identify these classes by scanning the code in a way similar to PMD violation checks. The number of lines of similarity (the metrics used by PMD to match code patterns) is 25 by default and can be set in PMD’s Preferences page.

  1. In Package Explorer right-click on QA Project and select PMDFind Suspect Cut And Paste
  2. PMD creates a folder reportsunder QA Projectand stores the result text file
  3. Select WindowShow ViewNavigator
  4. In Navigator view click on QA Projectand then on the reportsfolder
  5. The report file cpd-report.txtshould look like this:

=====================================================================
Found a 8 line (25 tokens) duplication in the following files:
Starting at line 6 of C:\temp\QA Project\Ying.java
Starting at line 23 of C:\temp\QA Project\Yang.java
new Yang("Bad").start();
}
public String thisIsCutAndPaste(String pFirst, int pSecond) {
out.println("New world");
return "New world";
}
}

We can generate two types of reports from PMD after integrating PMD dependencies with Maven and building the project.

Below are the dependencies for Maven Pom.xml file for PMD.

Working POM confiiguration

<build>
     <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-pmd-plugin</artifactId>
        <version>4.0.8</version>
      </plugin>
     </plugins>      
    </build>
    <reporting>
          <plugins>
            <plugin>
              <groupId>org.apache.maven.plugins</groupId>
              <artifactId>maven-pmd-plugin</artifactId>
              <version>4.0.8</version>
              <configuration>
                 <linkXref>true</linkXref>
                     <sourceEncoding>utf-8</sourceEncoding>
                     <minimumTokens>100</minimumTokens>
                 <targetJdk>1.8</targetJdk>
                 <excludes>
                     <exclude>**/*Bean.java</exclude>
                     <exclude>**/generated/*.java</exclude>
                 </excludes>             
                 <excludeRoots>
                     <excludeRoot>target/generated-sources/stubs</excludeRoot>
                 </excludeRoots>
              </configuration>
             </plugin>
          </plugins>
   </reporting>

 

The post PMD (Programming Mistake Detector) appeared first on Hadoop Online Tutorials.

]]>
http://hadooptutorial.info/pmd-programming-mistake-detector/feed/ 0
Creating UDF and UDAF for Impala http://hadooptutorial.info/creating-udf-and-udaf-for-impala/ http://hadooptutorial.info/creating-udf-and-udaf-for-impala/#respond Tue, 07 Jun 2016 11:48:17 +0000 http://hadooptutorial.info/?p=11287  Installing the UDF Development Package [crayon-5c96f01d00573061561575/] The output will be like below code. [cloudera@quickstart impala-udf-samples-master]$ cmake . — The C compiler identification is GNU 4.4.7 — The CXX compiler identification is GNU 4.4.7 — Check for working C compiler: /usr/bin/cc — Check for working C compiler: /usr/bin/cc — works — Detecting C compiler ABI info […]

The post Creating UDF and UDAF for Impala appeared first on Hadoop Online Tutorials.

]]>
 Installing the UDF Development Package

Get the quickstart VM
 sudo yum install gcc-c++ cmake boost-devel
 sudo yum install impala-udf-devel
 download https://github.com/cloudera/impala-udf-samples/archive/master.zip unzip above cd into ../impala-udf-samples-master
 cmake .

The output will be like below code.

[cloudera@quickstart impala-udf-samples-master]$ cmake .
— The C compiler identification is GNU 4.4.7
— The CXX compiler identification is GNU 4.4.7
— Check for working C compiler: /usr/bin/cc
— Check for working C compiler: /usr/bin/cc — works
— Detecting C compiler ABI info
— Detecting C compiler ABI info – done
— Check for working CXX compiler: /usr/bin/c++
— Check for working CXX compiler: /usr/bin/c++ — works
— Detecting CXX compiler ABI info
— Detecting CXX compiler ABI info – done
— Configuring done
— Generating done
— Build files have been written to: /home/cloudera/Downloads/impala-udf-samples-master

make

Upload  Library Files To Hdfs

  • UDF

hadoop fs -put /home/cloudera/Downloads/impala-udf-samples-master/build/libudfsample.so /user/udf_udafs

  • UDAF

hadoop fs -put -f /home/cloudera/Downloads/impala-udf-samples-master/build/libudasample.so /user/udf_udafs/

Create UDF&UDAF Functions

  • UDF

impala-shell

create function countvowels(string) returns int location '/user/udf_udafs/libudfsample.so' SYMBOL='CountVowels';

ex: SELECT countvowels(colname) from table;

  • UDAF

create aggregate function my_count(int) returns bigint location '/user/udf_udafs/libudasample.so' update_fn='CountUpdate';

ex: SELECT my_count(col_name) from table;

 

The post Creating UDF and UDAF for Impala appeared first on Hadoop Online Tutorials.

]]>
http://hadooptutorial.info/creating-udf-and-udaf-for-impala/feed/ 0
Postgres Commands http://hadooptutorial.info/postgres-commands/ http://hadooptutorial.info/postgres-commands/#respond Tue, 07 Jun 2016 11:48:00 +0000 http://hadooptutorial.info/?p=11294 CREATE [crayon-5c96f01d00c54721056118/] We can see our new table by typing this: [crayon-5c96f01d00c61657687538/] List of relations Schema |    Name    | Type  |  Owner ——–+————+——-+———- public | playground | table | postgres (1 row) INSERT [crayon-5c96f01d00c67850184012/]   Message returned if only one row was inserted. oid is the numeric OID of the inserted row. Ex: INSERT oid 1 […]

The post Postgres Commands appeared first on Hadoop Online Tutorials.

]]>
CREATE

CREATE TABLE playground3 (
equip_id serial PRIMARY KEY,
type varchar (50) NOT NULL,
color varchar (25) NOT NULL,
location varchar(25) check (location in ('north', 'south', 'west', 'east', 'northeast', 'southeast', 'southwest', 'northwest')),
install_date date
);

We can see our new table by typing this:

Postgres=#\d

postgres=# \dt

List of relations

Schema |    Name    | Type  |  Owner

——–+————+——-+———-

public | playground | table | postgres

(1 row)

INSERT

postgres=# INSERT INTO playground (type, color, location, install_date) VALUES ('slide', 'blue', 'south', '2014-04-28');
INSERT 0 1
postgres=# INSERT INTO playground (type, color, location, install_date) VALUES ('swing', 'yellow', 'northwest', '2010-08-16');
INSERT 0 1

 

  • Message returned if only one row was inserted. oid is the numeric OID of the inserted row.

Ex: INSERT oid 1

  • Message returned if more than one rows were inserted. # is the number of rows inserted.

Ex: INSERT 0 #

SYNTAX:

INSERT INTO TABLE_NAME (Column1,column2,…columnN) VALUES (value1,value2,….valueN);

SELECT

postgres=# SELECT * FROM playground;

output will be look like

equip_id | type  | color  | location  | install_date

———-+——-+——–+———–+————–

1 | slide | blue   | south     | 2014-04-28

2 | swing | yellow | northwest | 2010-08-16

(2 rows)

DELETE

DELETE FROM playground WHERE type = 'slide';

ALTER

ALTER TABLE playground ADD last_maint date;

UPDATE

UPDATE playground SET color = 'red' WHERE type = 'swing';

LIKE CLAUSE

postgres=# SELECT * FROM playground WHERE type like 'slide%'

postgres=# SELECT * FROM playground WHERE type like '%slide%'

postgres=# SELECT * FROM playground WHERE type like 'slide_'

postgres=# SELECT * FROM playground WHERE type like '_slide_'

LIMIT CLAUSE

postgres=# SELECT * FROM playground LIMIT 4 OFFSET 1

From the above command psql will fetch the total as 4 rows and starting with 2nd row.

LIMIT=no of rows to be returned.

OFFSET= starting of the row number.

ORDER BY

postgres=# SELECT * FROM playground ORDER BY equip_id ASC

GROUP BY

postgres=# SELECT type FROM playground GROUP BY location

WITH CLAUSE

WITH provides a way to write subqueries for use in a larger SELECT query. The subqueries, which are often referred to as Common Table Expressions or CTEs, can be thought of as defining temporary tables that exist just for this query. One use of this feature is to break down complicated queries into simpler parts. An example is:

<span class="typ">With</span><span class="pln"> test AS
</span><span class="pun">(</span><span class="typ">Select</span>
<span class="pln">equip_id,type,color,location,install_date
FROM </span>playground<span class="pun">)</span> 
<span class="typ">Select</span> <span class="pun">*</span> <span class="typ">From </span><span class="pln">test</span><span class="pun">;</span>

Basic Syntax:

WITH
   name_example AS (
     SELECT Statement)
   SELECT columns
   FROM name_for_summary_data
   WHERE conditions <=> (
   SELECT column FROM name_example) [ORDER BY columns]

HAVING CLAUSE

HAVING CLAUSE is like a where condition associated with GROUP BY

postgres=# SELECT * FROM playground GROUP BY location HAVING type='slide'

Basic Syntax:

<span class="pln">SELECT
FROM
WHERE
GROUP BY
HAVING
ORDER BY</span>

DISTINCT

postgres=# SELECT DISTINCT type FROM playground

 

The post Postgres Commands appeared first on Hadoop Online Tutorials.

]]>
http://hadooptutorial.info/postgres-commands/feed/ 0
Postgres Installation On Centos http://hadooptutorial.info/postgres-installation-on-centos/ http://hadooptutorial.info/postgres-installation-on-centos/#comments Tue, 07 Jun 2016 11:47:38 +0000 http://hadooptutorial.info/?p=11292 To install the server locally use the command line and type [crayon-5c96f01d015dd581516559/] To start off, we need to set the password of the PostgreSQL user (role) called “postgres”; we will not be able to access the server externally otherwise. As the local “postgres” Linux user, we are allowed to connect and manipulate the server using […]

The post Postgres Installation On Centos appeared first on Hadoop Online Tutorials.

]]>
To install the server locally use the command line and type

 sudo apt-get install postgresql postgresql-contrib

To start off, we need to set the password of the PostgreSQL user (role) called “postgres”; we will not be able to access the server externally otherwise. As the local “postgres” Linux user, we are allowed to connect and manipulate the server using the psql command.

In a terminal, type:

sudo -u postgres psql postgres

this connects as a role with same name as the local user, i.e. “postgres”, to the database called “postgres” (1st argument to psql).

Set a password for the “postgres” database role using the command:

\password postgres

and give your password when prompted. The password text will be hidden from the console for security purposes.

Type Control+D or \q to exit the posgreSQL prompt.

hadoop1@ubuntu:~$ sudo -i -u postgres

postgres@ubuntu:~$ psql

psql (9.3.9)

Type "help" for help.

postgres=# \conninfo

You are connected to database "postgres" as user "postgres" via socket in "/var/run/postgresql" at port "5432".

 

The post Postgres Installation On Centos appeared first on Hadoop Online Tutorials.

]]>
http://hadooptutorial.info/postgres-installation-on-centos/feed/ 1
HBase & Solr Search Integration http://hadooptutorial.info/hbase-solr-search-integration/ http://hadooptutorial.info/hbase-solr-search-integration/#comments Sun, 29 May 2016 07:45:43 +0000 http://hadooptutorial.info/?p=11390 ­HBase & Solr – Near Real time indexing and search Requirement: A. HBase Table B. Solr collection on HDFS C. Lily HBase Indexer. D. Morphline Configuration file Once Solr server ready then we are ready to configure our collection (in solr cloud); which will be link to HBase table. Add below properties to hbase-site.xml file. […]

The post HBase & Solr Search Integration appeared first on Hadoop Online Tutorials.

]]>
­HBase & Solr – Near Real time indexing and search

Requirement:

A. HBase Table

B. Solr collection on HDFS

C. Lily HBase Indexer.

D. Morphline Configuration file

Once Solr server ready then we are ready to configure our collection (in solr cloud); which will be link to HBase table.

  • Add below properties to hbase-site.xml file.
  • Add below properties to/etc/hbase-solr/conf/hbase-indexer-site.xml.  This will enable Lily indexer to reach HBase cluster for indexing. Replace your values for properties. Replace the hbase-cluster-zookeeper values as mentioned in hbase-site.xml, for local environment its value is localhost.

<property>
   <name>hbase.zookeeper.quorum</name>
   <value>hbase-cluster-zookeeper</value>
</property> 
<property>
   <name>hbaseindexer.zookeeper.connectstring</name>
   <value>hbase-cluster-zookeeper:2181</value>
</property> 

  • Restart below services

$ sudo service hbase-solr-indexer restart
$ sudo service solr-server restart

  • Create a hbase table with replication
Since the HBase Indexer works by acting as a Replication Sink, we need to make sure that Replication is enabled in HBase. You can activate replication using Cloudera Manager by clicking HBase Service->Configuration->Backup and ensuring “Enable HBase Replication” and “Enable Indexing” are both checked.
In addition, we have to make sure that the column family in the HBase table that needs to be replicated must have replication enabled. This can be done by ensuring that the REPLICATION_SCOPE flag is set while the column family is created, as shown below:

hbase shell> create 'EmployeeTable', {NAME => 'data', REPLICATION_SCOPE => 1}

  • Create Solr cloud collection

$ solrctl instancedir --generate $HOME/hbase-collection1

Once you run above command get into path $HOME/hbase-collection1/conf in which there is solr config file; you can edit the schema.xml file

$ nano $HOME/hbase-collection1/conf/schema.xml

with our own schema, for this use case we have to add below tag which is column family of HBase (data).

<field name="data" type="text_general" indexed="true" stored="true” multiValued="true”/>

  • Create a Solrcloud collection with the above schema.xml

$ solrctl instancedir --create hbase-collection1 $HOME/hbase-collection1$ solrctl collection --create hbase-collection1

Creating a Lily HBase Indexer configuration

$ nano $HOME/morphline-hbase-mapper.xml 
<?xml version="1.0"?>
     <indexer table=" EmployeeTable" mapper="com.ngdata.hbaseindexer.morphline.MorphlineResultToSolrMapper" mapping-type="column">   
          <param name="morphlineFile" value="/etc/hbase-solr/conf/morphlines.conf"/>
     </indexer>

Creating a Morphline Configuration File

$ nano /etc/hbase-solr/conf/morphlines.conf 
morphlines : [{ id : morphline1    importCommands : ["org.kitesdk.morphline.**", "com.ngdata.**"]     
commands : [{extractHBaseCells { mappings : [ {inputColumn : "data:*"  outputField : "data" type : string  source : value} ]}} 
{ logTrace { format : "output record: {}", args : ["@{}"] } } ] }]

Starting & Registering a Lily HBase Indexer configuration with the Lily HBase Indexer Service

 

  • Start the hbase-indexer:

$ /usr/bin/hbase-indexer server

  • Registering indexer

The Lily hbase indexer services provides a command line utility that can be used to add, list,  update and delete indexer configurations. The command shown below registers and adds a indexer configuration to the Hbase Indexer. This is done by passing an index configuration XML file also with the zookeeper ensemble information used for Hbase and SOLR and the solr collection

$ hbase-indexer add-indexer --name myNRTIndexer 
--indexer-conf $HOME/morphline-hbase-mapper.xml 
--connection-param solr.zk=localhost:2181/solr
--connection-param solr.collection=hbase-collection1
--zookeeper localhost:2181

  • Verify that the indexer was successfully created as follows:

$ hbase-indexer list-indexers

Verifying the indexing is working

Add rows to the indexed HBase table. For example:

$ hbase shell
hbase(main):001:0> put 'EmployeeTable', 'row1', 'data', 'value'
hbase(main):002:0> put 'EmployeeTable', 'row2', 'data', 'value2'

If the put operation succeeds, wait a few seconds, then navigate to the Search in HUE UI, and query the data. Note the updated rows in Search.

Configuring Lily HBase NRT Indexer Service for Use with Cloudera Search

Using the Lily HBase NRT Indexer Service

Steps to build indexing

  1. solrctl instancedir –generate $HOME/hbase-collection4
  2. rm -rf /home/babusi02/hbase-collection5/conf/schema.xml
  3. rm -rf /home/babusi02/hbase-collection4/conf/solrconfig.xml
  4. cp /home/babusi02/hbase-collection2/conf/schema.xml /home/babusi02/hbase-collection5/conf/
  5. cp /home/babusi02/hbase-collection2/conf/solrconfig.xml /home/babusi02/hbase-collection4/conf/
  6. nano /home/babusi02/hbase-collection4/conf/schema.xml
  7. nano /home/babusi02/hbase-collection4/conf/solrconfig.xml
  8. solrctl instancedir –create hbase-collection4 $HOME/hbase-collection4
  9. solrctl collection –create hbase-collection4
  10. nano $HOME/morphline-hbase-mapper.xml
  11. nano /etc/hbase-solr/conf/morphlines.conf
  12. hbase-indexer add-indexer –name Indexer6 –indexer-conf $HOME/morphline-hbase-mapper.xml –connection-param
  13. solr.zk=dayrhegapd016.enterprisenet.org:2181,dayrhegapd015.enterprisenet.org:2181,dayrhegapd014.enterprisenet.org:2181,dayrhegapd020.enterprisenet.org:2181,dayrhegapd019.enterprisenet.org:2181/solr –connection-param solr.collection=hbase-collection6 –zookeeper dayrhegapd020.enterprisenet.org:2181
  14. hbase-indexer list-indexers

The post HBase & Solr Search Integration appeared first on Hadoop Online Tutorials.

]]>
http://hadooptutorial.info/hbase-solr-search-integration/feed/ 1
Resilient Distributed Dataset http://hadooptutorial.info/resilient-distributed-dataset/ http://hadooptutorial.info/resilient-distributed-dataset/#respond Mon, 29 Feb 2016 09:47:53 +0000 http://hadooptutorial.info/?p=11305 What is an RDD? A Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including […]

The post Resilient Distributed Dataset appeared first on Hadoop Online Tutorials.

]]>
What is an RDD?

A Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.

Why RDD in Spark?

MapReduce is widely adopted for processing and generating large data sets with a parallel, distributed algorithm on a cluster. It allows users to write parallel computations, using a set of high-level operators, without having to worry about work distribution and fault tolerance. I.e Slow due to replication, serialization, and disk IO. So there was need for alternative programming model called RDD. Below we can see the    computational performance of MapReduce and Spark programming model.

Data Sharing in MapReduce:

MR Flow

Data Sharing in Spark :

Spark RDD

Key idea: resilient distributed datasets (RDDs)

  1. Distributed collections of objects that can be cached in memory across cluster nodes
  1. Manipulated through various parallel operators
  1. Automatically rebuilt on failure

Which results 10-100 × faster performance than network and disk.

RDD Abstraction:

Formally, an RDD is a read-only, partitioned collection of records. RDDs can only be created through deterministic operations on either (1) data in stable storage or (2) other RDDs. We call these operations transformations to differentiate them from other operations on RDDs. Ex- amples of transformations include map, filter, and join. RDDs do not need to be materialized at all times. In- stead, an RDD has enough information about how it was derived from other datasets (its lineage) to compute its partitions from data in stable storage. This is a powerful property: in essence, a program cannot reference an RDD that it cannot reconstruct after a failure.

Finally, users can control two other aspects of RDDs: persistence and partitioning. Users can indicate which RDDs they will reuse and choose a storage strategy for them (e.g., in-memory storage). They can also ask that an RDD’s elements be partitioned across machines based on a key in each record. This is useful for placement optimizations, such as ensuring that two datasets that will be joined together are hash-partitioned in the same way.

How to program with RDD:

An RDD is simply a distributed collection of elements. In Spark all work is expressed as either creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result. Under the hood, Spark automatically distributes the data contained in RDDs across your cluster and parallelizes the operations you perform on them.

Users can create RDDs in two ways: by loading an external dataset, or by distributing a collection of objects (e.g., a list or set) in their driver program. Let’s see loading a text file as an RDD of strings using SparkContext.textFile().

Example :1 Creating an RDD of Strings with text file () in Python:

>>>lines = sc.textFile("README.md")

Once created, RDDs offer two types of operations: transformations and actionsTransformations construct a new RDD from a previous one. For example, one common transformation is filtering data that matches a predicate. In our text file example, we can use this to create a new RDD holding just the strings that contain the word Python, as shown in example 2 below.

 Example :2 Calling the filter() transformation

>>>pythonLines = lines.filter(lambda line: "Python" in line)

Actions, on the other hand, compute a result based on an RDD, and either return it to the driver program or save it to an external storage system (e.g., HDFS). One example of an action we called earlier is first(), which returns the first element in an RDD and is demonstrated in example 3 below.

Example 3 : Calling first() action

  >>> pythonLines.first()  u'## Interactive Python Shell'

Transformations and actions are different because of the way Spark computes RDDs. Although you can define new RDDs any time, Spark computes them only in a lazy fashion—that is, the first time they are used in an action. This approach might seem unusual at first, but makes a lot of sense when you are working with Big Data

Finally, Spark’s RDDs are by default recomputed each time you run an action on them. If you would like to reuse an RDD in multiple actions, you can ask Spark to persist it using RDD.persist(). After computing it the first time, Spark will store the RDD contents in memory (partitioned across the machines in your cluster), and reuse them in future actions. Persisting RDDs on disk instead of memory is also possible. The behaviour of not persisting by default may again seem unusual, but it makes a lot of sense for big datasets: if you will not reuse the RDD, there’s no reason to waste storage space when Spark could instead stream through the data once and just compute the result. 

Example 4: Persisting an RDD in memory

>>> pythonLines.persist         
>>> pythonLines.count()         
>>> pythonLines.first()        u'## Interactive Python Shell'

To summarize, every Spark program and shell session will work as follows:

  1. Create some input RDDs from external data.
  2. Transform them to define new RDDs using transformations like filter().
  3. Ask Spark to persist() any intermediate RDDs that will need to be reused.
  4. Launch actions such as count() and first() to kick off a parallel computation, which is then optimized and executed by Spark.

Lazy Evaluation

As we read earlier, transformations on RDDs are lazily evaluated, meaning that Spark will not begin to execute until it sees an action. Lazy evaluation means that when we call a transformation on an RDD (for instance, calling map()), the operation is not immediately performed. Instead, Spark internally records metadata to indicate that this operation has been requested. Rather than thinking of an RDD as containing specific data, it is best to think of each RDD as consisting of instructions on how to compute the data that we build up through transformations. Loading data into an RDD is lazily evaluated in the same way transformations are. So, when we call sc.textFile(), the data is not loaded until it is necessary. As with transformations, the operation (in this case, reading the data) can occur multiple times.

The post Resilient Distributed Dataset appeared first on Hadoop Online Tutorials.

]]>
http://hadooptutorial.info/resilient-distributed-dataset/feed/ 0
Impala Best Practices http://hadooptutorial.info/impala-best-practices/ http://hadooptutorial.info/impala-best-practices/#comments Wed, 03 Feb 2016 04:44:35 +0000 http://hadooptutorial.info/?p=11271 Below are Impala performance tuning options: Pre-execution Checklist    Data types    Partitioning    File Format Data Type Choices      Define integer columns as INT/BIGINT      Operations on INT/BIGINT more efficient than STRING      Convert “external” data to good “internal” types on load      e.g. CAST date strings to TIMESTAMPS   […]

The post Impala Best Practices appeared first on Hadoop Online Tutorials.

]]>
Below are Impala performance tuning options:

Pre-execution Checklist

  •    Data types
  •    Partitioning
  •    File Format

Data Type Choices

  •      Define integer columns as INT/BIGINT
  •      Operations on INT/BIGINT more efficient than STRING
  •      Convert “external” data to good “internal” types on load
  •      e.g. CAST date strings to TIMESTAMPS
  •      This avoids expensive CASTs in queries later

Partitioning

  • The fastest I/O is the one that never takes place.
  • Understand your query filter predicates
  • For time-series data, this is usually the date/timestamp column
  • Use this/these column(s) for a partition key(s)
  • Validate queries leverage partition pruning using EXPLAIN •You can have too much of a good thing
  • A few thousand partitions per table is probably OK
  • Tens of thousands partitions is probably too much
  • Partitions/Files should be no less than a few hundred MBs

Use Parquet Columnar Format for HDFS

  • Well defined open format – http://parquet.io
  • Works in Impala, Pig, Hive & Map/Reduce
  • I/O reduction by only reading necessary columns
  • Columnar layout compresses/encodes better
  • Supports nested data by shredding columns
  • Uses techniques used by Google’s ColumnIO
  • Impala loads use Snappy compression by default
  • Gzip available: set PARQUET_COMPRESSION_CODEC=gzip;
  • Quick word on Snappy vs. Gzip

impala-parquet

Quick Note on Compression

Snappy

  • Faster compression/decompression speeds
  • Less CPU cycles
  • Lower compression ratio

Gzip/Zlib

  • Slower compression/decompression speeds
  • More CPU cycles
  • Higher compression ratio
  • It’s all about trade-offs

Left-Deep Join Tree

  • The largest* table should be listed first in the FROM clause
  • Joins are done in the order tables are listed in FROM clause
  • Filter early – most selective joins/tables first
  • v1.2.1 will do JOIN ordering

Types of Hash Joins

Broadcast

  • Default hash join type is BROADCAST (aka replicated)
  • Each node ends up with a copy of the right table(s)*
  • Left side, read locally and streamed through local hash join(s)
  • Best choice for “star join”, single large fact table, multiple small dims

Shuffle

  • Alternate hash join type is SHUFFLE (aka partitioned)
  • Right side hashed and shuffled; each node gets ~1/Nth the data
  • Left side hashed and shuffled, then streamed through join
  • Best choice for “large_table JOIN large_table”
  • Only available if ANALYZE was used to gather table/column stats*

How to use ANALYZE

  • Table Stats (from Hive shell)
  • analyze table T1 [partition(partition_key)] compute statistics;
  • Column Stats (from Hive shell)
  • analyze table T1 [partition(partition_key)] compute statistics for columns c1,c2,…
  • Impala 1.2.1 will have a built-in ANALYZE command

Hinting Joins

select ... from large_fact join [broadcast] small_dim

select ... from large_fact join [shuffle] large_dim

Determining Join Type From EXPLAIN

impala-explain

Memory Requirements for Joins & Aggregates

  • Impala does not “spill” to disk — pipelines are in-memory
  • Operators’ mem usage need to fit within the memory limit
  • This is not the same as “all data needs to fit in memory”
  • Buffered data generally significantly smaller than total accessed data
  • Aggregations’ mem usage proportional to number of groups
  • Applies for each in-flight query (sum of total)
  • Minimum of 128GB of RAM is recommended for impala nodes

 

The post Impala Best Practices appeared first on Hadoop Online Tutorials.

]]>
http://hadooptutorial.info/impala-best-practices/feed/ 1
Apache Storm Integration With Apache Kafka http://hadooptutorial.info/apache-storm-integration-with-apache-kafka/ http://hadooptutorial.info/apache-storm-integration-with-apache-kafka/#respond Tue, 02 Feb 2016 04:40:53 +0000 http://hadooptutorial.info/?p=11264 Installing Apache Storm The prerequisite for storm to work on the machine. a. Download and installation commands for ZeroMQ 2.1.7: Run the following commands on terminals [crayon-5c96f01d03865162288944/] b. Download and installation commands for JZMQ:  [crayon-5c96f01d03871485532009/]   2. Download latest storm from http://storm.apache.org/downloads.html  [crayon-5c96f01d03878358020345/] Second start Storm Cluster by starting master and worker nodes. Start master […]

The post Apache Storm Integration With Apache Kafka appeared first on Hadoop Online Tutorials.

]]>
Installing Apache Storm

The prerequisite for storm to work on the machine.
a. Download and installation commands for ZeroMQ 2.1.7:
Run the following commands on terminals

wget http://download.zeromq.org/zeromq-2.1.7.tar.gz

 tar -xzf zeromq-2.1.7.tar.gz

 cd zeromq-2.1.7

 ./configure

 make

 sudo make install

b. Download and installation commands for JZMQ: 

git clone https://github.com/nathanmarz/jzmq.git

2. cd jzmq

3. ./autogen.sh

4. ./configure

5. make

6. sudo make install 
Note: if its failing during build you need to run below commands to install required libraries 
sudo yum install libuuid* 
sudo yum install uuid-* 
sudo yum install gcc-* 
sudo yum install git 
sudo yum install libtool*

 

2. Download latest storm from http://storm.apache.org/downloads.html 

In my case “apache-storm-0.10.0.zip” 
$unzip apache-storm-0.10.0.zip 
$vi conf/storm.yaml 
Need to add below code.(If file not there then look for storm.default.yaml and save as storm.yaml 
########### These MUST be filled in for a storm configuration 
storm.zookeeper.servers: 
- "localhost" // your ip address 
storm.zookeeper.port: 2181 
nimbus.host: "localhost" // your ip address 
nimbus.childopts: "-Xmx1024m -Djava.net.preferIPv4Stack=true" 
ui.childopts: "-Xmx768m -Djava.net.preferIPv4Stack=true" 
supervisor.childopts: "-Djava.net.preferIPv4Stack=true" 
worker.childopts: "-Xmx768m -Djava.net.preferIPv4Stack=true" 
nimbus.thrift.port: 8627 
ui.port: 8772 
storm.local.dir: "/home/storm/storm-0.8.1/data" // your data dir path 
java.library.path: "/usr/lib/jvm/jre-1.7.0-openjdk.x86_64/" //Java home path 
supervisor.slots.ports: 
- 6700 
- 6701 
- 6702 
- 6703

Second start Storm Cluster by starting master and worker nodes.

Start master node i.e. nimbus.

To start master i.e. nimbus go to the ‘bin’ directory of the Storm installation and execute following command. [separate command line window]

storm nimbus

Start worker node i.e. supervisor.

To start worker i.e. supervisor go to the ‘bin’ directory of the Storm installation and execute following command. [separate command line window]

storm supervisor

Start ui . [separate command line window] 
strom ui


You can view the web ui at http://localhost:8772

Apache Kafka + Apache Storm

1. Kafka Producer

package com.org.kafka;
import java.util.Properties;

import kafka.javaapi.producer.Producer;

import kafka.producer.KeyedMessage;

import kafka.producer.ProducerConfig;

public class WordsProducer {
public static void main(String[] args) { // Build the configuration required for connecting to Kafka Properties props = new Properties();
// List of Kafka brokers. Complete list of brokers is not

// required as the producer will auto discover the rest of

// the brokers. Change this to suit your deployment.

props.put("metadata.broker.list", "localhost:9092");
// Serializer used for sending data to kafka. Since we are sending string,

// we are using StringEncoder.

props.put("serializer.class", "kafka.serializer.StringEncoder");
// We want acks from Kafka that messages are properly received.

props.put("request.required.acks", "1");
// Create the producer instance

ProducerConfig config = new ProducerConfig(props);

Producer<String, String> producer = new Producer<String, String>(config);
// Now we break each word from the paragraph for (String word : METAMORPHOSIS_OPENING_PARA.split("\\s")) {
// Create message to be sent to "words_topic" topic with the word

KeyedMessage<String, String> data = new KeyedMessage<String, String>("words_topic", word);
// Send the message

producer.send(data); }
System.out.println("Produced data");
// close the producer

producer.close(); }
// First paragraph from Franz Kafka's Metamorphosis

private static String METAMORPHOSIS_OPENING_PARA = "One morning, when Gregor Samsa woke from troubled dreams, " + "he found himself transformed in his bed into a horrible " + "vermin. He lay on his armour-like back, and if he lifted " + "his head a little he could see his brown belly, slightly " + "domed and divided by arches into stiff sections.";

}

2. Storm Spout/Topology

package com.org.kafka; 
import storm.kafka.KafkaSpout;

import storm.kafka.SpoutConfig;

import storm.kafka.StringScheme;

import storm.kafka.ZkHosts;

import backtype.storm.Config;

import backtype.storm.LocalCluster;

import backtype.storm.generated.AlreadyAliveException;

import backtype.storm.generated.InvalidTopologyException;

import backtype.storm.spout.SchemeAsMultiScheme;

import backtype.storm.topology.TopologyBuilder;


public class KafkaTopology { 
public static void main(String[] args) throws AlreadyAliveException, InvalidTopologyException { 
// zookeeper hosts for the Kafka cluster

ZkHosts zkHosts = new ZkHosts("localhost:2181"); 
// Create the KafkaSpout configuartion

// Second argument is the topic name

// Third argument is the zookeeper root for Kafka

// Fourth argument is consumer group id

SpoutConfig kafkaConfig = new SpoutConfig(zkHosts, "words_topic", "", "id7"); 
// Specify that the kafka messages are String

kafkaConfig.scheme = new SchemeAsMultiScheme(new StringScheme()); 
// We want to consume all the first messages in the topic everytime

// we run the topology to help in debugging. In production, this

// property should be false

kafkaConfig.forceFromStart = true; 
// Now we create the topology

TopologyBuilder builder = new TopologyBuilder(); 
// set the kafka spout class

builder.setSpout("KafkaSpout", new KafkaSpout(kafkaConfig), 1); 
// configure the bolts

builder.setBolt("SentenceBolt", new SentenceBolt(), 1).globalGrouping("KafkaSpout"); builder.setBolt("PrinterBolt", new PrinterBolt(), 1).globalGrouping("SentenceBolt"); 
// create an instance of LocalCluster class for executing topology in local mode.

LocalCluster cluster = new LocalCluster();

Config conf = new Config(); 
// Submit topology for execution

cluster.submitTopology("KafkaToplogy", conf, builder.createTopology()); 
try { // Wait for some time before exiting

System.out.println("Waiting to consume from kafka");

Thread.sleep(10000);

}

catch (Exception exception) {

System.out.println("Thread interrupted exception : " + exception); } 
// kill the KafkaTopology

cluster.killTopology("KafkaToplogy"); 
// shut down the storm test cluster

cluster.shutdown();

}

}

3. Bolts

package com. org.kafka; 
import backtype.storm.topology.BasicOutputCollector;

import backtype.storm.topology.OutputFieldsDeclarer;

import backtype.storm.topology.base.BaseBasicBolt;

import backtype.storm.tuple.Tuple; 

public class PrinterBolt extends BaseBasicBolt { 
public void execute(Tuple input, BasicOutputCollector collector) {

// get the sentence from the tuple and print it

String sentence = input.getString(0);

System.out.println("Received Sentence:" + sentence);

} 
public void declareOutputFields(OutputFieldsDeclarer declarer) {

// we don't emit anything

}

}

4. SentenceBolt

package com. org.kafka; 
import java.util.ArrayList;

import java.util.List; 
import org.apache.commons.lang.StringUtils; 
import backtype.storm.topology.BasicOutputCollector;

import backtype.storm.topology.OutputFieldsDeclarer;

import backtype.storm.topology.base.BaseBasicBolt;

import backtype.storm.tuple.Fields;

import backtype.storm.tuple.Tuple;


import com.google.common.collect.ImmutableList;

public class SentenceBolt extends BaseBasicBolt { 
// list used for aggregating the words

private List<String> words = new ArrayList<String>(); 
public void execute(Tuple input, BasicOutputCollector collector) {

// Get the word from the tuple

String word = input.getString(0); 
if(StringUtils.isBlank(word)){

// ignore blank lines

return; } 
System.out.println("Received Word:" + word); 
// add word to current list of words

words.add(word); 
if (word.endsWith(".")) { // word ends with '.' which means this is the end  of // the sentence publishes a sentence tuple

collector.emit(ImmutableList.of( (Object) StringUtils.join(words, ' '))); 
// and reset the words list. words.clear();

}

} 
public void declareOutputFields(OutputFieldsDeclarer declarer) {

// here we declare we will be emitting tuples with // a single field called "sentence" 
declarer.declare(new Fields("sentence"));

}

}

Running the Kafka and storm application

1. First create Kafka topic “words_topic”
Start server:

bin/kafka-server-start.sh config/server.properties

Create topic:

bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 -partitions 1 --topic words_topic

Enter something on Producer console

kafka+storm

Run the strom jar.

storm jar <path-to-topology-jar> <class-with-the-main> <arg1> <arg2> <argN>

kafka+storm1

 

 

 

 

 

 

 

 

 

 

 

The post Apache Storm Integration With Apache Kafka appeared first on Hadoop Online Tutorials.

]]>
http://hadooptutorial.info/apache-storm-integration-with-apache-kafka/feed/ 0
Kafka Design http://hadooptutorial.info/kafka-design/ http://hadooptutorial.info/kafka-design/#respond Mon, 01 Feb 2016 15:14:18 +0000 http://hadooptutorial.info/?p=11246 While developing Kafka, the main focus was to provide the following:   An API for producers and consumers to support custom implementation   Low overheads for network and storage with message persistence on disk   A high throughput supporting millions of messages for both publishing and subscribing—for example, real-time log aggregation or data feeds   […]

The post Kafka Design appeared first on Hadoop Online Tutorials.

]]>
While developing Kafka, the main focus was to provide the following:

  •   An API for producers and consumers to support custom implementation
  •   Low overheads for network and storage with message persistence on disk
  •   A high throughput supporting millions of messages for both publishing and subscribing—for example, real-time log aggregation or data feeds
  •   Distributed and highly scalable architecture to handle low-latency delivery
  •   Auto-balancing multiple consumers in the case of failure  Guaranteed fault-tolerance in the case of server failures

Kafka design fundamentals

kafka_arch

Replication in Kafka

kafka_adv

Kafka supports the following replication modes

Synchronous replication

In synchronous replication, a producer first identifies the lead replica from ZooKeeper and publishes the message. As soon as the message is published, it is written to the log of the lead replica and all the followers of the lead start pulling the message; by using a single channel, the order of messages is ensured. Each follower replica sends an acknowledgement to the lead replica once the message is written to its respective logs. Once replications are complete and all expected acknowledgements are received, the lead replica sends an acknowledgement to the producer. On the consumer’s side, all the pulling of messages is done from the lead replica.

Asynchronous replication

The only difference in this mode is that, as soon as a lead replica writes the message to its local log, it sends the acknowledgement to the message client and does not wait for acknowledgements from follower replicas. But, as a downside, this mode does not ensure message delivery in case of a broker failure.

 

The post Kafka Design appeared first on Hadoop Online Tutorials.

]]>
http://hadooptutorial.info/kafka-design/feed/ 0