Pig Functions Cheat Sheet 2


Below is the Pig Functions Cheat Sheet prepared by collecting different types of functions.

Pig Execution Modes

Grunt mode:
  • It is interactive mode of pig.
  • Very useful for testing syntax checking and ad-hoc data exploration
Script mode:
  • Run set of instructions from a file
  • Similar to sql script file
Embedded mode:
  • Executes pig program from a Java program
  • Suitable to create pi scripts on the fly
Local mode:
  • pig –x local
  • In this mode, entire pig job runs as a single jvm process
  • Picks and stores data from local Linux path
Map reduce:
  • pig –x map reduce
  • In this mode, pig job runs as a series of map reduce jobs
  • Input and output paths are assumed as HDFS paths
Tez:
  • pig –x tez
  • In this mode, pig job runs as a series of tez jobs
  • Input and output paths are assumed as HDFS paths
Pig program execution:

Pig is just a wrapper on top of map reduce layer. It parse, optimize and converts to the pig scripts to a series of M/R jobs.

PIG DATA TYPES

Pig data type Implementing class Defination

Bag Org.apache.pig.data.DataBig Collection of tuples
Tuple Org.apache.pig.data.Tuple It is collection of values
Map Java.utilMap<Object, Object (key , value ) pair
Integer Java.lang.Integer Number
Long Java.lang.Long Long
Float Java.lang.Float Float type
Double Java.lang.Double Double
Char array Java.lang.string character
Atom Org.apache.pig.data.Atom Single piece of data
Byte array Byte[] Byte count

COMPLEX DATA TYPES

  1. An atom/field is any single piece of data

e.g. ‘Jack’, 15

  1. Tuple is an ordered set of atoms/fields

e.g. (‘a’, ‘big’, ‘data’, ‘problem’)

  1. Bag is just a collection, it may contain atoms, tuples or even bags

e.g. {‘Vijay’, 10, (‘Jack’, ‘Daniel’), {2, (‘Google’, ‘Yahoo’)}}

PIG OPERATORS

Type command Description

Loading and storing LOADDUMPSTORE Loads data into a relationDumps data to consoleStore data to a given location
Data Grouping and joins GROUPCOGROUPCROSSJOIN Groups based on keyGroups data from multiple relations based on keyCross join of two relationsJoin multiple relations
Sorting LIMITORDER Limiting the resultsSorting by fields(s)
Data sets UNIONSPLIT Combining multiple relationsOpposite of UNION

PIG BUILT IN FUNCTIONS

Type Examples

EVAL Functions AVG, COUNT, COUNT_STAR, SUM, TOKENIZE, MAX, MIN, SIZE etc
Load/store Functions Pigstorage(), Textloader, HbaseStorage, JsonLoader, JsonStorage etc.
Math Functions ABS, COS, SIN, TAN, CEIL, FLOOR, ROUND, RANDOM etc
String Functions TRIM, SUBSTRING, LOWER, UPPER, TRIM, RTRIM etc.
Datetime functions GetDay, GetHour, GetYear, ToUnixTime, Tostring etc

Data Types

Pig is written in Java, so Pig data types correspond to underlying Java datatypes. When using a UDF, the Java types need to be translated into the target language.

TYPES DESCRIPTION EXAMPLE PHYTHON/JYTHON
int Signed 32 bit integer 10 int
long Signed 64 bit integer 10L long
float 32 bit floating point 10.5f float
double 64 bit floating point 10.5 float
chararray Character array (string) inUnicode UTF-8 format Hello world Str or unicode
bytearray Byte array (blob) Bytearray
boolean boolean True/false bool
datetime org.joda.time.DateTime 1970-01-01T00 Datetime
biginteger Java BigInteger 1000000 00000 long
bigdecimal Java BigDecimal 42.4242424242424 float
tuple An ordered set of fields. (17, random) tuple
bag A collection of tuples {(17,random),(42,life)} list
map A set of key value pairs [f00#bar, baz#quux] dict

Diagnostic operators

  • DESCRIBE : Returns the schema of a relation
  • DUMP : Dumps or display result to screen
  • Explain : Displays execution plan
  • Illustrate : Displays a step-by-step execution of a sequence of statement

BASIC OPERATORS

OPERATOR DESCRIPTION EXAMPLE

Arithmetic operators +,-,*,/,%,?: X= FOREACH A GENERATE f1,f2,f1%f2;X=FOREACH A GENERATE f2,(f2==1?1:COUNT(B));
Boolean operators and, or, not X=FILTER A BY (f1==8) OR (NOT (f2+f3>f1));
Cast operators Casting from one datatype to another B=FOREACH A GENERATE (int)$0+1;B=FOREACH A GENERATE $0+1,$1+1.0
Comparison operators ==,!=,>,<,>=,<=,matches X=FILTER A BY (f1==8);X=FILTER A BY (f2==’apache’);X=FILTER A BY “(f1 matches ‘.*apache.*’);
Construction operators used to construct tuple(), bag{} and map[] B=foreach A generate (name, age);B=foreach A generate {(name, age)},{name, age};B=foreach A generate [name, gpa];
Dereference Operators dereference tuples(tuple.id or tuple.(id,…)),bags(bag.id or bag.(id,…))and maps(map# ‘key’) X=FOREACH A AGENERATE f2.t1,f2.t3(dereferencing is used to retrieve two fields from tuple f2)
Disambiguate operator (::) used to identify field names after JOIN,COGROUP,CROSS, or FLATTEN Operators A= load ‘data1’ as (x, y);B= load ‘data2’ as (x, y, z);C= join A by B by x;D= foreach C generate A::y;
Flatten Operator Flatten un-nests tuples as well as bags Consider a relation that has a tuple of the form (a, (b, c)). The expression GENERATE $0, flatten ($1), will cause that tuple to become (a, b, c)
Null operator Is null, is not null X= FILTER A BY f1 is not null;
Sign Operators +-> has no effect, à changes the sign of a positive/negative number A=LOAD ‘data’ as (x, y, z);B=FOREACH A GENERATE –x, y;

RELATIONAL OPERATORS

NAME DESCRIPTION

COGROUP alias BY (col1, col2) COGROUP is the same as GROUP. For readability, programmers usually use GROUP when only one relation is involved and COGROUP when multiple relations are involved.
CROSS alias1, alias2 Computers the cross product of two or more relations.
CUBE alias BY CUBE (exp1, exp2, …….) Cube operation computes aggregates for all possible combinations of specified group by dimensions. The number of group by combinations generated by cube for n dimensions will be 2^n.
CUBE alias BY ROLLUP (exp1, exp2, …..) Rollup operations computes multiple levels of aggregates based on hierarchical ordering of specified group by dimensions. Rollup is useful when there is hierarchical ordering on the dimensions. The number of group by combinations generated by rollup for n dimensions will be n+1
DISTINCT alias Remove duplicates tuple in a relation.
FILTER alias BY expression Selecta tuples from a relation based on some condition
… FLATTEN (tuple)… FLATTEN (bag) Un-nests a bag or a tuple
FOREACH … GENERATE Performs transformations on each row in the data
GROUP alias ALLGROUP alias BY expressionGROUP alias BY exp1, exp2, ….) Groups the data in one or multiple relations.
JOIN smaller_alias BY expression, larger_alias BY expression Performs inner, equijoin of two or more relations based on common field values.
Join smaller_alias BY expression[LEFT/RIGHT/OUTER], larger_alias BY expression Performs an outer join of two or more relations based on common field values.
Join big_alias BY expression, small_ BY expression USING ‘replicated’ Effiecient join when one or more relations are small enough to fit in main memory
LIMIT alias n Limits the number of output tuples.
LOAD ‘data’ [USING function] [AS schema] Loads data from the file system.
ORDER alias BY col1 [ASC/DESC] Sorts a relation based on one or more fields
RANK alias BY col [ASC/DESC] Returns each tuple with the rank within a relation
SAMPLE alias size Selects a random sample of data based on the specified sample size.
SAMPLE alias size Partitions a relation into two or more relations
SPLIT alias INTO alias1 IF expression, alias2 IF expression….. Stores or saves results to the file system
STORE alias INTO ‘directory’ [USING function] Computes the union of two or more relations

SYNTAX TIPS

The fields in a pig relation have an explicit order (unlike SQL columns).As a result there are syntax shortcuts that rely on that field order.

Assume:

My_data =LOAD ‘file.tsv’ AS (field:int, field2:chararray, field3:float, field4:int);

Statement Output

FOREACH my_data generate $0, $1; (field1,field3)
FOREACH my_data generate field2..field4; (field2,field3,field4)
FOREACH my_data generate field2; (field2,field3,field4)
FOREACH my_data generate * ; (field1,field2,field3,field4)

MATHEMATICAL FUNCTIONS

NAME RETURN TYPE DESCRIPTION

ABS(int a), ABS(long a),ABS(float a), ABS(double a) Int, long, float, double Returns the absolute value of an expression.
ACOS(doubl a) Double Returns the arc cosine of an expression.
ASIN(double a) Double Returns the arc sine of an expression.
ATAN(double a) Double Returns the arc tangent of an expression.
CBRT(double a) Double Returns the cube root of an expression.
CEIL(double a) Double Returns the value of an expression rounded up to the nearest integer
COS(double a) Double Returns the cosine of an expression
COSH(double a) Double Returns the hyperbolic cosine of an expression
EXP(double a) Double Returns the Eulers number e raised to the power of x
FLOOR(double a) Double Returns the value of an expression rounded down to the nearest integer
LOG(double a) Double Returns the natural logarithm of an expression
LOG10(double a) Double Returns the base 10 logarithm of an expression
RANDOM() Double Returns Pseudo random number greater than or equal to 0.0 and less than 1.0
ROUND(float a), ROUND(double a) Double Returns the value of an expression rounded to ainteger
SIN(double a) Double Returns the sine of an expression
SINH(double a) Double Returns the hyperbolic exp
SQRT(double a) Double Returns the positive square root of an expression
TAN(double a) Double Returns the tangent of an expression
TANH(double a) Returns the hyperbolic tangent of an expression

EVAL FUNCTIONS

This group contains aggregate functions such as COUNT and SUM,along with useful utility methods such as IsEmpty.

NAME RETURN TYPE DESCRIPTION

AVG(col) double Computes the average of thr numeric values in a single column of a bag
CONCAT(string expression1, string expression2)CONCAT(byte[] expression1, byte[] expression2) String, byte[] Concatenates two expressions of identical type
COUNT(DataBag bag) long Computes the number of element in a bag Does not include null values
COUNT STAR(DataBag bag1, DataBag bag2) Long Compute the number of elements in a bag including null values
DIFF(DataBag bag1, DataBag bag2) Databag Compares two bags .Any tuples that are in one bag but notbthe other are returned in a bag
IsEmpty(DataBag bag), IsEmpty(Map map) boolean Checks if a bag or map isempty
Max(col) Int, long, float, double Computes the maximum of the numeric values or chararrays in a single-column bag
MIN(col) Int, long,float,double Computes the minimum of the numeric values or chararrays in a single-column bag
DEFINE pluck pluckTuple(expression1) Tuple Allows the user to specify a string prefix, and then filter for the column in a relation that begin with that prefix
SIZE(expression) Long Computes the number of elements based on any pig data
SUBSTRACT(DataBag bag1, DataBag bag2) DataBag Returns bag composed of bag1 elements not in bag2
SUM Double, long Computes the sum of the numeric values in a asingle-column bag
TOKENIZE(String expression[,‘field delimiter’) Databag Splits a string and output a bag of words. If field delimiter is null or not passed, the following will be used as delimiters:space[], double quote[“], comma[,]parenthesis[0], start[*]

SQL -> PIG

Because many people come to Pig from a relational database background, we’ve included a handy translation from SQL concepts to their Pig equivalents.

SQL Function SQL PIG

SELECT SELECT column_name,column_nameFROM table_name; FOREACH alias GENERATE column_name,column_name;
SELECT* SELECT * FROM table_name; FOREACH alias GENERATE *;
DISTINCT SELECT DISTINCT column_name,column_nameFROM table_name; DISTINCT(FOREACH aliasgenerate column_name, column_name);
WHERE SELECT column_name,column_nameFROM table_nameWHERE column_name operator value; FOREACH (FILTER alias BY column_nameoperator value)GENERATE column_name, column_name;
AND/OR WHERE (column_name operator value1AND column_name operator value2)OR column_name operator value3; FILTER alias BY (column_name operator value1AND column_name operator value2)OR column_name operator value3;
ORDER BY ORDER BY column_name ASC|DESC,column_name ASC|DESC; ORDER alias BY column_name ASC|DESC,column_name ASC|DESC;
TOP/LIMIT SELECT TOP number column_nameFROM table_name ORDER BY column_name ASC|DESC;SELECT column_name FROM table_name FOREACH (GROUP alias BY column_name)GENERATE LIMIT alias number;TOP(number, column_index, alias);
GROUP BY GROUP BY(column_name) FOREACH (GROUP alias BY column_name)GENERATE function(alias.column_name);
LIKE WHERE column_name LIKE pattern; FILTER alias BY REGEX_EXTRACT(column_name,pattern, 1) IS NOT NULL;
IN WHERE column_name IN (value1,value2,…); FILTER alias BY column_name IN(value1, value2,…);
JOIN SELECT column_name(s)FROM table1JOIN table2ON table1.column_name=table2.column_name; FOREACH (JOIN alias1 BY column_name,alias2 BY column_name)GENERATE column_name(s);
LEFT/RIGHT/FULL OUTERJOIN SELECT column_name(s) FROM table1LEFT|RIGHT|FULL OUTER JOIN table2ON table1.column_name=table2.column_name; FOREACH(JOINalias1 BY column_name LEFT|RIGHT|FULL,alias2 BY column_name) GENERATE column_name(s);
UNION ALL SELECT column_name(s) FROM table1UNIONALLSELECTcolumn_name(s) FROM table2; UNION alias1, alias2;
AVG SELECT AVG(column_name) FROM table_name; FOREACH (GROUP Alias ALL) GENERATEAVG(alias.column_name);
COUNT SELECT COUNT(column_name) FROM table_name; FOREACH (GROUP alias ALL) GENERATE COUNT(alias);
COUNT DISTINCT SELECT COUNT(DISTINCT column_name) FROM table_name; FOREACH alias{Unique _column=DISTINT Column_name);};
MAX SELECT MAX(column_name) FROM table_name; FOREACH(GROUP aliasALL) GENERATE MAX(alias.column_name);
MIN SELECT MIN(column_name) FROM table_name; FOREACH (GROUP aliasALL)GENERATE MIN(alias.column_name)
SUM SELECT SUM(column_name) FROM table_name; FOREACH (GROUP aliasALL)GEENRATE SUM(alias.column_name);
HAVING HAVING aggregate_function(column_name)operator value; FILTER alias BYAggregate_function(column_name)operatorValue;
UCASE/UPPER SELECT UCASE(column_name) FROM table_name FOREACH aliasGENERATEUPPER(column_name);
LCASE/LOWER SELECT LCASE(column_name) FROM table_name; FOREACH aliasGENERATELOWER(column_name);
SUBSTRING SELECTSUBSTRING(column_name,start,length) AS some_name FROM table_name; FOREACH aliasGENERATESUBSTRING(column_name,start,

Star+length) as

Some_name;

LEN SELECT LEN(column_name) FROM table_name; FOREACH aliasGENERATE SIZE(column_name)
ROUND SELECT ROUND(column_name,0) FROM table_name; FOREACH aliasGENEARATE ROUND(column_name);

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.


Leave a comment

Your email address will not be published. Required fields are marked *

2 thoughts on “Pig Functions Cheat Sheet


Review Comments
default image

I am a plsql developer. Intrested to move into bigdata.

Neetika Singh ITA Hadoop in Dec/2016 December 22, 2016

.