Pig Functions Cheat Sheet 2

Below is the Pig Functions Cheat Sheet prepared by collecting different types of functions.

Pig Execution Modes

Grunt mode:
  • It is interactive mode of pig.
  • Very useful for testing syntax checking and ad-hoc data exploration
Script mode:
  • Run set of instructions from a file
  • Similar to sql script file
Embedded mode:
  • Executes pig program from a Java program
  • Suitable to create pi scripts on the fly
Local mode:
  • pig –x local
  • In this mode, entire pig job runs as a single jvm process
  • Picks and stores data from local Linux path
Map reduce:
  • pig –x map reduce
  • In this mode, pig job runs as a series of map reduce jobs
  • Input and output paths are assumed as HDFS paths
  • pig –x tez
  • In this mode, pig job runs as a series of tez jobs
  • Input and output paths are assumed as HDFS paths
Pig program execution:

Pig is just a wrapper on top of map reduce layer. It parse, optimize and converts to the pig scripts to a series of M/R jobs.


Pig data type                                        Implementing class                                           Defination

Bag Org.apache.pig.data.DataBig   Collection of tuples
Tuple Org.apache.pig.data.Tuple    It is collection of values
Map Java.utilMap<Object, Object       (key , value ) pair
Integer Java.lang.Integer            Number
Long Java.lang.Long              Long
Float Java.lang.Float             Float type
Double Java.lang.Double             Double
Char array Java.lang.string              character
Atom Org.apache.pig.data.Atom     Single piece of data
Byte array Byte[]               Byte count


  1. An atom/field is any single piece of data

     e.g. ‘Jack’, 15

  1. Tuple is an ordered set of atoms/fields

     e.g. (‘a’, ‘big’, ‘data’, ‘problem’)

  1. Bag is just a collection, it may contain atoms, tuples or even bags

     e.g. {‘Vijay’, 10, (‘Jack’, ‘Daniel’), {2, (‘Google’, ‘Yahoo’)}}


              Type                                                   command                                                    Description

Loading and storing LOADDUMPSTORE Loads data into a relationDumps data to consoleStore data to a given location
Data Grouping and joins GROUPCOGROUPCROSSJOIN Groups based on keyGroups data from multiple relations based on keyCross join of two relationsJoin multiple relations
Sorting LIMITORDER Limiting the resultsSorting by fields(s)
Data sets UNIONSPLIT Combining multiple relationsOpposite of UNION


                           Type                                                                                                      Examples

Load/store Functions Pigstorage(), Textloader, HbaseStorage, JsonLoader, JsonStorage etc.
Datetime functions GetDay, GetHour, GetYear, ToUnixTime, Tostring etc

Data Types

Pig is written in Java, so Pig data types correspond to underlying Java datatypes. When using a UDF, the Java types need to be translated into the target language.

TYPES  DESCRIPTION                                     EXAMPLE                       PHYTHON/JYTHON
int Signed 32 bit integer               10            int
long Signed 64 bit integer               10L          long
float 32 bit floating point               10.5f           float
double 64 bit floating point                10.5           float
chararray Character array (string) inUnicode UTF-8 format          Hello world Str or unicode
bytearray Byte array (blob)       Bytearray
boolean boolean          True/false           bool
datetime org.joda.time.DateTime    1970-01-01T00         Datetime
biginteger Java BigInteger 1000000 00000          long
bigdecimal Java BigDecimal 42.4242424242424          float
tuple An ordered set of fields.     (17, random)         tuple
bag A collection of tuples {(17,random),(42,life)}           list
map A set of key value pairs [f00#bar, baz#quux]           dict

Diagnostic operators

  • DESCRIBE   :         Returns the schema of a relation
  • DUMP            :         Dumps or display result to screen   
  • Explain           :         Displays execution plan   
  • Illustrate         :         Displays a step-by-step execution of a sequence of statement


OPERATOR                   DESCRIPTION                            EXAMPLE

Arithmetic operators      +,-,*,/,%,?: X= FOREACH A GENERATE f1,f2,f1%f2;X=FOREACH A GENERATE f2,(f2==1?1:COUNT(B));
Boolean operators     and, or, not X=FILTER A BY (f1==8) OR (NOT (f2+f3>f1));
Cast operators Casting from one datatype to another B=FOREACH A GENERATE (int)$0+1;B=FOREACH A GENERATE $0+1,$1+1.0
Comparison operators ==,!=,>,<,>=,<=,matches X=FILTER A BY (f1==8);X=FILTER A BY (f2==’apache’);X=FILTER A BY “(f1 matches ‘.*apache.*’);
Construction operators used to construct  tuple(), bag{} and map[] B=foreach A generate (name, age);B=foreach A generate {(name, age)},{name, age};B=foreach A generate [name, gpa];
Dereference Operators dereference tuples(tuple.id or tuple.(id,…)),bags(bag.id or bag.(id,…))and maps(map# ‘key’) X=FOREACH A AGENERATE f2.t1,f2.t3(dereferencing is used to retrieve two fields from tuple f2)
Disambiguate operator (::) used to identify field names after JOIN,COGROUP,CROSS, or FLATTEN Operators A= load ‘data1’ as (x, y);B= load ‘data2’ as (x, y, z);C= join A by B by x;D= foreach C generate A::y;
Flatten Operator Flatten un-nests tuples as well as bags Consider a relation that has a tuple of the form (a, (b, c)). The expression GENERATE $0, flatten ($1), will cause that tuple to become (a, b, c)
Null operator Is null, is not null X= FILTER A BY f1 is not null;
Sign Operators +-> has no effect, à changes the sign of a positive/negative number A=LOAD ‘data’ as (x, y, z);B=FOREACH A GENERATE –x, y;


                    NAME                                                                                            DESCRIPTION

COGROUP alias BY (col1, col2) COGROUP is the same as GROUP. For readability, programmers usually use GROUP when only one relation is involved and COGROUP when multiple relations are involved.
CROSS alias1,  alias2 Computers the cross product of two or more relations.
CUBE alias BY CUBE (exp1, exp2,  …….) Cube operation computes aggregates for all possible combinations of specified group by dimensions. The number of group by combinations generated by cube for n dimensions will be 2^n.
CUBE alias BY ROLLUP (exp1, exp2, …..) Rollup operations computes multiple levels of aggregates based on hierarchical ordering of specified group by dimensions. Rollup is useful when there is hierarchical ordering on the dimensions. The number of group by combinations generated by rollup for n dimensions will be n+1
DISTINCT alias Remove duplicates tuple in a relation.
FILTER alias BY expression Selecta tuples from a relation based on some condition
… FLATTEN (tuple)… FLATTEN (bag) Un-nests a bag or a tuple
FOREACH … GENERATE Performs transformations on each row in the data
GROUP alias ALLGROUP alias BY expressionGROUP alias BY exp1, exp2, ….) Groups the data in one or multiple relations.
JOIN smaller_alias BY expression, larger_alias BY expression Performs inner, equijoin of two or more relations based on common field values.
Join smaller_alias BY expression[LEFT/RIGHT/OUTER], larger_alias BY expression Performs an outer join of two or more relations based on common field values.
Join big_alias BY expression, small_ BY expression USING ‘replicated’ Effiecient join when one or more relations are small enough to fit in main memory
LIMIT alias n Limits the number of output tuples.
LOAD ‘data’ [USING function] [AS schema] Loads data from the file system.
ORDER alias BY col1 [ASC/DESC] Sorts a relation based on one or more fields
RANK alias BY col [ASC/DESC] Returns each tuple with the rank within a relation
SAMPLE alias size Selects a random sample of data based on the specified sample size.
SAMPLE alias size Partitions a relation into two or more relations
SPLIT alias INTO alias1 IF expression, alias2 IF expression….. Stores or saves results to the file system
STORE alias INTO ‘directory’ [USING function] Computes the union of two or more relations


The fields in a pig relation have an explicit order (unlike SQL columns).As a result there are syntax shortcuts that rely on that field order.


My_data =LOAD ‘file.tsv’ AS (field:int, field2:chararray, field3:float, field4:int);

Statement                                                                                                       Output

FOREACH  my_data  generate $0, $1;                 (field1,field3)
FOREACH  my_data  generate field2..field4;              (field2,field3,field4)
FOREACH  my_data  generate field2;               (field2,field3,field4)
FOREACH  my_data  generate * ;           (field1,field2,field3,field4)


NAME                                                                  RETURN TYPE                                    DESCRIPTION

ABS(int a), ABS(long a),ABS(float a), ABS(double a)           Int, long, float, double Returns the absolute value of an expression.
ACOS(doubl a)              Double Returns the arc cosine of an expression.
ASIN(double a)              Double Returns the arc sine of an expression.
ATAN(double a)              Double Returns the arc tangent of an expression.
CBRT(double a)              Double Returns the cube root of an expression.
CEIL(double a)              Double Returns the value of an expression rounded up to the nearest integer
COS(double a)              Double Returns the cosine of an expression
COSH(double a)              Double Returns the hyperbolic cosine of an expression
EXP(double a)              Double Returns the Eulers number e raised to the power of x
FLOOR(double a)              Double Returns the value of an expression rounded down to the nearest integer
LOG(double a)              Double Returns the natural logarithm of an expression
LOG10(double a)             Double Returns the base 10 logarithm of an expression
RANDOM()             Double Returns  Pseudo random number greater than or equal to 0.0  and less than 1.0
ROUND(float a), ROUND(double a)             Double Returns the value of an expression rounded to ainteger
SIN(double a)             Double Returns the sine of an expression
SINH(double a)             Double Returns the hyperbolic exp
SQRT(double a)             Double Returns the positive square root of an expression
TAN(double a)             Double Returns the tangent of an expression
 TANH(double a) Returns the hyperbolic tangent of an expression


This group contains aggregate functions such as COUNT and SUM,along with useful utility methods such as IsEmpty.

NAME                                                                 RETURN TYPE                             DESCRIPTION

AVG(col)              double Computes the average of thr numeric values in a single column of a bag
CONCAT(string expression1, string expression2)CONCAT(byte[] expression1, byte[] expression2)          String, byte[] Concatenates two expressions of identical type
COUNT(DataBag bag)                  long Computes the number of element in a bag Does not include null values
COUNT STAR(DataBag bag1, DataBag bag2)                  Long Compute the number of elements in a bag including null values
DIFF(DataBag bag1, DataBag bag2)                Databag Compares two bags .Any tuples that are in one bag but notbthe other are returned in a bag
IsEmpty(DataBag bag), IsEmpty(Map map)                 boolean Checks if a bag or map isempty
Max(col)    Int, long, float, double Computes the maximum of the numeric values or chararrays in a single-column bag
MIN(col)    Int, long,float,double Computes the minimum of the numeric values or chararrays in a single-column bag
DEFINE pluck pluckTuple(expression1)              Tuple Allows the user to specify a string prefix, and then filter for the column in a relation that begin with that prefix
SIZE(expression)              Long Computes the number of elements based on any pig data
SUBSTRACT(DataBag bag1, DataBag bag2)           DataBag Returns bag composed of bag1 elements not in bag2
SUM       Double, long Computes the sum of the numeric values in a asingle-column bag
TOKENIZE(String expression[,‘field delimiter’)            Databag Splits a string and output a bag of words. If  field delimiter is null or not passed, the following will be used as delimiters:space[], double quote[“], comma[,]parenthesis[0], start[*]


Because many people come to Pig from a relational database background, we’ve included a handy translation from SQL concepts to their Pig equivalents.

SQL Function                                     SQL                                                                     PIG

SELECT SELECT column_name,column_nameFROM table_name; FOREACH alias GENERATE column_name,column_name;
DISTINCT SELECT DISTINCT column_name,column_nameFROM table_name; DISTINCT(FOREACH aliasgenerate column_name, column_name);
WHERE SELECT column_name,column_nameFROM table_nameWHERE column_name operator value; FOREACH (FILTER alias BY column_nameoperator value)GENERATE column_name, column_name;
AND/OR WHERE (column_name operator value1AND column_name operator value2)OR column_name operator value3; FILTER alias BY (column_name operator value1AND column_name operator value2)OR column_name operator value3;
ORDER BY ORDER BY column_name ASC|DESC,column_name ASC|DESC; ORDER alias BY column_name ASC|DESC,column_name ASC|DESC;
TOP/LIMIT SELECT TOP number column_nameFROM table_name ORDER BY column_name ASC|DESC;SELECT column_name FROM table_name FOREACH (GROUP alias BY column_name)GENERATE LIMIT alias number;TOP(number, column_index, alias);
GROUP BY GROUP BY(column_name) FOREACH (GROUP alias BY column_name)GENERATE function(alias.column_name);
LIKE WHERE column_name LIKE pattern; FILTER alias BY REGEX_EXTRACT(column_name,pattern, 1) IS NOT NULL;
IN WHERE column_name IN (value1,value2,…); FILTER alias BY column_name IN(value1, value2,…);
JOIN SELECT column_name(s)FROM table1JOIN table2ON table1.column_name=table2.column_name; FOREACH (JOIN alias1 BY column_name,alias2 BY column_name)GENERATE column_name(s);
LEFT/RIGHT/FULL OUTERJOIN SELECT  column_name(s) FROM  table1LEFT|RIGHT|FULL  OUTER  JOIN table2ON  table1.column_name=table2.column_name; FOREACH(JOINalias1 BY  column_name LEFT|RIGHT|FULL,alias2 BY  column_name) GENERATE column_name(s);
UNION ALL SELECT  column_name(s)  FROM  table1UNIONALLSELECTcolumn_name(s)  FROM  table2; UNION  alias1, alias2;
AVG SELECT   AVG(column_name)  FROM  table_name; FOREACH (GROUP Alias ALL) GENERATEAVG(alias.column_name);
COUNT SELECT   COUNT(column_name)  FROM  table_name; FOREACH (GROUP alias ALL) GENERATE COUNT(alias);
COUNT DISTINCT SELECT  COUNT(DISTINCT  column_name) FROM  table_name; FOREACH alias{Unique _column=DISTINT Column_name);};
MAX SELECT    MAX(column_name)  FROM  table_name; FOREACH(GROUP aliasALL) GENERATE MAX(alias.column_name);
MIN SELECT  MIN(column_name)  FROM  table_name; FOREACH (GROUP aliasALL)GENERATE MIN(alias.column_name)
SUM SELECT   SUM(column_name)  FROM  table_name; FOREACH (GROUP aliasALL)GEENRATE SUM(alias.column_name);
HAVING HAVING  aggregate_function(column_name)operator value;  FILTER alias BYAggregate_function(column_name)operatorValue;
UCASE/UPPER SELECT  UCASE(column_name)  FROM  table_name FOREACH aliasGENERATEUPPER(column_name);
LCASE/LOWER SELECT  LCASE(column_name)  FROM  table_name; FOREACH aliasGENERATELOWER(column_name);
SUBSTRING SELECTSUBSTRING(column_name,start,length) AS  some_name FROM  table_name;  FOREACH aliasGENERATESUBSTRING(column_name,start,

Star+length) as


LEN SELECT  LEN(column_name)  FROM  table_name; FOREACH aliasGENERATE SIZE(column_name)
ROUND SELECT   ROUND(column_name,0)  FROM  table_name; FOREACH aliasGENEARATE ROUND(column_name);


About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.

Leave a comment

Your email address will not be published. Required fields are marked *

2 thoughts on “Pig Functions Cheat Sheet