Load Functions In Pig


In this post, we will discuss about basic details of load functions in pig with some sample examples and we will also discuss about custom load functions in pig by writing UDFs.

To work with data in Pig, the first thing we need to do is load data from a source, and
Pig has a built-in LOAD function that loads data from the file system.

Load Operator:

Syntax:

Input Path:

We can also specify a directory name, to load all the files in the directory instead of a single file. We can also use Hadoop globing/regular expression to select the files that match a particular criteria.

Below is the high level table that shows Glob symbols and their meaning in hadoop/pig:

Glob Meaning
? Matches zero or any single character.
* Matches zero or more characters.
[abc] Matches a single character from character set (a,b,c).
[a-z] Matches a single character from the character range (a..z), inclusive. The first character must be lexicographically less than or equal to the second character.
[^abc] Matches a single character that is not in the character set (a, b, c). The ^ character must occur immediately to the right of the opening bracket.
[^a-z] Matches a single character that is not from the character range (a..z), inclusive. The ^ character must occur immediately to the right of the opening bracket.
\c Removes (escape sequence) any special meaning of character c.
{ab,cd} Matches a string from the string set {ab, cd}.
Load Function:

Any built-in function or user defined load function which extends LoadFunc class can be used as Load function in the above load operator’s USING clause to load data.

If we do not specify USING clause in Load operator, by default PigStorage is selected as Load function. PigStorage() uses tab delimiter to parse the fields in the input file by default, But if we need to process any other delimiter, then we need to pass the delimiter as the argument to PigStorage() function as shown in below examples.

Suppose below is input.txt file with fields tab separated as shown below.

In the above, both the load statements are identical. If our input file was having ‘,’ delimiter between fields then we need to call PigStorage() as shown below.

Schema:

Schema is specified in parenthesis after AS keyword to give names to the fields in the input file and declare their data types.

In the above examples of Load Operator we omitted AS clause, in this case, fields are not named and all fields default to type bytearray. We need to access fields by Positional notation as $0, $1, $2, etc…

But if we provide the schema as shown in the below examples, then we can access the fields by their names, and pig ensures that input data to in the expected field data type format only.

The above two example load statements can stated as below with Schema. These two are equivalent.

Please refer the post built-in Load functions of pig, to match your data loading criteria, or refer writing custom Loader UDFs in pig to know how to write our own load function.

LoadFunc in Pig is tightly coupled with Hadoop’s InputFormat class, So the complexity of reading the data and creating a record will depend on writing the InputFormat class. This enables Pig to easily read/write data in new storage formats as and when an Hadoop InputFormat and OutputFormat is available for them.

For examples on Custom Load functions refer the post parsing logs in hadoop.

Compression Support in Load Functions:

Compressed data can also be processed in load functions, currently gzip and bzip compression formats are supported in PigStorage() and TextLoader() for both read (load) and write (store). But BinStorage() does not support compression.

To recognize that input files are compressed, they need to have .gz, .bz or .bz2 extensions. gzip doesn’t support splitting, so the number of maps required process is equal to the number of input files. As bzip is is block-oriented compression, bzipped files can be split across multiple maps.


Profile photo of Siva

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.

Leave a comment

Your email address will not be published. Required fields are marked *


Review Comments
default image

I am a plsql developer. Intrested to move into bigdata.

Neetika Singh ITA Hadoop in Dec/2016 December 22, 2016

.