Built-in Load Store Functions in Pig 2


In this post, we will discuss about the following built in load store functions in pig with examples.

  • PigStorage
  • TextLoader
  • BinStorage
  • JsonLoader, JsonStorage
  • AvroStorage
  • HBaseStorage
  • MongoStorage

PigStorage:

PigStorage() is the default load/store function in pig. PigStorage expects data to be formatted using field delimiters and the default delimiter is ‘\t’. PigStorage() itself can be used for both Load and Store functions. It Loads/stores data as structured text files. All Pig simple and complex data types can be read/written using this function. The input data to the load can be a file, a directory or a glob.

It’s call syntax:

Here, the default delimiter is ‘\t’, but we can provide any other character/symbol as delimiter as shown in below examples. Options can be any of the below values, or multiple options can be space separated but enclosed in single quotes as shown in below examples.

These values are not visible but stored in a hidden file “.pig_schema” file created in the output directory when storing data. If tagPath or tagFile option is specified, PigStorage will add a pseudo-column INPUT_FILE_PATH or INPUT_FILE_NAME respectively to the beginning of the record.

PigStorage Example:

We will consider /etc/passwd file on Linux file system to load into pig and extract user and path names from this file. We will copy this file into /in/ directory location in HDFS and we will trigger below commands in pig grunt shell to examine PigStorage functionality. Here options are not necessary, even without using ‘schema tagFile’ options we will get the same results in the final output file.

Below is the sample output of the above commands.

PigStorage example

TextLoader:

TextLoader works with unstructured Text files with UTF8 format. Its mainly useful when, we are not able to impose any structure (schema) to the records in the input text files, then we can load such files with this loader and each line of text is treated as a single field of bytearray type, which is the default data type in pig. These aliases will not have any schema.

TextLoader supports only loading of file but cannot be used to store data. TextLoader doesn’t accept any parameters.

Example call to TextLoader:

Lets load hadooplogs from ‘/in/ ‘ from HDFS directory with TextLoader and lets verify its schema and output tuples of load statement.

Input file can be any sample hadoop logs file –> hadooplogs

Below is the output of the above grunt shell pig latin statements. As shown in below screen in DESCRIBE A; command, relation A will not have any schema and final tuples will contain only one single field with entire line of input text as shown in the below screen for DUMP B; output.

TextLoader Example

BinStorage:

BinStorage is used to both load and store binary files. Users rarely use this but Pig internally uses this to load and store the temporary data that is generated between multiple MapReduce jobs.

When we try to save text data into binary format files with this BinStorage, we need to be careful to specify custom converter to convert the bytearray data type fields into correct data types when using/loading these Binary files created in the previous STORE statements.

Example Scenario:

In the below example, we are loading /in/passwd with PigStorage and storing it into another file with BinStorage() and the same binary is being read to process its contents and display the counts of the paths. As we are not using any casts between data types, this example is succeeded without any errors.