Built-in Load Store Functions in Pig 2


In this post, we will discuss about the following built in load store functions in pig with examples.

  • PigStorage
  • TextLoader
  • BinStorage
  • JsonLoader, JsonStorage
  • AvroStorage
  • HBaseStorage
  • MongoStorage

PigStorage:

PigStorage() is the default load/store function in pig. PigStorage expects data to be formatted using field delimiters and the default delimiter is ‘\t’. PigStorage() itself can be used for both Load and Store functions. It Loads/stores data as structured text files. All Pig simple and complex data types can be read/written using this function. The input data to the load can be a file, a directory or a glob.

It’s call syntax:

Here, the default delimiter is ‘\t’, but we can provide any other character/symbol as delimiter as shown in below examples. Options can be any of the below values, or multiple options can be space separated but enclosed in single quotes as shown in below examples.

These values are not visible but stored in a hidden file “.pig_schema” file created in the output directory when storing data. If tagPath or tagFile option is specified, PigStorage will add a pseudo-column INPUT_FILE_PATH or INPUT_FILE_NAME respectively to the beginning of the record.

PigStorage Example:

We will consider /etc/passwd file on Linux file system to load into pig and extract user and path names from this file. We will copy this file into /in/ directory location in HDFS and we will trigger below commands in pig grunt shell to examine PigStorage functionality. Here options are not necessary, even without using ‘schema tagFile’ options we will get the same results in the final output file.

Below is the sample output of the above commands.

PigStorage example

TextLoader:

TextLoader works with unstructured Text files with UTF8 format. Its mainly useful when, we are not able to impose any structure (schema) to the records in the input text files, then we can load such files with this loader and each line of text is treated as a single field of bytearray type, which is the default data type in pig. These aliases will not have any schema.

TextLoader supports only loading of file but cannot be used to store data. TextLoader doesn’t accept any parameters.

Example call to TextLoader:

Lets load hadooplogs from ‘/in/ ‘ from HDFS directory with TextLoader and lets verify its schema and output tuples of load statement.

Input file can be any sample hadoop logs file –> hadooplogs

Below is the output of the above grunt shell pig latin statements. As shown in below screen in DESCRIBE A; command, relation A will not have any schema and final tuples will contain only one single field with entire line of input text as shown in the below screen for DUMP B; output.

TextLoader Example

BinStorage:

BinStorage is used to both load and store binary files. Users rarely use this but Pig internally uses this to load and store the temporary data that is generated between multiple MapReduce jobs.

When we try to save text data into binary format files with this BinStorage, we need to be careful to specify custom converter to convert the bytearray data type fields into correct data types when using/loading these Binary files created in the previous STORE statements.

Example Scenario:

In the below example, we are loading /in/passwd with PigStorage and storing it into another file with BinStorage() and the same binary is being read to process its contents and display the counts of the paths. As we are not using any casts between data types, this example is succeeded without any errors.

But if we use any operations that need casting between data types, then we will receive an error messages shown below.

In this scenario we need to use Utf8StorageConverter option in BinStorage to avoid these casting errors as shown below.

JsonLoader() and JsonStorage():

These are used to read/write JSON Format  data. Here JsonLoader() is tightly coupled with JsonStorage(), i.e. JsonLoader() can only read data written by JsonStorage(), which will write schema of the data into .pig_schema file & header information (field names) into .pig_header file in the output directory.  This can be seen in the screen provided at the end of this section.

If these two files are not present(this situation will occur while reading JSON files created by external sources), JsonLoader() will throw out error message similar to below.

Also when storing a relation without schema by using JsonStorage() we will get below error messages. In the below example relation A doesn’t have a schema.

But if we provide a schema to the input relation as shown in below, we will not get any error messages when storing it into Json file with JsonStorage().

Below are the screen shot of the output file:

JsonStorage Output

Now, we can read this file successfully with JsonLoader() without any error messages as shown below.

we can see the output in below screen.

JsonLoader example

In the next post, we will discuss about AvroStorage(), HBaseStorage() and XMLStorage().

HBaseStorage

Loads and stores data from an HBase table.

OrcStorage

Loads from or stores data to Orc file

Handling Compression

Pig Default load/store function (PigStorage) can read and write compressed files. We do not need a separate Load/Store Function to handle these. To work with gzip compressed files, input/output files need to have a .gz extension. Gzipped files cannot be split across multiple mappers; this means that the number of mappers created is equal to the number of part files in the input location.

To work with bzip compressed files, the input/output files need to have a .bz or .bz2 extension. Because the compression is block-oriented, bzipped files can be split across multiple mappers.

MongoStorage:

 

 


Profile photo of Siva

About Siva

Senior Hadoop developer with 4 years of experience in designing and architecture solutions for the Big Data domain and has been involved with several complex engagements. Technical strengths include Hadoop, YARN, Mapreduce, Hive, Sqoop, Flume, Pig, HBase, Phoenix, Oozie, Falcon, Kafka, Storm, Spark, MySQL and Java.


Leave a comment

Your email address will not be published. Required fields are marked *

2 thoughts on “Built-in Load Store Functions in Pig

  • senthil kumar

    Hi Siva,
    I have one question ..
    A = LOAD ‘/in/passwd’ USING PigStorage(‘:’, ‘schema tagFile’);

    here in the above statement.A means temporary variable or bags?

    Thanks!

  • astro

    Hi Siva,

    Thanks for this tutorial. I want to load a json document(unstructured) in Pig. How can I load this type of json documents using JsonLoader()? The number of fields are not known in advance also, it greatly varies from one json document to another.

    Thanks!

     


Review Comments
default gravatar

I am a plsql developer. Intrested to move into bigdata.

Neetika Singh ITA

.