In this post, we will describe the procedure for Pig Installation on Ubuntu Machine.
Below are the basic requirement for Pig installation on Ubuntu and getting started.
- Java 1.6 or Later versions installed and JAVA_HOME environment variable set to Java installation directory
- Hadoop1.x or 2.x Installed on the cluster. In this post we will use Hadoop-2.3.0 version for HADOOP_HOME environment variable setup.
Pig Installation Procedure:
- Download the latest stable version of Pig from Apache Download mirrors. Gzipped binary tar ball of pig-x.y.z.tar.gz format can be downloaded directly. In this post we are using, pig-0.13.0.tar.gz file.
- Copy the tar ball into our preferred installation directory location and extract the gzipped file.
- Set the Pig installation directory location to PIG_HOME, and also set PIG_CONF_DIR, PIG_CLASSPATH environment variables in .bashrc file.
Pig properties can be changed in PIG_CONF_DIR/pig.properties file.
Note: Usually pig engine will generate lot of INFO level messages on console, in order to hide these messages and to view only WARN and above level messages on console we can rename log4j.properties.template file to log4j.properties file under PIG_CONF_DIR location and perform below changes.
Verify Pig Installation:
Lets verify the pig installation on Ubuntu with $ pig -h command, which will display help contents of Pig.
Pig Example Run:
Once we have configured Pig to connect to a Hadoop cluster, we can launch Pig in,
- Local Mode: Local mode is much faster, but only suitable for small amounts of data. Local mode interprets paths on the local file system. By setting -x option to local, we can enable local mode of execution.
- Mapreduce Mode: by setting the -x option to mapreduce , or omitting it entirely, as MapReduce mode is the default.
There are three ways of executing Pig programs, and all of these work well both in Local and Mapreduce mode.
- Script: We can run a script file that contains pig commands via, $ pig script.pig runs the commands in the local file script.pig.
Grunt: Grunt is an interactive shell for running Pig commands. Grunt is started when no file is specified for Pig to run, and the -e option is not used. It is also possible to
run Pig scripts from within Grunt using run and exec.
- Embedded: We can run Pig programs from Java using the PigServer class. For programmatic access to Grunt, we can use PigRunner.
In the below example, Pig Latin statements extract all user IDs from the /etc/passwd file via Grunt interactive mode.
Here in the above example, execution of all the above commands will start only after the last grunt> dump C; command. We can observe the map task submitted in the background.
So, we have successfully loaded file and parsed it by “:”, extracted the fields and displayed the first field of each line in /etc/passwd file.