An Introduction to Apache Pig For Absolute Beginners!

Dhanya Thailappan Last Updated : 08 Aug, 2021

6 min read

This article was published as a part of the Data Science Blogathon

This article is focused on Apache Pig. It is a high-level platform for processing and analyzing a huge amount of data.

OVERVIEW

If we see the top-level overview of Pig, then Pig is an abstraction over MapReduce. Pig runs on Hadoop. So, it makes use of both the Hadoop Distributed File System (HDFS) and Hadoop’s processing system, MapReduce. Data flows are executed
by an engine. It is used to analyze data sets as data flows. It includes a high-level language called Pig Latin for expressing these data flows.

The input for Pig is Pig Latin which will be converted into MapReduce jobs. Pig uses MapReduce tricks to do all of its data processing. It combines Pig Latin scripts into a series of one or more MapReduce jobs that in turn executes.

Apache Pig was designed by Yahoo as it is easy to learn and work with. So, Pig makes Hadoop quite easy. Apache Pig was developed because MapReduce programming was getting quite difficult and many MapReduce users are not comfortable with declarative languages. Now, Pig is an open-source project under Apache.

Features of Pig
Pig vs MapReduce
Pig Architecture
Pig Execution Options
Pig Basic Execution Commands
Pig Data Types
Pig Operators
Pig Latin Script Example

1. FEATURES OF PIG

Let’s look at some of the features of Pig.

It has a rich set of operators such as join, sort, etc.
It is easy to program as it is similar to SQL.
The tasks in Apache Pig have been converted into MapReduce jobs automatically. The programmers need to focus only on the semantics of the language and not on MapReduce.
Own functions can be created using Pig.
Functions in other programming languages such as java can be embedded in Pig Latin scripts.
Apache Pig can handle
all kinds of data such as structured, unstructured, and semi-structured data and
stores the result in HDFS.

2. PIG VS MAPREDUCE

Let’s see the difference between Pig and MapReduce.

Pig has several advantages over MapReduce.

Apache Pig is a data flow language. It means that it allows users to describe how data from one or more inputs should be read, processed, and then stored to one or more outputs in parallel. While MapReduce on the other hand is a programming style.

Apache Pig is a high-level language while MapReduce is a compiled java code.

The syntax for Pig for performing join and multiple files is very intuitive and quite simple like SQL. MapReduce code
becomes complex if you want to write joining operations.

The learning curve for Apache Pig is very small. Expertise in Java and MapReduce libraries is a must
to run MapReduce code.

Apache Pig scripts can do the equivalent of multiple lines of MapReduce code and MapReduce code takes more lines of codes to perform the same operations.

Apache Pig is easy to debug and test while MapReduce programs take a lot of time for coding, testing, etc. Pig Latin is less costly than MapReduce.

3. PIG ARCHITECTURE

Now let’s see Pig architecture.

Source

Pig sits on top of Hadoop. Pig scripts can be run either on Grunt shell or on the Pig server. Execution engine of Pig the passes optimizes and compiles the script and finally converts into MapReduce jobs. It uses HDFS to store intermediate data between MapReduce jobs and then writes its output to HDFS.

4. PIG EXECUTION OPTIONS

Apache Pig can run two run modes. Both of them produce the same results.

4.1. Local mode

Command on gateways

pig -x local

4.2. Hadoop mode

Command on gateways

pig -exectype mapreduce

Apache Pig can be run in three ways in the above two modes.

Interactive mode / Grunt shell: enter Pig commands manually by using the grunt shell
Batch mode / Script file: Place Pig commands in a script file and run the script
Embedded program/UDF: embed Pig commands in java and run the scripts

5. PIG GRUNT SHELL COMMANDS

Grunt shells can be used to write Pig Latin scripts. The shell commands can be invoked by using fs and sh commands. Let’s see some basic
Pig commands.

5.1. fs command

fs command lets you run HDFS commands from Pig

5.1.1 To list all directories in HDFS

grunt> fs -ls;

Now, all the files in HDFS will be displayed.

5.1.2. To create a new directory mydir in HDFS

grunt> fs -mkdir mydir/;

The above command will create a new directory called mydir in HDFS.

5.1.3. To remove a directory

grunt> fs -rmdir mydir;

The above command will remove the created directory mydir.

5.1.4. To copy a file to HDFS

grant> fs -put sales.txt sales/;

Here, the file named sales.txt is the source file that will be copied to the destination directory in HDFS i.e. sales.

5.1.5. To quit from grunt shell

grunt> quit;

The above command will exit the grunt shell.

5.2. sh command

sh command lets you run Unix statement from Pig

5.2.1. To display the current date

grunt> sh date;

This command will show the current date.

5.2.2. To lists local files

grunt> sh ls;

This command will display all the files in the local system.

5.2.3. To execute Pig Latin from grunt shell

grunt> run salesreport.pig;

The above command will execute a Pig Latin script file “salesreport.pig” from the grunt shell.

5.2.4. To execute Pig Latin from Unix prompt

$pig salesreport.pig;

The above command will execute a Pig Latin script file “salesreport.pig” from Unix prompt.

6. PIG DATA TYPES

Pig Latin consists of the following datatypes.

6.1. Data Atom

It is a single value. It can be a string or a number. They are of scalar types such as int, float, double, etc.

For example, “john”, 9.0

6.2. Tuple

A tuple is similar to a record with a sequence of fields. It can be of any data type.

For example, (‘john’, ‘james’) is a tuple.

6.3. Data bag

It consists of a collection of tuples which is equivalent to a “table” in SQL. The tuples are non-unique and can have an arbitrary number of fields, each can be of any type.

For example, {(‘john’, ‘James), (‘king’, ‘mark’)} is a data bag which is equivalent to the below table in SQL.

john	James
king	mark

6.4. Data map

This data type
contains a collection of key-value pairs. Here, the key must be a chararray and unique. The values can be of any type.

For example, [name#(‘john’, ‘james’), age#22] is a data map where name, age are keys and (‘john, ‘james’), 22 are values.

7. PIG OPERATORS

Below is the contents of student.txt file.

John,23,Hyderabad
James,45,Hyderabad
Sam,33,Chennai
,56,Delhi
,43,Mumbai

7.1. LOAD

It loads data from the given file system.

A = LOAD 'student.txt' AS (name: chararray, age: int, city: chararray);

The data from the student file with column names as ‘name’, ‘age’, ‘city’ will be loaded into a variable A.

7.2. DUMP

DUMP operator is used to displaying the contents of a relation. Here, the contents of A will be displayed.

DUMP A
//results
(John,23,Hyderabad)
(James,45,Hyderabad)
(Sam,33,Chennai)
(,56,Delhi)
(,43,Mumbai)

7.3. STORE

The store function saves the results to the file system.

STORE A into ‘myoutput’ using PigStorage(‘*’);

Here, the data present in A will be stored into myoutput separated by ‘*’.

DUMP myoutput; 
//results
John*23*Hyderabad
James*45*Hyderabad
Sam*33*Chennai
*56*Delhi
*43*Mumbai

7.4. FILTER

B = FILTER A by name is not null;

The FILTER operator will filter a table with some conditions. Here, the name is the column in A. Non-empty values in the name will be stored in variable B.

DUMP B;
//results
(John,23,Hyderabad)
(James,45,Hyderabad)
(Sam,33,Chennai)

7.5. FOREACH GENERATE

C = FOREACH A GENERATE name, city;

FOREACH operator is used to accessing individual records. Here, the rows present in name and city will be fetched from A and stored into C.

DUMP C
//results
(John,Hyderabad)
(James,Hyderabad)
(Sam,Chennai)
(,Delhi)
(,Mumbai)

8. PIG LATIN SCRIPT EXAMPLE

We have a people file that has employee id, name, and hours as fields.

001,Rajiv,21
002,siddarth,12
003,Rajesh,22

First, load this data into a variable employee. Filter it by hours less than 20 and store in parttime. Order parttime by descending order and store it in another file called part_time. Display the contents.

The script will be

employee = Load ‘people’ as (empid, name, hours);
parttime = FILTER employee BY Hours < 20;
sorted = ORDER parttime by hours DESC;
STORE sorted INTO ‘part_time’;
DUMP sorted;
DESCRIBE sorted;
//results
(003,Rajesh,22)
(001,Rajiv,21)

ENDNOTES

These are some of the basic concepts of Apache Pig. I hope you enjoyed reading this article. Start practising
with Cloudera environment.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Dhanya Thailappan

Predicting the future is not magic. It's an Artificial Intelligence!! This inspired me so much and that's why I love Data Science and Artificial Intelligence. I am currently working as a Data Engineer. I wish to explore more and share my knowledge with others.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Intoduction to Python

Variables and data types

OOPs Concepts

Conditional statement

Looping Constructs

Data Structures

String Manipulation

Functions

Modules, Packages and Standard Libraries

Python Libraries for Data Science

Reading Data Files in Python

Preprocessing, Subsetting and Modifying Pandas Dataframes

Sorting and Aggregating Data in Pandas

Visualizing Patterns and Trends in Data

Programming

An Introduction to Apache Pig For Absolute Beginners!

OVERVIEW

TABLE OF CONTENTS

1. FEATURES OF PIG

2. PIG VS MAPREDUCE

4. PIG EXECUTION OPTIONS

4.1. Local mode

4.2. Hadoop mode

5. PIG GRUNT SHELL COMMANDS

5.1. fs command

5.1.1 To list all directories in HDFS

5.1.2. To create a new directory mydir in HDFS

5.1.3. To remove a directory

5.1.4. To copy a file to HDFS

5.1.5. To quit from grunt shell

5.2. sh command

5.2.1. To display the current date

5.2.2. To lists local files

5.2.3. To execute Pig Latin from grunt shell

5.2.4. To execute Pig Latin from Unix prompt

6. PIG DATA TYPES

6.1. Data Atom

6.2. Tuple

6.3. Data bag

6.4. Data map

7. PIG OPERATORS

7.1. LOAD

7.2. DUMP

7.3. STORE

7.4. FILTER

7.5. FOREACH GENERATE

8. PIG LATIN SCRIPT EXAMPLE

ENDNOTES

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC