This article was published as a part of the Data Science Blogathon.
Apache Pig is a high-level programming language that may be used to analyse massive amounts of data. The pig was developed as a consequence of Yahoo’s Development efforts.
Programs must be converted into a succession of Map and Reduce stages in a MapReduce architecture, and data analysts are not conversant with this programming approach. So, on top of Hadoop, an abstraction named Pig was constructed to bridge the gap. Thanks to Apache Pig, people can spend more time evaluating large data sets and less time implementing MapReduce applications.
Internally, Pig scripts are translated to Map Reduce tasks, which are then run on HDFS data. The pig can run its code on Apache Tez or Apache Spark, among other things.
Apache Pig can process data of all types: whether structured, semi-structured, or unstructured, and store the results in the Hadoop Data File System. Every operation that can be accomplished with PIG can also be accomplished with Java in MapReduce.
Yahoo’s researchers built Apache Pig earlier in 2006. The pig was created with the primary goal of executing MapReduce operations on huge datasets at the time. It was moved to the Apache Software Foundation (ASF) in 2007, making it an open-source project. Pig’s first version (0.1) was released in 2008.
Programmers who aren’t fluent in Java have traditionally struggled with Hadoop, particularly when conducting MapReduce operations. For all of these programmers, Apache Pig is a godsend tool.
There are many features of Apache Pig. Let us have a look:
The Apache Pig components that process the Pig Latin language over various levels are as follows:
Parser: The parser receives a user-supplied programme and conducts a syntax and type check. This procedure produces a DAG, including Pig Latin sentences and logical operators.
Optimizer: This step sends the DAG to a logical optimizer for logical optimisation.
Compiler: This is the stage at which the optimal logical plan is turned into MapReduce tasks. This is the stage at which the optimal logical strategy is turned into MapReduce tasks.
Execution Engine: The MapReduce jobs are sent to Hadoop for execution in this final stage. On completion, the user receives the desired data.
Pig’s architecture is divided into two parts:
( Image Source: https://www.guru99.com/introduction-to-pig-and-hive.html)
A Pig Latin programme is made up of a set of operations or transformations that are executed to input data to generate output. These procedures define a data flow that the Hadoop Pig execution environment converts into an executable form. The outputs of these modifications are a sequence of MapReduce tasks that are hidden from the programmer. Pig in Hadoop, in a sense, lets the programmer concentrate on the data rather than the nature of the execution.
PigLatin is a relatively stiffened language that utilises common data processing phrases like Join, Group, and Filter.
Pig has two execution modes in Hadoop:
Local mode: The Hadoop Pig language operates in a single JVM and uses the local file system in this mode. This mode is only useful for analysing tiny datasets in Hadoop using Pig.
Map Reduce mode: In this mode, Pig Latin queries are converted into MapReduce jobs and executed on a Hadoop cluster (cluster may be pseudo or fully distributed). Pig can execute on huge datasets in MapReduce mode with a fully distributed cluster.
There are certain differences between Apache Pig and Map Reduce, let us analyse them.
Apache Pig | Map Reduce |
It is a scripting language. | Map Reduce is a properly compiled programming language. |
The level of abstraction is high. | The level of abstraction is low. |
When compared to MapReduce, it has fewer lines of code. | It has more lines of code. |
Apache Pig requires less work. | MapReduce will necessitate more development work. |
There are many uses and applications of Apache Pig, let us see what they are.
There are four data models in Apache Pig, they are:
Atom: The basic data types in Pig Latin are atomic, also known as scalar data types, which are utilised in all kinds such as string, float, int, double, long, char[], and byte[]. Primitive data types are another name for atomic data types. In a field (column), each cell value is an atomic data type.
Tuple: A tuple is an ordered collection of fields. The ‘()’ symbol is used to indicate the tuple. The indices of the fields can be used to access fields in each tuple.
Bag: It is a collection of tuples of potentially varying structures and can contain duplicates. The bag is similar to the table in RDBMS. In contrast to RDBMS tables, Bag does not require that every tuple include the same amount of fields, or that all columns be in the same place or be of the same type.
Map: It is an associative array.
Users can write their own functions for special-purpose processing, for tasks like reading and writing data, in Apache Pig. It can analyse different types of data, both structured and unstructured, and supports many data types. It supports user-defined functions, allowing users to write functions in other programming languages like Java.
We had a broad look at Apache Pig, so summarize what we discussed so far:
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.