This article was published as a part of the Data Science Blogathon.
In this article, we will introduce you to the big data ecosystem and the role of Apache Spark in Big data. We will also cover the Distributed database system, the backbone of big data.
In today’s world, data is the fuel. Almost every electronic device collects data that is used for business purposes. Imagine how we work on a big volume of data. Of course, we need powerful computers to work on this amount of data, but it’s unpractical.
Processing petabytes of data is impractical for a single computer; hence big data technologies come into the picture.
Big data is the domain where we deal with many data using different big data tools and cloud systems.
Big data processing requires parallel computation since loading petabytes of data in a single or a very high-end computer is impossible. This technique of using parallel computation is known as distributed computing.
A single computer in a distributed system is known as a node, and each node uses its computing resources.
A master node is responsible for dividing the word load among various nodes, and if a work node fails, it stops delivering the load to the failed node.
A cluster is a collection of nodes, including the master node that words in synchronization.
Many open-source tools comprise the big data ecosystem. Open-source tools are generally used in Big data because these tools are more transparent and free to use, so there is no need to worry about the data leaking.
Popular big data open source tools are Apache Spark, Hadoop, Map-Reduce, Hive, Impala, etc.
Hadoop Ecosystem consists of various open-source tools that fall under the Apache project. These tools are made for big data workloads. All the components work dependently in the Hadoop ecosystem.
Spark is a distributed in-memory data processing tool. Spark is a replacement for Apache Map-Reduce.
Spark is a powerful replacement for Apache Map-Reduce. Spark is faster than Map reduce because of in-memory computation, making it highly capable and always up for a high volume of data processing.
In-memory computation takes the help of an individual system ram for computation instead of the disk, which makes spark powerful.
Apache Spark Core engine consists of 3 components
Spark Driver is responsible for spark context (code we write). It translates the spark context and sends the information to the cluster manager, which creates clusters, and the executor handles worker nodes and assigns tasks to them.
Spark applications can be written in Python. Python uses py4j in the backend to handle the java codes.
Pyspark in spark API built for python. It lets us create a spark application in Python. It uses py4j in the backend.
Spark can be run natively on any python environment. We can also build spark clusters on cloud notebooks. Popular python environment for running spark in data bricks, providing some databases to work on.
Here is a guide on running spark clusters on data bricks for free.
For Running spark in python, we need pyspark
module and findspark
.
!pip install pyspark !pip install findspark
Findspark
It generates startup files to the current Python profile and prepares the spark environment. Find spark locates the spark startup files.
import findspark findspark.init()
Spark Session and Context
Spark session
Spark Session keeps track of our application. Spark Session must be created before working on spark and loading the data.
SparkContext
Spark context is an entry point to the spark application, and it also concludes some RDD functions likeparallelize()
.
# Initialization spark context class sc = SparkContext() # Create spark session spark = SparkSession .builder .appName("Application name ") .config("spark.some.config.option", "somevalue") .getOrCreate()
getOrCreate
Creates a new session if the named session doesn’t exist.
spark
Spark RDD ( Resilient distributed datasets) are fundamental data structures on a spark, which is an immutable distributed object.RDDs are super fast, and the Core Engine of Spark supports RDDs.
RDDs in Spark can only be created by parallelizing or referencing the other datasets.
RDDS works in a distributed fashion, meaning the dataset in RDD is divided into logical partitions, computed by different assigned nodes by cluster. Spark RDDs are fault tolerant; in spark, other datasets are based on RDDs.
RDD accepts the following types of datatypes —
In RDD, data is distributed across multiple nodes, making it work in a distributed manner.
RDD supports lazy evaluation, which means it doesn’t compute anything until the value is required.
sc.parallelize
It transforms a series into an RDD.data = range(1,30) # print first element of iterator print(data[0]) len(data) xRDD = sc.parallelize(data, 5)
Transformations are the rules that must be followed for the computation.RDDs are lazy evaluations that indicate that no calculations will be performed until the actions are called.
This transformation will be stored as a set of rules and implemented at the action stage.
# Reduces each number by 2 sRDD = xrangeRDD.map(lambda x: x-2)
# selects all number less than 20 filteredRDD = sRDD.filter(lambda x : x<20)
Actions are the actual computation process. after applying the transformation, we need to call actions whenever we need the values. It helps in data integrity.
print(filteredRDD.collect())
filteredRDD.count()
Output:
In this article, we talked about the ecosystem of Big Data and the various types of tools the Big data Ecosystem is made up of.
We talked about the role of distributed systems and how Spark works in Big data.
Spark Architecture contains a driver node, context reader, and node manager. Spark works in a distributed manner, the same as Hadoop, but alike Hadoop, it uses In-memory computation instead of disk.
We discussed RDD and Transformations and actions.
Thanks for reading this article
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.