This article was published as a part of the Data Science Blogathon.
With the growth of the data industry, there is an increasing need to store and process such a vast amount of data. At present, businesses can collect more data than ever before. These data contain valuable insights into business and customers if unlocked. Nowadays, data comes in different formats: video, audio, and text. Data lakes can store these formats at a low cost, and in addition, data lakes do not lock businesses into a particular vendor, unlike data warehouses. However, data lakes have limitations like difficulty maintaining reliable data, inaccurate query results, increased data volume causing poor performance, and difficulty in properly securing and governing. 73% of the company data might go unused if these limitations are not addressed. Delta lake is a solution to all these limitations. Let us see what exactly is a delta lake. Cloud object stores can hold and process exabytes of data belonging to tons of customers.
Source: https://towardsdatascience.com/delta-lake-enables-effective-caching-mechanism-and-query-optimization-in-addition-to-acid-96c216b95134
Delta lake is an open-source storage format and management layer that is based on files. It uses a transaction log that is compacted into Apache Parquet format. This enables ACID (Atomicity-Consistency-Isolation-Durability), time travel, and quick metadata operations (Armbrust et al., 2020). It runs over existing data lakes and is consistent with Apache Spark and other processing engines (Lee, Das, & Jaiswal, 2022).
Lee, Das, & Jaiswal(2022) put forth the following features associated with delta lake
1. Guarantee of ACID – There is durability regarding all data changes written to storage. These changes are visible to readers atomically.
2. Scalable data and metadata handling – Delta lake is built on a data lake. This allows all reads and writes using Spark or other processing engines to be scalable to a petabyte scale. Delta lake uses Spark for metadata handling, which efficiently handles metadata of billions of files.
3. Audit history and time travel – The Delta Lake transaction log records details about all the changes in data giving a full audit of the changes that took place over the period. It also gives developers access and reverts to earlier versions of data.
4. Schema enforcement and schema evolution – Delta lake does not allow the insertion of data with a wrong schema; additionally, it allows explicit evolution of table schema to accommodate dynamic data.
5. Support for deletes, updates, and merge – Deletes, updates, and merge operations are supported by delta lake to enable complex application cases.
6. Streaming and batch unification – A delta lake table can work both in batch and as a streaming source.
Some key terms must be discussed before delving deep into delta lake’s an architecture/working pattern.
Pipeline
Parquet
ETL
Source: https://www.databricks.com/glossary/extract-transform-load
The following images depict how delta lake is advantageous compared to traditional platforms.
(1) Pipeline using a separate storage system.
Source: Armbrust et al., 2020
Figure (1) exhibits an example of a data pipeline involving individual business intelligence teams with separate object storage, a message queue, and two data warehouses.
(2) Delta Lake.
Source: Armbrust et al., 2020
Figure (2) depicts delta tables replacing separate storage systems through Delta’s
streaming I/O and performance features to run ETL and BI. Low-cost object storage creates fewer data copies, reducing storage costs and maintenance overheads.
Integrating Delta Lake within the Apache Spark ecosystem is the simplest way to start with delta lake. There are three mechanisms: leveraging local spark shells to use delta lake, using Github or Maven, and utilizing Databricks community edition (Lee, Das, & Jaiswal, 2022). We shall be looking into each of these mechanisms below
In Apache Spark shells, the Delta Lake package is available with the –packages option. This option is suitable when working with a
the local version of Spark.
# Using Spark Packages with PySpark shell
./bin/pyspark –packages io.delta:delta-core_2.12:1.0
# Using Spark Packages with spark-shell
./bin/spark-shell –packages io.delta:delta-core_2.12:1.0
# Using Spark Packages with spark-submit
./bin/spark-submit –packages io.delta:delta-core_2.12:1.0
# Using Spark Packages with spark-shell
./bin/spark-sql –packages io.delta:delta-core_2.12:1.0
Github is an online hosting service for developing software. It offers each project the benefits of access control, bug tracking, software feature requests, task management, and continuous integration. Maven is a build automation tool and is widely utilized for Java projects. Building and managing projects in C#, Ruby, Scala, and other languages are also possible with Maven. The Apache Software Foundation is the project’s host.
Databricks is a web-based open platform that acts as a collaborative environment to perform interactive and scheduled data analysis. To create an account, one needs to visit https://databricks.com/try and adhere to the instructions to use
the Community Edition for free. The Community Edition comprises many tutorials and examples. One can write own note‐
books in Python, R, Scala, or SQL; one can also import other notebooks. The following image represents the page of databricks.
Source: Lee, Das, & Jaiswal, 2022
We shall discuss the basic operations for constructing Delta tables. Delta includes complete APIs for three languages often used in the big data ecosystem: Python, Scala, and SQL. It can store data in HDFS and cloud object stores, including S3 and ADLS gen2. It is designed in a way that it can be written primarily by Spark applications and can be read by many open-source data engines like Spark SQL, Hive, Presto (or Trino), and many enterprise products like AWS Athena, Azure Synapse, and BigQuery (Lee, Das, & Jaiswal, 2022). In the basic operations, we shall focus on writing and reading the delta table with the help of python. There exists a difference between parquet and delta tables. The difference is the _delta_log folder which is the Delta transaction log. Being the underlying infrastructure for features like ACID transactions, scalable metadata handling, and time travel, the Delta transaction log is critical in understanding Delta Lake (Lee, Das, & Jaiswal, 2022).
When a Delta table is created, files like the file system or cloud object stores are written to some storage. A table is made after storing all the files together in a directory of a particular structure. Creating a Delta table means writing files to some storage location. For example, the following Spark code snippet through Spark DataFrame API takes an existing Spark DataFrame and writes it in the Apache Parquet
storage format in the folder location /data
dataframe.write.format("parquet").save("/data")
A simple change in the above code snippet by using Apache Spark can create a
Delta table. Initially, Spark DataFrame
data will be created, and then the write method will be used to save the table to storage.
# Create data DataFrame data = spark.range(0, 5) # Write the data DataFrame to /delta location data.write.format("delta").save("/delta")
The data partition is important when there is a humongous amount of data. The following code snippets do exactly this
# Write the Spark DataFrame to Delta table partitioned by date data.write.partitionBy("date").format("delta").save("/delta")
If there is an existing Delta Table and a need to append or overwrite data to the table, the mode method is used as shown in the code snippets.
# Append new data to your Delta table data.write.format("delta").mode("append").save("/delta") # Overwrite your Delta table data.write.format("delta").mode("overwrite").save("/delta")
To read the files from the delta table, Dataframe API will be applied. The code snippets are as follows
# Read the data DataFrame from the /delta location spark.read.format("delta").load("/delta").show()
In the above, we read delta tables directly from the file system. But how to read a metastore-defined Delta table? To do this, the table within the metastore needs to be defined using the saveAsTable method in python.
# Write the data DataFrame to the metastore defined as myTable data.write.format("delta").saveAsTable("myTable")
It is noteworthy that saveAsTable method saves the Delta table files into a
location managed by the metastore. So, in the basic tenets, we have covered two important operational aspects: writing and reading delta table.
Delta Lake is an open-source storage format and management layer. It enables Data Lake to overcome challenges like difficulty maintaining reliable data, inaccurate query results, increased data volume causing poor performance, and difficulty in properly securing and governing. Delta lake brings to the fore many advantages in terms of storing and managing data
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.