Building an ETL Data Pipeline Using Azure Data Factory

chaitanya Last Updated : 11 Jul, 2022

14 min read

This article was published as a part of the Data Science Blogathon.

Introduction

ETL is the process that extracts the data from various data sources, transforms the collected data, and loads that data into a common data repository. It helps organizations across the globe in planning marketing strategies and making critical business decisions. Azure Data Factory (ADF) is a cloud-based ETL and data integration service provided by Azure. We can build complex ETL processes and scheduled event-driven workflows using Azure Data Factory.

In this guide, I’ll show you how to build an ETL data pipeline to convert a CSV file into JSON File with Hierarchy and array using Data flow in Azure Data Factory.

Source: https://github.com/mspnp/azure-data-factory-sqldw-elt-pipeline

What is ETL?

ETL is the process that collects data from various sources, then transforms the data into a usable and trusted resource using various operations like filtering, concatenation, sorting, etc., and loads that data into a destination store. ETL, which stands for extract, transform, and load, describes the end-to-end process by which an organization can prepare data for business intelligence processes.

ETL is commonly used for data integration, data warehousing, cloud migration, machine learning, and artificial intelligence. ETL provides a mechanism to move the data from various data sources into a common data repository. It helps companies in planning marketing strategies or making critical business decisions by analyzing their business data.

An ETL pipeline or data pipeline is the set of processes used to move data from various sources into a common data repository such as a data warehouse. Data pipelines are a set of tools and activities that ingest raw data from various sources and move the data into a destination store for analysis and storage.

^{Source: https://databricks.com/glossary/extract-transform-load}

Now, we will learn how the ETL process works.

How does ETL Work?

Below are the three steps of the ETL process:

1. Extract: The first step is to extract the raw data from various data sources and place it in the staging area. Data came from different data sources such as CRM, transactional processing, APIs, sensor data, etc could be structured, semi-structured, or unstructured. If any transformations are required on data, they are performed in the staging area so that the performance of the source system is not degraded. Extraction can be further divided into three types:

a. Partial Extraction

b. Partial Extraction (with update notification)

c. Full extract

2. Transform: The data extracted in the previous step is raw and we need to clean, map and transform the data to convert the data into a format that can be used by different applications. Data inconsistency and inaccuracies are removed during data transformation. Apart from this, data mapping and data standardization are also done. Data standardization refers to the process of converting all data types to the same format. Various functions and rules are applied to prevent the entrance of bad or inconsistent data into the destination repository.

3. Load: Loading data in the destination repository is the last step of the ETL process. For better performance, the load process should be optimized. Data is delivered and shared in a secure way to required users and departments so that it can be used for effective decision-making and storytelling.

Source: https://www.informatica.com/hk/resources/articles/what-is-etl.html

What is ADF?

In today’s world, a huge amount of data is generated every day. Among this, a large volume of data is mostly raw and unorganized. However, we cannot use raw data to provide meaningful insights to analysts, ML engineers, data scientists, or business decision-makers. Therefore, for fulfilling the above requirements a mechanism is required to convert raw data into actionable business insights.

Azure Data Factory (ADF) is a cloud-based ETL and data integration service provided by Azure. We can build complex ETL processes and scheduled event-driven workflows using Azure Data Factory.

For example, imagine an e-commerce company that collects petabytes of product purchase logs that are produced by selling products in the cloud. The company wants to analyze these logs to gain insights into customer preferences, customer retention, active customer base, etc.

The logs are stored inside a container in Azure Blob Storage. To analyze these logs, the company needs to use reference data such as marketing campaign information, customer location & purchase details which are stored in the Azure SQL database. The company wants to extract and combine data from both sources to gain useful insights. In such scenarios, it is best to use Azure Data Factory for ETL purposes. Additionally, the company can publish the transformed data to data stores such as Azure Synapse Analytics, so that the business intelligence (BI) applications such as Power BI can consume it for storytelling and dashboarding.

ADF has built-in connectors for retrieving data from various data sources such as HDFS, Azure Services, Teradata, Amazon Redshift, Google BigQuery, ServiceNow, Salesforce, etc. ADF makes the process of creating a data pipeline easy by providing data transformations with code-free data flows, CI/CD support, built-in connectors for data ingestion, orchestration, and pipeline monitoring support.

Now, we have got a good understanding of ETL and Azure Data Factory. Let’s move forward to see top-level concepts in ADF.

^{Source: https://azure.microsoft.com/en-in/services/data-factory/#features}

Top-level Concepts in ADF

Below are top-level concepts in Azure Data Factory:

1. Dataset: Datasets identify data within different data stores which are used as inputs and outputs in the Activity. For example, the Azure SQL Table dataset specifies the SQL table in your Azure SQL Database from which you either want to take or insert data inside an activity. You must create Linked Service to link your data store to the required service before creating a dataset.

2. Linked Services: Linked Services define the connection information needed by the service to connect to the data sources. For example, An Azure SQL Database-linked service specifies a connection string to connect to the Azure SQL Database. Additionally, the Azure SQL Table dataset specifies the SQL table in your Azure SQL Database that contains the data. Linked Services are used to either represent a data store or a compute resource in ADF.

3. Activity: Activity is the task that is performed on the data. There are three types of activities in ADF: data movement activities, data transformation activities, and control activities. For example, you can use a copy activity to copy data from Azure Blob Storage to Azure SQL.

4. Pipeline: A pipeline is a logical grouping of activities that together perform a unit of work. A data factory may have one or more than one pipelines. The activities in the pipeline specify the task to be performed on the data. Users can validate, publish and monitor pipelines. For example, you can create a pipeline that gets triggered when a new blob arrives in the Azure Blob Storage container to copy data from Azure Blob Storage to Azure SQL.

^{Source: https://docs.microsoft.com/en-us/azure/data-factory/concepts-pipelines-activities?tabs=data-factory}

5. Mapping data flows: Mapping Data Flows provides a way to perform data transformation in the data flow designer without writing any code. The mapping data flow is executed as an activity within the ADF pipeline.

6. Integration runtimes: Integration runtime provides the computing environment where the activity either runs on or gets dispatched from.

7. Triggers: Triggers determine when a pipeline execution needs to be kicked off. A pipeline can be manually triggered or based on the occurrence of an event. For example, you can create a pipeline that gets triggered when a new blob arrives in the Azure Blob Storage container to copy data from Azure Blob Storage to Azure SQL.

8. Pipeline runs: An instance of the pipeline execution is known as a Pipeline run.

9. Parameters: Parameters are defined in the pipeline and their values are consumed by activities inside the pipeline.

10. Variables: Variables are used inside the pipelines to store value temporarily.

11. Control Flow: Control flow allows for building iterative, sequential, and conditional logic within the pipeline.

^{Source: https://docs.microsoft.com/en-us/azure/data-factory/concepts-pipelines-activities?tabs=data-factory}

Benefits of using ADF

1. Code-free data transformation: ADF provides mapping data flow to be executed as an activity in a pipeline. Mapping Data Flows provides a way to perform data transformation in the data flow designer. Thus, data transformation can be easily performed without writing any code.

2. Easy Migration of ETL Workloads to Cloud: ADF can be used to easily migrate workloads from one data source to another.

3. Consumption-based pricing: Data Factory provides a pay-as-you-go pricing model and no upfront cost is required from the customer end.

4. Better scalability and performance: ADF provides built-in parallelism and time-slicing features. Therefore, users can easily migrate a large amount of data to the cloud in a few hours. Thus, the performance of the data pipeline is improved.

5. Security: User can protect their data stores credentials in the ADF pipeline by either storing them in Azure Key Vault or by encrypting them with certificates managed by Microsoft.

What are the activities and types of activities in ADF?

A pipeline is a logical grouping of activities that together perform a unit of work. Activity is the task that is performed on the data. For example, you can use a copy activity to copy data from Azure Blob Storage to Azure SQL.

Below are the three types of activities present inside ADF:

1. Data movement activities: Data movement activities are used to move data from one data source to another. For example, a Copy activity is used to copy the data from a source location to a destination.

2. Data transformation activities: Data transformation activities are used to perform data transformation in ADF. Below are some important data transformation activities:

a. Data Flow: Data flow activity is used to perform data transformation via mapping data flows.

b. Stored Procedure: Stored Procedure activity invokes a SQL Server Stored Procedure in a pipeline.

c. Azure Functions: This activity helps to run Azure Functions in a data pipeline. The azure function is a serverless compute service that helps users to run the event-triggered code.

d. Databricks Notebook: This activity helps to run the Databricks notebook in your Azure Databricks workspace from a pipeline.

e. Spark: Spark activity executes a Spark program in the HDInsight cluster.

3. Control activities: Control flow activities are used to build iterative, sequential, or conditional logic in a pipeline. There are various control activities available in ADF such as until activity, For Each, If Condition activity, execute the pipeline, Lookup activity, etc.

^Source:^{https://docs.microsoft.com/en-us/azure/data-factory/concepts-pipelines-activities?tabs=data-factory}

Building ETL Data Pipeline for a Scenario

Now, you are familiar with various types of ADF activities and top-level ADF concepts. Let’s see how we can build an ETL data pipeline to convert a CSV file into JSON File with Hierarchy and array using Data flow in Azure Data Factory.

We are going to use a container in Azure Storage Account for storing CSV and JSON files as blobs. For building an ETL data pipeline for the above scenario, you will be required to create Azure Storage Account and Azure Data Factory. Follow the below steps to build an ETL data pipeline:

Create Azure Storage Account

Azure storage account provides highly available and secure storage for storing a variety of unstructured data such as text, images, binary data, log files, etc. Create an Azure account and sign in to it. Then, follow the below steps to create Azure Storage Account:

Step 1: Visit the Azure home page. Click on Create a resource (+).

Step 2: Select Storage Account in Popular Azure Services -> Create.

Step 3: On the Basics page, select your subscription name, create a resource group, provide the storage account name, select the performance, redundancy, and region, and click Next.

Step 4: On the Advanced page, configure the hierarchical namespace, blob storage, security, and azure files settings as per your requirements and click Next. For frequent data access, choose Hot tier.

Step 5: On the Networking page, enable public access from all networks, configure network routing and click Next.

Step 6: Click on Review + Create. The home page gets displayed, when the storage account is created successfully. Now, select Data storage-> Containers.

Step 7: Click on + Container. Provide employee as new container name and select Container in the public access level. Click Create.

Step 8: Create a blob, launch excel, copy the following text and save it in a file named Emp1.csv in your system.

FirstName	LastName	Department	Salary	Skill1	Skill2	Skill3
Rahul	Patel	Sales	90000	C	HTML	CSS
Chaitanya	Shah	R&D	95000	C#	SQL	Azure
Ashna	Jain	HR	93000	Python	Debugging
Mansi	Garg	Sales	81000	Java	Hibernate
Vipul	Gupta	HR	84000	ADF	Synapse	Azure

Step 9: Upload the Emp1.csv CSV file to the employee container inside the storage account.

Step 10: Click on + Container. Provide employeejson as a new container name and select Container in the public access level. Click Create.

Now, we have successfully uploaded the CSV file to the blob storage.

Create a Data Factory in Azure

Azure Data Factory (ADF) is a cloud-based ETL and data integration service provided by Azure. We can build complex ETL processes and scheduled event-driven workflows using Azure Data Factory.

Using the below steps create a data factory:

Step 1: Visit the Azure home page. Click on Create a resource (+).

Step 2: In the Azure Marketplace, search for a data factory. Select Create -> Data Factory.

Step 3: On the Basics page, select your subscription name, select the existing resource group, provide the data factory name, select the region and data factory version, and click Next.

Step 4: On the Git configuration page, choose to configure git and click Next.

Step 5: On the Networking page, fill in the required options and click Next.

Step 6: Click on Review + Create. Once the data factory is created successfully, the data factory home page is displayed. Open Azure Data Factory Studio in a new tab.

Create an ETL data pipeline to convert CSV file into JSON File with Hierarchy and array using Data flow in ADF

We have already created Azure Data Factory in the above. Now, follow the below steps inside Azure Data Factory Studio to create an ETL pipeline:

Step 1: Click New-> Pipeline. Rename the pipeline to ConvertPipeline from the General tab in the Properties section.

Step 2: After this, click Data flows-> New data flow. Inside data flow, click Add Source. Rename the source to CSV.

Step 3: In the Source tab, select source type as Dataset, and in dataset click +New to create the source dataset.

Search for Azure Blob Storage.

Select Continue-> Data Format DelimitedText -> Continue

Provide the linked service name, select the checkbox first row as a header, and click +New to create a new Linked Service.

In the new Linked Service, provide service name, select authentication type, azure subscription, and storage account name. Click Create.

After the linked service is created, the page gets redirected to the Set properties page. Now, select the Emp1.csv path in the File path.

Click OK.

Step 4: Now, click +. Select Derived Column in Schema modifier.

In the Derived column page, select Create New -> Column.

Provide info as the column name. Inside info, select +-> Add subcolumn. Write fName as column name and in the Expression select FirstName from the Expression values.

Inside info, select +-> Add subcolumn. Write lName as column name and in the Expression select LastName from the Expression values.

Now, again select Create New -> Column.

Provide skills as column name and in the Expression select array() from Expression elements and provide Skill1, Skill2, Skill3 from the Expression values as the parameters to the array(). Click Save and finish.

Step 5: Now, click +. Select Sink in Destination.

In the Sink tab, write outputfile as the output stream name.

Select derivedColumn1 as the incoming stream. Select source type as Dataset and in dataset click +New to create the source dataset.

Search for Azure Blob Storage.

Select Continue-> Data Format JSON -> Continue

Write OutputJsonFile as the name, choose the Linked Service created in step3 as the Linked service, and select employeejson container path in the File path.

Click OK.

Provide the File name in the Settings tab and select the input columns and output columns in the Mapping tab.

Step 6: In the ConvertPipeline pipeline, search for Data Flow activity in the activities and drag it to the designer surface. Select the above-created dataflow, inside the Data Flow activity.

Step 7: Validate the Pipeline by clicking on Validate All. After validating the pipeline, publish the pipeline by clicking Publish All.

Step 8: Run the pipeline manually by clicking trigger now.

Step 9: Verify that ConvertPipeline runs successfully by visiting the Monitor section in Azure Data Factory Studio.

Step 10: You can see that the uploaded CSV file is successfully converted into JSON File with Hierarchy and array by seeing the file contents inside employeejson container.

Conclusion

We have seen how we can create an ETL data pipeline to convert a CSV file into a JSON file using Azure Data Factory. We got an understanding of how ETL pipelines built using Azure Data Factory are cost-effective, secure, and easily scalable. We learned how we can build an ETL pipeline for a real-time scenario. The following were the major takeaways from the above :

1. What is ETL and how organizations can use ETL for effective decision-making.

2. How we can extract different types of data using different extraction methods?

3. We have seen how Azure Data Factory can be used to ingest data from various sources.

4. We got an understanding of how we can use pipelines, activities, linked services, datasets, etc. in Azure Data Factory.

5. We have also seen how we can easily perform data transformations using data flows.

6. We learned how to validate and publish a pipeline.

7. We understand how we can connect pipelines and dataflows with each other.

8. We got a good understanding of how we can use control flow and what are the various types of activities available in Azure Data Factory.

9. Further, we have executed an ETL data pipeline to convert a CSV file into JSON File with Hierarchy and array.

10. Apart from this, we saw how we can create containers in Azure Storage Account, Linked Service in Azure Data Factory, etc.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

chaitanya

Full Stack developer with 2.6 years of experience in developing applications using
Azure, SQL Server & Power BI. I love to read and write blogs. I am always willing to learn new technologies, easily adaptable to changes, and have a good time
management, problem-solving, logical and communication skills

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Building an ETL Data Pipeline Using Azure Data Factory

Introduction

What is ETL?

How does ETL Work?

What is ADF?

Top-level Concepts in ADF

Benefits of using ADF

What are the activities and types of activities in ADF?

Building ETL Data Pipeline for a Scenario

Create Azure Storage Account

Create a Data Factory in Azure

Create an ETL data pipeline to convert CSV file into JSON File with Hierarchy and array using Data flow in ADF

Conclusion

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID

__Secure-#

APISID

SSID

HSID

DV

NID

1P_JAR

OTZ

Facebook (2)

_fbp

fr

LinkedIn (6)