This article was published as a part of the Data Science Blogathon.
Azure data factory (ADF) is a cloud-based ETL (Extract, Transform, Load) tool and data integration service which allows you to create a data-driven workflow. The data-driven workflow in ADF orchestrates and automates the data movement and data transformation. In this article, I’ll show you how to create a blob storage, SQL database, data factory in Azure and then build a pipeline to copy data from Blob Storage to SQL Database using copy activity.
Azure storage account provides highly available, massively scalable and secure storage for storing a variety of data objects such as blobs, files, queues and tables in the cloud. Azure storage account contains content which is used to store blobs. This azure blob storage is used to store massive amounts of unstructured data such as text, images, binary data, log files, etc.
For creating azure blob storage, you first need to create an Azure account and sign in to it. After signing into the Azure account follow the below steps:
Step 1: On the azure home page, click on Create a resource.
Step 2: Select Storage Account -> Create
Step 3: On the Basics page, select the subscription, create or select an existing resource group, provide the storage account name, select the region, performance, redundancy and click Next. I have selected LRS for saving costs.
Step 4: On the Advanced page, configure the security, blob storage and azure files settings as per your requirements and click Next. I have chosen the hot access tier so that I can access my data frequently.
Step 5: On the Networking page, configure network connectivity, and network routing and click Next.
Step 6: Click on Review + Create. After the storage account is created successfully, its home page is displayed. Now, select Data storage-> Containers.
Step 7: Click on + Container. Write new container name as employee and select public access level as Container. Click Create.
Step 8: Create a blob, launch excel, copy the following text and save it in a file named Emp.csv on your machine.
FirstName | LastName | Department | Salary |
Rahul | Patel | Sales | 90000 |
Chaitanya | Shah | R&D | 95000 |
Ashna | Jain | HR | 93000 |
Mansi | Garg | Sales | 81000 |
Vipul | Gupta | HR | 84000 |
Step 9: Upload the Emp.csv file to the employee container.
Now, we have successfully uploaded data to blob storage. We will move forward to create Azure SQL database.
Azure SQL Database is a massively scalable PaaS database engine. It provides high availability, scalability, backup and security. Azure SQL Database delivers good performance with different service tiers, compute sizes and various resource types. It also provides advanced monitoring and troubleshooting features to find real-time performance insights and issues. Azure SQL Database provides below three deployment models:
1. Single database: It is the simplest deployment method. In this approach, a single database is deployed to the Azure VM and managed by the SQL Database Server. Each database is isolated from the other and has its own guaranteed amount of memory, storage, and compute resources.
2. Elastic pool: Elastic pool is a collection of single databases that share a set of resources. This deployment model is cost-efficient as you can create a new database, or move the existing single databases into a resource pool to maximize the resource usage.
3. Managed instance: Managed Instance is a fully managed database instance. It helps to easily migrate on-premise SQL databases.
Follow the below steps to create Azure SQL database:
Step 1: On the azure home page, click on Create a resource.
Step 2: Select SQL Database -> Create
Step 3: On the Basics page, select the subscription, create or select an existing resource group, provide a database name, create or select an existing server, choose if you want to use the elastic pool or not, configure compute + storage details, select the redundancy and click Next. I have selected LRS for saving costs.
Step 4: On the Networking page, configure network connectivity, connection policy, encrypted connections and click Next.
Step 5: Click on Review + Create. After the Azure SQL database is created successfully, its home page is displayed. Now, select Query editor (preview) and sign in to your SQL server by providing the username and password.
Step 6: Paste the below SQL query in the query editor to create the table Employee.
CREATE TABLE dbo.Employee ( ID int IDENTITY(1,1) NOT NULL, FirstName varchar(50), LastName varchar(50), Department varchar(50), Salary int ) GO CREATE CLUSTERED INDEX IX_emp_ID ON dbo.Employee (ID);
Note: Ensure that Allow Azure services and resources to access this Server option are turned on in your SQL Server.
Now, we have successfully created Employee table inside the Azure SQL database. We will move forward to create Azure data factory.
Azure data factory (ADF) is a cloud-based ETL (Extract, Transform, Load) tool and data integration service. ADF is a cost-efficient and scalable fully managed serverless cloud data integration tool.
Follow the below steps to create a data factory:
Step 1: On the azure home page, click on Create a resource.
Step 2: Search for a data factory in the marketplace. Select Create -> Data Factory
Step 3: On the Basics page, select the subscription, create or select an existing resource group, provide the data factory name, select the region and data factory version and click Next.
Step 4: On the Git configuration page, either choose to configure git later or enter all the details related to the git repository and click Next.
Step 5: On the Networking page, fill manage virtual network and self-hosted integration connectivity to Azure Data Factory options according to your requirement and click Next.
Step 6: Click on Review + Create. After the data factory is created successfully, the data factory home page is displayed. Click on open in Open Azure Data Factory Studio.
Step 1: In Azure Data Factory Studio, Click New-> Pipeline. Rename the pipeline from the Properties section.
Step 2: In the Activities toolbox, search for Copy data activity and drag it to the pipeline designer surface. Rename it to CopyFromBlobToSQL.
Step 3: In Source tab, select +New to create the source dataset. Search for Azure Blob Storage.
Select Continue-> Data Format DelimitedText -> Continue
Enter your name, select the checkbox first row as a header, and click +New to create a new Linked Service.
In the new Linked Service, provide service name, select authentication type, azure subscription and storage account name. Click Create.
After the linked service is created, it navigates back to the Set properties page. Now, select Emp.csv path in the File path.
Click OK. In the Source tab, make sure that SourceBlobStorage is selected. To preview data, select Preview data option.
Step 4: In Sink tab, select +New to create a sink dataset. Search for Azure SQL Database.
Select Continue. Enter your name, and click +New to create a new Linked Service.
In the new Linked Service, provide service name, select azure subscription, server name, database name, authentication type and authentication details. Click Create.
After the linked service is created, it navigates back to the Set properties page. Now, select dbo.Employee in the Table name. Click OK.
Step 5: Validate the Pipeline by clicking on Validate All. After validation is successful, click Publish All to publish the pipeline.
Step 6: Run the pipeline manually by clicking trigger now.
Step 7: Verify that CopyPipeline runs successfully by visiting the Monitor section in Azure Data Factory Studio.
In this article, we have learned how to build a pipeline to copy data from Azure Blob Storage to Azure SQL Database using Azure Data Factory. Our focus area in this article was to learn how to create Azure blob storage, Azure SQL Database and data factory. We also gained knowledge about how to upload files in a blob and create tables in SQL Database. Most importantly, we learned how we can copy blob data to SQL using copy activity. Hopefully, you got a good understanding of creating the pipeline. Please let me know your queries in the comments section below.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Azure is a nice and useful concept. Thanks author for sharing. Keep it up.