This article was published as a part of the Data Science Blogathon
Databricks is a unified analytics platform on top of Apache Spark for large-scale data processing, streaming, and machine learning applications. It also includes interoperability with cloud leaders like AWS, Azure to get unmatched scale and performance of the cloud.
Azure Databricks provides auto-scaling, auto-termination of clusters, auto-scheduling of jobs along with simple job submissions to the cluster.
In this blog, we will discuss the easily available storage options over Azure Databricks, their comparison, and different ways to interact with them.
Data over Azure Databricks can be broadly stored in three major storage types.
In this post, we are going to discuss DBFS and Azure Blob Storage only.
DBFS can be majorly accessed in three ways.
Files can be easily uploaded to DBFS using Azure’s file upload interface as shown below.
To upload a file, first click on the “Data” tab on the left (as highlighted in red) then select “Upload File” and click on “browse” to select a file from the local file system. By default, files are uploaded in the “/FileStore/tables” folder (as highlighted in yellow), but we can also upload in any other/new folder by specifying the folder name during uploading time.
Downsides of File upload interface
DBFS command-line interface(CLI) is a good alternative to overcome the downsides of the file upload interface. Using this, we can easily interact with DBFS in a similar fashion to UNIX commands.
databricks-cli is a python package that allows users to connect and interact with DBFS.
Databricks CLI configuration steps
1. Install databricks-cli using –
pip install databricks-cli
2. Configure the CLI using –
databricks configure --token
3. Above command prompts for Databricks Host(workspace URL) and access Token. Specify the same accordingly.
a. Listing file in DBFS
In the terminal type
dbfs ls
Similarly, to list content to a particular directory, specify the directory name(prefixed with dbfs:/) after ls. For e.g.
dbfs ls dbfs:/FileStore/tables
b. Making a new directory/folder
# mkdirs command dbfs mkdirs directory_path # For e.g. dbfs mkdirs dbfs:/FileStore/tables/temp_dir
c. Copying files/folder from local to DBFS and vice-versa
# To copy a file dbfs cp source_file_path destination_path # From local to DBFS dbfs cp /home/user1/Desktop/databricks.jpg dbfs:/FileStore/tables # From DBFS to local dbfs cp dbfs:/FileStore/tables/databricks.jpg /home/user1
# To copy a folder dbfs cp -r source_folder_path destination_folder_path # From local to DBFS dbfs cp -r /home/user1/Desktop/dummy_folder dbfs:/FileStore/tables/dummy_folder # From DBFS to local dbfs cp -r dbfs:/FileStore/tables/dummy_folder /home/user1/dummy_folder
d. Moving/Renaming files over DBFS
# Move command dbfs mv source_file_path destination_file_path # Moving file in a different folder dbfs mv dbfs:/FileStore/tables/databricks.jpg dbfs:/FileStore/tables/temp_dir/databricks.jpg
# Renaming file dbfs mv dbfs:/FileStore/tables/temp_dir/databricks.jpg dbfs:/FileStore/tables/temp_dir/databricks1.jpg
e. Deleting files & folder
# rm command dbfs rm [-r] file_or_folder_path #deleting a file dbfs rm dbfs:/FileStore/tables/temp_dir/databricks1.jpg
#deleting a folder dbfs rm -r dbfs:/FileStore/tables/dummy_folder
NOTE – Commands Source:- https://docs.databricks.com/dev-tools/cli/dbfs-cli.html
Programmatically(specifically using Python), DBFS can be easily accessed/interacted using dbutils.fs commands.
# listing content of a directory dbutils.fs.ls("/FileStore")
# making a new directory dbutils.fs.mkdirs("/FileStore/tables/temp_dir2")
# copying a file dbutils.fs.cp("/FileStore/tables/databricks.jpg", "/FileStore/tables/temp_dir2")
# copying a folder dbutils.fs.cp("/FileStore/tables/temp_dir", "/FileStore/tables/temp_dir2/temp_dir", recurse = True)
# moving a file dbutils.fs.mv("/FileStore/tables/temp_dir/databricks.jpg", "/FileStore/tables/temp_dir2/databricks.jpg")
# moving a folder dbutils.fs.mv("/FileStore/tables/temp_dir", "/FileStore/tables/temp_dir2/temp_dir",recurse = True)
# deleting a file dbutils.fs.rm("/FileStore/tables/temp_dir2/databricks.jpg")
# deleting a folder dbutils.fs.rm("/FileStore/tables/temp_dir2/temp_dir",recurse = True
NOTE – Commands Source: https://docs.databricks.com/_static/notebooks/dbutils.html
Data can also be stored in Azure Blob. It is ideal for storing massive amounts of unstructured data.
Before storing the data into Azure Blob, first, we need to create a Storage account over the Azure portal. Within a storage account, we can have multiple containers. We have already created a storage account named “dummy_storage_account” here for demo purposes.
A container can be created either using the portal interface or AzCopy(a command-line utility).
AzCopy is a command-line utility that allows transferring data to and from a storage account/local computer. To download and install AzCopy, follow the steps here.
In order to maintain secured delegated access to storage accounts, azure provides Shared Access Signature(SAS) for the resources.
SAS(Shared Access Signature) for a storage account can be easily obtained from its home page as shown below.
Click on the “Shared access signature” tab on the left side(as highlighted) then check all the boxes under “Allowed resource types”(as highlighted).
Now click on the “Generate SAS and connection string” button as shown below.
Copy the SAS URL from the “Blob services SAS URL” section. This URL is required in AzCopy commands.
Now, let’s discuss some common operations using AzCopy.
a. Creating Container
azcopy make "<SAS_URL>" # For e.g. sudo azcopy make "https://dummydatastorage.blob.core.windows.net/dummycontainer?sv=2020-02-10&ss=bfqt&srt=sco&sp=rwdlacuptfx&se=2021-06-22T21:53:01Z&st=2021-06-22T13:53:01Z&spr=https&sig=7Kv6vyhGN78700hjT%2FTeHx%2BeVfIdzazaSM6LnutuROM%3D"
NOTE – Also, add the container name in the SAS URL after dummydatastorage.blob.core.windows.net/. Here the container name is “dummycontainer”.
b. Copying data from local to Azure Blob and vice-versa
# Copying a file azcopy copy '<local-file-path>' '<SAS_URL>' # For e.g. sudo azcopy copy "/home/user1/Desktop/databricks.jpg" "https://dummydatastorage.blob.core.windows.net/dummycontainer?sv=2019-12-12&ss=bfqt&srt=sco&sp=rwdlacupx&se=2020-10-29T16:58:42Z&st=2020-10-29T08:58:42Z&spr=https&sig=TJ2Ujv%2FHkm0x5NZkbQkcHhI4SshPKSUWsY%2BP2GkZ6kk%3D"
#Copying a folder azcopy copy '<local-directory-path>' '<SAS_URL>' [--recursive] # For e.g. sudo azcopy copy "/home/user1/Desktop/dummy_folder" "https://dummydatastorage.blob.core.windows.net/dummycontainer?sv=2019-12-12&ss=bfqt&srt=sco&sp=rwdlacupx&se=2020-10-29T16:58:42Z&st=2020-10-29T08:58:42Z&spr=https&sig=TJ2Ujv%2FHkm0x5NZkbQkcHhI4SshPKSUWsY%2BP2GkZ6kk%3D" --recursive
Similarly, data from Azure Blob to local can be copied by changing the source and destination order in the azcopy copy command.
NOTE – Commands Source: https://docs.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-files
In this article, we discussed various storage options available in Azure Databricks, commands to perform numerous file and directory-level operations.
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.
Azcopy, Duplicati, Gs Richcopy 360, Carbonite and GoodSync are my best tools used to upload to Azure cloud or AWS