Databricks is a cloud-based analyzing tool that can be used for analyzing and processing massive amounts of big data. Databricks is a product of Microsoft cloud that used Apache Spark for computation purposes. It allows users to combine their data, ELT processes, and machine learning in an efficient manner. Databricks worked on a parallel distributed system, which means the workload is automatically split across various processors as a result it offers high scalability and sharding. Thus indirectly effects in reduced processing time and cost.
This Spark-based environment is very easy to use. It gives provisions to use the most commonly used programming languages like Python, R, and SQL. These languages are later converted through APIs to interact with Spark. As a result data processing and computation become an easy task.
Its computing power can be again increased by connecting it with an external database like MongoDB. In this way, we can process the massive amount of data in a short span of time.
MongoDB is an open-source document database built on a horizontal scale-out architecture.It was founded in 2007 by Dwight Merriman, Eliot Horowitz, and Kevin Ryan, who co-founded MongoDB in NYC. Instead of storing data in tables of rows or columns like SQL databases, each row in a MongoDB database is a document described in JSON formatting language.
Features of MongoDB
MongoDB Atlas is a specialized version of MongoDb that provides easy cluster formation and easy deployments. MongoDb provides a way to store millions of data efficiently.•MongoDB belongs to the NoSQL databases category, while MongoDb Atlas can be primarily classified as hosting that provides an easy way to deploy the cluster. This provides strong authentication and encryption features that ensure data protection.
Features of MongoDB Atlas
In order to connect databricks with MongoDB, one can make use of some packages available from maven. Some tutorials are already available for connecting Databricks with Mongodb through scala driver. But none of them give a clear picture of connecting MongoDB Atlas and Databricks through Python API.
source: google
Let’s have a look at the prerequisites required for establishing a connection between MongoDB Atlas with Databricks.
Enter the MongoDB Connector for Spark package value into the Coordinates field based on your Databricks Runtime version:
Eg: For Databricks Runtime 7.6 (includes Apache Spark 3.0.1, Scala 2.12)
Select org.mongodb.spark:mongo-spark-connector_2.12:3.0.1 or
Give extra care to search in packages and find the package that supports your spark and scala version.
Install Spark xl from libraries and restart the cluster.
We can take IP address by launching Web Terminal from Apps tab in databricks – Cluster.
Type ifconfig -a in the shell to get the IP address.
1.Create an account in MongoDB Atlas Instance by giving a username and password.
2. Create an Atlas free tier cluster. Click on Connect button.
3. Open MongoDB Compass and connect to database through string (don’t forget to replace password in the string with your password).
4.Open MongoDB Compass. Create a New database to save your data by clicking on the CREATE DATABASE button.
5.Import your document as a collection by clicking on the Import Data Button.
NOTE: To explore and manipulate your MongoDB data easily, install MongoDB Compass by clicking on I do not have MongoDB Compass button. Copy the connection string to connect to the MongoDB Atlas cluster from MongoDB Compass.
String looks like this
mongodb+srv://<user>:<password>@<cluster-name>-wlcof.azure.mongodb.net/test?retryWrites=true
1.Connection with databricks
Enable Databricks clusters to connect to the cluster by adding the external IP addresses for the Databricks cluster nodes to the whitelist in Atlas.
For that take network access on MongoDB and add the Databrick cluster IP address there.
2. Configure Databricks Cluster with MongoDB Connection URI
METHOD 1
spark.mongodb.output.uri<connection-string>
spark.mongodb.input.uri<connection-string>
Through python notebook, read the file
METHOD 2(More preferable)
Configure settings directly in python notebook through the below code
from pyspark.sql import SparkSession database = "cloud" your database name collection = "millionsongs"your collection name connectionString= copy your connection string here ('mongodb+srv://user:<password>@cluster0.9rvsi.mongodb.net/<database>?retryWrites=true&w=majority') spark = SparkSession .builder .config('spark.mongodb.input.uri',connectionString) .config('spark.mongodb.input.uri', connectionString) .config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.12:3.0.1') .getOrCreate() # Reading from MongoDB df = spark.read .format("com.mongodb.spark.sql.DefaultSource") .option("uri", connectionString) .option("database", database) .option("collection", collection) .load()
The media are shown in this article on ‘How to connect databricks and MongoDB Atlas’ are not owned by Analytics Vidhya and is used at the Author’s discretion.