Databricks simplify and accelerate data management and data analysis in the rapidly evolving world of big data and machine learning. Developed by Apache Spark, it offers tools for data storage, processing, and data visualization, all integrated with major cloud providers like AWS, Microsoft Azure, and Google Cloud Platform. This article provides a detailed Databricks tutorial for beginners and acts as a practical guide for data enthusiasts and professionals.
This article was published as a part of the Data Science Blogathon.
In simple terms, Databricks is a data warehousing and machine learning web-based platform developed by the creators of Spark. But Databricks is much more than that. It’s a one-stop product for all data needs, from data storage to data analysis. It derives insights using SparkSQL, builds predictive models using SparkML, and provides active connections to visualization tools such as PowerBI, Tableau, Qlikview, etc. It can be viewed as Facebook’s big data.
Amazon generates a large amount of data from operational sources like app clicks, Amazon Pay transactions, and Echo voice information. The data engineering team handles ETL to ensure data quality and sourcing. Spark simplifies ETL tasks, saving time and providing a competitive edge to stakeholders. It speeds up tasks like storing POS data in an SQL table, saving valuable time.
Databricks offers a solution for loan approval batch pipelines with 100M applications, allowing faster ETLs and easier decision-making. Conventional models take days to evaluate input data, leading to operational delays and unhappy customers. Fintech apps like Databricks provide various products and solutions, reducing the need for 24-48 hours to load weekly operations data and preventing workflow delays and potential lost opportunities.
Databricks integrates with Amazon Web Services, Microsoft Azure, and Google Cloud Platform, making connecting with these major cloud computing infrastructures easy. Some of the major firms, such as Starbucks,
Databricks can be thought of as a big toolbox for data folks. It allows data analysts, data engineers, and data scientists to work together on one platform. Here are some key things Databricks helps with:
Overall, Databricks helps businesses unlock the power of their data by making it easier to store, analyze, and collaborate on data projects.pen_sparktunesharemore_vert
The Databricks paid version has a 14-day trial period but must be used alongside AWS, Azure, or GCP.
Lakehouse or data lake is a marketing term used in Databricks for a storage layer that can accommodate structured or unstructured, streaming or batch information. It’s a simple platform for storing all data. The data lake from Databricks is called Delta Lake. Below are a few features of Delta Lake.
Data Lake Grid |
Data lake
| Data lakehouse |
Data warehouse
|
Types of data
| All types: Structured data, semi-structured data, unstructured (raw) data | All types: Structured data, semi-structured data, unstructured (raw) data | Structured data only |
Cost | $ | $ | $$$ |
Format | Open format | Open format | The closed, proprietary format |
Scalability | Scales to hold any amount of data at low cost, regardless of the type | Scales to hold any amount of data at low cost, regardless of the type | Scaling up becomes exponentially more expensive due to vendor costs |
Intended users
| Limited: Data scientists | Unified: Data analysts, data scientists, machine learning engineers | Limited: Data analysts |
Ease of use
| Difficult: Exploring large amounts of raw data can be difficult without tools to organize and catalog the data | Simple: Provides simplicity and structure of a data warehouse with the broader use cases of a data lake | Simple: The structure of a data warehouse enables users to quickly and easily access data for reporting and analytics |
Data Analyst/Business Analyst: Analysis, RACs, and visualizations are the bread and butter of analysts, so the focus must be BI integration and Databricks SQL. Read about the Tableau visualization tool here.
Data Scientist: Data scientist have well-defined roles in larger organizations but in smaller organizations, data scientist wears various hats, one can own all the 3 roles, of an analyst, data engineer, bi visualizer etc. In a well-defined role, data scientists are responsible for sourcing data, a skill grossly neglected in the face of modern ML algorithms. Build predictive models and manage model deployment and monitor data drift,
Important skills:
Data Engineer: Largely responsible for building ETLs and managing the constant flow of ever-increasing data. Process, clean, and quality check the data before pushing it to operational tables. Model deployment and platform support are other responsibilities entrusted to data engineers.
Databricks must combine with Azure, AWS, or GCP, and its relatively higher costs lead to low adoption among small and medium startups in India.
Spark is a tool for coordinating tasks/jobs across a cluster of computers. A cluster manager, such as YARN (Yet Another Resource Negotiator), Mesos, or Spark’s own Stand-Alone cluster manager, manages these clusters of machines. It supports Scala, Python, SQL, Java, and R. A Spark application consists of one driver and executors.
The driver node is responsible for three things:
The executors are responsible for two things:
The cluster manager is responsible for the following:
Also Read: Understand The Internal Working of Apache Spark.
The Databricks community edition is free to use, and it has two main Roles:
The machine learning path has an added model registry and experiment registry, where experiments can be tracked using MLFLOW. Databricks offers Jupyter notebooks to share across teams, making collaboration easy.
For the notebooks to work, they have to be deployed on a cluster. Databricks provides 1 Driver: 15.3 GB Memory, 2 Cores, and 1 DBU for free.
Alternatively,
Once the analysis is complete, Databricks notebooks can be published(publicly available) and the links will be available for 6 months.
Databricks notebooks, which are published, can be imported using URLs and physical files. To import using URL.
Create a new notebook and select SQL as the language. In the notebook, select Upload Data and upload the csv.
Write the data to events002:
%python
df1 = spark.read.format("csv").load("dbfs:/FileStore/shared_uploads/[email protected]/test-3.csv",header="true", inferSchema="true")
df1.write.format("delta").mode("overwrite").save("/mnt/delta/events002")
Create a SQL table using the below code:
DROP TABLE IF EXISTS diamonds; CREATE TABLE diamonds USING DELTA LOCATION '/mnt/delta/events002/'
Run SQL commands to query data:
select * from diamonds limit 10
select manufacturer, count(*) as freq from diamonds group by 1 order by 2 desc
The output data-frames can be visualized directly in the notebook. Select the bar icon below and choose the appropriate chart. A total of 11 chart types are available.
Databricks machine learning support is growing daily, and MLlib is Spark’s machine learning (ML) library, which was developed for machine learning activities on Spark. Below is a classification example to predict the quality of Portuguese “Vinho Verde” wine based on the wine’s physicochemical properties.
Use the link to download the data. Then, download both winequality-red.csv and winequality-white.csv to your local machine. Finally, upload the CSV using the Upload Data command in the toolbar.
import pandas as pd
red_wine = pd.read_csv("/dbfs/FileStore/shared_uploads/[email protected]/winequality_red.csv")
white_wine = pd.read_csv("/dbfs/FileStore/shared_uploads/[email protected]/winequality_white.csv")
white_wine['is_red'] = 0.0
red_wine['is_red'] = 1.0
data_df = pd.concat([white_wine, red_wine], axis=0)
Plot a histogram of the Y label:
import seaborn as sns
sns.distplot(data.quality, kde=False)
Box plots to compare features and Y label:
import matplotlib.pyplot as plt
dims = (3, 4)
f, axes = plt.subplots(dims[0], dims[1], figsize=(25, 15))
axis_i, axis_j = 0, 0
for col in data.columns:
if col == 'is_red' or col == 'quality':
continue # Box plots cannot be used on indicator variables
sns.boxplot(x=high_quality, y=data[col], ax=axes[axis_i, axis_j])
axis_j += 1
if axis_j == dims[1]:
axis_i += 1
axis_j = 0
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, random_state=123)
X_train = train.drop(["quality"], axis=1)
X_test = test.drop(["quality"], axis=1)
y_train = train.quality
y_test = test.quality
import mlflow
import mlflow.pyfunc
import mlflow.sklearn
import numpy as np
import sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from mlflow.models.signature import infer_signature
from mlflow.utils.environment import _mlflow_conda_env
import cloudpickle
import time
# The predict method of sklearn's RandomForestClassifier returns a binary classification (0 or 1).
# The following code creates a wrapper function, SklearnModelWrapper, that uses
# the predict_proba method to return the probability that the observation belongs to each class.
class SklearnModelWrapper(mlflow.pyfunc.PythonModel):
def __init__(self, model):
self.model = model
def predict(self, context, model_input):
return self.model.predict_proba(model_input)[:,1]
# mlflow.start_run creates a new MLflow run to track the performance of this model.
# Within the context, you call mlflow.log_param to keep track of the parameters used, and
# mlflow.log_metric to record metrics like accuracy.
with mlflow.start_run(run_name='untuned_random_forest'):
n_estimators = 10
model = RandomForestClassifier(n_estimators=n_estimators, random_state=np.random.RandomState(123))
model.fit(X_train, y_train)
# predict_proba returns [prob_negative, prob_positive], so slice the output with [:, 1]
predictions_test = model.predict_proba(X_test)[:,1]
auc_score = roc_auc_score(y_test, predictions_test)
mlflow.log_param('n_estimators', n_estimators)
# Use the area under the ROC curve as a metric.
mlflow.log_metric('auc', auc_score)
wrappedModel = SklearnModelWrapper(model)
# Log the model with a signature that defines the schema of the model's inputs and outputs.
# When the model is deployed, this signature will be used to validate inputs.
signature = infer_signature(X_train, wrappedModel.predict(None, X_train))
# MLflow contains utilities to create a conda environment used to serve models.
# The necessary dependencies are added to a conda.yaml file which is logged along with the model.
conda_env = _mlflow_conda_env(
additional_conda_deps=None,
additional_pip_deps=["cloudpickle=={}".format(cloudpickle.__version__), "scikit-learn=={}".format(sklearn.__version__)],
additional_conda_channels=None,
)
mlflow.pyfunc.log_model("random_forest_model", python_model=wrappedModel, conda_env=conda_env, signature=signature)
feature_importances = pd.DataFrame(model.feature_importances_, index=X_train.columns.tolist(), columns=['importance'])
feature_importances.sort_values('importance', ascending=False)
Hyperopt is a hyperparameter tuning framework based on Bayesian optimization. Grid search is time-consuming, and Random search, while better than grid search, fails to provide optimum results.
from hyperopt import fmin, tpe, hp, SparkTrials, Trials, STATUS_OK
from hyperopt.pyll import scope
from math import exp
import mlflow.xgboost
import numpy as np
import xgboost as xgb
search_space = {
'max_depth': scope.int(hp.quniform('max_depth', 4, 100, 1)),
'learning_rate': hp.loguniform('learning_rate', -3, 0),
'reg_alpha': hp.loguniform('reg_alpha', -5, -1),
'reg_lambda': hp.loguniform('reg_lambda', -6, -1),
'min_child_weight': hp.loguniform('min_child_weight', -1, 3),
'objective': 'binary:logistic',
'seed': 123, # Set a seed for deterministic training
}
def train_model(params):
# With MLflow autologging, hyperparameters and the trained model are automatically logged to MLflow.
mlflow.xgboost.autolog()
with mlflow.start_run(nested=True):
train = xgb.DMatrix(data=X_train, label=y_train)
test = xgb.DMatrix(data=X_test, label=y_test)
# Pass in the test set so xgb can track an evaluation metric. XGBoost terminates training when the evaluation metric
# is no longer improving.
booster = xgb.train(params=params, dtrain=train, num_boost_round=1000,
evals=[(test, "test")], early_stopping_rounds=50)
predictions_test = booster.predict(test)
auc_score = roc_auc_score(y_test, predictions_test)
mlflow.log_metric('auc', auc_score)
signature = infer_signature(X_train, booster.predict(train))
mlflow.xgboost.log_model(booster, "model", signature=signature)
# Set the loss to -1*auc_score so fmin maximizes the auc_score
return {'status': STATUS_OK, 'loss': -1*auc_score, 'booster': booster.attributes()}
# Greater parallelism will lead to speedups, but a less optimal hyperparameter sweep.
# A reasonable value for parallelism is the square root of max_evals.
spark_trials = SparkTrials(parallelism=10)
# Run fmin within an MLflow run context so that each hyperparameter configuration is logged as a child run of a parent
# run called "xgboost_models" .
with mlflow.start_run(run_name='xgboost_models'):
best_params = fmin(
fn=train_model,
space=search_space,
algo=tpe.suggest,
max_evals=96,
trials=spark_trials,
rstate=np.random.RandomState(123)
)
Finally, retrieve the best model from the MLFLOW run :
best_run = mlflow.search_runs(order_by=['metrics.auc DESC']).iloc[0]
print(f'AUC of Best Run: {best_run["metrics.auc"]}')
Practice – Market Basket Analysis on Databricks
Use the online notebook to analyse InstaKart grocery data and recommend upselling/cross-selling opportunities using market basket analysis.
Databricks has a feature that allows users to create an interactive dashboard using existing codes, images, and output.
Databricks provides live instructor-led training and self-paced programs to help individuals better understand the platform. The self-paced course is priced at $2000. It also provides certification based on role fitment. The common career tracks are Business leader, platform admin, SQL analyst, data engineer, and data scientist.
There are four certifications, namely:
Databricks are relatively easy to learn, especially if you have some experience with related technologies. Here’s a breakdown:
However, there are some factors to consider that might influence the learning curve:
This Databricks tutorial for beginners scratches the surface of what Databricks is capable of. Databricks is capable of a lot more, which are not explored in this article, and for data enthusiasts, it is quite a treasure trove. So practice and always keep learning. At the End of this article, you will fully understand Databricks and what they are for beginners.
A. Databricks is a unified analytics platform that simplifies data management and machine learning by integrating with major cloud providers like AWS, Microsoft Azure, and Google Cloud Platform.
A. Databricks leverages Apache Spark to efficiently handle large-scale data processing, including ETL tasks, data analysis, and machine learning model development.
A. Yes, Databricks is relatively easy to learn, especially if you have experience with SQL or Python. It offers a user-friendly interface and extensive learning resources.
A. Delta Lake is a storage layer that provides reliability and scalability by supporting ACID transactions, versioning, and the ability to handle both batch and streaming data.
A. Yes, Databricks facilitates collaboration through shared notebooks and interactive dashboards, allowing teams to work together seamlessly on data projects.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.