This article was published as a part of the Data Science Blogathon
Data science, machine learning, MLops, data engineering, all these frontiers of data are moving ahead with rapid pace and precision. The future of data science is defined by larger firms such as Microsoft, Amazon, Databricks, Google, and these firms are driving innovation in this field. Due to such fast-paced changes, it makes sense to get certified with any one of these big players and get to know their product offering. Moreover with end to end solutions provided by these platforms from scalable data lakes to scalable clusters, for a test as well as production, making life easier for data professionals. From a business perspective, it has all the infrastructure under one roof, on cloud and on-demand, and more and more businesses are inclined or moreover forced to move to the cloud due to the ongoing pandemic.
In short, businesses gather data from various sources, mobile apps, POS systems, in-house tools, machines, etc., and all these are housed under various departments or various databases, this is especially true for legacy big firms. One of the major hurdles for data scientists is to get relevant data under one single roof to build models on and use in production. In the case of Azure, all this data moves to a data lake, data manipulation can be done using SQL pools or Spark pools, data cleaning, model preprocessing, model building using test clusters(low cost), model monitoring, model fairness, data drift and deployment using cluster(high scalable higher cost). The data scientist can focus on solving problems and let Azure do the heavy lifting.
Another use case scenario is model tracking using mlflow(open source project by Databricks). Anybody who has participated in a DS hackathon knows model tracking, logging metrics, and comparing models is a tedious task, if you haven’t set up a pipeline. In Azure all of this is made easy using called experiments, all models are logged, metrics logged, artifacts logged, all using one single line of code.
Azure DP-100 (Designing and Implementing a Data Science Solution on Azure) is the go-to Data science certification from Microsoft for all data enthusiasts. It’s a self-paced learning experience, with freedom and flexibility. After completion, one can work on azure hassle-free and build models, track experiments, build pipelines, tune hyperparameters the AZURE way.
Requirements
Dp 100 exam page
The cost of the exam is about Rs.4,500 and not many firms expect a certification during recruitment, it’s good to have but many, not recruiters demand it or are aware of it, so the question arises is it worth paying for it? Is it worth my weekends? The answer is yes, simply because, even though one could be a machine learning grandmaster or python expert, but the inner workings of Azure are specific to Azure, many methods are Azure specific to drive performance improvements. One cannot just dump a python code and expect it to give optimal performance. Many processes are automated on azure for example – automl module builds models with just one line of code, hyper-parameter tuning takes one line of code. No code ML is another such drag and drop tool which makes building models a child’s play. Containers/ storage / Key vaults / workspace / experiments/ all are azure specific tools and class. Creating compute instances, working with the pipeline, mlflow helps understand Mlops concepts as well. It’s a definite plus if you are working on Azure and want to explore the nitty-gritty of it. Overall, the rewards exceed the effort.
Skills Measured:
Clone the repo to practice azure labs:
git clone https://github.com/microsoftdocs/ml-basics
A Few Important Azure Methods/Classes:
## to create workspace
ws = Workspace.get(name='aml-workspace',
subscription_id='1234567-abcde-890-fgh...', resource_group='aml-resources')
## register model
model = Model.register(workspace=ws,
model_name='classification_model', model_path='model.pkl', # local path description='A classification model', tags={'data-format': 'CSV'}, model_framework=Model.Framework.SCIKITLEARN, model_framework_version='0.20.3') ## Run a .py file in a piepeline step2 = PythonScriptStep(name = 'train model', source_directory = 'scripts', script_name = 'train_model.py', compute_target = 'aml-cluster') # Define the parallel run step step configuration parallel_run_config = ParallelRunConfig( source_directory='batch_scripts', entry_script="batch_scoring_script.py", mini_batch_size="5", error_threshold=10, output_action="append_row", environment=batch_env, compute_target=aml_cluster, node_count=4) # Create the parallel run step parallelrun_step = ParallelRunStep( name='batch-score', parallel_run_config=parallel_run_config, inputs=[batch_data_set.as_named_input('batch_data')], output=output_dir, arguments=[], allow_reuse=True )
A Few Important Concepts(not an exhaustive list):
Azure DP-100 exam prep session
Azure Machine Learning Workspace:
Azure Databricks create a cluster:
Azure Designer:
Good luck! Your next target should be DP-203 (Data Engineering on Microsoft Azure).