This article was published as a part of the Data Science Blogathon
The below topics will be covered in this article. The aim is to familiarize the reader with mlflow and provide a foundation to start using it.
MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. Simply put, mlflow helps track hundreds of models, container environments, datasets, model parameters and hyperparameters, and reproduce them when needed. There are major business use cases of mlflow and azure has integrated mlflow into its standard library, this speaks volumes as to why mlflow is important, and why it makes sense to put some effort into understanding it and deploying it.
Let’s consider scenario one: During hackathons on Analytics Vidya, it’s common to try out various types of data preprocessing, built models using boosted trees, GLMnets, or Deep learning methods, each model has a different parameter and resultant metric. Keeping track of all this is tedious or looks something like v001.pkl to v030.pkl. With MLflow tracking, each experiment/metrics/parameter is logged and can be rendered as a pandas dataframe, and the best model can be chosen based on RSME or RMSLE etc. Using the run_id, the model can be fetched and used for prediction.
Scenario two: In a large firm where models mean revenue in most of the cases, a slight drift in data or unexplainable output means losses. For example – Insurers collect customer/patient data from hospitals and utilize it to charge the optimal premium per customer based on risk factors and also keeping in mind the bottom line(margin). If low premiums are charged, it’s a business loss and high premium leads to customer attrition. Data scientists and engineers are tasked with building/fixing these models. In such cases, if training/testing/deployment configs aren’t logged and saved, it becomes an impossible task to reproduce the erroneous model. With mlfow, the config, dataset, script, environment(conda.yml) are logged and the run_id/experiment_id used for deployment in a scoring script, in a matter of minutes the model is reproducible.
This article will focus on scenario one, which will help budding data scientists/engineers alike start using mlflow in their data science pipeline. Mlflow has 4 components, tracking, models, registry and projects. This article focuses on mlflow-tracking for hackathon use cases.
MLflow sessions from the creators of mlfow on YouTube.
pip install mlflow
Each model run is called an experiment, the run_name attribute can be used to identify particular runs for example – xgboost-exp, or catboost-exp. This instructs mlflow to create a folder with a new run_id, and sub-folders are also created. Mlruns folder has been discussed in a later section below.
with mlflow.start_run(run_name=r_name) as run:
Any number of metrics, parameters, images can be logged using the below code. For each experiment tracking folder is assigned.
mlflow.log_param("l1_ratio", l1_ratio) mlflow.log_metric("rmse", rmse) mlflow.sklearn.log_model(lr, "model", registered_model_name="ElasticnetWineModel01") mlflow.log_artifact(temp_name, "rmse_estimators_plots")
Once the experiments execute, users have the option to use one particular or multiple metrics to filter and choose the top model. For example, metrics.rmse < 60, results in models which are less than 60 RSME. Similarly, MAE, MSE or any other metric can be used for regression.
df = mlflow.search_runs(filter_string="metrics.rmse < 60") df.head()
The top model, in this case, is fetched using the R2_score. The load_model method retrieves the model from the saved mlruns folder. The output of the method is the run_id, using the run_id, the model can fetch as well.
run_id = df.loc[df['metrics.r2'].idxmin()]['run_id']
## catboost-reg-model - is in the model class, we can also pass it as a parameter model = mlflow.sklearn.load_model("runs:/" + run_id + "/catboost-reg-model") model.get_params()
Once the model is retrieved, it can be used for prediction.
columns_to_keep = ["Manufacturer","Model","Prod. year","Category","Leather interior","Fuel type", "Cylinders","Gear box type", "Drive wheels", "Doors", "Wheel", "Airbags"] cat_features = ["Manufacturer","Model","Prod. year","Category","Leather interior","Fuel type", "Gear box type", "Drive wheels", "Doors", "Wheel"] features_to_predict_df = test[columns_to_keep] features_to_predict_df[cat_features] = features_to_predict_df[cat_features].astype(str) y_pred_log = model.predict(features_to_predict_df) features_to_predict_df.head()
Mlflow provides a UI to track experiments as well, using http://localhost:5000/ Comparing different models that meet the minimum threshold and validating each model is made easy using this mlflow UI. The UI can also be used to do the following:
# Once mlruns are complete, use this comamnd.
!mlflow ui
The problem at hand is a regression problem. Train data and test data can be downloaded from the attached links. The function calls mlflow, splits data into train and test, trains the model, logs metrics, parameters and returns experiment id and run id.
def model_run__log_mlfow(self, df, var_dict, other_dict = {}): ''' self : rf regressor model df : dataframe var_dict : model variables dict - var_dict["independant"], var_dict["dependant"] other_dict : other dict if needed, set to {} default ''' r_name = other_dict["run_name"] with mlflow.start_run(run_name=r_name) as run: # get current run and experiment id runID = run.info.run_uuid experimentID = run.info.experiment_id feature = var_dict["independant"] label = var_dict["dependant"] ## log of predictions df[label] = np.log(df[label]+1) X = df[feature] y = df[label] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25,random_state = 42) self._rfr.fit(X_train, y_train) y_pred = self._rfr.predict(X_test) ## self.model is a getter for the model mlflow.sklearn.log_model(self.model, "catboost-reg-model") mlflow.log_params(self.params) model_score = self._rfr.score(X_test , y_test) mae = metrics.mean_absolute_error(y_test, y_pred) mse = metrics.mean_squared_error(y_test, y_pred) rmse = np.sqrt(mse) r2 = metrics.r2_score(y_test, y_pred) # Log metrics mlflow.log_metric("mae", mae) mlflow.log_metric("mse", mse) mlflow.log_metric("rmse", rmse) mlflow.log_metric("r2", r2) print("-" * 100) print("Inside MLflow Run with run_id {} and experiment_id {}".format(runID, experimentID)) print('Mean Absolute Error :', mae) print('Mean Squared Error :', mse) print('Root Mean Squared Error:', rmse) print('R2 :', r2) return (experimentID, runID)
A brief understanding of what goes on under the hood. When mlflow is used, it creates a folder mlruns, which is a repository of the project.
channels: - defaults - conda-forge dependencies: - python=3.8.5 - pip - pip: - mlflow - scikit-learn==0.23.2 - cloudpickle==1.6.0 name: mlflow-env
Equipped with the basics of mlflow tracking, hackathons or work, it’s time to implement this useful model lifecycle management tool.
Good luck! Here is my Linkedin profile in case you want to connect with me. I’ll be happy to be connected with you.