Prefect and CometML For Bulldozer Sales Price Prediction

Dhrubaraj Roy Last Updated : 31 Mar, 2024

10 min read

Introduction

If you are a beginner and are just starting to learn MLOps, you might have a question: What are MLOps?

In simple words, MLOps (Machine Learning Operations) is a set of practices for collaboration and communication between data scientists and operations professionals. Applying these practices increases the quality, simplifies the management process, and automates the deployment of Machine Learning and Deep Learning models in large-scale production environments. It’s easier to align models with business needs and regulatory requirements. In this article, we will implement our project using Prefect and CometML.

In this MLOps project, we will build the best possible Machine Learning model using optimal hyperparameters to predict the sales price of a Bulldozer. As you may know, a Bulldozer is a powerful vehicle for shallow digging and ditching.

Learning Objectives

Learn MLOps concepts and end-to-end ML workflow.
Implement MLOps pipeline with Prefect and CometML.
Make reproducible, automated ML workflows.
Evaluate and monitor ML models.
End-to-end MLOps experience.

This article was published as a part of the Data Science Blogathon.

What is Prefect and CometML?
- Prefect
- CometML
The MLOps project: Let’s get started.
- Data Exploration
- Set Up a Virtual Environment
Configure CometML and Prefect
- Set up a Comet account
- Set Up the Prefect
5-steps to implement this MLOps project using Prefect and CometML
Frequently Asked Questions

What is Prefect and CometML?

Prefect

Prefect is an open-source Python library that helps you define, schedule, and manage data workflows well. It simplifies orchestrating and automating complex data workflows, making tasks easier. Examples are data extraction, transformation, and model training. You can do them in a systematic and repeatable way.

pip install prefect

Another thing I should mention is Prefect Cloud. Prefect Cloud is a cloud-based platform provided by Prefect for managing, orchestrating, and monitoring data workflows in MLOps.

CometML

CometML is a platform for managing and tracking machine learning experiments in MLOps. It provides tools for versioning, collaboration, and visualizing results. It helps streamline the development and monitoring of machine-learning models.

pip install comet_ml

The MLOps project: Let’s get started.

Data Exploration

As we build an end-to-end machine learning model, we will focus more on the ML life cycle than model building.

If you observe the dataset, you will see there are 53 columns. We will use all 52 columns for input features or X, and since our target variable is SalePrice, this will be the y. In the data exploration part, we conducted all kinds of explorations, from df.info() to plotting missing values using a scatter plot. You will find all the steps in my notebook on the GitHub repository. You can also download the dataset from there. Now, let’s start working on the project.

Set Up a Virtual Environment

What is Virtual Environment, and why do we need it?

A virtual environment is a self-contained Python workspace for isolating project dependencies.
You install many libraries on your computer for several projects. You might have installed Python3.11, but sometimes, you need Python3.9 for another project. To avoid conflict, you need to set up a virtual Environment.

Creating a Virtual Environment

For Windows:

python -m venv myenv
#then for activation
myenv\Scripts\activate

For Linux/macOS:

python3 -m venv myenv
#then for activation
source myenv/bin/activate

File Structure

Configure CometML and Prefect

To configure CometML, you need to create a file named .comet.config in your project directory and define its configuration parameters. Here is an example of how you can structure a basic .comet.config file:

[comet]
api_key = your_api_key
workspace = your_workspace
project_name = your_project_name

You should sign up for Comet for an api_key, workspace, and project_name. Let’s take a look at how to set up a Comet account.

Set up a Comet account

Please create a new account. It’s easy and free.

API key

When your account is created in the top right corner, click your avatar, then select Account Settings.

To get the API key, click the API Keys tab. Your current API key is displayed there. Click Copy to copy the API key.

You can see your workspace name and project name in the Workspaces Tab.

So now let’s configure Prefect.

Set Up the Prefect

Prefect provides a cloud platform and API for managing and monitoring workflows. By signing up, we can use Prefect Cloud. It has a dashboard for tracking workflows. It can set notifications, analyze logs, and more. The interesting part is that we can deploy our machine-learning model.

Step 1: Install Prefect

pip install -U prefect

See the install guide for more details.

Step 2: Connect to Prefect’s API

Prefect’s functionality relies on a backend cloud API. The API manages the execution of workflows and data pipelines. We need to connect Prefect installation to this API. This unlocks useful features. For example, a central dashboard can be used to watch workflow runs. It also lets you set notifications. You can get them when tasks fail, analyze logs, and track task history. Lastly, it lets you scale workloads across a cluster. We can build workflows locally without the API. But we can’t make them operational or ready for production. The Prefect Cloud handles scheduling and retries. It follows limits set through the API. So, using Prefect with its API service offers a serverless platform. It is for managing complex workflows without needing to host your own coordinators.

Create a new account or sign in at
Use the prefect cloud login CLI command to

Choose Log in with a web browser and click the Authorize button in the open browser window.

Self-hosted Prefect server instance

You can also run this on your local machine. See the tutorial for help. Note that you must host your own server and run your flows on your own infrastructure.

Step 3: Turn your function into a Prefect flow

See the flow.py file where I added the @flow decorator. This is the fastest way to get started with Prefect. A “Flow” is a Directed Acyclic Graph (DAG) representing a workflow. In Prefect, a task is a fundamental unit of work in the workflow. We will discuss tasks more later in this tutorial.

5-steps to implement this MLOps project using Prefect and CometML

Here are the 5 steps to implement the MLops project using Prefect and CometML

Step 1 – Ingest data

In this step, we ingest our data from our data folder. Let’s have a look at our ingest_data.py file inside the steps folder

class IngestData:
    """Ingests data from a CSV file."""

    def __init__(self, data_path: str):
        self.data_path = data_path

    def get_data(self):
        logging.info(f"Ingest data from {self.data_path}")
        return pd.read_csv(self.data_path)

@task(cache_key_fn=task_input_hash, cache_expiration=timedelta(hours=1))
def ingest_df(data_path: str) -> pd.DataFrame:
    """
    Ingest data from the specified path and return a DataFrame.

    Args:
        data_path (str): The path to the data file.

    Returns:
        pd.DataFrame: A pandas DataFrame containing the ingested data.
    """
    try:
        ingest_obj = IngestData(data_path)
        df = ingest_obj.get_data()
        print(f"Ingesting data from {data_path}")
        experiment.log_metric("data_ingestion_status", 1)
        return df
    except Exception as e:
        logging.error(f"Error while ingesting data: {e}")
        raise e
    finally:
        # Ensure that the experiment is ended to log all data
        experiment.end()

In Prefect, a task is a fundamental unit of work in a workflow. It represents an individual computation unit or an operation that needs to be performed. So, in this case, our first task is to ingest the data.

@task(cache_key_fn=task_input_hash, cache_expiration=timedelta(hours=1))

This Prefect task decorator specifies caching parameters, using task_input_hash as the cache key function and setting a cache expiration of one hour. You can learn more about this in prefect doc.

Step 2 – Clean data

In this step, we will clean our data, and the bellow code will return X_train, X_test, y_train, y_test, for training and testing our ML model. Let’s have a look

@task(cache_key_fn=task_input_hash, cache_expiration=timedelta(hours=1))
def clean_df(data: pd.DataFrame) -> Tuple[
    Annotated[pd.DataFrame, 'X_train'],
    Annotated[pd.DataFrame, 'X_test'],
    Annotated[pd.Series, 'y_train'],
    Annotated[pd.Series, 'y_test'],
]:
    """
    Data cleaning class which preprocesses the data and divides it into train and test data.

    Args:
        data: pd.DataFrame
    """
    try:
        preprocess_strategy = DataPreprocessStrategy()
        data_cleaning = DataCleaning(data, preprocess_strategy)
        preprocessed_data = data_cleaning.handle_data()

        divide_strategy = DataDivideStrategy()
        data_cleaning = DataCleaning(preprocessed_data, divide_strategy)
        X_train, X_test, y_train, y_test = data_cleaning.handle_data()
        logging.info(f"Data Cleaning Complete")
        experiment.log_metric("data_cleaning_status", 1)
        return X_train, X_test, y_train, y_test 
    except Exception as e: 
        logging.error(e)
        raise e
    finally:
        # Ensure that the experiment is ended to log all data
        experiment.end()

Till this point, if you observe the above code carefully, you might be thinking, where are the DataPreprocessStrategy(), and DataDivideStrategy() defined inside the model folder, we define these methods; let’s have a look

class DataPreprocessStrategy(DataStrategy):
    """
    Data preprocessing strategy which preprocesses the data.
    """

    def handle_data(self, data: pd.DataFrame) -> pd.DataFrame:
        try:
            """
            Performs transformations on df and returns transformaed df.
            """
            # Convert 'saledate' column to datetime
            data['saledate'] = pd.to_datetime(data['saledate'])
            data["saleYear"] = data.saledate.dt.year
            data["saleMonth"] = data.saledate.dt.month
            data["saleDay"] =data.saledate.dt.day
            data["saleDayOfWeek"] = data.saledate.dt.dayofweek
            data["saleDayOfYear"] = data.saledate.dt.dayofyear

            data.drop("saledate", axis=1, inplace=True)


            # Fill the numeric row with median
            for label, content in data.items():
                    if pd.api.types.is_numeric_dtype(content):
                        if pd.isnull(content).sum():
                            # Add a binary column which tells us if the data was missing 
                            # or not
                            data[label+"is_missing"] = pd.isnull(content)
                            # Fill missing numeric values with median
                            data[label] = content.fillna(content.median())

                    # Filled categorical missing data and turn categories into numbers
                    if not pd.api.types.is_numeric_dtype(content):
                        data[label+"is_missing"] = pd.isnull(content)
                        # We add +1 to the category code because pandas encodes
                        # missing categories as -1
                        data[label] = pd.Categorical(content).codes+1
                
        
        
            return data
        except Exception as e:
            logging.error("Error in Data handling: {}".format(e))
            raise e

In my GitHub repository, you can find all methods.

Step 3 – Train model

We will train a simple linear regression model using the Scikit learn library.

# Create a CometML experiment
experiment = Experiment()
@task(cache_key_fn=task_input_hash, cache_expiration=timedelta(hours=1))
def train_model(
    X_train: pd.DataFrame,
    X_test: pd.DataFrame,
    y_train: pd.Series,
    y_test: pd.Series,
    config: ModelNameConfig = ModelNameConfig(),
) -> RegressorMixin:
    """
    Train a regression model based on the specified configuration.

    Args:
        X_train (pd.DataFrame): Training data features.
        X_test (pd.DataFrame): Testing data features.
        y_train (pd.Series): Training data target.
        y_test (pd.Series): Testing data target.
        config (ModelNameConfig): Model configuration.

    Returns:
        RegressorMixin: Trained regression model.
    """
    try:
        model = None
        if config.model_name == "random_forest_regressor":
            model = RandomForestRegressor(n_estimators=40,
                                                min_samples_leaf=1,
                                                min_samples_split=14,
                                                max_features=0.5,
                                                n_jobs=-1,
                                                max_samples=None,
                                                random_state=42)
            trained_model = model.fit(X_train, y_train)
             # Save the trained model to a file
            model_filename = "trained_model.pkl"
            with open(model_filename, 'wb') as model_file:
                pickle.dump(trained_model, model_file)
            print("train model finished")
            experiment.log_metric("model_training_status", 1)
            return trained_model
        else:
            raise ValueError("Model name not supported")
    except Exception as e:
        logging.error(f"Error in train model: {e}")
        raise e
    finally:
    # Ensure that the experiment is ended to log all data
        experiment.end()

Step 4 – Evaluate model

# Create a CometML experiment
experiment = Experiment()
@task(cache_key_fn=task_input_hash, cache_expiration=timedelta(hours=1))
def evaluate_model(
    model: RegressorMixin, X_test: pd.DataFrame, y_test: pd.Series
) -> Tuple[Annotated[float, "r2"], 
           Annotated[float, "rmse"],
]:
    """
    Args:
        model: RegressorMixin
        x_test: pd.DataFrame
        y_test: pd.Series
    Returns:
        r2_score: float
        rmse: float
    """
    try:
        prediction = model.predict(X_test)

        # Using the MSE class for mean squared error calculation
        mse_class = MSE()
        mse = mse_class.calculate_score(y_test, prediction)
        experiment.log_metric("MSE", mse)
        # Using the R2Score class for R2 score calculation
        r2_class = R2Score()
        r2 = r2_class.calculate_score(y_test, prediction)
        experiment.log_metric("R2Score", r2)
        # Using the RMSE class for root mean squared error calculation
        rmse_class = RMSE()
        rmse = rmse_class.calculate_score(y_test, prediction)
        experiment.log_metric("RMSE", rmse)
       # Log metrics to CometML
        
        experiment.log_metric("model_evaluation_status", 1)
        print("Evaluate model finished")

        return r2, rmse
    except Exception as e:
        logging.error(f"Error in evaluation: {e}")
        raise e
    finally:
        # Ensure that the experiment is ended to log all data
        experiment.end()

We have logged all those metrics, like r2 score, mse, and rmse. You can see the above code. We can visualize those matrices on the CometML dashboard. However, when you run the flow, you can see the dashboard. In the next step, we discuss that.

Step 5 – Run the flow (The final step)

We have to run the flow.

We import all the tasks and flows into the flow.py file and run our flow from there.

python3 flow.py

from prefect import flow

from steps. ingest_data import ingest_df
from steps.clean_data import clean_df
from steps.train_model import train_model
from steps.evaluation import evaluate_model
## import comet_ml at the top of your file
from comet_ml import Experiment

## Create an experiment with your api key
@flow(retries=3, retry_delay_seconds=5, log_prints=True)
def my_flow():
    data_path="/home/dhrubaubuntu/gigs_projects/Bulldozer-price-prediction/data/TrainAndValid.csv"
    df = ingest_df(data_path)
    X_train, X_test, y_train, y_test = clean_df(df)
    model = train_model(X_train, X_test, y_train, y_test)
    r2_score, rmse = evaluate_model(model, X_test, y_test)

# Run the Prefect Flow
if __name__ == "__main__":
    my_flow()

Here, you can see all the run-in flow dashboards in Prefect

Conclusion

Implementing end-to-end MLOps enables organizations to reliably scale-out machine learning solutions in production. This tutorial demonstrated an automated workflow for predicting electric vehicle ranges using open-source libraries like Prefect and CometML.

Key highlights from the project include:

Orchestrating an ML pipeline with Prefect involves handling steps ranging from data ingestion, preprocessing, model development, evaluation, and monitoring.
Tracking experiments in CometML to visualize model metrics like RMSE and R2 scores over time for comparison.
Monitoring workflow executions in Prefect Cloud showing task durations.

Overall, this showcase implements data science best practices of automation, reproducibility, and monitoring in a structured workflow critical for real-world ML systems. Extending and operationalizing to production can further leverage Prefect’s scalability in managing large-scale flows across distributed infrastructure.

Key Takeaways

Some key takeaways from this end-to-end MLOps tutorial include:

Implementing MLOps improves data scientists and IT collaboration with automation and DevOps practices.
Prefect enables the creation of robust data pipelines and workflows to ingest, process, train, and evaluate models.
CometML provides an easy way to track ML experiments with logging and visualization.
Orchestrating the ML lifecycle end-to-end ensures models remain relevant as new data comes in.
Monitoring workflow executions helps identify and troubleshoot failures quickly.
MLOps unlocks faster experimentation by simplifying retraining and deployment of updated models.

Frequently Asked Questions

Q1. What is MLOps?

Ans. MLOps for machine learning is a set of practices that aims to streamline and automate the end-to-end machine learning lifecycle, including model development, deployment, and maintenance, to enhance collaboration and efficiency in data science and operations teams.

Q2. What is Prefect?

Ans. Prefect is an open-source Python library for workflow management. It enables the creation, scheduling, and orchestration of data workflows and tasks commonly used in data science and automation pipelines. It simplifies complex workflows, focusing on flexibility, reliability, and monitoring.

Q3. What is CometML?

Ans. CometML is a platform for machine learning experimentation and collaboration. It provides tools for tracking, comparing, and optimizing machine learning experiments, enabling teams to log and share experiment details, metrics, and visualizations to improve model development and collaboration.

Q4. What is Prefect used for?

Ans. Prefect is used for workflow management in data science and automation. It helps streamline and orchestrate complex data workflows, making designing, scheduling, and cohesively monitoring tasks easier. Prefect is commonly employed for data processing, machine learning model training, and other data-centric operations, providing a framework for building, running, and managing workflows efficiently.

Q5. What is the difference between MLflow and Comet?

Ans. MLflow is an open-source platform for managing the end-to-end machine learning lifecycle, including experiment tracking, packaging code into reproducible runs, and sharing and deploying models. Comet is a platform for machine learning experimentation and collaboration, focusing on experiment tracking, visualizations, and collaboration features. It provides a centralized hub for teams to analyze and share results. While both support experiment tracking, MLflow offers additional model packaging and deployment features, while Comet emphasizes collaboration and visualization capabilities.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Dhrubaraj Roy

Hey everyone I am Dhrubaraj Roy from Siliguri, India, Passionate and self-motivated data science enthusiast with a strong foundation in data analysis, machine learning, and statistical modeling. Skilled in Python programming with expertise in libraries like NumPy and Pandas, I excel at extracting valuable insights from large datasets and building predictive models. With a keen eye for detail and strong problem-solving abilities, I strive to deliver impactful results. Excellent communication and collaboration skills allow me to thrive in team-oriented environments. Currently seeking opportunities to apply my data science expertise and contribute to growth. Let's connect and explore potential collaborations!

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Prefect and CometML For Bulldozer Sales Price Prediction

Introduction

Learning Objectives

Table of contents

What is Prefect and CometML?

Prefect

CometML

The MLOps project: Let’s get started.

Data Exploration

Set Up a Virtual Environment

What is Virtual Environment, and why do we need it?

File Structure

Configure CometML and Prefect

Set up a Comet account

API key

Set Up the Prefect

Prefect cloud login

5-steps to implement this MLOps project using Prefect and CometML

Step 1 – Ingest data

Step 2 – Clean data

Step 3 – Train model

Step 4 – Evaluate model

Step 5 – Run the flow (The final step)

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken