Walkthrough of Kedro Framework Using News Classification Task

Dheeraj Bhat Last Updated : 18 Apr, 2023

7 min read

Introduction

Kedro is an open-source Python framework for creating reproducible, maintainable, and modular data science code. It uses best practices of software engineering to build production-ready data science pipelines. This article will give you a glimpse of Kedro framework using news classification tasks.

The advantages of using Kedro are:

Machine Learning Engineering: It borrows concepts from software engineering and applies them to machine-learning code. It is the foundation for clean, data science code.
Handles Complexity: Provides the scaffolding to build more complex data and machine-learning pipelines.
Standardisation: Standardises team workflows; the modular structure of Kedro facilitates a higher level of collaboration when teams solve problems together.
Production-Ready: Makes a seamless transition from development to production, as you can write quick, throw-away exploratory code and transition to maintainable, easy-to-share, code experiments quickly.

kedro framework | News Classification

Learning Objectives

In this article, you will learn the following:

Introduction to kedro
Core concepts of kedro
Step-by-step tutorial on how to install kedro
Step-by-step tutorial on AG News Classification task using kedro

This article was published as a part of the Data Science Blogathon.

Installation

Kedro can be installed from PyPi repository using the following command:

pip install kedro # core package
pip install kedro-viz # a plugin for visualization

It can also be installed using conda with the following command:

conda install -c conda-forge kedro

To confirm whether kedro is installed or not, type the following command in command line and you can verify the installation by seeing an ASCII art graphic with kedro version number:

kedro info

kedro framework information | Classification | news

What is Node?

In Kedro, a node is a wrapper for a pure Python function that names the inputs and outputs of that function. Nodes are the building block of a pipeline, and the output of one node can be the input of another.

What is Pipeline?

A pipeline organizes the dependencies and execution order of a collection of nodes and connects inputs and outputs while keeping your code modular. The pipeline determines the node execution order by resolving dependencies and does not necessarily run the nodes in the order in which they are passed.

Data Catalog

The Kedro Data Catalog is the registry of all data sources that the project can use to manage loading and saving data. It maps the names of node inputs and outputs as keys in a DataCatalog (a Kedro class that can be specialized for different types of data storage).

Project Data Structure

The default template followed by kedro to store datasets, notebooks, configurations, and source code is shown below. This project structure makes it easier to maintain and collaborate on the project easily. It can also be customized based on our needs.

project-dir         # Parent directory of the template
├── .gitignore      # Hidden file that prevents staging of unnecessary files to `git`
├── conf            # Project configuration files
├── data            # Local project data (not committed to version control)
├── docs            # Project documentation
├── logs            # Project output logs (not committed to version control)
├── notebooks       # Project-related Jupyter notebooks 
├── pyproject.toml  # Identifies the project root and
├── README.md       # Project README
├── setup.cfg       # Configuration options for `pytest` when doing `kedro test`
└── src             # Project source code

Kedro Project using AG News Classification Dataset

Let’s understand how to set up and use it by going through step by step tutorial for creating a simple text classification task 🙂

Project Setup For News Classification

It is always better to create a virtual environment to prevent any conflicts in the environment package. Create a new virtual environment and install kedro from the above commands. To create a new kedro classification project enter the following command in the command line and enter a name for the project:

kedro new

Fill in the name of the project as “kedro-agnews-tf” in the interactive shell. Then, go to the project and install the initial project dependencies using the command:

cd kedro-agnews-tf
pip install tensorflow
pip install scikit-learn
pip install mlxtend
pip freeze > requirements.txt # update requirements file

We can setup logging, credentials, and sensitive information in ‘conf ‘ folder of the project. Currently, we do not have any in our development project, but this becomes crucial in production environments.

Data Setup for News Classification

Now, we set up the data for our development workflow. The ‘data’ folder in the project directory hosts multiple sub-folders to store the project data. This structure is based on the layered data-engineering convention as a model of managing data (For in-depth information, check out this blogpost). We store the AG News Subset data (downloaded from here) in the ‘raw’ sub-folder. The processed data goes into other sub-folders like ‘intermediate’, and ‘feature’; the trained model goes into the ‘model’ sub-folder; model outputs and metrics go into ‘model_output’ and ‘reporting’ sub-folders respectively.

Then, we need to register the dataset with kedro Data Catalog i.e. we need to reference this dataset in the ‘conf/base/catalog.yml’ file which makes our project reproducible by sharing the data for the complete project pipeline. Add this code to the ‘conf/base/catalog.yml’ file (Note: we can also add to the ‘conf/local/catalog.yml’ file)

# in conf/base/catalog.yml

ag_news_train:
  type: pandas.CSVDataSet
  filepath: data/01_raw/ag_news_csv/train.csv
  load_args:
    names: ['ClassIndex', 'Title', 'Description']

ag_news_test:
  type: pandas.CSVDataSet
  filepath: data/01_raw/ag_news_csv/test.csv
  load_args:
    names: ['ClassIndex', 'Title', 'Description']

Testing Registered Dataset

To test whether kedro can load the data, type following command in command line:

kedro ipython

Type the following in the IPython session:

# train data
ag_news_train_data = catalog.load("ag_news_train")
ag_news_train_data.head()

# test data
ag_news_test_data = catalog.load("ag_news_test")
ag_news_test_data.head()

After validating the output, close the IPython session using the command: exit(). This shows that data has been registered with kedro successfully. Now, we move on to the pipeline creation stage where we create Data processing and Data Science pipelines.

Pipeline Creation

Now, we create python functions as nodes to construct the pipeline and run these nodes sequentially.

Data Processing Pipeline

In the terminal from project root directory, run the following command to generate a new pipeline for data processing:

kedro pipeline create data_processing

This generates following files:

src/kedro_agnews_tf/pipelines/data_processing/nodes.py
src/kedro_agnews_tf/pipelines/data_processing/pipeline.py
conf/base/parameters/data_processing.yml
src/tests/pipelines/data_processing

The steps to be followed are:

Add data preprocessing nodes (python functions) to nodes.py
Assemble the nodes in the pipeline.py
Add configurations in data_processing.yml file
Register the preprocessed data into conf/base/catalog.yml

To keep this blog succinct, I have not added the code that needs to be added to each of the files here. You can checkout the code that needs to be added for each file in my GitHub repository here.

Run the following command to validate if you are able to execute the data processing pipeline without any errors:

kedro run --pipeline=data_processing

The above code generates data in ‘data/02_intermediate’ and ‘data/03_primary’ folders.

Data Science Pipeline

In the terminal from project root directory, run the following command to generate a new pipeline for data science:

kedro pipeline create data_science

This command generates similar files as to when the data processing pipeline command had been run, BUT now files will be generated for the data science pipeline.

The steps to be followed are:

Add model training and evaluation nodes (python functions) to nodes.py
Assemble the nodes in the pipeline.py
Add configurations in data_science.yml file
Register the model and results into conf/base/catalog.yml

You can check out the code that needs to be added for each file in my GitHub repository here.

Run the following command to validate if you are able to execute the data science pipeline without any errors:

kedro run --pipeline=data_science

The above code generates model and results in ‘data/06_models’ and ‘data/08_reporting’ folders respectively

This completes the data science pipeline. If you are interested in further building project documentation, use Sphinx to build the documentation of your kedro project.

The data folder contains different datasets starting from raw data, intermediate data, features, models, etc. It is highly advised to use DVC (Data Version Control) to track this folder which offers lots of benefits.

Kedro Visualization

We can visualize our complete kedro project pipeline using Kedro-Viz, a plugin built by Kedro developers. We have already installed this package during initial installation (pip install kedro-viz). To visualize our kedro project, run the following command in the terminal in the project root directory:

kedro viz

This command opens a browser tab to serve the visualization (http://127.0.0.1:4141/). The below image shows the visualization of our kedro-agnews project:

You can click on each of the nodes and datasets in the visualization to get more details about them. This visualization can also be refreshed dynamically when the the Python or YAML file changes in the project, by using the option –autoreload in the command

Packaging Project

To package project, run the following in the project root directory:

kedro package

It builds the package into the ‘dist’ folder of your project and creates one .egg file and one .whl file, which are Python packaging formats for binary distribution.

Deploying Kedro Project

To deploy it’s pipelines, we can use kedro plugins to deploy to various deployment targets:

Kedro-Docker: For packaging and shipping kedro projects within docker contains
Kedro-Airflow: Converting kedro projects into Airflow project
Third-party plugins: Community-developed plugins for various deployment targets like AWS Batch and Prefect, AW SageMaker, Azure ML Pipelines, etc

Conclusion

To summarize briefly, it has many features that help you, from the development stage to the production of your ML workflow. To run the project directly, you can check out my GitHub repository here, and run the following commands:

git clone https://github.com/dheerajnbhat/kedro-agnews-tf.git
cd kedro-agnews-tf
tar -xzvf data/01_raw/ag_news_csv.tar.gz --directory data/01_raw/
pip install -r src/requirements.txt
kedro run
# for visualization
kedro viz

The key takeaways from this article are:

Understanding the capabilities kedro can offer for ML production
Understanding core concepts of kedro
Steps to install and use kedro
Walk-through tutorial using kedro on AG News Classification task

I hope this will help you get started with Kedro 🙂

References:
[1] https://github.com/kedro-org/kedro
[2] https://kedro.readthedocs.io/en/stable/index.html
[3] https://kedro.org/

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Dheeraj Bhat

Dheeraj is a Data Science and ML Enthusiast. Dheeraj likes writing about Machine Learning and Data Science in general, and loves to study about new concepts. Feel free to connect with me on linkedin :)

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Deep Learning

Feed Forward Networks

Gradient Descent

Loss Function

Activation Functions

Introduction to Neural networks

Forward and Backward Propagation

Optimizers

Learning Rate Schedulers

NN on Structured Data

Improving the Deep Learning Model

Deep Learning Model Optimization

Unsupervised Deep Learning

AutoDL

Model Deployment

Introduction to PyTorch

Walkthrough of Kedro Framework Using News Classification Task

Introduction

Table of Contents

Installation

What is Node?

What is Pipeline?

Data Catalog

Project Data Structure

Kedro Project using AG News Classification Dataset

Project Setup For News Classification

Data Setup for News Classification

Testing Registered Dataset

Pipeline Creation

Data Processing Pipeline

Data Science Pipeline

Kedro Visualization

Packaging Project

Deploying Kedro Project

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit