Kedro is an open-source Python framework for creating reproducible, maintainable, and modular data science code. It uses best practices of software engineering to build production-ready data science pipelines. This article will give you a glimpse of Kedro framework using news classification tasks.
The advantages of using Kedro are:
Learning Objectives
In this article, you will learn the following:
This article was published as a part of the Data Science Blogathon.
Kedro can be installed from PyPi repository using the following command:
pip install kedro # core package
pip install kedro-viz # a plugin for visualization
It can also be installed using conda with the following command:
conda install -c conda-forge kedro
To confirm whether kedro is installed or not, type the following command in command line and you can verify the installation by seeing an ASCII art graphic with kedro version number:
kedro info
In Kedro, a node is a wrapper for a pure Python function that names the inputs and outputs of that function. Nodes are the building block of a pipeline, and the output of one node can be the input of another.
A pipeline organizes the dependencies and execution order of a collection of nodes and connects inputs and outputs while keeping your code modular. The pipeline determines the node execution order by resolving dependencies and does not necessarily run the nodes in the order in which they are passed.
The Kedro Data Catalog is the registry of all data sources that the project can use to manage loading and saving data. It maps the names of node inputs and outputs as keys in a DataCatalog (a Kedro class that can be specialized for different types of data storage).
The default template followed by kedro to store datasets, notebooks, configurations, and source code is shown below. This project structure makes it easier to maintain and collaborate on the project easily. It can also be customized based on our needs.
project-dir # Parent directory of the template
├── .gitignore # Hidden file that prevents staging of unnecessary files to `git`
├── conf # Project configuration files
├── data # Local project data (not committed to version control)
├── docs # Project documentation
├── logs # Project output logs (not committed to version control)
├── notebooks # Project-related Jupyter notebooks
├── pyproject.toml # Identifies the project root and
├── README.md # Project README
├── setup.cfg # Configuration options for `pytest` when doing `kedro test`
└── src # Project source code
Let’s understand how to set up and use it by going through step by step tutorial for creating a simple text classification task 🙂
It is always better to create a virtual environment to prevent any conflicts in the environment package. Create a new virtual environment and install kedro from the above commands. To create a new kedro classification project enter the following command in the command line and enter a name for the project:
kedro new
Fill in the name of the project as “kedro-agnews-tf” in the interactive shell. Then, go to the project and install the initial project dependencies using the command:
cd kedro-agnews-tf
pip install tensorflow
pip install scikit-learn
pip install mlxtend
pip freeze > requirements.txt # update requirements file
We can setup logging, credentials, and sensitive information in ‘conf ‘ folder of the project. Currently, we do not have any in our development project, but this becomes crucial in production environments.
Now, we set up the data for our development workflow. The ‘data’ folder in the project directory hosts multiple sub-folders to store the project data. This structure is based on the layered data-engineering convention as a model of managing data (For in-depth information, check out this blogpost). We store the AG News Subset data (downloaded from here) in the ‘raw’ sub-folder. The processed data goes into other sub-folders like ‘intermediate’, and ‘feature’; the trained model goes into the ‘model’ sub-folder; model outputs and metrics go into ‘model_output’ and ‘reporting’ sub-folders respectively.
Then, we need to register the dataset with kedro Data Catalog i.e. we need to reference this dataset in the ‘conf/base/catalog.yml’ file which makes our project reproducible by sharing the data for the complete project pipeline. Add this code to the ‘conf/base/catalog.yml’ file (Note: we can also add to the ‘conf/local/catalog.yml’ file)
# in conf/base/catalog.yml
ag_news_train:
type: pandas.CSVDataSet
filepath: data/01_raw/ag_news_csv/train.csv
load_args:
names: ['ClassIndex', 'Title', 'Description']
ag_news_test:
type: pandas.CSVDataSet
filepath: data/01_raw/ag_news_csv/test.csv
load_args:
names: ['ClassIndex', 'Title', 'Description']
To test whether kedro can load the data, type following command in command line:
kedro ipython
Type the following in the IPython session:
# train data
ag_news_train_data = catalog.load("ag_news_train")
ag_news_train_data.head()
# test data
ag_news_test_data = catalog.load("ag_news_test")
ag_news_test_data.head()
After validating the output, close the IPython session using the command: exit(). This shows that data has been registered with kedro successfully. Now, we move on to the pipeline creation stage where we create Data processing and Data Science pipelines.
Now, we create python functions as nodes to construct the pipeline and run these nodes sequentially.
In the terminal from project root directory, run the following command to generate a new pipeline for data processing:
kedro pipeline create data_processing
This generates following files:
The steps to be followed are:
To keep this blog succinct, I have not added the code that needs to be added to each of the files here. You can checkout the code that needs to be added for each file in my GitHub repository here.
Run the following command to validate if you are able to execute the data processing pipeline without any errors:
kedro run --pipeline=data_processing
The above code generates data in ‘data/02_intermediate’ and ‘data/03_primary’ folders.
In the terminal from project root directory, run the following command to generate a new pipeline for data science:
kedro pipeline create data_science
This command generates similar files as to when the data processing pipeline command had been run, BUT now files will be generated for the data science pipeline.
The steps to be followed are:
You can check out the code that needs to be added for each file in my GitHub repository here.
Run the following command to validate if you are able to execute the data science pipeline without any errors:
kedro run --pipeline=data_science
The above code generates model and results in ‘data/06_models’ and ‘data/08_reporting’ folders respectively
This completes the data science pipeline. If you are interested in further building project documentation, use Sphinx to build the documentation of your kedro project.
The data folder contains different datasets starting from raw data, intermediate data, features, models, etc. It is highly advised to use DVC (Data Version Control) to track this folder which offers lots of benefits.
We can visualize our complete kedro project pipeline using Kedro-Viz, a plugin built by Kedro developers. We have already installed this package during initial installation (pip install kedro-viz). To visualize our kedro project, run the following command in the terminal in the project root directory:
kedro viz
This command opens a browser tab to serve the visualization (http://127.0.0.1:4141/). The below image shows the visualization of our kedro-agnews project:
You can click on each of the nodes and datasets in the visualization to get more details about them. This visualization can also be refreshed dynamically when the the Python or YAML file changes in the project, by using the option –autoreload in the command
To package project, run the following in the project root directory:
kedro package
It builds the package into the ‘dist’ folder of your project and creates one .egg file and one .whl file, which are Python packaging formats for binary distribution.
To deploy it’s pipelines, we can use kedro plugins to deploy to various deployment targets:
To summarize briefly, it has many features that help you, from the development stage to the production of your ML workflow. To run the project directly, you can check out my GitHub repository here, and run the following commands:
git clone https://github.com/dheerajnbhat/kedro-agnews-tf.git
cd kedro-agnews-tf
tar -xzvf data/01_raw/ag_news_csv.tar.gz --directory data/01_raw/
pip install -r src/requirements.txt
kedro run
# for visualization
kedro viz
The key takeaways from this article are:
I hope this will help you get started with Kedro 🙂
References:
[1] https://github.com/kedro-org/kedro
[2] https://kedro.readthedocs.io/en/stable/index.html
[3] https://kedro.org/
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.