An Introduction to Machine Learning Libraries for C++

Alakh Sethi Last Updated : 14 May, 2020

8 min read

Introduction

I love working with C++, even after I discovered the Python programming language for machine learning. C++ was the first programming language I ever learned and I’m delighted to use that in the machine learning space!

I wrote about building machine learning models in my previous article and the community loved the idea. I received an overwhelming response and one query stood out for me (from multiple folks) – are there any C++ libraries for machine learning?

It’s a fair question. Languages like Python and R have a plethora of packages and libraries that cater to different machine learning tasks. So does C++ have any such offering?

Yes, it does! I will highlight two such C++ libraries in this article, and we’ll also see both of them in action (with code). If you’re new to C++ for machine learning, I’ll again recommend going through the first article.

Why Should we use Machine Learning Libraries?
Machine Learning Libraries in C++
1. SHARK Library
2. MLPACK Library

Why Should we use Machine Learning Libraries?

This is a question a lot of newcomers will have. What is the importance of libraries in machine learning? Let me try and explain that in this section.

Let’s say that experienced professionals and industry veterans have put in the hard graft and come up with a solution to a problem. Would you prefer using that or would you rather spend hours trying to re-create the same thing from scratch? That’s usually little point in going for the latter method, especially when you’re working or learning under deadlines.

The great thing about our machine learning community is that a lot of solutions already exist in the form of libraries and packages. Someone else, from experts to enthusiasts, has already done the hard work and put the solution together nicely packaged in a library.

These machine learning libraries are efficient and optimized, and they are tested thoroughly for multiple use cases. Relying on these libraries is what powers our learning and makes writing code, whether that’s in C++ or Python, so much easier and intuitive.

Machine Learning Libraries in C++

In this section, we’ll look at the two most popular machine learning libraries in C+:

SHARK Library
MLPACK Library

Let’s look at each one individually and see how the C++ code works.

1) SHARK C++ Library

Shark is a fast modular library and it has overwhelming support for supervised learning algorithms, such as linear regression, neural networks, clustering, k-means, etc. It also includes the functionality of linear algebra and numerical optimization. These are key mathematical functions or areas that are very important when performing machine learning tasks.

We’ll first see how to install Shark and set up an environment. Then we’ll implement linear regression with Shark.

Install Shark and setup environment (I’ll be doing this for Linux)

Shark relies on Boost and cmake. Fortunately, all the dependencies can be installed with the following command:

              sudo apt-get install cmake cmake-curses-gui libatlas-base-dev libboost-all-dev

To install Shark, run the below commands line-by-line in your terminal:

git clone https://github.com/Shark-ML/Shark.git (you can download the zip file and extract as well)
cd Shark
mkdir build
cd build
cmake ..
make

If you haven’t seen this before that’s not a problem. It’s pretty straightforward and there’s plenty of information online if you get into trouble. For Windows and other operating systems, you can do a quick Google search on how to install Shark. Here is the reference site Shark Installation guide.

Compiling programs with Shark

Include the relevant header files. Suppose we want to implement linear regression, then the extra header files included are:
```
#include <shark/ObjectiveFunctions/Loss/SquaredLoss.h>
#include <shark/Algorithms/Trainers/LinearRegression.h>
```

To compile, we need to link with the following libraries:
```
-std=c++11 -lboost_serialization -lshark -lcblas
```

Implementing Linear Regression with Shark

My first article on this series had an introduction to linear regression. I’ll use the same idea in this article, but this time using the Shark C++ library.

Initialization phase
We’ll start by including the libraries and header functions for linear regression:

	#include <bits/stdc++.h> //header file for all basic c++ libraries
	#include <shark/Data/Csv.h> //header file ro importing data in csv format
	#include <shark/ObjectiveFunctions/Loss/SquaredLoss.h> //header file for implementing squared loss function
	#include <shark/Algorithms/Trainers/LinearRegression.h>// header file for implementing linear regression

view raw 2C++1.cpp hosted with ❤ by GitHub

Next comes the dataset. I have created two CSV files. The input.csv file contains the x values and the labels.csv file contains the y values. Below is a snapshot of the data:

You can find both the files here – Machine Learning with C++. First, we’ll make data containers for storing the values from CSV files:

	Data<RealVector> inputs; //container for storing the x values
	Data<RealVector> labels; //conatiner for storing the y values

view raw 2C++2.cpp hosted with ❤ by GitHub

Next, we need to import them. Shark comes with a nice import CSV function, and we specify the data container that we want to initialize, and also the location to path file of the CSV:

	importCSV(inputs, "input.csv"); // storing the values in specific container by specifying the path of csv
	importCSV(labels, "label.csv");

view raw 2C++3.cpp hosted with ❤ by GitHub

We then need to instantiate a regression dataset type. Now, this is just a general object for regression, and what we’ll do in the constructor is we pass in our inputs and also our labels for the data.

Next, we need to train the linear regression model. How do we do that? We need to instantiate a trainer and define a linear model:

	RegressionDataset data(inputs, labels);
	LinearRegression trainer;// trainer for linear regression model
	LinearModel<> model; // linear model

view raw 2C++4.cpp hosted with ❤ by GitHub

Training Phase

Next comes the key step where we actually train the model. Here, the trainer has a member function called train. So this function trains this model and finds the parameters for the model, which is exactly what we want to do.

	//training the model
	trainer.train(model, data);// train function ro training the model.

view raw 2C++5.cpp hosted with ❤ by GitHub

Prediction phase

Finally, let’s output the model parameters:

	// show model parameters
	cout << "intercept: " << model.offset() << endl;
	cout << "matrix: " << model.matrix() << endl;

view raw 2C++6.cpp hosted with ❤ by GitHub

Linear models have a member function called offset that outputs the intercept of the best fit line. Next, we’re outputting a matrix instead of a multiplier. This is because the model can be generalized (not just linear, it could be a polynomial).

We compute the line of best fit by minimizing the least-squares, aka minimizing the squared loss.

So, fortunately, the model allows us to output that information. The Shark library is very helpful in giving an indication as to how well the model(s) fit:

	SquaredLoss<> loss; //initializing square loss object
	Data<RealVector> prediction = model(data.inputs()); //prediction is calculated based on data inputs
	cout << "squared loss: " << loss(data.labels(), prediction) << endl; // In end we ouptut the loss

view raw 2C++7.py hosted with ❤ by GitHub

First, we need to initialize a squared loss object, and then we need to instantiate a data container. The prediction then is computed based on the inputs to the system, and then all we need to do is output the loss, which is computed via passing the labels and also the prediction value.

Finally, we need to compile. In the terminal, type the below command (make sure the directory is properly set):

g++ -o lr linear_regression.cpp -std=c++11 -lboost_serialization -lshark -lcblas

Once compiled, it would have created an lr object. Now simply run the program. The output we get is:

b : [1](-0.749091)

A :[1,1]((2.00731))

Loss: 7.83109

The value of b is a little far from 0 but that is because of the noise in labels. The value of the multiplier is close to 2 which is quite similar to the data. And that’s how you can use the Shark library in C++ to build a linear regression model!

2) MLPACK C++ Library

mlpack is a fast and flexible machine learning library written in C++. It aims to provide fast and extensible implementations of cutting-edge machine learning algorithms. mlpack provides these algorithms as simple command-line programs, Python bindings, Julia bindings, and C++ classes which can then be integrated into larger-scale machine learning solutions.

We’ll first see how to install mlpack and the setup environment. Then we’ll implement the k-means algorithm using mlpack.

Install mlpack and the setup environment (I’ll be doing this for Linux)

mlpack depends on the below libraries that need to be installed on the system and have headers present:

Armadillo >= 8.400.0 (with LAPACK support)
Boost (math_c99, program_options, serialization, unit_test_framework, heap, spirit) >= 1.49
ensmallen >= 2.10.0

In Ubuntu and Debian, you can get all of these dependencies through apt:

sudo apt-get install libboost-math-dev libboost-program-options-dev libboost-test-dev libboost-serialization-dev binutils-dev python-pandas python-numpy cython python-setuptools

Now that all the dependencies are installed in your system, you can directly run the commands below to build and install mlpack:

wget
tar -xvzpf mlpack-3.2.2.tar.gz
mkdir mlpack-3.2.2/build && cd mlpack-3.2.2/build
cmake ../
make -j4 # The -j is the number of cores you want to use for a build
sudo make install

On many Linux systems, mlpack will be installed by default to /usr/local/lib and you may need to set the LD_LIBRARY_PATH environment variable:

export LD_LIBRARY_PATH=/usr/local/lib

The instructions above are the simplest way to get, build, and install mlpack. If your Linux distribution supports binaries, follow this site to install mlpack using a one-line command depending on your distribution.: MLPACK Installation Instructions. The above method works for all distributions.

Compiling programs with mlpack

Include the relevant header files in your program (assuming implementing k-means):

1. #include <mlpack/methods/kmeans/kmeans.hpp>
2. #include <armadillo>

To compile we need to link the following libraries:

1. std=c++11 -larmadillo -lmlpack -lboost_serialization

Implementing K-Means with mlpack

K-means is a centroid-based algorithm, or a distance-based algorithm, where we calculate the distances to assign a point to a cluster. In K-Means, each cluster is associated with a centroid.

The main objective of the K-Means algorithm is to minimize the sum of distances between the points and their respective cluster centroid.

K-means is effectively an iterative process where we want to segment the data into certain clusters. First, we assign some initial centroids, so these can be completely random. Next, for each data point, we find the nearest centroid. We will then assign that data point to that centroid. So each centroid represents a class. And once we assign all the data points to each centroid, we will then compute the mean of those centroids.

For a detailed understanding of the K-means algorithm, read this tutorial: The Most Comprehensive Guide to K-Means Clustering You’ll Ever Need.

Here, we will implement k-means using the mlpack library in C++.

Initialization phase

We’ll start by including the libraries and header functions for k-means:

	#include <bits/stdc++.h>// include basic c++ liraries
	#include <mlpack/methods/kmeans/kmeans.hpp>// mlpack library for k-means
	#include <armadillo>//ml pack is dependent on armadillo

	Using namespace std;

view raw 2C++8.cpp hosted with ❤ by GitHub

Next, we’re going to create some basic variables to set the number of clusters, the dimensionality of the program, the number of samples, and the maximum amount of iterations we want to do. Why? Because K-means is an iterative process.

	int k = 2; //amount of clusters
	int dim = 2;//dimensionality of program
	int samples = 50;
	int max_iter = 10;//maximum number of iterations

view raw 2C++9.cpp hosted with ❤ by GitHub

Next, we are going to create the data. So this is where we’re first going to use the Armadillo library. We’ll create a map class that is effectively a data container:

arma::mat data(dim, samples, arma::fill::zeros);

view raw 2C++10.cpp hosted with ❤ by GitHub

There you go! This mat class, the object data it has, we’ve given it a dimensionality of two, and it knows it’s going to have 50 samples, and it’s initialized all these data values to be 0.

Next, we will assign some random data to this data class and then effectively run K-means on it. I’m going to create 25 points around the position 1 1, and we can do this by effectively saying each data point is 1 1 or at the position X equals 1, y equals 1. Then we’re going to add some random noise for each of the 25 data points. Let’s see this in action:

	// create data
	int i = 0;
	for(; i < samples / 2; ++i)
	{
	data.col(i) = arma::vec({1, 1}) + 0.25*arma::randn<arma::vec>(dim);
	}
	for(; i < samples; ++i)
	{
	data.col(i) = arma::vec({2, 3}) + 0.25*arma::randn<arma::vec>(dim);
	}

view raw 2C++11.cpp hosted with ❤ by GitHub

Here, for i from 0 to 25, the ith column needs to be this arma type vector at the position 1 1, and then we’re going to add a certain amount of random noise of size 2. So, it’s going to be a 2-dimensional random noise vector times 0.25 to this position, and that’s going to be our data column. And then we’re going to do exactly the same for the point x equals 2 and y equals 3.

And our data is now ready! Time to jump to the training phase.

Training Phase

So first, let’s instantiate an arma mat row type to hold the clusters, and then instantiate an arma mat to hold the centroids:

	//cluster the data
	arma::Row<size_t> clusters;
	arma::mat centroids;

view raw 2C++12.cpp hosted with ❤ by GitHub

Now, we need to instantiate the K-means class:

mlpack::kmeans::KMeans<> mlpack_kmeans(max_iter);

view raw 2C++13.cpp hosted with ❤ by GitHub

We’ve instantiated the K-means class, and we’ve specified the maximum amount of iterations that we passed through to the constructor. So now, we can go ahead and do the clustering.

We’ll call the Cluster member function of this K-means class. We need to pass in the data, the number of clusters, and then we also need to pass in the cluster’s object and the centroid’s object.

mlpack_kmeans.Cluster(data, k, clusters, centroids);

view raw 2C++14.cpp hosted with ❤ by GitHub

Now, this cluster function will run K-means on this data with a specified number of clusters, and it will then initialize these two objects – clusters and centroids.

Generating results

We can simply display the results using the centroids.print function. This will give the location of the centroids:

centroids.print("Centroids:");

view raw 2C++15.cpp hosted with ❤ by GitHub

Next, we need to compile. In the terminal, type the below command (again, make sure the directory is properly set):

g++ k_means.cpp -o kmeans_test -O3 -std=c++11 -larmadillo -lmlpack -lboost_serialization && ./kmeans_test

Once compiled, it would have created a kmeans object. Now simply run the program. The output we get is:

Centroids:

0.9497 1.9625

0.9689 3.0652

And that’s it!

End Notes

In this article, we saw two popular machine learning libraries that help us implement machine learning models in c++. I love the extensive support available in the official documentation so do check that out. If you need any help, reach out to me below and I’ll be happy to connect with you!

In the next article, we’ll implement some interesting machine learning models like decision trees and random forest. So stay tuned!

Alakh Sethi

Aspiring Data Scientist with a passion to play and wrangle with data and get insights from it to help the community know the upcoming trends and products for their better future.With an ambition to develop product used by millions which makes their life easier and better.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Theta

Awesome information! A really good read. Now to see what I can extrapolate from practice and experiments.

Show 1 reply

Thank you so much! Happy to hear!

Catan++

Please fix the inconsistent coding style of your snippets (comments, indentation etc).

Lori B Guidos

Hey..your registration page phone completion setup is a drag. I am a U S resisent and I couldn't get it to work with my phone number. Personally, I don't really want to ahare a phone number with anyone on a registration form...

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

An Introduction to Machine Learning Libraries for C++

Introduction

Table of Contents

Why Should we use Machine Learning Libraries?

Machine Learning Libraries in C++

1) SHARK C++ Library

Install Shark and setup environment (I’ll be doing this for Linux)

Compiling programs with Shark

Implementing Linear Regression with Shark

2) MLPACK C++ Library

Install mlpack and the setup environment (I’ll be doing this for Linux)

Compiling programs with mlpack

Implementing K-Means with mlpack

End Notes

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B