Data Science & Machine Learning are being used by organizations to solve a variety of business problems today. In order to create a real business impact, an important consideration is to bridge the gap between the data science pipeline and business decision making pipeline.
The outcome of data science pipeline is uaully predictions, patterns and insights from data (typically without any notion of constraints) but that alone is insufficient for business stakeholders to take decisions. Data science output has to be fed into the business decision making pipeline which involves some sort of optimization involving constraints and decision variables which model key aspects of the business.
For example, if you are running a Super Market chain – your data science pipeline would forecast the expected sales. You would then take those inputs and create an optimised inventory / sales strategy.
In this article, we will show one such example of Linear optimization for selecting which TED videos to watch.
Among optimization techniques, Linear Optimization using the Simplex Method is considered one of the most powerful ones and has been rated as one of the Top 10 algorithms of the 20th century. As data science practitioners, it is important to have hands-on knowledge in implementing Linear Optimization and this blog post is to illustrate its implementation using Python’s PuLP package.
To make things interesting & simpler to understand, we will learn this optimization technique by applying it on a practical, day-to-day problem. Having said that, what we learn is applicable to a variety of business problems as well.
Note: This article assumes you have a basics knowledge of linear programming. You can go through this article if you want to review the topic.
TED is a nonprofit devoted to spreading ideas. TED began in 1984 as a conference where Technology, Entertainment and Design converged, and today covers almost all topics — from science to business to global issues — in more than 100 languages. TED talks are delivered by experts passionate about work in their chosen domains and have a wealth of information.
Now, for the purpose of this blog post, imagine a situation where one is interested to create their watch list of the most popular TED talks given their constraints (time that can be allotted to viewing and the number of talks). We will see how to implement the Python program to help us create the watchlist in the optimal manner.
The code of the article can be found here. Screenshots from my Jupyter notebook are shown below:
PuLP is a free open source software written in Python. It is used to describe optimisation problems as mathematical models. PuLP can then call any of numerous external LP solvers (CBC, GLPK, CPLEX, Gurobi etc) to solve this model and then use python commands to manipulate and display the solution. By default, CoinMP solver is bundled with PuLP.
Dataset having all the TED talks (2550) is downloaded from Kaggle and read into a dataframe. A subset of relevant columns is selected and the resulting dataset has the following details – Index of the talk, Name of the talk, TED Event Name, Talk duration (in minutes), Number of Views (Proxy for Popularity of the talk)
#% matplotlib inline
#from pulp import *
#import numpy as np
import pandas as pd
#import re
#import matplotlib.pyplot as plt
#from IPython.display import Image
# Download the dataset from https://www.kaggle.com/rounakbanik/ted-talks
# Read the dataset into pandas dataframe, convert duration from seconds to minutes
ted = pd.read_csv('ted_main.csv', encoding='ISO-8859-1')
ted['duration'] = ted['duration'] / 60
ted = ted.round({'duration': 1})
# Select subset of columns & rows (if required)
# data = ted.sample(n=1000) # 'n' can be changed as required
data = ted
selected_cols = ['name', 'event', 'duration', 'views']
data.reset_index(inplace=True)
print(data.head())
Start with defining the LP Object. The prob variable is created to contain the problem formulation
Step 3.1: Create the decision variables
Iterate over each row of the data frame to create the decision variables, such that each talk becomes one decision variable. Since each talk can either be selected or not selected as part of the final watch list, the decision variable is binary in nature (1=Selected, 0=Not Selected)
Step 3.2: Define the Objective Function
The objective function is the sum over all rows of the views for each talk. The views serve as a proxy for the popularity of the talk and so in essence we are trying to maximize the views (popularity) by selecting appropriate talks (decision variables)
Step 3.3: Define the Constraints
In the problem, we have 2 constraints:
a) We only have fixed amount of total time that can be allocated to view the talks
b) We don’t want to view more than a certain number of talks to avoid information overload
Step 3.4: The Final Format (for problem formulation)
The final format of the problem formulated is written out into a .lp file. This will list the objective function, the decision variables and the constraints imposed on the problem.
Step 3.5: The Actual Optimization
The actual optimization is a single line of code that calls ‘prob.solve’. Assert statement is inserted to ascertain whether an optimal result was obtained for the problem.
The optimization results which indicates the specific decision variables (talks) that were selected to maximize the outcome has to be converted into a format of a watch list, as shown below:
This article provides an example of utilizing Linear Optimization techniques available in Python to solve the everyday problem of creating video watch list. The concepts learned are also applicable in more complex business situations involving thousands of decision variables and many different constraints.
Every data science practitioner needs to add “Optimization techniques” to their body of knowledge so that they can use advanced analytics to solve real world business problems and this article is intended to help you take the first step in that direction.
Karthikeyan Sankaran is currently a Director at LatentView Analytics which provides solutions at the intersection of Business, Technology & Math to business problems across a wide range of industries. Karthik has close to two decades of experience in the Information Technology industry having worked in multiple roles across the space of Data Management, Business Intelligence & Analytics.
This story was received as part of “Blogathon” contest on Analytics Vidhya. Karthikeyan’s entry was one of the winning entries in the competition.
Very useful. Thank you sir.