Introductory guide to Linear Optimization in Python (with TED videos case study)

Guest Blog Last Updated : 30 Oct, 2024

6 min read

Introduction

Data Science & Machine Learning are being used by organizations to solve a variety of business problems today. In order to create a real business impact, an important consideration is to bridge the gap between the data science pipeline and business decision making pipeline.

The outcome of data science pipeline is uaully predictions, patterns and insights from data (typically without any notion of constraints) but that alone is insufficient for business stakeholders to take decisions. Data science output has to be fed into the business decision making pipeline which involves some sort of optimization involving constraints and decision variables which model key aspects of the business.

For example, if you are running a Super Market chain – your data science pipeline would forecast the expected sales. You would then take those inputs and create an optimised inventory / sales strategy.

In this article, we will show one such example of Linear optimization for selecting which TED videos to watch.

Introduction to Linear Optimization
The Problem – Creating the Watch List for TED videos
Step 1 – Import relevant packages
Step 2 – Create a dataframe for TED talks
Step 3 – Set up the Linear Optimization Problem
Step 4 – Convert the Optimization results into an interpretable format

Introduction to Linear Optimization

Among optimization techniques, Linear Optimization using the Simplex Method is considered one of the most powerful ones and has been rated as one of the Top 10 algorithms of the 20^th century. As data science practitioners, it is important to have hands-on knowledge in implementing Linear Optimization and this blog post is to illustrate its implementation using Python’s PuLP package.

To make things interesting & simpler to understand, we will learn this optimization technique by applying it on a practical, day-to-day problem. Having said that, what we learn is applicable to a variety of business problems as well.

Note: This article assumes you have a basics knowledge of linear programming. You can go through this article if you want to review the topic.

The Problem – Creating the Watch List for TED videos

TED is a nonprofit devoted to spreading ideas. TED began in 1984 as a conference where Technology, Entertainment and Design converged, and today covers almost all topics — from science to business to global issues — in more than 100 languages. TED talks are delivered by experts passionate about work in their chosen domains and have a wealth of information.

Now, for the purpose of this blog post, imagine a situation where one is interested to create their watch list of the most popular TED talks given their constraints (time that can be allotted to viewing and the number of talks). We will see how to implement the Python program to help us create the watchlist in the optimal manner.

The code of the article can be found here. Screenshots from my Jupyter notebook are shown below:

Step 1 – Import relevant packages

PuLP is a free open source software written in Python. It is used to describe optimisation problems as mathematical models. PuLP can then call any of numerous external LP solvers (CBC, GLPK, CPLEX, Gurobi etc) to solve this model and then use python commands to manipulate and display the solution. By default, CoinMP solver is bundled with PuLP.

Step 2 – Create a dataframe for TED talks

Dataset having all the TED talks (2550) is downloaded from Kaggle and read into a dataframe. A subset of relevant columns is selected and the resulting dataset has the following details – Index of the talk, Name of the talk, TED Event Name, Talk duration (in minutes), Number of Views (Proxy for Popularity of the talk)

#% matplotlib inline

#from pulp import *
#import numpy as np
import pandas as pd
#import re
#import matplotlib.pyplot as plt
#from IPython.display import Image

# Download the dataset from https://www.kaggle.com/rounakbanik/ted-talks

# Read the dataset into pandas dataframe, convert duration from seconds to minutes
ted = pd.read_csv('ted_main.csv', encoding='ISO-8859-1')
ted['duration'] = ted['duration'] / 60
ted = ted.round({'duration': 1})

# Select subset of columns & rows (if required)
# data = ted.sample(n=1000) # 'n' can be changed as required
data = ted
selected_cols = ['name', 'event', 'duration', 'views']
data.reset_index(inplace=True)
print(data.head())

Step 3 – Set up the Linear Optimization Problem

Start with defining the LP Object. The prob variable is created to contain the problem formulation

Step 3.1: Create the decision variables

Iterate over each row of the data frame to create the decision variables, such that each talk becomes one decision variable. Since each talk can either be selected or not selected as part of the final watch list, the decision variable is binary in nature (1=Selected, 0=Not Selected)

Step 3.2: Define the Objective Function

The objective function is the sum over all rows of the views for each talk. The views serve as a proxy for the popularity of the talk and so in essence we are trying to maximize the views (popularity) by selecting appropriate talks (decision variables)

Step 3.3: Define the Constraints

In the problem, we have 2 constraints:

a) We only have fixed amount of total time that can be allocated to view the talks

b) We don’t want to view more than a certain number of talks to avoid information overload

Step 3.4: The Final Format (for problem formulation)

The final format of the problem formulated is written out into a .lp file. This will list the objective function, the decision variables and the constraints imposed on the problem.

Step 3.5: The Actual Optimization

The actual optimization is a single line of code that calls ‘prob.solve’. Assert statement is inserted to ascertain whether an optimal result was obtained for the problem.

Step 4 – Convert the Optimization results into an interpretable format

The optimization results which indicates the specific decision variables (talks) that were selected to maximize the outcome has to be converted into a format of a watch list, as shown below:

End Notes

This article provides an example of utilizing Linear Optimization techniques available in Python to solve the everyday problem of creating video watch list. The concepts learned are also applicable in more complex business situations involving thousands of decision variables and many different constraints.

Every data science practitioner needs to add “Optimization techniques” to their body of knowledge so that they can use advanced analytics to solve real world business problems and this article is intended to help you take the first step in that direction.

Karthikeyan Sankaran is currently a Director at LatentView Analytics which provides solutions at the intersection of Business, Technology & Math to business problems across a wide range of industries. Karthik has close to two decades of experience in the Information Technology industry having worked in multiple roles across the space of Data Management, Business Intelligence & Analytics.

This story was received as part of “Blogathon” contest on Analytics Vidhya. Karthikeyan’s entry was one of the winning entries in the competition.

Guest Blog

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Introductory guide to Linear Optimization in Python (with TED videos case study)

Introduction

Table of Contents

Introduction to Linear Optimization

The Problem – Creating the Watch List for TED videos

Step 1 – Import relevant packages

Step 2 – Create a dataframe for TED talks

Step 3 – Set up the Linear Optimization Problem

Step 4 – Convert the Optimization results into an interpretable format

End Notes

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#