Building a Movie Recommendation System with Machine Learning

Soham Last Updated : 17 Feb, 2025

10 min read

Movies hold universal appeal, connecting people of all backgrounds. Despite this unity, our individual movie preferences remain distinct, ranging from specific genres like thrillers, romance, or sci-fi to focusing on favorite actors and directors. While it’s challenging to generalize movies everyone would enjoy, data scientists analyze behavioral patterns to identify groups of similar movie preferences in society. As data scientists, we extract valuable insights from audience behavior and movie attributes to develop the Movie recommendation system with machine learning.

Movie recommendation systems are not just about convenience; they represent a fascinating intersection of data science, machine learning, and user experience design. These systems can make highly personalized recommendations that keep you engaged and satisfied by analyzing vast amounts of data, such as your viewing history, ratings, and even the time you spend watching certain genres. One of the most famous websites for movie recommendations is IMDB. Let’s delve into the “Movie Recommendation System” fundamentals to unlock the magic of personalized movie suggestions using machine learning algorithms.

Learning Outcomes:

Gain a comprehensive understanding of recommendation algorithms, their purpose, and how they function to predict user preferences.
Learn why recommendation engines enhance user experience, increase engagement, and drive revenue in various platforms.
Explore different recommendation engine types, specifically content-based and collaborative filtering, including their methods, advantages, and limitations.
Acquire practical knowledge of implementing a movie recommendation system using Python.
Understand the application of machine learning models and machine learning algorithms to build a recommendation engine.

This article was published as a part of the Data Science Blogathon.

What is a Recommendation System?
Why Recommendation Systems?
Types of Recommendation Systems
- Content-based Filtering
- Collaborative Filtering
Movie Recommendation System Code
Finally, Let’s Recommend Some Movies!
Conclusion

What is a Recommendation System?

A Recommendation System is a filtration program whose prime goal is to predict a user’s “rating” or “preference” toward a domain-specific item or item. In our case, this domain-specific item is a movie. Therefore, the main focus of our recommendation system is to filter and predict only those movies that a user would prefer, given some data about the user.

Why Recommendation Systems?

Recommendation Systems are essential for several reasons:

Recommendation Systems offer personalized suggestions based on user preferences and user ratings, ensuring that they discover content and products that are relevant and interesting to them.
By providing tailored recommendations, users are more likely to engage with the platform, increasing user satisfaction and retention.
E-commerce platforms like Amazon use recommendation engines to promote products, leading to higher sales and revenue as users discover and purchase items they might not have considered.
In today’s digital landscape, recommendation systems help users navigate the overwhelming amount of available content, making it easier to find a particular movie or series they seek or in any language they want, like English or Korean.
Recommendation algorithms expose users to new and diverse content, expanding their horizons and introducing them to items they might have overlooked.
Relying on past behavior and preferences, recommendation systems help users make informed decisions about complex and subjective choices such as movies, music, or books.

Types of Recommendation Systems

There are 2 types of recommendation systems- Content-based filtering and Collaborative Filtering. Lets discuss them now.

Content-based Filtering

Content-based filtering is a recommendation strategy that suggests items similar to those a user has previously liked. It calculates similarity (often using cosine similarity) between the user’s preferences and item attributes, such as lead actors, directors, and genres. For example, if a user enjoys ‘The Prestige,’ the system recommends movies with Christian Bale, the ‘Thriller’ genre, or films by Christopher Nolan.

However, content-based filtering has drawbacks. It limits exposure to different products, preventing users from exploring a variety of items. This can hinder business expansion as users might not try out new types of products.

Also Read: Beginners Guide to Content Based Recommender System

Collaborative Filtering

Collaborative filtering is a recommendation strategy that considers the user’s behavior and compares it with other users in the database. It uses the history of all users to influence the recommendation algorithm. Unlike a content-based recommender system, a collaborative filtering recommender relies on multiple users’ interactions with items to generate suggestions. It doesn’t solely depend on one user’s data for modeling. There are various approaches to implementing collaborative filtering, but the fundamental concept is the collective influence of multiple users on the recommendation outcome.

There are 2 types of collaborative filtering algorithms:

User-based Collaborative filtering

User-based collaborative filtering aims to recommend items to a user (A) based on the preferences of similar users in the database. It involves creating a matrix of items rated/liked/clicked by each user, computing similarity scores between users, and then suggesting items that user A hasn’t encountered yet but similar users have liked.

For example, if user A likes particular movies like Batman Begins, Justice League, and The Avengers and user B like Thor, they share similar interests in the superhero genre. Therefore, it is highly likely that user A would enjoy a popular movie like Thor and user B would like The Avengers, one with the highest movie ratings.

Disadvantages

User-based collaborative filtering has several disadvantages:

Fickle User Preferences: User preferences can change over time, making initial similarity patterns between users outdated. This can result in inaccurate recommendations as users’ tastes evolve and they start watching new movies.
Large Matrices: As the number of users is typically much larger than the number of items, maintaining large matrices becomes challenging and resource-intensive. Regular recomputation is required to keep the data up-to-date.
Vulnerability to Shilling Attacks: Shilling attacks involve creating fake user profiles with biased preference patterns to manipulate the recommendation system. User-based collaborative filtering is susceptible to such attacks, potentially leading to biased and manipulated recommendations.

Item-based Collaborative Filtering

Item-based Collaborative Filtering focuses on finding similar movies instead of similar users to recommend to user ‘A’ based on their past preferences. It identifies pairs of movies rated/liked by the same users, measures their similarity across all users who rated both, and then suggests similar films based on the similarity scores.

For example, when comparing movies ‘A’ and ‘B,’ we analyze the ratings from users who rated both movies. If these ratings show high similarity, it indicates that ‘A’ and ‘B’ are similar movies. Thus, if someone liked ‘A’, they should be recommended ‘B’, and vice versa.

Advantages over User-based Collaborative

Filtering includes stable movie preferences, as popular movies do not change people’s tastes. Maintaining and computing matrices is more accessible as there are usually fewer items than users. Shilling attacks are also more challenging since items cannot be faked, making this approach more robust.

Movie Recommendation System Code

In this implementation, when the user searches for a movie, we recommend the top 10 similar movies using our movie recommendation system. We will be using an item-based collaborative filtering algorithm for our purpose. The training set used in this demonstration is the movie lens-small dataset.

Getting the Data Up and Running

First, we need to import libraries which we’ll be using in our movie recommendation system with machine Learning. Also, we’ll import the dataset by adding the path of the CSV files. Then we will have a look at the movies dataset :

Python Code:

import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
# import matplotlib.pyplot as plt
# import seaborn as sns
movies = pd.read_csv("movies.csv")
ratings = pd.read_csv("ratings.csv")

print(movies.head())

The movie dataset has the following:

movieId: Once the recommendation is made, we get a list of all similar movieIds and the title for each movie from this dataset.
Genres: Which are not required for this filtering approach.

Code

ratings.head()

Output

Ratings dataset has-

userId: Unique for each user.
movieId: We take the movie title from the movie dataset using this feature.
Rating: Each user gives ratings for all the films. Using this, we are going to predict the top 10 similar films.

Here, we can see that userId one has watched movieId 1 and 3 and rated them 4.0 but has not rated movieId two. This interpretation is more challenging to extract from this dataframe. Therefore, to make things easier to understand and work with, we will make a new dataframe where each column represents each unique user, and each row represents each unique movie.

Code

final_dataset = ratings.pivot(index='movieId',columns='userId',values='rating')
final_dataset.head()

Output

Now, it’s much easier to interpret that userId 1 has rated movieId 1& 3 4.0 but has not rated movieId 3,4,5 at all (therefore, they are represented as NaN ), and therefore, their rating data is missing.

Let’s fix this and impute NaN with 0 to make the algorithm understandable and the data more eye-soothing.

Code

final_dataset.fillna(0,inplace=True)
final_dataset.head()

Output

Removing Noise from the Data

In the real world, ratings are very sparse, and data points are mainly collected from Noisetop-rated movies and highly engaged users. We wouldn’t want movies rated by a few users because it’s not credible enough. Similarly, users who have rated only a handful of films should also not be considered.

Considering all that and some trial-and-error experimentation, we will reduce the noise by adding filters to the final dataset.

To qualify for a movie, a minimum of 10 users should hoist a film. The user should have voted for at least 50 movies to qualify users.

Let’s Visualize How These Filters Look Like

Aggregating the number of users who voted and the number of movies who voted.

Code

no_user_voted = ratings.groupby('movieId')['rating'].agg('count')
no_movies_voted = ratings.groupby('userId')['rating'].agg('count')

Let’s visualize the number of users who voted with our threshold of 10.

f,ax = plt.subplots(1,1,figsize=(16,4))
# ratings['rating'].plot(kind='hist')
plt.scatter(no_user_voted.index,no_user_voted,color='mediumseagreen')
plt.axhline(y=10,color='r')
plt.xlabel('MovieId')
plt.ylabel('No. of users voted')
plt.show()

Output

Making the necessary modifications as per the threshold set.

Code

final_dataset = final_dataset.loc[no_user_voted[no_user_voted > 10].index,:]

Let’s visualize the number of votes by each user with our threshold of 50.

f,ax = plt.subplots(1,1,figsize=(16,4))
plt.scatter(no_movies_voted.index,no_movies_voted,color='mediumseagreen')
plt.axhline(y=50,color='r')
plt.xlabel('UserId')
plt.ylabel('No. of votes by user')
plt.show()

Output

Making the necessary modifications as per the threshold set.

Code

final_dataset=final_dataset.loc[:,no_movies_voted[no_movies_voted > 50].index]
final_dataset

Output

Removing Sparsity

Our final dataset has dimensions of 2121 * 378, and most of the values are sparse. We are using only a small dataset, but for the original large dataset of movie lenses, which has more than 100,000 features, our system may run out of computational resources when fed to the model. We use the csr_matrix function from the scipy library to reduce the sparsity.

I’ll give an example of how it works :

Code

sample = np.array([[0,0,3,0,0],[4,0,0,0,2],[0,0,0,0,1]])
sparsity = 1.0 - ( np.count_nonzero(sample) / float(sample.size) )
print(sparsity)

Output

Code

csr_sample = csr_matrix(sample)
print(csr_sample)

Output

As you can see, there is no sparse value in the csr_sample, and values are assigned as rows and column indexes. For the 0th row and 2nd column, the value is 3.

Applying the csr_matrix method to the dataset :

csr_data = csr_matrix(final_dataset.values)
final_dataset.reset_index(inplace=True)

Making the Movie Recommendation System

We will be using the KNN algorithm to compute similarity with cosine distance metric which is very fast and more preferable than pearson coefficient.

knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20, n_jobs=-1)
knn.fit(csr_data)

Making the Recommendation Function

The working principle is very simple. We first check if the movie name input is in the database. If it is, we use our recommendation system to find similar movies, sort them based on their similarity distance, and output only the top 10 movies with their distances from the input movie.

def get_movie_recommendation(movie_name):
    n_movies_to_reccomend = 10
    movie_list = movies[movies['title'].str.contains(movie_name)]  
    if len(movie_list):        
        movie_idx= movie_list.iloc[0]['movieId']
        movie_idx = final_dataset[final_dataset['movieId'] == movie_idx].index[0]
        distances , indices = knn.kneighbors(csr_data[movie_idx],n_neighbors=n_movies_to_reccomend+1)    
        rec_movie_indices = sorted(list(zip(indices.squeeze().tolist(),distances.squeeze().tolist())),key=lambda x: x[1])[:0:-1]
        recommend_frame = []
        for val in rec_movie_indices:
            movie_idx = final_dataset.iloc[val[0]]['movieId']
            idx = movies[movies['movieId'] == movie_idx].index
            recommend_frame.append({'Title':movies.iloc[idx]['title'].values[0],'Distance':val[1]})
        df = pd.DataFrame(recommend_frame,index=range(1,n_movies_to_reccomend+1))
        return df
    else:
        return "No movies found. Please check your input"

get_movie_recommendation('Iron Man')

I personally think the results are pretty good. All the movies at the top are superhero or animation movies, which are ideal for kids, as is the input movie “Iron Man”.

Let’s try another one :

get_movie_recommendation('Memento')

All the movies in the top 10 are serious and mindful, just like Memento itself; therefore, I think the result is also good in this case.

Our model works quite well- a movie recommendation system based on user behavior. Hence, we conclude our collaborative filtering here. You can get the complete implementation notebook here.

Also Read: Movies Recommendation System using Python

Conclusion

In conclusion, building a movie recommendation system with machine learning significantly enhances user experience by providing personalized movie suggestions based on individual preferences. Utilizing algorithms such as collaborative filtering and content-based filtering, these systems leverage user behavior data to deliver accurate and relevant recommendations. With the integration of Python and robust datasets like MovieLens, developers can create sophisticated recommendation engines akin to Netflix’s or Amazon Prime’s, using techniques like cosine similarity and matrix factorization. Embracing these technologies enriches user engagement and drives retention and satisfaction, showcasing the transformative power of data science in personalized content delivery.

Key Takeaways:

Content-based filtering recommends items similar to those a user has liked before, while collaborative filtering suggests items based on similarities between users’ behaviors and preferences.
Learn about the challenges such as fickle user preferences, large matrices, and vulnerability to shilling attacks in user-based collaborative filtering and how item-based collaborative filtering can mitigate some of these issues.
Understand the importance of data preparation, including removing noise and handling sparsity, to ensure the recommendation system functions effectively and efficiently.
Gain insights into the practical steps and Python code required to build and test a movie recommendation system, including data importing, matrix manipulation, and model training.
Recognize the real-world impact of recommendation systems, such as those used by Netflix, and how they improve user satisfaction by providing personalized and relevant content suggestions.

Join the Blackbelt Plus program to enhance your machine learning skills and create advanced movie recommendation systems. Enroll today and elevate your data science career!

Frequently Asked Questions

Q1. Which is the best algorithm for a movie recommendation system?

A. The best algorithm for a movie recommendation system often depends on the specific use case, but collaborative filtering, particularly matrix factorization techniques like Singular Value Decomposition (SVD), is highly effective. It excels by leveraging user-item interaction data to uncover latent features, offering personalized recommendations based on similarities in user preferences and past behaviors.

Q2. What is an example of a movie recommendation system?

A. Netflix’s recommendation engine is an example of a movie recommendation system. It utilizes a combination of collaborative filtering, content-based filtering, and deep learning models to analyze viewing habits, ratings, and user interactions, delivering tailored movie and TV show suggestions to enhance user engagement and satisfaction.

Q3. What factors do movie recommendation systems consider for personalizing suggestions?

A. Movie recommendation systems consider various factors for personalized suggestions, including user viewing history, explicit ratings, implicit feedback (such as viewing duration), demographic information, and content attributes (genre, actors, directors). They also use trending content and social network influences to refine recommendations and capture dynamic user preferences.

Soham

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Dai Software

Wonderful Blog. on demand mobile app development

BHAVIK AMARDAS DUDHREJIYA

You have created no_user_voted but it was groupby of movieId and rating no_user_voted = ratings.groupby('movieId')['rating'].agg('count') You have created no_movies_voted but it was groupby of userId and rating no_movies_voted = ratings.groupby('userId')['rating'].agg('count') I think you wrongly substitute name while creating count data based on used id and movie id

ROGERS Mugambi

There is a typo in the title for the advantages for item-based collaborative filtering. You have written 'user-based' instead of 'item-based'.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Building a Movie Recommendation System with Machine Learning

Table of contents

What is a Recommendation System?

Why Recommendation Systems?

Types of Recommendation Systems

Content-based Filtering

Collaborative Filtering

User-based Collaborative filtering

Item-based Collaborative Filtering

Movie Recommendation System Code

Getting the Data Up and Running

Removing Noise from the Data

Let’s Visualize How These Filters Look Like

Removing Sparsity

Making the Movie Recommendation System

Making the Recommendation Function

Finally, Let’s Recommend Some Movies!

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck