Movies hold universal appeal, connecting people of all backgrounds. Despite this unity, our individual movie preferences remain distinct, ranging from specific genres like thrillers, romance, or sci-fi to focusing on favorite actors and directors. While it’s challenging to generalize movies everyone would enjoy, data scientists analyze behavioral patterns to identify groups of similar movie preferences in society. As data scientists, we extract valuable insights from audience behavior and movie attributes to develop the “Movie Recommendation System.”
Movie recommendation systems are not just about convenience; they represent a fascinating intersection of data science, machine learning, and user experience design. These systems can make highly personalized recommendations that keep you engaged and satisfied by analyzing vast amounts of data, such as your viewing history, ratings, and even the time you spend watching certain genres. One of the most famous websites for movie recommendations is IMDB. Let’s delve into the “Movie Recommendation System” fundamentals to unlock the magic of personalized movie suggestions using machine learning algorithms.
Learning Outcomes:
This article was published as a part of the Data Science Blogathon.
A Recommendation System is a filtration program whose prime goal is to predict a user’s “rating” or “preference” toward a domain-specific item or item. In our case, this domain-specific item is a movie. Therefore, the main focus of our recommendation system is to filter and predict only those movies that a user would prefer, given some data about the user.
Recommendation Systems are essential for several reasons:
There are 2 types of recommendation systems- Content-based filtering and Collaborative Filtering. Lets discuss them now.
Content-based filtering is a recommendation strategy that suggests items similar to those a user has previously liked. It calculates similarity (often using cosine similarity) between the user’s preferences and item attributes, such as lead actors, directors, and genres. For example, if a user enjoys ‘The Prestige,’ the system recommends movies with Christian Bale, the ‘Thriller’ genre, or films by Christopher Nolan.
However, content-based filtering has drawbacks. It limits exposure to different products, preventing users from exploring a variety of items. This can hinder business expansion as users might not try out new types of products.
Also Read: Beginners Guide to Content Based Recommender System
Collaborative filtering is a recommendation strategy that considers the user’s behavior and compares it with other users in the database. It uses the history of all users to influence the recommendation algorithm. Unlike a content-based recommender system, a collaborative filtering recommender relies on multiple users’ interactions with items to generate suggestions. It doesn’t solely depend on one user’s data for modeling. There are various approaches to implementing collaborative filtering, but the fundamental concept is the collective influence of multiple users on the recommendation outcome.
There are 2 types of collaborative filtering algorithms:
User-based collaborative filtering aims to recommend items to a user (A) based on the preferences of similar users in the database. It involves creating a matrix of items rated/liked/clicked by each user, computing similarity scores between users, and then suggesting items that user A hasn’t encountered yet but similar users have liked.
For example, if user A likes particular movies like Batman Begins, Justice League, and The Avengers and user B like Thor, they share similar interests in the superhero genre. Therefore, it is highly likely that user A would enjoy a popular movie like Thor and user B would like The Avengers, one with the highest movie ratings.
Disadvantages
User-based collaborative filtering has several disadvantages:
Item-based Collaborative Filtering focuses on finding similar movies instead of similar users to recommend to user ‘A’ based on their past preferences. It identifies pairs of movies rated/liked by the same users, measures their similarity across all users who rated both, and then suggests similar films based on the similarity scores.
For example, when comparing movies ‘A’ and ‘B,’ we analyze the ratings from users who rated both movies. If these ratings show high similarity, it indicates that ‘A’ and ‘B’ are similar movies. Thus, if someone liked ‘A’, they should be recommended ‘B’, and vice versa.
Advantages over User-based Collaborative
Filtering includes stable movie preferences, as popular movies do not change people’s tastes. Maintaining and computing matrices is more accessible as there are usually fewer items than users. Shilling attacks are also more challenging since items cannot be faked, making this approach more robust.
In this implementation, when the user searches for a movie, we recommend the top 10 similar movies using our movie recommendation system. We will be using an item-based collaborative filtering algorithm for our purpose. The training set used in this demonstration is the movie lens-small dataset.
First, we need to import libraries which we’ll be using in our movie recommendation system. Also, we’ll import the dataset by adding the path of the CSV files. Then we will have a look at the movies dataset :
Python Code:
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.neighbors import NearestNeighbors
# import matplotlib.pyplot as plt
# import seaborn as sns
movies = pd.read_csv("movies.csv")
ratings = pd.read_csv("ratings.csv")
print(movies.head())
The movie dataset has the following:
Code
ratings.head()
Output
Ratings dataset has-
Here, we can see that userId one has watched movieId 1 and 3 and rated them 4.0 but has not rated movieId two. This interpretation is more challenging to extract from this dataframe. Therefore, to make things easier to understand and work with, we will make a new dataframe where each column represents each unique user, and each row represents each unique movie.
Code
final_dataset = ratings.pivot(index='movieId',columns='userId',values='rating')
final_dataset.head()
Output
Now, it’s much easier to interpret that userId 1 has rated movieId 1& 3 4.0 but has not rated movieId 3,4,5 at all (therefore, they are represented as NaN ), and therefore, their rating data is missing.
Let’s fix this and impute NaN with 0 to make the algorithm understandable and the data more eye-soothing.
Code
final_dataset.fillna(0,inplace=True)
final_dataset.head()
Output
In the real world, ratings are very sparse, and data points are mainly collected from Noisetop-rated movies and highly engaged users. We wouldn’t want movies rated by a few users because it’s not credible enough. Similarly, users who have rated only a handful of films should also not be considered.
Considering all that and some trial-and-error experimentation, we will reduce the noise by adding filters to the final dataset.
Aggregating the number of users who voted and the number of movies who voted.
Code
no_user_voted = ratings.groupby('movieId')['rating'].agg('count')
no_movies_voted = ratings.groupby('userId')['rating'].agg('count')
Let’s visualize the number of users who voted with our threshold of 10.
f,ax = plt.subplots(1,1,figsize=(16,4))
# ratings['rating'].plot(kind='hist')
plt.scatter(no_user_voted.index,no_user_voted,color='mediumseagreen')
plt.axhline(y=10,color='r')
plt.xlabel('MovieId')
plt.ylabel('No. of users voted')
plt.show()
Output
Making the necessary modifications as per the threshold set.
Code
final_dataset = final_dataset.loc[no_user_voted[no_user_voted > 10].index,:]
Let’s visualize the number of votes by each user with our threshold of 50.
f,ax = plt.subplots(1,1,figsize=(16,4))
plt.scatter(no_movies_voted.index,no_movies_voted,color='mediumseagreen')
plt.axhline(y=50,color='r')
plt.xlabel('UserId')
plt.ylabel('No. of votes by user')
plt.show()
Output
Making the necessary modifications as per the threshold set.
Code
final_dataset=final_dataset.loc[:,no_movies_voted[no_movies_voted > 50].index]
final_dataset
Output
Our final dataset has dimensions of 2121 * 378, and most of the values are sparse. We are using only a small dataset, but for the original large dataset of movie lenses, which has more than 100,000 features, our system may run out of computational resources when fed to the model. We use the csr_matrix function from the scipy library to reduce the sparsity.
I’ll give an example of how it works :
Code
sample = np.array([[0,0,3,0,0],[4,0,0,0,2],[0,0,0,0,1]])
sparsity = 1.0 - ( np.count_nonzero(sample) / float(sample.size) )
print(sparsity)
Output
Code
csr_sample = csr_matrix(sample)
print(csr_sample)
Output
As you can see, there is no sparse value in the csr_sample, and values are assigned as rows and column indexes. For the 0th row and 2nd column, the value is 3.
Applying the csr_matrix method to the dataset :
csr_data = csr_matrix(final_dataset.values)
final_dataset.reset_index(inplace=True)
We will be using the KNN algorithm to compute similarity with cosine distance metric which is very fast and more preferable than pearson coefficient.
knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20, n_jobs=-1)
knn.fit(csr_data)
The working principle is very simple. We first check if the movie name input is in the database. If it is, we use our recommendation system to find similar movies, sort them based on their similarity distance, and output only the top 10 movies with their distances from the input movie.
def get_movie_recommendation(movie_name):
n_movies_to_reccomend = 10
movie_list = movies[movies['title'].str.contains(movie_name)]
if len(movie_list):
movie_idx= movie_list.iloc[0]['movieId']
movie_idx = final_dataset[final_dataset['movieId'] == movie_idx].index[0]
distances , indices = knn.kneighbors(csr_data[movie_idx],n_neighbors=n_movies_to_reccomend+1)
rec_movie_indices = sorted(list(zip(indices.squeeze().tolist(),distances.squeeze().tolist())),key=lambda x: x[1])[:0:-1]
recommend_frame = []
for val in rec_movie_indices:
movie_idx = final_dataset.iloc[val[0]]['movieId']
idx = movies[movies['movieId'] == movie_idx].index
recommend_frame.append({'Title':movies.iloc[idx]['title'].values[0],'Distance':val[1]})
df = pd.DataFrame(recommend_frame,index=range(1,n_movies_to_reccomend+1))
return df
else:
return "No movies found. Please check your input"
get_movie_recommendation('Iron Man')
I personally think the results are pretty good. All the movies at the top are superhero or animation movies, which are ideal for kids, as is the input movie “Iron Man”.
Let’s try another one :
get_movie_recommendation('Memento')
All the movies in the top 10 are serious and mindful, just like Memento itself; therefore, I think the result is also good in this case.
Our model works quite well- a movie recommendation system based on user behavior. Hence, we conclude our collaborative filtering here. You can get the complete implementation notebook here.
Also Read: Movies Recommendation System using Python
In conclusion, building a movie recommendation system with machine learning significantly enhances user experience by providing personalized movie suggestions based on individual preferences. Utilizing algorithms such as collaborative filtering and content-based filtering, these systems leverage user behavior data to deliver accurate and relevant recommendations. With the integration of Python and robust datasets like MovieLens, developers can create sophisticated recommendation engines akin to Netflix’s or Amazon Prime’s, using techniques like cosine similarity and matrix factorization. Embracing these technologies enriches user engagement and drives retention and satisfaction, showcasing the transformative power of data science in personalized content delivery.
Key Takeaways:
Join the Blackbelt Plus program to enhance machine-learning skills and build cutting-edge movie recommendation systems. Propel your career in data science and make a difference in the world of personalized movie recommendations. Enroll in the program today and step into the future of movie suggestions!
A. The best algorithm for a movie recommendation system often depends on the specific use case, but collaborative filtering, particularly matrix factorization techniques like Singular Value Decomposition (SVD), is highly effective. It excels by leveraging user-item interaction data to uncover latent features, offering personalized recommendations based on similarities in user preferences and past behaviors.
A. Netflix’s recommendation engine is an example of a movie recommendation system. It utilizes a combination of collaborative filtering, content-based filtering, and deep learning models to analyze viewing habits, ratings, and user interactions, delivering tailored movie and TV show suggestions to enhance user engagement and satisfaction.
A. Movie recommendation systems consider various factors for personalized suggestions, including user viewing history, explicit ratings, implicit feedback (such as viewing duration), demographic information, and content attributes (genre, actors, directors). They also use trending content and social network influences to refine recommendations and capture dynamic user preferences.
Wonderful Blog. on demand mobile app development
You have created no_user_voted but it was groupby of movieId and rating no_user_voted = ratings.groupby('movieId')['rating'].agg('count') You have created no_movies_voted but it was groupby of userId and rating no_movies_voted = ratings.groupby('userId')['rating'].agg('count') I think you wrongly substitute name while creating count data based on used id and movie id
There is a typo in the title for the advantages for item-based collaborative filtering. You have written 'user-based' instead of 'item-based'.