How sklearn’s Tfidfvectorizer Calculates tf-idf Values

Unnikrishnan Last Updated : 03 Nov, 2021

6 min read

This article was published as a part of the Data Science Blogathon.

Overview

In NLP, tf-idf is an important measure and is used by algorithms like cosine similarity to find documents that are similar to a given search query.

Here in this blog, we will try to break tf-idf and see how sklearn’s TfidfVectorizer calculates tf-idf values. I had a hard time matching the tf-idf values generated by TfidfVectorizer and with the ones I calculated. The reason is that there are many ways in which tf-idf values are calculated, and we need to be aware of the method that TfidfVectorizer uses to calculate tf-idf. This will save a lot of time and effort for you. I spent a couple of days troubleshooting before I could realize the issue.

We will write a simple Python program that uses TfidfVectorizer to calculate tf-idf and manually validate this. Before we get into the coding part, let’s go through a few terms that make up tf-idf.

What is Term Frequency (tf)

tf is the number of times a term appears in a particular document. So it’s specific to a document. A few of the ways to calculate tf is given below:-

tf(t) = No. of times term ‘t’ occurs in a document

tf(t) = (No. of times term ‘t’ occurs in a document) / (No. Of terms in a document)

tf(t) = (No. of times term ‘t’ occurs in a document) / (Frequency of most common term in a document)

sklearn uses the first one i:e No. Of times a term ‘t’ appears in a document

Inverse Document Frequency (idf)

idf is a measure of how common or rare a term is across the entire corpus of documents. So the point to note is that it’s common to all the documents. If the word is common and appears in many documents, the idf value (normalized) will approach 0 or else approach 1 if it’s rare. A few of the ways we can calculate idf value for a term is given below

idf (t) =1 + log _e [ n / df(t) ]

idf(t) = log _e [ n / df(t) ]

where

n = Total number of documents available

t = term for which idf value has to be calculated

df(t) = Number of documents in which the term t appears

But as per sklearn’s online documentation, it uses the below method to calculate idf of a term in a document.

idf(t) = log _e [ (1+n) / ( 1 + df(t) ) ] + 1 (default i:e smooth_idf = True)

and

idf(t) = log _e [ n / df(t) ] + 1 (when smooth_idf = False)

Term Frequency-Inverse Document Frequency (tf-idf)

tf-idf value of a term in a document is the product of its tf and idf. The higher is the value, the more relevant the term is in that document.

Python program to generate tf-idf values

Step 1: Import the library

from sklearn.feature_extraction.text import TfidfVectorizer

Step 2: Set up the document corpus

d1=”petrol cars are cheaper than diesel cars”

d2=”diesel is cheaper than petrol”

doc_corpus=[d1,d2]

print(doc_corpus)

Python program to generate tf-idf values 1

Step 3: Initialize TfidfVectorizer and print the feature names

vec=TfidfVectorizer(stop_words='english')

matrix=vec.fit_transform(doc_corpus)

print("Feature Names n",vec.get_feature_names_out())

Python program to generate tf-idf values 2

Step 4: Generate a sparse matrix with tf-idf values

print("Sparse Matrix n",matrix.shape,"n",matrix.toarray())

Validate tf-idf values in the sparse matrix

To start with, the term frequency(tf) for each term in the documents is given in the above table.

Calculate idf values for each term.

As mentioned before , idf value of a term is common across all documents. Here we will consider the case when smooth_idf = True (default behaviour). So idf(t) is given by

idf(t) = log _e [ (1+n) / ( 1 + df(t) ) ] + 1

Here n=2 (no. Of docs)

idf(“cars”) = log _e (3/2) +1 => 1.405465083

idf(“cheaper”) = log _e(3/3) + 1 => 1

idf(“diesel”) = log _e(3/3) + 1 => 1

idf(“petrol”) = log _e(3/3) + 1 => 1

From the above idf values, we can see that as “cheaper”, “diesel” and “petrol” are common to both the documents , it has a lower idf value

Calculate tf-idf of the terms in each document d1 and d2.

For d1

tf-idf(“cars”) = tf(“cars”) x idf (“cars”) = 2 x 1.405465083 => 2.810930165

tf-idf(“cheaper”) = tf(“cheaper”) x idf (“cheaper”) = 1 x 1 => 1

tf-idf(“diesel”) = tf(“diesel”) x idf (“diesel”) = 1×1 => 1

tf-idf(“petrol”) = tf(“petrol”) x idf (“petrol”) = 1×1 => 1

For d2

tf-idf(“cars”) = tf(“cars”) x idf (“cars”) = 0 x 1.405465083 => 0

tf-idf(“cheaper”) = tf(“cheaper”) x idf (“cheaper”) = 1 x 1 => 1

tf-idf(“diesel”) = tf(“diesel”) x idf (“diesel”) = 1×1 => 1

tf-idf(“petrol”) = tf(“petrol”) x idf (“petrol”) = 1×1 => 1

So we have the sparse matrix with shape 2 x 3

[

[2.810930165 1 1 1]

[0 1 1 1]

]

Normalize tf-idf values

We have one final step. To avoid large documents in the corpus dominating smaller ones, we have to normalize each row in the sparse matrix to have the Euclidean norm.

First document d1

2.810930165 / sqrt( 2.810930165 ² + 1² + 1² + 1²) => 0.851354321

1 / sqrt( 2.810930165² + 1² + 1² + 1²) => 0.302872811

Second document d2

0 / sqrt(0 ² + 1² + 1² + 1²) => 0

1 / sqrt(0² + 1² + 1² + 1²)=> 0.577350269

1/ sqrt(0² + 1² + 1² + 1²) => 0.577350269

1 / sqrt(0² + 1² + 1² + 1²) => 0.577350269

This gives us the final normalized sparse matrix

[

[0.851354321 0.302872811 0.302872811 0.302872811]

[0 0.577350269 0.577350269 0.577350269]

]

And the above sparse matrix that we just calculated matches the one generated by TfidfVectorizer from sklearn.

Conclusion from tf-idf values

Below were the tf values of both documents d1 and d2

(i) In the first document d1, the term “cars” is the most relevant term as it has the highest tf-idf value (0.851354321)

(ii) In the second document d2, most of the terms have the same tf-idf value and have equal relevance.

The complete Python code to build the sparse matrix using Tfidfvectorizer is given below for ready reference.

from sklearn.feature_extraction.text import TfidfVectorizer

doc1="petrol cars are cheaper than diesel cars"

doc2="diesel is cheaper than petrol"

doc_corpus=[doc1,doc2]

print(doc_corpus)

vec=TfidfVectorizer(stop_words='english')

matrix=vec.fit_transform(doc_corpus)

print("Feature Names n",vec.get_feature_names_out())

print("Sparse Matrix n",matrix.shape,"n",matrix.toarray())

Closing Notes

In this blog, we got to know what tf, idf, and tf-idf are and understood that idf(term) is common for a document corpus and tf-idf(term) is specific to a document. And we used a Python program to generate a tf-idf sparse matrix using sklearn’s TfidfVectorizer and also validated the values.

Finding the tf-idf values when “smooth_idf = False” is left as an exercise to the reader. Hope you found this blog useful. Please leave your comments or questions if any.

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.

Unnikrishnan

Advanced NLP Text

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

Data Science Tools and Techniques

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

How sklearn’s Tfidfvectorizer Calculates tf-idf Values

Overview

What is Term Frequency (tf)

Inverse Document Frequency (idf)

Term Frequency-Inverse Document Frequency (tf-idf)

Python program to generate tf-idf values

Validate tf-idf values in the sparse matrix

Normalize tf-idf values

Conclusion from tf-idf values

Closing Notes

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics