Q1. Why do we take the log of IDF?

Question

Accepted Answer

Ans. A: Taking the log of IDF helps to scale down the effect of extremely common words and prevent the IDF values from exploding, especially in large corpora. It ensures that IDF values remain manageable and reduces the impact of words that appear very frequently across documents.

Term	Count	TF
the	1	1/4
sky	1	1/4
is	1	1/4
blue	1	1/4

Term	Count	TF
the	1	1/5
sun	1	1/5
is	1	1/5
bright	1	1/5
today	1	1/5

Term	Count	TF
the	2	2/7
sun	1	1/7
in	1	1/7
sky	1	1/7
is	1	1/7
bright	1	1/7

Term	Count	TF
we	1	1/9
can	1	1/9
see	1	1/9
the	2	2/9
shining	1	1/9
sun	2	2/9
bright	1	1/9

Term	DF	IDF
the	4	log⁡(4/4+1)=log⁡(0.8)≈−0.223
sky	2	log⁡(4/2+1)=log⁡(1.333)≈0.287
is	3	log⁡(4/3+1)=log⁡(1)=0
blue	1	log⁡(4/1+1)=log⁡(2)≈0.693
sun	3	log⁡(4/3+1)=log⁡(1)=0
bright	3	log⁡(4/3+1)=log⁡(1)=0
today	1	log⁡(4/1+1)=log⁡(2)≈0.693
in	1	log⁡(4/1+1)=log⁡(2)≈0.693
we	1	log⁡(4/1+1)=log⁡(2)≈0.693
can	1	log⁡(4/1+1)=log⁡(2)≈0.693
see	1	log⁡(4/1+1)=log⁡(2)≈0.693
shining	1	log⁡(4/1+1)=log⁡(2)≈0.693

Term	TF	IDF	TF-IDF
the	0.25	-0.223	0.25 * -0.223 ≈-0.056
sky	0.25	0.287	0.25 * 0.287 ≈ 0.072
is	0.25	0	0.25 * 0 = 0
blue	0.25	0.693	0.25 * 0.693 ≈ 0.173

Term	TF	IDF	TF-IDF
the	0.2	-0.223	0.2 * -0.223 ≈ -0.045
sun	0.2	0	0.2 * 0 = 0
is	0.2	0	0.2 * 0 = 0
bright	0.2	0	0.2 * 0 = 0
today	0.2	0.693	0.2 * 0.693 ≈0.139

Term	TF	IDF	TF-IDF
the	0.285	-0.223	0.285 * -0.223 ≈ -0.064
sun	0.142	0	0.142 * 0 = 0
in	0.142	0.693	0.142 * 0.693 ≈0.098
sky	0.142	0.287	0.142 * 0.287≈0.041
is	0.142	0	0.142 * 0 = 0
bright	0.142	0	0.142 * 0 = 0

Term	TF	IDF	TF-IDF
we	0.111	0.693	0.111 * 0.693 ≈0.077
can	0.111	0.693	0.111 * 0.693 ≈0.077
see	0.111	0.693	0.111 * 0.693≈0.077
the	0.222	-0.223	0.222 * -0.223≈-0.049
shining	0.111	0.693	0.111 * 0.693 ≈0.077
sun	0.222	0	0.222 * 0 = 0
bright	0.111	0	0.111 * 0 = 0

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

How Do You Convert Text Documents to a TF-IDF Matrix with tfidfvectorizer?

Table of contents

Terminology: Key Terms Used in TF-IDF

What is Term Frequency (TF)?

What is Document Frequency (DF)?

What is Inverse Document Frequency (IDF)?

What is TF-IDF?

Numerical Calculation of TF-IDF

Documents:

Step 1: Calculate Term Frequency (TF)

Step 2: Calculate Inverse Document Frequency (IDF)

Step 3: Calculate TF-IDF

TF-IDF Implementation in Python Using an Inbuilt Dataset

Step 1: Install Necessary Libraries

Step 2: Import Libraries

Step 3: Load the Dataset

Step 4: Initialize TfidfVectorizer

Step 5: Fit and Transform the Documents

Step 6: View the TF-IDF Matrix

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC