Exploring the Extractive Method of Text Summarization

Shilpi Mazumdar Last Updated : 12 Oct, 2024

14 min read

Introduction

Often, there are many situations where we don’t have/get enough time to read and understand lengthy documents, research papers, or news articles. Similarly, summarizing a large text while retaining essential information is crucial in many fields, such as journalism, research, and business. This is where NLP text summarization comes into play, which is a technique that automatically generates a condensed version of a given text while preserving its essential meaning. This article will explore the two main approaches of NLP text summarization, extractive and abstractive, and examine their applications, strengths, and weaknesses.

Objectives

Understand the different categories of text vectorization.
Understanding extractive summarization vs abstractive summarization through examples.
Learn the difference between both vectorization techniques.
And the future aspects of text summarization.

Introduction
Types of Text Summarization
Extractive Summarization
Abstractive Summarization
Understanding with Code
Extractive vs Abstractive Text Summarization
Future Outlook of Text Summarization
Conclusion
Frequently Asked Questions

Types of Text Summarization

Broadly, the NLP text summarization can be divided into two main categories.

Extractive Approach
Abstractive Approach

Let’s dive a little deeper into each of the above-mentioned categories.

Extractive Summarization

So, what exactly happens in the extractive summarization model? It simply takes out the important sentences or phrases from the original text and joins them to form a summary.

The question is, exactly on what basis are those sentences important? So, a ranking algorithm is used, which assigns scores to each sentence in the text based on their relevance to the overall meaning of the document. The most relevant sentences are then chosen to be included in the summary.

There are various ways through which the ranking of sentences can be performed.
TF-IDF (term frequency-inverse document frequency)
Graph-based methods such as TextRank
Machine learning-based methods such as Support Vector Machines (SVM) and Random Forests.

The main motive of the extractive summarization models is to maintain the original meaning of the text. This method also works well when the input text/content is already well-structured, physically and logically, like newspaper content.

Abstractive Summarization

Okay, now let’s come to the abstractive summarization method. The name implies that it has arrived from the root form of the word abstract, which means outline/summary or the basic idea of a voluminous thing (text). Unlike the extractive summarization models, it simply doesn’t pick out the important sentences. Rather, it analyses the input text and generates new phrases or sentences that capture the essence of the original text and convey the same meaning as the original text but more concisely and coherently.

Again, how exactly is the summary generated using this method? In brief, the input text is analyzed by a neural network model that learns to generate new phrases and sentences that capture the essence of the original text. The model is trained on large amounts of text data and learns to understand the relationships between words and sentences, generating new text that conveys the same meaning as the original text in a more understandable manner.

This method uses advanced NLP techniques such as natural language generation (NLG) and deep learning to understand the context and generate the summary. The resulting summaries are usually shorter and more readable than the ones generated by the extractive summarization models, but they can sometimes contain errors or inaccuracies.

Note that here in this article, we’ll only use the extractive text summarization method.

Understanding with Code

Here, we’ll focus on the extractive summarization models and understand it more with an example.

But, before that, let’s quickly understand it with a flowchart.

Here, we will implement the extractive summarization models using a Python library called NLTK (Natural Language Toolkit). NLTK provides a wide range of functionalities for natural language processing, including text tokenization, stopword removal, and sentence scoring.

Let’s take a look at the following code that demonstrates how to use NLTK to generate a summary from a given text:

Frequency-based Approach

# import the required libraries
import nltk
nltk.download('punkt') # punkt tokenizer for sentence tokenization
nltk.download('stopwords') # list of stop words, such as 'a', 'an', 'the', 'in', etc, which would be dropped
from collections import Counter # Imports the Counter class from the collections module, used for counting the frequency of words in a text.
from nltk.corpus import stopwords # Imports the stop words list from the NLTK corpus
# corpus is a large collection of text or speech data used for statistical analysis

from nltk.tokenize import sent_tokenize, word_tokenize # Imports the sentence tokenizer and word tokenizer from the NLTK tokenizer module. 
# Sentence tokenizer is for splitting text into sentences
# word tokenizer is for splitting sentences into words

# this function would take 2 inputs, one being the text, and the other being the summary which would contain the number of lines
def generate_summary(text, n):
# Tokenize the text into individual sentences
sentences = sent_tokenize(text)

# Tokenize each sentence into individual words and remove stopwords
stop_words = set(stopwords.words('english'))
# the following line would tokenize each sentence from sentences into individual words using the word_tokenize function of nltk.tokenize module
# Then removes any stop words and non-alphanumeric characters from the resulting list of words and converts them all to lowercase
words = [word.lower() for word in word_tokenize(text) if word.lower() not in stop_words and word.isalnum()]

# Compute the frequency of each word
word_freq = Counter(words)

# Compute the score for each sentence based on the frequency of its words
# After this block of code is executed, sentence_scores will contain the scores of each sentence in the given text, 
# where each score is a sum of the frequency counts of its constituent words

# empty dictionary to store the scores for each sentence
sentence_scores = {}

for sentence in sentences:
sentence_words = [word.lower() for word in word_tokenize(sentence) if word.lower() not in stop_words and word.isalnum()]
sentence_score = sum([word_freq[word] for word in sentence_words])
if len(sentence_words) < 20:
sentence_scores[sentence] = sentence_score

# checks if the length of the sentence_words list is less than 20 (parameter can be adjusted based on the desired length of summary sentences)
# If condition -> true, score of the current sentence is added to the sentence_scores dictionary with the sentence itself as the key
# This is to filter out very short sentences that may not provide meaningful information for summary generation

# Select the top n sentences with the highest scores
summary_sentences = sorted(sentence_scores, key=sentence_scores.get, reverse=True)[:n]
summary = ' '.join(summary_sentences)

return summary

Using a Sample Text From Wikipedia to Generate Summary

text = '''
Weather is the day-to-day or hour-to-hour change in the atmosphere. 
Weather includes wind, lightning, storms, hurricanes, tornadoes (also known as twisters), rain, hail, snow, and lots more. 
Energy from the Sun affects the weather too. 
Climate tells us what kinds of weather usually happen in an area at different times of the year. 
Changes in weather can affect our mood and life. We wear different clothes and do different things in different weather conditions. 
We choose different foods in different seasons.
Weather stations around the world measure different parts of weather. 
Ways to measure weather are wind speed, wind direction, temperature and humidity. 
People try to use these measurements to make weather forecasts for the future. 
These people are scientists that are called meteorologists. 
They use computers to build large mathematical models to follow weather trends.'''

summary = generate_summary(text, 5)
summary_sentences = summary.split('. ')
formatted_summary = '.\n'.join(summary_sentences)

print(formatted_summary)

Output

The following output is what we would be getting as a summary. This summary would contain 5 sentences.

We wear different clothes and do different things in different weather conditions.
Weather stations around the world measure different parts of weather.
Climate tells us what kinds of weather usually happen in an area at different times of the year.
Weather includes wind, lightning, storms, hurricanes, tornadoes (also known as twisters), rain, hail, snow, and more.
Wind speed, direction, temperature, and humidity are ways to measure weather.

What’s happening in the above code?

So, the above code takes a text and a desired number of sentences for the summary as input and returns a summary generated using the extractive summarization models. The method first tokenizes the text into individual sentences and then tokenizes each sentence into individual words. Stopwords are removed from the words, and then the frequency of each word is computed.

Then, the score for each sentence is computed based on the frequency of its words, and the top n sentences with the highest scores are selected to form the summary. Finally, the summary is generated by joining the selected sentences together.

In the next section, we will explore how the extractive summarization models can be further improved using advanced techniques such as TF-IDF.

TF-IDF Approach

# importing the required libraries

# importing TfidfVectorizer class to convert a collection of raw documents to a matrix of TF-IDF features.
from sklearn.feature_extraction.text import TfidfVectorizer

# importing cosine_similarity function to compute the cosine similarity between two vectors.
from sklearn.metrics.pairwise import cosine_similarity

# importing nlargest to return the n largest elements from an iterable in descending order.
from heapq import nlargest

def generate_summary(text, n):
# Tokenize the text into individual sentences
sentences = sent_tokenize(text)

# Create the TF-IDF matrix
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(sentences)

# Compute the cosine similarity between each sentence and the document
sentence_scores = cosine_similarity(tfidf_matrix[-1], tfidf_matrix[:-1])[0]

# Select the top n sentences with the highest scores
summary_sentences = nlargest(n, range(len(sentence_scores)), key=sentence_scores.__getitem__)

summary_tfidf = ' '.join([sentences[i] for i in sorted(summary_sentences)])

return summary_tfidf

Using a Sample Text to Check the Summary

text = '''
Weather is the day-to-day or hour-to-hour change in the atmosphere. 
Weather includes wind, lightning, storms, hurricanes, tornadoes (also known as twisters), rain, hail, snow, and lots more. 
Energy from the Sun affects the weather too. 
Climate tells us what kinds of weather usually happen in an area at different times of the year. 
Changes in weather can affect our mood and life. We wear different clothes and do different things in different weather conditions. 
We choose different foods in different seasons.
Weather stations around the world measure different parts of weather. 
Ways to measure weather are wind speed, wind direction, temperature and humidity. 
People try to use these measurements to make weather forecasts for the future. 
These people are scientists that are called meteorologists. 
They use computers to build large mathematical models to follow weather trends.'''

summary = generate_summary(text, 5)
summary_sentences = summary.split('. ')
formatted_summary = '.\n'.join(summary_sentences)

print(formatted_summary)

The following output is what we would be getting as a summary. This summary would contain 5 sentences.

Energy from the Sun affects the weather, too.
Weather changes can affect our mood and life.
We wear different clothes and do different things in different weather conditions.
Weather stations around the world measure different parts of the weather.
People try to use these measurements to make weather forecasts for the future.

The above code generates a summary for a given text using a tf idf approach. A function to create a summary that takes a text parameter and an n parameter(number of sentences in summary). The function tokenizes the text into individual sentences, creates a TF-IDF matrix using the TfidfVectorizer class, and computes the cosine similarity between each sentence and the document using the cosine_similarity function.

Next, the function selects the top n sentences with the highest scores using the nlargest function from the heapq library and joins them into a string using the join method.

Okay, before proceeding, let’s quickly understand the cosine similarity. If you already know this, you can skip to the next part.

What is cosine similarity?

So, the cosine similarity considers the angle between the vectors of word frequencies for each document rather than just their magnitudes. This means that documents with similar word frequencies and distributions will have a smaller angle between their vectors and, thus, a higher cosine similarity score. Let’s understand this with a simple example.

We have two sentences.

“I love cats and dogs.”
“I love only cats.”

We first need to convert each sentence into a vector representation to calculate the similarity between these two sentences using cosine similarity with TF-IDF. Here’s how we can do that:

“I love cats and dogs.” -> [1, 1, 1, 1, 0, 0]
“I love only cats.” -> [1, 1, 1, 0, 1, 0]

How are we getting the vector representation?

We need to perform the following steps.
1. Break the sentence into individual words -> tokenization:

“I love cats and dogs.” -> [‘I’, ‘love’, ‘cats’, ‘and’, ‘dogs’, ‘.’]
“I love only cats.” -> [‘I’, ‘love’, ‘only’, ‘cats’, ‘.’]

2. Now, Create a vocabulary of unique words from both sentences:
[‘I’, ‘love’, ‘cats’, ‘and’, ‘dogs’, ‘.’, ‘only’]
3. Now convert each sentence into a binary vector of size equal to the vocabulary, where 1 represents the presence of the word in the sentence and 0 represents its absence.

“I love cats and dogs.” -> [1, 1, 1, 1, 1, 1, 0]
Explanation:
‘I’ is present, hence 1
‘love’ is present, hence 1
‘cats’ is present, hence 1
‘and’ is present, hence 1
‘dogs’ is present, hence 1
‘.’ is present, hence 1
‘only’ is absent, hence 0
“I love only cats.” -> [1, 1, 1, 0, 0, 1, 1]
Explanation:
‘I’ is present -> 1
‘love’ is present -> 1
‘cats’ is present -> 1
‘and’ is absent -> 0
‘dogs’ is absent -> 0
‘.’ is present -> 1
‘only’ is present -> 1
Each vector has six elements corresponding to the six unique words in the sentences. The values in each vector represent the frequency of each word in its respective sentence.

Next, we compute the TF-IDF weights for each word in both sentences. Let’s assume all words’ inverse document frequency (IDF) is the same for simplicity. Then, the weights are:

“I love cats and dogs.” -> [0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
“I love only cats.” -> [0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

Since each word occurs in both sentences, its IDF values are zero, and the TF-IDF weights for each word are also zero.

Finally, we compute the cosine similarity between the two vectors using the formula:

cosine_similarity = (v1 . v2) / (||v1|| * ||v2||)

where v1 and v2 are the vector representations of the sentences, and ‘.’ denotes the dot product of two vectors. ||v1|| and ||v2|| are the Euclidean norms of the two vectors.

Cosine similarity between 2 sentences

Using the vector representations and the formula above, the cosine similarity between the two sentences is:

The dot product of the vectors [1, 1, 1, 1, 1, 1, 0] and [1, 1, 1, 0, 0, 1, 1] is:

1*1 + 1*1 + 1*1 + 1*0 + 1*0 + 1*1 + 0*1 = 4

The magnitude (or Euclidean length) of the first vector [1, 1, 1, 1, 1, 1, 0] is:
sqrt(1^2 + 1^2 + 1^2 + 1^2 + 1^2 + 1^2 + 0^2) = sqrt(6) -> 2.44

Similarly, the magnitude for the second vector [1, 1, 1, 0, 0, 1, 1] is:
sqrt(1^2 + 1^2 + 1^2 + 0^2 + 0^2 + 1^2 + 1^2) = sqrt(5) -> 2.23

Therefore, the cosine similarity between the two sentences is:

cosine_similarity = 4 / (2.44 * 2.23) => 4 / 5.4412 = 0.74 (approx)
This indicates that the two sentences are somewhat similar but not very similar.

Evaluation Metrics

Let’s now check how well our approach is working. I got this particular text from this link.
Following is the text.

Weather is the day-to-day or hour-to-hour change in the atmosphere. It includes wind, lightning, storms, hurricanes, tornadoes (twisters), rain, hail, snow, and lots more. Energy from the Sun Also affects the weather. Climate tells us what kinds of weather usually happen in an area at different times of the year. Weather changes can affect our mood and life. We wear different clothes and do different things in different weather conditions. We choose different foods in different seasons.

Weather stations around the world measure different parts of the weather. Wind speed, direction, temperature, and humidity are ways to measure weather. People try to use these measurements to make weather forecasts for the future. These people are scientists who are called meteorologists. They use computers to build large mathematical models to follow weather trends.

How can we check the accuracy of the above text’s summary when we generate one? So, one way is to use human evaluation as the ground truth. In this approach, we can create summaries using each method (frequency-based, TF-IDF), and then ask human evaluators to rate the quality of each summary based on different criteria such as coherence, readability, and relevance to the original text. We can then calculate the average score for each method based on the ratings given by the evaluators. This will give us a quantitative measure of the performance of each method.

Another approach is to use ROUGE (Recall-Oriented Understudy for Gisting Evaluation), a commonly used metric for evaluating text summarization models. ROUGE measures the overlap between the generated and reference summaries (i.e., the ground truth).

Let’s first go with the human evaluation method.

We got the following summary (5 sentences) as the output using the frequency-based approach.

We wear different clothes and do different things in various weather conditions.
Weather stations around the world measure different parts of the weather.
Climate tells us what kinds of weather usually happen in an area at different times of the year.
Weather includes wind, lightning, storms, hurricanes, tornadoes (also known as twisters), rain, hail, snow, and lots more.
Wind speed, direction, temperature, and humidity are ways to measure weather.

Using the TF-IDF approach, we got the following summary(5 sentences) as the output.

Energy from the Sun affects the weather too.
Weather changes can affect our mood and life.
We wear different clothes and do different things in different weather conditions.
Weather stations around the world measure different parts of the weather.
People try to use these measurements to make weather forecasts for the future.

The average rating human evaluators rated the frequency-based approach as ⅘ and the TF-IDF approach as ⅗

So, as per human evaluation, the frequency-based approach works better.

Now, let’s see how the machine evaluates.

Let’s see the evaluation using ROUGE. The following has a human-generated reference summary. We will check how well the artificially generated summary compares to the human-generated summary.

# in case it's not installed onto your system.
! pip install rouge

import rouge
from rouge import Rouge
# a defined function called evaluate_rouge taking two arguments, 
# one being reference text and the other summary text, 
# and uses the ROUGE metric to evaluate the quality of the summary text compared to the reference text.
# The function uses the rouge library to compute the ROUGE scores and returns the F1 score of the ROUGE-1 metric.
def evaluate_rouge(reference_text, summary_text):
rouge = Rouge()
scores = rouge.get_scores(reference_text, summary_text)
return scores[0]['rouge-1']['f']


# the following is a human generated summary
reference_summary = '''
Weather is a gradual slow change through days and hours in the atmosphere and can vary from wind to snow. 
Climate tells a lot about the weather in an area.
The livelihood of people changes according to the change in weather.
Weather stations measure different parts of weather.
People who use measurements to make weather forecasts for the future are called meteorologists, and are scientists.'''

# the sample text from Wikipedia
text = '''
Weather is the day-to-day or hour-to-hour change in the atmosphere. 
Weather includes wind, lightning, storms, hurricanes, tornadoes (also known as twisters), rain, hail, snow, and lots more. 
Energy from the Sun affects the weather too. 
Climate tells us what kinds of weather usually happen in an area at different times of the year. 
Changes in weather can affect our mood and life. We wear different clothes and do different things in different weather conditions. 
We choose different foods in different seasons.
Weather stations around the world measure different parts of weather. 
Ways to measure weather are wind speed, wind direction, temperature and humidity. 
People try to use these measurements to make weather forecasts for the future. 
These people are scientists that are called meteorologists. 
They use computers to build large mathematical models to follow weather trends.'''

# Generate summary using frequency-based/TF-IDF approach
summary = generate_summary(text, 5)

# Evaluate the summary using ROUGE
rouge_score = evaluate_rouge(reference_summary, summary)

print(f"ROUGE score: {rouge_score}")

# For frequency based approach we are getting a score of 0.336
# For TF-IDF approach we are getting a score of 0.465

Here, a reference summary and a text are defined. Then, a summary is generated from the text using the frequency-based approach and the tf-idf approach. Next, the generated summary’s ROUGE score is evaluated against the reference summary using the evaluate_rouge() function. The ROUGE score measures the similarity between the generated and reference summaries. The higher the ROUGE score, the more similar the two summaries are.

Now, here, for the frequency-based approach, we get a score of 0.336; using the TF-IDF approach, we get a score of 0.465. So, in this evaluation method, the TF-IDF approach works better.

Extractive vs Abstractive Text Summarization

Aspect	Extractive Summarization	Abstractive Summarization
Language	Uses the same text as in the original	Less complex than the abstractive method
Complexity	Less complex than abstractive method	Uses different text from the original but captures the essence
Accuracy	Tends to be more accurate, selecting direct sentences from the text	Summarizes information from the original text
Domain-specific	Suitable for domain-specific cases with less language variation	More suitable for general texts

Future Outlook of Text Summarization

The future of this particular field finds its way to the higher steps of the technology ladder as the R&D teams explore new techniques and ways every day. The use of machine learning and NLP will gradually improve the quality and accuracy of the summaries that will be generated.

This field also includes the usage of deep learning models, such as recurrent neural networks and transformers, which leads to a better understanding of the text’s content. Additionally, further advancements in language generation techniques will lead to the development of more sophisticated abstractive summarization methods.

The advanced solutions would help us save time, increase productivity, and make information more accessible and easily digestible.

Conclusion

Text summarization is a fast-growing field in natural language processing, and it has the potential to revolutionize the way we consume and process information. In this article, we covered

Extractive summarization techniques select and combine existing sentences from a text to create a summary. In contrast, abstractive techniques generate new sentences while keeping the essence of the original text intact.
Extractive summarization models and abstractive summarization, here some have higher accuracy, lower computational complexity, and better preservation of factual information.
Abstractive summarization has advantages over extractive summarization models, including the ability to create more concise and coherent summaries and the potential to capture the overall meaning of a text.
Text summarization has many real-world applications, including journalism, finance, healthcare, and the legal industry.
As digital information grows, text summarization will become an essential tool for efficient processing and making sense of large volumes of text.

Frequently Asked Questions

Q1. What is extractive text summarization?

A. Extractive text summarization involves selecting key sentences or phrases directly from the source text to form a concise summary. It identifies important parts based on statistical or linguistic features without generating new sentences or altering the original content.

Q2. What is the best extractive summarizer?

A. Determining the “best” extractive summarizer depends on the application. However, popular tools include BERTSUM and Sumy. BERT-based models like BERTSUM achieve high accuracy by leveraging contextual embeddings.

Q3. What is the difference between extractive vs abstractive summarization?

A. Extractive summarization selects key sentences directly from the source, while abstractive summarization generates new sentences that paraphrase and condense the original content, providing more coherent and human-like summaries.

Q4. What is text summarization and what are its types in NLP?

A. Text summarization in NLP aims to create shorter versions of texts while retaining essential information. It includes two main types: extractive summarization (selecting key text segments) and abstractive summarization (generating new condensed text).

Shilpi Mazumdar

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Cherry

May I ask, how did you get Cue Words? I was using Penn Treebank P.O.S. Tags, but it didn't show Cue Words

Show 1 reply

Yogesh Kulkarni

Cue words are the keywords important to the domain and are supplied to this process. For Banks, as mentioned in the article, the cue words could be “Undelivered”, ‘Fraud”, etc. So, the sentences having more of such cue words can be deemed important. This arrangement gives facility to inject domain expertise in the summarization process.

stepherd

wow really nice. It will be helpful for the people those who are ready to crack the interview and please also for remind what they have learned throughout concept.

Scott

Great article & work Yogesh - very helpful! FYI: the Analytics Vidhya notification email I received mentions a full write-up at your website, but I cannot locate your site. Also, did I overlook the Y/output "department" in the summary or is it not in the article?

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Exploring the Extractive Method of Text Summarization

Introduction

Table of Contents

Types of Text Summarization

Extractive Summarization

Abstractive Summarization

Understanding with Code

Frequency-based Approach

Using a Sample Text From Wikipedia to Generate Summary

What’s happening in the above code?

TF-IDF Approach

Using a Sample Text to Check the Summary

What is cosine similarity?

How are we getting the vector representation?

Cosine similarity between 2 sentences

Evaluation Metrics

We got the following summary (5 sentences) as the output using the frequency-based approach.

Using the TF-IDF approach, we got the following summary(5 sentences) as the output.

Extractive vs Abstractive Text Summarization

Future Outlook of Text Summarization

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect