Automated Text Summarization with Sumy Library

Abid Mohammed Last Updated : 29 Jul, 2024
7 min read

Introduction

Imagine you’re tasked with reading through mountains of documents, extracting the key points to make sense of it all. It feels overwhelming, right? That’s where Sumy comes in, acting like a digital assistant with the power to swiftly summarize extensive texts into concise, digestible insights. Picture yourself cutting through the noise and focusing on what really matters, all thanks to the magic of Sumy library. This article will take you on a journey through Sumy’s capabilities, from its diverse summarization algorithms to practical implementation tips, transforming the daunting task of summarization into an efficient, almost effortless process. Get ready to dive into the world of automated summarization and discover how Sumy can revolutionize the way you handle information.

Learning Objectives

  • Understand all the benefits of using the Sumy library.
  • Understand how to install this library via PyPI and GitHub.
  • Learn how to create a tokenizer and a stemmer using the Sumy library.
  • Implement different summarization algorithms like Luhn, Edmundson, and LSA provided by Sumy.

This article was published as a part of the Data Science Blogathon.

What is Sumy Library?

Sumy is one of the Python libraries for Natural Language Processing tasks. It is mainly used for automatic summarization of paragraphs using different algorithms. We can use different summarizers that are based on various algorithms, such as Luhn, Edmundson, LSA, LexRank, and KL-summarizers. We will learn in-depth about each of these algorithms in the upcoming sections. Sumy requires minimal code to build a summary, and it can be easily integrated with other Natural Language Processing tasks. This library is suitable for summarizing large documents.

Benefits of Using Sumy

  • Sumy provides many summarization algorithms, allowing users to choose from a wide range of summarizers based on their preferences.
  • This library integrates efficiently with other NLP libraries.
  • The library is easy to install and use, requiring minimal setup.
  • We can summarize lengthy documents using this library.
  • Sumy can be easily customized to fit specific summarization needs.

Installation of Sumy

Now let’s look at the how to install this library in our system.

To install it via PyPI, then paste the below command in your terminal.

pip install sumy

If you are working in a notebook such as Jupyter Notebook, Kaggle, or Google Colab, then add ‘!’ before the above command.

Building a Tokenizer with Sumy

Tokenization is one of the most important task in text preprocessing. In tokenization, we divide a paragraph into sentences and then breakdown those sentences into individual words. By tokenizing the text, Sumy can better understand its structure and meaning, which improves the accuracy and quality of the summaries generated.

Now, let’s see how to build a tokenizer using Sumy lirary. We will first import the Tokenizer module from sumy, then we will download the ‘punkt’ from NLTK. Then we will create an object or instance of Tokenizer for English language. We will then convert a sample text into sentences, then we will print the tokenized words for each sentence.

from sumy.nlp.tokenizers import Tokenizer
import nltk
nltk.download('punkt')
tokenizer = Tokenizer("en")

sentences = tokenizer.to_sentences("Hello, this is Analytics Vidhya! We offer a wide 
range of articles, tutorials, and resources on various topics in AI and Data Science. 
Our mission is to provide quality education and knowledge sharing to help you excel 
in your career and academic pursuits. Whether you're a beginner looking to learn 
the basics of coding or an experienced developer seeking advanced concepts, 
Analytics Vidhya has something for everyone. ")

for sentence in sentences:
    print(tokenizer.to_words(sentence))

Output:

output: Sumy

Creating a Stemmer with Sumy

Stemming is the process of reducing a word to its base or root form. This helps in normalizing words so that different forms of a word are treated as the same term. By doing this, summarization algorithms can more effectively recognize and group similar words, thereby improving the summarization quality. The stemmer is particularly useful when we have large texts that have various forms of the same words.

To create a stemmer using the Sumy library, we will first import the `Stemmer` module from Sumy. Then, we will create an object of `Stemmer` for the English language. Next, we will pass a word to the stemmer to reduce it to its root form. Finally, we will print the stemmed word.

from sumy.nlp.stemmers import Stemmer
stemmer = Stemmer("en")
stem = stemmer("Blogging")
print(stem)

Output:

output

Overview of Different Summarization Algorithms

Let us now look into the different summarization algorithms.

Luhn Summarizer

The Luhn Summarizer is one of the summarization algorithms provided by the Sumy library. This summarizer is based on the concept of frequency analysis, where the importance of a sentence is determined by the frequency of significant words within it. The algorithm identifies words that are most relevant to the topic of the text by filterin gout some common stop words and then ranks sentences. The Luhn Summarizer is effective for extracting key sentences from a document. Here’s how to build the Luhn Summarizer:

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.luhn import LuhnSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
import nltk
nltk.download('punkt')

def summarize_paragraph(paragraph, sentences_count=2):
    parser = PlaintextParser.from_string(paragraph, Tokenizer("english"))

    summarizer = LuhnSummarizer(Stemmer("english"))
    summarizer.stop_words = get_stop_words("english")

    summary = summarizer(parser.document, sentences_count)
    return summary

if __name__ == "__main__":
    paragraph = """Artificial intelligence (AI) is intelligence demonstrated by machines, in contrast
                   to the natural intelligence displayed by humans and animals. Leading AI textbooks define
                   the field as the study of "intelligent agents": any device that perceives its environment
                   and takes actions that maximize its chance of successfully achieving its goals. Colloquially,
                   the term "artificial intelligence" is often used to describe machines (or computers) that mimic
                   "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving"."""

    sentences_count = 2
    summary = summarize_paragraph(paragraph, sentences_count)

    for sentence in summary:
        print(sentence)

Output:

Output: Sumy

Edmundson Summarizer

The Edmundson Summarizer is another powerful algorithm provided by the Sumy library. Unlike other summarizers that primarily rely on statistical and frequency-based methods, the Edmundson Summarizer allows for a more tailored approach through the use of bonus words, stigma words, and null words. These type of words enable the algorithm to emphasize or de-emphasize those words in the summarized text. Here’s how to build the Edmundson Summarizer:

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.edmundson import EdmundsonSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
import nltk
nltk.download('punkt')

def summarize_paragraph(paragraph, sentences_count=2, bonus_words=None, stigma_words=None, null_words=None):
    parser = PlaintextParser.from_string(paragraph, Tokenizer("english"))

    summarizer = EdmundsonSummarizer(Stemmer("english"))
    summarizer.stop_words = get_stop_words("english")

    if bonus_words:
        summarizer.bonus_words = bonus_words
    if stigma_words:
        summarizer.stigma_words = stigma_words
    if null_words:
        summarizer.null_words = null_words

    summary = summarizer(parser.document, sentences_count)
    return summary

if __name__ == "__main__":
    paragraph = """Artificial intelligence (AI) is intelligence demonstrated by machines, in contrast
                   to the natural intelligence displayed by humans and animals. Leading AI textbooks define
                   the field as the study of "intelligent agents": any device that perceives its environment
                   and takes actions that maximize its chance of successfully achieving its goals. Colloquially,
                   the term "artificial intelligence" is often used to describe machines (or computers) that mimic
                   "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving"."""

    sentences_count = 2
    bonus_words = ["intelligence", "AI"]
    stigma_words = ["contrast"]
    null_words = ["the", "of", "and", "to", "in"]

    summary = summarize_paragraph(paragraph, sentences_count, bonus_words, stigma_words, null_words)

    for sentence in summary:
        print(sentence)

Output:

output: Sumy

LSA Summarizer

The LSA summarizer is the best one amognst all because it works by identifying patterns and relationships between texts, rather than soley rely on frequency analysis. This LSA summarizer generates more contextually accurate summaries by understanding the meaning and context of the input text. Here’s how to build the LSA Summarizer:

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
import nltk
nltk.download('punkt')

def summarize_paragraph(paragraph, sentences_count=2):
    parser = PlaintextParser.from_string(paragraph, Tokenizer("english"))

    summarizer = LsaSummarizer(Stemmer("english"))
    summarizer.stop_words = get_stop_words("english")

    summary = summarizer(parser.document, sentences_count)
    return summary

if __name__ == "__main__":
    paragraph = """Artificial intelligence (AI) is intelligence demonstrated by machines, in contrast
                   to the natural intelligence displayed by humans and animals. Leading AI textbooks define
                   the field as the study of "intelligent agents": any device that perceives its environment
                   and takes actions that maximize its chance of successfully achieving its goals. Colloquially,
                   the term "artificial intelligence" is often used to describe machines (or computers) that mimic
                   "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving"."""

    sentences_count = 2
    summary = summarize_paragraph(paragraph, sentences_count)

    for sentence in summary:
        print(sentence)

Output:

LSA

Conclusion

Sumy is one of the best automatic text summarizing libraries available. We can also use this library for tasks like tokenization and stemming. By using different algorithms like Luhn, Edmundson, and LSA, we can generate concise and meaningful summaries based on our specific needs. Although we have used a smaller paragraph for examples, we can summarize lengthy documents using this library in no time.

Key Takeaways

  • Sumy is the best library for building summarization, as we can select a summarizer based on our needs.
  • We can also use Sumy to build a tokenizer and stemmer in an easy way.
  • Sumy provides different summarization algorithms, each with its own benefit.
  • We can use the Sumy library to summarize lengthy textual documents.

Frequently Asked Questions

Q1. What is Sumy?

A. Sumy is a Python library for automatic text summarization using various algorithms.

Q2. What algorithms does Sumy support?

A. Sumy supports algorithms like Luhn, Edmundson, LSA, LexRank, and KL-summarizers.

Q3. What is tokenization in Sumy?

A. Tokenization is dividing text into sentences and words, improving summarization accuracy.

Q4. What is stemming in Sumy?

A. Stemming reduces words to their base or root forms for better summarization.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Experienced Software Engineer with a demonstrated history of working in the computer software industry. Skilled in Python, Data Science, Computer Vision, Amazon Web services, Docker, Podman, Kubernetes, Python, MongoDB, On-Premises SetUp, Git, Gitlab, Jenkins(CI/CD).

Responses From Readers

Clear

Flash Card

What is the Sumy Library?

Sumy is a Python library used for Natural Language Processing (NLP) tasks, mainly for summarizing paragraphs automatically. It has different algorithms for summarization, like Luhn, Edmundson, LSA, LexRank, and KL-summarizers. Using Sumy requires only a small amount of code and works well with other NLP tools. It’s great for creating summaries of long documents.

Benefits of Using Sumy:

  • It offers many summarization methods, so you can pick the one you like best.
  • It works smoothly with other NLP libraries.
  • It’s easy to install and use, with minimal setup needed.
  • It handles large documents and summarizes them efficiently.
  • Sumy can be customized to meet your specific summarization needs.

What is the Sumy Library?

Quiz

What is the primary use of the Sumy library in Python?

Flash Card

How can you install the Sumy library using PyPI and GitHub?

- To install Sumy via PyPI, use the command 'pip install sumy' in your terminal.\n- If you're using a notebook environment like Jupyter, Kaggle, or Google Colab, prepend the command with ‘!’ to execute it.

Quiz

What command is used to install the Sumy library via PyPI?

Flash Card

How do you create a tokenizer using the Sumy library?

- Tokenization involves breaking down text into sentences and words, which is crucial for text preprocessing.\n- Sumy provides a tokenizer that helps in understanding the text structure, improving summarization accuracy.\n- Example code demonstrates how to tokenize a paragraph into sentences and words using Sumy.

Quiz

What is the purpose of a tokenizer in the Sumy library?

Flash Card

What is the process of creating a stemmer with Sumy?

- Stemming reduces words to their root form, normalizing them for better text analysis.\n- Sumy includes a stemmer that helps in grouping similar words, enhancing summarization quality.\n- Example code shows how to stem a word using Sumy, illustrating its simplicity and effectiveness.

Quiz

What does a stemmer do in the Sumy library?

Flash Card

Can you explain the Luhn summarization algorithm provided by Sumy?

- The Luhn Summarizer uses frequency analysis to determine sentence importance based on significant word frequency.\n- It filters out common stop words and ranks sentences by relevance to the text topic.\n- Example code illustrates how to implement the Luhn summarizer using Sumy.

Quiz

What technique does the Luhn summarization algorithm use to determine sentence importance?

Flash Card

What is unique about the Edmundson summarizer in Sumy?

- The Edmundson Summarizer allows customization through bonus, stigma, and null words to emphasize or de-emphasize certain words.\n- This tailored approach enhances the relevance of the summarized text.\n- Example code demonstrates how to use the Edmundson summarizer with specific word lists.

Quiz

What customization does the Edmundson summarizer offer in Sumy?

Flash Card

How does the LSA summarizer differ from other algorithms in Sumy?

- The LSA summarizer identifies patterns and relationships in text, focusing on context rather than frequency.\n- It generates contextually accurate summaries by understanding the input text's meaning.\n- Example code shows the implementation of the LSA summarizer, highlighting its contextual analysis capabilities.

Quiz

What is the focus of the LSA summarizer in Sumy?

Flash Card

How does Sumy integrate with other NLP tasks and what are its customization capabilities?

- Sumy integrates efficiently with other NLP libraries, enhancing its utility in various text processing tasks.\n- It can be customized to meet specific summarization needs, offering flexibility in its application.\n- This integration and customization make Sumy a versatile tool for text summarization.

Quiz

What makes Sumy a versatile tool for text summarization?

Congratulations, You Did It!
Well Done on Completing Your Learning Journey. Stay curious and keep exploring!

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details