I would like to point out that in my previous article, I presented a method for extracting keywords from documents using TFIDF vectorizer. The TFIDF method relies on corpus statistics to weight the extracted keywords, so it cannot be applied here to a single text and this is one of its drawbacks.
To illustrate how each method of (Rake, Yake, Keybert, and Textrank) works, I’ll use the abstract of my published scientific article with the keywords specified by theme, and I will test each of the existing methods and check which ones return keywords that are closer to the words set by the author. Knowing that in such tasks of extracting keywords, there are so-called explicit keywords, which appear explicitly in the text, and implicit ones, which the author mentions as keywords without appearing explicitly in the text, but rather relating to the field.
I will introduce you to four methods to extract keywords/keyphrases from a single text, which are Rake, Yake, Keybert, and Textrank. We will briefly overview each scenario and then apply it to extract the keywords using an attached example.
This article was published as a part of the Data Science Blogathon.
In the example shown in the image, we have the text title and article abstract, with the standard keywords defined by the author in the original article marked in yellow. Note that the word “machine learning” does not appear explicitly in the abstract. Of course, we can adopt the full text of the article, but here for the sake of simplicity, we limited ourselves only to the abstract.
The title is usually combined with the provided text as the title contains valuable information and reflects the content of the article in a nutshell. Thus, we will combine the text and the title simply with a plus sign between the two variables text and title:
title = "VECTORIZATION OF TEXT USING DATA MINING METHODS"
text = "In the text mining tasks, textual representation should be not only efficient but also interpretable, as this enables an understanding of the operational logic underlying the data mining models. Traditional text vectorization methods such as TF-IDF and bag-of-words are effective and characterized by intuitive interpretability, but suffer from the «curse of dimensionality», and they are unable to capture the meanings of words. On the other hand, modern distributed methods effectively capture the hidden semantics, but they are computationally intensive, time-consuming, and uninterpretable. This article proposes a new text vectorization method called Bag of weighted Concepts BoWC that presents a document according to the concepts’ information it contains. The proposed method creates concepts by clustering word vectors (i.e. word embedding) then uses the frequencies of these concept clusters to represent document vectors. To enrich the resulted document representation, a new modified weighting function is proposed for weighting concepts based on statistics extracted from word embedding information. The generated vectors are characterized by interpretability, low dimensionality, high accuracy, and low computational costs when used in data mining tasks. The proposed method has been tested on five different benchmark datasets in two data mining tasks; document clustering and classification, and compared with several baselines, including Bag-of-words, TF-IDF, Averaged GloVe, Bag-of-Concepts, and VLAC. The results indicate that BoWC outperforms most baselines and gives 7% better accuracy on average"
full_text = title +", "+ text
print("The whole text to be usedn",full_text)
Now we will start applying each of the mentioned methods to extract keywords.
It is a lightweight, unsupervised automatic keyword extraction method that relies on statistical text features extracted from individual documents to identify the most relevant keywords in the text. This system does not require training on a specific set of documents and does not depend on dictionaries, text size, domain, or language. Yake defines a set of five features that capture keyword characteristics and heuristically combines them to assign a single score to every keyword.
The lower the score, the more significant the keyword will be. You can read more about it here. Python package for yake.
We install the Yake! first, then we import it:
pip install git+https://github.com/LIAAD/yake
import yake
Then we have to build a KeywordExtractor object. From the Yake instance, we call the KeywordExtractor constructor, which accepts several parameters. The most important parameters include the number of words to retrieve (top), and we set it to 10. Lan: here we use the default “en”. A list of stop words can be passed. Next, we pass the text to the extract_keywords function, which will return a list of tuples (keyword: score). Keywords are ranging in length from 1 to 3.
kw_extractor = yake.KeywordExtractor(top=10, stopwords=None)
keywords = kw_extractor.extract_keywords(full_text)
for kw, v in keywords:
print("Keyphrase: ",kw, ": score", v)
We note that there are three keywords identical to the words provided by the author, which are text mining, data mining and text vectorization methods. It is interesting that YAKE! pays attention to capital letters and gives more importance to words that start with a capital letter.
Rake is short for Rapid Automatic Keyword Extraction and it is a method of extracting keywords from individual documents. It can also be applied to new fields very easily and is very effective in dealing with multiple types of documents, especially text that requires specific grammatical conventions. Rake identifies key phrases in a text by analyzing the occurrence of a word and its compatibility with other words in the text (co-occurrence).
We’ll be using a package called multi_rake. First, install it, then we import Rake:
pip install multi_rake
from multi_rake import Rake
rake = Rake()
keywords = rake.apply(full_text)
print(keywords[:10])
We notice that there are two relevant keywords that are text mining and data mining.
TextRank is an unsupervised method for extracting keywords and sentences. The algorithm generates a graph where each node represents a word, and edges represent relationships between words formed by defining the co-occurrence of words within a moving window of a predetermined size. The algorithm draws inspiration from PageRank, which Google used to rank websites. It first Tokenizes and annotates text with Part of Speech (PoS). It only considers single words. However, the process does not use n-grams; instead, multi-words are reconstructed later. An edge creates when lexical units co-occur within a window of N words to obtain an unweighted undirected graph. Then it runs the text rank algorithm to rank the words. The most important lexical words select and then fold adjacent keywords into a multi-word keyword.
To generate keywords using Textrank you must first install the summa package and then module keywords must be imported.
pip install summa
from summa import keywords
After that, you simply have to call the keyword function and pass the text to be handled to it. We’ll also set the scores to true to print out the relevance of each resulting keyword.
TR_keywords = keywords.keywords(full_text, scores=True)
print(TR_keywords[0:10])
KeyBERT is a simple, easy-to-use keyword extraction algorithm that takes advantage of SBERT embeddings to generate keywords and key phrases from a document that are more similar to the document. First, the extract_keywords function generates a document embedding using the sentences-BERT model. Next, the function extracts the embeddings of words for N-gram phrases. The algorithm measures the similarity of each keyphrase to the document using cosine similarity. You can identify the most similar words as those that best describe the entire document and consider them as keywords.
To generate keywords using keybert you must first install the keybert package and then module keyBERT can be imported.
pip install keybert
from keybert import KeyBERT
Then you create an instance of keyBERT that accepts one parameter, which is the Sentences-Bert model. You can choose any embedding model you want from the following source. According to the author, the all-mpnet-base-v2 model is the best.
kw_model = KeyBERT(model='all-mpnet-base-v2')
It will start downloading like that:
The extract_keywords function accepts several parameters. The most important parameters include the text, the number of words that make up the keyphrase (n, m), top_n (the number of keywords to retrieve), and highlight (if highlight is true, it prints the text and highlights the keywords in yellow).
keywords = kw_model.extract_keywords(full_text,
keyphrase_ngram_range=(1, 3),
stop_words='english',
highlight=False,
top_n=10)
keywords_list= list(dict(keywords).keys())
print(keywords_list)
You can change the keyphrase_ngram_range to (1,2), considering that most of the keyphrases are between 1 and 2 in length. This time we will set highlight to true
We have presented four of the state-of-art techniques used in the field of extracting keywords or keyphrases with a code implementation for each of them. Each of the four methods has its own advantages. Each of them succeeded in extracting keywords that are either identical to the keywords specified by the author or close to them and related to the field. The main advantage of all the mentioned methods is that they do not require training on external resources.
This work is related to my scientific activity while working on my Ph.D. I hope that the information provided will be of benefit to all. In the future, we will introduce an innovative new method for automating keyword extraction and compare its performance with the mentioned baselines and many others.
You can check the code on my repository at GitHub. I would be grateful for any feedback.
About me: My name is Ali Mahmoud Mansour. I’am from Syria, and currently (in 2022) I am a graduate student (Ph.D. researcher) in the field of computer science. Passionate about text mining and data science.
The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.
how to get code and document i am studen no money
All the codes you need are in the article. There is also a link to GitHub where the code is located
Can I use keyword extraction before word embedding?
please explain your question more
Please check out the new semantically distinct keyword extraction module here https://github.com/sahyagiri/DistinctKeywords