This article was published as a part of the Data Science Blogathon.
Objectives: In this tutorial, I will introduce you to four methods to extract keywords/keyphrases from a single text, which are Rake, Yake, Keybert, and Textrank. We will briefly overview each scenario and then apply it to extract the keywords using an attached example.
Prerequisite: Basic understanding of Python.
Keywords: keywords extraction, keyphrases extraction, Python, NLP, TextRank, Rake, BERT.
I would like to point out that in my previous article, I presented a method for extracting keywords from documents using TFIDF vectorizer. The TFIDF method relies on corpus statistics to weight the extracted keywords, so it cannot be applied here to a single text and this is one of its drawbacks.
To illustrate how each method of (Rake, Yake, Keybert, and Textrank) works, I’ll use the abstract of my published scientific article with the keywords specified by theme, and I will test each of the existing methods and check which ones return keywords that are closer to the words set by the author. Knowing that in such tasks of extracting keywords, there are so-called explicit keywords, which appear explicitly in the text, and implicit ones, which the author mentions as keywords without appearing explicitly in the text, but rather relating to the field.
In the example shown in the image we have the text title and article abstract, and the standard keywords (defined by the author in the original article) are marked in yellow. Note that the word “machine learning” is not explicit and is not found in the abstract. Of course, we can adopt the full text of the article, but here for the sake of simplicity, we limited ourselves only to the abstract.
The title is usually combined with the provided text as the title contains valuable information and reflects the content of the article in a nutshell. Thus, we will combine the text and the title simply with a plus sign between the two variables text and title:
title = "VECTORIZATION OF TEXT USING DATA MINING METHODS" text = "In the text mining tasks, textual representation should be not only efficient but also interpretable, as this enables an understanding of the operational logic underlying the data mining models. Traditional text vectorization methods such as TF-IDF and bag-of-words are effective and characterized by intuitive interpretability, but suffer from the «curse of dimensionality», and they are unable to capture the meanings of words. On the other hand, modern distributed methods effectively capture the hidden semantics, but they are computationally intensive, time-consuming, and uninterpretable. This article proposes a new text vectorization method called Bag of weighted Concepts BoWC that presents a document according to the concepts’ information it contains. The proposed method creates concepts by clustering word vectors (i.e. word embedding) then uses the frequencies of these concept clusters to represent document vectors. To enrich the resulted document representation, a new modified weighting function is proposed for weighting concepts based on statistics extracted from word embedding information. The generated vectors are characterized by interpretability, low dimensionality, high accuracy, and low computational costs when used in data mining tasks. The proposed method has been tested on five different benchmark datasets in two data mining tasks; document clustering and classification, and compared with several baselines, including Bag-of-words, TF-IDF, Averaged GloVe, Bag-of-Concepts, and VLAC. The results indicate that BoWC outperforms most baselines and gives 7% better accuracy on average" full_text = title +", "+ text print("The whole text to be usedn",full_text)
Now we will start applying each of the mentioned methods to extract keywords.
It is a lightweight, unsupervised automatic keyword extraction method that relies on statistical text features extracted from individual documents to identify the most relevant keywords in the text. This system does not need to be trained on a particular set of documents, nor does it depend on dictionaries, text size, domain, or language. Yake defines a set of five features capturing keyword characteristics which are heuristically combined to assign a single score to every keyword. The lower the score, the more significant the keyword will be. You can read more about it here. Python package for yake.
We install the Yake! first, then we import it:
pip install git+https://github.com/LIAAD/yake import yake
Then we have to build a KeywordExtractor object. From the Yake instance, we call the KeywordExtractor constructor, which accepts several parameters, the most important of which are: the number of words to be retrieved (top), and here we set it to 10. Lan: here we use the default “en”. A list of stop words can be passed. Next, we pass the text to the extract_keywords function, which will return a list of tuples (keyword: score). Keywords are ranging in length from 1 to 3.
kw_extractor = yake.KeywordExtractor(top=10, stopwords=None) keywords = kw_extractor.extract_keywords(full_text) for kw, v in keywords: print("Keyphrase: ",kw, ": score", v)
We note that there are three keywords identical to the words provided by the author, which are text mining, data mining and text vectorization methods. It is interesting that YAKE! pays attention to capital letters and gives more importance to words that start with a capital letter.
Rake is short for Rapid Automatic Keyword Extraction and it is a method of extracting keywords from individual documents. It can also be applied to new fields very easily and is very effective in dealing with multiple types of documents, especially text that requires specific grammatical conventions. Rake identifies key phrases in a text by analyzing the occurrence of a word and its compatibility with other words in the text (co-occurrence).
We’ll be using a package called multi_rake. First, install it, then we import Rake:
pip install multi_rake
from multi_rake import Rake rake = Rake() keywords = rake.apply(full_text) print(keywords[:10])
We notice that there are two relevant keywords that are text mining and data mining.
TextRank is an unsupervised method for extracting keywords and sentences. It is based on a graph where each node is a word, and edges represent relationships between words which are formed by defining the co-occurrence of words within a moving window of a predetermined size. The algorithm is inspired by PageRank which was used by Google to rank websites. It first Tokenizes and annotates text with Part of Speech (PoS). It only considers single words. However, no n-grams are used, multi-words are reconstructed later. An edge is created if lexical units co-occur within a window of N-words to obtain an unweighted undirected graph. Then it runs the text rank algorithm to rank the words. The most important lexical words are selected and then adjacent keywords are folded into a multi-word keyword.
To generate keywords using Textrank you must first install the summa package and then module keywords must be imported.
pip install summa from summa import keywords
After that, you simply have to call the keyword function and pass the text to be handled to it. We’ll also set the scores to true to print out the relevance of each resulting keyword.
TR_keywords = keywords.keywords(full_text, scores=True) print(TR_keywords[0:10])
Textrank results. Source: the author
KeyBERT is a simple, easy-to-use keyword extraction algorithm that takes advantage of SBERT embeddings to generate keywords and key phrases from a document that are more similar to the document. First, document embedding (a representation) is generated using the sentences-BERT model. Next, the embeddings of words are extracted for N-gram phrases. The similarity of each keyphrase to the document is then measured using cosine similarity. The most similar words can then be identified as the words that best describe the entire document and are considered as keywords.
To generate keywords using keybert you must first install the keybert package and then module keyBERT can be imported.
pip install keybert from keybert import KeyBERT
Then you create an instance of keyBERT that accepts one parameter, which is the Sentences-Bert model. You can choose any embedding model you want from the following source. According to the author, the all-mpnet-base-v2 model is the best.
kw_model = KeyBERT(model='all-mpnet-base-v2')
It will start downloading like that:
The extract_keywords function accepts several parameters, the most important of which are: the text, the number of words that make up the keyphrase (n,m), top_n: the number of keywords to be retrieved, and finally highlight: if highlight=true it will print the text and highlight the keywords in yellow.
keywords = kw_model.extract_keywords(full_text, keyphrase_ngram_range=(1, 3), stop_words='english', highlight=False, top_n=10) keywords_list= list(dict(keywords).keys()) print(keywords_list)
You can change the keyphrase_ngram_range to (1,2), considering that most of the keyphrases are between 1 and 2 in length. This time we will set highlight to true
It’s so amazing.
We have presented four of the state-of-art techniques used in the field of extracting keywords/keyphrases with a code implementation for each of them. Each of the four methods has its own advantages. Each of them succeeded in extracting keywords that are either identical to the keywords specified by the author or close to them and related to the field. The main advantage of all the mentioned methods is that they do not require training on external resources.
This work is related to my scientific activity while working on my Ph.D. I hope that the information provided will be of benefit to all. In the future, we will present an innovative new method for automating keyword extraction, and its performance will be compared with the mentioned baselines and many others.
You can check the code on my repository at GitHub. I would be grateful for any feedback.
About me: My name is Ali Mahmoud Mansour. I’am from Syria, and currently (in 2022) I am a graduate student (Ph.D. researcher) in the field of computer science. Passionate about text mining and data science.
how to get code and document i am studen no money
All the codes you need are in the article. There is also a link to GitHub where the code is located
Can I use keyword extraction before word embedding?
please explain your question more
Please check out the new semantically distinct keyword extraction module here https://github.com/sahyagiri/DistinctKeywords