Natural language processing is one of the most widely used skills at the enterprise level as it can deal with non-numeric data. As we know machines communicate in either 0 or 1. Still, we as humans communicate in our native languages (English as a second language in most regions), so we need a technique from which we can iterate through our language and make the machine understand that. For that NLP helped us with a wide range of tools, this is the second article discussing tools used in NLP using PySpark.
In this article, we will move forward to discuss two tools of NLP that are equally important for natural language processing applications, and they are:
So let’s deep dive into the rest of the two NLP tools so that in the next article, we can build real-world NLP applications using PySpark.
This article was published as a part of the Data Science Blogathon.
Before implementing the above-mentioned tools we first need to start and initiate the Spark Session to maintain the distributed processing, for the same, we will be importing the SparkSession module from PySpark.
from pyspark.sql import SparkSession
spark_nlp2 = SparkSession.builder.appName('nlp_tools_2').getOrCreate()
spark_nlp2
Output:
Inference: The hierarchy of functions is used to create the PySpark Session where the builder function will build the environment where Pyspark can fit in then appName will give the name to the session and get or create() will eventually create the Spark Session with a certain configuration.
TF-IDF is one of the most decorated feature extractors and stimulators tools where it works for the tokenized sentences only i.e., it doesn’t work upon the raw sentence but only with tokens; hence first, we need to apply the tokenization technique (it could be either basic Tokenizer of RegexTokenizer as well depending on the business requirements). Now when we have the token so we can implement this algorithm on top of that, and it will return the importance of each token in that document. Note that it is a feature vectorization method, so any output will be in the format of vectors only.
Now, let’s breakdown the TF-IDF method; it is a two-step process:
As the name suggests, term frequency looks for the total frequency for the particular word we wanted to consider to get the relation in the whole document corpus. There are several ways to execute the Term Frequency step.
While the term frequency looks for the occurrence of a particular word in the document corpus at the other end, IDF has a pretty subjective and critical job to execute, where it kind of classifies words based on how common and uncommon they are among the whole document.
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
sentenceData = spark_nlp2.createDataFrame([
(0.0, "Hi I heard about Spark"),
(0.0, "I wish Java could use case classes"),
(1.0, "Logistic regression models are neat")
], ["label", "sentence"])
sentenceData.show(truncate=False)
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData)
wordsData.show(truncate=False)
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)
rescaledData.select("label", "features").show(truncate=False)
In this part, we are implementing the TF-IDF as we are all done with the pre-requisite required to execute it.
In the output, we can see that from a total of 20 features, it first indicates the occurrence of those related features ([6,8,13,16]) and then shows us how much they are common to each other.
Whenever we talk about CountVectorizer, CountVectorizeModel comes hand in hand with using this algorithm. A trained model is used to vectorize the text documents into the count of tokens from the raw corpus document. Count Vectorizer in the backend act as an estimator that plucks in the vocabulary and for generating the model. Note that this particular concept is for the discrete probability models.
Enough of the theoretical part now. Let’s do our hands dirty in implementing the same.
rom pyspark.ml.feature import CountVectorizer
df = spark_nlp2.createDataFrame([
(0, "a b c".split(" ")),
(1, "a b b c a".split(" "))
], ["id", "words"])
# fit a CountVectorizerModel from the corpus.
cv = CountVectorizer(inputCol="words", outputCol="features", vocabSize=3, minDF=2.0)
model = cv.fit(df)
result = model.transform(df)
result.show(truncate=False)
Here we are in the last section of the article, where we will discuss everything we did regarding the TF-IDF algorithm and CountVectorizerModel in this article. Firstly we gathered the theoretical knowledge about each algorithm and then did the practical implementation of the same.
Here’s the repo link to this article. I hope you liked my article on Guide for implementing Count Vectorizer and TF-IDF in NLP using PySpark. If you have any opinions or questions, comment below.
Connect with me on LinkedIn for further discussion on MLIB or otherwise.
A. TF-IDF (Term Frequency-Inverse Document Frequency) is used in NLP to assess the importance of words in a document relative to a collection of documents. It helps identify key terms by considering both their frequency and uniqueness.
A. In a bag of words, word frequency represents a document, ignoring the order. TF-IDF, however, considers not just frequency but also the importance of words by weighing them based on their rarity across documents in a corpus.
A. Yes, TF-IDF is a traditional and widely used approach for feature extraction in NLP. It assigns numerical values to words, capturing their relevance in a document and aiding tasks like text classification, information retrieval, and document clustering.
A. Term Frequency (TF) is calculated by dividing the number of occurrences of a term in a document by the total number of terms in the document. Inverse Document Frequency (IDF) is calculated as the logarithm of the total number of documents divided by the number of documents containing the term.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.