First of all, don’t let the title deceive you! Natural Language Processing is a vast field of its own. It is evident that a lot of linguistic computation and analysis can easily be performed with modern NLP tools and applications. From basic tasks of parsing the text to the complex task of performing sentiment analysis over an input sentence, NLP has a lot of practical applications and has a great scope in the research domain.
But Python offers us various libraries for performing Natural Language Processing tasks in the most convenient way possible. One of the most prominent and easy-to-use libraries is TextBlob. As the official documentation suggests:
TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.
Today we shall explore the fundamental working mechanism and various functionalities of TextBlob. We’ll begin with the installation technique through PIP, Conda, and Github. We will then dive into the basic guide portion. This section will cover some of the fundamental NLP concepts that one must know to get going in Natural Language Processing. One of the processes involves the functionality called POS Tagging or Part of speech tagging. This technique assigns the given word with its part of speech. We will move on to Noun phrases for enlisting noun words from the text or sentences. We will then move on to a popular NLP use case called sentiment analysis where we will look into the text polarity and subjectivity parameter in order to figure out the sentiment of a text. We will then look into tokenization, a method of creating word tokens for different analyses. We will also see the concept of word inflection as well as lemmatization. These aspects deal with figuring out the word origin or base structure. Moving on, we will quickly investigate the similarities of TextBlog with python’s native strings and the cross-compatibility. The idea that the library is compatible with Python string certainly makes it easier to implement a wide range of applications using Python. We will finally see the concepts like parsing where the text or sentences are parsed in order to potentially carry out several deep learning use cases. Last but not least, we will also look into the concept and the idea behind the n-gram approach and the ways to essentially implement them easily.
Let’s go ahead and try out some easy and fun NLP functionalities with TextBlob.
This article was published as a part of the Data Science Blogathon
pip install -U textblob python -m textblob.download_corpora
This will install the library as well as the necessary NLTK (Natural Language Toolkit) corpora.
To download minimum corpora instead, run:
python -m textblob.download_corpora lite
conda install -c conda-forge textblob python -m textblob.download_corpora
git clone https://github.com/sloria/TextBlob.git python setup.py install # inside the cloned directory
Now that we have installed TextBlob, we can create a new Jupyter Notebook file and walk together with the tutorial. Once you have initialized a new notebook file, we can begin writing the code.
Let’s start by importing the library.
from textblob import TextBlob
We begin by creating a simple TextBlob. The syntax is fairly straightforward. Feel free to use a custom sentence of your choice.
my_sentence = TextBlob("I am reading a blog post on AnalyticsVidhya. I am loving it!")
Now that our TextBlob is ready, let’s perform some interesting NLP operations.
The technique of assigning one of the parts of speech to a given word is known as (PoS) tagging. With POS or Part-of-speech tagging, we can list the part-of-speech tags through the tags property. It’s also known as “point-of-sale” tagging in the long-form. In layman’s terms, POS tagging is the task of labeling each word in a phrase with its proper part of speech. Nouns, verbs, adverbs, adjectives, pronouns, conjunctions, and their subcategories are all known parts of speech. Rule-based POS tagging, stochastic POS tagging, and transformation-based tagging are the most common types of POS tagging.
Rule-based POS tagging is one of the most ancient tagging methods. To find possible tags for each word, rule-based taggers consult a dictionary or lexicon. Rule-based taggers utilize hand-written rules to choose the correct tag if the word has more than one possible tag. The linguistic properties of a word, as well as its preceding and succeeding words, can be analyzed in rule-based tagging to disambiguate it.
Stochastic POS Tagging is another tagging approach. Stochastic refers to a model that includes frequency or probability (statistics). Stochastic tagger refers to a variety of alternative methods to the problem of part-of-speech tagging. Word frequency and tag sequence are applied in the simplest stochastic tagger.
Transformation-based tagging is also referred to as Brill tagging. It’s a transformation-based learning (TBL) example, which is a rule-based system for automatically labeling POS to given text. TBL converts one state to another using transformation rules, allowing us to have linguistic knowledge in a legible fashion.
Let’s try the POS tagging operation with our “my_sentence” object.
my_sentence.tags
Hit Run to see the output
from textblob import TextBlob
my_sentence = TextBlob("I am reading a blog post on AnalyticsVidhya. I am loving it!")
print(my_sentence.tags)
The Merriam-webster defines Noun as:
Any member of a class of words that typically can be combined with determiners to serve as the subject of a verb, can be interpreted as singular or plural, can be replaced with a pronoun, and refer to an entity, quality, state, action, or concept.
Let’s say we want to extract the noun phrases in our sentences. This can easily be done using the noun phrases property.
my_sentence.noun_phrases
Output:
WordList(['blog post', 'analyticsvidhya'])
Sentiment Analysis can assist us in determining the mood and feelings of the general public as well as obtaining useful information about the setting. Sentiment Analysis is the process of assessing data and categorizing it according to the needs.
The polarity and subjectivity of a statement are returned by TextBlob. The range of polarity is [-1,1], with -1 indicating a negative sentiment and 1 indicating a positive sentiment. Negative words are used to change the polarity of a sentence. Semantic labels in TextBlob aid in fine-grained analysis. Emoticons, exclamation marks, and emojis, for example. subjectivity falls under the numeric range of [0,1]. The degree of personal opinion and factual information in a text is measured by subjectivity. Because of the text’s heightened subjectivity, it contains personal opinion rather than factual information. There’s one more setting in TextBlob: intensity. The ‘intensity’ is used by TextBlob to calculate subjectivity. The intensity of a word influences whether it modifies the next word. Adverbs are used as modifiers in English.
By providing an input sentence, the TextBlob’s sentiment property returns a named tuple with polarity and subjectivity scores. The polarity score ranges from -1.0 to 1.0 and the subjectivity ranges from 0.0 to 1.0 where 0.0 is an objective statement and 1 is a subjective statement.
my_sentence.sentiment
Output:
Sentiment(polarity=0.75, subjectivity=0.95)
Up next is tokenization. In any NLP pipeline, tokenization is considered to be the first step of the pipeline process. A tokenizer breaks down unstructured data and natural language text into chunks of information that can be considered discrete elements. The document’s token occurrences can be used to generate a vector that reflects the document. An unstructured string (text document) is turned into a numerical data structure suitable for machine learning in a matter of seconds. They can also be utilized to direct a computer’s helpful operations and responses. They could potentially be used as features in a machine learning pipeline to prompt more complex decisions or actions.
Recent deep learning-powered NLP algorithms have interpreted tokens in the context in which they appear, even in very extensive contexts. Because the system can infer the “meaning” of unusual tokens based on their context, this ability mitigates the “Heteronyms” problem and makes NLP systems more robust in the face of rare tokens. The way we tokenize text has altered as a result of these relatively new capabilities in the realm of NLP. A pipeline approach is commonly used to tokenize non-deep learning systems. After separating the text into token candidates (by splitting on white spaces or using more complex heuristics), related tokens are merged and noisy tokens are removed.
Modern tokenization systems, such as the Sentence piece or BPE algorithms, split and merge tokens into more intricate forms, and are referred to as subword tokenizers. BPE, for example, is a tokenization technique that allows for an unlimited vocabulary by expressing some tokens as pairs of tokens or more. One new approach is to almost completely avoid tokenization and run NLP algorithms at the character level. As a result, our models must process characters and understand their meanings in order to deal with considerably longer sequences. NLP at the character level allows us to avoid the nuances of tokenization and the errors that it can introduce, with sometimes astonishing results.
We can easily break down the sentences into words or sentences. We have words and sentences properties for that.
my_sentence.words
Output
WordList(['I', 'am', 'reading', 'a', 'blog', 'post', 'on', 'AnalyticsVidhya', 'I', 'am', 'loving', 'it'])
For sentences:
my_sentence.sentences
Output:
[Sentence("I am reading a blog post on AnalyticsVidhya."),
Sentence("I am loving it!")]
We can easily singularize and pluralize the words with the help of “singularize” and “pluralize” properties respectively.
my_sentence.words[4].pluralize() # the word "blog"
Output:
'blogs'
Lemmatization is the process of aggregating together the derived forms of a word into a single element or item. The “lemmatize” property helps us achieve this functionality. In Natural Language Processing (NLP) and machine learning, lemmatization is one of the most used text techniques during the pre-processing phase. Stemming is a Natural Language Processing concept that is nearly comparable to this one. We strive to reduce a given term to its root word in both stemming and lemmatization. In the stemming process, the root word is termed a stem, and in the lemmatization process, it is called a lemma.
Lemmatization has the advantage of being more precise. Lemmatization is important if you’re working with an NLP application like a chat bot or a virtual assistant when comprehending the meaning of the discussion is critical. However, this precision comes at a price.
Because lemmatization entails determining a word’s meaning from a source such as a dictionary, it takes a long time. As a result, most lemmatization methods are slower than stemming techniques. Although there is a processing expense for lemmatization, computational resources are rarely a consideration in an ML challenge.
from textblob import Word
w = Word("radii")
w.lemmatize()
Output:
'radius'
Another example:
w = Word("went")
w.lemmatize("v")
Output:
'go'
TextBlob also offers the functionality of defining the given word. The property called “definitions” does the job for it.
Word("blog").definitions
Output:
['a shared on-line journal where people can post diary entries about their personal experiences and hobbies', 'read, write, or edit a shared on-line journal']
Synsets
The “synsets” property returns a list of synset objects for a particular word.
word = Word("phone")
word.synsets
Output:
[Synset('telephone.n.01'),
Synset('phone.n.02'),
Synset('earphone.n.01'),
Synset('call.v.03')]
The spell check operation is performed by the “correct()” method. It uses the classic approach of Peter Norvig’s “How to Write a Spelling Corrector?“
my_sentence = TextBlob("I am not in denger. I am the dyangr.")
my_sentence.correct()
Output:
TextBlob("I am not in danger. I am the danger.")
Similarly, the `spellcheck()` method returns a list of probably correct words along with the confidence in the form of a tuple.
w = Word('neumonia')
w.spellcheck()
Output:
[('pneumonia', 1.0)]
The “word_counts” operation returns the number of counts of a particular word in the sentence.
betty = TextBlob("Betty Botter bought some butter. But she said the Butter’s bitter. If I put it in my batter, it will make my batter bitter. But a bit of better butter will make my batter better.") betty.word_counts['butter']
Output:
3
To apply the case sensitiveness, we can apply the `.count(word, case_sensitive=True)` operations.
betty.words.count('butter', case_sensitive=True)
Output:
2
The term “parsing” comes from the Latin word “pars” (which means “part”). It is used to extract exact or dictionary meaning from a text. Syntactic analysis, or syntax analysis, is another name for it. Syntax analysis examines the text for meaning by comparing it to formal grammar rules. As a result, parsing, syntactic analysis, or syntax analysis can be defined as the process of analyzing strings of symbols in natural language that correspond to formal grammar rules.The “parse()” method parses the TextBlob by including the tags besides the words.
betty.parse()
Output:
'Betty/NNP/B-NP/O Botter/NNP/I-NP/O bought/VBD/B-VP/O some/DT/B-NP/O butter/NN/I-NP/O ././O/OnBut/CC/O/O she/PRP/B-NP/O said/VBD/B-VP/O the/DT/B-NP/O Butter/NN/I-NP/O ’/NN/I-NP/O s/PRP/I-NP/O bitter/JJ/B-ADJP/O ././O/OnIf/IN/B-PP/B-PNP I/PRP/B-NP/I-PNP put/VB/B-VP/O it/PRP/B-NP/O in/IN/B-PP/B-PNP my/PRP$/B-NP/I-PNP batter/NN/I-NP/I-PNP ,/,/O/O it/PRP/B-NP/O will/MD/B-VP/O make/VB/I-VP/O my/PRP$/B-NP/O batter/NN/I-NP/O bitter/JJ/B-ADJP/O ././O/OnBut/CC/O/O a/DT/B-NP/O bit/NN/I-NP/O of/IN/B-PP/B-PNP better/JJR/B-NP/I-PNP butter/NN/I-NP/I-PNP will/MD/B-VP/O make/VB/I-VP/O my/PRP$/B-NP/O batter/NN/I-NP/O better/JJR/B-ADJP/O ././O/O'
TextBlobs are similar to Python strings. They can perform basic slicing operations just like the regular Python strings.
my_sentence = TextBlob("Simple is better than complex.") my_sentence[0:16]
Output:
TextBlob("Simple is better")
Apart from slicing, the “upper()” and “lower()” can also be implemented.
my_sentence.upper()
Output:
TextBlob("SIMPLE IS BETTER THAN COMPLEX.")
And just like the regular string, we can also perform the “find()” operation.
my_sentence.find("better")
Output:
10
Textblobs and Python strings can easily be concatenated.
a = TextBlob("Black")
b = TextBlob("Blue")
a + ' and ' + b
Output:
TextBlob("Black and Blue")
The object can also be formatted.
"{0} and {1}".format(a,b)
Output:
'Black and Blue'
Consider the following examples:
1) A cat is in the bag.
2) Say my name.
3) Good luck
An N-gram is simply the sequence of ‘n’ words. In the above example, the first statement “A cat is in the bag” is a 6-gram. Likewise, “Say my name” is 3-gram, and “Good luck” is 2-gram. N-grams are utilized for a wide range of tasks. When creating a language model, for example, n-grams are utilized to create not only unigram (single n-gram) models but also bigram (2-gram) and trigram (3-gram) or multiple models. Web scale n-gram models have been built by researchers for a number of applications including spelling correction, word breaking, and text summarization. Another application of n-grams is in the development of features for supervised Machine Learning models like SVMs and Naive Bayes, among others. Instead of using unigrams, the concept is to employ tokens like bigrams in the feature space.
The n-gram operation (returning a list of tuples of n successive words) can also be easily performed.
bob = TextBlob("How many roads should a man must walk before we can call him a man?")
bob.ngrams(n=3)
Output:
[WordList(['How', 'many', 'roads']), WordList(['many', 'roads', 'should']), WordList(['roads', 'should', 'a']), WordList(['should', 'a', 'man']), WordList(['a', 'man', 'must']), WordList(['man', 'must', 'walk']), WordList(['must', 'walk', 'before']), WordList(['walk', 'before', 'we']), WordList(['before', 'we', 'can']), WordList(['we', 'can', 'call']), WordList(['can', 'call', 'him']), WordList(['call', 'him', 'a']), WordList(['him', 'a', 'man'])]
Alright! We’ve seen many easy and fun applications with TextBlob. However, the real deal here is to apply these functionalities in your own project and creating something meaningful out of the knowledge we’ve gathered. While NLP has a huge spectrum of applications, TextBlob indeed does help us to set the stepping stones.
Hi there! My name is Akash and I’ve been working as a Python developer for over 4 years now. In the course of my career, I began as a Junior Python Developer at Nepal’s biggest Job portal site, Merojob. Later, I was involved in Data Science and research at Nepal’s first ride-sharing company, Tootle. Currently, I’ve been actively involved in Data Science as well as Web Development with Django.
You can find my other projects on:
Connect me on LinkedIn
Email: [email protected] | [email protected]
Website: https://akashadhikari.github.io/
Thanks for reading!
I hope you enjoyed reading the article. If you found it useful, please share it among your friends on social media too. For any queries, suggestions, constructive criticisms, or any other discussion, please ping me here in the comments or you can directly reach me through email.
Image 1: Source
Image 2: Source
Image 3: Source