I was amazed that Roger Bacon gave the above quote in the 13th century, and it still holds, Isn’t it? I am sure that you all will agree with me.
Today, the way of understanding languages has changed a lot from the 13th century. We now refer to it as linguistics and natural language processing. But its importance hasn’t diminished; instead, it has increased tremendously. You know why? Because its applications have rocketed and one of them is the reason why you landed on this article.
Each of these applications involve complex NLP techniques and to understand these, one must have a good grasp on the basics of NLP. Therefore, before going for complex topics, keeping the fundamentals right is important.
That’s why I have created this article in which I will be covering some basic concepts of NLP – Part-of-Speech (POS) tagging, Dependency parsing, and Constituency parsing in natural language processing. We will understand these concepts and also implement these in python. So let’s begin!
In this article, you will learn about POS tagging in NLP, explore online tools for POS tagging, see a POS tagging example, and discover various POS tagging types.
Part-of-Speech (POS) tagging is a natural language processing technique that involves assigning specific grammatical categories or labels (such as nouns, verbs, adjectives, adverbs, pronouns, etc.) to individual words within a sentence. This process provides insights into the syntactic structure of the text, aiding in understanding word relationships, disambiguating word meanings, and facilitating various linguistic and computational analyses of textual data.
In our school days, all of us have studied the parts of speech, which includes nouns, pronouns, adjectives, verbs, etc. Words belonging to various parts of speeches form a sentence. Knowing the part of speech of words in a sentence is important for understanding it.
That’s the reason for the creation of the concept of POS tagging. I’m sure that by now, you have already guessed what POS tagging is. Still, allow me to explain it to you.
Part-of-Speech(POS) Tagging is the process of assigning different labels known as POS tags to the words in a sentence that tells us about the part-of-speech of the word.
Broadly there are two types of POS tags:
These tags are used in the Universal Dependencies (UD) (latest version 2), a project that is developing cross-linguistically consistent treebank annotation for many languages. These tags are based on the type of words. E.g., NOUN(Common Noun), ADJ(Adjective), ADV(Adverb).
You can read more about each one of them here.
These tags are the result of the division of universal POS tags into various tags, like NNS for common plural nouns and NN for the singular common noun compared to NOUN for common nouns in English. These tags are language-specific. You can take a look at the complete list here.
Now you know what POS tags are and what is POS tagging. So let’s write the code in python for POS tagging sentences. For this purpose, I have used Spacy here, but there are other libraries like NLTK and Stanza, which can also be used for doing the same.
import spacy
nlp=spacy.load('en_core_web_sm')
text='It took me more than two hours to translate a few pages of English.'
for token in nlp(text):
print(token.text, '=>',token.pos_,'=>',token.tag_)
In the above code sample, I have loaded the spacy’s en_web_core_sm model and used it to get the POS tags. You can see that the pos_ returns the universal POS tags, and tag_ returns detailed POS tags for words in the sentence.
Dependency parsing is the process of analyzing the grammatical structure of a sentence based on the dependencies between the words in a sentence.
In Dependency parsing, various tags represent the relationship between two words in a sentence. These tags are the dependency tags. For example, In the phrase ‘rainy weather,’ the word rainy modifies the meaning of the noun weather. Therefore, a dependency exists from the weather -> rainy in which the weather acts as the head and the rainy acts as dependent or child. This dependency is represented by amod tag, which stands for the adjectival modifier.
Similar to this, there exist many dependencies among words in a sentence but note that a dependency involves only two words in which one acts as the head and other acts as the child. As of now, there are 37 universal dependency relations used in Universal Dependency (version 2). You can take a look at all of them here. Apart from these, there also exist many language-specific tags.
Checkout this article Tutorial on Natural Language Processing using spaCy
Now let’s use Spacy and find the dependencies in a sentence.
import spacy
nlp=spacy.load('en_core_web_sm')
text='It took me more than two hours to translate a few pages of English.'
for token in nlp(text):
print(token.text,'=>',token.dep_,'=>',token.head.text)
In the above code example, the dep_ returns the dependency tag for a word, and head.text returns the respective head word. If you noticed, in the above image, the word took has a dependency tag of ROOT. This tag is assigned to the word which acts as the head of many words in a sentence but is not a child of any other word. Generally, it is the main verb of the sentence similar to ‘took’ in this case.
Now you know what dependency tags and what head, child, and root word are. But doesn’t the parsing means generating a parse tree?
Yes, we’re generating the tree here, but we’re not visualizing it. The tree generated by dependency parsing is known as a dependency tree. There are multiple ways of visualizing it, but for the sake of simplicity, we’ll use displaCy which is used for visualizing the dependency parse.
from spacy import displacy
displacy.render(nlp(text),jupyter=True)
In the above image, the arrows represent the dependency between two words in which the word at the arrowhead is the child, and the word at the end of the arrow is head. The root word can act as the head of multiple words in a sentence but is not a child of any other word. You can see above that the word ‘took’ has multiple outgoing arrows but none incoming. Therefore, it is the root word. One interesting thing about the root word is that if you start tracing the dependencies in a sentence you can reach the root word, no matter from which word you start.
Now you know about the dependency parsing, so let’s learn about another type of parsing known as Constituency Parsing.
Constituency Parsing is the process of analyzing the sentences by breaking down it into sub-phrases also known as constituents. These sub-phrases belong to a specific category of grammar like NP (noun phrase) and VP(verb phrase).
Let’s understand it with the help of an example. Suppose I have the same sentence which I used in previous examples, i.e., “It took me more than two hours to translate a few pages of English.” and I have performed constituency parsing on it. Then, the constituency parse tree for this sentence is given by-
In the above tree, the words of the sentence are written in purple color, and the POS tags are written in red color. Except for these, everything is written in black color, which represents the constituents. You can clearly see how the whole sentence is divided into sub-phrases until only the words remain at the terminals. Also, there are different tags for denoting constituents like
These are the constituent tags. You can read about different constituent tags here.
Now you know what constituency parsing is, so it’s time to code in python. Now spaCy does not provide an official API for constituency parsing. Therefore, we will be using the Berkeley Neural Parser. It is a python implementation of the parsers based on Constituency Parsing with a Self-Attentive Encoder from ACL 2018.
You can also use StanfordParser with Stanza or NLTK for this purpose, but here I have used the Berkely Neural Parser. For using this, we need first to install it. You can do that by running the following command.
!pip install benepar
Then you have to download the benerpar_en2 model.
%tensorflow_version 1.x
import benepar
benepar.download('benepar_en2')
You might have noticed that I am using TensorFlow 1.x here because currently, the benepar does not support TensorFlow 2.0. Now, it’s time to do constituency parsing.
from benepar.spacy_plugin import BeneparComponent
# Loading spaCy’s en model and adding benepar model to its pipeline
nlp = spacy.load('en')
nlp.add_pipe(BeneparComponent('benepar_en2'))
text='It took me more than two hours to translate a few pages of English.'
# Generating a parse tree for the text
list(nlp(text).sents)[0]._.parse_string
Here, _.parse_string generates the parse tree in the form of string.
Here are Some Use Cases of Pos tagging :
By incorporating these keywords, we can understand how POS tagging plays a critical role in various aspects of natural language processing and syntactic analysis.
Read More about this article How NLP using NLTK Library
Here are some reasons for Pos tagging is challenging :
Word ambiguity: Many words in a corpora have multiple meanings and parts of speech depending on the context. For instance, “bat” can be a noun (a flying mammal) or a verb (to hit something). A part-of-speech tagger needs to consider the surrounding words to assign the correct tag.
Words and complex grammar: Part-of-speech taggers are trained on large amounts of training data, but they can struggle with words they haven’t encountered before (out-of-vocabulary words) or languages with complex grammatical structures.
Here are some additional factors that make POS tagging tricky:
Now, you know what POS tagging, dependency parsing, and constituency parsing are and how they help you in understanding the text data i.e., POS tags tells you about the part-of-speech of words in a sentence, dependency parsing tells you about the existing dependencies between the words in a sentence and constituency parsing tells you about the sub-phrases or constituents of a sentence. You are now ready to move to more complex parts of NLP. As your next steps, you can read the following articles on the information extraction.
Hope you like the article! Part-of-speech (POS) tagging in NLP is essential for understanding text structure. What is POS tagging? It labels words with grammatical categories, enhancing machine comprehension. What is part of speech tagging in NLP? It aids in disambiguation and improves algorithm accuracy. Overall, what is POS tagging in NLP? It’s a foundational technique for various applications.
Also, Read More about Natural Langugae Processing Using Python
POS tagging assigns grammatical categories (tags) to words in a text. It helps machines understand language better and is used in tasks like translation, sentiment analysis, and information extraction.
POS tagging is crucial for NLP as it helps computers understand the grammatical structure and meaning of text. It’s used in tasks like syntactic analysis, semantic analysis, information extraction, machine translation, and text generation.
POS tagging is a process in NLP that assigns a grammatical category (e.g., noun, verb) to each word in a sentence. It uses various features and algorithms to achieve this, and has many applications in NLP tasks.
POS tagging is not language-independent. While there are some universal grammatical concepts, the specifics vary significantly across languages due to morphological differences, syntactic structures, lexical ambiguity, and tag sets. However, researchers are working on approaches like universal tag sets and transfer learning to make POS tagging more language-independent.