How Part-of-Speech Tag, Dependency and Constituency Parsing Aid In Understanding Text Data?

Abhishek Sharma Last Updated : 12 Nov, 2024

9 min read

Overview

Learn about Part-of-Speech (POS) Tagging,
Understand Dependency Parsing and Constituency Parsing

I was amazed that Roger Bacon gave the above quote in the 13th century, and it still holds, Isn’t it? I am sure that you all will agree with me.

Today, the way of understanding languages has changed a lot from the 13th century. We now refer to it as linguistics and natural language processing. But its importance hasn’t diminished; instead, it has increased tremendously. You know why? Because its applications have rocketed and one of them is the reason why you landed on this article.

Each of these applications involve complex NLP techniques and to understand these, one must have a good grasp on the basics of NLP. Therefore, before going for complex topics, keeping the fundamentals right is important.

That’s why I have created this article in which I will be covering some basic concepts of NLP – Part-of-Speech (POS) tagging, Dependency parsing, and Constituency parsing in natural language processing. We will understand these concepts and also implement these in python. So let’s begin!

In this article, you will learn about POS tagging in NLP, explore online tools for POS tagging, see a POS tagging example, and discover various POS tagging types.

What is Part-of-Speech(POS) Tagging?
- 1. Universal POS Tags :
- 2. Detailed POS Tags:
Dependency Parsing
Constituency Parsing
What is the use case of POS tagging?

What is Part-of-Speech(POS) Tagging?

Part-of-Speech (POS) tagging is a natural language processing technique that involves assigning specific grammatical categories or labels (such as nouns, verbs, adjectives, adverbs, pronouns, etc.) to individual words within a sentence. This process provides insights into the syntactic structure of the text, aiding in understanding word relationships, disambiguating word meanings, and facilitating various linguistic and computational analyses of textual data.

In our school days, all of us have studied the parts of speech, which includes nouns, pronouns, adjectives, verbs, etc. Words belonging to various parts of speeches form a sentence. Knowing the part of speech of words in a sentence is important for understanding it.

That’s the reason for the creation of the concept of POS tagging. I’m sure that by now, you have already guessed what POS tagging is. Still, allow me to explain it to you.

Part-of-Speech(POS) Tagging is the process of assigning different labels known as POS tags to the words in a sentence that tells us about the part-of-speech of the word.

Broadly there are two types of POS tags:

Universal POS Tags :

These tags are used in the Universal Dependencies (UD) (latest version 2), a project that is developing cross-linguistically consistent treebank annotation for many languages. These tags are based on the type of words. E.g., NOUN(Common Noun), ADJ(Adjective), ADV(Adverb).

List of Universal POS Tags

Part-of-Speech(POS) Tagging : list of pos tags

You can read more about each one of them here.

Detailed POS Tags

These tags are the result of the division of universal POS tags into various tags, like NNS for common plural nouns and NN for the singular common noun compared to NOUN for common nouns in English. These tags are language-specific. You can take a look at the complete list here.

Now you know what POS tags are and what is POS tagging. So let’s write the code in python for POS tagging sentences. For this purpose, I have used Spacy here, but there are other libraries like NLTK and Stanza, which can also be used for doing the same.

import spacy
nlp=spacy.load('en_core_web_sm')
 
text='It took me more than two hours to translate a few pages of English.'

for token in nlp(text):
 print(token.text, '=>',token.pos_,'=>',token.tag_)

Part-of-Speech(POS) Tagging: pos tagging

In the above code sample, I have loaded the spacy’s en_web_core_sm model and used it to get the POS tags. You can see that the pos_ returns the universal POS tags, and tag_ returns detailed POS tags for words in the sentence.

Dependency Parsing

Dependency parsing is the process of analyzing the grammatical structure of a sentence based on the dependencies between the words in a sentence.

In Dependency parsing, various tags represent the relationship between two words in a sentence. These tags are the dependency tags. For example, In the phrase ‘rainy weather,’ the word rainy modifies the meaning of the noun weather. Therefore, a dependency exists from the weather -> rainy in which the weather acts as the head and the rainy acts as dependent or child. This dependency is represented by amod tag, which stands for the adjectival modifier.

Similar to this, there exist many dependencies among words in a sentence but note that a dependency involves only two words in which one acts as the head and other acts as the child. As of now, there are 37 universal dependency relations used in Universal Dependency (version 2). You can take a look at all of them here. Apart from these, there also exist many language-specific tags.

Checkout this article Tutorial on Natural Language Processing using spaCy

Various Spacy

Now let’s use Spacy and find the dependencies in a sentence.

import spacy
nlp=spacy.load('en_core_web_sm')

text='It took me more than two hours to translate a few pages of English.'

for token in nlp(text):
 print(token.text,'=>',token.dep_,'=>',token.head.text)

In the above code example, the dep_ returns the dependency tag for a word, and head.text returns the respective head word. If you noticed, in the above image, the word took has a dependency tag of ROOT. This tag is assigned to the word which acts as the head of many words in a sentence but is not a child of any other word. Generally, it is the main verb of the sentence similar to ‘took’ in this case.

Now you know what dependency tags and what head, child, and root word are. But doesn’t the parsing means generating a parse tree?

Yes, we’re generating the tree here, but we’re not visualizing it. The tree generated by dependency parsing is known as a dependency tree. There are multiple ways of visualizing it, but for the sake of simplicity, we’ll use displaCy which is used for visualizing the dependency parse.

from spacy import displacy
displacy.render(nlp(text),jupyter=True)

In the above image, the arrows represent the dependency between two words in which the word at the arrowhead is the child, and the word at the end of the arrow is head. The root word can act as the head of multiple words in a sentence but is not a child of any other word. You can see above that the word ‘took’ has multiple outgoing arrows but none incoming. Therefore, it is the root word. One interesting thing about the root word is that if you start tracing the dependencies in a sentence you can reach the root word, no matter from which word you start.

Now you know about the dependency parsing, so let’s learn about another type of parsing known as Constituency Parsing.

Constituency Parsing

Constituency Parsing is the process of analyzing the sentences by breaking down it into sub-phrases also known as constituents. These sub-phrases belong to a specific category of grammar like NP (noun phrase) and VP(verb phrase).

Let’s understand it with the help of an example. Suppose I have the same sentence which I used in previous examples, i.e., “It took me more than two hours to translate a few pages of English.” and I have performed constituency parsing on it. Then, the constituency parse tree for this sentence is given by-

In the above tree, the words of the sentence are written in purple color, and the POS tags are written in red color. Except for these, everything is written in black color, which represents the constituents. You can clearly see how the whole sentence is divided into sub-phrases until only the words remain at the terminals. Also, there are different tags for denoting constituents like

VP for verb phrase
NP for noun phrases

These are the constituent tags. You can read about different constituent tags here.

Now you know what constituency parsing is, so it’s time to code in python. Now spaCy does not provide an official API for constituency parsing. Therefore, we will be using the Berkeley Neural Parser. It is a python implementation of the parsers based on Constituency Parsing with a Self-Attentive Encoder from ACL 2018.

You can also use StanfordParser with Stanza or NLTK for this purpose, but here I have used the Berkely Neural Parser. For using this, we need first to install it. You can do that by running the following command.

!pip install benepar

Then you have to download the benerpar_en2 model.

%tensorflow_version 1.x
import benepar
benepar.download('benepar_en2')

You might have noticed that I am using TensorFlow 1.x here because currently, the benepar does not support TensorFlow 2.0. Now, it’s time to do constituency parsing.

from benepar.spacy_plugin import BeneparComponent

# Loading spaCy’s en model and adding benepar model to its pipeline
nlp = spacy.load('en')
nlp.add_pipe(BeneparComponent('benepar_en2'))

text='It took me more than two hours to translate a few pages of English.'

# Generating a parse tree for the text
list(nlp(text).sents)[0]._.parse_string

Here, _.parse_string generates the parse tree in the form of string.

What is the use case of POS tagging?

Here are Some Use Cases of Pos tagging :

Syntactic Analysis: By understanding the grammatical role of each word (e.g., noun phrase, verb phrase), POS tagging helps analyze the sentence structure and relationships between words. This is achieved using hidden Markov models and other algorithms that predict the most likely sequence of POS tags based on the given text.
Disambiguation: Words like “play” can be a noun or verb. POS tagging helps identify the correct meaning based on context, using tagsets that define the possible tags for each word type and their contexts.
Language Modeling: POS tags provide valuable information about the relationships between words, which is useful for building statistical models of language. These models can be enhanced with deep learning techniques to improve their accuracy and handling of complex linguistic patterns.
Preprocessing for Other NLP Tasks: POS tagging is often a preliminary step for tasks like named entity recognition and information extraction. By identifying the part of speech for each word, we can better understand the structure of the text and extract relevant information more accurately. This involves prepositions and other parts of speech that help determine the relationships between entities in a sentence.
Lemmatization and Stemming: These techniques reduce words to their base forms (e.g., “running” to “run”). POS tags can help identify the correct base form depending on the word’s function in the sentence, distinguishing between different uses such as nouns, verbs, or interjections.
Grammar Checking: POS information can be used to flag potential grammatical errors, like using a verb in the wrong tense. This is particularly useful in applications such as grammar checking software, where understanding the pos tagger output helps identify mistakes.

Plays different Roles

By incorporating these keywords, we can understand how POS tagging plays a critical role in various aspects of natural language processing and syntactic analysis.

Read More about this article How NLP using NLTK Library

Why is POS tagging hard?

Here are some reasons for Pos tagging is challenging :

Word ambiguity: Many words in a corpora have multiple meanings and parts of speech depending on the context. For instance, “bat” can be a noun (a flying mammal) or a verb (to hit something). A part-of-speech tagger needs to consider the surrounding words to assign the correct tag.

Words and complex grammar: Part-of-speech taggers are trained on large amounts of training data, but they can struggle with words they haven’t encountered before (out-of-vocabulary words) or languages with complex grammatical structures.

Here are some additional factors that make POS tagging tricky:

Idioms and slang: Informal language constructs often don’t follow standard grammar rules, making them difficult to tag accurately.
Domain dependence: A part-of-speech tagger trained on a general dataset might not perform well on very specific domains, like legal documents or medical reports.
Perception: The interpretation of a text can vary depending on individual perception, which can affect how parts of speech are tagged.
Cardinal numbers: Numbers can be challenging as they can function as nouns, adjectives, or even other parts of speech depending on their use in a sentence.
Transformation-based methods: These methods refine initial tagging decisions based on a set of learned rules, improving accuracy but adding complexity to the tagging process.

End Notes

Now, you know what POS tagging, dependency parsing, and constituency parsing are and how they help you in understanding the text data i.e., POS tags tells you about the part-of-speech of words in a sentence, dependency parsing tells you about the existing dependencies between the words in a sentence and constituency parsing tells you about the sub-phrases or constituents of a sentence. You are now ready to move to more complex parts of NLP. As your next steps, you can read the following articles on the information extraction.

Hope you like the article! Part-of-speech (POS) tagging in NLP is essential for understanding text structure. What is POS tagging? It labels words with grammatical categories, enhancing machine comprehension. What is part of speech tagging in NLP? It aids in disambiguation and improves algorithm accuracy. Overall, what is POS tagging in NLP? It’s a foundational technique for various applications.

Also, Read More about Natural Langugae Processing Using Python

Frequently Asked Questions

Q1.What is POS tagging?

POS tagging assigns grammatical categories (tags) to words in a text. It helps machines understand language better and is used in tasks like translation, sentiment analysis, and information extraction.

Q2.Why is POS tagging important?

POS tagging is crucial for NLP as it helps computers understand the grammatical structure and meaning of text. It’s used in tasks like syntactic analysis, semantic analysis, information extraction, machine translation, and text generation.

Q3.How does POS tagging work?

POS tagging is a process in NLP that assigns a grammatical category (e.g., noun, verb) to each word in a sentence. It uses various features and algorithms to achieve this, and has many applications in NLP tasks.

Q4.Can POS tagging be language-independent?

POS tagging is not language-independent. While there are some universal grammatical concepts, the specifics vary significantly across languages due to morphological differences, syntactic structures, lexical ambiguity, and tag sets. However, researchers are working on approaches like universal tag sets and transfer learning to make POS tagging more language-independent.

Abhishek Sharma

He is a data science aficionado, who loves diving into data and generating insights from it. He is always ready for making machines to learn through code and writing technical blogs. His areas of interest include Machine Learning and Natural Language Processing still open for something new and exciting.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

How Part-of-Speech Tag, Dependency and Constituency Parsing Aid In Understanding Text Data?

Overview

Table of contents

What is Part-of-Speech(POS) Tagging?

Universal POS Tags :

List of Universal POS Tags

Detailed POS Tags

Dependency Parsing

Various Spacy

Constituency Parsing

What is the use case of POS tagging?

Plays different Roles

Why is POS tagging hard?

End Notes

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)