Introduction to StanfordNLP: An Incredible State-of-the-Art NLP Library for 53 Languages (with Python code)

Mohd Sanad Zaki Rizvi Last Updated : 12 May, 2020

9 min read

Introduction

A common challenge I came across while learning Natural Language Processing (NLP) – can we build models for non-English languages? The answer has been no for quite a long time. Each language has its own grammatical patterns and linguistic nuances. And there just aren’t many datasets available in other languages.

That’s where Stanford’s latest NLP library steps in – StanfordNLP.

I could barely contain my excitement when I read the news last week. The authors claimed StanfordNLP could support more than 53 human languages! Yes, I had to double-check that number.

I decided to check it out myself. There’s no official tutorial for the library yet so I got the chance to experiment and play around with it. And I found that it opens up a world of endless possibilities. StanfordNLP contains pre-trained models for rare Asian languages like Hindi, Chinese and Japanese in their original scripts.

The ability to work with multiple languages is a wonder all NLP enthusiasts crave for. In this article, we will walk through what StanfordNLP is, why it’s so important, and then fire up Python to see it live in action. We’ll also take up a case study in Hindi to showcase how StanfordNLP works – you don’t want to miss that!

What is StanfordNLP and Why Should You Use it?
Setting up StanfordNLP in Python
Using StanfordNLP to Perform Basic NLP Tasks
Implementing StanfordNLP on the Hindi Language
Using CoreNLP ‘s API for Text Analytics

What is StanfordNLP and Why Should You Use it?

Here is StanfordNLP’s description by the authors themselves:

StanfordNLP is the combination of the software package used by the Stanford team in the CoNLL 2018 Shared Task on Universal Dependency Parsing, and the group’s official Python interface to the Stanford CoreNLP software.

That’s too much information in one go! Let’s break it down:

CoNLL is an annual conference on Natural Language Learning. Teams representing research institutes from all over the world try to solve an NLP based task
One of the tasks last year was “Multilingual Parsing from Raw Text to Universal Dependencies”. In simple terms, it means to parse unstructured text data of multiple languages into useful annotations from Universal Dependencies
Universal Dependencies is a framework that maintains consistency in annotations. These annotations are generated for the text irrespective of the language being parsed
Stanford’s submission ranked #1 in 2017. They missed out on the first position in 2018 due to a software bug (ended up in 4th place)

StanfordNLP is a collection of pre-trained state-of-the-art models. These models were used by the researchers in the CoNLL 2017 and 2018 competitions. All the models are built on PyTorch and can be trained and evaluated on your own annotated data. Awesome!

Image result for stanford NLP Additionally, StanfordNLP also contains an official wrapper to the popular behemoth NLP library – CoreNLP. This had been somewhat limited to the Java ecosystem until now. You should check out this tutorial to learn more about CoreNLP and how it works in Python.

Below are a few more reasons why you should check out this library:

Native Python implementation requiring minimal effort to set up
Full neural network pipeline for robust text analytics, including:
- Tokenization
- Multi-word token (MWT) expansion
- Lemmatization
- Parts-of-speech (POS) and morphological feature tagging
- Dependency Parsing
Pretrained neural models supporting 53 (human) languages featured in 73 treebanks
A stable officially maintained Python interface to CoreNLP

What more could an NLP enthusiast ask for? Now that we have a handle on what this library does, let’s take it for a spin in Python!

Setting up StanfordNLP in Python

There are some peculiar things about the library that had me puzzled initially. For instance, you need Python 3.6.8/3.7.2 or later to use StanfordNLP. To be safe, I set up a separate environment in Anaconda for Python 3.7.1. Here’s how you can do it:

1. Open conda prompt and type this:

conda create -n stanfordnlp python=3.7.1

2. Now activate the environment:

source activate stanfordnlp

3. Install the StanfordNLP library:

pip install stanfordnlp

4. We need to download a language’s specific model to work with it. Launch a python shell and import StanfordNLP:

import stanfordnlp

then download the language model for English (“en”):

stanfordnlp.download('en')

This can take a while depending on your internet connection. These language models are pretty huge (the English one is 1.96GB).

A couple of important notes

StanfordNLP is built on top of PyTorch 1.0.0. It might crash if you have an older version. Here’s how you can check the version installed on your machine:

pip freeze | grep torch

which should give an output like torch==1.0.0

I tried using the library without GPU on my Lenovo Thinkpad E470 (8GB RAM, Intel Graphics). I got a memory error in Python pretty quickly. Hence, I switched to a GPU enabled machine and would advise you to do the same as well. You can try Google Colab which comes with free GPU support

That’s all! Let’s dive into some basic NLP processing right away.

Using StanfordNLP to Perform Basic NLP Tasks

StanfordNLP comes with built-in processors to perform five basic NLP tasks:

Tokenization
Multi-Word Token Expansion
Lemmatisation
Parts of Speech Tagging
Dependency Parsing

Let’s start by creating a text pipeline:

nlp = stanfordnlp.Pipeline(processors = "tokenize,mwt,lemma,pos")

doc = nlp("""The prospects for Britain’s orderly withdrawal from the European Union on March 29 have receded further, even as MPs rallied to stop a no-deal scenario. An amendment to the draft bill on the termination of London’s membership of the bloc obliges Prime Minister Theresa May to renegotiate her withdrawal agreement with Brussels. A Tory backbencher’s proposal calls on the government to come up with alternatives to the Irish backstop, a central tenet of the deal Britain agreed with the rest of the EU.""")

The processors = “” argument is used to specify the task. All five processors are taken by default if no argument is passed. Here is a quick overview of the processors and what they can do:

Let’s see each of them in action.

Tokenization

This process happens implicitly once the Token processor is run. It is actually pretty quick. You can have a look at tokens by using print_tokens():

doc.sentences[0].print_tokens()

The token object contains the index of the token in the sentence and a list of word objects (in case of a multi-word token). Each word object contains useful information, like the index of the word, the lemma of the text, the pos (parts of speech) tag and the feat (morphological features) tag.

Lemmatization

This involves using the “lemma” property of the words generated by the lemma processor. Here’s the code to get the lemma of all the words:

This returns a pandas data frame for each word and its respective lemma:

Parts of Speech (PoS) Tagging

The PoS tagger is quite fast and works really well across languages. Just like lemmas, PoS tags are also easy to extract:

Notice the big dictionary in the above code? It is just a mapping between PoS tags and their meaning. This helps in getting a better understanding of our document’s syntactic structure.

The output would be a data frame with three columns – word, pos and exp (explanation). The explanation column gives us the most information about the text (and is hence quite useful).

Adding the explanation column makes it much easier to evaluate how accurate our processor is. I like the fact that the tagger is on point for the majority of the words. It even picks up the tense of a word and whether it is in base or plural form.

Dependency Extraction

Dependency extraction is another out-of-the-box feature of StanfordNLP. You can simply call print_dependencies() on a sentence to get the dependency relations for all of its words:

doc.sentences[0].print_dependencies()

The library computes all of the above during a single run of the pipeline. This will hardly take you a few minutes on a GPU enabled machine.

We have now figured out a way to perform basic text processing with StanfordNLP. It’s time to take advantage of the fact that we can do the same for 51 other languages!

Implementing StanfordNLP on the Hindi Language

StanfordNLP really stands out in its performance and multilingual text parsing support. Let’s dive deeper into the latter aspect.

Processing text in Hindi (Devanagari Script)

First, we have to download the Hindi language model (comparatively smaller!):

stanfordnlp.download('hi')

Now, take a piece of text in Hindi as our text document:

hindi_doc = nlp("""केंद्र की मोदी सरकार ने शुक्रवार को अपना अंतरिम बजट पेश किया. कार्यवाहक वित्त मंत्री पीयूष गोयल ने अपने बजट में किसान, मजदूर, करदाता, महिला वर्ग समेत हर किसी के लिए बंपर ऐलान किए. हालांकि, बजट के बाद भी टैक्स को लेकर काफी कन्फ्यूजन बना रहा. केंद्र सरकार के इस अंतरिम बजट क्या खास रहा और किसको क्या मिला, आसान भाषा में यहां समझें""")

This should be enough to generate all the tags. Let’s check the tags for Hindi:

extract_pos(hindi_doc)

The PoS tagger works surprisingly well on the Hindi text as well. Look at “अपना” for example. The PoS tagger tags it as a pronoun – I, he, she – which is accurate.

Using CoreNLP’s API for Text Analytics

CoreNLP is a time tested, industry grade NLP tool-kit that is known for its performance and accuracy. StanfordNLP has been declared as an official python interface to CoreNLP. That is a HUGE win for this library.

There have been efforts before to create Python wrapper packages for CoreNLP but nothing beats an official implementation from the authors themselves. This means that the library will see regular updates and improvements.

StanfordNLP takes three lines of code to start utilizing CoreNLP’s sophisticated API. Literally, just three lines of code to set it up!

1. Download the CoreNLP package. Open your Linux terminal and type the following command:

wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip

2. Unzip the downloaded package:

unzip stanford-corenlp-full-2018-10-05.zip

3. Start the CoreNLP server:

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

Note: CoreNLP requires Java8 to run. Please make sure you have JDK and JRE 1.8.x installed.p

Now, make sure that StanfordNLP knows where CoreNLP is present. For that, you have to export $CORENLP_HOME as the location of your folder. In my case, this folder was in the home itself so my path would be like

export CORENLP_HOME=stanford-corenlp-full-2018-10-05/

After the above steps have been taken, you can start up the server and make requests in Python code. Below is a comprehensive example of starting a server, making requests, and accessing data from the returned object.

a. Setting up the CoreNLPClient

b. Dependency Parsing and POS

c. Named Entity Recognition and Co-Reference Chains

The above examples barely scratch the surface of what CoreNLP can do and yet it is very interesting, we were able to accomplish from basic NLP tasks like Parts of Speech tagging to things like Named Entity Recognition, Co-Reference Chain extraction and finding who wrote what in a sentence in just few lines of Python code.

What I like the most here is the ease of use and increased accessibility this brings when it comes to using CoreNLP in python.

My Thoughts on using StanfordNLP – Pros and Cons

Exploring a newly launched library was certainly a challenge. There’s barely any documentation on StanfordNLP! Yet, it was quite an enjoyable learning experience.

A few things that excite me regarding the future of StanfordNLP:

Its out-of-the-box support for multiple languages
The fact that it is going to be an official Python interface for CoreNLP. This means it will only improve in functionality and ease of use going forward
It is fairly fast (barring the huge memory footprint)
Straightforward set up in Python

There are, however, a few chinks to iron out. Below are my thoughts on where StanfordNLP could improve:

The size of the language models is too large (English is 1.9 GB, Chinese ~ 1.8 GB)
The library requires a lot of code to churn out features. Compare that to NLTK where you can quickly script a prototype – this might not be possible for StanfordNLP
Currently missing visualization features. It is useful to have for functions like dependency parsing. StanfordNLP falls short here when compared with libraries like SpaCy

Make sure you check out StanfordNLP’s official documentation.

End Notes

There is still a feature I haven’t tried out yet. StanfordNLP allows you to train models on your own annotated data using embeddings from Word2Vec/FastText. I’d like to explore it in the future and see how effective that functionality is. I will update the article whenever the library matures a bit.

Clearly, StanfordNLP is very much in the beta stage. It will only get better from here so this is a really good time to start using it – get a head start over everyone else.

For now, the fact that such amazing toolkits (CoreNLP) are coming to the Python ecosystem and research giants like Stanford are making an effort to open source their software, I am optimistic about the future.

Mohd Sanad Zaki Rizvi

A computer science graduate, I have previously worked as a Research Assistant at the University of Southern California(USC-ICT) where I employed NLP and ML to make better virtual STEM mentors. My research interests include using AI and its allied fields of NLP and Computer Vision for tackling real-world problems.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Rakesh Arya

Very nice article. Specially the hindi part explanation. It will open ways to analyse hindi texts. Thanks for sharing!

Show 1 reply

Hey Rakesh, Thanks for your comment. Indeed, not just Hindi but many local languages from all over the world will be accessible to the NLP community now because of StanfordNLP. Sanad

Jeremy

Do you know for NER if we can plug in the encoding of a custome language model? I am looking for a library that has NER supported already, but allowing me to use a custom language model aka the encoder and then build upon their mode structure. If you know of any documentation on this it would be greatly appreciated

garshasp

Dear Mr. Rizvi, nicely explained this recent state of art method in nlp. It would be very nice of you if you could share your experience with training word embeddings using Stanford nlp.

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Introduction to StanfordNLP: An Incredible State-of-the-Art NLP Library for 53 Languages (with Python code)

Introduction

Table of Contents

What is StanfordNLP and Why Should You Use it?

Setting up StanfordNLP in Python

A couple of important notes

Using StanfordNLP to Perform Basic NLP Tasks

Tokenization

Lemmatization

Parts of Speech (PoS) Tagging

Dependency Extraction

Implementing StanfordNLP on the Hindi Language

Processing text in Hindi (Devanagari Script)

Using CoreNLP’s API for Text Analytics

a. Setting up the CoreNLPClient

b. Dependency Parsing and POS

c. Named Entity Recognition and Co-Reference Chains

My Thoughts on using StanfordNLP – Pros and Cons

End Notes

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie