How to Build NLP Applications with Hugging Face?

Badrinarayan M Last Updated : 12 Jun, 2024
12 min read

Introduction

Hugging Face (HF) is a pioneering AI platform enabling ML community collaboration on models, datasets, and applications. This article will delve into Hugging Face’s capabilities for building NLP applications, covering key services such as models, datasets, and open-source libraries. Whether you are a beginner or an experienced developer, Hugging Face offers versatile tools to enhance your NLP journey.

Overview

  • Learn about Hugging Face for building NLP applications using models, datasets, and open-source tools.
  • Explore Hugging Face’s core services, which include a wide array of models, comprehensive datasets, and essential open-source NLP libraries.
  • Using Hugging Face’s tools, discover practical NLP applications such as text classification, text summarization, generation, translation, etc.
  • Learn how to leverage popular Hugging Face libraries like Transformers to develop and fine-tune models for various natural language processing tasks.

What is Hugging Face(HF)?

Hugging Face has quite a few offerings as an AI platform. Simply put, it is a platform where the ML community collaborates on models, datasets, and applications.

Let’s get started with Hugging Face. Some of the core hugging face services are:

  • Models
  • Datasets
  • Spaces
  • Open Source Libraries and Docs
  • Community

Models in Hugging Face

Hugging Face hosts many open-source models, such as LLMs, diffusion-based text-to-image models, audio models, and much more! A key advantage of using Hugging Face for this is its CLI tool, which is designed for large model files.

Model pages can also include lots of valuable tools and information. Some models will have direct links to run inference or host the model on a space.

Models in Hugging Face | Build NLP Applications with Hugging face

Datasets in Hugging Face

Similar to Models, Hugging Face also hosts datasets for training and evaluation. This can include text data sets, audio data, image data, and more!

Datasets in Hugging Face | Build NLP Applications with Hugging face

Open Source Libraries and Docs

Hugging Face creates, manages, and documents many open-source libraries that are popular in the ML space, such as:

  • Transformers
  • Diffusers
  • Gradio
  • Accelerate

We’ll explore Transformers libraries in the article, but overall, most libraries help developers create and run ML applications, such as LLMs or Text-to-Image models.

Transformers Library

It helps you run pretrained transformer models (often LLMs or Text models). It is a powerful open-source library for building and fine-tuning transformer models for natural language processing tasks. The Transformers library abstracts away much of the complexity involved in working with transformer models, allowing researchers and developers to focus on high-level tasks and rapid experimentation. Its wide adoption and support have made it a go-to library for many NLP projects and applications.

Various Functionalities Available in HF for NLP

In the NLP section of hugging face t, the tasks that we can do are

Various Functionalities Available in HF for NLP

How to Build NLP Applications Using Hugging Face

Some of the interesting NLP applications that we will look into are:

  • Text classification: Text classification based on its nature – positive or negative (sentimental analysis), spam, or ham.
  • Fill mask: One or more masks will be given in a sentence where the model will find the mask.
  • Text Summarization: Give the text that needs to be summarized into the model, and the model returns a summary.
  • Text Generation: The model generates text based on the input; it babies to complete the text.
  • Question and Answering: Give the model some context, and the model will answer when asked a question in that context.
  • Translation: We use models to translate text from one language to another.
  • Sentence similarity: Here, the model finds similarities between one sentence and multiple other sentences. It compares one sentence to all the other sentences.

Text Classification

We will now learn how to build NLP applications with Hugging Face for text classification.  Text classification is one of the most popular techniques in NLP. We classify our data into multiple labels. Some common text classification tasks are sentiment analysis, spam classification, auto-tagging queries, etc. We will do a basic sentiment analysis below. Moreover, we can use the hugging face model in two ways:

  • Using Pipeline
  • Using the model directly

We will try both methods for Sentiment analysis to get a simple overview of both. However, in most cases, Pipeline is best suited for most tasks unless some customization is required.

Using pipeline

import torch
import transformers
from transformers import pipeline
pipe = pipeline("text-classification", 
  model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")
output = pipe("You have to do better in NLP")
print(output)
Text Classification | How to Build NLP Applications Using Hugging Face
output = pipe("It is very easy to create application out of hugging face")
print(output)
Text classification

In the above code, we import necessary libraries like Pytorch and Transformers, which contain almost all open-source models. We import a pipeline from transformers, which creates a pipeline of the model we import. Here, we use a Distilbert model for classification. The pipeline takes care of tokenization and getting vector embeddings. Hence, we directly infer from the model. Distilbert is a pretrained model.

Using the model directly

from transformers import AutoTokenizer, AutoModelForSequenceClassification


tokenizer = AutoTokenizer.from_pretrained("
  distilbert/distilbert-base-uncased-finetuned-sst-2-english")
model = AutoModelForSequenceClassification.from_pretrained(
  "distilbert/distilbert-base-uncased-finetuned-sst-2-english")
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
with torch.no_grad():
   logits = model(**inputs).logits


predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]
 Screenshot from 2024-06-11 17-49-52.png

Here, we import AutoTokenizer, which we will use to import the Distilbert tokenizer. Then using AutoModelForSequenceClassificaiton we import distilbert. Now that we have imported the tokenizer and model, we will use them to classify our sentences. We will get the probabilities of both POSITIVE and NEGATIVE and then, using argmax, we get the label that our model classifies. 

Fill Mask

We will learn how to build NLP applications with Hugging Face for Fill Mask. Fill mask is an NLP task where the model tries to find the missing word or words in the sentence. This technique is primarily used in training language models to help them understand the context and relationships between words. We will use a distilbert base uncased to implement the Fill Mask task. We will replace one or more words in a sentence with a special token (commonly [MASK]), and the model’s job is to predict the masked words correctly.

from transformers import pipeline
unmasker = pipeline('fill-mask', model='distilbert-base-uncased')
unmasker("Hello I'm a [MASK] model.")
Fill Mask | How to Build NLP Applications Using Hugging Face
unmasker("The White man worked as a [MASK].")
Fill Mask

Text Summarization

Next up we will learn how to build NLP applications with Hugging Face for text summarization. Text summarization in NLP aims to keep an overview of the context and explain it in fewer words. The model’s objective is to produce a coherent and fluent summary that captures the main points of the original text. 

from transformers import pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

This loads Facebook’s BART (Bidirectional and Auto-Regressive Transformers). This model will do abstractive summarization. Abstractive summarization involves generating new sentences that convey the essential information from the original text, often paraphrasing and rephrasing the content.

ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, 
she got married in Westchester County, New York.
A year later, she got married again in Westchester County, 
but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. 
Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. 
In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of 
"offering a false instrument for filing in the first degree,
" referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, 
according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service 
and criminal trespass for allegedly sneaking into the New York subway through an emergency exit,
said DetectiveAnnette Markowski, a police spokeswoman. In total, 
Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island,
 New Jersey or the Bronx. She is believed to still be married to four men,
  and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, 
who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. 
It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s 
Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called 
"red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to 
his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  
Her next court appearance is scheduled for May 18.
"""

Let us provide this text where the model will have to summarize.

Output = summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False)
print(Output[0]['summary_text'])
Fill Mask

max_length and min_length sets the length of the summary. The above code illustrates how we can use BART to summarize our text.

Text Generation

Text Generation is an NLP task ranging from generating the next word to generating an entire paragraph or even longer text. TeIts are applied where new content relevant to the context is needed.

from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2')
set_seed(42)

We will be using GPT2 for text generation. This is the last open-source model from OpenAI. GPT2 is a generation leap in NLP. Now, the state of models like GPT-4 and GPT-4o is far better than GPT2, but still, since GPT2 is open source, we will use it.

Output = generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
Output
text generation | How to Build NLP Applications Using Hugging Face

max_length restricts the output of gpt2 to be at maa x of 30 words. num_return_sequences is set to 5 t. This will return us to five sequences generated by GPT.

Question and Answering

Next, we will learn how to build NLP applications with Hugging Face for Question and Answering. In QnA, we use a model that can take context and answer questions in that context. Its application helps create a chatbot. A bot created with some context will answer domain-specific queries. 

from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline
model_name = "deepset/roberta-base-squad2"
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)

In the above code, we use the Roberta model as a QnA chatbot. Now that we have downloaded and loaded the model, we will provide context and query our model.

QA_input = {
   'question': 'Where did Liana Barrientos get married?',
   'context': """ New York (CNN)When Liana Barrientos was 23 years old, 
   she got married in Westchester County, New York.
A year later, she got married again in Westchester County, 
but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. 
Then, Barrientos declared "I do" five more times, sometimes only within 
two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application
 for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a 
false instrument for filing in the first degree," referring to her 
false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, 
according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of 
service and criminal trespass for allegedly sneaking into the 
New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been 
married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. 
She is believed to still be married to four men, and at one time, 
she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, 
who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. 
It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by 
Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" 
countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native 
Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  
Her next court appearance is scheduled for May 18.
"""
}
res = nlp(QA_input)
res
Text Generation

We can see that the model’s answer to the question “Where did Liana Barrientos get married?” is “Westchester County, New York,” with a confidence score of 0.5. This is not the best, but it is considerable since we are not using a state-of-the-art model. 

Translation

We convert text from one language to another with the same context and meaning as the original text. This is Machine Translation. Translation in NLP has become so advanced that some real-time translators are created using state-of-the-art models. We will use an open-source t5-base model from Google for our translation. This model is not the best, but it gets our job done. 

from transformers import pipeline
translate = pipeline('translation_en_to_fr')

Here, you can see that I have not mentioned anything about the model. But your pipeline works well even then because it downloads the default model for translating from English to French. The default model is T5 based on Google.

result = translate("Hello, my name is Jose. What is your name?")
result
Translation | How to Build NLP Applications Using Hugging Face
result = translate("How are you?")
result
Translation

These translations may sound more of an exact meaning. It may not sound like people are speaking French is not the best translator out there, but it is good at its job.

Sentence Similarity

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F


# Load the model and tokenizer
model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

This code downloads our model (pretrained SBERT) and tokenizer. Then, after downloading them, I load them.

# Function to compute sentence embeddings
def compute_embeddings(sentences):
   inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
   with torch.no_grad():
       outputs = model(**inputs)
   embeddings = outputs.last_hidden_state.mean(dim=1)  # Mean pooling
   return embeddings

Using toa kenizer we get the embeddings of our sentence.

# Define the sentences
sentence_to_compare = "How are you doing?"
sentences = [
   "I am fine, thank you.",
   "What are you doing today?",
   "How have you been?"
]

We then define our sentences and create embeddings for those sentences.

#import csv
Sentence Similarity | How to Build NLP Applications Using Hugging Face

Now that we have the embeddings, we can use them to find the cosine similarity. We then display the similarity of all sentences with the sentence we intend to compare with. We can infer that sentence two has the highest similarity score. This is how we do sentence similarity.

Conclusion

This article explores various applications for building NLP using the popular Hugging Face Transformers libraries. Hugging Face is a very effective and versatile tool. Hence, I am sure it will enhance your NLP journey. I recommend that everyone delve deeper into the internal workings of the models we discussed above so that they can be used effectively.

Frequently Asked Questions

Q1. What is Hugging Face, and why is it important in NLP?

A. Hugging Face is an NLP technology company. The organization provides open-source transformers, a powerful library of pre-trained models for many different NLP missions. The latter makes it somewhat feasible for a developer without intense experience in machine learning to operationalize, let alone practice, state-of-the-art NLP methods.

Q2. How does text classification work in NLP using Hugging Face?

A. Text classification works by classifying the text it identifies within pre-defined categories. Inside transformers from Hugging Face, one can use a variety of pre-trained models to classify a given body of text based on content. 

Q3. What is a fill-mask, and how is it used in NLP?

A fill-mask is an example of a masked-language-modeling-based task, where the tasker masks a set number of words in a sentence and substitutes them with placeholders, with the model required to predict these missing words after that. Because of Hugging Face’s ingenuity and out-of-the-box thinking, some sophisticated models like BERT were available to strike upon this task and get the context and meanings of sentences right.

Q4. How do Hugging Face transformers help with text summarization? 

A. Text summarization means taking a long text and reducing its size without losing the main points. Hugging Face makes model implementations like BART and T5 that summarize input text and produce the output quickly and precisely.

Q5. What do we mean by text generation? How do transformers help in this task? 

A. Text generation extends to developing new text from a given input, where transformers like GPT-2 and GPT-3 thrive. Given an impetus to act, such Transformers can generate coherent, contextually relevant text continuations. Indeed, one may be given a paragraph-generating prompt to GPT-2 or GPT-3, and the input will be logically followed.

Data science Trainee at Analytics Vidhya, specializing in ML, DL and Gen AI. Dedicated to sharing insights through articles on these subjects. Eager to learn and contribute to the field's advancements. Passionate about leveraging data to solve complex problems and drive innovation.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details