Hugging Face (HF) is a pioneering AI platform enabling ML community collaboration on models, datasets, and applications. This article will delve into Hugging Face’s capabilities for building NLP applications, covering key services such as models, datasets, and open-source libraries. Whether you are a beginner or an experienced developer, Hugging Face offers versatile tools to enhance your NLP journey.
Hugging Face has quite a few offerings as an AI platform. Simply put, it is a platform where the ML community collaborates on models, datasets, and applications.
Let’s get started with Hugging Face. Some of the core hugging face services are:
Hugging Face hosts many open-source models, such as LLMs, diffusion-based text-to-image models, audio models, and much more! A key advantage of using Hugging Face for this is its CLI tool, which is designed for large model files.
Model pages can also include lots of valuable tools and information. Some models will have direct links to run inference or host the model on a space.
Similar to Models, Hugging Face also hosts datasets for training and evaluation. This can include text data sets, audio data, image data, and more!
Hugging Face creates, manages, and documents many open-source libraries that are popular in the ML space, such as:
We’ll explore Transformers libraries in the article, but overall, most libraries help developers create and run ML applications, such as LLMs or Text-to-Image models.
It helps you run pretrained transformer models (often LLMs or Text models). It is a powerful open-source library for building and fine-tuning transformer models for natural language processing tasks. The Transformers library abstracts away much of the complexity involved in working with transformer models, allowing researchers and developers to focus on high-level tasks and rapid experimentation. Its wide adoption and support have made it a go-to library for many NLP projects and applications.
In the NLP section of hugging face t, the tasks that we can do are
Some of the interesting NLP applications that we will look into are:
We will now learn how to build NLP applications with Hugging Face for text classification. Text classification is one of the most popular techniques in NLP. We classify our data into multiple labels. Some common text classification tasks are sentiment analysis, spam classification, auto-tagging queries, etc. We will do a basic sentiment analysis below. Moreover, we can use the hugging face model in two ways:
We will try both methods for Sentiment analysis to get a simple overview of both. However, in most cases, Pipeline is best suited for most tasks unless some customization is required.
Using pipeline
import torch
import transformers
from transformers import pipeline
pipe = pipeline("text-classification",
model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")
output = pipe("You have to do better in NLP")
print(output)
output = pipe("It is very easy to create application out of hugging face")
print(output)
In the above code, we import necessary libraries like Pytorch and Transformers, which contain almost all open-source models. We import a pipeline from transformers, which creates a pipeline of the model we import. Here, we use a Distilbert model for classification. The pipeline takes care of tokenization and getting vector embeddings. Hence, we directly infer from the model. Distilbert is a pretrained model.
Using the model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("
distilbert/distilbert-base-uncased-finetuned-sst-2-english")
model = AutoModelForSequenceClassification.from_pretrained(
"distilbert/distilbert-base-uncased-finetuned-sst-2-english")
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]
Here, we import AutoTokenizer, which we will use to import the Distilbert tokenizer. Then using AutoModelForSequenceClassificaiton we import distilbert. Now that we have imported the tokenizer and model, we will use them to classify our sentences. We will get the probabilities of both POSITIVE and NEGATIVE and then, using argmax, we get the label that our model classifies.
We will learn how to build NLP applications with Hugging Face for Fill Mask. Fill mask is an NLP task where the model tries to find the missing word or words in the sentence. This technique is primarily used in training language models to help them understand the context and relationships between words. We will use a distilbert base uncased to implement the Fill Mask task. We will replace one or more words in a sentence with a special token (commonly [MASK]), and the model’s job is to predict the masked words correctly.
from transformers import pipeline
unmasker = pipeline('fill-mask', model='distilbert-base-uncased')
unmasker("Hello I'm a [MASK] model.")
unmasker("The White man worked as a [MASK].")
Next up we will learn how to build NLP applications with Hugging Face for text summarization. Text summarization in NLP aims to keep an overview of the context and explain it in fewer words. The model’s objective is to produce a coherent and fluent summary that captures the main points of the original text.
from transformers import pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
This loads Facebook’s BART (Bidirectional and Auto-Regressive Transformers). This model will do abstractive summarization. Abstractive summarization involves generating new sentences that convey the essential information from the original text, often paraphrasing and rephrasing the content.
ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old,
she got married in Westchester County, New York.
A year later, she got married again in Westchester County,
but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again.
Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx.
In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of
"offering a false instrument for filing in the first degree,
" referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx,
according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service
and criminal trespass for allegedly sneaking into the New York subway through an emergency exit,
said DetectiveAnnette Markowski, a police spokeswoman. In total,
Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island,
New Jersey or the Bronx. She is believed to still be married to four men,
and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands,
who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved.
It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s
Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called
"red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to
his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.
Her next court appearance is scheduled for May 18.
"""
Let us provide this text where the model will have to summarize.
Output = summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False)
print(Output[0]['summary_text'])
max_length and min_length sets the length of the summary. The above code illustrates how we can use BART to summarize our text.
Text Generation is an NLP task ranging from generating the next word to generating an entire paragraph or even longer text. TeIts are applied where new content relevant to the context is needed.
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2')
set_seed(42)
We will be using GPT2 for text generation. This is the last open-source model from OpenAI. GPT2 is a generation leap in NLP. Now, the state of models like GPT-4 and GPT-4o is far better than GPT2, but still, since GPT2 is open source, we will use it.
Output = generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
Output
max_length restricts the output of gpt2 to be at maa x of 30 words. num_return_sequences is set to 5 t. This will return us to five sequences generated by GPT.
Next, we will learn how to build NLP applications with Hugging Face for Question and Answering. In QnA, we use a model that can take context and answer questions in that context. Its application helps create a chatbot. A bot created with some context will answer domain-specific queries.
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline
model_name = "deepset/roberta-base-squad2"
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)
In the above code, we use the Roberta model as a QnA chatbot. Now that we have downloaded and loaded the model, we will provide context and query our model.
QA_input = {
'question': 'Where did Liana Barrientos get married?',
'context': """ New York (CNN)When Liana Barrientos was 23 years old,
she got married in Westchester County, New York.
A year later, she got married again in Westchester County,
but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again.
Then, Barrientos declared "I do" five more times, sometimes only within
two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application
for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a
false instrument for filing in the first degree," referring to her
false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx,
according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of
service and criminal trespass for allegedly sneaking into the
New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been
married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx.
She is believed to still be married to four men, and at one time,
she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands,
who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved.
It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by
Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged"
countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native
Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.
Her next court appearance is scheduled for May 18.
"""
}
res = nlp(QA_input)
res
We can see that the model’s answer to the question “Where did Liana Barrientos get married?” is “Westchester County, New York,” with a confidence score of 0.5. This is not the best, but it is considerable since we are not using a state-of-the-art model.
We convert text from one language to another with the same context and meaning as the original text. This is Machine Translation. Translation in NLP has become so advanced that some real-time translators are created using state-of-the-art models. We will use an open-source t5-base model from Google for our translation. This model is not the best, but it gets our job done.
from transformers import pipeline
translate = pipeline('translation_en_to_fr')
Here, you can see that I have not mentioned anything about the model. But your pipeline works well even then because it downloads the default model for translating from English to French. The default model is T5 based on Google.
result = translate("Hello, my name is Jose. What is your name?")
result
result = translate("How are you?")
result
These translations may sound more of an exact meaning. It may not sound like people are speaking French is not the best translator out there, but it is good at its job.
Sentence Similarity
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F
# Load the model and tokenizer
model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
This code downloads our model (pretrained SBERT) and tokenizer. Then, after downloading them, I load them.
# Function to compute sentence embeddings
def compute_embeddings(sentences):
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state.mean(dim=1) # Mean pooling
return embeddings
Using toa kenizer we get the embeddings of our sentence.
# Define the sentences
sentence_to_compare = "How are you doing?"
sentences = [
"I am fine, thank you.",
"What are you doing today?",
"How have you been?"
]
We then define our sentences and create embeddings for those sentences.
#import csv
Now that we have the embeddings, we can use them to find the cosine similarity. We then display the similarity of all sentences with the sentence we intend to compare with. We can infer that sentence two has the highest similarity score. This is how we do sentence similarity.
This article explores various applications for building NLP using the popular Hugging Face Transformers libraries. Hugging Face is a very effective and versatile tool. Hence, I am sure it will enhance your NLP journey. I recommend that everyone delve deeper into the internal workings of the models we discussed above so that they can be used effectively.
A. Hugging Face is an NLP technology company. The organization provides open-source transformers, a powerful library of pre-trained models for many different NLP missions. The latter makes it somewhat feasible for a developer without intense experience in machine learning to operationalize, let alone practice, state-of-the-art NLP methods.
A. Text classification works by classifying the text it identifies within pre-defined categories. Inside transformers from Hugging Face, one can use a variety of pre-trained models to classify a given body of text based on content.
A fill-mask is an example of a masked-language-modeling-based task, where the tasker masks a set number of words in a sentence and substitutes them with placeholders, with the model required to predict these missing words after that. Because of Hugging Face’s ingenuity and out-of-the-box thinking, some sophisticated models like BERT were available to strike upon this task and get the context and meanings of sentences right.
A. Text summarization means taking a long text and reducing its size without losing the main points. Hugging Face makes model implementations like BART and T5 that summarize input text and produce the output quickly and precisely.
A. Text generation extends to developing new text from a given input, where transformers like GPT-2 and GPT-3 thrive. Given an impetus to act, such Transformers can generate coherent, contextually relevant text continuations. Indeed, one may be given a paragraph-generating prompt to GPT-2 or GPT-3, and the input will be logically followed.