Transformers were initially created to change the text from one language into another. BERT greatly impacted how we study and work with human language. It improved the part of the original transformer model that understands the text. Creating BERT embeddings is especially good at grasping sentences with complex meanings. It does this by examining the whole sentence and understanding how words connect. The Hugging Face transformers library is key in creating unique sentence codes and introducing BERT.
This article was published as a part of the Data Science Blogathon.
Think of pipelines as a user-friendly tool that simplifies the complex code found in the transformers library. They make it easy for people to use models for tasks like understanding language, analyzing sentiments, extracting features, answering questions, and more. They provide a neat way to interact with these powerful models.
Pipelines include a few essential components: a tokenizer (which turns regular text into smaller units for the model to work with), the model itself (which makes predictions based on the input), and some extra preparation steps to ensure the model works well.
Transformer models are usually huge, and handling them for training and using them in real applications can be quite complex. Hugging Face transformers aim to make this whole process simpler. They provide a single way to load, train, and save any Transformer model, no matter how huge. Using different software tools for different parts of the model’s life is even more handy. You can train it with one set of tools and then easily use it in a different place for real-world tasks without much hassle.
This tutorial is here to help you with the basics of working with datasets. The main aim of HuggingFace transformers is to make it easier to load datasets that come in different formats or types.
Usually, bigger datasets give better results. Hugging Face’s Dataset library has a feature that lets you quickly download and prepare many public datasets. You can directly get and store datasets using their names from the Dataset Hub. The result is like a dictionary containing all parts of the dataset, which you can access by their names.
A great thing about Hugging Face’s Datasets library is how it manages storage on your computer and uses something called Apache Arrow. This helps it handle even large datasets without using up too much memory.
You can learn more about what’s inside a dataset by looking at its features. If there are parts you don’t need, you can easily get rid of them. You can also change the names of labels to ‘labels’ (which Hugging Face Transformers models expect) and set the output format to different platforms like torch, TensorFlow, or numpy.
Translation is about changing one set of words into another. Making a new translation model from the beginning needs a lot of text in two or more languages. In this tutorial, we’ll make a Marian model better at translating English to French. It’s already learned a lot from a big collection of French and English text, so it’s had a head start. After we’re done, we’ll have an even better model for translation.
from transformers import pipeline
translator = pipeline("translation_en_to_fr")
translation = translator("What's your name?")
## [{'translation_text': "Quel est ton nom ?"}]
This is a special way of sorting text using a model that’s been trained to understand natural language. Most text sorters have a list of categories, but this one can decide what categories to use as it reads the text. This makes it really adaptable, even though it might work a bit slower. It can guess what a text is about in around 15 different languages, even if it doesn’t know the possible categories beforehand. You can easily use this model by getting it from the hub.
You create a pipeline using the “pipeline()” function in Hugging Face Transformers. This part of the system makes it easy to train a model for understanding sentiment and then use it to analyze sentiments using a specific model you can find in the hub.
Step 1: Get the right model for the task you want to do. For example, we’re getting the distilled BERT base model for classifying sentiments in this case.
chosen_model = "distilbert-base-uncased-finetuned-sst-2-english"
distil_bert = pipeline(task="sentiment-analysis", model=chosen_model)
As a result, the model is prepared to execute the intended task.
perform_sentiment_analysis(english_texts[1:])
This model assesses the sentiment expressed within the supplied texts or sentences.
The question-answering model is like a smart tool. You give it some text, and it can find answers in that text. It’s handy for getting information from different documents. What’s cool about this model is that it can find answers even if it doesn’t have all the background information.
You can easily use question-answering models and the Hugging Face Transformers library with the “question-answering pipeline.”
If you don’t tell it which model to use, the pipeline starts with a default one called “distilbert-base-cased-distilled-squad.” This pipeline takes a question, and some context related to the question and then figures out the answer from that context.
from transformers import pipeline
qa_pipeline = pipeline("question-answering")
query = "What is my place of residence?"
qa_result = qa_pipeline(question=query, context=context_text)
## {'answer': 'India', 'end': 39, 'score': 0.953, 'start': 31}
Using the BERT tokenizer, creating word embeddings with BERT begins by breaking down the input text into its individual words or parts. Then, this processed input goes through the BERT model to produce a sequence of hidden states. These states make word embeddings for each word in the input text. This is done by multiplying the hidden states with a learned weight matrix.
What’s special about BERT word embeddings is that they understand the context. This means the embedding of a word can change depending on how it’s used in a sentence. Other methods for word embeddings usually create the same embedding for a word, no matter where it appears in a sentence.
BERT, short for “Bidirectional Encoder Representations from Transformers,” is a clever system for training language understanding. It creates a solid foundation that can be used by people working on language-related tasks without any cost. These models have two main uses: you can use them to get more helpful information from your text data, or you can fine-tune them with your data to do specific jobs like sorting things, finding names, or answering questions.
It becomes instrumental once you put some information, like a sentence, document, or image, into BERT. BERT is great at pulling out important bits from text, like the meanings of words and sentences. These bits of information are helpful for tasks like finding keywords, searching for similar things and getting information. What’s special about BERT is that it understands words not just on their own but in the context they’re used in. This makes it better than models like Word2Vec, which don’t consider the words around them. Plus, BERT can handle the position of words really well, which is important.
Hugging Face Transformers allows you to use BERT in PyTorch, which you can install easily. This library also has tools to work with other advanced language models like OpenAI’s GPT and GPT-2.
!pip install transformers
You must bring in PyTorch, the pre-trained BERT model, and a BERT Tokenizer to get started.
import torch
from transformers import BertTokenizer, BertModel
Transformers provide different classes for using BERT in many tasks, like understanding the type of tokens and sorting text. But if you want to get word representations, BertModel is the best choice.
# OPTIONAL: Enable the logger for tracking information
import logging
import matplotlib.pyplot as plt
%matplotlib inline
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Load the tokenizer for the pre-trained model
When working with a pre-trained BERT model for understanding human language, it’s crucial to ensure your input data is in the right format. Let’s break it down:
Here’s what these tokens do in simpler terms:
BERT has 12 layers, each creating a summary of the text you give it, with the same number of parts as the words you put in. But these summaries are a bit different when they come out.
The ‘encode’ function in the Hugging Face Transformers library prepares and organises your data. Before using this function on your text, you should decide on the longest sentence length you want to use for adding extra words or cutting down longer ones.
The tokenizer.encode_plus function streamlines several processes:
input_ids = []
attention_masks = []
# For each sentence...
for sentence in sentences:
encoded_dict = tokenizer.encode_plus(
sentence,
add_special_tokens=True, # Add '[CLS]' and '[SEP]'
max_length=64, # Adjust sentence length
pad_to_max_length=True, # Pad/truncate sentences
return_attention_mask=True,# Generate attention masks
return_tensors='pt', # Return PyTorch tensors
)
input_ids.append(encoded_dict['input_ids'])
# Construct an attention mask (identifying padding/non-padding).
attention_masks.append(encoded_dict['attention_mask'])
In BERT, we’re looking at pairs of sentences. For each word in the tokenized text, we determine if it belongs to the first sentence (marked with 0s) or the second sentence (marked with 1s).
When working with sentences in this context, you give a value of 0 to every word in the first sentence along with the ‘[SEP]’ token, and you give a value of 1 to all the words in the second sentence.
Now, let’s talk about how you can use BERT with your text:
The BERT Model learns complex understandings of the English language, which can help you extract different aspects of text for various tasks.
If you have a set of sentences with labels, you can train a regular classifier using the information produced by the BERT Model as input (the text).
To obtain the features of a particular text using this model in TensorFlow:
from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
model = TFBertModel.from_pretrained("bert-base-cased")
custom_text = "
You are welcome to utilize any text of your choice."
encoded_input = tokenizer(custom_text, return_tensors='tf')
output_embeddings = model(encoded_input)
BERT is a powerful computer system made by Google. It’s like a smart brain that can learn from a text. You can make it even smarter by teaching it specific tasks, like figuring out what a sentence means. On the other hand, HuggingFace is a famous and open-source library for working with language. It gives you pre-trained BERT models, making it much easier to use them for specific language jobs.
A. Hugging Face Transformer is like a platform that gives people access to advanced, ready-to-use computer models. You can find these models on the Hugging Face website.
A. A pretrained transformer is an intelligent computer program trained and checked by people or companies. These models can be used as a starting point for similar tasks.
A. Hugging Face has two versions: one for regular folks and another for organizations. The regular one has a free option with some limits and a pro version that costs $9 monthly. Organizations get access to Lab and business solutions, which aren’t free.
A. Hugging Face provides tools for about 31 different computer programs. Most of them are used for deep learning, like PyTorch, TensorFlow, JAX, ONNX, fastai, Stable-Baseline 3, and more.
A. Some of these pretrained models have been trained to understand multiple languages, and they can work with programming languages like JavaScript, Python, Rust, and Bash/Shell. If you’re interested in this, you might want to take a Python Natural Language Processing course to learn how to clean up text data effectively.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.