Hugging Face is at the forefront of a lot of updates in the NLP space. They have released one groundbreaking NLP library after another in the last few years. Honestly, I have learned and improved my own NLP skills a lot thanks to the work open-sourced by Hugging Face.
And today, they’ve released another big update – a brand new version of their popular Tokenizer library.
So, what is tokenization? Tokenization is a crucial cog in Natural Language Processing (NLP). It’s a fundamental step in both traditional NLP methods like Count Vectorizer and Advanced Deep Learning-based architectures like Transformers.
Tokens are the building blocks of Natural Language.
Hugging Face is a company that makes tools for understanding and working with language. They create software that helps computers understand and generate human language better. They also provide a platform where people can share and use these tools for free.
Tokenization is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters, or subwords. Hence, tokenization can be broadly classified into 3 types – word, character, and subword (n-gram characters) tokenization.
For example, consider the sentence: “Never give up”.
The most common way of forming tokens is based on space. Assuming space as a delimiter, the tokenization of the sentence results in 3 tokens – Never-give-up. As each token is a word, it becomes an example of Word tokenization.
As tokens are the building blocks of Natural Language, the most common way of processing the raw text happens at the token level. The sentences or phrases of a text dataset are first tokenized and then those tokens are converted into integers which are then fed into the deep learning models.
For example, Transformer-based models – the State-of-the-Art (SOTA) Deep Learning architectures in NLP – process the raw text at the token level. Similarly, the most popular deep learning architectures for NLP like RNN, GRU, and LSTM also process the raw text at the token level.
We all know about Hugging Face thanks to their Transformer library that provides a high-level API to state-of-the-art transformer-based models such as BERT, GPT2, ALBERT, RoBERTa, and many more.
The Hugging Face team also happens to maintain another highly efficient and super fast library for text tokenization called Tokenizers. Recently, they have released the v0.8.0 version of the library.
To see the entire list of updates and changes refer to this link. In this article, I’ll show how you can easily get started with this latest version of the Tokenizers library for NLP tasks.
I’ll be using Google Colab for this demo. However, you are free to use any other platform or IDE of your choice. So, first of all, let’s quickly install the tokenizers library:
!pip install tokenizers
You can check the version of the library by executing the command below:
tokenizers.__version__
Let’s import some required libraries and the BertWordPieceTokenizer from the tokenizer library:
There are other different types of tokenization schemes available as well, such as ByteLevelBPETokenizer, CharBPETokenizer, and SentencePieceBPETokenizer. In this article, I will be using BertWordPieceTokenizer only. This is the tokenization schemes used in the BERT model.
Next, we have to download a vocabulary set for our tokenizer:
# Bert Base Uncased Vocabulary
!wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
Now, let’s tokenize a sample sentence:
The three main components of “encoded_output” are:
print(encoded_output.ids)
Output: [101, 2653, 2003, 1037, 2518, 1997, 5053, 1012, 2021, 11495, 1037, 2047, 2653, 2013, 11969, 2003, 3243, 1037, 4830, 16671, 2075, 9824, 1012, 102]
print(encoded_output.tokens)
Output: [‘[CLS]’, ‘language’, ‘is’, ‘a’, ‘thing’, ‘of’, ‘beauty’, ‘.’, ‘but’, ‘mastering’, ‘a’, ‘new’, ‘language’, ‘from’, ‘scratch’, ‘is’, ‘quite’, ‘a’, ‘da’, ‘##unt’, ‘##ing’, ‘prospect’, ‘.’, ‘[SEP]’]
print(encoded_output.offsets)
Output: [(0, 0), (0, 8), (9, 11), (12, 13), (14, 19), (20, 22), (23, 29), (29, 30), (31, 34), (35, 44), (45, 46), (47, 50), (51, 59), (60, 64), (65, 72), (73, 75), (76, 81), (82, 83), (84, 86), (86, 89), (89, 92), (93, 101), (101, 102), (0, 0)]
The tokenizers library also allows us to easily save our tokenizer as a JSON file and load it for later use. This is helpful for large text datasets. We won’t have to initialize the tokenizer again and again.
While working with text data, there are often situations where the data is already tokenized. However, it is not tokenized as per the desired tokenization scheme. In such a case, the tokenizers library can come in handy as it can encode pre-tokenized text sequences as well.
So, instead of the input sentence, we will pass the tokenized form of the sentence as input. Here, we have tokenized the sentence based on the space between two consecutive words:
print(sentence.split())
Output: [‘Language’, ‘is’, ‘a’, ‘thing’, ‘of’, ‘beauty.’, ‘But’, ‘mastering’, ‘a’, ‘new’, ‘language’, ‘from’, ‘scratch’, ‘is’, ‘quite’, ‘a’, ‘daunting’, ‘prospect.’]
Output: [‘[CLS]’, ‘language’, ‘is’, ‘a’, ‘thing’, ‘of’, ‘beauty’, ‘.’, ‘but’, ‘mastering’, ‘a’, ‘new’, ‘language’, ‘from’, ‘scratch’, ‘is’, ‘quite’, ‘a’, ‘da’, ‘##unt’, ‘##ing’, ‘prospect’, ‘.’, ‘[SEP]’]
It turns out that this output is identical to the output we got when the input was a text string.
As I mentioned above, tokenizers is a fast tokenization library. Let’s test it out on a large text corpus.
I will use the WikiText-103 dataset (181 MB in size). Let’s first download it and then unzip it:
!wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip
!unzip wikitext-103-v1.zip
The unzipped data contains three files – wiki.train.tokens, wiki.valid.tokens, and wiki.test.tokens. We will use wiki.train.tokens file only for benchmarking:
Output: 1801350
There are close to two million sequences of text in the train set. It is quite a huge number. Let’s see how the tokenizers library deals with this huge data. We will use “encode_batch” instead of “encode” because now we are going to tokenize more than one sequence:
Output: 218.2345
This is mind-blowing! It took just 218 seconds or close to 3.5 minutes to tokenize 1.8 million text sequences. Most of the other tokenization methods would crash even on Colab.
Go ahead, try it out and let me know your experience using Hugging Face’s Tokenizers NLP library!
Tokenization is the process of breaking down text into smaller units called tokens. It’s needed for tasks like natural language processing. Hugging Face’s Tokenizers library helps with this, offering efficient tools. Starting with it is easy, and speed testing ensures it performs well. Tokenization simplifies text analysis, making it manageable.