In contrast to Computer Vision, where image data augmentation is common, text data augmentation in NLP is uncommon.
Simple manipulations on images, such as rotating them a few degrees or turning them to grayscale, have little effect on their semantics. Because of the semantically invariant transformation, augmentation has become an important tool in Computer Vision research.
I looked through the current literature to see whether there had been any attempts to build augmentation approaches for NLP. Based on my findings, I’ll present an overview of existing approaches for text data augmentation in this article and introduce the python library ‘NLPAug’.
Data Augmentation describes applying transformations to our original labeled examples to construct new data for the training set
This simply means we want to generate more data and more examples from our current dataset. So what it means is, let’s say you have data (X, Y), where X is a sentence and Y is its corresponding label. So, we can imagine it to be like X is a movie review and Y is the sentiment associated with that review.
As a part of data augmentation, we transform this X and create X’ out of it, while still preserving the label Y.
(X, Y) ——T——> (X’, Y)
So, as you can see since Y is still preserved, which means the transformation that we want to apply, say, T, has to be semantically invariant which means it doesn’t change the meaning of the original sentence. So, X’ could be syntactically a little different compared to X, but semantically it should mean the same thing.
People have been researching to come up with different data augmentation techniques to answer this question – “How do you define T efficiently so that X’ is diverse enough, yet semantic coherent so that the model becomes robust and generalizes well on the unseen data?”
Let’s see how this is done with a simple example.
Consider a sentence: This is a good movie.
So, this is one review. Using data augmentation techniques we have created the following five examples :
In this section, we will discuss different techniques to augment the text. Let’s start with Synonym Replacement.
Synonym Replacement
One of the basic text data augmentation techniques is to replace words or phrases with their synonyms. For example, the results for the word “evaluate” can be: [‘measure’, ‘evaluate’, ‘valuate’, ‘assess’, ‘appraise’, ‘value’, ‘pass_judgment’, ‘judge’]
We know that synonyms are very limited and synonym-based augmentation cannot produce different patterns from the original texts. So, let’s move to our next method using word embeddings!
Replace Words with Similar Word Embeddings
GloVe, Word2Vec, and fastText are examples of trained word embedding that can be used to identify the closest word vector from latent space to replace the original sentence. Contextual bidirectional embeddings such as Elmo and BERT, which have a considerably richer vector representation, can also be used for more reliability. Longer text sequences are encoded by Bi-LSTM and Transformer-based models, which are contextually aware of surrounding words.
Lexical based Replacement
Wordnet is an English lexical database that includes word definitions, hyponyms, and other semantic relationships. Wordnet can be used to identify synonyms for the token/word that has to be replaced in the original sentence. NLP packages such as Spacy and NLTK can be used to locate and substitute synonyms in a sentence.
Back Translation
The basic idea behind back-translation is to translate a sentence into another language and then back into the original language, with a few word changes.
It works in 3 steps:
I’d recommend using the HuggingFace transformers and Moses Tokenizers, after which I imported the MarianMT model and tokenizer. We can start by initializing a model that converts English To Romance languages, and similarly a model that converts any of the Romance languages to English.
Note: When the word romance is capitalized, as in Romance languages, it most likely refers to a group of languages based on Latin, the ancient Romans’ language.
Then, given the machine translation model, tokenizer, and target romance language, we can define a helper function to translate a batch of text. Finally, we can perform the back-translation.
For example:
Original text: ‘The quick brown fox jumped over the lazy dog’
Augmented text (English to French, and then back-translated): ‘The fast brown fox jumped over the lazy dog’
Generative Models
Generative language models such as BERT, RoBERTa, BART, or the latest T5 model can be used to generate the text in a more class label preserving manner. The generative model encodes the class label together with its related text sequences to create newer examples with some alterations.
Using BERT
Bert stands for Bidirectional Encoder Representations from Transformers and is a language representation model. It’s an approach for pre-training language representations that can be used in a variety of NLP applications. It was trained using a lot of text from Wikipedia and Book Corpus.
Two tasks are used to train this model:
In this section, I’ll explain to you about a Python library that simply does all of the data augmentations, with the ability to fine-tune the level of augmentation required using various arguments.
NLPAug is a python library for textual augmentation in machine learning experiments. The goal is to improve deep learning model performance by generating textual data. It is also able to generate adversarial examples to prevent adversarial attacks. NLPAug is a tool that assists you in enhancing NLP for machine learning applications. Let’s look at how we can utilize this library to enrich data.
NLPAug provides three different types of augmentation:
We’ll look into these three basic elements in the next section.
This NLPAug module is specially designed for Natural Language Processing. You can install it by using this command:
The basic components of NLPAug are as follows:
Here is a list of available augmenters mapped with the type of augmentation –
Augmenter | Type | Keyword | Action | Description |
Textual | Character | KeyboardAug | substitute | Simulate keyboard distance error |
Textual | OcrAug | substitute | Simulate OCR engine error | |
Textual | RandomAug | insert, substitute, swap, delete | Apply augmentation randomly | |
Textual | Word | AntonymAug | substitute | Substitute opposite meaning word according to WordNet antonym |
Textual | WordEmbsAug | insert, substitute | Leverage word2vec, GloVe, or fasttext embeddings to apply augmentation | |
Signal | Audio | CropAug | delete | Delete audio’s segment |
Pipeline | Flow | Sequential | Apply a list of augmentation functions sequentially |
Take a look at more augmenters at: https://pypi.org/project/nlpaug/
The package has many more augmentations – at the character, word, and sentence levels.
Now, we will see the implementation of few important augmentations-
Character Level Augmentation
Character level augmentation represents enhancing data at the character level.
Image to text and chatbot are two examples of possible possibilities. We require an optical character recognition (OCR) model to recognize text from an image, but OCR generates some inaccuracies, such as detecting “o” and “0.” Despite the fact that most applications come with word correction, there are still typos in chatbots. To get around this, you may let your model view the alternative outcomes before predicting them online.
Keyboard
Augmenter that applies typo error simulation to textual input.
import nlpaug.augmenter.char as nac
test_sentence = 'I went Shopping Today, and my trolly was filled with Bananas. I also had food at burgur palace'
aug = nac.KeyboardAug(name='Keyboard_Aug', aug_char_min=1, aug_char_max=10, aug_char_p=0.3, aug_word_p=0.3,
aug_word_min=1, aug_word_max=10, stopwords=None, tokenizer=None, reverse_tokenizer=None,
include_special_char=True, include_numeric=True, include_upper_case=True, lang='en', verbose=0,
stopwords_regex=None, model_path=None, min_char=4)
test_sentence_aug = aug.augment(test_sentence)
print(test_sentence)
print(test_sentence_aug)
Here, what is happening is, it will replace characters by keyboard distance. That is, ‘n’ will get replaced with ‘m’. This is because, ‘n’ is closer to ‘m’ in the keyboard, so the augmenter check for the distance between two characters. In the same way, it replaced ‘o’ with ‘(‘.
Optical Character Recognition (OCR)
Output: I Went Shopping T0day, And My Tr0l1ey was filled with Bananas. I also had food at a burgek place
Here, as you can see, ‘o’ got replaced with ‘0’, ‘l’ with ‘1’, ‘r’ with ‘k’, and so on.
Random
Augmenter that applies random character error to textual input
Output: I Went ShoF0ing Today, And My Troagey was filled wiVh Bananas. I also had %ood at a curger placD
Word Level Augmentation
Aside from character enhancement, word-level is also crucial. To insert and substitute equivalent words, we use word2vec, GloVe, fast text, BERT, and wordnet. Word2vecAug, GloVeAug, and FasttextAug use word embeddings to replace the original word with the most equivalent set of words.
Synonym
Augmenter that applies semantic meaning based on textual input.
Output: I Went Shopping Today, And My Trolley was occupy with Banana tree. Iodin also had food at a burger position
Antonym
Augmenter that applies semantic meaning based on textual input.
Output: very ugly
Random
Augmenter that applies random word operation to textual input.
Output: Went Shopping Today, And My Trolley was filled with. also had at a place
Spelling
Augmenter that applies spelling error simulation to textual input.
Output: J Went Shopping Today, And Mt Trolley was fillled with Bananas. Hi also hace food tt a burger place
Split
Augmenter that apply word splitting operation to textual input.
Output: I We nt Sho pping Today, And My T rolley was filled w ith Ban anas. I also had food at a burger pla ce
Flow Augmentation
In this type of augmentation, we can make use of multiple augmenters at once. Sequential and sometimes pipelines are used to connect augmenters in order to make use of many augmentations. A single text can be sent through multiple augmenters to yield a wide range of data.
Original: I Went Shopping Today, And My Trolley was filled with Bananas. I also had food at a burger place
Augmented Text:
i went shopping today, and my trolley sack was filled up with bananas. guess i also probably had a food at – a burger place
yesterday i occasionally went shopping today, and today my trolley was filled with fresh bananas. i also only had snack food at a burger place
i generally went shopping today, though and thankfully my trolley was filled with green bananas. i once also had food at a little burger place
though i usually went bus shopping today, and my trolley car was filled with bananas. and i also also had food at a burger place
so i went shopping today, grocery and today my trolley was filled with bananas. i also sometimes had homemade food at a local burger place
Word2Vec
Sequential
You can add as many augmenters to this flow as you wish, and Sequential will execute them one by one. You can, for example, combine ContextualWordEmbsAug and WordEmbsAug.
Output:
[‘what is your especially recommended book : on bayesian statistics ?’,
Sometimes
If you don’t want to use the same set of augmenters every time, Sometimes the pipeline can pick a different set of augmenters every time.
Output:
[‘what is there your recommended second book on bayesian statistics ?’,
In this article, I have explored the python library ‘NLPAug’ for text data augmentation. This package contains a variety of augmentations to supplement text data and introduce noise that may help your model generalize. I’ve done a beginner-level exploration only! This library definitely has the potential and can be used to predict the sentiment of a text, oversample any class of your choice (neutral, positive, or negative), and what not!
These are a few libraries similar to NLPAug:
Hope you got a brief idea on text data augmentation and the /NLPAug library. Do explore more on this library! Thank you!