NLPAUG – A Python library to Augment Your Text Data

nithilaau Last Updated : 21 Oct, 2024
8 min read

Introduction

In contrast to Computer Vision, where image data augmentation is common, text data augmentation in NLP is uncommon.

Simple manipulations on images, such as rotating them a few degrees or turning them to grayscale, have little effect on their semantics. Because of the semantically invariant transformation, augmentation has become an important tool in Computer Vision research.

I looked through the current literature to see whether there had been any attempts to build augmentation approaches for NLP. Based on my findings, I’ll present an overview of existing approaches for text data augmentation in this article and introduce the python library ‘NLPAug’.

 

Table of Contents

  1. Why and What is Data Augmentation?
  2. Different methods of Text Augmentation
  3. Introduction to NLPaug library
  4. Basic elements of NLPaug library
    1. Character
    2. Word
    3. Flow

 

Why and What is Data Augmentation?

Data Augmentation describes applying transformations to our original labeled examples to construct new data for the training set

This simply means we want to generate more data and more examples from our current dataset. So what it means is, let’s say you have data (X, Y), where X is a sentence and Y is its corresponding label. So, we can imagine it to be like X is a movie review and Y is the sentiment associated with that review.

As a part of data augmentation, we transform this X and create X’ out of it, while still preserving the label Y.

(X, Y) ——T——> (X’, Y)

So, as you can see since Y is still preserved, which means the transformation that we want to apply, say, T, has to be semantically invariant which means it doesn’t change the meaning of the original sentence. So, X’ could be syntactically a little different compared to X, but semantically it should mean the same thing.

People have been researching to come up with different data augmentation techniques to answer this question – “How do you define T efficiently so that X’ is diverse enough, yet semantic coherent so that the model becomes robust and generalizes well on the unseen data?”

Let’s see how this is done with a simple example.

Consider a sentence: This is a good movie.

So, this is one review. Using data augmentation techniques we have created the following five examples :

  1. Movies good
  2. Awesome movie
  3. I like the movie
  4. Enjoyed movie
  5. This is a nice film

 

Different methods of Text Augmentation

In this section, we will discuss different techniques to augment the text. Let’s start with Synonym Replacement.


Synonym Replacement 

One of the basic text data augmentation techniques is to replace words or phrases with their synonyms. For example, the results for the word “evaluate” can be: [‘measure’, ‘evaluate’, ‘valuate’, ‘assess’, ‘appraise’, ‘value’, ‘pass_judgment’, ‘judge’]

We know that synonyms are very limited and synonym-based augmentation cannot produce different patterns from the original texts. So, let’s move to our next method using word embeddings!


Replace Words with Similar Word Embeddings

GloVe, Word2Vec, and fastText are examples of trained word embedding that can be used to identify the closest word vector from latent space to replace the original sentence. Contextual bidirectional embeddings such as Elmo and BERT, which have a considerably richer vector representation, can also be used for more reliability. Longer text sequences are encoded by Bi-LSTM and Transformer-based models, which are contextually aware of surrounding words.


Lexical based Replacement

Wordnet is an English lexical database that includes word definitions, hyponyms, and other semantic relationships. Wordnet can be used to identify synonyms for the token/word that has to be replaced in the original sentence. NLP packages such as Spacy and NLTK can be used to locate and substitute synonyms in a sentence.


Back Translation

The basic idea behind back-translation is to translate a sentence into another language and then back into the original language, with a few word changes.

It works in 3 steps:

  1. Input is a text in some source language (Eg: English)
  2. Translate the given text input to any different language for the time being (Eg: English to Spanish)
  3. Translate the previously translated text back into the source language (Eg: Spanish to English)

I’d recommend using the HuggingFace transformers and Moses Tokenizers, after which I imported the MarianMT model and tokenizer. We can start by initializing a model that converts English To Romance languages, and similarly a model that converts any of the Romance languages to English.

Note: When the word romance is capitalized, as in Romance languages, it most likely refers to a group of languages based on Latin, the ancient Romans’ language.

Then, given the machine translation model, tokenizer, and target romance language, we can define a helper function to translate a batch of text. Finally, we can perform the back-translation.

For example:
Original text: ‘The quick brown fox jumped over the lazy dog’
Augmented text (English to French, and then back-translated): ‘The fast brown fox jumped over the lazy dog’


Generative Models

Generative language models such as BERT, RoBERTa, BART, or the latest T5 model can be used to generate the text in a more class label preserving manner. The generative model encodes the class label together with its related text sequences to create newer examples with some alterations.


Using BERT

Bert stands for Bidirectional Encoder Representations from Transformers and is a language representation model. It’s an approach for pre-training language representations that can be used in a variety of NLP applications. It was trained using a lot of text from Wikipedia and Book Corpus.

Two tasks are used to train this model:

  1. Masked word prediction
    1. Idea – Masked word prediction works by hiding keywords in sentences and letting BERT guess what they are.
  2. Next sentence prediction
    1. Next sentence prediction teaches BERT to recognize longer-term dependencies across sentences, while masked word prediction teaches BERT to understand the relationship between words.

 

Introduction to NLPAUG

In this section, I’ll explain to you about a Python library that simply does all of the data augmentations, with the ability to fine-tune the level of augmentation required using various arguments.

NLPAug is a python library for textual augmentation in machine learning experiments. The goal is to improve deep learning model performance by generating textual data. It is also able to generate adversarial examples to prevent adversarial attacks. NLPAug is a tool that assists you in enhancing NLP for machine learning applications. Let’s look at how we can utilize this library to enrich data.

NLPAug provides three different types of augmentation:

  1. Character level augmentation
  2. Word level augmentation
  3. Flow/ Sentence level augmentation

We’ll look into these three basic elements in the next section.

This NLPAug module is specially designed for Natural Language Processing. You can install it by using this command:

 

Basic elements of NLPAUG

The basic components of NLPAug are as follows:

  1. Character level augmentation
  2. Word level augmentation
  3. Flow augmentation

Here is a list of available augmenters mapped with the type of augmentation –

AugmenterTypeKeywordActionDescription
TextualCharacterKeyboardAugsubstituteSimulate keyboard distance error
Textual OcrAugsubstituteSimulate OCR engine error
Textual RandomAuginsert, substitute, swap, deleteApply augmentation randomly
TextualWord AntonymAugsubstituteSubstitute opposite meaning word according to WordNet antonym
Textual WordEmbsAuginsert, substituteLeverage word2vec, GloVe, or fasttext embeddings to apply augmentation
SignalAudioCropAugdeleteDelete audio’s segment
PipelineFlowSequential Apply a list of augmentation functions sequentially

 

Take a look at more augmenters at: https://pypi.org/project/nlpaug/

The package has many more augmentations – at the character, word, and sentence levels.

  1. Character Augmenter – OCR, Keyboard, Random
  2. Word Augmenter – Spelling, Word Embeddings, TF-IDF, Contextual Word Embeddings, Synonym, Antonym, Random Word, Split
  3. Sentence Augmenter- Contextual Word Embeddings for Sentence

Now, we will see the implementation of few important augmentations-

Character Level Augmentation

Character level augmentation represents enhancing data at the character level.

Image to text and chatbot are two examples of possible possibilities. We require an optical character recognition (OCR) model to recognize text from an image, but OCR generates some inaccuracies, such as detecting “o” and “0.” Despite the fact that most applications come with word correction, there are still typos in chatbots. To get around this, you may let your model view the alternative outcomes before predicting them online.


Keyboard

Augmenter that applies typo error simulation to textual input.

import nlpaug.augmenter.char as nac

test_sentence = 'I went Shopping Today, and my trolly was filled with Bananas. I also had food at burgur palace'

aug = nac.KeyboardAug(name='Keyboard_Aug', aug_char_min=1, aug_char_max=10, aug_char_p=0.3, aug_word_p=0.3, 
                      aug_word_min=1, aug_word_max=10, stopwords=None, tokenizer=None, reverse_tokenizer=None, 
                      include_special_char=True, include_numeric=True, include_upper_case=True, lang='en', verbose=0, 
                      stopwords_regex=None, model_path=None, min_char=4)
 
test_sentence_aug = aug.augment(test_sentence)
print(test_sentence)
print(test_sentence_aug)

Here, what is happening is, it will replace characters by keyboard distance. That is, ‘n’ will get replaced with ‘m’. This is because, ‘n’ is closer to ‘m’ in the keyboard, so the augmenter check for the distance between two characters. In the same way, it replaced ‘o’ with ‘(‘.


Optical Character Recognition (OCR)

  • Augmenter that applies OCR error simulation to textual input.
  • For example, OCR may recognize ‘I’ as ‘1’ incorrectly, or ‘0’ as ‘o’ or ‘O’.
  • Pre-defined OCR mapping is leveraged to replace a character with a possible OCR error.
  • Solving the out of vocabulary (OOV) problem, that is Out of vocabulary words are words that are not in the training set, but appear in the test set, real data.
  • The main problem is that the model assigns a probability zero to out of vocabulary words resulting in a zero likelihood.
    • This is a common problem, especially when you have not trained on a smaller data set.
    • So, to overcome this we can use models like BERT and GPT (Generative Pre-trained Transformer models).

Output: I Went Shopping T0day, And My Tr0l1ey was filled with Bananas. I also had food at a burgek place

Here, as you can see, ‘o’ got replaced with ‘0’, ‘l’ with ‘1’, ‘r’ with ‘k’, and so on.


Random

Augmenter that applies random character error to textual input

Output: I Went ShoF0ing Today, And My Troagey was filled wiVh Bananas. I also had %ood at a curger placD

 

Word Level Augmentation

Aside from character enhancement, word-level is also crucial. To insert and substitute equivalent words, we use word2vec, GloVe, fast text, BERT, and wordnet. Word2vecAug, GloVeAug, and FasttextAug use word embeddings to replace the original word with the most equivalent set of words.

Synonym

Augmenter that applies semantic meaning based on textual input.

Output: I Went Shopping Today, And My Trolley was occupy with Banana tree. Iodin also had food at a burger position


Antonym

Augmenter that applies semantic meaning based on textual input.

Output: very ugly


Random

Augmenter that applies random word operation to textual input.

Output: Went Shopping Today, And My Trolley was filled with. also had at a place


Spelling

Augmenter that applies spelling error simulation to textual input.

Output: J Went Shopping Today, And Mt Trolley was fillled with Bananas. Hi also hace food tt a burger place


Split

Augmenter that apply word splitting operation to textual input.

Output: I We nt Sho pping Today, And My T rolley was filled w ith Ban anas. I also had food at a burger pla ce

 

Flow Augmentation

In this type of augmentation, we can make use of multiple augmenters at once. Sequential and sometimes pipelines are used to connect augmenters in order to make use of many augmentations. A single text can be sent through multiple augmenters to yield a wide range of data.


BERT

Original: I Went Shopping Today, And My Trolley was filled with Bananas. I also had food at a burger place

Augmented Text:

i went shopping today, and my trolley sack was filled up with bananas. guess i also probably had a food at – a burger place

yesterday i occasionally went shopping today, and today my trolley was filled with fresh bananas. i also only had snack food at a burger place

i generally went shopping today, though and thankfully my trolley was filled with green bananas. i once also had food at a little burger place

though i usually went bus shopping today, and my trolley car was filled with bananas. and i also also had food at a burger place

so i went shopping today, grocery and today my trolley was filled with bananas. i also sometimes had homemade food at a local burger place


Word2Vec

Sequential

You can add as many augmenters to this flow as you wish, and Sequential will execute them one by one. You can, for example, combine ContextualWordEmbsAug and WordEmbsAug.

Output:

[‘what is your especially recommended book : on bayesian statistics ?’,
‘what is your ones recommended book title on bayesian statistics ?’,
‘what exactly is your recommended book based here bayesian statistics ?’,
‘what volume is your recommended reference book on bayesian stats ?’,
‘what is that your recommended book on statistical computation statistics ?’,
‘what else seems your recommended book on experimental bayesian statistics ?’,
‘what is this your recommended book after generalized bayesian statistics ?’,
‘what called your recommended course book on bayesian growth statistics ?’,
‘what is your recommended banners book on the bayesian statistics ?’,
‘what is now your most recommended book on phylogenetic statistics ?’]


Sometimes

If you don’t want to use the same set of augmenters every time, Sometimes the pipeline can pick a different set of augmenters every time.

Output:

[‘what is there your recommended second book on bayesian statistics ?’,
‘what exactly is your most recommended book on mathematical statistics ?’,
‘” what is your recommended guide book on bayesian statistics ?’,
‘• what is your recommended book on bayesian statistical statistics ?’,
‘What is yourself recommended book on Bayesian Statistics?’,
‘what is your generally recommended reference book monday bayesian statistics ?’,
‘what exactly is this your recommended book on bayesian statistics ?’,
‘what is your recommended classic book on bayesian overview statistics ?’,
‘What is see recommended book on Bayesian Statistics?’,
‘what is your most recommended reference book on recursive statistics ?’]

 

End Notes

In this article, I have explored the python library ‘NLPAug’ for text data augmentation. This package contains a variety of augmentations to supplement text data and introduce noise that may help your model generalize. I’ve done a beginner-level exploration only! This library definitely has the potential and can be used to predict the sentiment of a text, oversample any class of your choice (neutral, positive, or negative), and what not!

These are a few libraries similar to NLPAug:

  1. TextAttack
  2. TextAugment

Hope you got a brief idea on text data augmentation and the /NLPAug library. Do explore more on this library! Thank you!

 

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details