NLPAUG – A Python library to Augment Your Text Data

nithilaau Last Updated : 21 Oct, 2024

8 min read

Introduction

In contrast to Computer Vision, where image data augmentation is common, text data augmentation in NLP is uncommon.

Simple manipulations on images, such as rotating them a few degrees or turning them to grayscale, have little effect on their semantics. Because of the semantically invariant transformation, augmentation has become an important tool in Computer Vision research.

I looked through the current literature to see whether there had been any attempts to build augmentation approaches for NLP. Based on my findings, I’ll present an overview of existing approaches for text data augmentation in this article and introduce the python library ‘NLPAug’.

Why and What is Data Augmentation?
Different methods of Text Augmentation
Introduction to NLPaug library
Basic elements of NLPaug library
1. Character
2. Word
3. Flow

Why and What is Data Augmentation?

Data Augmentation describes applying transformations to our original labeled examples to construct new data for the training set

This simply means we want to generate more data and more examples from our current dataset. So what it means is, let’s say you have data (X, Y), where X is a sentence and Y is its corresponding label. So, we can imagine it to be like X is a movie review and Y is the sentiment associated with that review.

As a part of data augmentation, we transform this X and create X’ out of it, while still preserving the label Y.

(X, Y) ——T——> (X’, Y)

So, as you can see since Y is still preserved, which means the transformation that we want to apply, say, T, has to be semantically invariant which means it doesn’t change the meaning of the original sentence. So, X’ could be syntactically a little different compared to X, but semantically it should mean the same thing.

People have been researching to come up with different data augmentation techniques to answer this question – “How do you define T efficiently so that X’ is diverse enough, yet semantic coherent so that the model becomes robust and generalizes well on the unseen data?”

Let’s see how this is done with a simple example.

Consider a sentence: This is a good movie.

So, this is one review. Using data augmentation techniques we have created the following five examples :

Movies good
Awesome movie
I like the movie
Enjoyed movie
This is a nice film

Different methods of Text Augmentation

In this section, we will discuss different techniques to augment the text. Let’s start with Synonym Replacement.

Synonym Replacement

One of the basic text data augmentation techniques is to replace words or phrases with their synonyms. For example, the results for the word “evaluate” can be: [‘measure’, ‘evaluate’, ‘valuate’, ‘assess’, ‘appraise’, ‘value’, ‘pass_judgment’, ‘judge’]

We know that synonyms are very limited and synonym-based augmentation cannot produce different patterns from the original texts. So, let’s move to our next method using word embeddings!

Replace Words with Similar Word Embeddings

GloVe, Word2Vec, and fastText are examples of trained word embedding that can be used to identify the closest word vector from latent space to replace the original sentence. Contextual bidirectional embeddings such as Elmo and BERT, which have a considerably richer vector representation, can also be used for more reliability. Longer text sequences are encoded by Bi-LSTM and Transformer-based models, which are contextually aware of surrounding words.

Lexical based Replacement

Wordnet is an English lexical database that includes word definitions, hyponyms, and other semantic relationships. Wordnet can be used to identify synonyms for the token/word that has to be replaced in the original sentence. NLP packages such as Spacy and NLTK can be used to locate and substitute synonyms in a sentence.

Back Translation

The basic idea behind back-translation is to translate a sentence into another language and then back into the original language, with a few word changes.

It works in 3 steps:

Input is a text in some source language (Eg: English)
Translate the given text input to any different language for the time being (Eg: English to Spanish)
Translate the previously translated text back into the source language (Eg: Spanish to English)

I’d recommend using the HuggingFace transformers and Moses Tokenizers, after which I imported the MarianMT model and tokenizer. We can start by initializing a model that converts English To Romance languages, and similarly a model that converts any of the Romance languages to English.

Note: When the word romance is capitalized, as in Romance languages, it most likely refers to a group of languages based on Latin, the ancient Romans’ language.

Then, given the machine translation model, tokenizer, and target romance language, we can define a helper function to translate a batch of text. Finally, we can perform the back-translation.

For example:
Original text: ‘The quick brown fox jumped over the lazy dog’
Augmented text (English to French, and then back-translated): ‘The fast brown fox jumped over the lazy dog’

Generative Models

Generative language models such as BERT, RoBERTa, BART, or the latest T5 model can be used to generate the text in a more class label preserving manner. The generative model encodes the class label together with its related text sequences to create newer examples with some alterations.

Using BERT

Bert stands for Bidirectional Encoder Representations from Transformers and is a language representation model. It’s an approach for pre-training language representations that can be used in a variety of NLP applications. It was trained using a lot of text from Wikipedia and Book Corpus.

Two tasks are used to train this model:

Masked word prediction
1. Idea – Masked word prediction works by hiding keywords in sentences and letting BERT guess what they are.
Next sentence prediction
1. Next sentence prediction teaches BERT to recognize longer-term dependencies across sentences, while masked word prediction teaches BERT to understand the relationship between words.

Introduction to NLPAUG

In this section, I’ll explain to you about a Python library that simply does all of the data augmentations, with the ability to fine-tune the level of augmentation required using various arguments.

NLPAug is a python library for textual augmentation in machine learning experiments. The goal is to improve deep learning model performance by generating textual data. It is also able to generate adversarial examples to prevent adversarial attacks. NLPAug is a tool that assists you in enhancing NLP for machine learning applications. Let’s look at how we can utilize this library to enrich data.

NLPAug provides three different types of augmentation:

Character level augmentation
Word level augmentation
Flow/ Sentence level augmentation

We’ll look into these three basic elements in the next section.

This NLPAug module is specially designed for Natural Language Processing. You can install it by using this command:

Basic elements of NLPAUG

The basic components of NLPAug are as follows:

Character level augmentation
Word level augmentation
Flow augmentation

Here is a list of available augmenters mapped with the type of augmentation –

Augmenter	Type	Keyword	Action	Description
Textual	Character	KeyboardAug	substitute	Simulate keyboard distance error
Textual		OcrAug	substitute	Simulate OCR engine error
Textual		RandomAug	insert, substitute, swap, delete	Apply augmentation randomly
Textual	Word	AntonymAug	substitute	Substitute opposite meaning word according to WordNet antonym
Textual		WordEmbsAug	insert, substitute	Leverage word2vec, GloVe, or fasttext embeddings to apply augmentation
Signal	Audio	CropAug	delete	Delete audio’s segment
Pipeline	Flow	Sequential		Apply a list of augmentation functions sequentially

Take a look at more augmenters at: https://pypi.org/project/nlpaug/

The package has many more augmentations – at the character, word, and sentence levels.

Character Augmenter – OCR, Keyboard, Random
Word Augmenter – Spelling, Word Embeddings, TF-IDF, Contextual Word Embeddings, Synonym, Antonym, Random Word, Split
Sentence Augmenter- Contextual Word Embeddings for Sentence

Now, we will see the implementation of few important augmentations-

Character Level Augmentation

Character level augmentation represents enhancing data at the character level.

Image to text and chatbot are two examples of possible possibilities. We require an optical character recognition (OCR) model to recognize text from an image, but OCR generates some inaccuracies, such as detecting “o” and “0.” Despite the fact that most applications come with word correction, there are still typos in chatbots. To get around this, you may let your model view the alternative outcomes before predicting them online.

Keyboard

Augmenter that applies typo error simulation to textual input.

import nlpaug.augmenter.char as nac

test_sentence = 'I went Shopping Today, and my trolly was filled with Bananas. I also had food at burgur palace'

aug = nac.KeyboardAug(name='Keyboard_Aug', aug_char_min=1, aug_char_max=10, aug_char_p=0.3, aug_word_p=0.3, 
                      aug_word_min=1, aug_word_max=10, stopwords=None, tokenizer=None, reverse_tokenizer=None, 
                      include_special_char=True, include_numeric=True, include_upper_case=True, lang='en', verbose=0, 
                      stopwords_regex=None, model_path=None, min_char=4)
 
test_sentence_aug = aug.augment(test_sentence)
print(test_sentence)
print(test_sentence_aug)

Here, what is happening is, it will replace characters by keyboard distance. That is, ‘n’ will get replaced with ‘m’. This is because, ‘n’ is closer to ‘m’ in the keyboard, so the augmenter check for the distance between two characters. In the same way, it replaced ‘o’ with ‘(‘.

Optical Character Recognition (OCR)

Augmenter that applies OCR error simulation to textual input.
For example, OCR may recognize ‘I’ as ‘1’ incorrectly, or ‘0’ as ‘o’ or ‘O’.
Pre-defined OCR mapping is leveraged to replace a character with a possible OCR error.
Solving the out of vocabulary (OOV) problem, that is Out of vocabulary words are words that are not in the training set, but appear in the test set, real data.
The main problem is that the model assigns a probability zero to out of vocabulary words resulting in a zero likelihood.
- This is a common problem, especially when you have not trained on a smaller data set.
- So, to overcome this we can use models like BERT and GPT (Generative Pre-trained Transformer models).

Output: I Went Shopping T0day, And My Tr0l1ey was filled with Bananas. I also had food at a burgek place

Here, as you can see, ‘o’ got replaced with ‘0’, ‘l’ with ‘1’, ‘r’ with ‘k’, and so on.

Random

Augmenter that applies random character error to textual input

Output: I Went ShoF0ing Today, And My Troagey was filled wiVh Bananas. I also had %ood at a curger placD

Word Level Augmentation

Aside from character enhancement, word-level is also crucial. To insert and substitute equivalent words, we use word2vec, GloVe, fast text, BERT, and wordnet. Word2vecAug, GloVeAug, and FasttextAug use word embeddings to replace the original word with the most equivalent set of words.

Synonym

Augmenter that applies semantic meaning based on textual input.

Output: I Went Shopping Today, And My Trolley was occupy with Banana tree. Iodin also had food at a burger position

Antonym

Augmenter that applies semantic meaning based on textual input.

Output: very ugly

Random

Augmenter that applies random word operation to textual input.

Output: Went Shopping Today, And My Trolley was filled with. also had at a place

Spelling

Augmenter that applies spelling error simulation to textual input.

Output: J Went Shopping Today, And Mt Trolley was fillled with Bananas. Hi also hace food tt a burger place

Split

Augmenter that apply word splitting operation to textual input.

Output: I We nt Sho pping Today, And My T rolley was filled w ith Ban anas. I also had food at a burger pla ce

Flow Augmentation

In this type of augmentation, we can make use of multiple augmenters at once. Sequential and sometimes pipelines are used to connect augmenters in order to make use of many augmentations. A single text can be sent through multiple augmenters to yield a wide range of data.

BERT

Original: I Went Shopping Today, And My Trolley was filled with Bananas. I also had food at a burger place

Augmented Text:

i went shopping today, and my trolley sack was filled up with bananas. guess i also probably had a food at – a burger place

yesterday i occasionally went shopping today, and today my trolley was filled with fresh bananas. i also only had snack food at a burger place

i generally went shopping today, though and thankfully my trolley was filled with green bananas. i once also had food at a little burger place

though i usually went bus shopping today, and my trolley car was filled with bananas. and i also also had food at a burger place

so i went shopping today, grocery and today my trolley was filled with bananas. i also sometimes had homemade food at a local burger place

Word2Vec

Sequential

You can add as many augmenters to this flow as you wish, and Sequential will execute them one by one. You can, for example, combine ContextualWordEmbsAug and WordEmbsAug.

Output:

[‘what is your especially recommended book : on bayesian statistics ?’,
‘what is your ones recommended book title on bayesian statistics ?’,
‘what exactly is your recommended book based here bayesian statistics ?’,
‘what volume is your recommended reference book on bayesian stats ?’,
‘what is that your recommended book on statistical computation statistics ?’,
‘what else seems your recommended book on experimental bayesian statistics ?’,
‘what is this your recommended book after generalized bayesian statistics ?’,
‘what called your recommended course book on bayesian growth statistics ?’,
‘what is your recommended banners book on the bayesian statistics ?’,
‘what is now your most recommended book on phylogenetic statistics ?’]

Sometimes

If you don’t want to use the same set of augmenters every time, Sometimes the pipeline can pick a different set of augmenters every time.

Output:

[‘what is there your recommended second book on bayesian statistics ?’,
‘what exactly is your most recommended book on mathematical statistics ?’,
‘” what is your recommended guide book on bayesian statistics ?’,
‘• what is your recommended book on bayesian statistical statistics ?’,
‘What is yourself recommended book on Bayesian Statistics?’,
‘what is your generally recommended reference book monday bayesian statistics ?’,
‘what exactly is this your recommended book on bayesian statistics ?’,
‘what is your recommended classic book on bayesian overview statistics ?’,
‘What is see recommended book on Bayesian Statistics?’,
‘what is your most recommended reference book on recursive statistics ?’]

End Notes

In this article, I have explored the python library ‘NLPAug’ for text data augmentation. This package contains a variety of augmentations to supplement text data and introduce noise that may help your model generalize. I’ve done a beginner-level exploration only! This library definitely has the potential and can be used to predict the sentiment of a text, oversample any class of your choice (neutral, positive, or negative), and what not!

These are a few libraries similar to NLPAug:

TextAttack
TextAugment

Hope you got a brief idea on text data augmentation and the /NLPAug library. Do explore more on this library! Thank you!

nithilaau

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

NLPAUG – A Python library to Augment Your Text Data

Introduction

Table of Contents

Why and What is Data Augmentation?

Different methods of Text Augmentation

Introduction to NLPAUG

Basic elements of NLPAUG

BERT

End Notes

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap