Explaining Text Generation with LSTM

Abhishek Jaiswal Last Updated : 22 Mar, 2022

9 min read

This article was published as a part of the Data Science Blogathon.

An end-to-end guide on Text generation using LSTM

Text Generation with LSTM — Source: develop-paper

Hey Folks!

In this article, we are going to talk about text generation using LSTM with end-to-end examples. we will also look at concepts related to LSTM as a quick revision.

In the next generation, we predict the next character of a given word of a sequence. Text data can be seen as a sequence of words or a sequence of individual data. For the prediction of sequence, we have used deep learning models like RNN/LSTM/GRU.

I have already written a very detailed article on the idea of RNN then I discussed why RNN is not practical and explained RNN and GRU with examples. you can refer to this link.

Introduction to LSTM
Why does RNN fail?
Understanding LSTM architecture and various Gates
The idea of Text Generation
Implementation of Text Generation using LSTM

Introduction

Text generation can be considered a very important feature of AI-based tools. it comes very useful in machines which are supposed to become more interactive towards humans. smart gadgets like smart-speakers, home assistants use text generation in some forms.

Use cases of Text-Generation

Search engines
Chatbots
Text summarize
Question answering

Why RNN isn’t Practical for Text Generation?

RNN has a big problem of vanishing and exploding gradients. hence RNN can’t hold longer sequential information and in the Text-generation task, we particularly need a model that can memorize a long sequence of data. for this purpose LSTM came into the picture.

LSTM (Long Short Term Memory)

As we know that RNN can’t hold /memorize sequential data for a long time and begins to forget the previous inputs and new input comes. In order to fix this problem, LSTM is designed with various gates.

LSTM solves the problem of short-term memory learning by using different types of gates.

When a new input comes in RNN, it modifies the existing information without deciding if the incoming input is important or not, whereas in the case of LSTM gates are available to allow only important inputs to modify the existing information.

In LSTM gates decide what data to be ignored and what to be feed-forward for the training. there are 3 gates in LSTM:

Input Gate
Output Gate
Forget Gate

Forget Gate

This gate is responsible for selecting relevant information and discarding irrelevant information. after selecting relevant information it is passed through the input gate.

First, the information from the current state and previous hidden state is passed through the activation function. here will be the sigmoid activation function. Sigmoid activation function return value between 0 to1.a value closer to 0 means current information should be ignored otherwise it should be passed through the input gate.

Input Gate

This gate is responsible for adding information to the model by using the activation function sigmoid. Using the activation function tanhcreates an array of information that is passed through the input gate. the array of information contains values ranging from -1 to 1 and a sigmoid function filter and maintain what information should be added to the model and what information should be discarded.

Output Gate

Output Gate is responsible for generating the next hidden states along with cell states that are carried over the next time step. It creates a hidden state using the activation function tanhand its value ranges from -1 to 1.

The Idea of Text Generation

Text generation is nothing but a continuous series of next-word predictions. as we already know that text data is a sequence of words, using these sequences we can predict the next word.

Implementing Text Generation

There are steps various steps listed for text generation:-

Load the necessary libraries
Load the textual- data
Perform text-cleaning if needed
Data preparation for training
Define and train the LSTM model
Prediction

Loading necessary libraries

libraries for data handling

import pandas as pd
import numpy as np
import string, os
import warnings
warnings.filterwarnings("ignore")
warnings.simplefilter(action='ignore', category=FutureWarning)

loading deep learning libraries

# set seeds for reproducability
from tensorflow import set_random_seed
from numpy.random import seed
set_random_seed(2)
seed(1)
# keras module for building LSTM 
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense, Dropout
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential
import keras.utils as ku

Loading the Dataset

We will use Newyork’s time comments dataset available on Kaggle. you can download the dataset or you can even create a new Kaggle notebook using this dataset.

The dataset contains various articles and comments. Our objective is to load all the articles as headlines and merge them into a list.

# Loading the all headlines as a list 
curr_dir = '../input/'
all_headlines = []
for filename in os.listdir(curr_dir):
    if 'Articles' in filename:
        article_df = pd.read_csv(curr_dir + filename)
        all_headlines.extend(list(article_df.headline.values))
        break

all_headlines = [line for line in all_headlines if line!= "Unknown"]
print(all_headlines[:10])

We have a total of 829 headlines and we will use these headlines to generate text.

Dataset Preparation

For Dataset Preparation our first task will be to clean the text data which includes removing punctuations, lowercasing words, etc.

Data Cleaning

We defined a function that takes a single headline at a time and returns the cleaned headline. Using iteration we have passed each headline and made a list of cleaned data corpus.

def clean_text(txt):
    txt = "".join(t for tin txt if t not in string.punctuation).lower()
    txt = txt.encode("utf8").decode("ascii",'ignore')
    return txt

corpus = [clean_text(x) for x in all_headlines]
print(corpus[:10])

Generating n-gram Sequence for training

In NLP language model requires sequential input data, and input word/token must be numerical. Here we are generating n-grams in order to train our model for next word prediction.

tokenizer = Tokenizer()
def get_sequence_of_tokens(corpus):
    ## tokenization
    tokenizer.fit_on_texts(corpus)
    total_words = len(tokenizer.word_index) + 1
    
    ## convert data to a token sequence 
    input_sequences = []
    for line in corpus:
        token_list = tokenizer.texts_to_sequences([line])[0]
        for i in range(1, len(token_list)):
            n_gram_sequence = token_list[:i+1]
            input_sequences.append(n_gram_sequence)
    return input_sequences, total_words

inp_sequences, total_words = get_sequence_of_tokens(corpus)
print(inp_sequences[:10])

As you see that inp_sequence is an n-gram sequence that is required for training next-word prediction. we had 829 headlines and using the n-gram concept we have now 4544 rows.

You can relate the inp_sequences with this picture where you can clearly see that in every step we add a token to the Seed sequence for training.

Padding the Sequences

The inp_sequence we just made have variable sequence length, which is not favorable for training, using padding we make every sequence of having the same length.

def generate_padded_sequences(input_sequences):
    max_sequence_len = max([len(x) for x in input_sequences])
    input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
    
    predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
    label = ku.to_categorical(label, num_classes=total_words)
    return predictors, label, max_sequence_len

predictors, label, max_sequence_len = generate_padded_sequences(inp_sequences)

predictors : these are tokens that will be used as input for predicting the next word.
label: is the next word to be predicted.
max_sequence_len: is the sequence length.
pad_sequence: provided by Keras is used to pad an array of tokens to a given length.
In this case,max_sequence_len is 17.

Model Creation

So far we have prepared the data for training. now in this step, we will create an LSTM model that will take predictors as input X and labels as input y.

A quick reminder on Layers in Keras:-

Input Layer: This is responsible for taking input sequence.
LSTM Layer: It calculates the output using LSTM units and returns hidden and cell states. In our case we have added 100 units in the layer, that can be fine-tuned later.
Dropout Layer: This layer is responsible for regularisation which means it prevents over-fitting. this is done by turning off the activations of some neurons in the LSTM layer.
Output Layer: This Computes the probability of our prediction.

def create_model(max_sequence_len, total_words):
    input_len = max_sequence_len - 1
    model = Sequential()
    # ----------Add Input Embedding Layer
    model.add(Embedding(total_words, 10, input_length=input_len))
    # ----------Add Hidden Layer 1 - LSTM Layer
    model.add(LSTM(100))
    model.add(Dropout(0.1))
    # ----------Add Output Layer
    model.add(Dense(total_words, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam')
    return model

model = create_model(max_sequence_len, total_words)
model.summary()

Training the model

After building the model architecture we can train the model using our predictors (X_train) and label(y_train).100 epochs should be enough.

model.fit(predictors, label, epochs=100, verbose=5)

Text Generation (Prediction)

Awesome!!

We have trained our model architecture and now it’s ready to generate text. We need to write a function to predict the next word based on the input words. We also have to tokenize the sequence and pad it with the same sequence_length we provided for training, and then we will append each predicted word as a string.

def generate_text(seed_text, next_words, model, max_sequence_len):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list],              maxlen=max_sequence_len-1, padding='pre')
        predicted = model.predict_classes(token_list, verbose=0)
        
        output_word = ""
        for word,index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " "+output_word
    return seed_text.title()

seed_text : it’s the initial words that will be passed for text generation.
predict_classes: it will return the token id for the predicted word.
predicted: Its token id for predicted word and this will be converted back into a word using the dictionarytokenizer.word_index .items().
next_words It’s the number of next words we want to be predicted.

Prediction

Calling the function generate_textwill generate text.generate_text function takes initial words and number of words to be predicted, model name, and sequence length.

print (generate_text("india and pakistan", 3, model, max_sequence_len))
print (generate_text("president trump", 3, model, max_sequence_len))
print (generate_text("united states", 4, model, max_sequence_len))
print (generate_text("donald trump", 2, model, max_sequence_len))
print (generate_text("new york", 3, model, max_sequence_len))
print (generate_text("science and technology", 5, model, max_sequence_len))

Conclusion

In this article, we have discussed the LSTM model with its architecture and then we discussed the Idea of the text-generation and we implemented the text-generation using the LSTM model.

Our trained model worked perfectly well but you can improve the model by:-

Adding more data to be trained on
Fine Tuning the model architecture, ie ( number of units, layers, etc).
Fine Tuning the parameters like ( epochs, units, learning rate, activation function, etc)

Thanks for Reading !!

Feel free to hit me on my Linkedin if you have any suggestions or questions for me.

References

https://iq.opengenus.org/text-generation-lstm/
https://www.kaggle.com/shivamb/beginners-guide-to-text-generation-using-lstms

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.

Abhishek Jaiswal

A data enthusiast exploring the leading technologies related to the data

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Explaining Text Generation with LSTM

An end-to-end guide on Text generation using LSTM

Table of Contents

Introduction

Use cases of Text-Generation

Why RNN isn’t Practical for Text Generation?

LSTM (Long Short Term Memory)

Forget Gate

Input Gate

Output Gate

The Idea of Text Generation

Implementing Text Generation

Loading necessary libraries

Loading the Dataset

Dataset Preparation

Model Creation

Training the model

Text Generation (Prediction)

Prediction

Conclusion

References

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC