Understanding Word Embeddings and Building your First RNN Model

Avikumar talaviya Last Updated : 12 Oct, 2024

6 min read

This article was published as a part of the Data Science Blogathon.

Introduction

Deep learning is one of the hottest fields in the past decade, with applications in industry and research. However, even though it’s easy to delve into the topic, many people are confused by the terminology and end up only implementing neural network models that do not match their expectations. In this article, I will go over what recurrent neural networks (RNNs) and word embeddings are and a step-by-step guide to building your first RNN model for text classification tasks and challenges.

Source: Photo by Markus Spiske on Unsplash

RNNs are one of the most important concepts in machine learning. They’re used across a wide range of problems, including text classification, language detection, translation tasks, author identification, and question answering, to name a few.

Let’s deep dive into RNNs and a step-by-step guide to building your first RNN model for text data.

Introduction
Deep Learning for Text Data
Understanding Word-embeddings
Recurrent Neural Network (RNN): Demystified
Building your First RNN Model for Text Classification Tasks
Conclusion

Deep Learning for Text Data

Deep learning for natural-language processing is pattern recognition applied to text, words, and paragraphs in much similar way that computer vision is pattern recognition applied to pixels. In a true sense, deep learning models map the statistical structure of text data which is sufficient to solve many simple textual tasks and problems. Deep-learning models don’t take input as text like other models they only work with numeric tensors.

Three techniques are used to vectorize the text data:

Segment text into words and convert word into a vector
Segment text into characters and transform each character into a vector
Extract n-grams of words, and transform each n-gram into a vector.

If you want to build a text model, the first thing you need to do is convert the text into a vector. There are many ways one can convert text to vector depending on what models one uses along with time or resource utilization.

Keras has a built-in method for converting text into vectors(Word embedding layer) which we will use in this article.

Here is a visual depiction of the deep neural network model for NLP tasks

source: https://www.dulnvxiers.gq/products.aspx?cname=glove+nlp&cid=41

Understanding Word-embeddings

A word embedding is a learned representation for text where words that have the same meaning and save similar representation

Courtesy: Machinelearningmastery.com

This approach to representing words and documents may be considered one of the key breakthroughs of deep learning on challenging NLP problems
Word embeddings are alternative to one-hot encoding along with dimensionality reduction.

One-hot word vectors — Sparse, High-dimensional and Hard-coded

Word embeddings — Dense, Lower-Dimensional and Learned from the data

Keras library has embeddings layer which does word representation of given text corpus

tf.keras.layers.Embedding( input_dim, output_dim, embeddings_initializer=’uniform’, embeddings_regularizer=None, activity_regularizer=None, embeddings_constraint=None, mask_zero=False, input_length=None, **kwargs)

Key Arguments:

input_dim — the size of vocabulary or length of the word index
output_dim — Output dimension of word representation
input_length — max input sequence length of the document

Here is the visual depiction of word embedding or also known as word2vec representation

source: medium.com/deepleaningdemystified

Recurrent Neural Network (RNN): Demystified

A major difference between densely connected neural networks and recurrent neural network is that fully connected networks have no memory in units of each layer. At the same time, recurrent neural networks store the state of the previous timestep or sequence while assigning weights to the current input.
In RNNs, we process inputs word by word or eye saccade but eye saccade – while keeping memories of what came before in each cell. This gives a fluid representation of sequences and allows the neural networks to capture the context of the sequence rather than an absolute representation of words.

“Recurrent neural network processes sequences by iterating through the sequence elements and maintaining a state containing information relative to what it has seen so far. In effect, an RNN is a type of neural network that has an internal loop.”

Courtesy: 6.2 Understanding recurrent neural networks, deep learning using python by Chollet

Below is the visual depiction of how recurrent neural networks learn the context of words about the target word

source: machineab.blogspot.com

Here is a simple depiction of RNN architecture with rolled and unrolled RNN.

source: ibm.com

Building your First RNN Model for Text Classification Tasks

Now we will look at the step-by-step guide to building your first RNN model for the text classification task of the news descriptions classification project

So let’s get started:

Step 1: load the dataset using pandas ‘read_json()’ method as the dataset is in json file format

df = pd.read_json('../input/news-category-dataset/News_Category_Dataset_v2.json', lines=True)

Step 2: Pre-process the dataset to combine the ‘headline’ and ‘short_description’ of the dataset.
Python Code:

import pandas as pd

df = pd.read_json('News_Category_Dataset_v2.json', lines = True)
# create final dataframe of combined headline and short_description
final_df = df.copy()
final_df['length_of_news'] = final_df['headline'] + final_df['short_description']
final_df.drop(['headline','short_description'], inplace=True, axis=1)
final_df['len_news'] = final_df['length_of_news'].map(lambda x: len(x))
print(final_df.head())

the output of the above code block

Step 3: Clean the text data to move forward with tokenization and vectorization of text inputs before we feed vectorized text data to the RNN model.

# clean the text data using regex and data cleaning function
def datacleaning(text):
    whitespace = re.compile(r"s+")
    user = re.compile(r"(?i)@[a-z0-9_]+")
    text = whitespace.sub(' ', text)
    text = user.sub('', text)
    text = re.sub(r"[[^()]*]","", text)
    text = re.sub("d+", "", text)
    text = re.sub(r'[^ws]','',text)
    text = re.sub(r"(?:@S*|#S*|http(?=.*://)S*)", "", text)
    text = text.lower()
    
    # removing stop-words
    text = [word for word in text.split() if word not in list(STOPWORDS)]
    
    # word lemmatization
    sentence = []
    for word in text:
        lemmatizer = WordNetLemmatizer()
        sentence.append(lemmatizer.lemmatize(word,'v'))
        
    return ' '.join(sentence)

Step 4: Tokenization and vectorization of text data to create a word index of the sentences and split the dataset into train and test datasets.

# one hot encoding using keras tokenizer and pad sequencing
X = final_df2['length_of_news']
encoder = LabelEncoder()
y = encoder.fit_transform(final_df2['category'])
print("shape of input data: ", X.shape)
print("shape of target variable: ", y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

tokenizer = Tokenizer(num_words=100000, oov_token='')
tokenizer.fit_on_texts(X_train) # build the word index
# padding X_train text input data
train_seq = tokenizer.texts_to_sequences(X_train) # converts strinfs into integer lists
train_padseq = pad_sequences(train_seq, maxlen=20) # pads the integer lists to 2D integer tensor 

# padding X_test text input data
test_seq = tokenizer.texts_to_sequences(X_test)
test_padseq = pad_sequences(test_seq, maxlen=20)

word_index = tokenizer.word_index
max_words = 150000 # total number of words to consider in embedding layer
total_words = len(word_index)
maxlen = 130 # max length of sequence 
y_train = to_categorical(y_train, num_classes=41)
y_test = to_categorical(y_test, num_classes=41)
print("Length of word index:", total_words)

----------------------------[output]--------------------------------

shape of input data:  (184853,)
shape of target variable:  (184853,)
Length of word index: 174991

Step 5: Now as we have ‘train’ and ‘test’ data prepared, we can build an RNN model using the ‘Embedding()’ and ‘SimpleRNN()’ layers of Kera’s library.

# basline model using embedding layers and simpleRNN
model = Sequential()
model.add(Embedding(total_words, 70, input_length=maxlen))
model.add(Bidirectional(SimpleRNN(64, dropout=0.1, recurrent_dropout=0.20, activation='tanh', return_sequences=True)))
model.add(Bidirectional(SimpleRNN(64, dropout=0.1, recurrent_dropout=0.30, activation='tanh', return_sequences=True)))
model.add(SimpleRNN(32, activation='tanh'))
model.add(Dropout(0.2))
model.add(Dense(41, activation='softmax'))
model.summary()

----------------------------[output]--------------------------------

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 130, 70)           12249370  
_________________________________________________________________
bidirectional (Bidirectional (None, 130, 128)          17280     
_________________________________________________________________
bidirectional_1 (Bidirection (None, 130, 128)          24704     
_________________________________________________________________
simple_rnn_2 (SimpleRNN)     (None, 32)                5152      
_________________________________________________________________
dropout (Dropout)            (None, 32)                0         
_________________________________________________________________
dense (Dense)                (None, 41)                1353      
=================================================================
Total params: 12,297,859
Trainable params: 12,297,859
Non-trainable params: 0
_________________________________________________________________

Step 6: Compile the model with the ‘rmsprop’ optimizer and ‘accuracy’ as validation metrics followed by fitting the model to the ‘X_train’ and ‘y_train’ data. you can evaluate the model using the ‘model.evaluate()’ method on test data. Congrats! you have just built your first model using word embedding and RNN layers.

model.compile(optimizer='rmsprop',
            loss='categorical_crossentropy',
            metrics=['accuracy']
            )
# SETUP A EARLY STOPPING CALL and model check point API
earlystopping = keras.callbacks.EarlyStopping(monitor='accuracy',
                                             patience=5,
                                              verbose=1,
                                              mode='min'
                                             )
checkpointer = ModelCheckpoint(filepath='bestvalue',moniter='val_loss', verbose=0, save_best_only=True)
callback_list = [checkpointer, earlystopping]

# fit model to the data
history = model.fit(train_padseq, y_train,
batch_size=128,
epochs=15,
validation_split=0.2
)

# evalute the model
test_loss, test_acc = model.evaluate(test_padseq, y_test, verbose=0)
print("test loss and accuracy:", test_loss, test_acc)

Conclusion

With the advent of Deep Learning methods and techniques, NLP-related tasks have been approached using RNN and 1-D CNN (for extending the concept of classification based on character sequences to sequences of words). This article will briefly describe the basic idea of RNN (recurrent neural network), Wordembeddings, and their implementation in python. The code of this model can be easily used to build other complex networks.

Key takeaways:

Word embeddings are representations of word tokens that eventually can be trained along with a model to find optimal weights that fit the task at hand.
Recurrent neural networks are widely used in text data classification tasks and can be implemented using the Keras library of python.
Using a step-by-step guide for building an RNN model to classify text data at hand, you build the model for any text classification problem.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Avikumar talaviya

I specialize in data science and machine learning with hands-on experience in working on various end-to-end data science projects. I am the chapter co-lead of the Mumbai local chapter of Omdena. I am also a kaggle master and educator ambassador at streamlit with volunteers around the world.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Understanding Word Embeddings and Building your First RNN Model

Introduction

Table of contents

Deep Learning for Text Data

Understanding Word-embeddings

Recurrent Neural Network (RNN): Demystified

Building your First RNN Model for Text Classification Tasks

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit