A Must-Read NLP Tutorial on Neural Machine Translation – The Technique Powering Google Translate

Prateek joshi Last Updated : 16 Oct, 2024

10 min read

Introduction

“If you talk to a man in a language he understands, that goes to his head. If you talk to him in his own language, that goes to his heart.” – Nelson Mandela

The beauty of language transcends boundaries and cultures. Learning a language other than our mother tongue is a huge advantage. But the path to bilingualism, or multilingualism, can often be a long, never-ending one.

There are so many little nuances that we get lost in the sea of words. Things have, however, become so much easier with online translation services (I’m looking at you Google Translate!).

I have always wanted to learn a language other than English. I tried my hand at learning German (or Deutsch), back in 2014. It was both fun and challenging. I had to eventually quit but I harboured a desire to start again.

Fast-forward to 2019, I am fortunate to be able to build a language translator for any possible pair of languages. What a boon Natural Language Processing has been!

In this article, we will walk through the steps of building a German-to-English language translation model using Keras. We’ll also take a quick look at the history of machine translation systems with the benefit of hindsight.

This article assumes familiarity with RNN, LSTM, and Keras. Below are a couple of articles to read more about them:

Machine Translation – A Brief History
Understanding the Problem Statement
Introduction to Sequence-to-Sequence Prediction
Implementation in Python using Keras

Machine Translation – A Brief History

Most of us were introduced to machine translation when Google came up with the service. But the concept has been around since the middle of last century.

Research work in Machine Translation (MT) started as early as 1950’s, primarily in the United States. These early systems relied on huge bilingual dictionaries, hand-coded rules, and universal principles underlying natural language.

In 1954, IBM held a first ever public demonstration of a machine translation. The system had a pretty small vocabulary of only 250 words and it could translate only 49 hand-picked Russian sentences to English. The number seems minuscule now but the system is widely regarded as an important milestone in the progress of machine translation.

This image has been taken from the research paper describing IBM’s system

Soon, two schools of thought emerged:

Empirical trial-and-error approaches, using statistical methods, and
Theoretical approaches involving fundamental linguistic research

In 1964, the Automatic Language Processing Advisory Committee (ALPAC) was established by the United States government to evaluate the progress in Machine Translation. ALPAC did a little prodding around and published a report in November 1966 on the state of MT. Below are the key highlights from that report:

It raised serious questions on the feasibility of machine translation and termed it hopeless
Funding was discouraged for MT research
It was quite a depressing report for the researchers working in this field
Most of them left the field and started new careers

Not exactly a glowing recommendation!

A long dry period followed this miserable report. Finally, in 1981, a new system called the METEO System was deployed in Canada for translation of weather forecasts issued in French into English. It was quite a successful project which stayed in operation until 2001.

The world’s first web translation tool, Babel Fish, was launched by the AltaVista search engine in 1997.

And then came the breakthrough we are all familiar with now – Google Translate. It has since changed the way we work (and even learn) with different languages.

Understanding the Problem Statement

Let’s circle back to where we left off in the introduction section, i.e., learning German. However, this time around I am going to make my machine do this task. The objective is to convert a German sentence to its English counterpart using a Neural Machine Translation (NMT) system.

We will use German-English sentence pairs data from http://www.manythings.org/anki/. You can download it from here.

Introduction to Sequence-to-Sequence (Seq2Seq) Modeling

Sequence-to-Sequence (seq2seq) models are used for a variety of NLP tasks, such as text summarization, speech recognition, DNA sequence modeling, among others. Our aim is to translate given sentences from one language to another.

Here, both the input and output are sentences. In other words, these sentences are a sequence of words going in and out of a model. This is the basic idea of Sequence-to-Sequence modeling. The figure below tries to explain this method.

Sequence to sequence model block diagram

A typical seq2seq model has 2 major components –

a) an encoder
b) a decoder

Both these parts are essentially two different recurrent neural network (RNN) models combined into one giant network:

I’ve listed a few significant use cases of Sequence-to-Sequence modeling below (apart from Machine Translation, of course):

Speech Recognition
Name Entity/Subject Extraction to identify the main subject from a body of text
Relation Classification to tag relationships between various entities tagged in the above step
Chatbot skills to have conversational ability and engage with customers
Text Summarization to generate a concise summary of a large amount of text
Question Answering systems

Implementation in Python using Keras

It’s time to get our hands dirty! There is no better feeling than learning a topic by seeing the results first-hand. We’ll fire up our favorite Python environment (Jupyter Notebook for me) and get straight down to business.

Import the Required Libraries

import string
import re
from numpy import array, argmax, random, take
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding, RepeatVector
from keras.preprocessing.text import Tokenizer
from keras.callbacks import ModelCheckpoint
from keras.preprocessing.sequence import pad_sequences
from keras.models import load_model
from keras import optimizers
import matplotlib.pyplot as plt
%matplotlib inline
pd.set_option('display.max_colwidth', 200)

Read the Data into our IDE

Our data is a text file (.txt) of English-German sentence pairs. First, we will read the file using the function defined below.
Python Code:

import re
import pandas as pd
from numpy import array

# function to read raw text file
def read_text(filename):
        # open the file
        file = open(filename, mode='rt', encoding='utf-8')
        
        # read all text
        text = file.read()
        file.close()
        return text
# Let’s define another function to split the text into English-German pairs separated by ‘\n’. We’ll then split these pairs into English sentences and German sentences respectively.
# split a text into sentences

def to_lines(text):
      sents = text.strip().split('\n')
      sents = [i.split('\t') for i in sents]
      return sents

#We can now use these functions to read the text into an array in our desired format.

data = read_text("deu.txt")
deu_eng = to_lines(data)
#deu_eng = array(deu_eng)

print(data[0:175])

We can now use these functions to read the text into an array in our desired format.

data = read_text("deu.txt")
deu_eng = to_lines(data)
deu_eng = array(deu_eng)

The actual data contains over 150,000 sentence-pairs. However, we will use only the first 50,000 sentence pairs to reduce the training time of the model. You can change this number as per your system’s computation power (or if you’re feeling lucky!).

deu_eng = deu_eng[:50000,:]

Text Pre-Processing

Quite an important step in any project, especially so in NLP. The data we work with is more often than not unstructured so there are certain things we need to take care of before jumping to the model building part.

(a) Text Cleaning

Let’s first take a look at our data. This will help us decide which pre-processing steps to adopt.

deu_eng

array([['Hi.', 'Hallo!'],
     ['Hi.', 'Grüß Gott!'],
     ['Run!', 'Lauf!'],
     ...,
     ['Mary has very long hair.', 'Maria hat sehr langes Haar.'],
     ["Mary is Tom's secretary.", 'Maria ist Toms Sekretärin.'],
     ['Mary is a married woman.', 'Maria ist eine verheiratete Frau.']],
     dtype='<U380')

We will get rid of the punctuation marks and then convert all the text to lower case.

# Remove punctuation
deu_eng[:,0] = [s.translate(str.maketrans('', '', string.punctuation)) for s in deu_eng[:,0]]
deu_eng[:,1] = [s.translate(str.maketrans('', '', string.punctuation)) for s in deu_eng[:,1]]

deu_eng

array([['Hi', 'Hallo'],
     ['Hi', 'Grüß Gott'],
     ['Run', 'Lauf'],
     ...,
     ['Mary has very long hair', 'Maria hat sehr langes Haar'],
     ['Mary is Toms secretary', 'Maria ist Toms Sekretärin'],
     ['Mary is a married woman', 'Maria ist eine verheiratete Frau']],
     dtype='<U380')

# convert text to lowercase
for i in range(len(deu_eng)):
    deu_eng[i,0] = deu_eng[i,0].lower()
    deu_eng[i,1] = deu_eng[i,1].lower()

deu_eng

array([['hi', 'hallo'],
     ['hi', 'grüß gott'],
     ['run', 'lauf'],
     ...,
     ['mary has very long hair', 'maria hat sehr langes haar'],
     ['mary is toms secretary', 'maria ist toms sekretärin'],
     ['mary is a married woman', 'maria ist eine verheiratete frau']],
     dtype='<U380')

(b) Text to Sequence Conversion

A Seq2Seq model requires that we convert both the input and the output sentences into integer sequences of fixed length.

But before we do that, let’s visualise the length of the sentences. We will capture the lengths of all the sentences in two separate lists for English and German, respectively.

# empty lists
eng_l = []
deu_l = []

# populate the lists with sentence lengths
for i in deu_eng[:,0]:
      eng_l.append(len(i.split()))

for i in deu_eng[:,1]:
      deu_l.append(len(i.split()))

length_df = pd.DataFrame({'eng':eng_l, 'deu':deu_l})

length_df.hist(bins = 30)
plt.show()

Quite intuitive – the maximum length of the German sentences is 11 and that of the English phrases is 8.

Next, vectorize our text data by using Keras’s Tokenizer() class. It will turn our sentences into sequences of integers. We can then pad those sequences with zeros to make all the sequences of the same length.

Note that we will prepare tokenizers for both the German and English sentences:

# function to build a tokenizer
def tokenization(lines):
      tokenizer = Tokenizer()
      tokenizer.fit_on_texts(lines)
      return tokenizer

# prepare english tokenizer
eng_tokenizer = tokenization(deu_eng[:, 0])
eng_vocab_size = len(eng_tokenizer.word_index) + 1

eng_length = 8
print('English Vocabulary Size: %d' % eng_vocab_size)

English Vocabulary Size: 6453

# prepare Deutch tokenizer
deu_tokenizer = tokenization(deu_eng[:, 1])
deu_vocab_size = len(deu_tokenizer.word_index) + 1

deu_length = 8
print('Deutch Vocabulary Size: %d' % deu_vocab_size)

Deutch Vocabulary Size: 10998

The below code block contains a function to prepare the sequences. It will also perform sequence padding to a maximum sentence length as mentioned above.

# encode and pad sequences
def encode_sequences(tokenizer, length, lines):
         # integer encode sequences
         seq = tokenizer.texts_to_sequences(lines)
         # pad sequences with 0 values
         seq = pad_sequences(seq, maxlen=length, padding='post')
         return seq

Model Building

We will now split the data into train and test set for model training and evaluation, respectively.

from sklearn.model_selection import train_test_split

# split data into train and test set
train, test = train_test_split(deu_eng, test_size=0.2, random_state = 12)

It’s time to encode the sentences. We will encode German sentences as the input sequences and English sentences as the target sequences. This has to be done for both the train and test datasets.

# prepare training data
trainX = encode_sequences(deu_tokenizer, deu_length, train[:, 1])
trainY = encode_sequences(eng_tokenizer, eng_length, train[:, 0])

# prepare validation data
testX = encode_sequences(deu_tokenizer, deu_length, test[:, 1])
testY = encode_sequences(eng_tokenizer, eng_length, test[:, 0])

Now comes the exciting part!

We’ll start off by defining our Seq2Seq model architecture:

For the encoder, we will use an embedding layer and an LSTM layer
For the decoder, we will use another LSTM layer followed by a dense layer

Model Architecture

# build NMT model
def define_model(in_vocab,out_vocab, in_timesteps,out_timesteps,units):
      model = Sequential()
      model.add(Embedding(in_vocab, units, input_length=in_timesteps, mask_zero=True))
      model.add(LSTM(units))
      model.add(RepeatVector(out_timesteps))
      model.add(LSTM(units, return_sequences=True))
      model.add(Dense(out_vocab, activation='softmax'))
      return model

We are using the RMSprop optimizer in this model as it’s usually a good choice when working with recurrent neural networks.

# model compilation
model = define_model(deu_vocab_size, eng_vocab_size, deu_length, eng_length, 512)

rms = optimizers.RMSprop(lr=0.001)
model.compile(optimizer=rms, loss='sparse_categorical_crossentropy')

Please note that we have used ‘sparse_categorical_crossentropy‘ as the loss function. This is because the function allows us to use the target sequence as is, instead of the one-hot encoded format. One-hot encoding the target sequences using such a huge vocabulary might consume our system’s entire memory.

We are all set to start training our model!

We will train it for 30 epochs and with a batch size of 512 with a validation split of 20%. 80% of the data will be used for training the model and the rest for evaluating it. You may change and play around with these hyperparameters.

We will also use the ModelCheckpoint() function to save the model with the lowest validation loss. I personally prefer this method over early stopping.

filename = 'model.h1.24_jan_19'
checkpoint = ModelCheckpoint(filename, monitor='val_loss', verbose=1, save_best_only=True, mode='min')

# train model
history = model.fit(trainX, trainY.reshape(trainY.shape[0], trainY.shape[1], 1),
                    epochs=30, batch_size=512, validation_split = 0.2,callbacks=[checkpoint], 
                    verbose=1)

Let’s compare the training loss and the validation loss.

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.legend(['train','validation'])
plt.show()

As you can see in the above plot, the validation loss stopped decreasing after 20 epochs.

Finally, we can load the saved model and make predictions on the unseen data – testX.

model = load_model('model.h1.24_jan_19')
preds = model.predict_classes(testX.reshape((testX.shape[0],testX.shape[1])))

These predictions are sequences of integers. We need to convert these integers to their corresponding words. Let’s define a function to do this:

def get_word(n, tokenizer):
      for word, index in tokenizer.word_index.items():
          if index == n:
              return word
      return None

Convert predictions into text (English):

preds_text = []
for i in preds:
       temp = []
       for j in range(len(i)):
            t = get_word(i[j], eng_tokenizer)
            if j > 0:
                if (t == get_word(i[j-1], eng_tokenizer)) or (t == None):
                     temp.append('')
                else:
                     temp.append(t)
            else:
                   if(t == None):
                          temp.append('')
                   else:
                          temp.append(t) 

       preds_text.append(' '.join(temp))

Let’s put the original English sentences in the test dataset and the predicted sentences in a dataframe:

pred_df = pd.DataFrame({'actual' : test[:,0], 'predicted' : preds_text})

We can randomly print some actual vs predicted instances to see how our model performs:

# print 15 rows randomly
pred_df.sample(15)

Our Seq2Seq model does a decent job. But there are several instances where it misses out on understanding the key words. For example, it translates “im tired of boston” to “im am boston”.

These are the challenges you will face on a regular basis in NLP. But these aren’t immovable obstacles. We can mitigate such challenges by using more training data and building a better (or more complex) model.

You can access the full code from this Github repo.

End Notes

Even with a very simple Seq2Seq model, the results are pretty encouraging. We can improve on this performance easily by using a more sophisticated encoder-decoder model on a larger dataset.

Another experiment I can think of is trying out the seq2seq approach on a dataset containing longer sentences. The more you experiment, the more you’ll learn about this vast and complex space.

If you have any feedback on this article or have any doubts/questions, kindly share them in the comments section below.

Prateek joshi

Data Scientist at Analytics Vidhya with multidisciplinary academic background. Experienced in machine learning, NLP, graphs & networks. Passionate about learning and applying data science to solve real world problems.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Sayam Nandy

HI Prateek, It's a wonderful article.One request,can you show us the implementation in R?

Show 1 reply

Prateek Joshi

Thanks Sayam. I will try to implement it in R as well and share it with you all.

Harish Nagpal

Good one Prateek. Thanks! I am looking for models in life insurance analytics. Since you have experience in BFSI, did you develop any such model like lapsation, claims etc ! Do reply back.

Janmejay Singh

Hi, For some reason,the array function is not working properly.The function should return just an array while it is returning a list of array and shape is also not correct. Can you help me debugging it. By the way...Very good article.

Hi Janmejay, Which part of the code you are referring to?

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

A Must-Read NLP Tutorial on Neural Machine Translation – The Technique Powering Google Translate

Introduction

Table of Contents

Machine Translation – A Brief History

Understanding the Problem Statement

Introduction to Sequence-to-Sequence (Seq2Seq) Modeling

Implementation in Python using Keras

Import the Required Libraries

Read the Data into our IDE

Text Pre-Processing

Model Building

End Notes

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm