Building a Machine Learning Model for Title Generation

Sharvari Last Updated : 24 Sep, 2021

6 min read

This article was published as a part of the Data Science Blogathon

Introduction

In this article, I will use the YouTube Trends database and Python programming language to train a language model that generates text using learning tools, which will be used for the task of making youtube video articles or for your blogs.

The topic generator is a function of Natural Language Processing and is a subject between several Machine Learning, including text compilation, text speaking, and discussion programs.

To create a title-generating work model or a text generator, the model must be trained to learn whether a word may occur, using words that already appear in sequence as context.

What is Natural Language Processing

NLP | Model for Title Generation — Image 2

Natural Language Processing (NLP) is often used for textual segregation activities such as spam detection and emotional analysis, text production, language translation, and text classification. Text data can be viewed in alphabetical order, word order, or sentence sequence. In general, text data is considered a sequence of words in most problems. In this article, we will enter, a process using simple sample data. However, the steps discussed here apply to any NLP activities. In particular, we will use TensorFlow2, Keras to obtain text processing which includes:

Tokenization
Sequence
Padding

Building the Machine Learning Model for Title Generation

I will start this project of building a title generator with Python and machine learning by importing libraries and reading data sets. The data sets I use for this project can be downloaded from here.

Importing the necessary libraries Building the Machine Learning Model for Title Generation

Importing libraries before we start working on them. Here, I have used Keras and TensorFlow as the main libraries for our model as it is a highly productive interface for solving such problems, with a deep learning approach.

import pandas as pd
import string
import numpy as np
import json
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense, Dropout
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential
import keras.utils as ku
import tensorflow as tf
tf.random.set_seed(2)
from numpy.random import seed
seed(1)

Loading the dataset

#load all the datasets 
df1 = pd.read_csv('USvideos.csv')
df2 = pd.read_csv('CAvideos.csv')
df3 = pd.read_csv('GBvideos.csv')

#load the datasets containing the category names
data1 = json.load(open('US_category_id.json'))
data2 = json.load(open('CA_category_id.json'))
data3 = json.load(open('GB_category_id.json'))

Now we need to process our data so that we can use this data to train our machine learning model with the task of making a topic. Here are all the steps to clean up and process the data we need to follow:

def category_extractor(data):
    i_d = [data['items'][i]['id'] for i in range(len(data['items']))]
    title = [data['items'][i]['snippet']["title"] for i in range(len(data['items']))]
    i_d = list(map(int, i_d))
    category = zip(i_d, title)
    category = dict(category)
    return category

#create a new category column by mapping the category names to their id
df1['category_title'] = df1['category_id'].map(category_extractor(data1))
df2['category_title'] = df2['category_id'].map(category_extractor(data2))
df3['category_title'] = df3['category_id'].map(category_extractor(data3))

#join the dataframes
df = pd.concat([df1, df2, df3], ignore_index=True)

#drop rows based on duplicate videos
df = df.drop_duplicates('video_id')

#collect only titles of entertainment videos
#feel free to use any category of video that you want
entertainment = df[df['category_title'] == 'Entertainment']['title']
entertainment = entertainment.tolist()

#remove punctuations and convert text to lowercase
def clean_text(text):
    text = ''.join(e for e in text if e not in string.punctuation).lower()
    
    text = text.encode('utf8').decode('ascii', 'ignore')
    return text

corpus = [clean_text(e) for e in entertainment]

Generating sequences for Building the Machine Learning Model for Title Generation

Natural language processing operations require data entry in the form of a token sequence. The first step after data purification is to generate a sequence of n-gram tokens.

N-gram is the closest sequence of n elements of a given sample of text or vocal corpus. Items can be words, letters, phonemes, letters, or base pairs. In this case, n-gr is a sequence of words in the corpus of titles.

The tokenizer is an API found in TensorFlow Keras that is used to make sentences into a token. We defined our text data as sentences (each with a comma) and with multiple strings.

Since in-depth reading models do not understand the text, we need to convert the text into a numerical representation. For this purpose, the first step is to make tokens. The Tokenizer API from TensorFlow Keras divides sentences into words and converts these into numbers. Tokenization is the process of issuing tokens from a corpus:

tokenizer = Tokenizer()
def get_sequence_of_tokens(corpus):
  #get tokens
  tokenizer.fit_on_texts(corpus)
  total_words = len(tokenizer.word_index) + 1
 
  #convert to sequence of tokens
  input_sequences = []
  for line in corpus:
  token_list = tokenizer.texts_to_sequences([line])[0]
  for i in range(1, len(token_list)):
  n_gram_sequence = token_list[:i+1]
  input_sequences.append(n_gram_sequence)
 
  return input_sequences, total_words
inp_sequences, total_words = get_sequence_of_tokens(corpus)

Padding the sequences for Building the Machine Learning Model for Title Generation

In any raw text data, there will naturally be sentences of different lengths. However, all neural networks need to be input in the same size. For this purpose, wrapping is done. The use of the ‘pre’ or ‘post’ pad depends on the analysis. In some cases, wrapping at first is appropriate while not for others. For example, if we use Recurrent Neural Network (RNN) to detect spam detection, then initial wrapping may be appropriate as RNN can read long-distance patterns. Early wrap allows us to keep track of the end which is why RNN can use these sequences to predict the next. However, any support should be made after careful consideration and business knowledge.

Since sequences can vary in length, the length of the sequence must be proportional. When using neural networks, we usually feed input to the network while waiting for the result. In practice, it is better to process data in batches than to do one at a time. The pad_sequences() is a function in the Keras deep learning library that can be used to pad variable-length sequences.

This is done using matrices [batch length x sequence length], where the length of the sequence corresponds to the longest sequence. In this case, we complete the sequence with the symbol (frequency 0) to match the size of the matrix. This process of filling the token sequence is called filling. To enter data from the training model, I need to create predictions and labels.

I will build an n-gram sequence as a prediction and the following n-gram word as a label:

def generate_padded_sequences(input_sequences):
  max_sequence_len = max([len(x) for x in input_sequences])
  input_sequences = np.array(pad_sequences(input_sequences,  maxlen=max_sequence_len, padding=’pre’))
  predictors, label = input_sequences[:,:-1], input_sequences[:, -1]
  label = ku.to_categorical(label, num_classes = total_words)
  return predictors, label, max_sequence_len
predictors, label, max_sequence_len = generate_padded_sequences(inp_sequences)

LSTM Model for Title Generation

In recurrent neural networks, the activation effect is still distributed in both directions, e.g. From inputs to outputs and outputs to inputs, unlike neural networks that work directly where the output d is distributed is only one-sided. This creates barriers to the formation of a neural network that acts as a “memory state” for the nerves.

Because of this, the RNN keeps the state up to date or “remembers” what was learned over time. Memory status has its advantages, but it also has its drawbacks. The missing gradient is one of them.

In this case, while reading about a lot of layers, it becomes very difficult for the network to read and adjust the parameters of previous layers. To solve this problem, a new type of RNN has been developed; LSTM (long-term memory).

LSTM model

The LSTM model contains an additional status (cell status) that allows the network to learn what it will store in the future, what to remove and what to read. . The LSTM of this model consists of three layers:

Input layer: takes the word order as input
LSTM Layout: Calculate output using LSTM units.
Disposal layer: a regular layer to avoid overheating
Output layer: determines whether the next word may be output

I will now use LSTM Model to build a Title Generator job model with Machine Learning:

def create_model(max_sequence_len, total_words):
  input_len = max_sequence_len — 1
  model = Sequential()
 
  # Add Input Embedding Layer
  model.add(Embedding(total_words, 10, input_length=input_len))
 
  # Add Hidden Layer 1 — LSTM Layer
  model.add(LSTM(100))
  model.add(Dropout(0.1))
 
  # Add Output Layer
  model.add(Dense(total_words, activation=’softmax’))
  model.compile(loss=’categorical_crossentropy’, optimizer=’adam’)
 
  return model
model = create_model(max_sequence_len, total_words)
model.fit(predictors, label, epochs=20, verbose=5)

Now that our title generator learning model is ready and trained using data, it is time to predict the title based on the input name. The input name is completed first, the sequence is completed before being transferred to a trained model to retrieve the predicted sequence:

def generate_text(seed_text, next_words, model, max_sequence_len):
  for _ in range(next_words):
  token_list = tokenizer.texts_to_sequences([seed_text])[0]
  token_list = pad_sequences([token_list], maxlen=max_sequence_len-1,  padding=’pre’)
  predicted = model.predict_classes(token_list, verbose=0)
 
  output_word = “”
  for word,index in tokenizer.word_index.items():
  if index == predicted:
  output_word = word
  break
  seed_text += “ “+output_word
  return seed_text.title()

Now that we have created the topic of topic production let’s take a look at our topic production model:

print(generate_text(“HAPPY”, 5, model, max_sequence_len))

Output:  The Secret Of HAPPY

I hope you enjoyed this article on how to create a theme-generating model by typing with machine and Python programming language. Feel free to ask your key questions in the comments section below.

Thanks For Reading!

About Me:

Hey, I am Sharvari Raut. I love to write!

Connect with me on:

Twitter: https://twitter.com/aree_yarr_sharu

LinkedIn: https://t.co/g0A8rcvcYo?amp=1

Github: https://github.com/sharur7

References :

Image 1: https://unsplash.com/s/photos/machine-learning?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText

Image 2: https://unsplash.com/s/photos/machine-learning?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Sharvari

I am Sharvari Raut. I love to write. I am a final year student in Computer Science and Engineering from NCER Pune. I have worked as a freelance technical writer for few startups and companies. Having 2 yrs of experience in Technical Writing I have written over 100+ technical articles which are published till now. Writing for Analytics Vidhya is one of my favourite things to do.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Building a Machine Learning Model for Title Generation

Introduction

What is Natural Language Processing

Building the Machine Learning Model for Title Generation

Importing the necessary libraries Building the Machine Learning Model for Title Generation

Loading the dataset

Generating sequences for Building the Machine Learning Model for Title Generation

Padding the sequences for Building the Machine Learning Model for Title Generation

LSTM Model for Title Generation

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap