How to Build a Script Generator using Generative AI?

Adil Mohammed Last Updated : 26 May, 2024

11 min read

Introduction

Have you ever wondered about how the characters of your favourite web series lived after the end of the series? If yes, then this blog will assist you in building a script generator that will generate a script for a new episode. Our model will be trained on the scripts of all the episodes and ready to generate the script of the next episode, which has not been produced in the series. For generations now, crafting storylines, compelling dialogue, and the entirety of scripts has been the domain of humans. However, this process is often time-consuming and relies heavily on the collaboration of multiple individuals, especially when developing scripts for long-running series such as ‘Brooklyn Nine-Nine.’ Hence, in this blog, we will build a script generator using Generative AI, which will assist the screenwriters in writing the scripts as soon as possible.

Now, let’s understand the definition of the technology we will use in this blog. Generative AI is a subset of artificial intelligence capable of generating new images, audio, video, text, etc. We are now using Generative AI in almost all fields to optimize the time required to finish a specific task. When talking about textual data, Generative AI can generate human-like texts. It can understand the task’s context and generate text based on that. In the web series script generation context, we can use Generative AI to generate a new episode script with the same writing style and tone as the whole web series.

Learning Objectives

We can understand how AI will be used in content writing for web series, movies, etc.
We will learn the detailed process of building a script generator model, including data scraping, cleaning, model building, etc.
We will learn the ability of generative AI in scrip writing, how efficiently to write and its advantages.
We will learn how important preparing and cleaning data is and how this affects the script generator.

This article was published as a part of the Data Science Blogathon.

The Process of Creating Scripts Using Generative AI
Benefits of Using Generative AI in Scriptwriting
Challenges of Using Generative AI in Scriptwriting
About Dataset
Exploratory Data Analysis
Data Preparation
Model Building
Generating Scripts
Frequently Asked Questions

The Process of Creating Scripts Using Generative AI

First, let’s briefly overview the entire flow for building the script generator:

Gathering Web Series Data

As we all know, we must gather data before building any model. So here, in building the AI Script Generator, we first need to collect all the data regarding the scripts of the web series. This process includes collecting many scripts from particular episodes of a web series. We Scrape these using scrapping websites, through databases, or by seeking permissions from the owner of the scripts. The main aim is to build a huge dataset with a wide range of dialogues, communications between the characters, the development of particular scenes, or the twists present in the series. As we develop the dataset, we must ensure that the data we collect is true, has no copyright issues, and is complete.

Cleaning and Pre-processing the Data

Data Pre-Processing is a crucial step that ensures our data is clean and tidy. This step involves removing unnecessary data, such as stage directions or director’s descriptions. Since we are collecting data through web scraping, we need to check for any missing data. We will also need to normalize the text data by removing punctuation and special characters and converting all the words to lowercase. In this way, we will clean our dataset.

Data Preparation

After thoroughly cleaning the dataset, it’s time to prepare it as per our model needs. First, we start by tokenizing the script into individual words using a Tokenizer. This tokenizer breaks the whole sentence or a scripted dialogue into individual words and then assigns a unique index value, forming a word index. Following that, we create sequences of tokens. So, we create a list of tokens for each dialogue in the script. After tokenization, we pad these sequences with zeros at the beginning so that the input is uniform for our model. Then, the last word of each sequence is used as a label to predict the next word. Finally, the labels are converted to categorical format using one-hot encoding. In this way, the dataset is prepared for model training.

Building Generative Model

Once the data is prepared, we are ready to build our Generative Model. We need a model to handle sequential data for the text generation task. This blog will use a transformer-based model to generate the scripts. In this training phase, our model will learn to predict the next word based on the previous words. After the model is trained, we can assess the quality of the model’s prediction using a loss function, such as cross-entropy loss.

Generating New Script

Once our model is trained, we can generate a new episode script. To do this, we first need to feed the model with an initial sentence named ‘seed.’ The model then predicts the next word based on this seed sentence. The model generates the next word based on the probabilities learned during training. This predicted word is added to a sequence, and then this process is repeated until the desired length of the script is reached.

Benefits of Using Generative AI in Scriptwriting

Here are the benefits of using Generative AI in scriptwriting:

As we discussed earlier, scriptwriting is a time-consuming process, as it is done manually by human writers. However, with the use of Generative AI, we can speed up the process by generating initial drafts.
One of the main benefits of using AI to generate a script is that it can maintain the writing style and tone of the previous scripts in this new script.
Generative AI can generate creative and interesting dialogues during script generation that might not occur to human writers.
It helps scriptwriters to spend their time refining and perfecting the script rather than writing it from scratch.

Challenges of Using Generative AI in Scriptwriting

Here are the challenges:

The first challenge that one might face while building this AI script generator is data collection. We should check for any copywriter issues.
Generative AI can need help understanding the script’s context, which can lead to consistency in the storyline.
Although Generative AI can generate scripts in no time, it can lack the level of creativity and originality of a human writer.
One of the main challenges is that Generative AI requires a lot of computational power, which can be expensive.

Now, let’s dive deep into understanding the code behind this AI script generator in the next few sections.

You can execute all codes by clicking on the ‘Copy & Edit‘ button in this link.

First, Let’s import all the libraries we will use to build the script generator.

import requests
from bs4 import BeautifulSoup
import re
from nltk.tokenize import sent_tokenize
import plotly.express as px
from collections import Counter
import pandas as pd
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TextDataset,
 DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

About Dataset

We will use the scripts of the Brooklyn 99 web series. Since it has many episodes, it will be good for our model. We will use the BeautifulSoup and Requests libraries to scrape these scripts from a web page.

We will do this by using two functions namely, ‘fetch_and_preprocess_scripts’, and ‘preprocess_text’.

The first function takes a URL as a parameter. This URL is the web page from where we will scrape our scripts. We will use the requests library to send an HTTP request to get the HTML content of the page. We will then use the BeautifulSoup library to parse this HTML content. We try to find all anchor tags (<a>) with the class ‘topictitle’ as they contain all the links to individual episode scripts. Then, we construct the full URL of each script by concatenating it to the base URL, and we will store it in a list. This list is then reversed to maintain the order of the episodes. Finally, the function then iterates over each script and extracts the text. The text is then appended to a final script string.

# Function to fetch and preprocess the script content from a given URL
def fetch_and_preprocess_scripts(url):
    base_url = "https://transcripts.foreverdreaming.org"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")

    anchor_tags = soup.find_all("a", class_="topictitle")
    links = [base_url + tag["href"][1:] for tag in anchor_tags]
    links = links[2:]
    links.reverse()

    final_script = ""
    
    for link in links:
        response = requests.get(link)
        soup = BeautifulSoup(response.content, "html.parser")
        script_div = soup.find("div", class_="content")
        script_text = script_div.get_text(separator="\n") if script_div else ""

        final_script += script_text.strip() + "\n"

    preprocessed_script = preprocess_text(final_script)
    return preprocessed_script

Now, we will call the preprocess_text function, which will clean the script string by removing all HTML tags and square brackets, tokenizing the text into sentences, and converting the sentences to lowercase.

# Function to clean and preprocess the text
def preprocess_text(text):
    cleaned_text = re.sub(r'<[^>]+>', '', text)
    cleaned_text = re.sub(r'\[[^\]]+\]', '', cleaned_text)
    sentences = sent_tokenize(cleaned_text)
    
    preprocessed_text = ' '.join(sentence.lower() for sentence in sentences)
    
    return preprocessed_text


url = "https://transcripts.foreverdreaming.org/viewforum.php?f=429&sid=
       acbdaf84cb954f2929838f627cb124cb&start=78"
newpreprocessed_script = fetch_and_preprocess_scripts(url)
url1 = "https://transcripts.foreverdreaming.org/viewforum.php?f=429"
new_preprocessed_script = fetch_and_preprocess_scripts(url1)
preprocessed_script = newpreprocessed_script+new_preprocessed_script

In this way, we scraped and cleaned the episodes’ scripts. Now, we are ready with our dataset.

Now let’s see the first 500 words of our dataset, which should be the starting words of the series’ pilot episode.

print(preprocessed_script[:500])

Output:

Exploratory Data Analysis

In this section, we will perform Exploratory Data Analysis (EDA) on our script data. We will begin by splitting the preprocessed script into individual tokens (words). Then, we will count the frequency of each token using the Counter() function.

tokens = preprocessed_script.split()
token_counter = Counter(tokens)

Now, we will extract the top 20 common tokens and their counts.

most_common_tokens = token_counter.most_common(20)
token_labels, token_counts = zip(*most_common_tokens)

Finally, we will create a DataFrame from this data to visualize it using a bar chart. On the x-axis, we will keep the words, and on the y-axis, we will keep the count of each word.

data = {'Word': token_labels, 'Frequency': token_counts}
df = pd.DataFrame(data)

fig = px.bar(df, x='Word', y='Frequency', title='Most Common Words')
fig.update_xaxes(tickangle=45)
fig.show()

Output:

Another way to visualize the most used words is to make a word cloud, a more visually appealing chart. We need to call the WordCloud() function and pass a few details about the chart, like width, height, and background_color, along with the whole script.

text = ' '.join(tokens)
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)

plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Script Words')
plt.show()

Output:

Data Preparation

Now, we will start preparing our dataset to be suitable for training the model. This involves tokenizing the preprocessed script into individual words using the Tokenizer() function. First, we will create the object of Tokenizer, and then we will use the fit_on_tests() function to tokenize. Finally, we can get the total number of unique words used later.

# Tokenizing the text into words
tokenizer = Tokenizer()
tokenizer.fit_on_texts([preprocessed_script])
total_words = len(tokenizer.word_index) + 1

Now, we will create sequences of these tokens for every line in the script. We will do this by iterating over every line of the script, in which we will convert each line to a sequence of tokens and then create a n-gram sequence from these tokens. Finally, we will append these sequences to a list of input sequences.

# Creating sequences of tokens
input_sequences = []
for line in preprocessed_script.split('\\n'):
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

The next step in the data preparation phase is to create padding for the input sequences. We do this to ensure that the input sequence is uniform in length. To do this, we will call the ‘pad_sequences’ function from the Keras library, in which we will pass the input_sequences variable and the length of the longest sequence.

# Padding sequences to ensure uniform length
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre')

Finally, we will split each sequence into labels and predictors. The predictors will contain all the words of the sequence except the last word, and the label variable will contain the last word of the sequence. We do this to train the model to predict the next word based on the label variable. The labels are then converted to a categorical format, which is necessary for training the model.

# Creating predictors and labels
predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
# Converting labels to categorical format
from tensorflow.keras.utils import to_categorical
label = to_categorical(label, num_classes=total_words)

print("Total words:", total_words)
print("Max sequence length:", max_sequence_len)
print("Number of input sequences:", len(input_sequences))

OUTPUT:

Model Building

Now, for the main section of the whole building process of this AI script generator, we will use the pre-trained GPT-2 model from the transformer library.

First, we will load the tokenizer and model using the from_pretrained() function.

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

The next step is tokenizing and encoding the script data. This is done using the tokenizer.

preprocessed_script_tokens = tokenizer(preprocessed_script, return_tensors="pt", max_length=1024, 
                                       truncation=True)

Now, we will save the tokenized data into a text file.

file_path = "preprocessed_script.txt"
with open(file_path, "w") as f:
    f.write(preprocessed_script)

Now, we will convert the tokenized data into a PyTorch dataset using the TextDataset() class from the transformers library.

dataset = TextDataset(tokenizer=tokenizer, file_path=file_path, block_size=128)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

Now, we will define training arguments using the TrainingArguments class. The main arguments are the number of training epochs and batch size.

training_args = TrainingArguments(
    output_dir="./script_generator",
    overwrite_output_dir=True,
    num_train_epochs=50,
    per_device_train_batch_size=4,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
    report_to=[],  # Disabled wandb logging
)

Now, we will create the Trainer object to pass the model, training_args, data_collator, and dataset variable we have created so far.

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

Finally, we will train the model using the train() function.

trainer.train()

This will train the model to generate scripts in the style of the preprocessed script data.

Generating Scripts

Now that the model is trained, we can generate a new episode’s script by loading the trained model and tokenizer.

# Loading the fine-tuned model
model = GPT2LMHeadModel.from_pretrained("/kaggle/working/fine_tuned_script_generator")

# Loading the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

Now, we will define a seed sentence, which will serve as a starting point for the new script.

# Generating text
prompt_text = "Detective Jake Peralta enters the precinct and announces:"  # Custom prompt text
input_ids = tokenizer.encode(prompt_text, return_tensors="pt")

Now, we will generate the script using generate() function, in which we will pass, the seed sentence, and the maximum length of the new script. Another parameter we will pass is temperature, which controls the randomness of the predictions.

# Generating text with a maximum length of 500 tokens
output1 = model.generate(input_ids, max_length=500, num_return_sequences=1, temperature=0.7, 
                         do_sample=True)

Finally, we will decode the generated script and format it into a more readable format.

# Decoding the generated text
generated_text = tokenizer.decode(output1[0], skip_special_tokens=True)

delimiters = [". ", "? ", "! ", "| "]
for delimiter in delimiters:
    generated_text = generated_text.replace(delimiter, delimiter + "\\n")

# Printing each dialogue on a new line
print(generated_text.strip())

Output:

As you can see, our model generated a new episode script, which is very accurate and interesting.

Conclusion

In conclusion, Generative AI is a powerful tool for script generation. It can create a new script that matches the tone of a particular web series in which the model is trained. It can reduce the time and effort of human writers. The quality of the generated scripts depends on the quality and quantity of the dataset. It also depends on the choice of the model and its parameters. Despite these challenges, script writers can use a script generator as an initial draft of an episode, and they can refine it to their needs reducing the total time of script writing.

Key Takeaways

Generative AI can be a useful technology for scriptwriters as they can build and use script generators with it.
To build this script generator, we should first gather a web series dataset, prepare the dataset in a suitable format, build the model, and generate a new episode script.
One of the main challenges while building this script generator is that it requires a lot of computational power.
The quality of the newly generated script depends on the quantity and quality of the training data.

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.

Frequently Asked Questions

Q1. What is Generative AI?

A. Generative AI is a technology capable of creating new things, such as images, songs, videos, text, etc.

Q2. What are the benefits of using Generative AI in scriptwriting?

A. It can help the scriptwriters by generating a new episode script as an initial draft. Scriptwriters can use this draft to refine it to their needs and make the final draft, reducing the total time to write the script.

Q3. What challenges are faced when using Generative AI in scriptwriting?

A. The need for computational resources is one of the main challenges in building the script generator.

Q4. Can Generative AI replace human writers?

A. No, Generative AI cannot completely replace scriptwriters yet. But it can help script writers to write a new script in a short time.

Adil Mohammed

👋 Hello! I'm Adil Naib, a passionate data science enthusiast and Kaggle Notebooks Expert currently pursuing a degree in Data Science at Presidency University. Through courses and independent projects, I've acquired expertise in Exploratory Data Analysis, Data Visualization, and Predictive modelling.

🖋 I've written and published informative data science blogs on Analytics Vidhya. As I continue my journey in the data science world, I aim to create more valuable content that provides insights, tips, and solutions to complex data-related challenges.

💼 I'm excited to use my knowledge and talents in the real world and acquire first-hand experience through internships or other possibilities. I eventually want to work in data science and contribute significantly.

In my free time, I enjoy writing Data Science blogs and sharing my knowledge with the community.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

How to Build a Script Generator using Generative AI?

Introduction

Learning Objectives

Table of contents

The Process of Creating Scripts Using Generative AI

Gathering Web Series Data

Cleaning and Pre-processing the Data

Data Preparation

Building Generative Model

Generating New Script

Benefits of Using Generative AI in Scriptwriting

Challenges of Using Generative AI in Scriptwriting

About Dataset

Exploratory Data Analysis

Data Preparation

Model Building

Generating Scripts

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt