Have you ever wondered about how the characters of your favourite web series lived after the end of the series? If yes, then this blog will assist you in building a script generator that will generate a script for a new episode. Our model will be trained on the scripts of all the episodes and ready to generate the script of the next episode, which has not been produced in the series. For generations now, crafting storylines, compelling dialogue, and the entirety of scripts has been the domain of humans. However, this process is often time-consuming and relies heavily on the collaboration of multiple individuals, especially when developing scripts for long-running series such as ‘Brooklyn Nine-Nine.’ Hence, in this blog, we will build a script generator using Generative AI, which will assist the screenwriters in writing the scripts as soon as possible.
Now, letโs understand the definition of the technology we will use in this blog. Generative AI is a subset of artificial intelligence capable of generating new images, audio, video, text, etc. We are now using Generative AI in almost all fields to optimize the time required to finish a specific task. When talking about textual data, Generative AI can generate human-like texts. It can understand the task’s context and generate text based on that. In the web series script generation context, we can use Generative AI to generate a new episode script with the same writing style and tone as the whole web series.
This article was published as a part of the Data Science Blogathon.
First, letโs briefly overview the entire flow for building the script generator:
As we all know, we must gather data before building any model. So here, in building the AI Script Generator, we first need to collect all the data regarding the scripts of the web series. This process includes collecting many scripts from particular episodes of a web series. We Scrape these using scrapping websites, through databases, or by seeking permissions from the owner of the scripts. The main aim is to build a huge dataset with a wide range of dialogues, communications between the characters, the development of particular scenes, or the twists present in the series. As we develop the dataset, we must ensure that the data we collect is true, has no copyright issues, and is complete.
Data Pre-Processing is a crucial step that ensures our data is clean and tidy. This step involves removing unnecessary data, such as stage directions or director’s descriptions. Since we are collecting data through web scraping, we need to check for any missing data. We will also need to normalize the text data by removing punctuation and special characters and converting all the words to lowercase. In this way, we will clean our dataset.
After thoroughly cleaning the dataset, itโs time to prepare it as per our model needs. First, we start by tokenizing the script into individual words using a Tokenizer. This tokenizer breaks the whole sentence or a scripted dialogue into individual words and then assigns a unique index value, forming a word index. Following that, we create sequences of tokens. So, we create a list of tokens for each dialogue in the script. After tokenization, we pad these sequences with zeros at the beginning so that the input is uniform for our model. Then, the last word of each sequence is used as a label to predict the next word. Finally, the labels are converted to categorical format using one-hot encoding. In this way, the dataset is prepared for model training.
Once the data is prepared, we are ready to build our Generative Model. We need a model to handle sequential data for the text generation task. This blog will use a transformer-based model to generate the scripts. In this training phase, our model will learn to predict the next word based on the previous words. After the model is trained, we can assess the quality of the modelโs prediction using a loss function, such as cross-entropy loss.
Once our model is trained, we can generate a new episode script. To do this, we first need to feed the model with an initial sentence named โseed.โ The model then predicts the next word based on this seed sentence. The model generates the next word based on the probabilities learned during training. This predicted word is added to a sequence, and then this process is repeated until the desired length of the script is reached.
Here are the benefits of using Generative AI in scriptwriting:
Here are the challenges:
Now, letโs dive deep into understanding the code behind this AI script generator in the next few sections.
You can execute all codes by clicking on the ‘Copy & Edit‘ button in this link.
First, Let’s import all the libraries we will use to build the script generator.
import requests
from bs4 import BeautifulSoup
import re
from nltk.tokenize import sent_tokenize
import plotly.express as px
from collections import Counter
import pandas as pd
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TextDataset,
DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments
We will use the scripts of the Brooklyn 99 web series. Since it has many episodes, it will be good for our model. We will use the BeautifulSoup and Requests libraries to scrape these scripts from a web page.
We will do this by using two functions namely, โfetch_and_preprocess_scriptsโ, and โpreprocess_textโ.
The first function takes a URL as a parameter. This URL is the web page from where we will scrape our scripts. We will use the requests library to send an HTTP request to get the HTML content of the page. We will then use the BeautifulSoup library to parse this HTML content. We try to find all anchor tags (<a>) with the class โtopictitleโ as they contain all the links to individual episode scripts. Then, we construct the full URL of each script by concatenating it to the base URL, and we will store it in a list. This list is then reversed to maintain the order of the episodes. Finally, the function then iterates over each script and extracts the text. The text is then appended to a final script string.
# Function to fetch and preprocess the script content from a given URL
def fetch_and_preprocess_scripts(url):
base_url = "https://transcripts.foreverdreaming.org"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
anchor_tags = soup.find_all("a", class_="topictitle")
links = [base_url + tag["href"][1:] for tag in anchor_tags]
links = links[2:]
links.reverse()
final_script = ""
for link in links:
response = requests.get(link)
soup = BeautifulSoup(response.content, "html.parser")
script_div = soup.find("div", class_="content")
script_text = script_div.get_text(separator="\n") if script_div else ""
final_script += script_text.strip() + "\n"
preprocessed_script = preprocess_text(final_script)
return preprocessed_script
Now, we will call the preprocess_text function, which will clean the script string by removing all HTML tags and square brackets, tokenizing the text into sentences, and converting the sentences to lowercase.
# Function to clean and preprocess the text
def preprocess_text(text):
cleaned_text = re.sub(r'<[^>]+>', '', text)
cleaned_text = re.sub(r'\[[^\]]+\]', '', cleaned_text)
sentences = sent_tokenize(cleaned_text)
preprocessed_text = ' '.join(sentence.lower() for sentence in sentences)
return preprocessed_text
url = "https://transcripts.foreverdreaming.org/viewforum.php?f=429&sid=
acbdaf84cb954f2929838f627cb124cb&start=78"
newpreprocessed_script = fetch_and_preprocess_scripts(url)
url1 = "https://transcripts.foreverdreaming.org/viewforum.php?f=429"
new_preprocessed_script = fetch_and_preprocess_scripts(url1)
preprocessed_script = newpreprocessed_script+new_preprocessed_script
In this way, we scraped and cleaned the episodes’ scripts. Now, we are ready with our dataset.
Now letโs see the first 500 words of our dataset, which should be the starting words of the series’ pilot episode.
print(preprocessed_script[:500])
Output:
In this section, we will perform Exploratory Data Analysis (EDA) on our script data. We will begin by splitting the preprocessed script into individual tokens (words). Then, we will count the frequency of each token using the Counter() function.
tokens = preprocessed_script.split()
token_counter = Counter(tokens)
Now, we will extract the top 20 common tokens and their counts.
most_common_tokens = token_counter.most_common(20)
token_labels, token_counts = zip(*most_common_tokens)
Finally, we will create a DataFrame from this data to visualize it using a bar chart. On the x-axis, we will keep the words, and on the y-axis, we will keep the count of each word.
data = {'Word': token_labels, 'Frequency': token_counts}
df = pd.DataFrame(data)
fig = px.bar(df, x='Word', y='Frequency', title='Most Common Words')
fig.update_xaxes(tickangle=45)
fig.show()
Output:
Another way to visualize the most used words is to make a word cloud, a more visually appealing chart. We need to call the WordCloud() function and pass a few details about the chart, like width, height, and background_color, along with the whole script.
text = ' '.join(tokens)
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)
plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Script Words')
plt.show()
Output:
Now, we will start preparing our dataset to be suitable for training the model. This involves tokenizing the preprocessed script into individual words using the Tokenizer() function. First, we will create the object of Tokenizer, and then we will use the fit_on_tests() function to tokenize. Finally, we can get the total number of unique words used later.
# Tokenizing the text into words
tokenizer = Tokenizer()
tokenizer.fit_on_texts([preprocessed_script])
total_words = len(tokenizer.word_index) + 1
Now, we will create sequences of these tokens for every line in the script. We will do this by iterating over every line of the script, in which we will convert each line to a sequence of tokens and then create a n-gram sequence from these tokens. Finally, we will append these sequences to a list of input sequences.
# Creating sequences of tokens
input_sequences = []
for line in preprocessed_script.split('\\n'):
token_list = tokenizer.texts_to_sequences([line])[0]
for i in range(1, len(token_list)):
n_gram_sequence = token_list[:i+1]
input_sequences.append(n_gram_sequence)
The next step in the data preparation phase is to create padding for the input sequences. We do this to ensure that the input sequence is uniform in length. To do this, we will call the โpad_sequencesโ function from the Keras library, in which we will pass the input_sequences variable and the length of the longest sequence.
# Padding sequences to ensure uniform length
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre')
Finally, we will split each sequence into labels and predictors. The predictors will contain all the words of the sequence except the last word, and the label variable will contain the last word of the sequence. We do this to train the model to predict the next word based on the label variable. The labels are then converted to a categorical format, which is necessary for training the model.
# Creating predictors and labels
predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
# Converting labels to categorical format
from tensorflow.keras.utils import to_categorical
label = to_categorical(label, num_classes=total_words)
print("Total words:", total_words)
print("Max sequence length:", max_sequence_len)
print("Number of input sequences:", len(input_sequences))
OUTPUT:
Now, for the main section of the whole building process of this AI script generator, we will use the pre-trained GPT-2 model from the transformer library.
First, we will load the tokenizer and model using the from_pretrained() function.
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
The next step is tokenizing and encoding the script data. This is done using the tokenizer.
preprocessed_script_tokens = tokenizer(preprocessed_script, return_tensors="pt", max_length=1024,
truncation=True)
Now, we will save the tokenized data into a text file.
file_path = "preprocessed_script.txt"
with open(file_path, "w") as f:
f.write(preprocessed_script)
Now, we will convert the tokenized data into a PyTorch dataset using the TextDataset() class from the transformers library.
dataset = TextDataset(tokenizer=tokenizer, file_path=file_path, block_size=128)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
Now, we will define training arguments using the TrainingArguments class. The main arguments are the number of training epochs and batch size.
training_args = TrainingArguments(
output_dir="./script_generator",
overwrite_output_dir=True,
num_train_epochs=50,
per_device_train_batch_size=4,
save_steps=10_000,
save_total_limit=2,
prediction_loss_only=True,
report_to=[], # Disabled wandb logging
)
Now, we will create the Trainer object to pass the model, training_args, data_collator, and dataset variable we have created so far.
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset,
)
Finally, we will train the model using the train() function.
trainer.train()
This will train the model to generate scripts in the style of the preprocessed script data.
Now that the model is trained, we can generate a new episodeโs script by loading the trained model and tokenizer.
# Loading the fine-tuned model
model = GPT2LMHeadModel.from_pretrained("/kaggle/working/fine_tuned_script_generator")
# Loading the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
Now, we will define a seed sentence, which will serve as a starting point for the new script.
# Generating text
prompt_text = "Detective Jake Peralta enters the precinct and announces:" # Custom prompt text
input_ids = tokenizer.encode(prompt_text, return_tensors="pt")
Now, we will generate the script using generate() function, in which we will pass, the seed sentence, and the maximum length of the new script. Another parameter we will pass is temperature, which controls the randomness of the predictions.
# Generating text with a maximum length of 500 tokens
output1 = model.generate(input_ids, max_length=500, num_return_sequences=1, temperature=0.7,
do_sample=True)
Finally, we will decode the generated script and format it into a more readable format.
# Decoding the generated text
generated_text = tokenizer.decode(output1[0], skip_special_tokens=True)
delimiters = [". ", "? ", "! ", "| "]
for delimiter in delimiters:
generated_text = generated_text.replace(delimiter, delimiter + "\\n")
# Printing each dialogue on a new line
print(generated_text.strip())
Output:
As you can see, our model generated a new episode script, which is very accurate and interesting.
In conclusion, Generative AI is a powerful tool for script generation. It can create a new script that matches the tone of a particular web series in which the model is trained. It can reduce the time and effort of human writers. The quality of the generated scripts depends on the quality and quantity of the dataset. It also depends on the choice of the model and its parameters. Despite these challenges, script writers can use a script generator as an initial draft of an episode, and they can refine it to their needs reducing the total time of script writing.
The media shown in this article are not owned by Analytics Vidhya and is used at the Authorโs discretion.
A. Generative AI is a technology capable of creating new things, such as images, songs, videos, text, etc.
A. It can help the scriptwriters by generating a new episode script as an initial draft. Scriptwriters can use this draft to refine it to their needs and make the final draft, reducing the total time to write the script.
A. The need for computational resources is one of the main challenges in building the script generator.
A. No, Generative AI cannot completely replace scriptwriters yet. But it can help script writers to write a new script in a short time.