Elon Musk AI Text Generator with LSTMs in Tensorflow 2

Guest Blog Last Updated : 30 Oct, 2024

4 min read

Introduction

Elon Musk has become an internet sensation over the past couple of years, with his views about the future, funny personality along with his passion for technology. By now everyone knows him, either as that electric car guy, or that guy who builds flamethrowers. He is mostly active on his Twitter, where he shares everything, Even memes!

He inspires a lot of young people in the IT industry, and I wanted to do a fun little project, where I would create an AI that would generate text based on his previous Twitter postings. I wanted to encapsulate his style and see what kind of weird results I would get.

Preparation

The data I am using was scraped directly from Elon Musk’s twitter, both his posts and replies. You can download the dataset at this link.

Importing the libraries:

import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam

Now I’m gonna create the function that is going to get rid of all of the links, the hashtags, tags, and all the stuff that’s gonna confuse the model so that we’re left with clean text.

Python Code:

import pandas as pd
import numpy as np
import re

#import the data
data_path = 'elonmusk.csv'
data = pd.read_csv(data_path)

#Function to clean the text
def clean_text(text):
    '''Make text lowercase, remove text in square brackets,remove links,remove punctuation
    and remove words containing numbers.'''
    text = text.lower()
    #text = text.replace('\%','')
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    #text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = " ".join(filter(lambda x:x[0]!="@", text.split()))
    return text

#Apply the function
data['text'] = data['text'].apply(lambda x: clean_text(x))
data = data['text']
print(data.head())

Let’s define a tokenizer, and apply it to the text. That is how we’re mapping all the words into their numeric representations. We do that because neural networks cannot take strings. If you’re new to that, there is a great series on Youtube by Lawrence Moroney, that I would suggest checking out below:

tokenizer = Tokenizer()
tokenizer.fit_on_texts(data)
total_words = len(tokenizer.word_index) + 1
print(total_words) #5952input_sequences = []
for line in data:
	token_list = tokenizer.texts_to_sequences([line])[0]
	for i in range(1, len(token_list)):
		n_gram_sequence = token_list[:i+1]
		input_sequences.append(n_gram_sequence)

Now we will have to define max_length(all data needs to be padded to a fixed length, just like with Convolutions), and we also need to turn input_sequences into a numpy array.

max_sequence_length = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_length, padding='pre'))

We’re gonna create sequences of data, where we will use all the elements except the last one as our X, and the last element as the y, of our data. Also, our y is a one-hot representation of total_words, which can sometimes be a lot of data(if total_words is 5952, that means every y is of shape (5952, ))

# create predictors and label
xs, labels = input_sequences[:,:-1],input_sequences[:,-1]
ys = tf.keras.utils.to_categorical(labels, num_classes=total_words)

Model

Below is the configuration of our model.

model = Sequential()
model.add(Embedding(total_words, 80, input_length=max_sequence_length-1))
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(50))
model.add(tf.keras.layers.Dropout(0.1))
model.add(Dense(total_words/20))
model.add(Dense(total_words, activation='softmax'))
model.summary()Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 56, 80)            476160    
_________________________________________________________________
lstm_2 (LSTM)                (None, 56, 100)           72400     
_________________________________________________________________
lstm_3 (LSTM)                (None, 50)                30200     
_________________________________________________________________
dropout_1 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 297)               15147     
_________________________________________________________________
dense_3 (Dense)              (None, 5952)              1773696   
=================================================================
Total params: 2,367,603
Trainable params: 2,367,603
Non-trainable params: 0

I’ve tried a couple of optimizers, and I’ve found Adam to work the best for this example. Let’s compile and run the model:

model.compile(loss='categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])

history = model.fit(xs, ys, epochs=200, verbose=1)
#Output
Epoch 196/200
1026/1026 [==============================] - 12s 12ms/step - loss: 0.7377 - accuracy: 0.8031
Epoch 197/200
1026/1026 [==============================] - 12s 12ms/step - loss: 0.7363 - accuracy: 0.8025
Epoch 198/200
1026/1026 [==============================] - 12s 12ms/step - loss: 0.7236 - accuracy: 0.8073
Epoch 199/200
1026/1026 [==============================] - 19s 18ms/step - loss: 0.7147 - accuracy: 0.8083
Epoch 200/200
1026/1026 [==============================] - 12s 12ms/step - loss: 0.7177 - accuracy: 0.8070

Let’s create a ‘for loop’, that will generate new text, based on seed_text and the number of words that we will define. This part of the code can look a little intimidating, but once you read each line carefully, you’ll see that we’ve already done something similar earlier.

for _ in range(next_words):
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    token_list = pad_sequences([token_list], maxlen=max_sequence_length - 1, padding='pre')
    predicted = np.argmax(model.predict(token_list), axis=-1)
    output_word = ""
    for word, index in tokenizer.word_index.items():
        if index == predicted:
            output_word = word
            break
    seed_text += " " + output_word

Now is the time to play with our model. Woohoo!

seed_text = "Space is big"
next_words = 20
Space is big conflation of cats a lot of civilization by spacex is making a few months of dragon is intense as we

seed_text = "i think about flowers"
next_words = 30
i think about flowers that on the future it are limited as you could brute force it with tankers to low earth orbit that’s probably faster than liquid temp in year we can have

seed_text = "i want to colonize jupiter"
next_words = 40
i want to colonize jupiter be words just be order to zero immediate future nor can we ourselves accurately predict what issues we will encounter on a short term fine grained level with in the house with it with a human part of the us

Summary

Space is a big conflation of cats!? Who would’ve known! As you can see, the outputs the model gives are silly, and they don’t make a lot of sense. As with all the deep learning models, there are a lot of things that could be tweaked, in order to generate better results. I’ll leave that to you.

Guest Blog

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Debasish

Interesting and very original work.. thank you ;)

Shoaib Sabir

This is really great article and was really helpful. Need to know something personally. Please contact me on shoaibalisabir2525@gmail.com or whatsapp me on +923340806660. Your response would be really appreciated. Thanks and Regards. Shoaib Sabir

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Elon Musk AI Text Generator with LSTMs in Tensorflow 2

Introduction

Preparation

Model

Summary

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang