With recent AI advancements such as LangChain, ChatGPT builder, and the prominence of Hugging Face, creating AI and LLM apps has become more accessible. However, many are unsure how to leverage these tools effectively.
In this article, I’ll guide you in building an AI storyteller application that generates stories from random images. Utilizing open-source LLM models and custom prompts with an industry-standard approach, we’ll explore the step-by-step process.
Before we begin, let’s establish expectations for this informative journey.
Before moving ahead here are a few pre-requires that’s need to be fulfilled:
Take your AI innovations to the next level with GenAI Pinnacle. Fine-tune models like Gemini and unlock endless possibilities in NLP, image generation, and more. Dive in today! Explore Now
So, assuming you have met all the pre-requirements, let’s get started by understanding the project workflow of our AI Storyteller application.
This article was published as a part of the Data Science Blogathon.
Like any software company, let’s start with the development of a project outline.
Here is the table of things we need to do along with the approach & provider
Section Name | Approach | Provider |
Image Upload | Image upload web interface | Python Lib |
Convert image to text | LLM Models (img2text) | Hugging Face |
Generate a story from text | ChatGPT | Open AI |
Convert the story to audio | LLM Model (text2speech) | Hugging Face |
User listens to audio | Audio interface | Python Lib |
Demonstration | Web Interface | Python Lib |
If you are still unclear here is a high-level user-flow image 👇
So having defined the workflow, let’s start by organizing project files.
Go to command prompt in working directory and enter this command one by one:
mkdir ai-project
cd ai-project
code
Once you run the last command it will open the VS code and create a workspace for you. We will be working in this workspace.
Alternatively, you can create the ai-project folder and open it inside vs code. The choice is yours 😅.
Now inside the .env file create 2 constant variables as:
HUGGINGFACEHUB_API_TOKEN = YOUR HUGGINGFACE API KEY
OPENAI_API_KEY = YOUR OPEN AI API KEY
Now let’s fill in the values.
Open AI allows developers to use API keys for interacting with their products, so let’s now grab one for ourselves.
Now let’s fix the other one!
Hugging Face is an AI community that provides open-source models, datasets, tasks, and even computing spaces for a developer’s use case. The only catch is, that you need to use their API to use the models. Here is how to get one (refer to ref.png for reference):
So why do we require this? This is because as a developer, it’s natural to accidentally reveal secret info on our system. If someone else gets hold of this data it can be disastrous, so it’s a standard practice to separate the env files and later access them in another script.
As of now, we are done with the workspace setup, but there is one optional step.
This step is optional, so you may skip it but it’s preferred not to!
Often one needs to isolate their development space to focus on modules and files needed for the project. This is done through creating a virtual environment.
You can use Mini-Conda to create the v-env due to its ease of use. So, open up a command prompt and type the following commands one after the other:
conda create ai-storyteller
conda activate ai-storyteller
1st command creates a new virtual environment, while 2nd activates that. This approach even helps later at the project deployment stage. Now let’s head to the main project development.
As mentioned previously, we will work out each component separately and then merge them all.
In the vs-code or current-working-directory, create a new python file main.py. This will serve as the entry point for the project. Now let’s import all the required libraries:
from dotenv import find_dotenv, load_dotenv
from transformers import pipeline
from langchain import PromptTemplate, LLMChain, OpenAI
import requests
import os
import streamlit as st
Don’t get into library details, we will be learning them, as we use go along.
load_dotenv(find_dotenv())
HUGGINGFACE_API_TOKEN = os.getenv("HUGGINGFACEHUB_API_TOKEN")
Here:
Having loaded all the requirements and dependencies, let’s move to building out the 1st component. Image to text generator.
#img-to-text
def img2text(path):
img_to_text = pipeline(
"image-to-text", model="Salesforce/blip-image-captioning-base")
text = img_to_text(path)[0]['generated_text']
return text
Now let’s dissect the code:
So simple, right?
Next, let’s pass on the text to the story generator.
For text-to-story generation, we are going to use ChatGPT but you are free to use any other model you like.
Additionally, we will use Lang-chain to provide a custom prompt template to model to make it safe for every age to use. This can be achieved as:
def story_generator(scenario):
template = """
You are an expert kids story teller;
You can generate short stories based on a simple narrative
Your story should be more than 50 words.
CONTEXT: {scenario}
STORY:
"""
prompt = PromptTemplate(template=template, input_variables = ["scenario"])
story_llm = LLMChain(llm = OpenAI(
model_name= 'gpt-3.5-turbo', temperature = 1), prompt=prompt, verbose=True)
story = story_llm.predict(scenario=scenario)
return story
Let’s understand the code:
For those who are curious about the Lang-Chain classes used:
To learn more about Lang-chain and its features refer here.
Now we need to convert the generated output to audio. Let’s have a look.
But this time rather than loading the model, we will use hugging-face inference API, to fetch the result. This saves the storage and compute costs. Here is the code:
#text-to-speech (Hugging Face)
def text2speech(msg):
API_URL = "https://api-inference.huggingface.co/models/espnet/kan-bayashi_ljspeech_vits"
headers = {"Authorization": f"Bearer {HUGGINGFACE_API_TOKEN}"}
payloads = {
"inputs" : msg
}
response = requests.post(API_URL, headers=headers, json=payloads)
with open('audio.flac','wb') as f:
f.write(response.content)
Here is the explanation of the above code:
Note: The format for model inferencing can vary over the model, so please refer to the end of the section.
Optional
In case you plan to choose a different text-to-audio model, you can get the inference details by visiting the models page clicking on the drop-down arrow beside deploy, and selecting the inference-API option.
Congrats the backend part is now complete, let’s test the working!
Now it’s a good time to test the model. For this, we will pass in the image and call all the model functions. Copy – paste the code below:
scenario = img2text("img.jpeg") #text2image
story = story_generator(scenario) # create a story
text2speech(story) # convert generated text to audio
Here img.jpeg is the image file and is present in the same directory as main.py.
Now go to your terminal and run main.py as:
python main.py
If everything goes well you will see an audio file in the same directory as:
If you don’t find the audio.flac file, please ensure you have added your api keys, have sufficient tokens, and have all the necessary libraries installed including FFmpeg.
Now that we have done creating the backend, which works, it’s time to create the frontend website. Let’s move.
To make our front end we will use streamlit library which provides easy-to-use reusable components for building webpages from Python scripts, having a dedicated cli too, and hosting. Everything needed to host a small project.
To get started, visit Streamlit and create an account – It’s free!
Now go to your terminal and install the streamlit cli using:
pip install streamlit
Once done, you are good to go.
Now copy-paste the following code:
def main():
st.set_page_config(page_title = "AI story Teller", page_icon ="🤖")
st.header("We turn images to story!")
upload_file = st.file_uploader("Choose an image...", type = 'jpg') #uploads image
if upload_file is not None:
print(upload_file)
binary_data = upload_file.getvalue()
# save image
with open (upload_file.name, 'wb') as f:
f.write(binary_data)
st.image(upload_file, caption = "Image Uploaded", use_column_width = True) # display image
scenario = img2text(upload_file.name) #text2image
story = story_generator(scenario) # create a story
text2speech(story) # convert generated text to audio
# display scenario and story
with st.expander("scenario"):
st.write(scenario)
with st.expander("story"):
st.write(story)
# display the audio - people can listen
st.audio("audio.flac")
# the main
if __name__ == "__main__":
main()
Here is what our function does in a nutshell:
Our main function creates a webpage that allows the user to upload the image, pass that to the model, convert the image to the caption, generate a story based on it, and convert that story to audio that the user can listen to. Apart from that one can also view the generated caption and story and the audio file is stored in the local / hosted system.
Now to run your application, head over to the terminal and run:
streamlit run app.py
If everything successful, you will get below response:
Now head over to the Local URL and you can test the app.
Here is a video which showcases how to use the app:
Congrats on building your LLM- application powered by Hugging Face, OpenAI, and Lang chain. Now let’s summarize what you have learned in this article.
That’s all, we have learnt how to build frontend and backend of an AI Storyteller application!
We started by laying down the foundation of the project, then leveraged the power of hugging face to use Open Source LLM Models for the task in hand, combined open AI with lang-chain to give custom context and later wrapped the entire application into an interactive web app using streamlit. We also applied security principles guide along the project.
Dive into the future of AI with GenAI Pinnacle. From training bespoke models to tackling real-world challenges like PII masking, empower your projects with cutting-edge capabilities. Start Exploring.
I hope you enjoyed building this AI storyteller application. Now put that into practice, I can’t wait to see what you all come up with. Thanks for sticking to the end. Here are a few resources to get you started.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.