“Less is more,” as architect Ludwig Mies van der Rohe famously said, and this is what summarization means. Summarization is a critical tool in reducing voluminous textual content into succinct, relevant morsels, appealing to today’s fast-paced information consumption. In text applications, summarization aids information retrieval, and supports decision-making. The integration of Generative AI, like OpenAI GPT-3-based models, has revolutionized this process by not only extracting key elements from text and generating coherent summaries that retain the source’s essence. Interestingly, Generative AI’s capabilities extend beyond text to video summarization. This involves extracting pivotal scenes, dialogues, and concepts from videos, creating abridged representations of the content. You can achieve video summarization in many different ways, including generating a short summary video, performing video content analysis, and highlighting key sections of the video or creating a textual summary of the video using video transcription
The Open AI Whisper API leverages automatic speech recognition technology to convert spoken language into written text, hence increasing accuracy and efficiency of text summarization. On the other hand, the Hugging Face Chat API provides state-of-the-art language models like GPT-3.
In this article we will learn about:
This article was published as a part of the Data Science Blogathon.
It involves the process of extracting meaningful information from a video. Use deep learning to track and identify objects and action in a video and identify the scenes. Some of the popular techniques for video summarization are:
This process includes converting the video to a limited number of still pictures. Video skim is another term for this shorter video of keyshots.
Video shots are non-interrupted continuous series of frames. Shot boundary recognition detects transitions between shots, like cuts, fades, or dissolves, and chooses frames from each shot to build a summary. The below are the major steps to extract a continuous short video summary from a longer video:
In this we try to identify human action performed in the video this is widely used application of Video analytics. We breakdown the video in small subsequences instead of frames and try to estimate the action performed in the segment by classification and pattern recognition techniques like HMC (Hidden Markov Chain Analysis).
In this article we have used single modal approach where in we use the audio of video to create a summary of video using textual summary. Here we use a
single aspect of video which is audio convert it to text and then get summary using that text.
In multi-modal approach we combine information from many modalities like audio, visual, and text, give a holistic knowledge of the video content for more accurate summarization.
Before diving into the implementation of our video summarization we should first know the applications of video summarization. Below are some of the listed examples of video summarization in a variety of fields and domains:
The Whisper model of Open AI is a automatic speech recognition(ASR). It is used for transcribing speech audio into text.
It is based on the transformer architecture, which stacks encoder and decoder blocks with an attention mechanism that propagates information between them. It will take the audio recording, divide it into 30-second pieces, and process each one individually. For each 30-second recording, the encoder encodes the audio and preserves the location of each word stated, and the decoder uses this encoded information to determine what was said.
The decoder will expect tokens from all of this information, which are basically each word pronounced. It will then repeat this process for the following word , utilising all of the same information to assist it identify the next one that makes more sense.
!pip install yt-dlp openai-whisper hugchat
import yt_dlp
import whisper
from hugchat import hugchat
#Function for saving audio from input video id of youtube
def download(video_id: str) -> str:
video_url = f'https://www.youtube.com/watch?v={video_id}'
ydl_opts = {
'format': 'm4a/bestaudio/best',
'paths': {'home': 'audio/'},
'outtmpl': {'default': '%(id)s.%(ext)s'},
'postprocessors': [{
'key': 'FFmpegExtractAudio',
'preferredcodec': 'm4a',
}]
}
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
error_code = ydl.download([video_url])
if error_code != 0:
raise Exception('Failed to download video')
return f'audio/{video_id}.m4a'
#Call function with video id
file_path = download('A_JQK_k4Kyc&t=99s')
# Load whisper model
whisper_model = whisper.load_model("tiny")
# Transcribe audio function
def transcribe(file_path: str) -> str:
# `fp16` defaults to `True`, which tells the model to attempt to run on GPU.
transcription = whisper_model.transcribe(file_path, fp16=False)
return transcription['text']
#Call the transcriber function with file path of audio
transcript = transcribe('/content/audio/A_JQK_k4Kyc.m4a')
print(transcript)
Note to use hugging chat api we need to login or sign up on hugging face platform. After that in place of “username” and “password” we need to pass in our hugging face credentials.
from hugchat.login import Login
# login
sign = Login("username", "password")
cookies = sign.login()
sign.saveCookiesToDir("/content")
# load cookies from usercookies
cookies = sign.loadCookiesFromDir("/content") # This will detect if the JSON file exists, return cookies if it does and raise an Exception if it's not.
# Create a ChatBot
chatbot = hugchat.ChatBot(cookies=cookies.get_dict()) # or cookie_path="usercookies/<email>.json"
print(chatbot.chat("Hi!"))
#Summarise Transcript
print(chatbot.chat('''Summarize the following :-'''+transcript))
In conclusion, the concept of summarization is a transformative force in information management. It’s a powerful tool that distills voluminous content into concise, meaningful forms, tailored to the fast-paced consumption of today’s world.
Through the integration of Generative AI models like OpenAI’s GPT-3, summarization has transcended its traditional boundaries, evolving into a process that not only extracts but generates coherent and contextually accurate summaries.
The journey into video summarization unveils its relevance across diverse sectors. The implementation of how audio extraction, transcription using Whisper, and summarization through Hugging Face Chat can be seamlessly integrated to create video textual summaries.
1. Generative AI: Video summarization can be achieved using generative AI technologies such as LLMs and ASR.
2. Applications in Fields: Video summarization is actually beneficial in many important fields where one has to analyze large amount of videos to mine crucial information.
3. Basic Implementation: In this article we explored basic code implementation of video summarization based on audio dimension.
4. Model Architecture: We also learnt about basic architecture of Open AI Whisper model and its process flow.
A. Whisper API call limit is 50 in a min. There is no audio length limit but files upto 25 MB can only be shared. One can reduce file size of audio by decreasing bitrate of audio.
A. The following file formats: m4a, mp3, webm, mp4, mpga, wav, mpeg
A. Some of the major alternatives for Automatic Speech Recognition are – Twilio Voice, Deepgram, Azure speech-to-text, Google Cloud Speech-to-text.
A. One of the the difficulty in comprehending diverse accents of the same language, necessity for specialized training applications in specialized fields.
A. Advanced research is taking place in the field of speech recognition like decoding imagined speech from EEG signals using neural architecture. This allows people
with speech disabilities to communicate their thoughts of speech to outside world with help of devices. One such interesting paper here.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.