AI voice-cloning has taken social media by storm. It has opened a world of creative possibilities. You must have seen memes or AI voice-overs of famous personalities on social media. Have you wondered how it is done? Sure, many platforms provide APIs like Eleven Labs, but can we do it for free, using open-source software? The short answer is YES. The open-source has TTS models and ai lip-sync tools to achieve voice synthesis. So, in this article, we will explore open-source tools and models for voice-cloning and lip-syncing ai, also lip syncing deepfake face online how you can make it with the help of lip syncing tool how you can clone your voice with this tool.
This article was published as a part of the Data Science Blogathon.
As you already know, we will use OpenAI’s Whisper, FFmpeg, Coqui-ai’s xTTS model, and Wav2lip as our tech stack. But before delving into the codes, let’s briefly discuss these tools. And also thanks to the authors of these projects.
Take your AI innovations to the next level with GenAI Pinnacle. Fine-tune models like Gemini and unlock endless possibilities in NLP, image generation, and more. Dive in today! Explore Now
Whisper: Whisper is OpenAI’s ASR (Automatic Speech Recognition) model. It is an encoder-decoder transformer model trained with over 650k hours of diverse audio data and corresponding transcripts. Thus making it very potent at a multi-lingual transcription from audio.
The encoders receive the log-mel spectrogram of 30-second chunks of audio. Each encoder block uses self-attention to understand different parts of audio signals. The decoder then receives hidden state information from encoders and learned positional encodings. The decoder uses self-attention and cross-attention to predict the next token. At the end of the process, it outputs a sequence of tokens representing the recognized text. For more on Whisper, refer to the official repository.
Coqui TTS: TTS is an open-source library from Coqui-ai. It hosts multiple text-to-speech models. It has end-to-end models like Bark, Tortoise, and xTTS, spectrogram models like Glow-TTS, FastSpeech, etc, and Vocoders like Hifi-GAN, MelGAN, etc. Moreover, it provides a unified API for inferencing, fine-tuning, and training text-to-speech models. In this project, we will use xTTS, an end-to-end multi-lingual voice-cloning model. It supports 16 languages, including English, Japanese, Hindi, Mandarin, etc. For more information about the TTS, refer to the official TTS repository.
Wav2Lip: Wav2lip is a Python repository for the paper “A Lip Sync ai Expert Is All You Need for Speech to Lip Generation In the Wild.” It uses a lip-sync discriminator to recognize face and lip movements. This works out great for dubbing voices. For more information, refer to the official repository. We will use this forked repository of Wav2lip.
Now that we are familiar with the tools and models we will use, let’s understand the workflow. This is a simple workflow. So, here is what we will do.
Now, let’s delve into the codes.
This project would require significant RAM and GPU consumption, so it is prudent to use a Colab runtime. The free tier Colab provides 12GB of CPU and 15GB of T4 GPU, which should be sufficient for this project. So, head over to your Colab and connect to a GPU runtime. If you’re working on tasks like lip sync AI, leveraging the computational power of a GPU can significantly accelerate the processing speed and improve the performance of your model.
Now, install the TTS and Whisper.
!pip install TTS
!pip install git+https://github.com/openai/whisper.git
Now, we will upload a video and resize it to 720p format. The Wav2lip tends to perform better when the videos are in 720p format. This can be done using FFmpeg.
#@title Upload Video
from google.colab import files
import os
import subprocess
uploaded = None
resize_to_720p = False
def upload_video():
global uploaded
global video_path # Declare video_path as global to modify it
uploaded = files.upload()
for filename in uploaded.keys():
print(f'Uploaded {filename}')
if resize_to_720p:
filename = resize_video(filename) # Get the name of the resized video
video_path = filename # Update video_path with either original or resized filename
return filename
def resize_video(filename):
output_filename = f"resized_{filename}"
cmd = f"ffmpeg -i {filename} -vf 'scale=-1:720' {output_filename}"
subprocess.run(cmd, shell=True)
print(f'Resized video saved as {output_filename}')
return output_filename
# Create a form button that calls upload_video when clicked and a checkbox for resizing
import ipywidgets as widgets
from IPython.display import display
button = widgets.Button(description="Upload Video")
checkbox = widgets.Checkbox(value=False, description='Resize to 720p (better results)')
output = widgets.Output()
def on_button_clicked(b):
with output:
global video_path
global resize_to_720p
resize_to_720p = checkbox.value
video_path = upload_video()
button.on_click(on_button_clicked)
display(checkbox, button, output)
This will output a form button for uploading videos from a local device and a checkbox for enabling 720p resizing. You can also upload a video manually to the current collab session and resize it using a subprocess.
Now that we have our video, the next thing we will do is extract audio using FFmpeg and use Whisper to transcribe.
# @title Audio extraction (24 bit) and whisper conversion
import subprocess
# Ensure video_path variable exists and is not None
if 'video_path' in globals() and video_path is not None:
ffmpeg_command = f"ffmpeg -i '{video_path}' -acodec pcm_s24le -ar 48000 -q:a 0 -map a\
-y 'output_audio.wav'"
subprocess.run(ffmpeg_command, shell=True)
else:
print("No video uploaded. Please upload a video first.")
import whisper
model = whisper.load_model("base")
result = model.transcribe("output_audio.wav")
whisper_text = result["text"]
whisper_language = result['language']
print("Whisper text:", whisper_text)
This will extract audio from the video in 24-bit format and will use the Whisper Base to transcribe it. For better transcription, use Whisper small or medium models.
Now, to the voice cloning part. As I have mentioned before, we will use Coqui-ai’s xTTS model. This is one of the best open-source models out there for voice synthesis. Coqui-ai also provides many TTS models for different purposes; do check them. For our use case, which is voice-cloning, we will use the xTTS v2 model.
Load the xTTS model. This is a big model with a size of 1.87 GB. So, this will take a while.
# @title Voice synthesis
from TTS.api import TTS
import torch
from IPython.display import Audio, display # Import the Audio and display modules
device = "cuda" if torch.cuda.is_available() else "cpu"
# Initialize TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)
XTTS currently supports 16 languages. Here are the ISO codes of languages the xTTS model supports.
print(tts.languages)
['en','es','fr','de','it','pt','pl','tr','ru','nl','cs','ar','zh-cn','hu','ko','ja','hi']
Note: Languages like English and French do not have a character limit, while Hindi has a character limit of 250. Few other languages might have the limit as well.
For this project, we will use the Hindi language, you can experiment with others as well.
So, the first thing we need now is to translate the transcribed text into Hindi. This can either be done by Google Translate package or using an LLM. As per my observations, GPT-3.5-Turbo performs much better than Google Translate. We can use OpenAI API to get our translation.
import openai
client = openai.OpenAI(api_key = "api_key")
completion = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": f"translate the texts to Hindi {whisper_text}"}
]
)
translated_text = completion.choices[0].message
print(translated_text)
As we know, Hindi has a character limit, so we need to do text pre-processing before passing it to the TTS model. We need to split the text into chunks of less than 250 characters.
text_chunks = translated_text.split(sep = "।")
final_chunks = [""]
for chunk in text_chunks:
if not final_chunks[-1] or len(final_chunks[-1])+len(chunk)<250:
chunk += "।"
final_chunks[-1]+=chunk.strip()
else:
final_chunks.append(chunk+"।".strip())
final_chunks
This is a very simple splitter. You can create a different one or use Langchain’s recursive text-splitter. Now, we will pass each chunk to the TTS model. The resulting audio files will be merged using FFmpeg.
def audio_synthesis(text, file_name):
tts.tts_to_file(
text,
speaker_wav='output_audio.wav',
file_path=file_name,
language="hi"
)
return file_name
file_names = []
for i in range(len(final_chunks)):
file_name = audio_synthesis(final_chunks[i], f"output_synth_audio_{i}.wav")
file_names.append(file_name)
As all the files have the same codec, we can easily merge them with FFmpeg. To do this, create a Txt file and add the file paths.
# this is a comment
file 'output_synth_audio_0.wav'
file 'output_synth_audio_1.wav'
file 'output_synth_audio_2.wav'
Now, run the code below to merge files.
import subprocess
cmd = "ffmpeg -f concat -safe 0 -i my_files.txt -c copy final_output_synth_audio_hi.wav"
subprocess.run(cmd, shell=True)
This will output the final concatenated audio file. You can also play the audio in Colab.
from IPython.display import Audio, display
display(Audio(filename="final_output_synth_audio_hi.wav", autoplay=False))
Now, to the lip-syncing ai part. To lip-sync our synthetic audio with the original video, we will use the Wav2lip repository. To use Wav2lip to sync audio, we need to install the model checkpoints. But before that, if you are on T4 GPU runtime, delete the xTTS and Whisper models in the current Colab session or restart the session.
import torch
try:
del tts
except NameError:
print("Voice model already deleted")
try:
del model
except NameError:
print("Whisper model deleted")
torch.cuda.empty_cache()
Now, clone the Wav2lip repository and install the checkpoints.
# @title Dependencies
%cd /content/
!git clone https://github.com/justinjohn0306/Wav2Lip
!cd Wav2Lip && pip install -r requirements_colab.txt
%cd /content/Wav2Lip
!wget 'https://github.com/justinjohn0306/Wav2Lip/releases \
/download/models/wav2lip.pth' -O 'checkpoints/wav2lip.pth'
!wget 'https://github.com/justinjohn0306/Wav2Lip/releases \
/download/models/wav2lip_gan.pth' -O 'checkpoints/wav2lip_gan.pth'
!wget 'https://github.com/justinjohn0306/Wav2Lip/releases \
/download/models/mobilenet.pth' -O 'checkpoints/mobilenet.pth'
!pip install batch-face
The Wav2lip has two models for lip-syncing. wav2lip and wav2lip_gan. According to the authors of the models, the GAN model requires less effort in face detection but produces slightly inferior results. In contrast, the non-GAN model can produce better results with more manual padding and rescaling of the detection box. You can try out both and see which one is doing better.
Run the inference with the model checkpoint path, video, and audio files.
%cd /content/Wav2Lip
#This is the detection box padding, adjust incase of poor results.
#Usually, the bottom one is the biggest issue
pad_top = 0
pad_bottom = 15
pad_left = 0
pad_right = 0
rescaleFactor = 1
video_path_fix = f"'../{video_path}'"
!python inference.py --checkpoint_path 'checkpoints/wav2lip_gan.pth' \
--face $video_path_fix --audio "/content/final_output_synth_audio_hi.wav" \
--pads $pad_top $pad_bottom $pad_left $pad_right --resize_factor $rescaleFactor --nosmooth \
--outfile '/content/output_video.mp4'
This will output a lip-sync ai video. If the video doesn’t look good, adjust the parameters and retry.
So, here is the repository for the notebook and a few samples.
GitHub Repository: sunilkumardash9/voice-clone-and-lip-sync
Video voice-cloning and lip-syncing ai technology have a lot of use cases across industries. Here are a few cases where this can be beneficial.
Entertainment: The entertainment industry will be the most affected industry of all. We are already witnessing the change. Voices of celebrities of current and bygone eras can be synthesized and re-used. This also poses ethical challenges. The use of synthesized voices should be done responsively and within the perimeter of laws.
Marketing: Personalized ad campaigns with familiar and relatable voices can greatly enhance brand appeal.
Communication: Language has always been a barrier to all sorts of activities. Cross-language communication is still a challenge. Realtime end-to-end translation while keeping one’s accent and voice will revolutionize the way we communicate. This might become a reality in a few years.
Content Creation: Content creators will no longer depend on translators to reach a bigger audience. With efficient voice cloning and lip-syncing, cross-language content creation will be easier. Podcasts and audiobook narration experience can be enhanced with voice synthesis.
Voice synthesis is one of the most sought-after use cases of generative AI. It has the potential to revolutionize the way we communicate. Ever since the creation of civilizations, the language barrier between communities has been a hurdle for forging deeper relationships, culturally and commercially. With AI voice synthesis, this gap can be filled. So, in this article, we explored the open-source way of voice-cloning and lip-sync ai.
Hope you like the article and now you get clear understanding about how you can make lip syncing deepfake face online with he lip syncing tool now you can clon your voice.
Dive into the future of AI with GenAI Pinnacle. From training bespoke models to tackling real-world challenges like PII masking, empower your projects with cutting-edge capabilities. Start Exploring.
A. Cloning voice might be illegal as it infringes on copyright. However, getting permission from the person before cloning is the right way to go about it.
A. The AI tool for lip sync ai is called SyncVoice. It helps match the movement of lips with the audio.
A. The AI that makes your lips move is also SyncVoice. It uses advanced algorithms to synchronize lip movements with speech.
A. Lip-synching itself isn’t illegal, but using it to deceive or misrepresent in certain contexts, like performances or presentations, could be considered fraud or breach of contract.
A. Voice cloning can be beneficial for a range of use cases, such as content creation, narration in games and movies, Ad campaigns, etc.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.