Imagine you’re creating a podcast or crafting a virtual assistant that sounds as natural as a real conversation. That’s where ChatTTS comes in. This cutting-edge text-to-speech tool turns your written words into lifelike audio, capturing nuances and emotions with incredible precision. Picture this: you type out a script, and ChatTTS brings it to life with a voice that feels genuine and expressive. Whether you’re developing engaging content or enhancing user interactions, ChatTTS offers a glimpse into the future of seamless, natural-sounding dialogues. Dive in to see how this tool can transform your projects and make your voice heard in a whole new way.
This article was published as a part of the Data Science Blogathon.
ChatTTS, a voice generation tool, is a significant leap in AI, enabling seamless conversations. As the demand for voice generation increases alongside text generation and LLMs, ChatTTS makes audio dialogues more handy and comprehensive. Engaging in a dialogue with this tool is a breeze, and with comprehensive data mining and pretraining, the efficiency of this concept only amplifies.
ChatTTS is one of the best open-source models for Text-to-Speech voice generation for many applications. This tool is perfect in both English and Chinese. With over 100,000 hours of training data, this model can provide dialogue in both languages seems natural.
ChatTTS, with its unique features, stands out from other large language models that can be generic and lack expressiveness. With approximately 10 hours of data training in English and Chinese, this tool greatly advances AI. Other text-to-audio models, like Bark and Vall-E, have great features similar to this one. But ChatTTS edges out in some aspects.
For example, when comparing ChatTTS with Bark, there is a notable difference with the long-form input.
The output, in this case, is usually no longer than 13 seconds, and that is because of its GPT-style architecture. Also, Bark’s inference speed can be slower for old GPUs, default collabs, or CPUs. However, it works for enterprise GPUs, Pytorch, and CPUs.
ChatTTS, on the other hand, has a good inference speed; it can generate audio corresponding to around seven semantic tokens per second. This model’s emotion control also makes it edge out Valle.
Let’s delve into some of the unique features that make ChatTTS a valuable tool for AI voice generation:
This model is trained to execute task dialogue expressively. It carries natural speech patterns and also keeps speech synthesis for multiple speakers. This simple concept makes it easier for users, especially those with voice synthesis needs.
ChatTTS is doing a lot to ensure this tool’s safety and ethical concerns. There is an understandable concern about the abuse of this model, and some features, like reducing image quality and current work on an open-source tool to detect artificial speech, are good examples of ethical AI developments.
This is another evolution toward the security and control of this model. The ChatTTS team has shown its desire to maintain its reliability; adding watermarks and integrating them with large language models is a visible sign of ensuring the safety and reliability concerns that may arise.
This model has a few more standout qualities. One vital feature is that users can control the output and certain speech variations. The next section explains this better.
The level of controllability this model gives users is what makes it unique. When adding text, you can include tokens. These tokens act as embedded commands that control oral commands, including pauses and laughter.
This token concept can be divided into two stages: sentence-level control and word-level control. The sentence level introduces tokens such as laughter [laugh_ (0-2)] and pauses. On the other hand, the word-level control introduces these breaks around certain words to make the sentence more expressive.
Using some parameters, you can refine the output during audio generation. This is another crucial feature that makes this model more controllable.
This concept is similar to sentence-level control, as users can control specific identities, such as speaker identity, speech variations, and decoding strategies.
Generally, text pre-processing and output fine-tuning are two critical features that give ChatTTS its high level of customization and ability to generate expressive voice conversations.
params_infer_code = {'prompt':'[speed_5]', 'temperature':.3}
params_refine_text = {'prompt':'[oral_2][laugh_0][break_6]'}
ChatTTS has powerful potential, with fine-tuning capabilities and seamless integration with LLM. The community is looking to open-source a train-based model to develop further and recruit more researchers and developers to improve it.
There have also been talks of releasing a version of this model with multiple emotion controls and a Lora training code. This development could drastically reduce the difficulty in training since ChatTTS has LLM integration.
This model also supports a web user interface where you can input text, change parameters, and generate audio interactively. This is possible with the webui.py script.
python webui.py --server_name 0.0.0.0 --server_port 8080 --local_path /path/to/local/models
We’ll highlight this model’s simple steps to run efficiently, from downloading the code to fine-tuning.
!rm -rf /content/ChatTTS
!git clone https://github.com/2noise/ChatTTS.git
!pip install -r /content/ChatTTS/requirements.txt
!pip install nemo_text_processing WeTextProcessing
!ldconfig /usr/lib64-nvidia
This code consists of commands to help set up the environment. Downloading the clone version of this model from Git Hub gets the project’s latest version. The lines of code also install the necessary dependencies and ensure that the system libraries are correctly configured for NVIDIA GPUs.
The next step in running inference for this model involves importing the necessary libraries for your scrip; you’ll need to import Torch, ChatTTS, and Audio from IPython.display. You can listen to the audio with an ipynb file. There is also an alternative to save this audio as a ‘.wav’ file if you want to use a third-party library or install an audio driver like FFmpeg or SoundFile.
The code should look like the block below:
import torch
torch._dynamo.config.cache_size_limit = 64
torch._dynamo.config.suppress_errors = True
torch.set_float32_matmul_precision('high')
from ChatTTS import ChatTTS
from IPython.display import Audio
This step involves initiating the model using the ‘chat’ as an instance in the class. Then, load the ChatTTS pre-trained data.
chat = ChatTTS.Chat()
# Use force_redownload=True if the weights updated.
chat.load_models(force_redownload=True)
# Alternatively, if you downloaded the weights manually, set source='locals' and local_path will point to your directory.
# chat.load_models(source='local', local_path='YOUR LOCAL PATH')
texts = ["So we found being competitive and collaborative was a huge way of staying motivated towards our goals, so one person to call when you fall off, one person who gets you back on then one person to actually do the activity with.",]*3 \
+ ["我觉得像我们这些写程序的人,他,我觉得多多少少可能会对开源有一种情怀在吧我觉得开源是一个很好的形式。现在其实最先进的技术掌握在一些公司的手里的话,就他们并不会轻易的开放给所有的人用。"]*3
wavs = chat.infer(texts)
This model performs batch inference by providing a list of text. The ‘audio’ function in IPython can help you play the generated audio.
Audio(wavs[0], rate=24_000, autoplay=True)
Audio(wavs[3], rate=24_000, autoplay=True)
wav = chat.infer('四川美食可多了,有麻辣火锅、宫保鸡丁、麻婆豆腐、担担面、回锅肉、夫妻肺片等,每样都让人垂涎三尺。', \
params_refine_text=params_refine_text, params_infer_code=params_infer_code)
So, this shows how the parameters for speed, variability, and specific speech characteristics are defined.
Audio(wav[0], rate=24_000, autoplay=True)
This concept is another great customization feature that this model allows. Sampling a random speaker to generate audio with ChatTTS is seamless, and the sample random speaker embedding also makes it possible.
You can listen to the generated audio using an ipynb file or save it as a .wav file using a third-party library.
rand_spk = chat.sample_random_speaker()
params_infer_code = {'spk_emb' : rand_spk, }
wav = chat.infer('四川美食确实以辣闻名,但也有不辣的选择。比如甜水面、赖汤圆、蛋烘糕、叶儿粑等,这些小吃口味温和,甜而不腻,也很受欢迎。', \
params_refine_text=params_refine_text, params_infer_code=params_infer_code)
Two-stage control allows you to perform text refinement and audio generation seperately. This is possible with the ‘refine_text_only’ and ‘skip_refine_text’ parameters.
You can use the two-stage control in ChatTTS to refine text and audio generation. Also, this refinement can be separately done with some unique parameters in the code block below:
text = "So we found being competitive and collaborative was a huge way of staying motivated towards our goals, so one person to call when you fall off, one person who gets you back on then one person to actually do the activity with."
refined_text = chat.infer(text, refine_text_only=True)
refined_text
wav = chat.infer(refined_text)
Audio(wav[0], rate=24_000, autoplay=True)
This is the second stage that indicates the breaks, and pauses in the speech during audio generation.
text = 'so we found being competitive and collaborative [uv_break] was a huge way of staying [uv_break] motivated towards our goals, [uv_break] so [uv_break] one person to call [uv_break] when you fall off, [uv_break] one person who [uv_break] gets you back [uv_break] on then [uv_break] one person [uv_break] to actually do the activity with.'
wav = chat.infer(text, skip_refine_text=True)
Audio(wav[0], rate=24_000, autoplay=True)
The integration of ChatTTS with LLMs means it can refine text and generate audio from users’ questions in these models. Here are a few steps to break down this process.
from ChatTTS.experimental.llm import llm_api
This function imports the ‘llm_api’ used to create the API client. We will then use Deepseek to create the API. This API helps to facilitate seamless interactions in text-based applications. We can get the API from Deepseek API. Choose the ‘Access API’ option on the page, sign up for an account, and you can create a New key.
API_KEY = ''
client = llm_api(api_key=API_KEY,
base_url="https://api.deepseek.com",
model="deepseek-chat")
user_question = '四川有哪些好吃的美食呢?'
text = client.call(user_question, prompt_version = 'deepseek')
print(text)
text = client.call(text, prompt_version = 'deepseek_TN')
print(text)
You can then generate the audio using the text generated. Here is how to add the audio;
params_infer_code = {'spk_emb' : rand_spk, 'temperature':.3}
wav = chat.infer(text, params_infer_code=params_infer_code)
A voice generation tool that converts text to audio will be valuable today. The wave of AI chatbots, virtual assistants, and the integration of automated voices in many industries makes ChatTTS a massive deal. Here are some of the real-life applications of this model.
ChatTTS indicates a massive leap in AI generation, with natural and smooth conversations in both English and Chinese. The best part of this model is its controllability, which allows users to customize and, as a result, brings expressiveness to the speech. As the ChatTTS community continues to develop and refine this model, its potential for advancing text-to-speech technology is bright.
A. Developers can integrate chatTTS into their applications using APIs and SDKs.
A. With over 100,000 hours of data training, this model can efficiently perform tasks of voice generation in English and Chinese.
A. No, ChatTTS is intended for research and academic applications only. It should not be used for commercial or legal purposes. The model’s development includes ethical considerations to ensure safe and responsible use.
A. This model is valuable in various applications. One of its most prominent uses is a conversational tool for large language model assistants. ChatTTS can generate dialogue speech for video introduction, educational training, and other applications that require text-to-speech content.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.