Text-to-speech (TTS) technology has evolved rapidly, allowing natural and expressive voice generation for a various applications. One standout model in this domain is Kokoro TTS, a cutting-edge TTS model known for its efficiency and high-quality speech creation. Kokoro-82M is a Text-to-Speech model consisting of 82 million parameters. Despite its significantly small size (82 million parameters), Kokoro TTS provides voice quality equivalent to considerably larger models.
This article was published as a part of the Data Science Blogathon.
Text-to-Speech is a voice synthesis technology that converts written form of text into spoken form i.e. in the form of words. It has rapidly evolved – from a synthesized voice sounding robotic and monotonous to expressive and natural, human-like speech. TTS has various applications, like making digital content accessible for people with visual impairments, learning disabilities etc.
TTS has evolved from rule-based robotic voices to AI-powered natural speech synthesis:
Although having only 82 million parameters, Kokoro-82M has become a state-of-the-art, cutting-edge TTS model that produces high-quality natural sounding audio output. It performs better than larger models, making it a great option for developers looking to balance resource usage and performance.
StyleTTS2 architecture uses diffusion models to describe speech styles as latent random variables, producing speech that sounds human. Thus it eliminates the requirement for reference speech by enabling the system to provide appropriate styles for the provided text. It uses adversarial training with big pre-trained speech language models (SLMs), like WavLM.
ISTFTNet is a mel-spectrogram vocoder (voice encoder) that utilizes the inverse short-time Fourier transform (iSTFT). It is designed to achieve high-quality speech synthesis with reduced computational costs and training times.
The Kokoro-82M model outperforms in various criteria. It took first place in the TTS Spaces Arena test, outperforming more larger models such as XTTS v2 (467M parameters) and MetaVoice (1.2B parameters)1. Even models trained on much larger datasets, such as Fish Speech with a million hours of audio, failed to equal Kokoro-82M’s performance. It achieved peak performance in under 20 epochs with a curated dataset of fewer than 100 hours of audio. This efficiency, along with high-quality output, makes Kokoro-82M as a top performer in the text-to-speech domain.
It provides some excellent features such as:
Kokoro TTS supports multiple languages, making it a versatile choice for global applications. It currently offers support for:
Kokoro TTS’s capacity to generate customised voices is one of its most notable characteristics. By combining several voice embeddings, users may create distinctive and personalised voices that improve user experience and brand identification.
Being an open-source project, developers are free to use, alter, and incorporate Kokoro into their programs. The model’s vibrant community support helps in improvements.
Unlike many cloud-based TTS solutions, Kokoro TTS can run locally, eliminating the need for external APIs.
With an architecture optimized for real-time performance and minimal resource usage, Kokoro TTS is suitable for deployment on edge devices and low-power systems. This efficiency ensures smooth speech synthesis without requiring high-end hardware.
some of the voices provided by Kokoro-82M are:
Reference: Github
Let’s understand the working of Kokoro-82M by creating a Gradio powered application for speech generation.
Install git-lfs and clone the Kokoro-82M repository from Hugging Face. Then install the required dependencies:
#Install dependencies silently
!git lfs install
!git clone https://huggingface.co/hexgrad/Kokoro-82M
%cd Kokoro-82M
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
!pip install -q phonemizer torch transformers scipy munch gradio
The modules we require are:
#Import necessary modules
from models import build_model
import torch
from kokoro import generate
from IPython.display import display, Audio
import gradio as gr
#Checks for GPU/cuda availability for faster inference
device = 'cuda' if torch.cuda.is_available() else 'cpu'
#Load the model
MODEL = build_model('kokoro-v0_19.pth', device)
Here we create a dictionary of available voices.
VOICE_OPTIONS = {
'American English': ['af', 'af_bella', 'af_sarah', 'am_adam', 'am_michael'],
'British English': ['bf_emma', 'bf_isabella', 'bm_george', 'bm_lewis'],
'Custom': ['af_nicole', 'af_sky']
}
We define a function to load the selected voicepack and convert the input text into speech.
#Generate speech from text using selected voice
def tts_generate(text, voice):
try:
voicepack = torch.load(f'voices/{voice}.pt', weights_only=True).to(device)
audio, out_ps = generate(MODEL, text, voicepack, lang=voice[0])
return (24000, audio), out_ps
except Exception as e:
return str(e), ""
Define app() function which acts as a wrapper for gradio interface.
def app(text, voice_region, voice):
"""Wrapper for Gradio UI."""
if not text:
return "Please enter some text.", ""
return tts_generate(text, voice)
with gr.Blocks() as demo:
gr.Markdown("# Multilingual Kokoro-82M - Speech Generation")
text_input = gr.Textbox(label="Enter Text")
voice_region = gr.Dropdown(choices=list(VOICE_OPTIONS.keys()), label="Select Voice Type", value='American English')
voice_dropdown = gr.Dropdown(choices=VOICE_OPTIONS['American English'], label="Select Voice")
def update_voices(region):
return gr.update(choices=VOICE_OPTIONS[region], value=VOICE_OPTIONS[region][0])
voice_region.change(update_voices, inputs=voice_region, outputs=voice_dropdown)
output_audio = gr.Audio(label="Generated Audio")
output_text = gr.Textbox(label="Phoneme Output")
generate_btn = gr.Button("Generate Speech")
generate_btn.click(app, inputs=[text_input, voice_region, voice_dropdown], outputs=[output_audio, output_text])
#Launch the web app
demo.launch()
When the user selects a voice region, the available voices update automatically.
The Kokoro-82M model is remarkable, however it has several limitations. It’s training data is primarily synthetic and neutral, thus it struggles to produce emotional speech like laughter, anger, or grief. This is because these emotions were under-represented in the training set. The model’s limitations stem from both architecture decisions and training data limits. The model lacks voice cloning capabilities due to its small training dataset of less than 100 hours. It uses espeak-ng for grapheme-to-phoneme (G2P) conversion, which introduces potential failure areas in the text processing pipeline. While the 82 million parameter count allows for efficient deployment, it may not match the capabilities of billion-parameter diffusion transformers or big language models.
Kokoro TTS is a great alternative for developers and organisations that want to deploy high-quality voice synthesis without incurring API fees. Whether you’re creating voice-enabled applications, engaging instructional content, improving video production, or developing assistive technology, Kokoro TTS offers a reliable and affordable alternative to proprietary TTS services. Kokoro TTS is a game changer in the world of text-to-speech technology, thanks to its minimal footprint, open-source nature, and excellent voice quality. If you’re searching for a lightweight, efficient, and customizable TTS model, the Kokoro TTS is worth considering!
Kokoro-82M represents a major breakthrough in text-to-speech technology, delivering high-quality, natural-sounding speech despite its small size. Its efficiency, multi-language support, and real-time processing capabilities make it a compelling choice for developers seeking a balance between performance and resource usage. As TTS technology continues to evolve, models like Kokoro-82M pave the way for more accessible, expressive, and privacy-friendly speech synthesis solutions.
A. The main TTS methodologies are formant synthesis, concatenative synthesis, parametric synthesis, and neural network-based synthesis.
A. Speech concatenation involves stitching together pre-recorded units of speech, such as phonemes, diphones, or words, to form complete sentences. Waveform generation is done to smooth the transitions between units to produce natural sounding speech.
A. A speech sounds database is the foundational dataset for TTS systems. It contains a large collection of recorded speech sound samples and their corresponding text transcriptions. These databases are essential for training and evaluating TTS models.
A. It can be used as an API endpoint and integrated into applications like chatbots, audiobooks, or voice assistants.
A. The generated speech is in 24kHz WAV format, which is high-quality and suitable for most applications.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.