Create Gradio Demo for Speaker Verification

Drishti Last Updated : 22 Jun, 2022

8 min read

This article was published as a part of the Data Science Blogathon.

In this article, we will build an app for Speaker Verification using UniSpeech-SAT and X-Vectors. We will leverage the Gradio Python package for creating a web interface for the model and deploy our app on Hugging Face Spaces.

Introduction on Speaker Verification

Have you ever been in a situation where you felt compelled to do Speaker verification? If you answered yes, and even if you’re just curious about this topic, this post is for you. In this blog post, I’ll show you how to use UniSpeech-SAT and X-Vectors to make a basic Speaker Verification app. To set the records straight, while I cannot promise that this is the finest application for Speaker verification or that it is spoof-proof, here’s my attempt at creating an app that verifies if the two audios are from the same individual. If this is something that piques your interest, please continue reading.

Tools Used

The UniSpeech-SAT base model for Speaker Verification
Gradio
Hugging Face Spaces

Core Idea

Researchers from Harbin Institute of Technology and Microsoft Corporation proposed Universal Speech representation learning with Speaker Aware pre-Training (UniSpeech-SAT). For improving the unsupervised speaker information extraction, two approaches namely the utterance-wise contrastive learning and utterance mixing augmentation were introduced. These approaches are integrated into the HuBERT framework. The former, utterance-wise contrastive learning, is used to improve single-speaker information extraction to improve downstream tasks such as speaker verification and identification. The latter, namely utterance mixing augmentation, is especially useful for multi-speaker tasks eg. speech diarization. The utterance mixing method attempts to simulate multi-speaker speech for self-supervised pertaining when only single-speaker pre-training data is available.

We will be using the UniSpeech-SAT-Base model from the Hugging Face model hub for creating a speaker verification app. The model was pre-trained on 60,000 hours of Libri-Light, 10,000 hours of GigaSpeech, and 24,000 hours of VoxPopuli, wherein the speech audios were sampled at 16 kHz with utterance and speaker contrastive loss. As a result, it’s critical to make sure the speech input is also sampled at 16 kHz. Furthermore, The model is fine-tuned on the VoxCeleb1 dataset using an X-Vector head with an Additive Margin Softmax loss.

Furthermore, we will utilize Gradio‘s Interface class to establish a UI for the machine learning model and deploy our app on Hugging Face Spaces.

The app will take two speech samples as input and apply the sox effects on both audios, then retrieve features from both audios, and finally calculate the cosine similarity. If the cosine similarity score exceeds the threshold value we set, both audios are from the same individual; otherwise, speaker verification/authentication fails.

Step-by-step Implementation

The following is a step-by-step guide to creating a Speaker Verification app using Gradio and Hugging Face Spaces.

Step 1: Creating a Hugging Face Account and Setting up a New Space

If you don’t already have a Hugging Face account, please go visit the website and create one. After you’ve created an account, go to the top-right side of the page and click on the profile icon, and then the ‘New Space’ button. Then you’ll be directed to a new page where you’ll be asked to name the repository you’re going to establish. Give the space a name, and then choose ‘Gradio’ from the SDK options before clicking the ‘create new space’ button. As a result, your app’s repository will be built. You can watch the demo video that has been included below.

Step 2: Creating a Requirements.txt file

Now we will create a requirements.txt file in which we will list the Python packages that our app will need to execute. Those dependencies will be installed through pip install -r requirements.txt in the backend.

Step 3: Creating app.py File

For this segment, I’ve broken down the code into sections for clarity and to make things easier to understand. We’ll go over the code one by one as we go.

1. Import Necessary Libraries

We will start by importing the necessary dependencies. The AutoFeatureExtractor class will help in the extraction of audio features, while the AutoModelForAudioXVector class will load the pre-trained model (“microsoft/unispeech-sat-base-plus-sv”) for audio retrieval via the X-vector head.

#Importing all the necessary libraries
import torch
import gradio as gr
from torchaudio.sox_effects import apply_effects_file    #for applying the sox effects to the audio input files
from transformers import AutoFeatureExtractor, AutoModelForAudioXVector

2. Check if the Cuda is Available or Not

It is a general practice in PyTorch to set up a variable called “device” that will hold the device we’re training on (CPU or GPU). As a default, the tensors are created on the CPU, and the model is also initialized on the CPU. As a result, one has to manually ensure that the actions are performed on the GPU if one wants to leverage GPUs for fast computation.

PyTorch provides an easy-to-use API for transferring tensors created on the CPU to the GPU. The new tensors are formed on the same device as the parent tensor. For the model, the same rationale applies. As a result, the data and model must be transferred to the available device. However, if you do not have a premium subscription plan, you will only be able to use CPU accelerated inference. In that case, the model and the tensors will be directed to the CPU.

device = "cuda" if torch.cuda.is_available() else "cpu"

3. Define the Effect to be Applied to the Audio File(s)

We will now specify the audio effects that will be used. We favor mono-channel audio because it does not produce a stereophonic wide effect, which is why the channels have been merged into one. Also, as previously stated, because our model was pre-trained on speech audios sampled at 16 kHz, we must ensure that the speech input is sampled at 16 kHz as well.

# Defines the effects to apply to the audio file

EFFECTS = [
    ['remix', '-'],        # to merge all the channels
    ["channels", "1"],     #channel-->mono
    ["rate", "16000"],     # resample to 16000 Hz
    ["gain", "-1.0"],      #Attenuation -1 dB
    ["silence", "1", "0.1", "0.1%", "-1", "0.1", "0.1%"],
    #['pad', '0', '1.5'],  # for adding 1.5 seconds at the end
    ['trim', '0', '10'],   # get the first 10 seconds
]

4. Load the Model and the corresponding Feature Extractor, and Set a Threshold Value Which Will be Used as Reference to Estimate if two Audios Belong to the Same Speaker or Not

#UniSpeech-SAT model

model_name = "microsoft/unispeech-sat-base-plus-sv"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
model = AutoModelForAudioXVector.from_pretrained(model_name).to(device)

#Setting the threshold value
THRESHOLD = 0.85

cosine_similarity = torch.nn.CosineSimilarity(dim=-1)

5. Define a Function that will Evaluate the Cosine Similarity of two Signals

Now we will create a function that takes audio streams from two different file directories. At this stage, the sox effects will be applied to both audios, after which features from both audios will be retrieved, and finally, the cosine similarity will be calculated. I recommend you take a look at this tutorial, which explores Cosine Similarity in detail.

When the cosine similarity score exceeds the threshold value we set, both audios are from the same person; otherwise, speaker verification/authentication fails.

def similarity_fn(path1, path2):
  if not (path1 and path2):
    return 'ERROR: Please record audio for *both* speakers!'
  #Applying the effects to both the audio input files
  wav1, _ = apply_effects_file(path1, EFFECTS)
  wav2, _ = apply_effects_file(path2,EFFECTS)
  #Extracting features
  input1 = feature_extractor(wav1.squeeze(0), return_tensors="pt", sampling_rate=16000).input_values.to(device)
  input2 = feature_extractor(wav2.squeeze(0), return_tensors="pt", sampling_rate=16000).input_values.to(device)
  with torch.no_grad():
    emb1 = model(input1).embeddings
    emb2 = model(input2).embeddings
  emb1 = torch.nn.functional.normalize(emb1, dim=-1).to(device)
  emb2 = torch.nn.functional.normalize(emb2, dim=-1).to(device)
  similarity = cosine_similarity(emb1, emb2).numpy()[0]
  if similarity>= THRESHOLD:
    return f"Similarity score is {similarity :.0%}. Audio belongs to the same person "
  elif similarity< THRESHOLD:
    return f"Similarity score is {similarity:.0%}. Audio doesn't belong to the same person. Authentication failed!"

6. Create a UI For Model Using Gr.Interface

Next, we will utilize Gradio’s Interface class to establish a UI for the machine learning model by providing (1) the function, (2) the desired input components, and (3) the desired output components, which will allow us to quickly prototype and test our model. In our case, the function is similarity_function. We will provide two audio streams since we want to accomplish speaker verification by evaluating the cosine similarity scores of the two audios. For providing the audio input, we will use microphones or two different file paths. Since the intended output is a string we will use output=’gr.outputs’. For displaying the string output, use Textbox(label=”Output Text”). Finally, to launch the demo, call the launch() method. In addition, we will set enable_queue = True to force inference requests to be served in a queue rather than in parallel threads. To avoid timeout, longer inference times (> 1 minute) are required.

⚠️Additionally, if you wish to test the audio files stored locally, make sure that audios have been uploaded and that the path to those files is provided in examples (as shown in the code snippet below). It’s worth noting that the components can be given either as instantiated objects or as string shortcuts.

To upload audio files, go to “Files and versions” –> “Contribute” –> “Upload Files” in the order stated here.

inputs = [
    gr.inputs.Audio(source="microphone", type="filepath", optional=True, label="Speaker #1"),
    gr.inputs.Audio(source="microphone", type="filepath", optional=True, label="Speaker #2"),
]
outputs = gr.outputs.Textbox(label="Output Text")
description = (
    "This app evaluates whether the given audio speech inputs belong to the same individual based on Cosine Similarity score. "
)
interface = gr.Interface(
    fn=similarity_fn,
    inputs=inputs,
    outputs=outputs,
    title="Voice Authentication with UniSpeech-SAT + X-Vectors",
    description=description,
    layout="horizontal",
    theme="grass",
    allow_flagging=False,
    live=False,
    examples=[
        ["cate_blanch.mp3", "cate_blanch_2.mp3"],
        ["cate_blanch.mp3", "denzel_washington.mp3"]
    ]
)
interface.launch(enable_queue=True)

If you run into an error, head straight to the “See log” tab, which is located right next to the spot where Runtime Error is shown.

Once the Space is up and running error-free, it should function as follows:

⚠️Note: Because we set the threshold value to 0.85, our app will prompt “Authentication failed” if the cosine similarity score falls below this value.

Link to the Space: https://huggingface.co/spaces/DrishtiSharma/Speaker-Verification-using-UniSpeech-SAT-and-XVectors

Challenges

Challenging audio – this could be due to a variety of reasons such as low pitch, audio with low SNR, etc.
Spoofing attacks via various means, including generative models
Multiple sources of speech and far-field audio capture
Distinct phonation styles
Speech Modality

Solutions

There is a pressing need to develop more robust and intelligent solutions. To meet this need, perhaps speech/audio-based authentication can be combined with textual/visual/touch-based authentication to create a multi-modal authentication system.

Applications

Access control: Many on-site applications, eg. access management to cars, homes, warehouses, and computer terminals, can benefit from speaker verification.
Remote applications like telecom networks, databases, websites, e-trade, banking transactions, digital time and attendance logging, and other confidential transactions can gain from this.
Forensics investigations and personalization.
Information Retrieval

Conclusion on Speaker Verification

Hence in this blog, we created a Speaker verification app that uses a cosine similarity score to determine if two audios are from the same person. The future scope of the work may include combining speech/audio-based authentication with textual/visual/touch-based authentication to create a multi-modal authentication system.

To sum it up, the key takeaways from this article for all of us are as follows:

We learned how to create a speaker verification app using UniSpeech-SAT and X-Vectors. We also leveraged Gradio‘s Interface class to establish a UI for the machine learning model and deploy our app on Hugging Face Spaces.
We looked into challenges encountered in developing robust speaker verification solutions.
We also learned how a speech verification system could be made more robust.
Lastly, we explored the potential applications of a Speaker Verification app.

Thanks for reading. If you have any questions or concerns, please post them in the comments section below. Happy Learning!

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Drishti

I'm a Researcher who works primarily on various Acoustic DL, NLP, and RL tasks. Here, my writing predominantly revolves around topics related to Acoustic DL, NLP, and RL, as well as new emerging technologies. In addition to all of this, I also contribute to open-source projects @Hugging Face.
For work-related queries please contact: [email protected]

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Create Gradio Demo for Speaker Verification

Introduction on Speaker Verification

Tools Used

Core Idea

Step-by-step Implementation

Step 1: Creating a Hugging Face Account and Setting up a New Space

Step 2: Creating a Requirements.txt file

Step 3: Creating app.py File

1. Import Necessary Libraries

2. Check if the Cuda is Available or Not

3. Define the Effect to be Applied to the Audio File(s)

4. Load the Model and the corresponding Feature Extractor, and Set a Threshold Value Which Will be Used as Reference to Estimate if two Audios Belong to the Same Speaker or Not

5. Define a Function that will Evaluate the Cosine Similarity of two Signals

6. Create a UI For Model Using Gr.Interface

Challenges

Solutions

Applications

Conclusion on Speaker Verification

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)