Hacking Into Nvidia Nemo Script(Download Common Voice)

Purnendu Last Updated : 15 Mar, 2022

6 min read

This article was published as a part of the Data Science Blogathon.

Introduction to Nvidia Nemo Script

Hey all👋

I am sure you must have heard Nvidia Nemo in recent times. It’s a great library for creating NLP models using just a few lines and code, and needless to say, the team has done a great job.

So as with all others, I wanted to try it out for myself and create something unique. This article covers a few current snippets of my journey, along with code creations from scratch. Have a good time reading it😀.

Nvidia Nemo Script: The Problem Statement

Like every proficient Data Scientist, I picked the problem statement of creating an ASR(Automatic Speech Recognition) model.

The goal here was to create a model which works similarly to the actual Google Assistant / YT Auto Captioning services but only in a single language-Eng.

To achieve this, I planned to use the Mozilla Common Voice Dataset 7.0, a 65 GB word corpus of spoken English sentences. Now the question was how to download such a huge file and process it simultaneously. This is where google helps, and a quick search landed me on a script that was doing the heavy lifting, which I quickly used, and suddenly everything changed👀

If you have read the above dilemma, the problem statement is unambiguous, making the script work. So let’s dive into the exact walkthrough of how it was fixed.

Understanding Script

The Nvidia Nemo Script, which we are modifying is originally by SeanNaren and is hosted at this link. So before changing it, let’s define what we are supposed to do.

👉Download, Store& Unzip: We start by downloading the dataset using- mozilla_voice_bundler , storing it in the directory specified by data_root And finally unzipping the tar file.

👉Processing Data: After extracting, the next part focuses on parsing the data by converting given mp3 files (present in tsv files)wav ones and then passing it to sox library to get the duration of the voice sample. This step will also capture the path where the new files are stored along with the text.

👉 Creating Manifest: Finally, with all the info given, the last part is about appending extracted values to create manifests passed to the Nemo models.

Having defined the explicit goals, we can now move to the fun part, Coding!

Editing The Nvidia Nemo Script

Here are a few plans to keep in mind:

Our script should work the exact similar way as the previous one
There is no ambiguity in code/code should be in industry-standard approach.

Lets’ start

🌟 Some Imports

import argparse
import csv
import json
import logging
import multiprocessing
import os
import tarfile
from multiprocessing.pool import ThreadPool
from pathlib import Path
from typing import List
import sox
from sox import Transformer
import wget
from tqdm import tqdm

Pretty straightforward here, simple implications! tqdm,logging Are optional.

🌟Command Lines Arguments

After imports next step is to edit command line args:

parser = argparse.ArgumentParser(description=’Downloads and processes Mozilla Common Voice dataset.’)
parser.add_argument(“–data_root”, default=’./’, type=str, help=”Directory to store the dataset.”)
parser.add_argument(‘–manifest_dir’, default=’manifest_dir/’, type=str, help=’Output directory for manifests’)
parser.add_argument(“–num_workers”, default=multiprocessing.cpu_count(), type=int, help=”Workers to process dataset.”)
parser.add_argument(‘–sample_rate’, default=16000, type=int, help=’Sample rate’)
parser.add_argument(‘–files_to_process’, nargs=’+’, default=[‘test.tsv’, ‘dev.tsv’, ‘train.tsv’],
type=str, help=’list of *.csv file names to process’)
parser.add_argument(‘–version’, default=’cv-corpus-7.0-2021-07-21′,
type=str, help=’Version of the dataset (obtainable via https://commonvoice.mozilla.org/en/datasets’)
parser.add_argument(‘–language’, default=’hi’,
type=str, help=’Which language to download.(default english,’
‘check https://commonvoice.mozilla.org/en/datasets for more language codes’)
args = parser.parse_args()

Two things worth noting here is the default = "cv-corpus-7.0-2021-07-21 and default = "hi" . For general readers, the above code will greet you with cmd-like options, and if nothing is passed, take the default value. to learn more, use the --help/h Option.

🌟 Changing URL format

One key thing to change is the URL format that downloads the dataset from the amazon s3 bucket, which keeps changing from time to time. Currently, the link looks similar to :

https://mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com/cv-corpus-7.0-2021-07-21/cv-corpus-7.0-2021-07-21-en.tar.gz

Given the original script can’t fetch, we must match the current format and can be adding structures as basic_url/{}/{}-{}.tar.gz where {}/{} will be version no and {}will be language code.

The below format does just that:

# https://mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com/cv-corpus-7.0-2021-07-21/cv-corpus-7.0-2021-07-21-en.tar.gz
COMMON_VOICE_URL = f”https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/”
“{}/{}-{}.tar.gz”.format(args.version, args.version, args.language)

🌟 Working On Processing & Manifest Helper Function

One aspect of functional programming is to split the helper functions separately and later use them later with main function. So in this section, we will be editing our helper functions process_files.py and manifest.py to do the heavy lifting.

process_files.py📜

def process_files(csv_file, data_root, num_workers):
“”” Read *.csv file description, convert mp3 to wav, process text.
Save results to data_root.
Args:
csv_file: str, path to *.csv file with data description, usually start from ‘cv-‘
data_root: str, path to dir to save results; wav/ dir will be created
“””
wav_dir = os.path.join(data_root, ‘wav/’)
os.makedirs(wav_dir, exist_ok=True)
audio_clips_path = os.path.dirname(csv_file) + ‘/clips/’
def process(x):
file_path, text = x
file_name = os.path.splitext(os.path.basename(file_path))[0]
text = text.lower().strip()
audio_path = os.path.join(audio_clips_path, file_path)
output_wav_path = os.path.join(wav_dir, file_name + ‘.wav’)
tfm = Transformer()
tfm.rate(samplerate=args.sample_rate)
tfm.build(
input_filepath=audio_path,
output_filepath=output_wav_path
)
duration = sox.file_info.duration(output_wav_path)
return output_wav_path, duration, text
logging.info(‘Converting mp3 to wav for {}.’.format(csv_file))
with open(csv_file) as csvfile:
reader = csv.DictReader(csvfile, delimiter=’t’)
next(reader, None) # skip the headers
data = [(row[‘path’], row[‘sentence’]) for row in reader]
with ThreadPool(num_workers) as pool:
data = list(tqdm(pool.imap(process, data), total=len(data)))
return data

This function takes our tsv files, resulting in a data path. data_root And no of cores to use. The main job is to read CSV file description, navigate to the given file path, perform mp3->wav the conversion, process the text and save the result to data_root , the given directory.

For simplicity, let’s split it into pieces/lines:

Line 8–10: Sets default paths such as wav_dir (if none present creates new ) and audio_clips_path which is present in clips folder in the same directory as the tsv files.
Line 12–26 : def process(x)-A sub-function to process_files whose job is to return duration , test and output_wav_path given input path.
Line 28–35: This part is responsible to open CSV/tsv file, read the contents of the column path & sentence and finally, process it using process the function defined above while displaying progress bar using tqdm and return the processed data

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Extras:

Here is a quick breakdown of the subprocess process.py:

👉 Line 13–17: Extracts the filename from the tail of file_path, converts text to lower case and finally defines the output path for wav files.
👉 Line 19–26: Pass the audio and output path to sox transformerclass defines the sample rate given by sample_rate , finds the duration of audio using sox.file_info.duration() and finally returns the values required.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — –

manifest.py📜

Having returned our data, we can now parse the data as JSON format, and that’s what create_manifest function does:

def create_manifest(
data: List[tuple],
output_name: str,
manifest_path: str):
output_file = Path(manifest_path) / output_name
output_file.parent.mkdir(exist_ok=True, parents=True)

Pretty straightforward, all we do here is pass data(a tuple of file paths), output_name (name of output_file), manifest_path (path to store the manifest/created files).

Note: For an extreme case, the path is created if the folder is not present and the files are stored in that path. (Line 6)

🌟Combining All Functionality

So now that we have all functionality ready, let’s combine them in main function — the actual backbone of the script and contains all the functionalities defined at the start:

def main():
data_root = args.data_root
os.makedirs(data_root, exist_ok=True)
target_unpacked_dir = os.path.join(data_root, “CV_unpacked”)
if os.path.exists(target_unpacked_dir):
logging.info(‘Find existing folder {}’.format(target_unpacked_dir))
else:
logging.info(“Could not find Common Voice, Downloading corpus…”)
filename = wget.download(COMMON_VOICE_URL, data_root)
target_file = os.path.join(data_root, os.path.basename(filename))
os.makedirs(target_unpacked_dir, exist_ok=True)
logging.info(“Unpacking corpus to {} …”.format(target_unpacked_dir))
tar = tarfile.open(target_file)
tar.extractall(target_unpacked_dir)
tar.close()
folder_path = os.path.join(target_unpacked_dir, args.version + f’/{args.language}/’)
for csv_file in args.files_to_process:
data = process_files(
csv_file=os.path.join(folder_path, csv_file),
data_root=os.path.join(data_root, os.path.splitext(csv_file)[0]),
num_workers=args.num_workers
)
logging.info(‘Creating manifests…’)
create_manifest(
data=data,
output_name=f’commonvoice_{os.path.splitext(csv_file)[0]}_manifest.json’,
manifest_path=args.manifest_dir,
)
if __name__ == “__main__”:
main()

I hope it’s pretty explanatory after reading the Understanding Script section. However few things to add are :

By default, the unpacked files are stored in CV_unpacked the folder (to keep things simple). To extract it to pwd remove it.
We have added a functionality to check for the path. If present, it will just process and create desired files. The download, unzip, process, and storing way will be taken.
Finally, a boilerplate to run the script automatically — if __name__ = "__main__ call main.

✨Win Or Lose Time

Ok, so what’s the proof the scripts actually work?

Well, below is a small clip showing the working of the file:)

Link to the video: https://youtu.be/SrKhromAdoI

Working Proof — Sorry For Water Mark – Run at 2x and max res – Video By Author

Note — The script is used with default settings.

Summary

So that ends our coding and evaluation part. If you have followed along, you have learned how to: recreate an entire script from scratch, understand different components, & write modularised and production-ready code.

On the other hand, you may have figured out how to use the argparse to turn any function into a command liner.

However, it will be much more beneficial if you apply these concepts in real life and concrete your learning. I would really love to see them😍.

Hope you liked my article on Nvidia Nemo Script. Below are some of the resources for advanced readers.

Extra Resources

Github: For downloading and usage, click here.

Contact Links: You can contact me on Twitter, LinkedIn, and GitHub.

Must Read: Nvidia Nemo ASR.

Finally, If you like the article, support my efforts by sharing and passing on your suggestions. To read more articles like these, kindly visit my author page & make sure to follow and get notified🔔. You are welcome to comment, too⏬.

Thanks😀

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.

Purnendu

A dynamic and enthusiastic individual with a proven track record of delivering high-quality content around Data Science, Machine Learning, Deep Learning, Web 3.0, and Programming in general.

Here are a few of my notable achievements👇

🏆 3X times Analytics Vidhya Blogathon Winner under guides category.

🏆 Stackathon by Winner Under Circle API Usage Category - My Detailed Guide

🏆 Google TensorFlow Developer ( for deep learning) and Contributor to Open Source

🏆 A Part Time Youtuber - Programing Related content coming every week!

Feel free to contact me if you wanna have a conversation on Data Science, AI Ethics & Web 3 / share some opportunities.

Beginner Datasets NLP

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Hacking Into Nvidia Nemo Script(Download Common Voice)

Introduction to Nvidia Nemo Script

Nvidia Nemo Script: The Problem Statement

Understanding Script

Editing The Nvidia Nemo Script

🌟 Some Imports

🌟Command Lines Arguments

🌟 Changing URL format

🌟 Working On Processing & Manifest Helper Function

🌟Combining All Functionality

✨Win Or Lose Time

Summary

Extra Resources

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID