Hey all👋
I am sure you must have heard Nvidia Nemo in recent times. It’s a great library for creating NLP models using just a few lines and code, and needless to say, the team has done a great job.
So as with all others, I wanted to try it out for myself and create something unique. This article covers a few current snippets of my journey, along with code creations from scratch. Have a good time reading it😀.
Like every proficient Data Scientist, I picked the problem statement of creating an ASR(Automatic Speech Recognition) model.
The goal here was to create a model which works similarly to the actual Google Assistant / YT Auto Captioning services but only in a single language-Eng.
To achieve this, I planned to use the Mozilla Common Voice Dataset 7.0, a 65 GB word corpus of spoken English sentences. Now the question was how to download such a huge file and process it simultaneously. This is where google helps, and a quick search landed me on a script that was doing the heavy lifting, which I quickly used, and suddenly everything changed👀
If you have read the above dilemma, the problem statement is unambiguous, making the script work. So let’s dive into the exact walkthrough of how it was fixed.
The Nvidia Nemo Script, which we are modifying is originally by SeanNaren and is hosted at this link. So before changing it, let’s define what we are supposed to do.
👉Download, Store& Unzip: We start by downloading the dataset using- mozilla_voice_bundler
, storing it in the directory specified by data_root
And finally unzipping the tar
file.
👉Processing Data: After extracting, the next part focuses on parsing the data by converting given mp3
files (present in tsv
files)wav
ones and then passing it to sox
library to get the duration of the voice sample. This step will also capture the path where the new files are stored along with the text.
👉 Creating Manifest: Finally, with all the info given, the last part is about appending extracted values to create manifests
passed to the Nemo models.
Having defined the explicit goals, we can now move to the fun part, Coding!
Here are a few plans to keep in mind:
Lets’ start
Pretty straightforward here, simple implications! tqdm,logging
Are optional.
After imports next step is to edit command line args:
Two things worth noting here is the default = "cv-corpus-7.0-2021-07-21
and default = "hi"
. For general readers, the above code will greet you with cmd-like options, and if nothing is passed, take the default value. to learn more, use the --help/h
Option.
One key thing to change is the URL format that downloads the dataset from the amazon s3 bucket, which keeps changing from time to time. Currently, the link looks similar to :
https://mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com/cv-corpus-7.0-2021-07-21/cv-corpus-7.0-2021-07-21-en.tar.gz
Given the original script can’t fetch, we must match the current format and can be adding structures as basic_url/{}/{}-{}.tar.gz
where {}/{}
will be version no and {}
will be language code.
The below format does just that:
One aspect of functional programming is to split the helper functions separately and later use them later with main
function. So in this section, we will be editing our helper functions process_files.py
and manifest.py
to do the heavy lifting.
process_files.py📜
This function takes our tsv
files, resulting in a data path. data_root
And no of cores to use. The main job is to read CSV file description, navigate to the given file path, perform mp3->wav
the conversion, process the text and save the result to data_root
, the given directory.
For simplicity, let’s split it into pieces/lines:
wav_dir
(if none present creates new ) and audio_clips_path
which is present in clips
folder in the same directory as the tsv files.def process(x)-
A sub-function to process_files
whose job is to return duration , test and output_wav_path
given input path.path & sentence
and finally, process it using process
the function defined above while displaying progress bar using tqdm
and return the processed data— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
Extras:
Here is a quick breakdown of the subprocess process.py
:
file_path
, converts text to lower case and finally defines the output path for wav
files.transformer
class defines the sample rate given by sample_rate
, finds the duration of audio using sox.file_info.duration()
and finally returns the values required.— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — –
manifest.py📜
Having returned our data, we can now parse the data as JSON format, and that’s what create_manifest
function does:
Pretty straightforward, all we do here is pass data
(a tuple of file paths), output_name
(name of output_file), manifest_path
(path to store the manifest/created files).
Note: For an extreme case, the path is created if the folder is not present and the files are stored in that path. (Line 6)
So now that we have all functionality ready, let’s combine them in main
function — the actual backbone of the script and contains all the functionalities defined at the start:
I hope it’s pretty explanatory after reading the Understanding Script section. However few things to add are :
CV_unpacked
the folder (to keep things simple). To extract it to pwd
remove it.if __name__ = "__main__
call main.
Ok, so what’s the proof the scripts actually work?
Well, below is a small clip showing the working of the file:)
Link to the video: https://youtu.be/SrKhromAdoI
Working Proof — Sorry For Water Mark – Run at 2x and max res – Video By Author
Note — The script is used with default settings.
So that ends our coding and evaluation part. If you have followed along, you have learned how to: recreate an entire script from scratch, understand different components, & write modularised and production-ready code.
On the other hand, you may have figured out how to use the argparse to turn any function into a command liner.
However, it will be much more beneficial if you apply these concepts in real life and concrete your learning. I would really love to see them😍.
Hope you liked my article on Nvidia Nemo Script. Below are some of the resources for advanced readers.
Github: For downloading and usage, click here.
Contact Links: You can contact me on Twitter, LinkedIn, and GitHub.
Must Read: Nvidia Nemo ASR.
Finally, If you like the article, support my efforts by sharing and passing on your suggestions. To read more articles like these, kindly visit my author page & make sure to follow and get notified🔔. You are welcome to comment, too⏬.
Thanks😀
The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.