Speech to Text Conversion in Python – A Step-by-Step Tutorial

Prashant Last Updated : 22 Dec, 2023

6 min read

This article was published as a part of the Data Science Blogathon.

Introduction to Text to Speech

When it comes to our interactions with machines, things have gotten a lot more complicated. We’ve gone from large mechanical buttons to touchscreens. However, hardware isn’t the only thing that’s changing. Throughout the history of computers, the text has been the primary method of input. But thanks to developments in NLP and ML (Machine Learning), Data Science, we now have the means to use speech as a medium for interacting with our gadgets in the near future.

Virtual assistants are the most common use of these tools, which are all around us. Google, Siri, Alexa, and a host of other digital assistants have set the bar high for what’s possible when it comes to communicating with the digital world on a personal level.

For the first time in the history of modern technology, the ability to convert spoken words into text is freely available to everyone who wants to experiment with it.

When it comes to creating speech-to-text applications, Python, one of the most widely used programming languages, has plenty of options.

History of Speech to Text

Before diving into Python’s statement to text feature, it’s interesting to take a look at how far we’ve come in this area. Listed here is a condensed version of the timeline of events:

Audrey,1952: The first speech recognition system built by 3 Bell Labs engineers was Audrey in 1952. It was only able to read numerals.

IBM Shoebox (1962): Coils can distinguish 16 words in addition to numbers in IBM’s first voice recognition system, the IBM Shoebox (1962). Had the ability to do basic mathematical calculations and publish the results.

Defense Advanced Research Projects Agency(DARPA) (1970): Defense Advanced Research Projects Agency (DARPA) (1970): DARPA supported Speech Understanding Research, which led to the creation of Harpy’s ability to identify 1011 words.

Hidden Markov Model(HMM), the 1980s: Problems that need sequential information can be represented using the HMM statistical model. This model was used in the development of new voice recognition techniques.

Voice search by Google,2001: It was in 2001 that Google launched its Voice Search tool, which allowed users to search by speaking. This was the first widely used voice-enabled app.

History of Speech to Text - google voice search

IMAGE

Siri,2011: A real-time and convenient way to connect with Apple’s gadgets was provided by Siri in 2011.

IMAGE

Alexa,2014 & google home,2016: Voice-activated virtual assistants like Alexa and Google Home, which have sold over 150 million units combined, entered the mainstream in 2014 and 2016, respectively.

IMAGE

Problems faced in Speech to Text

Speech-to-text conversion is a difficult topic that is far from being solved. Numerous technical limitations render this a substandard tool at best. The following are some of the most often encountered difficulties with voice recognition technology:

1. Imprecise interpretation

Speech recognition does not always accurately comprehend spoken words. VUIs (Voice User Interfaces) are not as proficient at comprehending contexts that alter the connection between words and phrases as people are. Thus, machines may have difficulty comprehending the semantics of a statement.

2. Time

At times, speech recognition systems require an excessive amount of time to process. This might be due to the fact that humans possess a wide variety of vocal patterns. Such difficulties with voice recognition can be overcome by speaking slower or more precisely, but this reduces the tool’s convenience.

3. Accents

VUIs may have difficulty comprehending dialects that are not standard. Within the same language, people might utter the same words in drastically diverse ways.

4. Background noise and loudness

In a perfect world, these would not be an issue, but that is not the case, and hence VUIs may struggle to operate in noisy surroundings (public spaces, big offices, etc.).

How does Speech recognition work?

A complete description of the method is beyond the scope of this blog.А соmрlete desсriрtiоn оf the methоd is beyоnd the sсорe оf this blоg. I’m going to demonstrate how to convert speech to text using Python in this blog. This is accomplished using the “Speech Recognition” API and the “PyAudio” library.

Download the Python packages listed below

speech_recogntion (pip install SpeechRecogntion): This is the core package that handles the most important part of the conversion process. Other solutions, such as appeal, assembly, google-cloud-search, pocketsphinx, Watson-developer-cloud, wit, and so on, offer advantages and disadvantages.

pip install SpeechRecognition

My audio (pip install Pyaudio)
Portaudio (pip install Portaudio)

Convert an audio file into text

Steps

Import library for speech recognition
Initializing the recognizer class in order to do voice recognition. We аre utilizing Gооgle’s sрeeсh reсоgnitiоn teсhnоlоgy.
The following audio formats are supported by speech recognition: wav, AIFF, AIFF-C, and FLAC. In this example, I utilized a ‘wav’ file.
I’ve utilized an audio clip from a ‘stolen’ video that states “I have no idea who you are or what you want, but if you’re seeking for ransom, I can tell you I don’t have any money.”
Google recognizer reads English by default. It supports a variety of languages; for further information, please refer to this documentation.

Code

#import library
import speech_recognition as sr
#Initiаlize  reсоgnizer  сlаss  (fоr  reсоgnizing  the  sрeeсh)
r = sr.Recognizer()
# Reading Audio file as source
#  listening  the  аudiо  file  аnd  stоre  in  аudiо_text  vаriаble
with sr.AudioFile('I-dont-know.wav') as source:
    audio_text = r.listen(source)
# recoginize_() method will throw a request error if the API is unreachable, hence using exception handling
    try:
        # using google speech recognition
        text = r.recognize_google(audio_text)
        print('Converting audio transcripts into text ...')
        print(text)
    except:
         print('Sorry.. run again...')

Output:

Let’s have a look at it in more detail

Speech is nothing more than a sound wave at its most basic level. In terms of acoustics, amplitude, peak, trough, crest, and trough, wavelength, cycle, and frequency are some of the characteristics of these sound waves or audio signals.

Due to the fact that these audio signals are continuous, they include an endless number of data points. To convert such an audio signal to a digital signal capable of being processed by a computer, the network must take a discrete distribution of samples that closely approximates the continuity of an audio signal.

Once we’ve established a suitable sample frequency (8000 Hz is a reasonable starting point, given the majority of speech frequencies fall within this range), we can analyze the audio signals using Python packages such as LibROSA and SciPy. On the basis of these inputs, we can then partition the data set into two parts: one for training the model and another for validating the model’s conclusions.

At this stage, one may use the Conv1d model architecture, a convolutional neural network with a single dimension of operation. After that, we may construct a model, establish its loss function, and use neural networks to prevent the best model from converting voice to text. We can modify statements to text using deep learning and NLP (Natural Language Processing) to enable wider applicability and acceptance.

Applications of Speech Recognition

There are more tools accessible for operating this technological breakthrough because it is mostly a software creation that does not belong to anyone company. Because of this, even developers with little financial resources have been able to use this technology to create innovative apps.

The following are some of the sectors in which voice recognition is gaining traction

Evolution in search engines: Speech recognition will aid in improving search accuracy by bridging the gap between verbal and textual communication.
Impact on the healthcare industry: The impact on the healthcare business is that voice recognition is becoming a more prevalent element in the medical sector, as it speeds up the production of medical reports. As VUIs improve their ability to comprehend medical language, clinicians will gain time away from administrative tasks by using this technology.
Service industry: As automation advances, it is possible that a customer will be unable to reach a human to respond to a query; in this case, speech recognition systems can fill the void. We will witness a quick expansion of this function at airports, public transportation, and other locations.
Service providers: Telecommunications companies may rely even more on speech-to-text technology that may help determine callers’ requirements and lead them to the proper support.

Conclusion

A speech-to-text conversion is a useful tool that is on its way to becoming commonplace. With Python, one of the most popular programming languages in the world, it’s easy to create applications with this tool. As we make progress in this area, we’re laying the groundwork for a future in which digital information may be accessed not just with a fingertip but also with a spoken command.

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.v

Prashant

Hello, my name is Prashant, and I'm currently pursuing my Bachelor of Technology (B.Tech) degree. I'm in my 3rd year of study, specializing in machine learning, and attending VIT University.

In addition to my academic pursuits, I enjoy traveling, blogging, and sports. I'm also a member of the sports club. I'm constantly looking for opportunities to learn and grow both inside and outside the classroom, and I'm excited about the possibilities that my B.Tech degree can offer me in terms of future career prospects.

Thank you for taking the time to get to know me, and I look forward to engaging with you further!

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Shamine

Hi, very nice article. I have a query so if the audio has some words from different language for eg: the audio is "I love my country India Bharat". If I pass langauge as en-IN. It will predict words in english but Bharat will be left out how can we handle such things any idea.

neet and angel

Awesome tutorial! I'm definitely going to try out this speech-to-text conversion in Python. It's great to see how easy it is to implement using the Librosa library. I can't wait to see what other interesting projects I can build with this skill. Thank you for sharing!

Naga Teja Guntureddy

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Speech to Text Conversion in Python – A Step-by-Step Tutorial

Introduction to Text to Speech

History of Speech to Text

Problems faced in Speech to Text

How does Speech recognition work?

Convert an audio file into text

Applications of Speech Recognition

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit