10 Audio Processing Tasks to get you started with Deep Learning Applications (with Case Studies)

Faizan Shaikh Last Updated : 15 May, 2020

6 min read

Introduction

Imagine a world where machines understand what you want and how you are feeling when you call at a customer care – if you are unhappy about something, you speak to a person quickly. If you are looking for a specific information, you may not need to talk to a person (unless you want to!).

This is going to be the new order of the world – you can already see this happening to a good degree. Check out the highlights of 2017 in the data science industry. You can see the breakthroughs that deep learning was bringing in a field which were difficult to solve before. One such field that deep learning has a potential to help solving is audio/speech processing, especially due to its unstructured nature and vast impact.

So for the curious ones out there, I have compiled a list of tasks that are worth getting your hands dirty when starting out in audio processing. I’m sure there would be a few more breakthroughs in time to come using Deep Learning.

The article is structured to explain each task and its importance. There is also a research paper that goes in the details of that specific task, along with a case study that would help you get started in solving the task.

So let’s get cracking!

1. Audio Classification

Audio classification is a fundamental problem in the field of audio processing. The task is essentially to extract features from the audio, and then identify which class the audio belongs to. Many useful applications pertaining to audio classification can be found in the wild – such as genre classification, instrument recognition and artist identification.

This task is also the most explored topic in audio processing. Plenty of papers were published in this field in the last year. In fact, we have also hosted a practice hackathon for community collaboration for solving this particular task.

Whitepaper – http://ieeexplore.ieee.org/document/5664796/?reload=true

A common approach to solve an audio classification task is to pre-process the audio inputs to extract useful features, and then apply a classification algorithm on it. For example, in the case study below we are given a 5 second excerpt of a sound, and the task is to identify which class does it belong to – whether it is a dog barking or a drilling sound. As mentioned in the article, an approach to deal with this is to extract an audio feature called MFCC and then pass it though a neural network to get the appropriate class.

Case Study – https://www.analyticsvidhya.com/blog/2017/08/audio-voice-processing-deep-learning/

2. Audio Fingerprinting

The aim of audio fingerprinting is to determine the digital “summary” of the audio. This is done to identify the audio from an audio sample. Shazam is an excellent example of an application of audio fingerprinting. It recognises the music on the basis of the first two to five seconds of a song. However, there are still situations where the system fails, especially where there is a high amount of background noise.

Whitepaper – http://www.cs.toronto.edu/~dross/ChandrasekharSharifiRoss_ISMIR2011.pdf

To solve this problem, an approach could be to represent the audio in a different manner, so that it is easily deciphered. Then, we can find out the patterns that differentiate the audio from the background noise. In the case study below, the author converts raw audio to spectrograms and then uses peak finding and fingerprint hashing algorithms to define the fingerprints of that audio file.

Case Study – http://willdrevo.com/fingerprinting-and-audio-recognition-with-python/

3. Automatic Music Tagging

Music Tagging is a more complex version of audio classification. Here, we can have multiple classes that each audio may belong to, aka, a multi-label classification problem. A potential application of this task can be to create metadata for the audio so that it can be searched later on. Deep learning has helped solve this task to a certain extent which can be seen in the case study below.

Whitepaper – https://link.springer.com/article/10.1007/s10462-012-9362-y

As seen with most of the tasks, the first step is always to extract features from the audio sample. Then, sort it according to the nuances of the audio (for example, if the audio contains more instrumental noise than the singer’s voice, the tag could be “instrumental”). This can be done either by machine learning or deep learning methods. The case study mentioned below uses deep learning to solve the problem, specifically convolution recurrent neural network along with Mel Frequency Extraction.

Case Study – https://github.com/keunwoochoi/music-auto_tagging-keras

4. Audio Segmentation

Segmentation literally means dividing a particular object into parts (or segments) based on a defined set of characteristics. Segmentation, especially for audio data analysis, is an important pre-processing step. This is because we can segment a noisy and lengthy audio signal into short homogeneous segments (handy short sequences of audio) which are used for further processing. An application of the task is heart sound segmentation, i.e. to identify sounds specific to the heart.

Whitepaper – http://www.mecs-press.org/ijitcs/ijitcs-v6-n11/IJITCS-V6-N11-1.pdf

We can convert this into a supervised learning problem, where each time stamp can be classified on the basis of the segments required. Then we can apply an audio classification approach to solve the problem. In the case study below, the task is to segment the heart sound into two segments (lub and dub), so that we can identify an anomaly in each segment. It can be solved by using audio feature extraction and then deep learning can be applied for classification.

Case Study – https://www.analyticsvidhya.com/blog/2017/11/heart-sound-segmentation-deep-learning/

5. Audio Source Separation

Audio Source Separation consists of isolating one or more source signals from a mixture of signals. One of the most common applications of this is identifying the lyrics from the audio for simultaneous translation (karaoke, for instance). This is a classic example shown in Andrew Ng’s machine learning course where he separates the sound of the speaker from the background music.

Whitepaper – http://ijcert.org/ems/ijcert_papers/V3I1103.pdf

A typical usage scenario involves:

loading an audio file
computing a time-frequency transform to obtain a spectrogram, and
using some of the source separation algorithm (such as non-negative matrix factorization) to obtain a time-frequency mask

The mask is then multiplied with the spectrogram and the result is converted back to the time domain.

Case Study – https://github.com/IoSR-Surrey/untwist

6. Beat Tracking

As the name suggests, the goal here is to track the location of each beat in a collection of audio files. Beat tracking can be utilized to automate time-consuming tasks that must be completed in order to synchronize events with music. It is useful in various applications, such as video editing, audio editing, and human-computer improvisation.

Whitepaper – https://www.audiolabs-erlangen.de/content/05-fau/professor/00-mueller/01-students/2012_GroschePeter_MusicSignalProcessing_PhD-Thesis.pdf

An approach to solve beat tracking can be to be parse the audio file and use an onset detection algorithm to track the beats. Although the techniques used to for onset detection rely heavily on audio feature engineering and machine learning, deep learning can easily be used here to optimize the results.

Case Study – https://github.com/adamstark/BTrack

7. Music Recommendation

Thanks to the internet, we now have millions of songs we can listen to anytime. Ironically, this has made it even harder to discover new music because of the plethora of options out there. Music recommendation systems help deal with this information overload by automatically recommending new music to listeners. Content providers like Spotify and Saavn have developed highly sophisticated music recommendation engines. These models leverage the user’s past listening history among many other features to build customized recommendation lists.

Whitepaper – https://pdfs.semanticscholar.org/7442/c1ebd6c9ceafa8979f683c5b1584d659b728.pdf

We can tackle the challenge of customizing listening preferences by training a regression/deep learning model. This can be used to predict the latent representations of songs that were obtained from a collaborative filtering model. This way, we could predict the representation of a song in the collaborative filtering space, even if no usage data was available.

Case Study – http://benanne.github.io/2014/08/05/spotify-cnns.html

8. Music Retrieval

One of the most difficult tasks in audio processing, Music Retrieval essentially aims to build a search engine based on audio. Although we can do this by solving sub-tasks like audio fingerprinting, this task encompasses much more that that. For example, we also have to solve different smaller tasks for different types of music retrieval (timbre detection would be great for gender identification). Currently, there is no other system that has been developed to match the industry expected standards.

Whitepaper – http://www.nowpublishers.com/article/Details/INR-042

The task of music retrieval is divided into smaller and simpler steps, which include tonal analysis (e.g. melody and harmony) and rhythm or tempo (e.g. beat tracking). Then, on the basis of these individual analysis, information is extracted which is used for retrieval of similar audio samples.

Case Study – https://youtu.be/oGGVvTgHMHw

9. Music Transcription

Music Transcription is another challenging audio processing task. It comprises of annotating audio and creating a kind of “sheet” for generating music from it at a later point of time. The manual effort involved in transcribing music from recordings can be vast. It varies enormously depending on the complexity of the music, how good our listening skills are and how detailed we want our transcription to be.

Whitepaper – http://ieeexplore.ieee.org/abstract/document/7955698

The approach for music transcription is similar to that of speech recognition, where musical notes are transcribed into lyrical excerpts of instruments.

Case Study – https://youtu.be/9boJ-Ai6QFM

10. Onset Detection

Onset detection is the first step in analysing an audio/music sequence. For most of the tasks mentioned above, it is somewhat necessary to perform onset detection, i.e. detecting the start of an audio event. Onset detection was essentially the first task that researchers intended to solve in audio processing.

Whitepaper – http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.332.989&rep=rep1&type=pdf

Onset detection is typically done by:

computing a spectral novelty function
finding peaks in the spectral novelty function
backtracking from each peak to a preceding local minimum. Backtracking can be useful for finding segmentation points such that the onset occurs shortly after the beginning of the segment

Case Study – https://musicinformationretrieval.com/onset_detection.html

End Notes

In this article, I have mentioned a few tasks that can be looked at when solving audio processing problems. I hope you find the article insightful in dealing with audio/speech related projects.

Learn, engage , hack and get hired!

Faizan Shaikh

Faizan is a Data Science enthusiast and a Deep learning rookie. A recent Comp. Sc. undergrad, he aims to utilize his skills to push the boundaries of AI research.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

10 Audio Processing Tasks to get you started with Deep Learning Applications (with Case Studies)

Introduction

1. Audio Classification

2. Audio Fingerprinting

3. Automatic Music Tagging

4. Audio Segmentation

5. Audio Source Separation

6. Beat Tracking

7. Music Recommendation

8. Music Retrieval

9. Music Transcription

10. Onset Detection

End Notes

Learn, engage , hack and get hired!

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or