25 Open Datasets for Deep Learning Every Data Scientist Must Work With

Pranav Dar Last Updated : 05 Feb, 2025

10 min read

Deep learning is a subset of machine learning based on neural networks with representation learning. The key to mastering this topic (or most fields in life) is practice. There are a variety of practice problems available in deep learning, ranging from image processing to speech recognition. But where can you get the sample datasets for these practice problems? In this article, we have listed a collection of openly available high-quality datasets for deep learning enthusiasts. We have also added a few practice problems towards the end of this article, for you to use these public datasets.

The Need for Open-Source Datasets
How to Use These Datasets?
25 Datasets for Deep Learning
Natural Language Processing (NLP) Datasets
Audio/Speech Datasets
Analytics Vidhya Practice Problems
Frequently Asked Questions

The Need for Open-Source Datasets

Open source datasets are much needed for data science students, researchers, and working professionals to test out various artificial intelligence (AI) and machine learning (ML) algorithms. Problems such as time series forecasting, computer vision, regression, semantic analysis, data analysis, and more, require large datasets to work on.

Checkout these points the about the need of Datasets:

Working on these datasets will enhance your skills as a data scientist.
The learning experience gained from these datasets will be highly valuable for your career.
The article provides access to papers with state-of-the-art (SOTA) results to help you improve your models.
Many current research papers use proprietary datasets, which are often not available to the public.
This lack of access to datasets can hinder learning and the application of new skills.
The article offers a solution by providing a list of openly available datasets for your use.

How to Use These Datasets?

First things first – these datasets are huge in size! So make sure you have a fast internet connection with no / very high limit on the amount of data you can download.

There are numerous ways how you can use these datasets. You can use them to apply various deep learning techniques. You can use them to hone your skills, understand how to identify and structure each problem, think of unique use cases, and publish your findings for everyone to see!

In this article, we have included 25 versatile datasets you can use for deep learning problems. The datasets are divided into three categories – Image Processing, Natural Language Processing, and Audio/Speech Processing.

Let’s dive into it!

25 Datasets for Deep Learning

1. MNIST

MNIST is one of the most popular deep learning datasets out there. It’s a dataset of handwritten digits and contains a training set of 60,000 examples and a test set of 10,000 examples. It’s a good database for trying learning techniques and deep recognition patterns on real-world data while spending minimum time and effort in data preprocessing.

Size: ~50 MB

Number of Records: 70,000 images in 10 classes

SOTA: Dynamic Routing Between Capsules

2. MS-COCO

COCO is a large-scale and rich for object detection, segmentation and captioning dataset. It has several features:

Object segmentation
Recognition in context
Superpixel stuff segmentation
330K images (>200K labeled)
1.5 million object instances
80 object categories
91 stuff categories
5 captions per image
250,000 people with keypoints

Size: ~25 GB (Compressed)

Number of Records: 330K images, 80 object categories, 5 captions per image, 250,000 people with key points

SOTA: Mask R-CNN

Bored with Datasets? Solve real life project on Deep Learning

3. ImageNet

ImageNet is a dataset of images that are organized according to the WordNet hierarchy. WordNet contains approximately 100,000 phrases and ImageNet has provided around 1000 images on average to illustrate each phrase.

Size: ~150GB

Number of Records: Total number of images: ~1,500,000; each with multiple bounding boxes and respective class labels

SOTA: Aggregated Residual Transformations for Deep Neural Networks

4. Open Images Dataset

Open Images is a dataset of almost 9 million URLs for images. These images have been annotated with image-level labels bounding boxes spanning thousands of classes. The dataset contains a training set of 9,011,219 images, a validation set of 41,260 images, and a test set of 125,436 images.

Size: 500 GB (Compressed)

Number of Records: 9,011,219 images with more than 5k labels

SOTA: Resnet 101 image classification model (trained on V2 data): Model checkpoint, Checkpoint readme, Inference code.

5. VisualQA

VQA is a dataset containing open-ended questions about images. These questions require an understanding of vision and language. Some of the interesting features of this dataset are:

265,016 images (COCO and abstract scenes)
At least 3 questions (5.4 questions on average) per image
10 ground truth answers per question
3 plausible (but likely incorrect) answers per question
Automatic evaluation metric

Size: 25 GB (Compressed)

Number of Records: 265,016 images, at least 3 questions per image, 10 ground truth answers per question

SOTA: Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge

6. The Street View House Numbers (SVHN)

This is a real-world image dataset for developing object detection algorithms. This requires minimum data preprocessing. It is similar to the MNIST dataset mentioned in this list but has more labeled data (over 600,000 labeled images). The data in the SVHN dataset has been collected from house numbers viewed in Google Street View.

Size: 2.5 GB

Number of Records: 6,30,420 images in 10 classes

SOTA: Distributional Smoothing With Virtual Adversarial Training

7. CIFAR-10

This dataset is another one for image classification. It consists of 60,000 images of 10 classes (each class is represented as a row in the above image). In total, there are 50,000 training images and 10,000 test images. The CIFAR-10 dataset is divided into 6 parts – 5 training batches and 1 test batch. Each batch has 10,000 images.

Size: 170 MB

Number of Records: 60,000 images in 10 classes

SOTA: ShakeDrop regularization

8. Fashion-MNIST

Fashion-MNIST consists of 60,000 training images and 10,000 test images. It is an MNIST-like fashion product database. The developers believe MNIST has been overused so they created this as a direct replacement for that dataset. Each image is in greyscale and associated with a label from 10 classes.

Size: 30 MB

Number of Records: 70,000 images in 10 classes

SOTA: Random Erasing Data Augmentation

Natural Language Processing (NLP) Datasets

9. IMDB Reviews

This is a dream dataset for movie lovers. It is meant for binary sentiment classification and has far more data than any previous datasets in this field. Apart from the training and test review examples, there is further unlabeled data for use as well. Raw text and preprocessed bag of word formats have also been included.

Size: 80 MB

Number of Records: 25,000 highly polar movie reviews for training, and 25,000 for testing

SOTA: Learning Structured Text Representations

10. Twenty Newsgroups

This dataset, as the name suggests, contains information about newsgroups. To curate this dataset, 1000 Usenet articles were taken from 20 different newsgroups. The articles have typical features like subject lines, signatures, and quotes.

Size: 20 MB

Number of Records: 20,000 messages taken from 20 newsgroups

SOTA: Very Deep Convolutional Networks for Text Classification

11. Sentiment140

Sentiment140 is a dataset that can be used for sentiment analysis. A popular dataset, it is perfect to start off your NLP journey. Emotions have already been removed from the data. The final dataset has the below 6 features:

polarity of the tweet
id of the tweet
date of the tweet
the query
username of the tweeter
text of the tweet

Size: 80 MB (Compressed)

Number of Records: 1,60,000 tweets

SOTA: Assessing State-of-the-Art Sentiment Models on State-of-the-Art Sentiment Datasets

12. WordNet

As mentioned in the ImageNet dataset above, WordNet is a large database of English synsets. Synsets are groups of synonyms that each describe a different concept. WordNet’s structure makes it a very useful tool for NLP.

Size: 10 MB

Number of Records: 117,000 synsets is linked to other synsets by means of a small number of “conceptual relations.

SOTA: Wordnets: State of the Art and Perspectives

13. Yelp Reviews

This is an open dataset released by Yelp for learning purposes. It consists of millions of user reviews, businesses attributes, and over 200,000 pictures from multiple metropolitan areas. This is a very commonly used dataset for NLP challenges globally.

Size: 2.66 GB JSON, 2.9 GB SQL and 7.5 GB Photos (all compressed)

Number of Records: 5,200,000 reviews, 174,000 business attributes, 200,000 pictures and 11 metropolitan areas

SOTA: Attentive Convolution

14. The Wikipedia Corpus

This dataset is a collection of all the text on Wikipedia. It contains almost 1.9 billion words from more than 4 million articles. What makes this a powerful NLP dataset is that you search by word, phrase or part of a paragraph itself.

Size: 20 MB

Number of Records: 4,400,000 articles containing 1.9 billion words

SOTA: Breaking The Softmax Bottelneck: A High-Rank RNN language Model

15. The Blog Authorship Corpus

This dataset consists of blog posts collected from thousands of bloggers and has been gathered from blogger.com. Each blog is provided as a separate file. Each blog contains a minimum of 200 occurrences of commonly used English words.

Size: 300 MB

Number of Records: 681,288 posts with over 140 million words

SOTA: Character-level and Multi-channel Convolutional Neural Networks for Large-scale Authorship Attribution

16. Machine Translation of Various Languages

This dataset consists of training data for four European languages. The task here is to improve the current translation methods. You can participate in any of the following language pairs:

English-Chinese and Chinese-English
English-Czech and Czech-English
English-Estonian and Estonian-English
English-Finnish and Finnish-English
English-German and German-English
English-Kazakh and Kazakh-English
English-Russian and Russian-English
English-Turkish and Turkish-English

Size: ~15 GB

Number of Records: ~30,000,000 sentences and their translations

SOTA: Attention Is All You Need

Engage with real-life projects on Natural Language Processing here.

Audio/Speech Datasets

17. Free Spoken Digit Dataset

Another entry in this list inspired by the MNIST dataset! This one was created to solve the task of identifying spoken digits in audio samples. It’s an open dataset so the hope is that it will keep growing as people keep contributing more samples. Currently, it contains the following characteristics:

3 speakers
1,500 recordings (50 of each digit per speaker)
English pronunciations

Size: 10 MB

Number of Records: 1,500 audio samples

SOTA: Raw Waveform-based Audio Classification Using Sample-level CNN Architectures

18. Free Music Archive (FMA)

FMA is a dataset for music analysis. The dataset consists of full-length and HQ audio, pre-computed features, and track and user-level metadata. It is an open dataset created to evaluate several tasks in MIR. Below is the list of CSV files the dataset has along with what they include:

tracks.csv: per track metadata such as ID, title, artist, genres, tags, and play counts, for all 106,574 tracks.
genres.csv: all 163 genre IDs with their name and parent (used to infer the genre hierarchy and top-level genres).
features.csv: common features extracted with librosa.
echonest.csv: audio features provided by Echonest (now Spotify) for a subset of 13,129 tracks.

Size: ~1000 GB

Number of Records: ~100,000 tracks

SOTA: Learning to Recognize Musical Genre from Audio

19. Ballroom

This dataset contains ballroom dancing audio files. A few characteristic excerpts of many dance styles are provided in real audio format. Below are a few characteristics of the dataset:

Total number of instances: 698
Duration: ~30 s
Total duration: ~20940 s

Size: 14GB (Compressed)

Number of Records: ~700 audio samples

SOTA: A Multi-Model Approach To Beat Tracking Considering Heterogeneous Music Styles

20. Million Song Dataset

The Million Song Dataset is a freely available collection of audio features and metadata for a million contemporary popular music tracks. Its purposes are:

To encourage research on algorithms that scale to commercial sizes
To provide a reference dataset for evaluating research
As a shortcut alternative to creating a large dataset with APIs (e.g. The Echo Nest’s)
To help new researchers get started in the MIR field

The core of the dataset is the feature analysis and metadata for one million songs. The dataset does not include any audio, only the derived features. The sample audio can be fetched from services like 7digital, using code provided by Columbia University.

Size: 280 GB

Number of Records: PS – its a million songs!

SOTA: Preliminary Study on a Recommender System for the Million Songs Dataset Challenge

21. LibriSpeech

This dataset is a large-scale corpus of around 1000 hours of English speech. The data has been sourced from audiobooks from the LibriVox project. It has been segmented and aligned properly. If you’re looking for a starting point, check out already prepared Acoustic models that are trained on this data set at kaldi-asr.org and language models, suitable for evaluation, at http://www.openslr.org/11/.

Size: ~60 GB

Number of Records: 1000 hours of speech

SOTA : Letter-Based Speech Recognition with Gated ConvNets

22. VoxCeleb

VoxCeleb is a large-scale speaker identification dataset. It contains around 100,000 utterances by 1,251 celebrities, extracted from YouTube videos. The data is mostly gender balanced (males comprise of 55%). The celebrities span a diverse range of accents, professions and age. There is no overlap between the development and test sets. It’s an intriguing use case for isolating and identifying which superstar the voice belongs to.

Size: 150 MB

Number of Records: 100,000 utterances by 1,251 celebrities

SOTA: VoxCeleb: a large-scale speaker identification dataset

Analytics Vidhya Practice Problems

For your practice, we also provide real-life problems and datasets to get your hands dirty. In this section, we’ve listed down the deep learning practice problems on our DataHack platform.

23. X Sentiment Analysis

Hate Speech in the form of racism and sexism has become a nuisance on X (formerly, Twitter) and it is important to segregate these sorts of tweets from the rest. In this practice problem, we provide Twitter data that has both normal and hate tweets. Your task as a data scientist is to identify the tweets that are hate tweets and those that are not.

Size: 3 MB

Number of Records: 31,962 tweets

24. Age Detection of Indian Actors

This is a fascinating challenge for any deep learning enthusiast. The dataset contains thousands of images of Indian actors and your task is to identify their age. All the images are manually selected and cropped from the video frames resulting in a high degree of variability interms of scale, pose, expression, illumination, age, resolution, occlusion, and makeup.

Size: 48 MB (Compressed)

Number of Records: 19,906 images in the training set and 6636 in the test set

SOTA: Hands on with Deep Learning – Solution for Age Detection Practice Problem

25. Urban Sound Classification

This dataset consists of more than 8000 sound excerpts of urban sounds from 10 classes. This practice problem is meant to introduce you to audio processing in the usual classification scenario.

Size: Training set – 3 GB (Compressed), Test set – 2 GB (Compressed)

Number of Records: 8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes

Conclusion

Mastering deep learning requires practice, and having access to the right datasets can make a huge difference in your learning journey. With the rise of open-source models, we now have access to a number of training datasets. However, these new datasets may be specific to each of those models, letting us test, experiment, and build on them.

Each dataset comes with specific characteristics and benchmarks that can help you test and improve your models. Whether you’re a student, researcher, or professional, these resources offer valuable opportunities to apply and enhance your skills in real-world scenarios.

You can use these public datasets to apply various deep learning algorithms and improve your skillset. Do try out the practice problems listed in this article, and let us know in the comments if the datasets were helpful in your attempts.

Frequently Asked Questions

Q1. What are datasets in deep learning?

A. Datasets are collections of data that are used to train, validate, and test models. In deep learning, these datasets are essential for developing and evaluating algorithms. Deep learning datasets can contain data in various forms, such as images, text, audio, and video.

Q2. Where can I get ML datasets?

A. This article is a comprehensive resource for open-source datasets. You can find more open datasets for machine learning on Kaggle, GitHub, UCI Machine Learning Repository, Amazon’s Registry of Open Data, and Google’s Datasets Search Engine.

Q3. What is a good dataset size for deep learning?

A. The ideal dataset size for deep learning depends on the complexity of the task and the model architecture being used. Generally, larger datasets tend to yield better performance. A good rule of thumb is to aim for thousands to millions of data points for effective training. However, it’s also important to balance the dataset size with computational resources and model capacity to prevent overfitting.

Pranav Dar

Senior Editor at Analytics Vidhya.Data visualization practitioner who loves reading and delving deeper into the data science and machine learning arts. Always looking for new ways to improve processes using ML and AI.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Computer Vision

Getting Started with Image Data

Introduction to CNN and Implementation

Introduction to CNN and implementation

Introduction to Transfer Learning

CNN Visualization

Overview of Pretrained Models

Inception

ResNets

DenseNets

CSRNet

Introduction to Object Detection

Region Based Convolutional Neural Network

Single Stage Networks

Transformed Based Object Detection Models

Face Detection

Object Tracking

Pose Estimation

Introduction to Image Segmentation

Understanding Deep Learning Architectures for Image Segmentation

Video Classification

Introduction to Image Generation

Experiments with Generative Adversarial Networks

Zero and Few Shot Learning

Model Deployment

25 Open Datasets for Deep Learning Every Data Scientist Must Work With

Table of Contents

The Need for Open-Source Datasets

How to Use These Datasets?

25 Datasets for Deep Learning

1. MNIST

2. MS-COCO

3. ImageNet

4. Open Images Dataset

5. VisualQA

6. The Street View House Numbers (SVHN)

7. CIFAR-10

8. Fashion-MNIST

Natural Language Processing (NLP) Datasets

9. IMDB Reviews

10. Twenty Newsgroups

11. Sentiment140

12. WordNet

13. Yelp Reviews

14. The Wikipedia Corpus

15. The Blog Authorship Corpus

16. Machine Translation of Various Languages

Audio/Speech Datasets

17. Free Spoken Digit Dataset

18. Free Music Archive (FMA)

19. Ballroom

20. Million Song Dataset

21. LibriSpeech

22. VoxCeleb

Analytics Vidhya Practice Problems

23. X Sentiment Analysis

24. Age Detection of Indian Actors

25. Urban Sound Classification

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID