Interesting Kaggle Datasets Every Beginner in Data Science Should Try Out

Prateek Majumder Last Updated : 21 Apr, 2021

6 min read

This article was published as a part of the Data Science Blogathon.

Introduction

These days, Kaggle has indeed become one of the most important stepping stones for students and professionals venturing into Data Science.

Kaggle has a lot of online resources that help one to get started with Data Science. It has thousands of Datasets, Data Science competitions, Code Submissions on the Datasets, Community chat, and even Beginner-friendly courses. The user also gets a shareable public user profile, which tracks and shows all of the user’s contributions and achievements.

The user profile shows whom the user follows, who follows the user, code by the user, any datasets by the user, and other information. There are also various ranking methods. The kaggle profile serves as a good way to create online projects which are shareable and show your talent. Just like how your HackerEarth or Code Chef profile shows your competitive coding skills, your kaggle profile serves as a way to express your Data Science skills.

To build a good kaggle profile, one needs to work on the data and build high-quality Python or R notebooks in the form of projects and tell a tale through the data. One can add various data plots, write markdown, and train models on Kaggle Notebooks. There is a lot one can do using them. And the best thing about Kaggle Notebooks is that: the user doesn’t need to install Python or R on their computer to use it. Almost all major libraries can be directly imported. Kaggle also provides TPUs for free. Tensor Processing Units (TPUs) are hardware accelerators specialized in deep learning tasks. They are supported in Tensorflow 2.1 both through the Keras high-level API and, at a lower level, in models using a custom training loop.

So, working with Datasets on Kaggle is very easy and convenient and all beginners must try Kaggle, so as to build up some skill and knowledge.

Here are some datasets every beginner can try and build awesome projects –

1. Netflix Movies and TV Shows

Who doesn’t like Netflix? This dataset on kaggle has tv shows and movies available on Netflix. One can create a good quality Exploratory Data Analysis project using this dataset. Using this dataset, one can find out: what type of content is produced in which country, identify similar content from the description, and much more interesting tasks.

Link to Dataset

My favorite Notebooks-

2. Students Performance in Exams

This data is based on population demographics. The data contains various features like the meal type given to the student, test preparation level, parental level of education, and students’ performance in Math, Reading, and Writing. Using the data, various types of Regression and Classification problems can be solved. It can also be used to find which factors can lead to better exam scores. Overall, it will be interesting to work on.

Link to Dataset

My favorite Notebooks-

Student Performance In Exams Notebook

3. Mobile Price Classification

Kaggle Datasets mobile price classification

The Mobile Price Classification dataset has a lot of data features and a wide variety of data following various distribution patterns. There are categorical features, Numerical continuous data, and even binary data. A lot of data patterns ensures that one is able to work with a lot of data and deal with various mathematical computations and statistics.

Link to Dataset

My favorite Notebooks-

4. Dogs & Cats Images

The classic Dog vs Cat classification dataset. There are a lot of Dog and Cat images that can be used to train models and do predictions. This dataset is a must for students trying to get into Image Processing or Computer Vision. Also, you get to look at a lot of cute images of cats and dogs.

Link to Dataset

My favorite Notebooks-

Dogs and Cats Image-Classifier Notebook

5. Trip Advisor Hotel Reviews

Hotels are important parts of trips and vacations. Hotel reviews are text data, which can be worked up using Natural Language Processing (NLP) methods. There are over 20,000 hotel reviews followed by a star rating of 1 to 5. The dataset can be used to train a classification model to determine the star rating of a given test review. It can be a good stepping stone for getting into text analytics and NLP.

Link to Dataset

My favorite Notebooks-

Hotel Reviews Sentiment prediction Notebook

6. Melbourne Housing Market

Melbourne Housing Market dataset is an all-time favorite learning resource for beginners into data science. It has a lot of features: numeric, categorical, and even geographic data ( Latitude and Longitude). So it can also be used for geospatial analysis and other clustering problems. Similarly, regression and classification tasks can also be performed on this dataset. There are also numerous code samples and guides available for this dataset, making it the ideal dataset for learners.

Link to Dataset

My favorite Notebooks-

7. Churn Modelling

Employee churn rate indicates how frequently the company’s employees quit their jobs within a given period. It is an important aspect of HR Analytics and corporate strategy. Data are real-life features like age, gender, time of bond with the company, and other important features. The data can be used to create a classification model and explore interesting patterns in data.

Link to Dataset

My favorite Notebooks-

Churn-Classification Notebook

8. Amazon Top 50 Bestselling Books 2009 – 2019

A sales dataset is always interesting to work with and gain insights from. Features include Amazon user rating, number of reviews on Amazon, and others. This dataset can be used to create EDA projects and also create regression analysis. It can be used to create an interesting case study on the success of Bestselling books.

Link to Dataset

My favorite Notebooks-

Amazon Top 50 Bestselling Books Notebook

9. Medical Cost Personal Dataset

This dataset is used to do Insurance Forecast based on various features. Interesting features include BMI, Number of Children, and if the person is a smoker or not. It also falls under the Demographics category and can be used to show an analysis of a person’s Insurance Expenditure.

Link to Dataset

My favorite Notebooks-

Patient Charges || Clustering and Regression Notebook

10. Kepler Exoplanet Search Results

Kepler had verified 1284 new exoplanets as of May 2016. As of October 2017, there are over 3000 confirmed exoplanets total (using all detection methods, including ground-based ones). The telescope is still active and continues to collect new data on its extended mission.

The data has various features, all of which might be a bit difficult to understand. A detailed explained guide can be found here.

Link to Dataset

End Notes

There are a lot of Notebooks on this dataset, it might be a bit difficult for beginners, but a lot of work can be done on this dataset.

There are a lot more datasets and challenges available on Kaggle, plenty for beginners to learn from. One can also use their Kaggle profile as a means to express their skills in Data Science.

The media shown in this article on Kaggle Datasets are not owned by Analytics Vidhya and is used at the Author’s discretion.

Prateek Majumder

Prateek is a dynamic professional with a strong foundation in Artificial Intelligence and Data Science, currently pursuing his PGP at Jio Institute. He holds a Bachelor's degree in Electrical Engineering and has hands-on experience as a System Engineer at TCS Digital, where he excelled in API management and data integration. Prateek also has a background in product marketing and analytics from his time with start-ups like AppleX and Milkie Way, Inc., where he was involved in growth campaigns and technical blog management. Recognized for his structured thinking and problem-solving abilities, he has received accolades like the Dr. Sudarshan Chakraborty Award for Best Student Performance. Fluent in multiple languages and passionate about technology, Prateek continues to expand his expertise in the rapidly evolving AI and tech landscape.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Ramesh Sampangi

Hi Pratheekmaj, well-written information. First of all, I would like to thank you for sharing such a wonderful piece of information. I agree with your statement that every fresher in the data science field should try out the Kaggle data sets for a better experience. Once again, thanks for sharing this article.

Reading list

Introduction to Computer Vision

Getting Started with Image Data

Introduction to CNN and Implementation

Introduction to CNN and implementation

Introduction to Transfer Learning

CNN Visualization

Overview of Pretrained Models

Inception

ResNets

DenseNets

CSRNet

Introduction to Object Detection

Region Based Convolutional Neural Network

Single Stage Networks

Transformed Based Object Detection Models

Face Detection

Object Tracking

Pose Estimation

Introduction to Image Segmentation

Understanding Deep Learning Architectures for Image Segmentation

Video Classification

Introduction to Image Generation

Experiments with Generative Adversarial Networks

Zero and Few Shot Learning

Model Deployment

Interesting Kaggle Datasets Every Beginner in Data Science Should Try Out

Introduction

1. Netflix Movies and TV Shows

2. Students Performance in Exams

3. Mobile Price Classification

4. Dogs & Cats Images

5. Trip Advisor Hotel Reviews

6. Melbourne Housing Market

7. Churn Modelling

8. Amazon Top 50 Bestselling Books 2009 – 2019

9. Medical Cost Personal Dataset

10. Kepler Exoplanet Search Results

End Notes

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory