These days, Kaggle has indeed become one of the most important stepping stones for students and professionals venturing into Data Science.
Kaggle has a lot of online resources that help one to get started with Data Science. It has thousands of Datasets, Data Science competitions, Code Submissions on the Datasets, Community chat, and even Beginner-friendly courses. The user also gets a shareable public user profile, which tracks and shows all of the user’s contributions and achievements.
The user profile shows whom the user follows, who follows the user, code by the user, any datasets by the user, and other information. There are also various ranking methods. The kaggle profile serves as a good way to create online projects which are shareable and show your talent. Just like how your HackerEarth or Code Chef profile shows your competitive coding skills, your kaggle profile serves as a way to express your Data Science skills.
To build a good kaggle profile, one needs to work on the data and build high-quality Python or R notebooks in the form of projects and tell a tale through the data. One can add various data plots, write markdown, and train models on Kaggle Notebooks. There is a lot one can do using them. And the best thing about Kaggle Notebooks is that: the user doesn’t need to install Python or R on their computer to use it. Almost all major libraries can be directly imported. Kaggle also provides TPUs for free. Tensor Processing Units (TPUs) are hardware accelerators specialized in deep learning tasks. They are supported in Tensorflow 2.1 both through the Keras high-level API and, at a lower level, in models using a custom training loop.
So, working with Datasets on Kaggle is very easy and convenient and all beginners must try Kaggle, so as to build up some skill and knowledge.
Here are some datasets every beginner can try and build awesome projects –
Who doesn’t like Netflix? This dataset on kaggle has tv shows and movies available on Netflix. One can create a good quality Exploratory Data Analysis project using this dataset. Using this dataset, one can find out: what type of content is produced in which country, identify similar content from the description, and much more interesting tasks.
My favorite Notebooks-
This data is based on population demographics. The data contains various features like the meal type given to the student, test preparation level, parental level of education, and students’ performance in Math, Reading, and Writing. Using the data, various types of Regression and Classification problems can be solved. It can also be used to find which factors can lead to better exam scores. Overall, it will be interesting to work on.
My favorite Notebooks-
The Mobile Price Classification dataset has a lot of data features and a wide variety of data following various distribution patterns. There are categorical features, Numerical continuous data, and even binary data. A lot of data patterns ensures that one is able to work with a lot of data and deal with various mathematical computations and statistics.
My favorite Notebooks-
The classic Dog vs Cat classification dataset. There are a lot of Dog and Cat images that can be used to train models and do predictions. This dataset is a must for students trying to get into Image Processing or Computer Vision. Also, you get to look at a lot of cute images of cats and dogs.
My favorite Notebooks-
Hotels are important parts of trips and vacations. Hotel reviews are text data, which can be worked up using Natural Language Processing (NLP) methods. There are over 20,000 hotel reviews followed by a star rating of 1 to 5. The dataset can be used to train a classification model to determine the star rating of a given test review. It can be a good stepping stone for getting into text analytics and NLP.
My favorite Notebooks-
Melbourne Housing Market dataset is an all-time favorite learning resource for beginners into data science. It has a lot of features: numeric, categorical, and even geographic data ( Latitude and Longitude). So it can also be used for geospatial analysis and other clustering problems. Similarly, regression and classification tasks can also be performed on this dataset. There are also numerous code samples and guides available for this dataset, making it the ideal dataset for learners.
My favorite Notebooks-
Employee churn rate indicates how frequently the company’s employees quit their jobs within a given period. It is an important aspect of HR Analytics and corporate strategy. Data are real-life features like age, gender, time of bond with the company, and other important features. The data can be used to create a classification model and explore interesting patterns in data.
My favorite Notebooks-
A sales dataset is always interesting to work with and gain insights from. Features include Amazon user rating, number of reviews on Amazon, and others. This dataset can be used to create EDA projects and also create regression analysis. It can be used to create an interesting case study on the success of Bestselling books.
My favorite Notebooks-
This dataset is used to do Insurance Forecast based on various features. Interesting features include BMI, Number of Children, and if the person is a smoker or not. It also falls under the Demographics category and can be used to show an analysis of a person’s Insurance Expenditure.
My favorite Notebooks-
Kepler had verified 1284 new exoplanets as of May 2016. As of October 2017, there are over 3000 confirmed exoplanets total (using all detection methods, including ground-based ones). The telescope is still active and continues to collect new data on its extended mission.
The data has various features, all of which might be a bit difficult to understand. A detailed explained guide can be found here.
There are a lot of Notebooks on this dataset, it might be a bit difficult for beginners, but a lot of work can be done on this dataset.
There are a lot more datasets and challenges available on Kaggle, plenty for beginners to learn from. One can also use their Kaggle profile as a means to express their skills in Data Science.
The media shown in this article on Kaggle Datasets are not owned by Analytics Vidhya and is used at the Author’s discretion.
Hi Pratheekmaj, well-written information. First of all, I would like to thank you for sharing such a wonderful piece of information. I agree with your statement that every fresher in the data science field should try out the Kaggle data sets for a better experience. Once again, thanks for sharing this article.