10 Data Science Projects Every Beginner should add to their Portfolio

Shipra Saxena Last Updated : 30 Jul, 2021

9 min read

Overview

The projects are a way to enhance and improve your knowledge in the data science domain.
To boost your resume, here we have 10 data science projects as a beginner you can work upon
By no means is this exhaustive. Feel free to add more data science projects in the comments below

Introduction

With the rapid increase in the demand for data scientists in recent times, reports have shown that people are enrolling in high numbers for data science programs. Still, the industry is lacking in a skilled workforce in the AI domain. Why?

While hiring the data scientists companies expect, the candidates must have worked on some related projects. Just knowing about the tools and algorithms or having certifications is not enough. As a data scientist, you must have experience working on full-stack data science projects & data science project ideas and know-how about the tasks including preparing the problem statement, hypothesis making, gathering and cleaning the relevant data, Building the Ml pipeline, and deploying the model.

For simplifying your search for the relevant projects as a beginner, here we list down some data science projects you can try your hands on. The projects are divided into four parts

CV projects
NLP projects
Time series projects
Miscellaneous

Computer vision Projects

Computer vision is one of the most popular applications of machine learning and everyone wants to explore it. This year we saw many interesting CV use cases. Here I am sharing a few you can get your hands dirty on.

If you are looking to master computer vision, check out our course Computer Vision using Deep Learning 2.0

Object detection with YOLO4

In recent times we have seen tremendous change in the state of the art real-time object detection models. The latest one is the release of YOLO4. You only look once (YOLO) is a family of one-stage object detectors that are fast and accurate. YOLO v4 showed very good results compared to other object detectors.

In experiments, YOLOv4 obtained an AP value of 43.5 percent (65.7 percent AP50) on the MS COCO dataset, and achieved a real-time speed of ∼65 FPS on the Tesla V100, beating the fastest and most accurate detectors in terms of both speed and accuracy.

YOLOv4 is twice as fast as EfficientNet with comparable performance. In addition, compared with YOLOv3, the AP and FPS have increased by 10 percent and 12 percent, respectively.

Object detection with YOLO4 - data science projects

source

I will recommend you to go through the following links if you want to learn object detection.

Image classification with Microsoft Lobe

Recently, Microsoft launched its machine learning APP lobe, which aimed to make developing a machine learning model easier without writing a single line of code. It is an exciting image classification project for beginners.

data science projects - Image classification with Microsoft Lobe

source

Lobe automatically selects the right machine learning architecture and starts training without any setup or configuration. Further, users can evaluate the model’s strengths and weaknesses with real-time visual results. Once the training is done the model can be deployed on a website or device.

As a beginner, if you want to develop an image classification model, I will recommend you to try Lobe

Color Your pictures with ChromaGan

Image coloring is a very interesting problem. Here you have to fill the grayscale image with plausible colors. This can have multiple correct solutions. Before the emergence of deep learning techniques, the most effective methods relied on human intervention. Now we have various AI techniques including Generative networks. ChromaGan

source

ChromaGan is one such solution. It combines the strength of generative adversarial networks with semantic class distribution learning. As a result, ChromaGAN is able to perceptually colorize a grayscale image from the semantic understanding of the captured scene.

It is an interesting project to enhance your profile as a computer vision expert. Here is the ChromaGan paper, I will suggest you definitely go through it.

Natural Language Processing

NLP is one of the hottest fields in the machine learning industry with applications like chatbots, Topic modelling, and many more. Hence, the AI giants are investing large amounts in NLP researches.

Don’t forget to check the following links, if you are looking into NLP.

Electra

Electra ((Efficiently Learning an Encoder that Classifies Token Replacements Accurately) is a pre-training approach. It aims to match or exceed the downstream performance of a Masked language modelling pre-trained model used by BERT while using significantly less compute resources for the pre-training stage.

electra - data science projects

Setup for ELECTRA pre-training (Source — ELECTRA paper)

The pre-training task in ELECTRA is based on detecting replaced tokens in the input sequence. This setup requires two Transformer models, a generator and a discriminator similar to gan.

It is shown that the original ELECTRA approach yields an 85.0 score while ELECTRA 15% gets 82.4. (For comparison, BERT scored 82.2)

Here is the Github link and Electra Paper.

Topic Modeling with Top2Vec

Top2vec is an algorithm for discovering semantic structure or topics in a given set of documents. Basically, Top2vec uses doc2vec to generate semantic space.

data sciecne projects - Topic Modeling with Top2Vec

This model does not require stop-word lists, stemming, or lemmatization, and it automatically finds the number of topics. The resulting topic vectors are jointly embedded with the document and word vectors with the distance between them representing semantic similarity.

The Top2vec Authors have also provided the open-source API to experiment with. Further, you can dig deeper into the model through the paper.

ALBERT: A Lite BERT For Self-supervised Learning Of Language Representations

The ALBERT is a language representation model proposed in the paper ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. The model is a modified version of the traditional BERT model.

albert

Generally, it is found that increased model size in language representation problems results in improved performance and a proportional increase in training time. To resolve this issue the authors have proposed two methods to reduce the memory consumption and training time of traditional BERT.

Splitting the embedding matrix into two smaller matrices.
Using repeating layers split among groups.

According to the researchers, this model outperformed the GLUE, RACE, and SQuAD benchmark tests for natural language understanding.

For better understanding don’t forget to read the Albert paper. Here, you can find the model documentation and implementation for ALBERT.

Time-series Analysis

Time-series analysis is a powerful modelling technique that deals with observations having different values at different time stamps. It is a highly useful technique for companies for example forecasting the sales, traffic on the website, predicting stock prices, and more.

In case you are interested to dig it further here is your guide for time series Analysis

Rocket

Time series classification is an interesting problem as the features here possess an order/ sequence, we can not avoid. For example, classifying ECG signals of a patient or the Motion Sensor Data.

rocket

source:https://arxiv.org/pdf/1910.13051.pdf

Most of the state of the art methods used for time series classification have high complexity and significant learning time even on smaller datasets. Also, they are effectively unusable for large datasets. Rocket (RandOm Convolutional KErnel Transform) can achieve the same level of accuracy in just a fraction of time as competing with SOTA algorithms, including convolutional neural networks.

To achieve accuracy and scalability Rocket algorithm first uses randomized convolutional kernels to transform the time series features. Later, passes these transformed features into a classifier.

You can find its implementation in sktime and here is an example notebook. Further, you can also go for the paper to understand the approach better.

Prophet

A prophet is an open-source tool by Facebook for forecasting time series data. Also, It decomposes time series into trend, seasonality, and holidays. In addition, Prophet has intuitive parameters that are easy to tune.

It is fully automatic, accurate, and fast, Hence making the prophet easy to use for someone who lacks deep expertise in time series forecast.

prophet

It works best with time series that have strong seasonal effects and several seasons of historical data. Also, Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.

Here is a guide, if you want to know more about the implementation of time series forecasting using Prophet.

Generate Quick and Accurate Time Series Forecasts using Facebook’s Prophet (with Python & R codes)

Gluon TS: Probabilistic Time Series Models in Python

Here, we have another library available for time series prediction at our end Gluon TS. It is a library for deep-learning-based time series modeling. It simplifies the development of and experimentation with time-series models for tasks such as forecasting or anomaly detection.

LUON

The library provides all the necessary tools and that scientists need for quickly building new models, for efficiently running and analyzing experiments, and for evaluating model accuracy.

GluonTS is available as open-source software on GitHub under the Apache License, version 2.0.

Miscellaneous

Here are some other projects that you can add to your portfolio to enhance it.

Recommender system with Tensorflow-Recommenders

From suggesting movies to products on Recommender systems are an important machine learning application. Recently, TensorFlow launched it’s package Tensorflow Recommenders. It is an open-source package that makes building, evaluating, and serving the recommender systems easy.

TensorFlow

Further, the library is built on Keras, to have a smooth learning curve also giving you the flexibility to develop complex models. Further, I will suggest you read the official documentation and the tutorials provided by Tensorflow

Anamoly Detection with PyoD

Anamoly or outlier detection is a problem of identifying unusual patterns in the data. It is a process of identifying what is normal and what is not. Further, anomaly detection is defining a boundary around normal data points in order to distinguish them from the outliers.

Anamoly detection

Coming to PyoD ( The python outlier detection) is a comprehensive and scalable Python toolkit for outlier detection. It implements more than 30 algorithms. PyOD is developed with a comprehensive API to support multiple techniques and you can take a look at the official documentation of PyOD here.

Here is the detailed tutorial for you.

An Awesome Tutorial to Learn Outlier Detection in Python using PyOD Library

Develop Machine Learning web app with Streamlit

Suppose you have created a project for tweet sentiment analysis that is efficiently working with high accuracy. If you want to demonstrate your project, you need to develop a dashboard using HTML or javascript. It’s a tedious task in itself if you do not know any of the scripting languages.

With the launch of streamlit developing a dashboard or web application for a machine-learning project has become incredibly easy using python only isn’t it exciting!

Streamlit is an open-source python library to build efficient, beautiful, and shareable web-based apps in very little time.

To install the library you can use the code below and it’s done

Pip install streamlit

Here is an interesting gif to make you understand how it works

Develop Machine Learning web app with Streamlit

source

Using streamlit we can develop from very simple to complex machine learning applications with few lines of code. Also, I personally like this tool as

It uses python scripting no other language is needed
Less code is required to create efficient applications
Data caching speeds up the application

Excited to explore further! Here is the link for you Streamlit

Endnote

The data science world is advancing at a high pace. Hence to stand in the competition it is required to be aware of the latest tools and techniques coming as a breakthrough in the industry.

In this article, I tried to cover a diverse set of projects in the data science domain, as a beginner you should definitely know about them. Now it’s your turn to get the hands-on experience.

Shipra Saxena

Shipra is a Data Science enthusiast, Exploring Machine learning and Deep learning algorithms. She is also interested in Big data technologies. She believes learning is a continuous process so keep moving.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Avanish singh

Valuable information for data science aspirations...

Stephanie

Can Someone with Background in Arts understand Data Science? If yes, what aspect of Data science will be most recommended for fighting online fraud.

SARAT

Need course details / or a WhatsApp no. To ask Python related doubt s

Reading list

Introduction to Deep Learning

Feed Forward Networks

Gradient Descent

Loss Function

Activation Functions

Introduction to Neural networks

Forward and Backward Propagation

Optimizers

Learning Rate Schedulers

NN on Structured Data

Improving the Deep Learning Model

Deep Learning Model Optimization

Unsupervised Deep Learning

AutoDL

Model Deployment

Introduction to PyTorch

10 Data Science Projects Every Beginner should add to their Portfolio

Overview

Introduction

Computer vision Projects

Object detection with YOLO4

Image classification with Microsoft Lobe

Color Your pictures with ChromaGan

Natural Language Processing

Electra

Topic Modeling with Top2Vec

ALBERT: A Lite BERT For Self-supervised Learning Of Language Representations

Time-series Analysis

Rocket

Prophet

Gluon TS: Probabilistic Time Series Models in Python

Miscellaneous

Recommender system with Tensorflow-Recommenders

Anamoly Detection with PyoD

Develop Machine Learning web app with Streamlit

Endnote

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap