7 Open Source Data Science Projects you Should Add to your Resume

Pranav Dar Last Updated : 02 Jul, 2020

8 min read

Overview

Open source data science projects add a lot of value to your resume and help you stand out in an interview
Here are 7 such open source data science projects you should work on this month

Introduction

I’m going to give you a tip I wish someone had given me when I started my data science career. When I was navigating the obstacle-filled journey through the backwaters of data science, I had quite a struggle before I landed my first role. I had all the qualifications (or so I thought) but something seemed to be off.

That gap between what I brought to the table and what the interviewer expected was data science project experience.

Data science projects add a lot of value to your resume, especially if you’re a beginner. Most newcomers will have certifications but adding open source data science projects will give you a significant advantage over the competition. And trust me, there are an astonishing number of open source data science projects for you.

Here, I’ve put together a list of the top open-source data science projects that were created or released in June. This is part of my monthly project series where I bring out the best data science projects open-sourced on GitHub.

If you want to check out the previous projects, I’ve put them together in the form of a free course. They’re structured by the domain (computer vision projects, NLP projects, etc.) so you can focus on the project you want. And if you’re new to GitHub, make sure you’re enrolled in this free introduction to Git and GitHub course.

Open Source Data Science Projects to Enhance your Resume

I have divided the projects into three categories based on their domain:

Machine Learning
Computer Vision
Other open-source data science projects, including an awesome dataset

Let’s look at each category individually.

Open Source Machine Learning Projects

This is where you’ll get the lay of the machine learning land. We’ll cover three useful open source projects here related to machine learning. You can pick a project based on your interests or try all of them. I have tried to keep them as diverse as possible so you’ll see a project on machine learning papers and another of building machine learning pipelines.

If you’re looking for guidance or are new to this field, I’ll direct you to a few helpful learning resources:

Machine Learning Papers with Illustrations and Annotations

Reading machine learning research papers is quite a daunting prospect for most professionals, let alone beginners. Data scientists and machine learning researchers tend to write extremely technical papers that even experts have a hard time decoding. This is actually one of the biggest pain points in our field.

So any effort to break down the complexity is always welcome. This helpful project is a collection of data science and machine learning papers “with illustrations, annotations, and brief explanations of technical keywords, terms, and previous studies which makes it easier to read the paper and to get the main idea”.

This project was open sourced on GitHub just last week so it’s being updated regularly. Right now we can see a few papers there already so you can go through them to get an idea of how the annotations have been done. I especially love the YOLOv1 annotation:

Pretty cool! Go ahead and explore this plus the other papers. There’s a lot to learn!

NeoML – A Machine Learning Framework

This is quite an interesting project for anyone who has a bit of data science knowledge.

NeoML is a comprehensive machine learning framework that enables us to build, train, and deploy machine learning models. In short, we can build an end-to-end machine learning pipeline without the hassle of spending big money on out-of-the-box solutions.

Data scientists and data engineers can use it for computer vision and Natural Language Processing (NLP) tasks, such as image preprocessing, classification, document layout analysis, OCR, and data extraction from structured and unstructured documents.

Here are the key feature of NeoML I’ve taken from their GitHub repository:

Neural networks with support for over 100 layer types
Traditional machine learning: 20+ algorithms (classification, regression, clustering, etc.)
CPU and GPU support, fast inference
ONNX support
Languages: C++, Java, Objective-C
Cross-platform: the same code can be run on Windows, Linux, macOS, iOS, and Android

Here’s a beginner-friendly article on how to build machine learning pipelines:

Build your First Machine Learning Pipeline using scikit-learn

Google’s Caliban for Machine Learning

Here’s another project that any data scientist would love, especially if you’re inclined towards research. We often struggle to go from a test environment to a full-scale deployment – it’s not an easy step to take (we really should appreciate the role data engineers play).

Google, of course, has a potential solution for us in the form of Caliban. This is a tool that will help you launch and track your numerical experiments in an isolated, reproducible computing environment. Caliban was developed by machine learning researchers and engineers over at Google.

As they put it, Caliban “makes it easy to go from a simple prototype running on a workstation to thousands of experimental jobs running on Cloud”. Here are the key highlights you should be aware of:

Develop your experimental code locally and test it inside an isolated (Docker) environment
Easily sweep over experimental parameters
Submit your experiments as Cloud jobs, where they will run in the same isolated environment
Control and keep track of jobs

Open Source Computer Vision Projects

I’m amazed by the progress we are seeing in computer vision (no pun intended!). It seems every month when I sit down to write this article, I come across more and more groundbreaking frameworks and new approaches that enhance the state-of-the-art in this field.

Organizations are scouring the globe for computer vision talent right now so it’s a great time to work on these projects and get into the field. If you haven’t yet started reading about computer vision, here are a few helpful resources:

Genetic Drawing

What if I gave you a target image and asked you to write a computer vision program that created the image from scratch? Yes, that’s the power of computer vision!

This really cool open source project enables us to imitate a drawing process when we’re provided with a target image. Here’s a small demo of what the process looks like:

I can’t wait to get my hands on this and start drawing up all sorts of stuff. You’ll need the below Python libraries to run this:

OpenCV 3.4.1
NumPy 1.16.2
matplotlib 3.0.3

The developer has also given us an example so you can execute that and watch the magic of computer vision unfold. I’d also suggest going through the below OpenCV articles if you haven’t worked with it before:

16 OpenCV Functions to Start your Computer Vision journey (with Python code)

PULSE – Face Depixelizer

This open source project caters to slightly more advanced data scientists. To understand what this project is about, we need to grasp the concept of single-image super-resolution. In simple terms, the aim here is to construct a high-resolution image from a corresponding low-resolution input.

Sounds like a classic computer vision project!

PULSE is a novel solution to this problem statement. Short for Photo Upsampling via Latent Space Exploration, PULSE generates high-resolution and ultra-realistic images at incredibly high resolutions. And this is accomplished in an entirely self-supervised fashion and is not confined to a specific degradation operator used during training.

Here’s an example of how PULSE works:

I’d encourage you to first read the research paper before looking at the code. This will give you a better idea of how PULSE works underneath so you can tackle the code with much more clarity.

Other Open Source Data Science Projects

Here are a couple of open-source data science projects that didn’t quite fit the above two categories. These are actually two contrasting projects – one caters to beginners in data science while the other deals with the world of reinforcement learning.

Pick whichever one works best for you and start exploring it.

PalmerPenguins – An Awesome Dataset for Exploration and Visualization

I’m sure most of you have worked with the Iris dataset. In fact, it might even have been the very first dataset you used to understand the concept of classification in machine learning. I love how simple the dataset is to understand and explore.

But working with the same dataset can become a bit dour, especially when you’re learning the ins and outs of machine learning.

This is where the PalmerPenguins dataset comes in. Open sourced last month, this dataset positions itself as an alternative to Iris and aims to provide a great dataset for data exploration & visualization, especially for beginners. Here’s a taste of the visualizations you can come up with:

The link I’ve mentioned above contains examples of how to start exploring this data. They’ve even provided details about the different variables but wouldn’t you want to explore that yourself? 🙂

You can get PalmerPenguins on your machine using the below code:

# install.packages("remotes")
remotes::install_github("allisonhorst/palmerpenguins")

I also recommend checking out the below popular articles on data exploration and visualization:

Slime Volleyball Gym Environment

Ah, here’s an open source project for all you reinforcement learning folks. SlimeVolleyGym is a simple gym environment for testing single and multi-agent reinforcement learning algorithms. This has been created and open-sourced by hardmaru, a legend in the machine learning space.

Here’s how the game works according to him (he created the game himself in JavaScript):

The game is very simple: the agent’s goal is to get the ball to land on the ground of its opponent’s side, causing its opponent to lose a life. Each agent starts off with five lives. The episode ends when either agent loses all five lives, or after 3000 timesteps have passed. An agent receives a reward of +1 when its opponent loses or -1 when it loses a life.

You can install slimevolleygym directly from pip:

pip install slimevolleygym

Here are a couple of excellent tutorials by our resident reinforcement learning expert Ankit Choudhary:

End Notes

Phew – that’s a lot of projects. My aim, as always, was to keep the projects as diverse as possible so you can pick the ones that fit into your data science journey. If you’re a beginner, I would suggest starting with the PalmerPenguins dataset as most folks aren’t even aware of it right now. A great chance to get a head start.

I would love to hear your thoughts on which open source project you found the most useful. Or let me know if you want me to feature any other data science projects here or in next month’s edition.

Pranav Dar

Senior Editor at Analytics Vidhya.Data visualization practitioner who loves reading and delving deeper into the data science and machine learning arts. Always looking for new ways to improve processes using ML and AI.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Rahul

I would like to enroll in your "Top Data Science Projects for Analysts and Data Scientists" course but I did not find any option to enroll.

Show 1 reply

Hi Rahul - This was a temporary error - can you please check again and enroll?

Alisher Abdulkhaev

Thank you Pranav for featuring us in your post :) Looking forward for the contributions!

Dan Elbert

Hi Pranav Very interesting article. How would a beginner use these projects to showcase his skills and his own work? Are those projects that accept contributions? Dan

Reading list

Introduction to Deep Learning

Feed Forward Networks

Gradient Descent

Loss Function

Activation Functions

Introduction to Neural networks

Forward and Backward Propagation

Optimizers

Learning Rate Schedulers

NN on Structured Data

Improving the Deep Learning Model

Deep Learning Model Optimization

Unsupervised Deep Learning

AutoDL

Model Deployment

Introduction to PyTorch

7 Open Source Data Science Projects you Should Add to your Resume

Overview

Introduction

Open Source Data Science Projects to Enhance your Resume

Open Source Machine Learning Projects

Machine Learning Papers with Illustrations and Annotations

NeoML – A Machine Learning Framework

Google’s Caliban for Machine Learning

Open Source Computer Vision Projects

Genetic Drawing

PULSE – Face Depixelizer

Other Open Source Data Science Projects

PalmerPenguins – An Awesome Dataset for Exploration and Visualization

Slime Volleyball Gym Environment

End Notes

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp