5 Open Source Machine Learning Projects to Challenge your Inner Data Scientist

Pranav Dar Last Updated : 14 Jun, 2020

6 min read

Overview

Start 2020 on the right note with these 5 challenging open-source machine learning projects
These machine learning projects cover a diverse range of domains, including Python programming and NLP

Introduction

More people than ever before are looking for a way to transition into data science. Whether you’re a fresh college graduate, a relatively new entrant in the industry, a mid-level professional, or someone who’s just curious about machine learning – everyone wants a piece of the data science pie.

And if you’re from India, you would surely have read about the Government’s investment in the data field (in the 2020 Union Budget). This is a great time to invest in your career!

And one of the best ways to get your data science career off the ground is to invest in yourself. Here’s a simple path to do that:

Find an open-source machine learning project that you are passionate about
Understand the current benchmark solution for that project
If it exists, learn from it. If it doesn’t, carve out a solution using your existing machine learning skillset

I’ve picked out 5 open-source machine learning projects (created in January 2020) to acquaint you with the latest state-of-the-art frameworks and libraries. As always, I tried to diversify the list as much as possible. You’ll see a bit of everything sprinkled in, from Natural Language Processing (NLP) to Python programming ideas.

Head over here if you’re interested in checking out the previous projects we’ve showcased in this monthly series. This is the 3rd year of this series – thanks to our community for the overwhelming response!

Without further ado, here are the 5 open-source machine learning projects

Reformer – The Efficient Transformer in PyTorch

The Transformer architecture changed the Natural Language Processing (NLP) landscape. It has spawned a plethora of NLP frameworks, such as BERT, XLNet, GPT-2, among others.

But there’s an issue I’m sure most of you will relate to – these Transformer-powered models are LARGE. They achieve state-of-the-art results but they’re way too expensive and beyond the scope of most folks who want to learn and implement them.

This is where the Reformer model comes in. Reformer performs as well as these Transformer models, but it does so while using far less resources and money.

This GitHub repository I’ve linked above contains the PyTorch implementation of Reformer. The author of the project has provided a simple but effective example along with the entire code to help you build your own model.

I encourage you to read about the inner workings of Reformer in the official research paper here.

You can install Reformer on your machine using the below command:

pip install reformer_pytorch

The below articles are essential reading if you’re new to the Transformer architecture and the PyTorch framework:

PandaPy – Your New Favorite Python Library

I came across PandaPy last week and have already used it in my current project. It is a fascinating Python library with a lot of potential to become mainstream.

If you are working on a machine learning project with mixed data types (int, float, datetime, str, etc.), you should try out PandaPy instead of Pandas. It consumes roughly one-third less memory than Pandas for these data types!

“If you have smaller Pandas dataframes (<50K number of records) in a production environment, then it is worth considering PandaPy.”

Here are three key areas you’ll find interesting (I’ve taken these points verbatim from the PandaPy GitHub repository):

For simple calculations on a small dataset (i.e, plus, mult, log) PandaPy is 25x – 80x faster than Pandas
For table functions (i.e., group, pivot, drop, concat, fillna) on a small data set PandaPy is 5x – 100x times faster than Pandas
For most use cases with small data, PandaPy is faster than Dask, Modin Ray and Pandas

Install PandaPy using pip:

!pip3 install pandapy

If you still want to stick with Pandas, then check out the latest major release (v1.0.0) here.

Google Earth Engine – 300+ Jupyter Notebooks to Analyze Geospatial Data

What a brilliant GitHub repository! I’ve had a lot of aspiring data scientists reach out to me on LinkedIn asking about how to get started with geospatial analysis. It’s a very interesting field with petabytes of data available. We just need a structured approach to clean and analyze it.

This amazing repository is a collection of 300+ Jupyter notebooks that contain examples of using Google Earth Engine data.

Here’s a really cool GIF that demonstrates one of the visualizations you will generate using these notebooks:

These notebooks rely on three Python libraries to execute the code:

Earth Engine Python API
Folium
Geehydro

The GitHub repository contains plenty of examples with Python code to get you started. Dig in and have fun!

Here’s an excellent article to get started with Geospatial Data:

Geospatial Data and its Role in Data Science

AVA – Automated Visual Analytics

Here’s another quality data visualization idea for you. The thought of automating the data exploration step has been floated around for a while without any substantial frameworks. Until now

AVA, short for Automated Visual Analytics, is a framework by Alibaba that aims to make visual analytics AI-driven and automated.

Here’s a demo showing the power of AVA:

I highly recommend checking out the below resources to enhance and build your data visualization profile:

Fast Neptune – Speed up your Machine Learning Projects

Reproducibility is a crucial aspect of any machine learning project these days, whether that’s in research or the industry. We need to track every test we perform, every iteration, and every parameter of our machine learning model, along with the results.

The Fast Neptune library enables us to quickly record all the information we need to launch our machine learning experiments. In other words, Fast Neptune is your answer to the reproducibility question you might have asked while reading the above paragraph.

Here are the features Fast Neptune uses to help us run quick experiments (quoting from the above link):

Metadata about the machine where the code is run, including OS, and OS version
Requirements of the notebook where the experiments are run
Parameters used during the experience, which means the names of the values of the variables you want to track
Code you used during the run that you want to record

Pretty neat, right? Install Fast Neptune using just one line of code:

pip install fast-neptune

Couple of noteworthy frameworks to keep an eye on:

I wanted to highlight a couple of other major releases in January 2020 that you should be aware of:

Thinc: This is a lightweight deep learning library from the makers of spaCy. Thinc “offers an elegant, type-checked, functional-programming API for composing models, with support for layers defined in other frameworks such as PyTorch, TensorFlow or MXNet”
Google’s Incredible Human-Like Generative Chatbot: Google has created Meena, a 2.6 billion parameter end-to-end trained neural conversational model. Meena can conduct conversations that are more sensible and specific than existing state-of-the-art chatbots. Will they open-source the code? That remains to be seen but this is one to keep your eye on

End Notes

2020 is off to a fast start in the machine learning space. The state-of-the-art continues to evolve at a rapid pace and it can become overwhelming for newcomers to keep up.

That’s why I publish these monthly articles where I aim to bring out the most relevant and useful open-source machine learning projects for our community.

Is there any other machine learning project or framework you want to highlight? I would love to hear your thoughts and ideas in the comments section below. Let’s connect and brainstorm together.

Pranav Dar

Senior Editor at Analytics Vidhya.Data visualization practitioner who loves reading and delving deeper into the data science and machine learning arts. Always looking for new ways to improve processes using ML and AI.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Data analyst Learning Path

Tableau Learning Path

NLP Learning Path

Data Scientist Learning Path

Data Engineer Learning Path

MLOps Learning Path

AI Engineer Learning Path

Computer Vision Learning Path

Generative AI Learning Path

Generative AI Roadmap for Enterprises

LLMs Roadmap

Prompt Engineer Leaning Path

5 Open Source Machine Learning Projects to Challenge your Inner Data Scientist

Overview

Introduction

Without further ado, here are the 5 open-source machine learning projects

Reformer – The Efficient Transformer in PyTorch

PandaPy – Your New Favorite Python Library

Google Earth Engine – 300+ Jupyter Notebooks to Analyze Geospatial Data

AVA – Automated Visual Analytics

Fast Neptune – Speed up your Machine Learning Projects

Couple of noteworthy frameworks to keep an eye on:

End Notes

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID