6 Powerful Open Source Machine Learning GitHub Repositories for Data Scientists

Pranav Dar Last Updated : 27 Apr, 2020

5 min read

Overview

Check out the top 6 machine learning GitHub repositories created in June
There’s a heavy focus on NLP again, with XLNet outperforming Google’s BERT on several state-of-the-art benchmarks
All machine learning GitHub repositories are open source; download the code and start experimenting!

Introduction

Do you sometimes feel that machine learning is too broad and vast to keep up? I certainly feel that way. Just check out the list of major developments in Natural Language Processing (NLP) in the last year:

Google’s BERT
OpenAI’s GPT-2
Google’s Transformer-XL

It can become overwhelming as a data scientist to simply keep track of all that’s happening in machine learning. My aim of running this GitHub series since January 2018 has been to take that pain away for our community.

We trawl through every open source machine learning release each month and pick out the top developments we feel you should absolutely know. This is an ever-evolving field – and data scientists should always be on top of these breakthroughs. Otherwise, we risk being left behind.

This month’s machine learning GitHub collection is quite broad in its scope. I’ve covered one of the biggest NLP releases in recent times (XLNet), a unique approach to reinforcement learning by Google, understanding actions in videos, among other repositories.

Fun times ahead so let’s get rolling!

You can also go through the GitHub repositories and Reddit discussions we’ve covered so far this year:

Top Machine Learning GitHub Repositories

XLNet: The Next Big NLP Framework

Of course we are starting with NLP. It is the hottest field in machine learning right now. If you thought 2018 was a big year (and it was), 2019 has taken up the mantle now.

The latest state-of-the-art NLP framework is XLNet. It has taken the NLP (and machine learning) community by storm. XLNet uses Transformer-XL at its core. The developers have released a pretrained model as well to help you get started with XLNet.

XLNet has so far outperformed Google’s BERT on 20 NLP tasks and achieved state-of-the-art performance on 18 such tasks. Here are a few results on popular NLP benchmarks for reading comprehensions:

Model	RACE accuracy	SQuAD1.1 EM	SQuAD2.0 EM
BERT	72.0	84.1	78.98
XLNet	81.75	88.95	86.12

Want more? Here are the results for text classification:

Model	IMDB	Yelp-2	Yelp-5	DBpedia	Amazon-2	Amazon-5
BERT	4.51	1.89	29.32	0.64	2.63	34.17
XLNet	3.79	1.55	27.80	0.62	2.40	32.26

XLNet is, to put it mildly, very impressive. You can read the full research paper here.

Implementation of XLNet in PyTorch

Wait – were you wondering how you can implement XLNet on your machine? Look no further – this repository will get you started in no time.

If you’re well versed with NLP features this will be pretty simple to understand. But if you’re new to this field, take a few moments to go through the documentation I mentioned above and then try this out.

The developer(s) has also provided the entire code in Google Colab so you can leverage GPU power for free! This is a framework you DON’T want to miss out on.

Google Research Football – A Unique Reinforcement Learning Environment

I’m a huge football fan so the title of the repository instantly had my attention. Google Research and football – what in the world do these two have to do with each other?

Well, this “repository contains a reinforcement learning environment based on the open-source game Gameplay Football”. This environment was created exclusively for research purposes by the Google Research team. Here are a few scenarios produced within the environment:

Agents are trained to play football in an advanced, physics-based 3D simulator. I’ve seen a few RL environments in the last couple of years but this one takes the cake.

The research paper makes for interesting reading, especially if you’re a football or reinforcement learning enthusiast (or both!). Check it out here.

Implementation of the CRAFT Text Detector

This is a fascinating concept. CRAFT stands for Character Region Awareness for Text Detection. This should be on your to-read list if you’re interested in computer vision. Just check out this GIF:

Can you figure out how the algorithm is working? CRAFT detects the text area by exploring each character region present in the image. And the bounding box of the text? That is obtained by simply finding minimum bounding rectangles on a binary map.

You’ll grasp CRAFT in a jiffy if you’re familiar with the concept of object detection. This repository includes a pretrained model so you don’t have to code this algorithm from scratch!

You can find more details and an in-depth explanation of CRAFT in this paper.

MMAction – Open Source Toolbox for Action Understanding in Videos

Ever worked with video data before? It’s a really challenging but rewarding experience. Just imagine the sheer amount of things we can do and extract from a video.

How about understanding the action being performed in a particular video frame? That’s what the MMAction repository does. It is an “open source toolbox for action understanding based on PyTorch”. MMAction can perform the below tasks, as per the repository:

Action recognition from trimmed videos
Temporal action detection (also known as action localization) in untrimmed videos
Spatial-temporal action detection in untrimmed videos

MMAction’s developers have also provided tools to deal with different kinds of video datasets. The repository contains a healthy number of steps to at least get you up and running.

Here is the getting started guide for MMAction.

TRAINS – Auto-Magical Experiment Manager & Version Control for AI

One of the most crucial, and yet overlooked, aspects of a data scientist’s skillset – software engineering. It is an intrinsic part of the job. Knowing how to build models is great, but it’s equally important to understand the software side of your project.

If you’ve never heard of version control before, rectify that immediately. TRAINS “records and manages various deep learning research workloads and does so with practically zero integration costs”.

The best part about TRAINS (and there are many) is that it’s free and open source. You only need to write two lines of code to fully integrate TRAINS into your environment. It currently integrates with PyTorch, TensorFlow, and Keras and also supports Jupyter notebooks.

The developers have set up a demo server here. Go ahead and try out TRAINS using whatever code you want to test.

End Notes

My pick for this month is surely XLNet. It has opened up endless opportunities for NLP scientists. There’s only one caveat though – it requires strong computational power. Will Google Colab come to the rescue? Let me know if you’ve tried it out yet.

On a relevant note, NLP is THE field to get into right now. Developments are happening at breakneck speed and I can easily predict there’s a lot more coming this year. If you haven’t already, start delving into this as soon as you can.

Are there any other machine learning GitHub repositories I should include in this list? Which one did you like from this month’s collection? Let’s discuss in the comments section below.

Pranav Dar

Senior Editor at Analytics Vidhya.Data visualization practitioner who loves reading and delving deeper into the data science and machine learning arts. Always looking for new ways to improve processes using ML and AI.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

6 Powerful Open Source Machine Learning GitHub Repositories for Data Scientists

Overview

Introduction

Top Machine Learning GitHub Repositories

XLNet: The Next Big NLP Framework

Implementation of XLNet in PyTorch

Google Research Football – A Unique Reinforcement Learning Environment

Implementation of the CRAFT Text Detector

MMAction – Open Source Toolbox for Action Understanding in Videos

TRAINS – Auto-Magical Experiment Manager & Version Control for AI

End Notes

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory