The Top GitHub Repositories & Reddit Threads Every Data Scientist should know (June 2018)

Pranav Dar Last Updated : 31 May, 2020

6 min read

Introduction

Half the year has flown by and that brings us to the June edition of our popular series – the top GitHub repositories and Reddit threads from last month. During the course of writing these articles, I have learned so much about machine learning from either open source codes or invaluable discussions among the top data science brains in the world.

What makes GitHub special is not just it’s code hosting and social collaboration features for data scientists. It has lowered the entry barrier into the open source world and has played a MASSIVE role in spreading knowledge and expanding the machine learning community.

We saw some amazing open source code being released in June. One of the most intriguing repositories was ‘NLP Progress’ – created with the aim of keeping everyone updated regarding the latest updates in this field. Facebook also released the code for it’s popular DensePose framework which could be a game changer in the pose estimation field.

When it comes to Reddit, it is so rich in knowledge and perspective from data scientists and ML experts from around the globe. In this article you’ll see discussions on reinforcement learning applications, machine learning setups, a wonderful computer vision example, and much more. I highly recommend participating in this discussions to enhance your skillset.

You can check out the top GitHub repositories and top Reddit discussions (from April onwards) for the last five months below:

GitHub Repositories

Facebook’s DensePose

Human pose estimation has garnered a lot of attention in the deep learning community this year. And Facebook took things to a new level when they open sourced the code to ‘DensePose’, their popular pose estimation framework. This technique identifies more than 5000 nodes in the human body (for context, other approaches operate with 10 or 20 joints). You can get an idea of this node mapping technique in the above image.

DensePose has been created in the Detectron framework and is powered by Caffe2. Apart from the code, this repository also contains notebooks to visualize the DensePose-COCO dataset. Read more details about this release here.

NLP Progress

Natural Language Processing (NLP) is an often difficult field to get into, despite it’s attractions. There is tons of unstructured text lying around which you have to work with, and that is no easy task. This repository has been created especially to track the progress in the NLP field. It’s a very informative list of datasets and current state-of-the-art tasks, like dependency parsing, part-of-speech tagging, reading comprehension, etc.

Make sure you star this and follow the progress if you’re even vaguely interested in NLP. There is still a lot that can (and will) be added to this list, like information extraction, relation extraction, grammatical error correction, etc. If you have anything to contribute to this repository, the creator is open to ideas and suggestions so feel free to do that.

MLflow

Getting your model into production is one of the biggest challenges data scientists face when they enter this field. Designing and building the model is what attracts most people to machine learning but if you can’t get that model into production, it essentially becomes just a piece of useless code.

So Databricks (founded by the Apache Spark creators) decided to build and open source a solution to all ML framework challenges. Called MLflow, it is a platform that manages the entire machine learning lifecycle (from start to production) and has been designed to work with any library. Ever since it’s release, it has gained a huge following (1,355 stars on GitHub) and you can check out our coverage of the library here.

Salesforce’s decaNLP

Another NLP entry in this article! When it comes to NLP tasks like sentiment analysis, or machine translation, the norm has been to build models specific to that task. Have you ever built a sentiment analysis model that can also do semantic parsing and question answering at the same time? That’s what Salesforce researchers intend to do with this repository.

They have published a research paper that outlines a model which can do 10 different NLP tasks at the same time. In this paper, they have thrown a chalenge (which they are calling decaNLP) to the community – can you build such a model and improve on the approach they’ve provided? The model Salesforce have built is being called the “Swiss Army Knife for Natural Language Processing”.

Read more details about this on AV’s post here.

Reinforcement Learning Notebooks

Reinforcement learning is becoming popular by the day and so is the open source community for it. This repository is a collection of reinforcement learning algorithms from Richard Sutton and Andrew Barto’s book and other research papers. These algorithms are presented in the form pf Python notebooks.

The creator of the repository recommends using these notebooks when you read the book as they will significantly enhance your understand of what’s being presented. The notes are detailed and anyone entering this field should definitely refer to this collection.

Reddit Threads

Source: Wikipedia

Playing Card Detection with YOLOv3

Your interest in this thread will be piqued by the above video, which is put together and presented really neatly. It sent the machine learning subreddit into overdrive and received almost 100 comments! The thread has a lot of useful information on how the technique was created (there’s a step-by-step explanation from the developer), how long it took, what kind of other things it can do, etc. You’ll learn a lot about computer vision in this thread.

The creator of this technique and video has also open sourced his code on GitHub. So open your Jupyter notebooks and get cracking!

OpenAI Five

OpenAI Five is a group of 5 neural networks designed and developed to beat human opponents in the popular Dota 2 game. It has been developed by the Elon Musk co-founded OpenAI venture, which explains the immediate popularity it has received since it’s release.

Why I’m recommending this thread is the rich discussion around what else data scientists want to see from this technique, it’s comparison with the popular DeepMind AlphaGo algorithm, and how much computational power it required to pull this off. There is a lot of perspective in this thread that will greatly benefit you.

Additionally, you can also read our article on OpenAI Five here.

What ML Hypothesis are you Curious About but are Hoping Someone Else will Research it?

If this topic didn’t get your attention, the first few comments surely will. This discussion is like a wish list of what data scientists and machine learning practitioners want to see from the community. This thread made my list because of the discussion each idea spawned. Once a person added their idea to the thread, multiple folks replied with their ideas on how to implement it and if similar research was already present.

This is a MUST-READ discussion – for both enthusiasts and practitioners. Take some time out to go through it and you’ll come out with a lot of knowledge (and perhaps even more questions).

Setup that Data Scientists use for Machine Learning

What hardware you use for machine learning plays a critical role in determining how good your model will perform, especially when the amount of data to be trained is huge. Read this thread to find out what other data scientists use for building their ML processes and models. The original poster has listed down a structured list of questions which helped keep the thread neat and understandable. There questions are as below:

Desktop or laptop? What model?
GPU?
Operating System?
Programming language?
Framework?
What kind of work/research do you do?

You can also take part in the discussion or use the comments section below this article to let us know your setup!

Practical Use Cases for Reinforcement Learning

As I mentioned above, reinforcement learning is a popular field these days. But due to the complex nature of the work, most of the research and use cases are limited to games and lab environments. In this thread, people already working in this field give their take on where they see RL penetrating in the near future. Some of the comments are of a more skeptical nature but are worth reading to understand what experts and enthusiasts feel about RL.

End Notes

Phew! So much to read and learn in this past month. This list has something for everybody – NLP, reinforcement learning, open source code you can download and start working on, computer vision, discussions on various machine learning related things, and much more. Use the comments section below to let us know which repository and/or discussion you found the most interesting!

Pranav Dar

Senior Editor at Analytics Vidhya.Data visualization practitioner who loves reading and delving deeper into the data science and machine learning arts. Always looking for new ways to improve processes using ML and AI.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

The Top GitHub Repositories & Reddit Threads Every Data Scientist should know (June 2018)

Introduction

GitHub Repositories

Reddit Threads

End Notes

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID

__Secure-#

APISID

SSID

HSID

DV

NID

1P_JAR

OTZ

Facebook (2)

_fbp

fr

LinkedIn (6)

bscookie

lidc

bcookie

aam_uuid

UserMatchHistory

li_sugr

Microsoft (2)

MR