The 25 Best Data Science and Machine Learning GitHub Repositories from 2018

Pranav Dar Last Updated : 22 Jun, 2023

14 min read

Introduction

What’s the best platform for hosting your code, collaborating with team members, and also acts as an online resume to showcase your coding skills? Ask any data scientist, and they’ll point you towards GitHub. It has been a truly revolutionary platform in recent years and has changed the landscape of how we host and even do coding.

But that’s not all. It acts as a learning tool as well. How, you ask? I’ll give you a hint – open source!

Most Active Data Scientists to Follow, Free Books & Tutorials on Github

The world’s leading tech companies open source their projects on GitHub by releasing the code behind their popular algorithms. 2018 saw a huge spike in such releases, with the likes of Google and Facebook leading the way. The best part about these releases is that the researchers behind the code also provide pretrained models so folks like you and I don’t have to waste time building difficult models from scratch.

Additionally, we regularly see the top trending repositories aimed towards coders and developers – this includes resources like cheatsheets, video links, e-books, research paper links, among other things. No matter which level you are at in your professional career (beginner, established or advanced), you will always find something new to learn on GitHub.

2018 was a transcendent one in a lot of data science sub-fields, as we will shortly see. Natural Language Processing (NLP) was easily the most talked about domain within the community with the likes of ULMFiT and BERT being open-sourced. In my quest to bring the best to our awesome community, I ran a monthly series throughout the year where I hand-picked the top 5 projects every data scientist should know about. You can check out the entire collection below:

There will be some overlap here with my article covering the biggest breakthroughs in AI and ML in 2018. Do check out that article as well – it is essentially a list of all the major developments I feel everyone in this field needs to know about. As a bonus, there are predictions from experts as well – not something you want to miss. 🙂

And now, get ready to explore new projects in your quest to attain data science stardom in 2019 and scroll down! Simply click on each project title to head over to the code repository on GitHub.

Topics we will cover in this article

Tools and Frameworks
Computer Vision
Generative Adversarial Networks (GANs)
Other Deep Learning Projects
Natural Language Processing (NLP)
Automated Machine Learning (AutoML)
Reinforcement Learning

Tools and Frameworks

Let’s get the ball rolling with a look at the top projects in terms of tools, libraries and frameworks. Since we are speaking about a software repository platform, it feels right to open things with this section.

Technology is advancing rapidly and computational costs are lower than ever, so we’re being treated to one massive release after another. Can we call this the golden age of coding in machine learning? That is an open question, but one thing we can all agree on – it’s a great time to be a programmer in data science. In this section (and the article overall), I have tried to diversify the languages as much as possible, but Python inevitably ruled the roost.

ML.NET

How about all you .NET developers wanting to learn a bit of machine learning to complement your existing skills? Here’s the perfect repository to get that idea started! ML.NET, a Microsoft project, is an open-source machine learning framework that allows you design and develop models in .NET.

You can even integrate existing ML models into your application, all without requiring explicit knowledge of how ML models are developed. ML.NET is actually used in multiple Microsoft products, like Windows, Bing Search, MS Office, among others.

ML.NET runs on Windows, Linux and MacOS.

TensorFlow.js

Machine learning in the browser! A fictional thought a few years back, a stunning reality now. A lot of us in this field are welded to our favorite IDEs, but TensorFlow.js has the potential to change your habits. It’s become a very popular release since it’s release earlier this year and continues to amaze with its flexibility.

As the repository states, there are primarily three major features of TensorFlow.js:

Develop machine learning and deep learning models in your browser itself
Run pre-existing TensorFlow models within the browser
Retrain or fine-tune these pre-existing models as well

If you’re familiar with Keras, the high-level layers API will seem quite familiar. There are plenty of examples available on the GitHub repository, so check those out to quicken your learning curve.

PyTorch 1.0

What a year it has been for PyTorch. It has won the hearts and now projects of data scientists and ML researchers around the globe. It is easy to grasp, flexible, and is already being implemented across high profile researches (as you’ll see later in this article). The latest version (v1.0) already powers many Facebook products and services at scale, including performing 6 billion text translations a day. If you’ve been wondering when to start dabbling with PyTorch, the time is NOW.

If you’re new to this field, ensure you check out Faizan Shaikh’s guide to getting started with PyTorch.

Papers with Code

HEADER

While not strictly a tool or framework, this repository is a gold mine for all data scientists. Most of us struggle with reading through a paper and then implementing it (at least I do). There are a lot of moving parts that don’t seem to work on our machines.

And that’s where ‘Papers with Code’ comes in. As the name suggests, they have a code implementation of all the major papers that have been released in the last 6 years or so. It is a mind-blowing collection that you will find yourself fawning over. They have even added code from papers presented at NIPS (NeurIPS) 2018. Get yourself over there now!

Computer Vision

Thanks to falling computational costs and a surge of breakthroughs from the top researchers (something tells me those two might be linked), deep learning is accessible to more people than ever before. And within deep learning, computer vision projects are ubiquitous – most of the repositories you’ll see in this section will cover one computer vision technique or another.

It is simply the hottest field in deep learning right now and will continue to be so for the foreseeable future. Whether it’s object detection or pose estimation, there’s a repository for seemingly all computer vision tasks. Never a better time to get acquainted with these developments – a lot of job openings might come your way soon.

Facebook’s Detectron

Detectron made a HUGE splash when it was launched in early 2018. Developed by Facebook’s AI Research team (FAIR), it implements state-of-the-art object detection frameworks. It is (surprise, surprise) written in Python and has helped enable multiple projects, including DensePose (which we will talk about soon).

This repository contains the code and over 70 pretrained models. Too good an opportunity pass up, would’t you agree?

NVIDIA’s vid2vid Technique

Object detection in images is awesome, but what about doing it in videos? And not just that, can we extend this concept and translate the style of one video to another? Yes, we can! It is a really cool concept and NVIDIA have been generous enough to release the PyTorch implementation for you to play around with.

The repository contains videos of how the technique looks, the full research paper, and of course the code. The Cityscapes dataset, available publicly post registration, is used in NVIDIA’s examples. One of my favorite projects from 2018.

Training a Model on the ImageNet Dataset in 18 Minutes

Training a deep learning model in 18 minutes? While not having access to high-end computational resources? Believe me, it’s already been done. Fast.ai’s Jeremy Howard and his team of students built a model on the popular ImageNet dataset that even outperformed Google’s approach.

I encourage you to at least go through this project to get a sense of how these researchers structured their code. Not everyone has access to multiple GPUs (or even one) so this was quite a win for the minnows.

Comprehensive Collection of Object Detection Papers

Another research paper collection repository! It’s always helpful to know how your subject of choice has evolved over a span of multiple years, and this one-stop shop will help you do just that for object detection. It’s a comprehensive collection of papers from 2014 till date, and even include code wherever possible.

The above image shows how object detection frameworks have evolved and transformed in the last five years. Quite fascinating, isn’t it? There’s even a 2019 entry included, so you have quite a lot of catching up to do.

Facebook’s DensePose

Let’s turn our attention to the field of pose detection. I came across this concept this year itself and have been fascinated with it ever since. That above image captures the essence of this repository – dense human pose estimation in the wild.

The code to train and evaluate your own DensePose-RCNN model is included here. There are notebooks available as well to visualize the DensePose COCO dataset. Pretty good place to kick off your pose estimation learning.

Everybody Dance Now – Pose Estimation

The above image (taken from a video) really piqued my interest. I covered the release of the research paper back in August and have continued to be in awe of this technique. This technique enables us to transfer the motion between human objects in different videos. The video I mentioned is available within the repository – it will blow your mind!

This repository further contains the PyTorch implementation of this approach. The amount of intricate details this approach is capable of picking up and replicating is incredible.

GANs

I’m sure most of you must have come across a GAN application (even if you perhaps didn’t realize it at the time). GANs, or Generative Adversarial Networks, were introduced by Ian Goodfellow back in 2014 and have caught fire since. They specilize in performing creative tasks, especially artistic ones. Check out this amazing introductory guide by Faizan Shaikh to the world of GANs, along with an implementation in Python.

We saw a plethora of GAN based projects in 2018 and hence I wanted to create a separate section for this.

Deep Painterly Harmonization

Let’s start off with one of my favorites. I want you to take a moment to just admire the above images. Can you tell which one was done by a human and which one by a machine? I certainly couldn’t. Here, the first frame is the input image (original) and the third frame has been generated by this technique.

Amazing, right? The algorithm adds an external object of your choosing to any image and manages to make it look like nothing touched it. Make sure you check out the code and try to implement it on a different set of images yourself. It’s really, really fun.

Image Outpainting

What if I gave you an image and asked you to extend the boundaries by imagining what it would look like when the entire scene was captured? You would understandably turn to some image editing software. But here’s the awesome news – you can achieve it in a few lines of code!

This project is a Keras implementation of Stanford’s Image Outpainting paper (incredibly cool and illustrated paper – this is how most research papers should be!). You can either build a model from scratch or use the one provided by this repository’s author. Deep learning wonders never cease to amaze.

Visualizing and Understanding GANs

If you haven’t got a handle on GANs yet, try out this project. Pioneered by researchers from MIT’s CSAIL division, it helped you visualize and understand GANs. You can explore what your GAN model has learned by inspecting and manipulating it’s neurons.

I would like to point you towards the official MIT project page, which has plenty of resources to get you familiar with the concept, including a video demo.

GANimation

This algorithm enables you to change the facial expression of any person in an image. It’s as exciting as it is concerning. The images above inside the green border at the originals, the rest have been generated by GANimation.

The link contains a beginner’s guide, data preparation resources, prerequisites, and the Python code. As the author mentioned, do NOT use it for immoral purposes.

NVIDIA’s FastPhotoStyle

This project is quite similar to the Deep Painterly Harmonization one we saw earlier. But it deserved a mention given it came from NVIDIA themselves. As you can see in the image above, the FastPhotoStyle algorithm requires two inputs – a style photo and a content photo. The algorithm then works in one of two ways to generate the output – it either uses photorealistic image stylization code or uses semantic label maps.

Other Deep Learning Projects

The computer vision field has the potential to overshadow other work in deep learning but I wanted to highlight a few projects outside it.

NVIDIA’s WaveGlow

Audio processing is another field where deep learning has started to make it’s mark. It’s not just limited to generating music, you can do tasks like audio classification, fingerprinting, segmentation, tagging, etc. There is a lot that’s still yet to be explored and who knows, perhaps you could use these projects to pioneer your way to the top.

Here are two intuitive articles to help you get acquainted with this line of work:

And here comes NVIDIA again. WaveGlow is a flow-based network capable of generating really high quality audio. It is essentially a single network for speech synthesis.

This repository includes a PyTorch implementation of WaveGlow along with a pre-trained model which you can download. The researchers have also listed down the steps you can follow if you want to train your own model from scratch.

AstroNet

Want to discover your own planet? That might perhaps be overstating things a bit, but this AstroNet repository will definitely get you close. The Google Brain team discovered two new planets in December 2017 by applying AstroNet. It’s a deep neural network meant for working with astronomical data. It goes to show the far-ranging applications of machine learning and was a truly monumental development.

And now the team behind the technology has open sourced the entire code (hint: the model is based on CNNs!) that powers AstroNet.

VisualDL – Visualizing Deep Learning Models

Who doesn’t love visualizations? But it can get a tad bit intimidating to imagine how a deep learning model works – there are too many moving parts involved. But VisualDL does a great job mitigating those challenges by designing specific deep learning jobs.

VisualDL currently supports the below components for visualizing jobs (you can see examples of each in the repository):

scalar
histogram
image
audio
graph
high dimensional

Natural Language Processing (NLP)

Surprised to see NLP so down in this list? That’s primarily because I covered almost all the major open source releases in this article. I highly recommend checking out that list to stay on top of your NLP game. The frameworks I have mentioned here include ULMFiT, Google’s BERT, ELMo, and Facebook’s PyText. I will briefly mention BERT and a couple of other respositories here as I found them very helpful.

Google’s BERT

I couldn’t possibly let this section pass by without mentioning BERT. Google AI’s release has smashed records on it’s way to winning the hearts of NLP enthusiasts and experts alike. Following ULMFiT and ELMo, BERT really blew away the competition with it’s performance. It obtained state-of-the-art results on 11 NLP tasks.

Apart from the official Google repository I have linked to above, a PyTorch implementation of BERT is worth checking out. Whether it marks a new era of not in NLP we will soon find out.

MatchZoo

It often helps to know how well your model is performing against a certain benchmark. For NLP, and specifically deep text matching models, I have found the MatchZoo toolkit quite reliable. Potential tasks related to MatchZoo include:

Conversation
Question Answer
Textual Entailment
Information Retrieval
Paraphrase Identification

MatchZoo 2.0 is currently under development so expect to see a lot more being added to this already useful toolkit.

NLP Progress

This repository was created by none other than Sebastian Ruder. The aim of this project is to track the latest progress in NLP. This includes both datasets and state-of-the-art models.

Any NLP technique you’ve ever wanted to know more about – there’s a good chance it’ll already be present here. The repository covers both traditional and core NLP tasks such as reading comprehension and parts-of-speech tagging. It’s mandatory to star/bookmark this repository if you’re even vaguely interested in this field.

Automated Machine Learning (AutoML)

What an year for AutoML. With industries look to integrate machine learning into their core mission, the need to data science specialists continues to grow. There is currently a massive gap between the demand and the supply. This gap could potentially be filled by AutoML tools.

These tools are designed for those people who do not have data science expertise. While there are certainly some incredible tools out there, most of them are priced significantly higher than most individuals can afford. So our amazing open source community came to the rescue in 2018, with two high profile releases.

Auto Keras

This made quite a splash upon it’s release a few months ago. And why wouldn’t it? Deep learning has been long considered a very specialist field, so a library that can automate most tasks came as a welcome sign. Quoting from their official site, “The ultimate goal of AutoML is to provide easily accessible deep learning tools to domain experts with limited data science or machine learning background”.

You can install this library from pip:

pip install autokeras

The repository contains a simple example to give you a sense of how the whole thing works. You’re welcome, deep learning enthusiasts. 🙂

Google’s AdaNet

AdaNet is a framework for automatically learning high-quality models without requiring programming expertise. Since it’s a Google invention, the framework is based on TensorFlow. You can build ensemble models using AdaNet, and even extend it’s use to training a neural network.

The GitHub page contains the code, an example, the API documentation, and other things to get your hands dirty. Trust me, AutoML is the next big thing in our field.

Reinforcement Learning

Since I already covered a few reinforcement learning releases in my 2018 overview article, I will keep this section fairly brief. My hope in including a RL section where I can is to foster a discussion among our community and to hopefully accelerate research in this field.

First, make sure you check out OpenAI’s Spinning Up repository, an exhaustive educational resource for beginners. Then head over to Google’s Dopamine page. It is a research framework for accelerating research in this still nascent field. Now let’s look at a couple of other resources as well.

DeepMimic

Skills

If you follow a few researchers on social media, you must have come across the above images in video form. A stick human running across a terrain, or trying to stand up, or some such sort. That, dear reader, is reinforcement learning in action.

Here is a signature example of it – a framework to train a simulated humanoid to imitate multiple motion skills. You can get the code, examples, and a step-by-step run-through on the above link.

Reinforcement Learning Notebooks

This repository is a collection of reinforcement learning algorithms from Richard Sutton and Andrew Barto’s book and other research papers. These algorithms are presented in the form of Python notebooks.

As the author of this repo mentioned, you will only truly learn if you implement the learning as you go along. It’s a complex topic, and giving up or reading the resources like a storybook will lead you nowhere.

End Notes

And that bring us to the end of our journey for 2018. What a year! It was a joyful ride putting this article together and I learned a lot of new stuff along the way.

I would love to hear your feedback on this article. Which repository have you used? Which one did you find the most useful? And which one(s) did I miss out on? Use the comments section below and let me know.

Pranav Dar

Senior Editor at Analytics Vidhya.Data visualization practitioner who loves reading and delving deeper into the data science and machine learning arts. Always looking for new ways to improve processes using ML and AI.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Elba Rodriguez

Good m orning: Yes I am very interest in this course. My question is what is the cost? Thank you

Show 1 reply

Hi Elba, Can you please let me know which course you are referring to? You can check out our catalogue of courses here: http://trainings.analyticsvidhya.com

Harish Nagpal

Good one Pranav :) Thanks :)

Thanks, Harish!

Sonali Dasgupta

Thanks for the comprehensive list of resources. This shall be a great help for learners in 2019. My machine learning journey in 2018 involved delving deeper into neural networks and ensemble methods, in which the FastAI library proved to be a great help in implementing state of the art algorithms. Looking forward to a fruitful experience ahead.

Fastai's library was a godsend in a lot of aspects, Sonali, you're absolutely right. I expect a lot more to come from those folks in 2019.

Reading list

Introduction to Deep Learning

Feed Forward Networks

Gradient Descent

Loss Function

Activation Functions

Introduction to Neural networks

Forward and Backward Propagation

Optimizers

Learning Rate Schedulers

NN on Structured Data

Improving the Deep Learning Model

Deep Learning Model Optimization

Unsupervised Deep Learning

AutoDL

Model Deployment

Introduction to PyTorch

The 25 Best Data Science and Machine Learning GitHub Repositories from 2018

Introduction

Topics we will cover in this article

Tools and Frameworks

Computer Vision

GANs

Other Deep Learning Projects

Natural Language Processing (NLP)

Automated Machine Learning (AutoML)

Reinforcement Learning

End Notes

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme