More people than ever before are looking for a way to transition into data science. Whether you’re a fresh college graduate, a relatively new entrant in the industry, a mid-level professional, or someone who’s just curious about machine learning – everyone wants a piece of the data science pie.
And if you’re from India, you would surely have read about the Government’s investment in the data field (in the 2020 Union Budget). This is a great time to invest in your career!
And one of the best ways to get your data science career off the ground is to invest in yourself. Here’s a simple path to do that:
I’ve picked out 5 open-source machine learning projects (created in January 2020) to acquaint you with the latest state-of-the-art frameworks and libraries. As always, I tried to diversify the list as much as possible. You’ll see a bit of everything sprinkled in, from Natural Language Processing (NLP) to Python programming ideas.
Head over here if you’re interested in checking out the previous projects we’ve showcased in this monthly series. This is the 3rd year of this series – thanks to our community for the overwhelming response!
The Transformer architecture changed the Natural Language Processing (NLP) landscape. It has spawned a plethora of NLP frameworks, such as BERT, XLNet, GPT-2, among others.
But there’s an issue I’m sure most of you will relate to – these Transformer-powered models are LARGE. They achieve state-of-the-art results but they’re way too expensive and beyond the scope of most folks who want to learn and implement them.
This is where the Reformer model comes in. Reformer performs as well as these Transformer models, but it does so while using far less resources and money.
This GitHub repository I’ve linked above contains the PyTorch implementation of Reformer. The author of the project has provided a simple but effective example along with the entire code to help you build your own model.
I encourage you to read about the inner workings of Reformer in the official research paper here.
You can install Reformer on your machine using the below command:
pip install reformer_pytorch
The below articles are essential reading if you’re new to the Transformer architecture and the PyTorch framework:
I came across PandaPy last week and have already used it in my current project. It is a fascinating Python library with a lot of potential to become mainstream.
If you are working on a machine learning project with mixed data types (int, float, datetime, str, etc.), you should try out PandaPy instead of Pandas. It consumes roughly one-third less memory than Pandas for these data types!
“If you have smaller Pandas dataframes (<50K number of records) in a production environment, then it is worth considering PandaPy.”
Here are three key areas you’ll find interesting (I’ve taken these points verbatim from the PandaPy GitHub repository):
Install PandaPy using pip:
!pip3 install pandapy
If you still want to stick with Pandas, then check out the latest major release (v1.0.0) here.
What a brilliant GitHub repository! I’ve had a lot of aspiring data scientists reach out to me on LinkedIn asking about how to get started with geospatial analysis. It’s a very interesting field with petabytes of data available. We just need a structured approach to clean and analyze it.
This amazing repository is a collection of 300+ Jupyter notebooks that contain examples of using Google Earth Engine data.
Here’s a really cool GIF that demonstrates one of the visualizations you will generate using these notebooks:
These notebooks rely on three Python libraries to execute the code:
The GitHub repository contains plenty of examples with Python code to get you started. Dig in and have fun!
Here’s an excellent article to get started with Geospatial Data:
Here’s another quality data visualization idea for you. The thought of automating the data exploration step has been floated around for a while without any substantial frameworks. Until now
AVA, short for Automated Visual Analytics, is a framework by Alibaba that aims to make visual analytics AI-driven and automated.
Here’s a demo showing the power of AVA:
I highly recommend checking out the below resources to enhance and build your data visualization profile:
Reproducibility is a crucial aspect of any machine learning project these days, whether that’s in research or the industry. We need to track every test we perform, every iteration, and every parameter of our machine learning model, along with the results.
The Fast Neptune library enables us to quickly record all the information we need to launch our machine learning experiments. In other words, Fast Neptune is your answer to the reproducibility question you might have asked while reading the above paragraph.
Here are the features Fast Neptune uses to help us run quick experiments (quoting from the above link):
Pretty neat, right? Install Fast Neptune using just one line of code:
pip install fast-neptune
I wanted to highlight a couple of other major releases in January 2020 that you should be aware of:
2020 is off to a fast start in the machine learning space. The state-of-the-art continues to evolve at a rapid pace and it can become overwhelming for newcomers to keep up.
That’s why I publish these monthly articles where I aim to bring out the most relevant and useful open-source machine learning projects for our community.
Is there any other machine learning project or framework you want to highlight? I would love to hear your thoughts and ideas in the comments section below. Let’s connect and brainstorm together.