More people than ever before are looking for a way to transition into data science. Whether you’re a fresh college graduate, a relatively new entrant in the industry, a mid-level professional, or someone who’s just curious about machine learning – everyone wants a piece of the data science pie.
And if you’re from India, you would surely have read about the Government’s investment in the data field (in the 2020 Union Budget). This is a great time to invest in your career!
And one of the best ways to get your data science career off the ground is to invest in yourself. Here’s a simple path to do that:
I’ve picked out 5 open-source machine learning projects (created in January 2020) to acquaint you with the latest state-of-the-art frameworks and libraries. As always, I tried to diversify the list as much as possible. You’ll see a bit of everything sprinkled in, from Natural Language Processing (NLP) to Python programming ideas.
Head over here if you’re interested in checking out the previous projects we’ve showcased in this monthly series. This is the 3rd year of this series – thanks to our community for the overwhelming response!
The Transformer architecture changed the Natural Language Processing (NLP) landscape. It has spawned a plethora of NLP frameworks, such as BERT, XLNet, GPT-2, among others.
But there’s an issue I’m sure most of you will relate to – these Transformer-powered models are LARGE. They achieve state-of-the-art results but they’re way too expensive and beyond the scope of most folks who want to learn and implement them.
This is where the Reformer model comes in. Reformer performs as well as these Transformer models, but it does so while using far less resources and money.
This GitHub repository I’ve linked above contains the PyTorch implementation of Reformer. The author of the project has provided a simple but effective example along with the entire code to help you build your own model.
I encourage you to read about the inner workings of Reformer in the official research paper here.
You can install Reformer on your machine using the below command:
pip install reformer_pytorch
The below articles are essential reading if you’re new to the Transformer architecture and the PyTorch framework:
I came across PandaPy last week and have already used it in my current project. It is a fascinating Python library with a lot of potential to become mainstream.
If you are working on a machine learning project with mixed data types (int, float, datetime, str, etc.), you should try out PandaPy instead of Pandas. It consumes roughly one-third less memory than Pandas for these data types!
“If you have smaller Pandas dataframes (<50K number of records) in a production environment, then it is worth considering PandaPy.”
Here are three key areas you’ll find interesting (I’ve taken these points verbatim from the PandaPy GitHub repository):
Install PandaPy using pip:
!pip3 install pandapy
If you still want to stick with Pandas, then check out the latest major release (v1.0.0) here.
What a brilliant GitHub repository! I’ve had a lot of aspiring data scientists reach out to me on LinkedIn asking about how to get started with geospatial analysis. It’s a very interesting field with petabytes of data available. We just need a structured approach to clean and analyze it.
This amazing repository is a collection of 300+ Jupyter notebooks that contain examples of using Google Earth Engine data.
Here’s a really cool GIF that demonstrates one of the visualizations you will generate using these notebooks:
These notebooks rely on three Python libraries to execute the code:
The GitHub repository contains plenty of examples with Python code to get you started. Dig in and have fun!
Here’s an excellent article to get started with Geospatial Data:
Here’s another quality data visualization idea for you. The thought of automating the data exploration step has been floated around for a while without any substantial frameworks. Until now
AVA, short for Automated Visual Analytics, is a framework by Alibaba that aims to make visual analytics AI-driven and automated.
Here’s a demo showing the power of AVA:
I highly recommend checking out the below resources to enhance and build your data visualization profile:
Reproducibility is a crucial aspect of any machine learning project these days, whether that’s in research or the industry. We need to track every test we perform, every iteration, and every parameter of our machine learning model, along with the results.
The Fast Neptune library enables us to quickly record all the information we need to launch our machine learning experiments. In other words, Fast Neptune is your answer to the reproducibility question you might have asked while reading the above paragraph.
Here are the features Fast Neptune uses to help us run quick experiments (quoting from the above link):
Pretty neat, right? Install Fast Neptune using just one line of code:
pip install fast-neptune
I wanted to highlight a couple of other major releases in January 2020 that you should be aware of:
2020 is off to a fast start in the machine learning space. The state-of-the-art continues to evolve at a rapid pace and it can become overwhelming for newcomers to keep up.
That’s why I publish these monthly articles where I aim to bring out the most relevant and useful open-source machine learning projects for our community.
Is there any other machine learning project or framework you want to highlight? I would love to hear your thoughts and ideas in the comments section below. Let’s connect and brainstorm together.
Very insightful and a crisp gist !!
Hey Preeti, I'm glad you found this article useful!
Hello :) Nice list. I would like to see more tools , it is hard to find them though. some hard tasks are : LINT and debug data processing, pre processing and analysis automated code analysis tools - ml visualisation of code , etc. any dimension transcending tools - allowing multiplexing or demultiplexing of dimensions in ML data, like 2d input to 3d output, linear text to sets of 2d arrays etc. polynomial decomposition and root finding tools - it is very usefull to express NN into sets of LFSR (linear feedback shift register) defined polynomials as this is what is easy to be encoded straight into chips. weights can be very long, and defined by structure of LFSR. while only military can order big wafers packed with LFSR's trained on a cluster, amateurs can still use FPGA's wchich have enough capacity nowadays to allow fairly complex tasks. associative arrays extraction tools - associative arrays can be implemented on fairly simple hardware (and clusters) and their design can be pre defined to follow specific axioms, wchich makes them not only debug-able but also semantical. most datasets and NN can be translated into associative array, given proper toolset. CAM tools for circuit design - most tools are proprietary, but few simple opensource tools can be reached. sorry for bit chaotic set , writting from travel :) greetings