Projects play a HUGE part in cracking data science interviews. I’ve personally taken over a hundred interviews in the last year and quite often, the final round comes down to the quality of these data science projects. This is especially relevant for newcomers and freshers in data science.
What kind of projects have you picked up? How did you perform on these projects? Did you beat the benchmark model? Did you experiment with the source code and build something different?
These are critical questions that might make or break your data science interview. I always encourage folks to take up a diverse range of data science projects and try to learn from that as much as possible.
I will cover 6 such open-source data science projects in this article. I love putting this out at the start of every month (this is the 25th edition!). You’ll see a broad range of projects here, from performing computer vision tasks using MS Excel to drawing up a unique visualization in R.
You can check out the entire archive of open source data science projects here. And here’s the collection I picked out last month.
What’s the last MAJOR development you remember from the computer vision space? I’ve come across articles recently saying we’ve hit the proverbial deep learning wall – and there is no way up from there.
I respectfully disagree with this. There is a LOT more to uncover and unpack in deep learning (and computer vision in particular). If you’re wondering where I’m getting this level of confidence from, wait till you check out the below open-source computer vision using deep learning projects!
There are more jobs in deep learning and computer vision than ever before. And that trend is likely to increase exponentially in 2020. Time to get on board and polish up your computer vision skills!
You should check out the below resources to get started with deep learning and computer vision:
Real-time object detection has really gathered pace in the last year or so. I love the different applications we can design using real-time object detection, such as tracking a football or a player during a game.
Now here’s a really cool Hollywood-level computer vision project – removing people from complex backgrounds in real-time using deep learning! The developers off this project have used TensorFlow.js to build their model.
Check out this example:
This was done in real-time in a web browser! That’s the beauty of TensorFlow.js. The GitHub repository I’ve linked above contains the code to implement the project in your own machine.
Here are a couple of in-depth computer vision tutorials to get you started with these concepts:
I love this open-source computer vision project! This one is for all the folks who have written off Excel as just a spreadsheet tool. The machine learning team at Amazon has come up with this rather cool project that shows us how to perform basic computer vision tasks in Microsoft Excel.
You can detect faces and find edges and lines using the tutorial provided in the project on GitHub. Here’s a quick look at what you’ll be building in Excel:
You don’t need any background in computer vision to work on this project. You will, however, need to know at least how a weighted average is calculated (and knowledge of Excel is required, of course).
So whether you’re a newcomer in deep learning and computer vision, or are coming from a software development background, this project is for you! Go ahead and try it out on your own machine and let me know about the crazy applications you build.
Here are a couple of resources to learn MS Excel:
Here are a few non-computer vision and non-deep learning projects I wanted to highlight. These cover a range of data science topics, from data visualization in R to the importance of software engineering in machine learning.
If you’re looking for a comprehensive, end-to-end course on machine learning, look no further!
An R project! It’s a miracle! I’m a heavy R user and I love working with the wonderful ggplot2 library – but there haven’t been a lot of recent updates to report about. So I was thrilled when I came across ggbump last month.
ggbump is an R visualization package for, you guessed it, creating bump charts. Here’s an example of what you can draw using ggbump:
Bump charts are typically used to compare two dimensions against each other using one measure value (all you Tableau folks will understand this!). The majority of use cases focus on exploring the changes in the rank of a value over time (like the bump chart above).
ggbump isn’t on CRAN yet but you can install it directly in R using the below command:
devtools::install_github("davidsjoberg/ggbump")
Here are a few resources to get you started with data visualization in R and Tableau:
I’m a bibliophile so naturally, Goodreads is my go-to platform for anything related to books. I rely on it heavily for recommendations, book reviews, and much more.
So imagine my joy when I came across this awesome project on GitHub! This is an end-to-end Goodreads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
The Goodreads Machine Learning pipeline consists of the below modules:
I encourage you to go through the below tutorial on building your own machine learning pipeline using sklearn:
This is a fascinating project. Graphs can appear to be daunting at first, but once you get an idea of how they work, you’ll love working with them.
Graph neural networks (GNN) are behind applications like social media network analysis, knowledge trees, recommendation systems, and much more.
The GitHub repository I’ve linked above provides the implementation of various flavors of graph neural networks in TensorFlow 2.0. You have a few training script examples in the repository as well to get you on your way.
You can install the Python library from pip:
pip install tf2_gnn
I’ve provided resources below to help you understand the various concepts behind graph neural networks:
Software engineering is a very under-rated part of the machine learning pipeline. Experts don’t discuss it, courses don’t usually cover it, and data science aspirants don’t study about it.
And yet, when you sit for a data science interview, you’ll inevitably face a ton of software engineering questions. How do you set up a machine learning pipeline? What is model deployment? And so on.
This wonderful repository offers a curated list of tutorials that cover software engineering best practices for building machine learning applications. Here’s what the repository currently covers:
Trust me, software engineering is a must-have skill in your data scientist’s resume. You need to get on board with this and start picking up these skills.
My pick of the above open-source projects:
I’ve already started working on these two on my own and would be happy to share the progress and code with the community! Let me know in the comments section below which project you’ll be picking up this month.