So you’ve decided data science is the field for you. More and more businesses are becoming data driven, the world is increasingly becoming more connected and looks like every business will need a data science practice. So, the demand for data scientists is huge. Even better, everyone acknowledges the shortfall of talent in the industry.
However, becoming a data scientist does not come easy. It needs a mix of problem solving, structured thinking, coding and various technical skills among others to be truly successful. If you are from a non-technical and non-mathematical background, there’s a good chance a lot of your learning happens through books and video courses. Most of these resources don’t teach you what the industry is looking for in a data scientist.
This is one of the reasons why aspiring data scientists are struggling to bridge the gap between self education and real-world jobs.
In this article, I discuss the top mistakes amateur data scientists make (I have made some of them myself). I have also provided resources wherever applicable with the aim of helping you avoid these pitfalls on your data science journey.
Additionally, if you’re just starting out in data science or struggling to make headway, I would recommend this awesome and comprehensive program: Certified Program on Data Science for Beginners (with Interviews).
As I mentioned in my article on AV’s practice problems – it’s good to get a grasp of the theory behind machine learning techniques. But if you don’t apply them, they are only theoretical concepts. When I started out learning data science, I made the same mistake – I studied books and online courses but didn’t always apply them to solve a problem.
So when I was faced with a challenge or problem where I had the chance to apply all that I had learned, I couldn’t remember half of it! There’s so much to learn – algorithms, derivations, research papers, etc. There’s a high chance you’ll lose your motivation halfway through and give up. I have personally seen this happen to a lot of people who attempt to enter this field.
It’s imperative that your learning process should be a healthy balance between theoretical and practical. As soon as you learn a concept, head over to Google and find a dataset or problem where you can use it. You’ll find that you are retaining that concept way better than before. You can also use AV’s DataHack platform to take part in practice problems and ongoing competitions.
You will have to accept that you cannot learn everything in one go. Fill in the gaps as you practice and you will learn a whole lot more!
The majority of folks who want to become a data scientist are inspired by videos of robots, or awesome predictive models, and in some cases even the high salaries. Sadly (sorry to disappoint!), there is a long road you need to travel, before you reach there.
You should get to know how techniques work before you apply them in a problem. Learning this will help you understand how an algorithm works, what you can do to fine tune it, and will also help you build on existing techniques. Mathematics plays an important role here so it’s always helpful to know certain concepts. In a day-to-day corporate data scientist role you may not need to know advanced calculus, but having a high-level overview definitely helps.
In case you have a curious mind, or want to get into a research role, the four key components you need to know before diving into core machine learning are:
Just as a house is built brick-by-brick, a data scientist is also the sum of all the individual parts. There are tons of resources out there which will help you learn these topics. I have mentioned one resource of each topic below which should get you started:
You can also check out Analytics Vidhya’s ‘Introduction to Data Science‘ course which includes a comprehensive module on statistics and probability.
Ah, the pet-peeve of hiring managers and recruiters. Ever since data science became ultra popular, certifications and degrees have cropped up just about everywhere. A glance through my LinkedIn feed shows up at least 5 certification images proudly being displayed. While achieving that certification is no easy feat, relying solely on it is a recipe for disaster.
There are too many of these courses online being poured over and completed by thousands upon thousands of aspiring data scientists. If they ever added a unique value to your data science CV, that is no longer the case. Hiring managers do not care much for these pieces of paper – they place far more emphasis on your knowledge, and how you’ve applied it in real-life practical situations.
This is because dealing with clients, handling deadlines, understanding how a data science project lifecycle works, how to design your model to fit into the existing business framework – these are just some of the things you will need to know to succeed as a data scientist. Just a certification or degree will not qualify you for it.
Don’t get me wrong – certifications are valuable, but only when you apply that knowledge outside the classroom and put it out in the open. Use real-world datasets and whatever analysis you do, make sure you write about it. Create your own blog, post it on LinkedIn, and ask for feedback from the community. This shows that you are willing to learn and are flexible enough to ask for suggestions and work them into your projects.
You should be open to the idea of internships (regardless of your experience level). You will learn a lot about how a data science team works, which will benefit you when you sit for another interview.
If you’re looking for that next project, you’ve come to the right place. We have an awesome list of projects here divided by the degree of difficulty. Get started NOW.
This is one of the biggest misconceptions aspiring data scientists have these days. Competitions and hackathons provide us with datasets that are clean and spotless (okay – I went a little overboard, but you get the hang of it). You download them, and start working on the problem. Even those datasets that have columns with missing values don’t require you to work your brain cells off – figure out an imputation technique and fill in the blanks.
Unfortunately, real-world projects don’t work like that. There is an end-to-end pipeline which involves working with a bunch of people. You will almost always have to work with messy and unclean data. The old saying about spending 70-80% of your time just collecting and cleaning data is true. It’s the grueling part and you will (most likely) not enjoy but it’s something that eventually becomes part of a routine.
Also, and we will cover this in more detail in the next point, the simpler model will win precedence over any complex stacked ensemble model. Accuracy isn’t always the end goal, and this is one of the most contrasting things you’ll learn on the job.
One of the key factors to negate this misunderstanding is, ironically, experience. The more experience you gain (internships help a lot in this case), the better you’ll be able to distinguish between the two. This is where social media comes in handy – reach out to data scientists and ask them their experience.
Additionally, I suggest going through this Quora thread where data scientists from around the world provide their input on this exact question. Getting a good score on a competition leaderboard is excellent for measuring your learning progress, but interviewers will want to know how you can optimize your algorithm for impact, not for the sake of increasing accuracy. Learn about how a data science project works, what different types of roles a team has (from a data engineer to a data architect), and structure your answer in that sense.
Go through this LinkedIn post which explains the standard methodology for analytical models.
As mentioned above, accuracy isn’t always what the business is after. Sure a model that predicts loan default with 95% accuracy is good, but if you can’t explain how the model got there, which features led it there, and what your thinking was when building the model, your client will reject it.
You will rarely, if ever, find a deep neural network being used in commercial applications. It’s just not possible to explain to the client how a neural network (let alone a deep one) worked with hidden layers, convolutions layers, etc. The first preference is, and will always be, on ensuring that we are able to understand what’s going on underneath the model. If you can’t tell whether age, or number of family members, or previous credit history went into rejecting a loan application, how will the business run?
Another key aspect is whether your model will fit within the organization’s existing framework. Using 10 different types of tools and libraries will fail spectacularly if the production environment cannot support it. You will have to redesign and retrain the model from scratch with a simpler approach.
The best way to prevent yourself from making this mistake is speaking to people working in the industry. There is no better teacher than experience. Pick a domain (finance, HR, marketing, sales, operations, etc.) and reach out to people to understand how their project works.
Apart from that, practice making simpler models and then explaining them to non-technical people. Then add complexity to your model and keep doing this until even you don’t understand what’s going on beneath. This will teach you when to stop, and why simple models are always given preference in real-life applications.
If you have done this before, you will know what I’m talking about. If your resume currently has this problem, rectify it immediately! You may know a plethora of techniques and tools but simply listing them down will turn off potential hiring managers.
Your resume is a profile of what you have accomplished and how you did it – not a list of things to simply jot down. When a recruiter looks at your resume, he/she wants to understand your background and what all you have accomplished in a neat and summarized manner. If half the page is filled with vague data science terms like linear regression, XGBoost, LightGBM, without any explanation, your resume might not clear the screening round.
The simplest way to eliminate resume clutter is to use bullet points. Only list the techniques which you have used to accomplish something (could be a project or a competition). Write a line about how you used it – this helps the recruiter understand your thinking.
When you’re applying for fresher or entry-level jobs, your resume needs to reflect what potential impact you can add to the business. You will be applying to roles in different domains so perhaps having a set template will help – just change the story to relfect your interest in that particular industry.
This article by Kunal Jain is an excellent resource for preparing an outstanding CV for data science roles.
Let’s take an example to understand why this is a mistake. Imagine you’ve been given a dataset on house prices and you need to predict the value of future real estate. There are over 200 variables, including number of buildings, rooms, number of tenants, family size, size of the courtyard, whether faucets are available, etc. There’s a good chance you might not be aware of what some variables mean. You can still build a model with a good accuracy, but you have no idea why a certain variable was dropped.
As it turns out, that variable was a crucial element in a real-world scenario. It’s a calamitous mistake.
Having a solid knowledge of tools and libraries is excellent, but it will only take you so far. Combining that knowledge with the business problem posed by the domain is where a true data scientist steps in. You should be aware of at least the basic challenges in the industry you are interested in (or are applying to).
There are plenty of options to explore here:
Data visualization is such a wonderful facet of data science, yet a lot of aspiring data scientists prefer to skim over it and get to the model building stage. This approach might work out in competitions, but is bound to fail in a real job. Understanding the data you’re given is the single most important thing you will do, and your model’s results will reflect that.
By spending time on getting to know the dataset and trying out different charts, you will gain a deeper knowledge of the challenge or problem you’ve been tasked with solving. You’d be surprised to know how much insight you can gain just by doing this! Pattern and trends emerge, stories are told and the best part? Visualizations are the best way to present your findings to the client.
As a data scientist, you need to be inherently curious. It’s one of the great things about data science – the more curious you are, the more questions you’ll ask. This leads to a much better understanding of the data you are given and also helps solve problems you didn’t know existed in the first place!
Practice! Next time you work on a dataset, spend more time on this step. You will be stunned at the amount of insight it will generate for you. Ask questions! Ask your manager, ask domain experts, search for solutions on the internet and if you don’t find any, ask on social media. So many options!
To help you get started, I have mentioned a few resources below which you should refer to:
Structured thinking helps a data scientist in many ways:
There are many more reasons why having a structured thinking mindset helps. As you can imagine, not having a structured thinking mindset is counter intuitive. Your work and approach to a problem will be haphazard, you will lose track of your own steps when faced with a complex problem, etc.
When you go for a data science interview, you will inevitably be given a case study, guess estimate and puzzle problem(s). Because of the pressure filled atmosphere in an interview room and the time constraint, the interviewer looks at how well you structure your thoughts to arrive at a final result. In many cases, this can be a deal breaker or deal sealer for getting the job.
You can acquire a structured thinking mindset through simple training and and a disciplined approach. I have listed a few articles below which will help you get started on this crucial aspect:
I’ve seen this one too many times. Because of the dilemma and the unique features each tool offers, people tend to attempt learning all the tools at once. This is a bad idea – you will end up mastering none of them. Tools are a means to perform data science, they are not the end goal.
Pick one tool and stick to it until you have mastery over it. If you’ve already started learning R, then don’t be tempted by Python (yet). Stick with R, learn it end-to-end and only then try to incorporate another tool into your skillset. You will learn more with this approach.
Each tool has a great user community which you can tap into whenever you get stuck. Use our discussion forum to ask questions, search stuff online, and don’t give up. The aim is to learn data science through the tool, not the tool through data science.
If you are still undecided on which tool you should use, check out this wonderful article which lists down each tool’s advantages and shortcomings (it also includes SAS in case you are interested in that).
This one applies to all data scientists, not just freshers. We have a tendency to get distracted easily. We study for a period of time (say, a month), then we give it a break for the next 2 months. Trying to get back into the groove of things after that is a nightmare. Most of the earlier concepts are forgotten, notes are lost and it feels like we just wasted the last few months.
I have personally experienced this as well. Due to various things we have going on, we find excuses and reasons not to get back to studying. But this is eventually our loss – if data science was as easy as opening a text book and cramming everything, everyone would be a data scientist today. It demands consistent effort and learning, something which people don’t appreciate until it’s too late.
Set goals for yourself. Map out a time table and stick it on your wall. Plan how and what you want to study and set deadlines for yourself. For example, when I wanted to learn about neural networks, I gave myself a couple of weeks and then tested what I’d learned by competing in a hackathon.
You have decided to become a data scientist so you should be ready to put in the hours. If you continually keep finding excuses not to study, this might not be the field for you.
This is a combination of a few things we’ve seen in the above points. Aspiring data scientists tend to shy away from posting their analysis online in fear of being criticized. But if you don’t receive feedback from the community, you will not grow as a data scientist.
Data science is a field where discussions, ideas and brainstorming is of utter importance. You cannot sit in a silo and work – you need to collaborate and understand other data scientists’ perspective. Similarly, people don’t take part in competitions because they feel they won’t win. This is a wrong mindset! You participate in these competitions to learn, not to win. Winning is a bonus, learning is the goal.
It’s fairly straightforward – start participating in discussions and competitions! It’s okay to not come in the top 5%. If you learn a new technique out of the whole thing, you have won in your own right.
Communications skills are one of the most under-rated and least talked about aspects a data scientist absolutely MUST possess. I am yet to come across a course that places a solid emphasis on this. You can learn all the latest techniques, master multiple tools and make the best graphs, but if you cannot explain your analysis to your client, you will fail as a data scientist.
And not just clients, you will also be working with team members who are not well versed with data science – IT, HR, finance, operations, etc. You can be sure that the interviewer will be monitoring this aspect throughout.
Assume you’ve built a credit risk model using logistic regression. As a thought exercise, take a minute to think how you would explain to a non-technical person how you came to the final conclusion. If you used any technical words, you need to work on this ASAP!
Most data scientists these days are coming from a computer science background so I understand this can be a daunting skill to acquire. But to become a successful data scientist and climb up the ladder, you don’t have a choice but to polish this part of your personality.
One of the things I find most helpful is explaining data science terms to a non-technical person. It helps me gauge how well I have articulated the problem. If you’re working in a small to medium-sized company, find a person in the marketing or sales department and do this exercise with them. It will help you immensely in the long term.
There are plenty of free resources available on the internet to get you started but remember, practice is key when it comes to soft skills. Ensure you start doing this TODAY.
This is most definitely NOT an exhaustive list – there are plenty of other mistakes aspiring data scientists tend to make. But these were the most common ones I have seen and my aim, as stated earlier, is to help others avoid it (as much as possible).
I would love to hear your thoughts on these pointers, and also your personal experience with similar problems. Use the comments section below to let me know!
Great observations Pranav and the way you have penned down the same is just amazing. Keep up the good work.
Thank you pranav... Now this will help me a lot
This article is too helpful for pioneers