Exclusive Interview with Data Scientist – Bishwarup Bhattacharjee (Analytics Vidhya Rank 8)

Kunal Jain Last Updated : 23 Nov, 2016

11 min read

Introduction

Energy and Persistence conquers all things!

– Benjamin Franklin

Bishwarup Bhattacharjee, senior data scientist, Decision Minds is an epitome of persistence and hard-work. The road to becoming a data scientists is tedious. It requires sheer perseverance and a lot of hard work. Bishwarup’s journey not only tells us how to make a career in data science, but also how to become one of the best.

Bishwarup completed his Bachelors in Statistics from University of Calcutta. His experience varies from being a data analyst to being an independent analytics consultant and finally joining a startup as a data scientist. He has won several competitions on Analytics Vidhya and is currently ranked 8th on our datahack platform.

He is a huge inspiration for all of us and one of the best minds I have come across in analytics industry. We wanted to know more about his journey and what kept him going. So, we conducted an exclusive interview with him.

Here are the excepts from my conversations with him!

KJ: First of all, I would like to sincerely thank you for devoting time for this interview. Kindly tell us about yourself and how did you start your career in analytics?

Bishwarup: I am glad to have this opportunity to present myself to such a fascinating group of professional and aspiring data scientists and I really thank you and the team AnalyticsVidhya at large for that. I hail from a Statistical background. I always liked the part where Statistical methods are used to solve real-life problems like driving the growth of a business or facilitating various workflows in a large organization and many more. On the other hand, I’ve always had a knack for learning different programming languages and I am still learning. I think that helped me a lot to get me going and make my journey very interesting as a data scientist. I started as a Data Analyst in a product based start-up. However, not after long I started providing independent consulting services.

KJ: When and why did you think that you will start your own consultancy?

Bishwarup: While working as an independent consultant, I made quite a few contacts with offshore clients who wanted to work with me in a confidential manner and without the hassle of going through different freelancing web portals. In the beginning, I would take care of most of their requirements by myself, but going ahead I got involved with a number of long term projects where time management became an issue. I was almost working 16 hours a day at a point. So, I thought of reaching out to a few like-minded people I knew who could potentially help me in this regard. All of us together, thought that it would be better for us to work as a team rather than a number of people working on their own. So, I went ahead and registered a business. The business was good in its initial years. But today almost every analytics service is being automated and we too took a pretty bad hit. So very recently, I thought of moving out of what I was doing and at this moment, I am working with Decision Minds Pvt Ltd, a US based start-up, in the role of a Senior Data Scientist.

KJ: Tell us 3 things life has taught you in your journey from a data analyst to founder of a company and going back to corporate again?

Bishwarup:

It’s crucial to spend time with yourself to be very clear about what you want to do next, weight different pros and cons, assess the present and future situations.
Analytics is a very dynamic field of work; a lot of work have been achieved in the last decade which completely changed the landscape of how different organizations think of leveraging it for their profitability. That is why I think it’s necessary for one to keep a close eye on the latest technologies and developments and have an idea where we are heading to in next 5 years.
Whenever taking a big stride with a lot of risk both for your career and financial stability, you must have a back-up plan. Data Science is full of opportunities and it will be so – no matter how it shapes up in the time to come. It’s about finding suitable place for you and preparing yourself in advance.

KJ: Tell us a bit more on what challenges did you face in your journey? How did you overcome them?

Bishwarup: The most critical challenge, I think you would agree, to make up your mind when thinking something out of the box. Apart from that, I had a financial constraint at a point which got better with time. Also, I would like to mention that when I started my own consultancy services.

KJ: You are currently ranked 74^th on Kaggle among more than 50,000 people. Amazing feat. Please describe your journey.

Bishwarup: I joined Kaggle almost a couple of years back. At that time, I was exploring potential ways of enhancing my skills in data science, not just by enrolling into some online course but something that will provide me with hands-on experience of what it takes to deal with large scale data. I found Kaggle very useful and I am really glad that I kept myself involved in there. My first competition in Kaggle was Springleaf Marketing and the data was quite large to fit into 4GB RAM laptop that I had at that point. I was confused by the advanced discussions going on the forums and I had little idea how to efficiently approach the problem. However, I went ahead and rented a AWS instance to implement whatever I could learn and we finally ended up at the 27^th position in the private leaderboard among 2226 teams. I was pretty satisfied with my effort and since then I’ve always believed I can do better in every competition in which I take part and that’s what helped me to learn a lot of new things including stuff like how to preprocess large data files in a number of different ways, stacked generalization and many more. From my personal experience, I have seen people who think of platforms like Kaggle, AnalyticsVidhya, KDD etc. as just a fun competition organizers. However, if you ask me, I would rate these platforms even higher than attending a course in Coursera or Udacity. These platforms promote self-learning which, in my opinion, is the best way to master a subject.

KJ: Recently, you’ve won various other data science competitions including AV Hackthons, CrowdAnalytix etc. I must say you’ve got the midas touch. Is there any structure / formula / framework you follow to build this winning streak?

Bishwarup:

Well, there is no secret formula or shortcuts one can take to just pop up in the top. It’s all hard work and perseverance. Most of the people who participate in these types of online contest are working full-time with other companies and yet they consistently perform well in the competitions. So, it’s also about priorities and dependent on how you want to spend your free time after the office or in the weekend. I would rather understand Owen Zhang’s leave-one-out randomized and shrunk encoding of categorical variables and write my own implementation of that instead of going to a movie.
Apart from that, I think it’s essential for one to think of different ways a problem can be modelled. For example, in the BNP Paribas competition where we ranked 2^nd, we had a total of more than 300 models with different data preparation and hyper-parameters. As one can understand, that took us more than just a magic touch to crack. Besides being diligent, I think it is also crucial for one to keep track of the latest developments taking place in the field of data science and understand their applications in real life situations.

KJ: How do you decide in which Kaggle competition should you participate?

Bishwarup: Given the time, I would like to participate in all of the competitions as each of them helps one learn something new. However, there is a resource and time constraint on my side, so before entering a contest I like to think of the amount of time I would probably be able to invest behind it. There is no point taking part in a competition, copy some forum scripts and make a number of submissions.

There are also a lot of competitions related to computer vision that are held in Kaggle, but I don’t have much idea about that subject. I would really like to learn that in near future and participate in such competitions.

KJ: Which mode do you prefer in a competition? Team or Self?

Bishwarup: I personally prefer competing solo mainly because of the reason that after the end of the competition you get to know what possibly you could have done differently to make your model better. But participating in team is also advantageous for a number of reasons. For one, you get to learn different ideas and concepts from your teammates. At the same time, it offers a really good scope of ensembling different approaches and surging the leaderboard.

KJ: According to you, what should be the ideal approach by people to solve problems in these competitions?

Bishwarup: I don’t think that there is a one size fits all solution here as the problems are quite varied and they come in their own flavours. However, certain things are common across them. For example,

Knowing your data is necessary. By knowing the data I mean a few things – what is the target distribution, how the target relates to different covariates, how much signal you can extract from the data and how much noise is there, the structure of the NULL values and stuff like that. One can start with exploratory data analysis (EDA) – which is always very helpful in this regard. Even just looking at the data in a spreadsheet application like MS Excel and doing some basic colour coding, pivot and conditional formatting proves quite beneficial at times.
From the point of view of performing well in a competition, it is also essential to know the criteria of scoring as different competitions use different evaluation metrics. Optimizing Root Means Square Error (RMSE) is not the same as minimizing Mean Absolute Error (MAE).
One more point which I would like to highlight is that about cross-validation. Establishing a robust cross-validation strategy is indispensable if you are to rank high in the private leaderboard. How many folds you want to validate, whether you want to make the splits stratified in target values or by any other variable or you want a time-based validation is to be decided based on the structure of the competition. The main idea is to imitate the test data in your validation set as closely as possible.
Hyper-parameter tuning – it can offer a lot of lift if you use it right.
Weak learners make strong learners – One must use ensembling techniques whenever and wherever possible. Merging diverse models offers potential gain almost everytime.
Thinking out of the box – what possibly you can do with the data which you think is not very conventional. Never be afraid of trying things that apparently does not make a lot of sense – at times they come out helpful and set you apart.

KJ: Which techniques / algorithms do you think are the most important to learn to give a tough fight in these competitions? Why?

eXtreme Gradient Boosing (xgboost): It’s a fancy implementation of conventional Gradient Boosting which offers higher accuracy, more flexibility in terms of hyperparameters one can tune and shorter training time. It also offers online learning which is the way to go in case your data doesn’t fit into the memory. It can be applied to both classification and regression task.
Artificial Neural Network (ANN): Can efficiently model complex patterns in the data given the right architecture. I encourage using Keras in python for this – which has both theano and tensorflow backend and offers huge flexibility in terms of implementation, parameter tuning, optimization, checking overfitting and training on sparse data. As a side note, it especially performs well in case of multiclass classification.
Elastic Net: It might not offer as much accuracy as the previous two, but it is fantastic in terms of feature selection. Especially when you have a very sparse set of features – variable selection through elastic net can boost the score by quite a lot.
Apart from the above I think there are a lot of other algorithms which are good to know as they really perform well in certain type of data sets – like Vowpal Wabbit, Factorization Machines, libFFM, Regualized Greedy Forest (RGF) etc.

KJ: According to you, how different are these competitions to real life challenges which are solved in industry using data science and machine learning?

Bishwarup: There is a fundamental difference indeed. In industry, businesses are looking for estimates which are backed by a certain confidence bands – the confidence interval is more crucial than just a point estimate whereas in online competitions we are crazy about optimizing the evaluation metric even up to the 5^th or 6^th decimal places. In real life use cases, people are more interested in a directional view of where the business is heading to and find the potential drivers of the change whereas in online contests we rarely bother about what insight we can gather from the data. Besides, the complicated stacked models that we develop to crack the online contests are hardly possible to implement in production environment due to their complexity/long training time/ stochastic nature of optimization. However, it does not mean that such contests offer no career value for people working in companies – I think most of the companies in their analytics wing, use algorithms like Random Forest or even xGboost. One just have to factor in the domain perspective rather than just applying the black-box machine learning to solve a problem in industry.

KJ: As per your experience, which tutorials, online courses, MOOCs are must to undergo for aspiring data scientists? Which one helped you the most personally?

Bishwarup: I haven’t taken any online courses, MOOCs or followed any online or offline contents. I learnt by debugging – whenever stuck with something I would go to stackoverflow or google forums to search for answers. I think it’s the best way to learn. Of course, the courses in websites like Coursera, Udacity or Udemy provide a lot of value – they will give you a jump start – but if you want to master a subject, better do it by practicing and being on your own. You will get hundreds of contents describing what deep learning is and where it is used, but not a single content in the web completely describing how to install [theano + gpu + cuda + cudnn] on a Windows machine. In my opinion, there is no use of going through an article if you can’t practice the same side by side.

KJ: If you had a chance to go back in time, what are the things you would have done differently?

Bishwarup: I would learn Java to its core and also try to create some kind of routine in my life.

KJ: What things a fresher must do to get his/her first break in analytics?

Bishwarup: I would be honest, it’s a very competitive market down there. Going back 7-8 years from now, companies would hire people who can efficiently run a logistic regression but today the outlook is completely different. Corporate establishments are looking for people extremely well equipped with latest technologies – however, it is not to demotivate the folks who just came out of college or looking to move to data science domain – I would say:

If you have an Economics/ Maths/ Statistical Background, go ahead and learn some programming skills in R or python. These two are in very high demand at this moment and will be so for quite some time, especially python.
If you are coming from a Computer Science/ Computer Application background, go to Coursera or Udacity and enrol yourself to a certification course, e.g. Data Science Toolbox or Data Analyst Essentials. It will help you to grasp the concepts of different applied statistical methods which is important and you will also get a certificate of achievement at the end of the course which will help you to build your portfolio.
Those who are not from the above two backgrounds, there is no reason to believe you cannot get into data science. I personally know people who made it to analytics from completely different fields and well established at this moment. It’s just that, it would be a career jump for you and before taking any online course or something, you should probably sit quiet and contemplate your move. If you find yourself determined enough, I don’t think you will find it hard at all to make a break. If you are in confusion about where to start, you can always PM me. I would be glad to help J.

Also you should follow the AnalyticsVidhya blog posts which are full of helpful materials. Last time I remember, I read a blog post on D3.js and it was pretty well written.

I would like to thank the AnalyticsVidhya team once more for providing me with this wonderful opportunity and I look forward to having a long-term relationship with you guys. Your blog posts are fantastic and you guys are doing a great job for the data science community at large. My best wishes for you.

KJ: Thank you Bishwarup for your invaluable time and thoughts. I am sure a lot of people in analytics & data science industry will benefit from this. All the best to you too!

You can test your skills and knowledge. Check out Live Competitions and compete with best Data Scientists from all over the world.

Kunal Jain

Kunal Jain is the Founder and CEO of Analytics Vidhya, one of the world's leading communities of Al professionals. With over 17 years of experience in the field, Kunal has been instrumental in shaping the global Al landscape. His expertise spans diverse markets, from developed economies like the UK to emerging ones like India, where he has successfully led and delivered complex data-driven solutions. As a recognized thought leader, Kunal has empowered countless individuals to realize their Al ambitions through his visionary approach to Al education and community building. Before founding Analytics Vidhya, Kunal earned both his undergraduate and postgraduate degrees from IIT Bombay and held key roles at Capital One and Aviva Life Insurance across multiple geographies. His passion lies at the intersection of analytics, Al, and fostering a thriving community of data science professionals.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Anon

Very nice interview. If Bishwarup wouldn't mind answering a question from the comments, I have a few for him. :) Could you contrast you work as an independent consultant/running you company and your current role as a senior data scientist? What were/are your responsibilities in each? You mentioned that a lot of analytics was automated during your consultancy days; how is automation affecting (or not) your job at Decision Minds?

BabluTed

Thanks for sharing your experience with us. It's really helpful.

Lokesh

Thank you for providing an honest opinion about everything. One question though - how can we PM you for some guidance please?

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Exclusive Interview with Data Scientist – Bishwarup Bhattacharjee (Analytics Vidhya Rank 8)

Introduction

You can test your skills and knowledge. Check out Live Competitions and compete with best Data Scientists from all over the world.

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth