Interview with data scientist and top Kaggler, Mr. Steve Donoho

Kunal Jain Last Updated : 21 Aug, 2015

6 min read

It’s our pleasure to introduce top data Scientist (as per Kaggle), Mr. Steve Donoho, who has generously agreed to do an exclusive interview for Analytics Vidhya. Steve is living a dream which most of us think of! He is founder and Chief data Scientist at Donoho Analytics Inc., tops Kaggle ranking for data scientists and chooses his areas of interest.

Prior to this, he worked as Head of Research for Mantas and as Principal for SRA International Inc. On the education front, Steve completed his graduation from Purdue University followed by a M.S. And Ph.D. From Illinois University. His interest and work include an interesting mix of problems in areas of Insider trading, Money Laundering, Excessive mark up and customer attrition.

On a personal front, Steve likes trekking and playing card and board games with his family (Rumikub, Euchre, Dutch Blitz, Settlers of Catan, etc.).

Kunal: Welcome Steve! Thanks for accepting the offer to share your knowledge with our audience of Analytics Vidhya. Kindly tell us briefly about yourself and your career in Analytics and how you chose this career.

Steve: When I was in grad school, I was good at math and science so everyone told me, “You should be an engineer!” So I got a degree in computer engineering, but I found that designing computers was not so interesting to me. I found what I really loved to do was to analyze things and to use computers as a tool to analyze things. So for any young person out there who is good at math and science, I recommend you ask yourself, “Do I love to analyze things?” If so, a career as a data scientist may be the thing for you. In my career, I have mainly worked in financial services because data abounds in the financial services world, and it is a very data-driven industry. I enjoy looking for fraud because it gives me an opportunity to think like a crook without actually being one.

Kunal: So, how and when did you start participating in Kaggle competitions?

Steve: I found out about Kaggle a couple years ago from an article in the Wall Street Journal. The article was about the Heritage Health Prize, and I worked on that contest. But I was quickly drawn into other contests because they all looked so interesting.

Kunal: How frequently do you participate in these competitions and How do you choose which ones to participate?

Steve: I’d have to say that I do about one each month if there are interesting-looking contests going on. I try to pick contests that will force me to learn something new. For example, 12 months ago I would have had to say that I knew very little about text mining. So I deliberately entered a couple text mining contests. Once I made an entry, the competitive spirit forced me to learn as much as I could about text mining, and other competitors post helpful hints about good techniques to learn about. So it is a great way to sharpen your skills.

Kunal: Team vs. Self?

Steve: I usually enter contests by myself. This is mainly because it can be difficult to coordinate with teammates while juggling a job, contest, etc.

Kunal: Which was the most interesting / difficult competition you have participated till date?

Steve: The GE Flight Quest was very interesting. The challenge was to predict when a flight was going to land given all the information about its current position, weather, wind, airport delays, etc. After being in that contest, when I looked up and saw an airplane in the sky, I found myself thinking, “I wonder what that airplane’s Estimated Arrival Time is, and will it be ahead of schedule or behind?” I have also liked the hack-a-thons which are contest that last only 24 hours – it totally changes the way you approach problem because you don’t have as much time to mull over the problem.

Kunal: What are the common tools you use for these competitions and your work outside of Kaggle?

Steve: I mostly use the R programming language, but I also use Python scikit-learn especially if it is a text-mining problem. For work outside Kaggle, data is often in a relational database so a good working knowledge of SQL is a must.

Kunal: Any special pre-processing / data cleansing exercise which you found immensely helpful? How much time do you spend on data-cleansing vs. choosing the right technique / algorithm?

Steve: Well, I start by simply familiarizing myself with the data. I plot histograms and scatter plots of the various variables and see how they are correlated with the dependent variable. I sometimes run an algorithm like GBM or Random Forest on all the variables simply to get a ranking of variable importance. I usually start very simple and work my way toward more complex if necessary. My first few submissions are usually just “baseline” submissions of extremely simple models – like “guess the average” or “guess the average segmented by variable X.” These are simply to establish what is possible with very simple models. You’d be surprised that you can sometimes come very close to the score of someone doing something very complex by just using a simple model.

A next step is to ask, “What should I actually be predicting?” This is an important step that is often missed by many – they just throw the raw dependent variable into their favorite algorithm and hope for the best. But sometimes you want to create a derived dependent variable. I’ll use the GE Flight Quest as an example – you don’t want to predict the actual time the airplane will land; you want to predict the length of the flight; and maybe the best way to do that is to use that ratio of how long the flight actually was to how long it was originally estimated to be and then multiply that times the original estimate.

I probably spend 50% of my time on data exploration and cleansing depending on the problem.

Kunal: Which algorithms have you used most commonly in your final submissions?

Steve: It really depends on the problem. I like to think of myself as a carpenter with a tool chest full of tools. An experienced carpenter looks at his project and picks out the right tools. Having said that, the algorithms that I get the most use out of are the old favourites: R’s GBM package (Generalized Boosted Regression Models), Random Forests, and Support Vector Machines.

Kunal: What are your views on traditional predictive modeling techniques like Regression, Decision tree?

Steve: I view them as tools in my tool chest. Sometimes simple regression is just the right tool for a problem, or regression used in an ensemble with a more complex algorithm.

Kunal: Which tools and techniques would you recommend an Analytics newbie to learn? Any specific recommendation for learning tools with big data capabilities?

Steve: I don’t know if I have a good answer for this question.

Kunal: I have been working in Analytics Industry for some time now, but am new to Kaggle. What would be your tips for someone like me to excel on this platform?

Steve: My advice would be to make your goal having fun and learning new things. If you set a goal of becoming highly ranked, it will become “work” instead of “fun” and then it will become drudgery. But if you set your goal to have fun and learn, then you will pour all your creative juices into it, and you will probably end up with a good score in the end. Kagglers are very helpful. We love to give hints in the forums and tell how we approached a problem after the contest is over. When I started on Kaggle, I just went back to all the completed contests and read the “Congratulations Winners! Here’s how I approached this problem” forum entry where all the winners gave away their secrets. I picked up a lot of great tips that way – both for what algorithms to learn and techniques I had not thought of. It expanded my tool chest.

Kunal: Finally, any advice you would want to provide to audience of Analytics Vidhya?

Steve: Here are some thoughts based on my experience:

Knowledge of statistics & machine learning is a necessary foundation. Without that foundation, a participant will not do very well. BUT what differentiates the top 10 in a contest from the rest of the pack is their creativity and intuition.
I think beginners sometimes just start to “throw” algorithms at a problem without first getting to know the data. I also think that beginners sometimes also go too-complex-too-soon. There is a view among some people that you are smarter if you create something really complex. I prefer to try out simpler. I *try* to follow Albert Einstein’s advice when he said, “Any intelligent fool can make things bigger and more complex. It takes a touch of genius — and a lot of courage — to move in the opposite direction.”
The more tools you have in your toolbox, the better prepared you are to solve a problem. If I only have a hammer in my toolbox, and you have a toolbox full of tools, you are probably going to build a better house than I am. Having said that, some people have a lot of tools in their toolbox, but they don’t know *when* to use *which* tool. I think knowing when to use which tool is very important. Some people get a bunch of tools in their toolbox, but then they just start randomly throwing a bunch of tools at their problem without asking, “Which tool is best suited for this problem?” The best way to learn this is by experience, and Kaggle provides a great platform for this.

Thanks Steve for sharing these nuggets of Gold. Really appreciated!

Bonus: In addition to this interview, Steve has agreed to answer a few specific questions from readers. For the benefit of everyone, I would urge you to keep them as specific as possible and avoid asking questions already answered as part of the interview. Please post your questions in the comments below. Steve will answer the questions once he is back from Thanksgiving holidays.

If you like what you just read & want to continue your analytics learning, subscribe to our emails or like our facebook page.

Image (background) source: theninjamarketingblog

Kunal Jain

Kunal Jain is the Founder and CEO of Analytics Vidhya, one of the world's leading communities of Al professionals. With over 17 years of experience in the field, Kunal has been instrumental in shaping the global Al landscape. His expertise spans diverse markets, from developed economies like the UK to emerging ones like India, where he has successfully led and delivered complex data-driven solutions. As a recognized thought leader, Kunal has empowered countless individuals to realize their Al ambitions through his visionary approach to Al education and community building. Before founding Analytics Vidhya, Kunal earned both his undergraduate and postgraduate degrees from IIT Bombay and held key roles at Capital One and Aviva Life Insurance across multiple geographies. His passion lies at the intersection of analytics, Al, and fostering a thriving community of data science professionals.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Responses From Readers

Kuber

Hello Steve, I am glad that I got an opportunity to ask you questions. Well I am currently pursuing Master's in Business Analytics and I am still not able to see jobs related to Data Science. Mckinsey predicted that there's a shortage of 1.5million analysts and I don't see much opportunity. What you think could be the reason and what should we do, extra to improve our PROFILE as data scientists. Thanks.

Show 1 reply

Steve Donoho

I think that groups like McKinsey sometimes lump all sorts of jobs together under data analyst. I've found many job listings out there for data analyst unfortunately don't involve much predictive modeling - they involve mostly what I call "data handling." But sometimes those jobs can grow into jobs that involve more predictive modeling.

Tavish

Steve, I am currently working on a project to calculate the utility index of customer for a premium bank savings account. We want to assign every customer an index of how useful (in aggregate) will the value propositions of the account be to that customer. There are many value propositions of premium bank savings account. This includes high number of free NEFT transaction, free airport lounge entry etc. Business wants us to come up with a single utility index for each customer. Assuming I have the data for all the value propositions being currently used by the customer. Because there is no clear objective function, I am struggling with how to assign weights to different value proposition usage to come up with a single score. One approach I can think of is using Data envelopment analysis/Linear programming to come up with weights for each parameter/value proposition usage. Can you suggest the best method to find the weights, in a scientific manner, of these parameters to finally come up with a utility index for each customer. Please let me know in case you need any other specific details. Tavish

Show 1 reply

Steve Donoho

Let me make sure I understand the problem. There are multiple value propositions for a premium bank savings account. Are you saying that each value proposition has different value to different customers - for example, the NEFT transactions are very valuable to customer#1 because customer#1 does a lot of NEFT transactions, but not very valuable to customer #2 because customer#2 does few NEFT transactions? When you say, "Assuming I have the data for all the value propositions being currently used by the customer" what information exactly do you have 1) binary yes/no of whether customer uses value proposition 2) amount they use value proposition (i.e. number of NEFT transactions), or 3) something else?

Nimit Gupta

Hi Kunal/Tavish, Could you please suggest any good literature on multivariate regression modelling, thanks in advance. Kind Regards, Nimit Gupta

Show 1 reply

Tavish

Hi Nimit, I referred to three books for building my knowledge on multivariate regression. 1. Statistics for management : To build foundation on statistical details of regression models. 2. SAS Enterprise Guide ANNOVA Regression & Logistic Regression [Course notes] : To learn SAS routines for building models and read diagnostic plots. 3. SAS E Miner Predictive Modelling [Course notes] : To learn how E-Miner makes programming easier for regression models. These course notes should be available on request to SAS institute. If you wish to consult a single book for the overall content covered in these 3 books, you might consider taking "SAS for Linear models [4th edition] - by Ramon C. Littell, Walter W. Stroup and Rudolf J. Freund. I hope this helps. Tavish

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Interview with data scientist and top Kaggler, Mr. Steve Donoho

If you like what you just read & want to continue your analytics learning, subscribe to our emails or like our facebook page.

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)