Interview with data scientist and top Kaggler, Mr. Steve Donoho

Kunal Jain Last Updated : 21 Aug, 2015
6 min read

It’s our pleasure to introduce top data Scientist (as per Kaggle), Mr. Steve Donoho, who has generously agreed to do an exclusive interview for Analytics Vidhya. Steve is living a dream which most of us think of! He is founder and Chief data Scientist at Donoho Analytics Inc., tops Kaggle ranking for data scientists and chooses his areas of interest.

Prior to this, he worked as Head of Research for Mantas and as Principal for SRA International Inc. On the education front, Steve completed his graduation from Purdue University followed by a M.S. And Ph.D. From Illinois University. His interest and work include an interesting mix of problems in areas of Insider trading, Money Laundering, Excessive mark up and customer attrition.

On a personal front, Steve likes trekking and playing card and board games with his family (Rumikub, Euchre, Dutch Blitz, Settlers of Catan, etc.).

Interview with data scientist and top Kaggler, Mr. Steve Donoho1

Kunal: Welcome Steve! Thanks for accepting the offer to share your knowledge with our audience of Analytics Vidhya. Kindly tell us briefly about yourself and your career in Analytics and how you chose this career.

Steve: When I was in grad school, I was good at math and science so everyone told me, “You should be an engineer!”  So I got a degree in computer engineering, but I found that designing computers was not so interesting to me. I found what I really loved to do was to analyze things and to use computers as a tool to analyze things. So for any young person out there who is good at math and science, I recommend you ask yourself, “Do I love to analyze things?”  If so, a career as a data scientist may be the thing for you. In my career, I have mainly worked in financial services because data abounds in the financial services world, and it is a very data-driven industry.  I enjoy looking for fraud because it gives me an opportunity to think like a crook without actually being one.

 

Kunal: So, how and when did you start participating in Kaggle competitions?

Steve: I found out about Kaggle a couple years ago from an article in the Wall Street Journal.  The article was about the Heritage Health Prize, and I worked on that contest. But I was quickly drawn into other contests because they all looked so interesting.

 

Kunal: How frequently do you participate in these competitions and How do you choose which ones to participate?

Steve: I’d have to say that I do about one each month if there are interesting-looking contests going on. I try to pick contests that will force me to learn something new. For example, 12 months ago I would have had to say that I knew very little about text mining.  So I deliberately entered a couple text mining contests. Once I made an entry, the competitive spirit forced me to learn as much as I could about text mining, and other competitors post helpful hints about good techniques to learn about. So it is a great way to sharpen your skills.

 

Kunal: Team vs. Self?

Steve: I usually enter contests by myself.  This is mainly because it can be difficult to coordinate with teammates while juggling a job, contest, etc.

 

Kunal: Which was the most interesting / difficult competition you have participated till date?

Steve: The GE Flight Quest was very interesting. The challenge was to predict when a flight was going to land given all the information about its current position, weather, wind, airport delays, etc.  After being in that contest, when I looked up and saw an airplane in the sky, I found myself thinking, “I wonder what that airplane’s Estimated Arrival Time is, and will it be ahead of schedule or behind?” I have also liked the hack-a-thons which are contest that last only 24 hours – it totally changes the way you approach problem because you don’t have as much time to mull over the problem.

 

Kunal: What are the common tools you use for these competitions and your work outside of Kaggle?

Steve: I mostly use the R programming language, but I also use Python scikit-learn especially if it is a text-mining problem.  For work outside Kaggle, data is often in a relational database so a good working knowledge of SQL is a must.

 

Kunal: Any special pre-processing / data cleansing exercise which you found immensely helpful? How much time do you spend on data-cleansing vs. choosing the right technique / algorithm?

Steve: Well, I start by simply familiarizing myself with the data.  I plot histograms and scatter plots of the various variables and see how they are correlated with the dependent variable.  I sometimes run an algorithm like GBM or Random Forest on all the variables simply to get a ranking of variable importance.  I usually start very simple and work my way toward more complex if necessary.  My first few submissions are usually just “baseline” submissions of extremely simple models – like “guess the average” or “guess the average segmented by variable X.”  These are simply to establish what is possible with very simple models.  You’d be surprised that you can sometimes come very close to the score of someone doing something very complex by just using a simple model.

A next step is to ask, “What should I actually be predicting?”  This is an important step that is often missed by many – they just throw the raw dependent variable into their favorite algorithm and hope for the best.  But sometimes you want to create a derived dependent variable.  I’ll use the GE Flight Quest as an example – you don’t want to predict the actual time the airplane will land; you want to predict the length of the flight; and maybe the best way to do that is to use that ratio of how long the flight actually was to how long it was originally estimated to be and then multiply that times the original estimate.

I probably spend 50% of my time on data exploration and cleansing depending on the problem.

 

Kunal: Which algorithms have you used most commonly in your final submissions?

Steve: It really depends on the problem.  I like to think of myself as a carpenter with a tool chest full of tools.  An experienced carpenter looks at his project and picks out the right tools. Having said that, the algorithms that I get the most use out of are the old favourites: R’s GBM package (Generalized Boosted Regression Models), Random Forests, and Support Vector Machines.

 

Kunal: What are your views on traditional predictive modeling techniques like Regression, Decision tree?

Steve: I view them as tools in my tool chest. Sometimes simple regression is just the right tool for a problem, or regression used in an ensemble with a more complex algorithm.

 

Kunal: Which tools and techniques would you recommend an Analytics newbie to learn? Any specific recommendation for learning tools with big data capabilities?

Steve:  I don’t know if I have a good answer for this question.

 

Kunal: I have been working in Analytics Industry for some time now, but am new to Kaggle. What would be your tips for someone like me to excel on this platform?

Steve: My advice would be to make your goal having fun and learning new things. If you set a goal of becoming highly ranked, it will become “work” instead of “fun” and then it will become drudgery. But if you set your goal to have fun and learn, then you will pour all your creative juices into it, and you will probably end up with a good score in the end. Kagglers are very helpful. We love to give hints in the forums and tell how we approached a problem after the contest is over. When I started on Kaggle, I just went back to all the completed contests and read the “Congratulations Winners! Here’s how I approached this problem” forum entry where all the winners gave away their secrets. I picked up a lot of great tips that way – both for what algorithms to learn and techniques I had not thought of.  It expanded my tool chest.

 

Kunal: Finally, any advice you would want to provide to audience of Analytics Vidhya?

Steve: Here are some thoughts based on my experience:

  • Knowledge of statistics & machine learning is a necessary foundation.  Without that foundation, a participant will not do very well.  BUT what differentiates the top 10 in a contest from the rest of the pack is their creativity and intuition.
  • I think beginners sometimes just start to “throw” algorithms at a problem without first getting to know the data.  I also think that beginners sometimes also go too-complex-too-soon.  There is a view among some people that you are smarter if you create something really complex.  I prefer to try out simpler.  I *try* to follow Albert Einstein’s advice when he said, “Any intelligent fool can make things bigger and more complex. It takes a touch of genius — and a lot of courage — to move in the opposite direction.”
  • The more tools you have in your toolbox, the better prepared you are to solve a problem.  If I only have a hammer in my toolbox, and you have a toolbox full of tools, you are probably going to build a better house than I am.  Having said that, some people have a lot of tools in their toolbox, but they don’t know *when* to use *which* tool.  I think knowing when to use which tool is very important.  Some people get a bunch of tools in their toolbox, but then they just start randomly throwing a bunch of tools at their problem without asking, “Which tool is best suited for this problem?”  The best way to learn this is by experience, and Kaggle provides a great platform for this.

Thanks Steve for sharing these nuggets of Gold. Really appreciated!

Bonus: In addition to this interview, Steve has agreed to answer a few specific questions from readers. For the benefit of everyone, I would urge you to keep them as specific as possible and avoid asking questions already answered as part of the interview. Please post your questions in the comments below. Steve will answer the questions once he is back from Thanksgiving holidays.

If you like what you just read & want to continue your analytics learning, subscribe to our emails or like our facebook page.

 

Image (background) source: theninjamarketingblog

Kunal Jain is the Founder and CEO of Analytics Vidhya, one of the world's leading communities of Al professionals. With over 17 years of experience in the field, Kunal has been instrumental in shaping the global Al landscape. His expertise spans diverse markets, from developed economies like the UK to emerging ones like India, where he has successfully led and delivered complex data-driven solutions. As a recognized thought leader, Kunal has empowered countless individuals to realize their Al ambitions through his visionary approach to Al education and community building. Before founding Analytics Vidhya, Kunal earned both his undergraduate and postgraduate degrees from IIT Bombay and held key roles at Capital One and Aviva Life Insurance across multiple geographies. His passion lies at the intersection of analytics, Al, and fostering a thriving community of data science professionals.

Responses From Readers

Clear

Kuber
Kuber

Hello Steve, I am glad that I got an opportunity to ask you questions. Well I am currently pursuing Master's in Business Analytics and I am still not able to see jobs related to Data Science. Mckinsey predicted that there's a shortage of 1.5million analysts and I don't see much opportunity. What you think could be the reason and what should we do, extra to improve our PROFILE as data scientists. Thanks.

Tavish
Tavish

Steve, I am currently working on a project to calculate the utility index of customer for a premium bank savings account. We want to assign every customer an index of how useful (in aggregate) will the value propositions of the account be to that customer. There are many value propositions of premium bank savings account. This includes high number of free NEFT transaction, free airport lounge entry etc. Business wants us to come up with a single utility index for each customer. Assuming I have the data for all the value propositions being currently used by the customer. Because there is no clear objective function, I am struggling with how to assign weights to different value proposition usage to come up with a single score. One approach I can think of is using Data envelopment analysis/Linear programming to come up with weights for each parameter/value proposition usage. Can you suggest the best method to find the weights, in a scientific manner, of these parameters to finally come up with a utility index for each customer. Please let me know in case you need any other specific details. Tavish

Nimit Gupta
Nimit Gupta

Hi Kunal/Tavish, Could you please suggest any good literature on multivariate regression modelling, thanks in advance. Kind Regards, Nimit Gupta

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details