It’s our pleasure to introduce top data Scientist (as per Kaggle), Mr. Steve Donoho, who has generously agreed to do an exclusive interview for Analytics Vidhya. Steve is living a dream which most of us think of! He is founder and Chief data Scientist at Donoho Analytics Inc., tops Kaggle ranking for data scientists and chooses his areas of interest.
Prior to this, he worked as Head of Research for Mantas and as Principal for SRA International Inc. On the education front, Steve completed his graduation from Purdue University followed by a M.S. And Ph.D. From Illinois University. His interest and work include an interesting mix of problems in areas of Insider trading, Money Laundering, Excessive mark up and customer attrition.
On a personal front, Steve likes trekking and playing card and board games with his family (Rumikub, Euchre, Dutch Blitz, Settlers of Catan, etc.).
Kunal: Welcome Steve! Thanks for accepting the offer to share your knowledge with our audience of Analytics Vidhya. Kindly tell us briefly about yourself and your career in Analytics and how you chose this career.
Steve: When I was in grad school, I was good at math and science so everyone told me, “You should be an engineer!” So I got a degree in computer engineering, but I found that designing computers was not so interesting to me. I found what I really loved to do was to analyze things and to use computers as a tool to analyze things. So for any young person out there who is good at math and science, I recommend you ask yourself, “Do I love to analyze things?” If so, a career as a data scientist may be the thing for you. In my career, I have mainly worked in financial services because data abounds in the financial services world, and it is a very data-driven industry. I enjoy looking for fraud because it gives me an opportunity to think like a crook without actually being one.
Kunal: So, how and when did you start participating in Kaggle competitions?
Steve: I found out about Kaggle a couple years ago from an article in the Wall Street Journal. The article was about the Heritage Health Prize, and I worked on that contest. But I was quickly drawn into other contests because they all looked so interesting.
Kunal: How frequently do you participate in these competitions and How do you choose which ones to participate?
Steve: I’d have to say that I do about one each month if there are interesting-looking contests going on. I try to pick contests that will force me to learn something new. For example, 12 months ago I would have had to say that I knew very little about text mining. So I deliberately entered a couple text mining contests. Once I made an entry, the competitive spirit forced me to learn as much as I could about text mining, and other competitors post helpful hints about good techniques to learn about. So it is a great way to sharpen your skills.
Kunal: Team vs. Self?
Steve: I usually enter contests by myself. This is mainly because it can be difficult to coordinate with teammates while juggling a job, contest, etc.
Kunal: Which was the most interesting / difficult competition you have participated till date?
Steve: The GE Flight Quest was very interesting. The challenge was to predict when a flight was going to land given all the information about its current position, weather, wind, airport delays, etc. After being in that contest, when I looked up and saw an airplane in the sky, I found myself thinking, “I wonder what that airplane’s Estimated Arrival Time is, and will it be ahead of schedule or behind?” I have also liked the hack-a-thons which are contest that last only 24 hours – it totally changes the way you approach problem because you don’t have as much time to mull over the problem.
Kunal: What are the common tools you use for these competitions and your work outside of Kaggle?
Steve: I mostly use the R programming language, but I also use Python scikit-learn especially if it is a text-mining problem. For work outside Kaggle, data is often in a relational database so a good working knowledge of SQL is a must.
Kunal: Any special pre-processing / data cleansing exercise which you found immensely helpful? How much time do you spend on data-cleansing vs. choosing the right technique / algorithm?
Steve: Well, I start by simply familiarizing myself with the data. I plot histograms and scatter plots of the various variables and see how they are correlated with the dependent variable. I sometimes run an algorithm like GBM or Random Forest on all the variables simply to get a ranking of variable importance. I usually start very simple and work my way toward more complex if necessary. My first few submissions are usually just “baseline” submissions of extremely simple models – like “guess the average” or “guess the average segmented by variable X.” These are simply to establish what is possible with very simple models. You’d be surprised that you can sometimes come very close to the score of someone doing something very complex by just using a simple model.
A next step is to ask, “What should I actually be predicting?” This is an important step that is often missed by many – they just throw the raw dependent variable into their favorite algorithm and hope for the best. But sometimes you want to create a derived dependent variable. I’ll use the GE Flight Quest as an example – you don’t want to predict the actual time the airplane will land; you want to predict the length of the flight; and maybe the best way to do that is to use that ratio of how long the flight actually was to how long it was originally estimated to be and then multiply that times the original estimate.
I probably spend 50% of my time on data exploration and cleansing depending on the problem.
Kunal: Which algorithms have you used most commonly in your final submissions?
Steve: It really depends on the problem. I like to think of myself as a carpenter with a tool chest full of tools. An experienced carpenter looks at his project and picks out the right tools. Having said that, the algorithms that I get the most use out of are the old favourites: R’s GBM package (Generalized Boosted Regression Models), Random Forests, and Support Vector Machines.
Kunal: What are your views on traditional predictive modeling techniques like Regression, Decision tree?
Steve: I view them as tools in my tool chest. Sometimes simple regression is just the right tool for a problem, or regression used in an ensemble with a more complex algorithm.
Kunal: Which tools and techniques would you recommend an Analytics newbie to learn? Any specific recommendation for learning tools with big data capabilities?
Steve: I don’t know if I have a good answer for this question.
Kunal: I have been working in Analytics Industry for some time now, but am new to Kaggle. What would be your tips for someone like me to excel on this platform?
Steve: My advice would be to make your goal having fun and learning new things. If you set a goal of becoming highly ranked, it will become “work” instead of “fun” and then it will become drudgery. But if you set your goal to have fun and learn, then you will pour all your creative juices into it, and you will probably end up with a good score in the end. Kagglers are very helpful. We love to give hints in the forums and tell how we approached a problem after the contest is over. When I started on Kaggle, I just went back to all the completed contests and read the “Congratulations Winners! Here’s how I approached this problem” forum entry where all the winners gave away their secrets. I picked up a lot of great tips that way – both for what algorithms to learn and techniques I had not thought of. It expanded my tool chest.
Kunal: Finally, any advice you would want to provide to audience of Analytics Vidhya?
Steve: Here are some thoughts based on my experience:
Thanks Steve for sharing these nuggets of Gold. Really appreciated!
Image (background) source: theninjamarketingblog
Hello Steve, I am glad that I got an opportunity to ask you questions. Well I am currently pursuing Master's in Business Analytics and I am still not able to see jobs related to Data Science. Mckinsey predicted that there's a shortage of 1.5million analysts and I don't see much opportunity. What you think could be the reason and what should we do, extra to improve our PROFILE as data scientists. Thanks.
I think that groups like McKinsey sometimes lump all sorts of jobs together under data analyst. I've found many job listings out there for data analyst unfortunately don't involve much predictive modeling - they involve mostly what I call "data handling." But sometimes those jobs can grow into jobs that involve more predictive modeling.
Steve, I am currently working on a project to calculate the utility index of customer for a premium bank savings account. We want to assign every customer an index of how useful (in aggregate) will the value propositions of the account be to that customer. There are many value propositions of premium bank savings account. This includes high number of free NEFT transaction, free airport lounge entry etc. Business wants us to come up with a single utility index for each customer. Assuming I have the data for all the value propositions being currently used by the customer. Because there is no clear objective function, I am struggling with how to assign weights to different value proposition usage to come up with a single score. One approach I can think of is using Data envelopment analysis/Linear programming to come up with weights for each parameter/value proposition usage. Can you suggest the best method to find the weights, in a scientific manner, of these parameters to finally come up with a utility index for each customer. Please let me know in case you need any other specific details. Tavish
Let me make sure I understand the problem. There are multiple value propositions for a premium bank savings account. Are you saying that each value proposition has different value to different customers - for example, the NEFT transactions are very valuable to customer#1 because customer#1 does a lot of NEFT transactions, but not very valuable to customer #2 because customer#2 does few NEFT transactions? When you say, "Assuming I have the data for all the value propositions being currently used by the customer" what information exactly do you have 1) binary yes/no of whether customer uses value proposition 2) amount they use value proposition (i.e. number of NEFT transactions), or 3) something else?
Hi Kunal/Tavish, Could you please suggest any good literature on multivariate regression modelling, thanks in advance. Kind Regards, Nimit Gupta
Hi Nimit, I referred to three books for building my knowledge on multivariate regression. 1. Statistics for management : To build foundation on statistical details of regression models. 2. SAS Enterprise Guide ANNOVA Regression & Logistic Regression [Course notes] : To learn SAS routines for building models and read diagnostic plots. 3. SAS E Miner Predictive Modelling [Course notes] : To learn how E-Miner makes programming easier for regression models. These course notes should be available on request to SAS institute. If you wish to consult a single book for the overall content covered in these 3 books, you might consider taking "SAS for Linear models [4th edition] - by Ramon C. Littell, Walter W. Stroup and Rudolf J. Freund. I hope this helps. Tavish