Feature engineering, Data wrangling and Visualization are all aspects of Data Preparation – one of the most important phases in any standard data mining or machine learning workflow. Dipanjan Sarkar, Data Scientist\Author and your Hacker for this session, believes feature engineering is both an art as well as a science.
““Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering” – Prof. Andrew Ng
Even with the advent of automated machine learning tools (auto-ml) and automated feature engineering libraries, more than often, creative and innovative hand-crafted features can often be the deciding factor in winning competitions or getting a real-world project from being a proof of concept to being deployed in production.
In this one-hour hack session, Dipanjan will be taking a structured and comprehensive hands-on approach (with Python), where we will explore two interesting case studies based on real-world problems!
In Case Study 1, we will look at predicting taxi fare prices around New York City. This will be a regression problem at scale where we will be dealing with millions of data points! We will start with building a simple baseline model and then incrementally build on our baseline model with a wide variety of feature engineering techniques which include the following.
- Outlier Analysis and Removal
- Temporal Features based on Date & Time
- Geographic Distance Features
- Visualizing Frequent Trip Patterns
- Hand-crafting trip trends into features
- Features based on trip locations and duration
- Categorical features based on trending locations
In Case Study 2, we will explore another interesting dataset pertaining to e-commerce product reviews and ratings. Here we tackle a classification problem of trying to predict recommended product ratings based on the customer review descriptions – a classic NLP problem! Here we will explore the following techniques in detail.
- Classic NLP count based features from text
- Features based on POS Tags
- Features derived from Sentiment Analysis
- Text Pre-processing & Wrangling
- Classic Bag of Words based Features
- Bi-grams and Tri-grams with intelligence feature selection
At every stage we will build relevant models and test the performance and effectiveness of each featureset which we generate incrementally and finally conclude with the best model and set of relevant features.
Key Takeaways from this Hack Session:
- Understand the need and importance of feature engineering, data wrangling and visualization
- Look at real-world problems where hand-crafted features and human domain expertise might be more useful than randomly building a bunch of features or models using automated techniques or meta-heuristics
- Hands-on demonstration of feature engineering and machine learning modeling on big datasets having millions of data points
- Comprehensive coverage of popular feature engineering techniques for structured data including numeric, categorical, temporal and geospatial features
- Comprehensive coverage of popular feature engineering techniques for unstructured text data including count based features, POS tags, sentiment analysis, bag of words and so on
- Notebooks will be shared on GitHub for you to take home and use them in your own problems in the future!