Effective Feature Engineering: A Structured Approach to Building Better Machine Learning Models

Feature engineering, Data wrangling and Visualization are all aspects of Data Preparation – one of the most important phases in any standard data mining or machine learning workflow. Dipanjan Sarkar, Data Scientist\Author and your Hacker for this session, believes feature engineering is both an art as well as a science.

““Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering” – Prof. Andrew Ng

Even with the advent of automated machine learning tools (auto-ml) and automated feature engineering libraries, more than often, creative and innovative hand-crafted features can often be the deciding factor in winning competitions or getting a real-world project from being a proof of concept to being deployed in production.

In this one-hour hack session, Dipanjan will be taking a structured and comprehensive hands-on approach (with Python), where we will explore two interesting case studies based on real-world problems!

In Case Study 1, we will look at predicting taxi fare prices around New York City. This will be a regression problem at scale where we will be dealing with millions of data points! We will start with building a simple baseline model and then incrementally build on our baseline model with a wide variety of feature engineering techniques which include the following.

Outlier Analysis and Removal
Temporal Features based on Date & Time
Geographic Distance Features
Visualizing Frequent Trip Patterns
Hand-crafting trip trends into features
Features based on trip locations and duration
Categorical features based on trending locations

In Case Study 2, we will explore another interesting dataset pertaining to e-commerce product reviews and ratings. Here we tackle a classification problem of trying to predict recommended product ratings based on the customer review descriptions – a classic NLP problem! Here we will explore the following techniques in detail.

Classic NLP count based features from text
Features based on POS Tags
Features derived from Sentiment Analysis
Text Pre-processing & Wrangling
Classic Bag of Words based Features
Bi-grams and Tri-grams with intelligence feature selection

At every stage we will build relevant models and test the performance and effectiveness of each featureset which we generate incrementally and finally conclude with the best model and set of relevant features.

Key Takeaways from this Hack Session:

Understand the need and importance of feature engineering, data wrangling and visualization
Look at real-world problems where hand-crafted features and human domain expertise might be more useful than randomly building a bunch of features or models using automated techniques or meta-heuristics
Hands-on demonstration of feature engineering and machine learning modeling on big datasets having millions of data points
Comprehensive coverage of popular feature engineering techniques for structured data including numeric, categorical, temporal and geospatial features
Comprehensive coverage of popular feature engineering techniques for unstructured text data including count based features, POS tags, sentiment analysis, bag of words and so on
Notebooks will be shared on GitHub for you to take home and use them in your own problems in the future!

Hackers

Dipanjan (DJ) Sarkar

Dipanjan (DJ) Sarkar is a Data Scientist, a published author and a consultant and trainer. He has consulted and worked with several startups as well as Fortune 500 companies like Intel. He primarily works on leveraging data science, advanced analytics, machine learning and deep learning to build large- scale intelligent systems. He holds a master of technology degree with specializations in Data Science and Software Engineering. He is also an avid supporter of self-learning and massive open online courses. He plans to venture soon into the world of open-source products to improve the productivity of developers across the world. Duration of Hack-Session: 1 hour

Buy Ticket

Key Takeaways from this Hack Session:

Hackers

Dipanjan (DJ) Sarkar

Download Brochure