Innoplexus Sentiment Analysis Hackathon: Top 3 Out-of-the-Box Winning Approaches

Ankit Choudhary Last Updated : 28 Aug, 2019

7 min read

Overview

Hackathons are a wonderful opportunity to gauge your data science knowledge and compete to win lucrative prizes and job opportunities
Here are the top 3 approaches from the Innoplexus Sentiment Analysis Hackathon – a superb NLP challenge

Introduction

I’m a big fan of hackathons. I’ve learned so much about data science from participating in these hackathons in the past few years. I’ll admit it – I have gained a lot of knowledge through this medium and this, in turn, has accelerated my professional career.

This comes with a caveat – winning a data science hackathon is really hard. Just think about the number of obstacles in your way:

A brand new problem statement we haven’t worked on before
A plethora of top data scientists competing to rise up the leaderboard
Time crunch! We have to understand the problem statement, put together a framework, clean the data, explore it, and build the model in a matter of a few hours
And then repeat the process!

A single decimal point could be the difference between the top 10 and the top 50. Isn’t this why we love hackathons in the first place? The thrill of seeing our hard work pay off with a rise in the leaderboard rankings is unparalleled.

So, we’re thrilled to bring to you the top 3 winning approaches from the Innoplexus Sentiment Analysis hackathon! You are going to be awestruck by how these three top data scientists thought through their solutions and came up with their own unique framework.

There is a LOT to learn from these approaches. Trust me, take the time to go through the steps and understand where they came from. And then think if you would have done anything differently. And then – go ahead and take part in these hackathons yourself on our DataHack platform!

So let’s begin, shall we?

About the Innoplexus Sentiment Analysis Hackathon

It’s always an exciting prospect, hosting hackathons with our partner Innoplexus. Each time they come up with problem statements that are based on Natural Language Processing (NLP), an immensely popular field right now. We have seen huge developments in NLP thanks to transfer learning models such as BERT, XLNet, GPT-2, etc.

And sentiment analysis is one of the most common NLP projects data scientists tend to work on. This Innolpexus hackathon was a 5-day contest with more than 3200 data scientists across the globe competing for job opportunities and exciting prizes offered by Innoplexus.

It was a hard-fought contest with a total of 8000+ submissions and a variety of approaches employed by the best in the business to occupy the top spots.

For those of you who could not make it to the top, or otherwise could not find time to work on the problem, we have collated the winners’ approach and solutions to help you appreciate and learn from these. So here goes.

Problem Statement for the Innoplexus Sentiment Analysis Hackathon

There are a lot of components that go into building the narrative of a brand. It isn’t just built and controlled by the company that owns the brand. Think about any big brand you are familiar with and you’ll instantly understand what I’m talking about.

For this reason, companies are constantly looking out across various platforms, such as blogs, forums, social media, etc. for checking the sentiment around their various products and also competitor products to learn how their brand resonates in the market. This analysis helps them in various aspects of their post-launch market research.

This is relevant for a lot of industries, including pharma and their drugs.

But this comes with several challenges. Primarily, the language used in this type of content is not strictly grammatically correct. We often come across people using sarcasm. Others cover several topics with different sentiments in one post. Other users post comments to indicate their sentiment around the topic.

Broadly speaking, sentiment can be clubbed into 3 major buckets – Positive, Negative and Neutral Sentiments.

In the Innoplexus Sentiment Analysis Hackathon, the participants were provided with data containing samples of text. This text could potentially contain one or more drug mentions. Each row contained a unique combination of the text and the drug mention. Note that the same text could also have different sentiments for a different drug.

Given the text and drug name, the task was to predict the sentiment for texts contained in the test dataset. Given below is an example of text from the dataset:

Example:

Stelara is still fairly new to Crohn’s treatment. This is why you might not get a lot of replies. I’ve done some research, but most of the “time to work” answers are from Psoriasis boards. For Psoriasis, it seems to be about 4-12 weeks to reach a strong therapeutic level. The good news is, Stelara seems to be getting rave reviews from Crohn’s patients. It seems to be the best med to come along since Remicade. I hope you have good success with it. My daughter was diagnosed Feb. 19/07, (13 yrs. old at the time of diagnosis), with Crohn’s of the Terminal Illium. Has used Prednisone and Pentasa. Started Imuran (02/09), had an abdominal abscess (12/08). 2cm of Stricture. Started Remicade in Feb. 2014, along with 100mgs. of Imuran.

The above text is positive for Stelara and negative for Remicade. Now that we have a solid understanding of what the problem at hand was, let’s dive into the winning approaches!

Winners of the Innoplexus Sentiment Analysis Hackathon

As I mentioned earlier, winning a hackathon is extremely difficult. I loved going through these top solutions and approaches provided by our winners. First, let’s look at who won and congratulate them:

Rank 1: Melwin Babu
Rank 2: Harini Vengala
Rank 3: Mohsin Hasan Khan

Here are the final rankings of all the participants on the Leaderboard.

The top 3 winners have shared their detailed approach from the competition. I am sure you are eager to know their secrets so let’s begin.

Rank 3: Mohsin Hasan Khan (ML Engineer @HealthifyMe)

Here’s what Mohsin shared with us:

Approach

“My final solution is an ensemble of BERT and XLNet runs.”

My first impression of the data suggested there were a lot of wrong labels as per my perception of negative and positive sentiment. So, I felt it would be really difficult to handcraft the features. I decided it would be best to stick to the state-of-art NLP models to learn on noisy data
I started with a simple Tf-idf plus logistic regression model which gave me a cross-validation (CV) score of 0.5. After looking at text data, I realized there were many rows that had a lot of lines unrelated to the drug
Hence, I decided to use only sentences that had a drug name occurring in them. Tf-idf plus logistic regression with drug name-only sentences gave me a CV score of 0.54
At this point, I decided to use BERT. Without any finetuning, I only got a CV score of 0.45. But, once I let BERT finetune on training data, it gave a validation score of 0.60 and a leaderboard score of 0.59. Then, I added sentences that occurred before and after the drug sentence – this increased the CV score slightly. Then I used BERT-large and finetuned it which gave me a CV score of 0.65 and a leaderboard score of 0.61. Similarly, I finetuned the XLNet base, which gave a CV score of 0.64 and leaderboard 0.58. My final solution is an ensemble of BERT and XLNet runs
Note: 5-Fold stratified K-Fold as class distribution was imbalanced
Check out the code of Mohsin here.

Rank 2: Harini Vengala (Statistical Analyst @WalmartLabs India)

Here’s what Harini shared with us:

Approach

“My final model was an ensemble of 3 BERT and 1 AEN.”

My first baseline approach was using a count vectorizer. This got me a public leaderboard score of 0.42. I removed digits, emojis, URLs, punctuations, and converted the text to lowercase
I tried BERT and tuned it for the given data set. Next, I removed stop words from the text as BERT requires more memory if I run it on the entire passage. I took sequence length as 150, but I realized most of the important information is ignored in this approach. I didn’t cross the 0.50 score on the public leaderboard
So, what else could I try? I took only the sentences in which the given drug was present and used BERT again to classify sentiment. This gave me a score of 0.60 on the public leaderboard. I also implemented the Attentional Encoder Network (AEN) for Targeted Sentiment Classification which resulted in 0.56 score
My final model was an ensemble of 3 BERT and 1 AEN. The loss function I used was CrossEntropyLoss with class weights = 1/number of observations in each corresponding class
My key takeaway – try different things and check what works according to the data. Spend some time listing what are all the things you can try during the hackathon
Check out the code of Harini here.

Rank 1: Melwin Babu (Data Scientist @nference)

Here’s what Melwin shared with us:

Approach

“I noticed pretty early that increasing the max sequence length increased the score sufficiently. This observation more or less dictated my approach. I used a basic XLNet model with hardly any feature engineering.”

I lowercased all the sentences and masked the relevant drug in the sentences. Then, I proceeded to take the first 1380 tokens after the sentence piece tokenization
I chose to fill the GPU RAM with as much max sequence length as possible and refrained from using extra features. I tried to add variations to the data but made implementation mistakes and ran out of time
In the final model, I have used 6 seeds to average the XLNet base cased model’s predictions. Time didn’t permit me for any other hyperparameter tuning
Really surprised that in a competition where deep learning could be the best solution, I could be competitive with just an 8 GB GPU RAM machine
Identifying the differences in train and test distribution may be crucial. Most of the things are the same as other hackathons
Check out the code of Melwin here.

End Notes

It was great fun interacting with these winners and getting to know their approach during the competition. This is a tightly contest hackathon and as you have already seen, the winning approaches were supremely awesome.

I encourage you to head over to the DataHack platform TODAY and participate in the ongoing and upcoming hackathons. It will be an invaluable learning experience!

If you have any questions, feel free to post them below.

Ankit Choudhary

IIT Bombay Graduate with a Masters and Bachelors in Electrical Engineering. I have previously worked as a lead decision scientist for Indian National Congress deploying statistical models (Segmentation, K-Nearest Neighbours) to help party leadership/Team make data-driven decisions. My interest lies in putting data in heart of business for data-driven decision making.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Tushar Singh

Is it possible to share the solution as well?

Show 1 reply

The codes are included now at this link: https://github.com/kunalj101/Innoplexus_sentiment_analysis_top_solutions

Mohit

Thanks for sharing winning approaches, is it possible to share the code as well?

Preeti

Thanks for sharing the approaches…is there any GitHub repo likewise for a different dataset which we can explore

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Innoplexus Sentiment Analysis Hackathon: Top 3 Out-of-the-Box Winning Approaches

Overview

Introduction

About the Innoplexus Sentiment Analysis Hackathon

Problem Statement for the Innoplexus Sentiment Analysis Hackathon

Winners of the Innoplexus Sentiment Analysis Hackathon

Rank 3: Mohsin Hasan Khan (ML Engineer @HealthifyMe)

Approach

Rank 2: Harini Vengala (Statistical Analyst @WalmartLabs India)

Approach

Rank 1: Melwin Babu (Data Scientist @nference)

Approach

End Notes

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)