Introduction
90-95% of the times we find ourselves working with tabular data in machine learning. That number jumps up even more in machine learning hackathons. Can you remember the last time you worked on a challenge that had you thinking – “I haven’t seen this kind of data before!”
Working with graph data is a unique challenge. That’s why we were thrilled to host the ML Hikeathon in partnership with Hike last month. And our community loved it too – more than 5300 data scientists from all over the world participated in the 9-day event!
There was a lot to appreciate and gain by participating in this hackathon, including:
- Handling and working with Big Data
- How to approach a non-tabular data problem
- Working on a different domain – we don’t often come across graph problems
- Understanding how graphs work in an industry setting – Hike is among the leaders in their space
If you didn’t participate in the ML Hikeathon, you missed out on a lot of fun! But don’t worry – head over to our DataHack platform and enrol in upcoming hackathons and practice problems today.
About the ML Hikeathon
The ML Hikeathon was a marathon competition spanning a full 9 days! The hackathon went live on the midnight of March 30th, 2019 and closed on 7th April 2019.
Given the nature of the problem statement, it would definitely have taken data scientists some time to understand the requirements and frame their solution. That’s what makes a 9-day hackathon so unique – you get time to think through and experiment with your approach – but the level of the challenge goes up several notches.
Here, I have to tip my hat to the data scientists who came up with the top solutions. The creativity and knowledge displayed by these data scientists were sublime.
Understanding the ML Hikeathon Problem Statement
Hike is a popular social platform. Predicting links in their network forms the basis for recommending new friends to their users. And in turn, high-quality recommendations help in the creation of new social connections between existing users.
So, the problem statement involved building a model for Link Prediction on Hike’s Social Network. This link prediction model would increase the retention of new users by helping them find friends as they join the platform. In other words, the aim was to develop an algorithm that can predict whether a Hike user will chat up another Hike user who is a part of his/her phone contact list.
The evaluation metric used was Area under the Curve (AUC).
Datasets Provided for the Hikeathon
The data for this competition was a subset of Hike’s social graph and the anonymized features of users. The participants were provided with three files:
- Train.csv:
- node1_id and node2_id: The train set contained anonymized identifiers for users who are in each other’s phone address book
- is_chat signifies their chat relationship
- is_chat is 1 if the first user sends a chat message to the second user, and 0 otherwise
- Test.csv:
- The test set contains an id and a pair of nodes – node1_id, node2_id
- Participants are required to predict is_chat on the test set
- User_features.csv:
- This file contains some anonymized features of all nodes/users
- f13 is a categorical feature, f1-f12 are ordinal features each representing the number of days a user did some specific activity on the app in the last 31 days
Winners & their Approaches
Alright – let’s put our hands together for the winners! These winners used different and unique approaches to rise up the leaderboard. Below are the top 3 winners on the hackathon leaderboard:
- #1 – Kamakal Vectors
- #2 – Tezdhar
- #3 – SVJ and Datageek
Let’s look at each of their winning solutions. We have penned down the approaches in the winners’ own words. There’s a lot to learn from them so take notes!
We strongly recommend going through these approaches as they will help shape your mindset for future hackathons. Look at these approaches through a dataset-agnostic lens. Understand how a winner frames their thinking to approach a hackathon as tricky as this one.
Rank 3: Sourabh Jha and Datageek
Feature engineering was the name of the game for Sourabh and Datageek.
Validation strategy:
- 10-fold Cross-Validation
Feature engineering:
- We started by simply merging the anonymized features with the train data on node_1 and node_2
- For each anonymized feature, we created new features by calculating the difference for each feature between node1 and node2 (fi_x-fi_y)
- After initial model training, we identified the top five features among anonymized features – f3, f6, f9, f12 and f13. For each of these, we calculated the pairwise difference within each node (fi_x-fj_x, fi_y-fj_y)
- Cosine similarity: We calculated the cosine distance between anonymized feature vectors of the two nodes
- We also calculated sum, mean, min and max for each node for the corresponding anonymized feature
Graph-based features (Directed graph):
- We split the data into 10 folds. For each fold, we used the other 9 folds to build a graph. And we used the entire train data to build the graph for the test set
- We built both direct as well as undirected graphs wherever is_chat was 1
- Directed Graph was used to create features like page rank and average neighbors for each node and the difference between the page ranks and average neighbors
- The directed graph was also used to identify a one-sided connection. 90% of the connections were bi-directed so this was an important feature
- Also, the total connections (along with bi-directed connections) to and from each node were also calculated, along with their averages for each node
Graph-based features (Undirected graph):
- We computed connected components to identify cliques in the graph using the undirected graph. Additional features such as node count and average connections in each component were also calculated
- Clustering coefficient for each node was computed
- We also computed the number of common neighbors and Jaccard similarity (for both directed and undirected graph)
Features using all the pairs:
- We were not entirely sure how the data (node1-node2 pairs) were generated, but we hypothesized that each pair of data represent phone contacts of node1
- This allowed us to create more features similar to graph features, like common contacts, number of contacts, bi-directed contacts, etc.
Some more feature ideas they shared were:
- We found that nearly 1M million pairs were self-connections
- Such connections had only 0.1% likelihood of is_chat compared to 3% within the data
- Our hypothesis was that these are the same customers/device/ SIM and hence are very unlikely to chat with each other
Models:
- Step 1: We used LightGBM to train the model
- Step 2: Since the data was huge, we ran most of the iteration on 20% data. This helped us to quickly iterate and test more features
- Step 3: Making the final submission. The final submission was trained on 90% data on a 200GB machine
They used a larger computer machine with 64GB RAM and 16 cores to train due to the large size of the data.
Rank 2: Tezdhar
Validation strategy:
- Stratified K-Fold cross-validation
Feature engineering:
- User activity features: All activity features for both node1 and node2 users were used. All neural networks of all these features were simply divided by 31 to bring them in the 0-1 scale
- Node-Node features (Graph features): Common link prediction indicators like (node counts(both node1 and node2), common neighbors, Jaccard coefficient, Adamic Aldar, resource allocation, preferential attachment, total neighbors, neighbor distance and neighbors common neighbor) were calculated for 4 different graphs. The four graphs were as follows:
- Undirected graph of all node pairs in both train and test (this will capture phonebook clusters)
- Directed graph from node1 –> node2 for all pairs provided in train and test
- Undirected graph of nodes who had chatted (is_chat == 1). The features from these graphs were mapped in cross-validation fashion to avoid information leakage
- Directed graph of nodes who had chatted. Again, these were calculated in cross-validation fashion.
- Additionally, during analysis, it was found that some users had self edges (as in both users were the same nodes). Features were recalculated removing those nodes. For the gradient boosting model, all features were used as is and for Neural Net model, the features were transformed using RankGauss (sklearn’s Quantile transformer with output distribution set to normal)
- Swap nodes; If node1 chatted with node2, there was a very high change of node2 chatting with node1. This feature was generated in cross-validation fashion
- Node2vec: The library was used to generate node2vec vectors which were used in the Neural Net model
Models:
- Gradient Boosting models: Two LightGBM models were trained using slightly different feature sets
- Neural Net: Pyramid-like neural net with 3 layers (after concatenating all features) was trained for 2 folds
- The final model was a weighted average of probabilities from 2 LightGBM and 1 Neural Net model
This provided him a score of 0.938988506 on the private leaderboard. A brilliant result!
Rank 1: Kamakal Vectors
Here’s a brief overview of their approach:
- Our solution heavily depended on Negative Undersampling
- We ended up using around 5% of the negative examples (is_chat==0)
- We didn’t notice any drop in the performance
- This approach helped us on two fronts:
- Rapid Experimentation: We were able to test out many different sets of features because of the reduced training time
- Our final solution was a blend of multiple models trained on a different 5% of the negative samples. This helped us in getting a more robust and better model
Validation strategy:
- Five-fold random stratified CV
Feature engineering:
Feature creation played a very important role in our climb up the leaderboard. We used three major sets of features in our final model:
- User Activity Features
- We were given a user activity matrix in 13 dimensions, which was used to create metrics on the similarity of users based on activity
- Graph Features
- User Social Circle Features
Graph Features:
We created three graphs from the data set:
- Undirected Contact Graph
- Directed Contact Graph
- Chat Graph (capturing the data with is_chat=1)
Some of the metrics used include:
- Jaccard coefficient
- Resource allocation index
- Degrees of nodes (in case of a directed graph, in/out degrees)
User Social Circle Features
- These variables proved to the most crucial for boosting model performance
- The number of mutual nodes was calculated between the node pairs. Higher this number, more the chances they interact
- With how many mutual nodes does each node pair interact? Let’s take an example:
- We have to find the chat probability between A&B. X, Y, Z are the mutual contacts between A&B
- Now if A&B are chatting with all three then there is a higher chance they will chat with each other
- If they are not chatting with anyone, then it’s highly likely that X, Y, Z are customer care numbers
- How many time is each node involved in a chat? For A and B to have a higher chance of chatting, they need to be chatting with other people too
- How chatty is the neighborhood? If this number is high for both the nodes, then it indicates they are part of a more talkative neighborhood and hence higher the chances of chatting.
- Inverse Links, this features captured the inverse relationships present in the data shared across train and test.
Final Model
- The final model is an ensemble of 10 LightGBM classifiers with each model fitted over a five-fold random stratified CV
- Each of the above models was built on a different subset of data with negative undersampling
This provided them with the first rank and an AUC score of 0.940911. Congratulations again to our winners!
Key Learnings or Takeaways
Below are two key takeaways from the ML Hikeathon:
- When the competition started, a few participants found it very difficult to deal with such huge datasets. So this was something new that most of us learned – the whole data wasn’t used for training. Feature engineering was heavily leaned on later once a submission was already done
- The winners brought out some really amazing features by using directed and undirected graphs. As far as the validation strategy was concerned, most of them adopted stratified K-fold validation
End Notes
It was great interacting with these winners and understanding their approach during the competition. We see a heavy emphasis on feature engineering again. It’s no coincidence that creating new features is often the difference between several positions in the leaderboard.
Hopefully, you will be able to evaluate what you missed out. Check out all the upcoming competitions here. Participate in them and improve your knowledge as well as expertise in the field. See you in the next hackathon!
Analytics Vidhya Content team