There is no age to learn and master something. The general perception that data scientists take a lot of time to master their skills and thought is just a myth and to prove that to you we bring you Kaggle Grandmaster who defied all limits.
Joining us today in the 14th edition of the Kaggle Grandmaster Series is one of the youngest Kaggle Grandmasters- Peiyuan Liao.
Peiyuan is the youngest Chinese Kaggle Competitions Grandmaster and ranks 28th with 7 gold medals to his name. He is also a Kaggle Discussions Master and an Expert in the Kaggle Notebooks section.
Peiyuan is currently pursuing his Bachelor’s Degree in Computer Science from Carnegie Mellon University.
You can go through the previous Kaggle Grandmaster Series Interviews here.
Peiyuan Liao (PL): Both university coursework and Kaggle competitions are time-consuming to me. So right now, I only participate in competitions during breaks (Thanksgiving, Christmas, summer, etc.). I do agree that the benefit is two-way: my experience in Kaggle during my high school years helped me gain a better understanding of data science and computer science, as well as certain engineering techniques, and in turn, my coursework and research in machine learning helped me in exploring novel methods for Kaggle competitions.
PL: Yes, I’m currently taking the introduction to machine learning at my school, and I’m planning to take more courses in deep learning for my next semester. I do learn on my own and I tend to read papers on arXiv and OpenReview. For me, one of the good sources for learning is Ian Goodfellow’s Deep Learning book, but I believe that it is always better to read the original papers and look at the native implementations.
PL: In our research, we study the problem of protecting information when learning with graph-structured data. While the advent of Graph Neural Networks (GNNs) has greatly improved node and graph representational learning in many applications, the neighborhood aggregation paradigm exposes additional vulnerabilities to attackers seeking to extract node-level information about sensitive attributes.
To counter this, we propose a minimax game between the desired GNN encoder and the worst-case attacker. The resulting adversarial training creates a strong defense against inference attacks, while only suffering a small loss in task performance. We analyze the effectiveness of our framework against a worst-case adversary and characterize the trade-off between predictive accuracy and adversarial defense.
Experiments across multiple datasets from recommender systems, knowledge graphs, and quantum chemistry demonstrate that the proposed approach provides a robust defense across various graph structures and tasks while producing competitive GNN encoders.
I’ve always had a passion for exploring not only the performance but safety and responsibility of machine learning algorithms: while it is nice to have it perform tasks with flying colors, we need to make sure that it is safe to use, that it cannot be used maliciously or make unethical choices.
PL:
PL: At the start of a competition, I will make a priority list (which will be updated throughout the competition) of what to implement and what to explore. Things like making the data pipeline bug-free are usually high on the list, while things like reading papers on new improvement tricks tend to be lower.
If the deadline is near, I will prioritize the remaining items that are more upfront on the priority list. And, in the end, I tend to write lots of comments in code, so that I can always go back and make sure that I knew what I was doing. This helped a lot in debugging, which tends to be time-consuming in hackathons.
PL: When I was a beginner, I mainly chose topics that I was familiar with: simple image classification, tabular data, etc. It was mainly because I would be familiar with the methodologies involved. But now I focus more on the data involved: I believe that data is one of the most important components of a successful solution. If the data is not clean or is awkwardly represented, developing models around it tends to be a waste of time.
PL: My first instinct would be to do more EDAs to figure out the core of the problem: is it that the data is not clean enough, or are there magic features that need to be extracted. I also use many data visualization tools to figure out what’s wrong with my model: is it not trained enough, or is there simply bugs in inference and prediction. As for resources, I tend to look at the source code and documentation of the libraries I’m using, like PyTorch, sklearn, etc. I also go to arXiv and GitHub for the newest papers with their implementations, to find inspiration for novel methods.
PL: Below is my usual steps for building the model:
EDA -> universal baseline -> more eda -> read from arXiv -> delve into metric -> improvement
EDA: I will first do a thorough inspection of data to see if there are any missing samples, noisy labels, or leakage. Then I will write a Jupyter notebook for visualizations like label distribution. I sometimes will inspect each sample individually to get a sense of the difficulty of the task
Universal baseline: I have a universal baseline code for several types of data, like a set of hyperparameters for xgboost or a CNN architecture for image classification. The purpose of this is to establish a fully working submission pipeline, especially for notebook-only competitions
More EDA: I will then analyze the baseline results, compare them to the leaderboard, and do more data analysis to search for room for improvements.
Read from arXiv: This is when I search up the newest articles from arXiv or top conferences for methods that can be incorporated into my solution. For example, if I’m dealing with an object detection problem, I will look at papers with results higher than a certain COCO mAP to find tricks in training method, loss function, data augmentation, or model architecture.
Delve into metric: At this point, I will revisit the metric to see if there is any room for improvement. The ideal case is that the model optimizes the metric directly. And if that’s not possible, I will spend time working on better surrogates.
Improvements: This is where I work on improvement to a solution, usually on a case-by-case basis. I tend to try out calibration methods and model ensemble.
My pipeline remains pretty much the same for different kinds of data.
PL: For the past semester, I was working on a project where I needed to write a compiler for a C-like language. It is really fascinating to see how a human-friendly programming language eventually turns into a machine-friendly language like assembly, and by writing out each component of the compiler, I became more familiar with the features of programming languages I use every day.
PL: Honestly, I’m not sure yet. I am still exploring and I’m open to opportunities. I think I will probably be more certain once I do a few more internships.
PL: The first three are machine learning scientists that I admire:
The remaining two are Kagglers:
Well, Age is indeed just a number. Peiyuan has continuously proved it with his dedication to data science. We hope this youngster gives you the courage to bury the age barrier you have built as a stopping block in your mind.
This is the 13th interview in the Kaggle Grandmasters Series. You can read the previous few in the following links-
What did you learn from this interview? Are there other data science leaders you would want us to interview for the Kaggle Grandmaster Series? Let me know in the comments section below!