Kaggle Grandmaster Series – Exclusive Interview with Kaggle Competitions Grandmaster Peiyuan Liao (Rank 28!)

Analytics Vidhya Last Updated : 19 Jan, 2021

7 min read

Welcome back to the Kaggle Grandmaster Series!

There is no age to learn and master something. The general perception that data scientists take a lot of time to master their skills and thought is just a myth and to prove that to you we bring you Kaggle Grandmaster who defied all limits.

Joining us today in the 14th edition of the Kaggle Grandmaster Series is one of the youngest Kaggle Grandmasters- Peiyuan Liao.

kaggle grandmaster series peiyuan liao

Peiyuan is the youngest Chinese Kaggle Competitions Grandmaster and ranks 28th with 7 gold medals to his name. He is also a Kaggle Discussions Master and an Expert in the Kaggle Notebooks section.

Peiyuan is currently pursuing his Bachelor’s Degree in Computer Science from Carnegie Mellon University.

You can go through the previous Kaggle Grandmaster Series Interviews here.

In this interview, we cover a range of topics, including:

Peiyuan’s Education
Peiyuan’s Kaggle Journey from Scratch to become a Kaggle Grandmaster
Peiyuan’s Advice for Beginners
Peiyuan’s Inspiration and Future Plans

Peiyuan’s Education

kaggle grandmaster series education and work

Analytics Vidhya (AV): You’re currently pursuing an undergrad degree and already are the youngest Chinese Kaggle Grandmaster. How are you managing your coursework and Kaggle altogether? How is Kaggle helping you in your degree and vice-versa?

Peiyuan Liao (PL): Both university coursework and Kaggle competitions are time-consuming to me. So right now, I only participate in competitions during breaks (Thanksgiving, Christmas, summer, etc.). I do agree that the benefit is two-way: my experience in Kaggle during my high school years helped me gain a better understanding of data science and computer science, as well as certain engineering techniques, and in turn, my coursework and research in machine learning helped me in exploring novel methods for Kaggle competitions.

AV: Is Machine Learning a part of your course curriculum or you have learned it on your own? Can you suggest some good sources for learning ML?

PL: Yes, I’m currently taking the introduction to machine learning at my school, and I’m planning to take more courses in deep learning for my next semester. I do learn on my own and I tend to read papers on arXiv and OpenReview. For me, one of the good sources for learning is Ian Goodfellow’s Deep Learning book, but I believe that it is always better to read the original papers and look at the native implementations.

AV: Can you tell us about your research around “Attribute inference attacks on graph-structured data” and also about the defense algorithm you designed, in detail? And how did you decide to do research around this topic?

PL: In our research, we study the problem of protecting information when learning with graph-structured data. While the advent of Graph Neural Networks (GNNs) has greatly improved node and graph representational learning in many applications, the neighborhood aggregation paradigm exposes additional vulnerabilities to attackers seeking to extract node-level information about sensitive attributes.

To counter this, we propose a minimax game between the desired GNN encoder and the worst-case attacker. The resulting adversarial training creates a strong defense against inference attacks, while only suffering a small loss in task performance. We analyze the effectiveness of our framework against a worst-case adversary and characterize the trade-off between predictive accuracy and adversarial defense.

Experiments across multiple datasets from recommender systems, knowledge graphs, and quantum chemistry demonstrate that the proposed approach provides a robust defense across various graph structures and tasks while producing competitive GNN encoders.

I’ve always had a passion for exploring not only the performance but safety and responsibility of machine learning algorithms: while it is nice to have it perform tasks with flying colors, we need to make sure that it is safe to use, that it cannot be used maliciously or make unethical choices.

Peiyuan’s Kaggle Journey from scratch to becoming a Grandmaster

AV: It is a huge achievement for any student to earn the title of Kaggle grandmaster. Can you list down the challenges you faced initially and how did you overcome them?

PL:

My first challenge was to learn how to code when I first entered Kaggle, which I overcame by participating in community discussions and taking part in competitions to improve.
My second challenge was finding teammates to learn from and improve. In the end, I learned to write cold emails to users that have similar rankings as I and learned to use collaboration software like slack.
The third was to obtain a solo gold medal to achieve the Grandmaster status. There were numerous times when I was close to the gold medal zone during the competition but ended up a few places away from the cutoff. I just kept trying and got my solo gold medal in the wheat detection competition. I believe that it is perseverance that got me through the difficult times.

AV: Hackathons are usually time-bound and you yourself are a student which means you can’t invest your whole time in the competitions. Keeping all this in mind what is your approach to a hackathon making sure that you’ll complete it by the deadline?

PL: At the start of a competition, I will make a priority list (which will be updated throughout the competition) of what to implement and what to explore. Things like making the data pipeline bug-free are usually high on the list, while things like reading papers on new improvement tricks tend to be lower.

If the deadline is near, I will prioritize the remaining items that are more upfront on the priority list. And, in the end, I tend to write lots of comments in code, so that I can always go back and make sure that I knew what I was doing. This helped a lot in debugging, which tends to be time-consuming in hackathons.

AV: What are your criteria for choosing to participate in a hackathon- how have they changed from when you were a beginner to now?

PL: When I was a beginner, I mainly chose topics that I was familiar with: simple image classification, tabular data, etc. It was mainly because I would be familiar with the methodologies involved. But now I focus more on the data involved: I believe that data is one of the most important components of a successful solution. If the data is not clean or is awkwardly represented, developing models around it tends to be a waste of time.

AV: If you are stuck while solving a problem statement, how do go about resolving it? What are the resources you seek help from?

PL: My first instinct would be to do more EDAs to figure out the core of the problem: is it that the data is not clean enough, or are there magic features that need to be extracted. I also use many data visualization tools to figure out what’s wrong with my model: is it not trained enough, or is there simply bugs in inference and prediction. As for resources, I tend to look at the source code and documentation of the libraries I’m using, like PyTorch, sklearn, etc. I also go to arXiv and GitHub for the newest papers with their implementations, to find inspiration for novel methods.

Peiyuan’s Advice for Beginners in Data Science

data science interview work

AV: What are the steps you follow while building a model? Could you also specify how they change depending on the kind of data you are dealing with?

PL: Below is my usual steps for building the model:

EDA -> universal baseline -> more eda -> read from arXiv -> delve into metric -> improvement

EDA: I will first do a thorough inspection of data to see if there are any missing samples, noisy labels, or leakage. Then I will write a Jupyter notebook for visualizations like label distribution. I sometimes will inspect each sample individually to get a sense of the difficulty of the task

Universal baseline: I have a universal baseline code for several types of data, like a set of hyperparameters for xgboost or a CNN architecture for image classification. The purpose of this is to establish a fully working submission pipeline, especially for notebook-only competitions

More EDA: I will then analyze the baseline results, compare them to the leaderboard, and do more data analysis to search for room for improvements.

Read from arXiv: This is when I search up the newest articles from arXiv or top conferences for methods that can be incorporated into my solution. For example, if I’m dealing with an object detection problem, I will look at papers with results higher than a certain COCO mAP to find tricks in training method, loss function, data augmentation, or model architecture.

Delve into metric: At this point, I will revisit the metric to see if there is any room for improvement. The ideal case is that the model optimizes the metric directly. And if that’s not possible, I will spend time working on better surrogates.

Improvements: This is where I work on improvement to a solution, usually on a case-by-case basis. I tend to try out calibration methods and model ensemble.

My pipeline remains pretty much the same for different kinds of data.

AV: What are some other interesting projects that you have worked on apart from your Kaggle competitions? Which was your favorite?

PL: For the past semester, I was working on a project where I needed to write a compiler for a C-like language. It is really fascinating to see how a human-friendly programming language eventually turns into a machine-friendly language like assembly, and by writing out each component of the compiler, I became more familiar with the features of programming languages I use every day.

Peiyuan’s Inspiration and Future plans

kaggle grandmaster series future plan

AV: What is next for you Peiyuan? What will you be focusing on, Masters or a job?

PL: Honestly, I’m not sure yet. I am still exploring and I’m open to opportunities. I think I will probably be more certain once I do a few more internships.

AV: Can you name five data scientists whose work you always look forward to?

PL: The first three are machine learning scientists that I admire:

Ian Goodfellow
Soumith Chintala
Tianqi Chen

The remaining two are Kagglers:

Kohei
bestfitting

End Notes

Well, Age is indeed just a number. Peiyuan has continuously proved it with his dedication to data science. We hope this youngster gives you the courage to bury the age barrier you have built as a stopping block in your mind.

This is the 13th interview in the Kaggle Grandmasters Series. You can read the previous few in the following links-

What did you learn from this interview? Are there other data science leaders you would want us to interview for the Kaggle Grandmaster Series? Let me know in the comments section below!

Analytics Vidhya

Analytics Vidhya Content team

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Kaggle Grandmaster Series – Exclusive Interview with Kaggle Competitions Grandmaster Peiyuan Liao (Rank 28!)

Welcome back to the Kaggle Grandmaster Series!

In this interview, we cover a range of topics, including:

Peiyuan’s Education

Analytics Vidhya (AV): You’re currently pursuing an undergrad degree and already are the youngest Chinese Kaggle Grandmaster. How are you managing your coursework and Kaggle altogether? How is Kaggle helping you in your degree and vice-versa?

AV: Is Machine Learning a part of your course curriculum or you have learned it on your own? Can you suggest some good sources for learning ML?

AV: Can you tell us about your research around “Attribute inference attacks on graph-structured data” and also about the defense algorithm you designed, in detail? And how did you decide to do research around this topic?

Peiyuan’s Kaggle Journey from scratch to becoming a Grandmaster

AV: It is a huge achievement for any student to earn the title of Kaggle grandmaster. Can you list down the challenges you faced initially and how did you overcome them?

AV: Hackathons are usually time-bound and you yourself are a student which means you can’t invest your whole time in the competitions. Keeping all this in mind what is your approach to a hackathon making sure that you’ll complete it by the deadline?

AV: What are your criteria for choosing to participate in a hackathon- how have they changed from when you were a beginner to now?

AV: If you are stuck while solving a problem statement, how do go about resolving it? What are the resources you seek help from?

Peiyuan’s Advice for Beginners in Data Science

AV: What are the steps you follow while building a model? Could you also specify how they change depending on the kind of data you are dealing with?

EDA -> universal baseline -> more eda -> read from arXiv -> delve into metric -> improvement

AV: What are some other interesting projects that you have worked on apart from your Kaggle competitions? Which was your favorite?

Peiyuan’s Inspiration and Future plans

AV: What is next for you Peiyuan? What will you be focusing on, Masters or a job?

AV: Can you name five data scientists whose work you always look forward to?

End Notes

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID