People click on top items in search and recommendations more often because they are on top, not because of their relevancy. If you order your search results with an ML model, they may eventually degrade in quality because of such a positive self-reinforcing feedback loop. How can this problem be solved?
Nowadays, search ranking and recommendation systems rely on a large amount of data to train machine learning models like Learning-to-Rank (LTR) models to rank results for a given query, and implicit user feedbacks (e.g. click data) have become the dominant source of data collection due to their abundance and low cost, particularly for major Internet companies. Nevertheless, one disadvantage of this data collection method is that the data may be highly skewed. One of the most prominent biases is position bias, which occurs when users are inclined towards clicking on higher ranked results.
In this article, we’re going to discuss the following topics:
This article was published as a part of the Data Science Blogathon.
Every time you present a list of things, such as search results or recommendations (or autocomplete suggestions and contact lists), to a human being, we can hardly ever impartially evaluate all the items in the list.
A cascade click model assumes that people evaluate all the items in the list sequentially before they find the relevant one. But then it means that things on the bottom have a smaller chance to be evaluated at all, hence will organically have fewer clicks:
Top items receive more clicks only because of their position — this behavior is called position bias. However, the position bias is not the only bias in item lists, there are plenty of other dangerous things to watch out for:
In practice, the position bias is the strongest one — and removing it while training may improve your model reliability.
We conducted a small crowd-sourced research about position bias. With a RankLens dataset, we used a Google Keyword Planner tool to generate a set of queries to find a particular movie.
With a set of movies and corresponding actual queries, we have a perfect search evaluation dataset — all items are well-known for a wider audience, and we know the correct labels in advance.
All major crowd-sourcing platforms like Amazon Mechanical Turk and Toloka.ai have out-of-the-box templates for typical search evaluation:
But there’s a nice trick in such templates, preventing you from shooting yourself in the foot with position bias: each item must be examined independently. Even if multiple items are present on screen, their ordering is random!
But does random item order prevents people from clicking on the first results?
The raw data for the experiment is available on github.com/metarank/msrd, but the main observation is that people still click more on the first position, even on randomly-ranked items!
But how can you offset the impact of position on implicit feedback you get from clicks? Each time you measure the click probability of an item, you observe the combination of two independent variables:
In the MSRD dataset mentioned in the previous paragraph, it’s hard to distinguish the impact of position independently from BM25 relevance as you only observe them combined together.
For example, 18% of clicks are happening on position #1. Does this only happen because we have the most relevant item presented there? Will the same item on position #20 get the same amount of clicks?
The Inverse Propensity Weighting approach suggests that the observed click probability on a position is just a combination of two independent variables:
And then, if you estimate the click probability on each position (the propensity), you can weight all your relevance labels with it and get an actual unbiased relevance:
But how can you estimate the propensity in practice? The most common method is introducing a minor shuffling to rankings so that the same items within the same context (e.g., for a search query) will be evaluated on different positions.
But adding extra shuffling will definitely degrade your business metrics like CTR and Conversion Rate. Are there any less invasive alternatives not involving shuffling?
A position-aware approach to ranking suggests asking your ML model to optimize both ranking relevancy and position impact at the same time:
In other words, you trick your ranking ML model into detecting how position affects relevance during the training but zero out this feature during the prediction: all the items are simultaneously being presented in the same position.
But which constant value should you choose? The authors of the PAL paper did a couple of numerical experiments on selecting the optimal value — the rule of thumb is not to pick too high positions, as there’s too much noise.
The PAL approach is already a part of multiple open-source tools for building recommendations and searches:
As the position-aware approach is just a hack around feature engineering, in Metarank, it is only a matter of adding yet another feat
On an MSRD dataset mentioned above, such a PAL-inspired ranking feature has quite a high importance value compared to other ranking features:
The position-aware learning approach is not only limited to pure ranking tasks and position de-biasing: you can use this trick to overcome any other type of bias:
The ML model trained with the PAL approach should produce an unbiased prediction. Considering the simplicity of the PAL approach, it can also be applied in other areas of ML where biased training data is a usual thing.
While conducting this research, we made the following main observations:
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.