Counters in Recommendations: From Exponential Decay to Position Debiasing
In recommendations and similar ML tasks, the use of counters as features is a popular approach. Examples include the number of times a user has clicked on documents from the same host or a document’s CTR (calculated using two counters — clicks and impressions). Often, composite keys for aggregation are utilized, like counting events of a specific type in a specific region, category, or user segment.
For recommendations, counters are frequently the starting point, although some may begin with matrix factorization. They emerged as a way to feed traditional models with categorical features like user ID, document ID, topic, etc. Even in the neural network era, where embeddings can model categorical features, counters remain essential and highly useful. Counters are cost-effective: they can be calculated over a long period (neural networks might be too expensive for the same), and they’re easily updated in real time (unlike neural networks, which at best can be updated in near real-time, often with a significant delay).
It’s useful to calculate counters over different time periods, such as an hour, day, week, month, etc. If a user has never shown interest in a certain topic but suddenly does in the last day, that’s a valuable signal. For documents, the ratio of counters over various periods can express trends.
I was surprised to find that not all utilize a straightforward modification of counters, known as exponential decay. Instead of counting the number of events, these counters sum the weights, which exponentially diminish with time.
Updating regular counters for a finite period (time window) isn’t exactly simple. Offline processing might simply recompute the value for a new period, but in real-time, the count of old events falling outside the computation window must be subtracted, and new ones added. This requires storing these old events or their aggregates for shorter periods (like every 5 minutes or half-hour). Exponential counters, on the other hand, are easy to update:
Instead of using a finite aggregation window, HalfLife is used — the time it takes for the counter to reduce by half if nothing new happens. o additional storage is needed, except for the last update time. This is called an inductive function, where only the added element and the old function value are needed. Moreover, exponential counters’ expressiveness (i.e., usefulness for the model) is typically no less than regular counters, and sometimes even greater. Now, let’s discuss another improvement to counters.
How to Improve CTR Features
Consider the most standard example — a document’s CTR, calculated as the number of clicks divided by the number of impressions. As we discussed, instead of using regular counts of clicks and impressions, it’s better to use exponentially decaying counters. Additionally, both the numerator and denominator should be smoothed. But let’s focus on how to reduce the influence of various biases.
The CTR is strongly influenced by various biases, with position bias being a prime example. Imagine a document that received 10 impressions at the most prominent spot and garnered 3 clicks. Compare that with another document, displayed 10 times at the bottom or side of the page, perhaps in a smaller size, which gets 2 clicks. Which of the two documents is inherently better? The bias in placement can easily distort a straightforward comparison.
Recognizing that not all impressions are created equal, and some might be “good” while others are “bad,” the field of information retrieval has developed the measure of CoEC, or clicks over expected clicks. Instead of the basic clicks/impressions ratio, CoEC uses clicks/(expected clicks), where expected clicks is the sum of the document-independent click probabilities for all impressions. A document with more clicks than expected is considered good; fewer clicks indicate a bad document. An average document would have a CoEC of 1.
The estimation of this prior probability has various options. If you disregard all variables and simply take the global average CTR, the denominator will be proportional to the number of impressions, making the result equivalent to the standard CTR for documents. But by accounting for the position, the outcome can be markedly improved. This requires calculating the click curve — the a priori click probability at each position.(Details on how to do this without bias will be left for future posts.)
This is the widely used CoEC variant that effectively combats position bias. The position can be more than a mere number; it can mean the location on a page, dimensions, device, etc. The key is to accurately calculate the click curve. But you can go even further, considering other features like user behavior. For instance, some users are active clickers, while others are passive non-clickers, which significantly influences expected clicks. You can take this idea to the extreme, utilizing all document-independent features. However, the practicality of this approach is debatable, as it requires a separate model to estimate click probability.
The technical complexity here is the need to carry the click probability information to the processing system responsible for counter calculations. This data must be sourced from the frontend logs for position-related features and from the backend logs (or transferred from the backend to the front) for other features.
The document’s CTR was just one example. The same principle applies to all other categorical features and different types of actions, including non-binary ones. Considering that these features collectively tend to be among the most informative, this improvement can be quite significant.