The Ultimate Question of Diversity, Exploration, and Value in Recommender Systems

Reformulating the challenges of ‘beyond accuracy’ aspects through the lens of value optimization

Michael Roizner
5 min readFeb 15, 2024

Although not everyone agrees with this approach, I believe that the main goal of any recommendation system should be the optimization of total value. This value can be measured by various metrics. The main four types I’ve encountered are: DAU (Daily Active Users), time spent, transactions (GMV, Gross Merchandise Value), and subscriptions. Moreover, sometimes this value concerns not only the ordinary consuming users but also other parties — content providers.

Photo by Pieter on Unsplash

As I wrote in a previous post, the core part of any recommendation system is the engagement model E(engagement | item, user, context), which predicts this very value (or some meaningful simplification of it) for a single recommended object. Recommendations can be constructed simply by sorting based on the predictions of this model, ignoring everything else. Let’s call this baseline approach cynical ranking.

Cynical ranking is not optimal for the stated goal of optimizing total value. You might often hear about various “beyond accuracy” aspects of recommendations such as exploration, diversity, novelty, serendipity. Let’s translate these concepts from the language of feelings and “product vision” into the language of optimizing total value.

Exploration

Let’s start with exploration. Everyone has heard of the exploration vs. exploitation dilemma. It’s about the often beneficial trade-off of value (reward) at the current moment to learn something new and act more optimally in the future.

It’s quite important to distinguish between user exploration and system exploration because they require different approaches. In the case of user exploration, we sacrifice the user’s value at the moment for their own benefit, to learn more about them. This can be assessed through standard A/B testing. Sometimes, however, it’s necessary to extend the duration of these tests to capture more long-term effects.

In the case of system exploration, we aim to learn more about the entire system. An important specific case is item exploration — learning more about under-explored items (especially those that have recently appeared in the system). It’s also possible to explore different areas in the feature space of any of the models used — model exploration.

Unlike user exploration, measuring in system exploration is significantly more complex. We sacrifice the user’s experience for the benefit of others, and those others may end up in a different sample of an A/B test.

Nevertheless, some simple and useful item exploration metrics can still be utilized: the proportion of items that receive no fewer than X impressions or clicks in the first Y hours — as a metric on the dashboard, and the shares of clicks and impressions on such under-explored (new with few impressions) items — as metrics in A/B tests. These metrics allow for the comparison of exploration levels, but they don’t answer the question of what level would be optimal.

YouTube has attempted a more principled approach, but it’s fair to say that this area is not yet resolved.

Diversity

In some products, it’s immediately clear that without diversity, recommendations would be terrible (would you like listening to music recommendations featuring only one artist?), while in others, it’s not so obvious. When talking about diversity, several different aspects are meant.

There is intra-list diversity — diversity within a single page (request-level). Cynical ranking lacks diversity because each item is evaluated independently, and they all share the same context. In reality, users see recommended items in the context of each other. For example, if two items are very similar to each other, it’s likely that the user will click on only one of them.

Therefore, it’s worthwhile to dynamically enrich this context with items from the beginning of the list during listwise ranking — as I described some time ago. However, this is still not ideal, as it optimizes the value of a single item at each step, not the entire request.

And then, there is diversity within the user’s history — user-level diversity. In this case, it overlaps significantly with user exploration. However, not completely — one can imagine situations with sufficient diversity but insufficient user exploration, and vice versa, as well as other combinations with novelty and serendipity. The essence of these effects is the same — the short-term nature of optimization. If we learn to optimize the long-term value for the user, we will immediately solve all these issues of user-level diversity, user exploration, novelty, serendipity, etc.

By the way, in services where one item is consumed at a time (like TikTok or radio), the boundary between these two types of diversity is blurred. Technically, a request still usually yields a list of items (say, five) that are shown to the user sequentially. By changing this parameter — the size of the list — we toggle between request-level (intra-list) and user-level diversity.

There is also system-level diversity. As one might guess, it aims at ecosystem optimization — that is, the total value of all users. We touched on this when discussing popularity bias.

Reflections on the Bubble

Do these aspects cover all types of diversity? What if diversity isn’t needed to maximize value (even in the long term), or only a little is needed? But what if it intuitively feels like there’s a lack of diversity? This is a fairly common situation — when any increase in diversity only worsens the metrics.

As I’ve mentioned before, I adhere to the approach:

The main goal of any recommendation system should be the optimization of total value.

If intuition contradicts the observations from metrics, then either the metrics need to be adjusted or the intuition itself.

Metrics: It’s possible that we’ve incorrectly defined value. For instance, we optimize for time spent, whereas optimizing for DAU/retention might align more closely with intuition (user happiness?). Or we’re measuring them over an insufficient period of time. Or (which is also very common!) we simply haven’t found a good way to optimize value in accordance with intuition yet.

Intuition: Perhaps there’s no need to increase diversity and pull users out of their bubble? Maybe they don’t actually need it, or even if they do, it won’t necessarily lead to them making more purchases, for example?

A philosophical question. Feel free to share your opinion in the comments 🙂

To summarize, the evolution of recommendation objectives could be as follows:

  1. E(value | item, user, context) — cynical ranking, optimizing the value of each specific item.
  2. E(value | request) — listwise ranking, optimizing the total value of each request.
  3. E(value | user) — optimizing the long-term value for each user.
  4. E(value) — optimizing the total value of the entire system.

At each level, one can choose their own definition of value. And it is in this way that each subsequent level can utilize the previous one.

--

--