Rating Systems: The Math Behind Stars

Michael Roizner
4 min readNov 16, 2023

--

Many services feature ratings of objects: products, movies, applications, organizations on a map. How should these be accurately calculated? The question is not as simple as it seems at first glance.

xkcd: TornadoGuard

I won’t even touch upon important nuances such as dishonest manipulation of ratings, or the fact that reducing the “quality” of an object to a single number is a significant oversimplification. (Fortunately, many services now show different aspects of quality, recall Booking.) And I’m not a fan of numerical rating scales at all. But this post is about the simpler and purely mathematical problems of average ratings.

The most standard problem is that, with simple averaging of all ratings, the top of the rating list will be occupied by objects with a very small number of maximum ratings. If your task is to compile a rating of the best objects, the easiest way is to consider only those objects that have received no less than a certain number of ratings (100, 1000). Simple and effective, although in practice, objects with only slightly more ratings than this threshold can still top the list.

A screenshot of Google Maps on the author’s phone

If you want to calculate the rating for all objects accurately, the most popular method is rating smoothing. Instead of the sum divided by the number of ratings:

Smoother is the smoothing parameter, the number of a priori “virtual” ratings. The prior rating is the globally average rating. If an object has no ratings yet, it starts with this averaged (a priori) rating. As time goes on, it accumulates more ratings, and such a smoothed rating becomes closer to a simple average. Sometimes smoothing is added only to the denominator, forgetting the numerator. This isn’t quite correct, although it also works (just more biased towards popular objects).

This method has theoretical justification. Those familiar with Bayesian statistics will surely know it. If you assume, for a binary rating scale, that the a priori distribution of the average rating has a beta distribution (for example, simply uniform from 0 to 1, which is a special case), then the posterior distribution will also have a beta distribution. The expected value of this distribution will be exactly the smoothed average of all ratings.

Another serious problem with average ratings is selection bias, or survivorship bias. Ratings are not given by random users, but by somewhat specific ones. The average rating has the interpretation of the expected value of a rating from a random user, provided that they give any rating at all. Firstly, the very act of giving a rating may be correlated with something that affected the user, either positively or negatively. Many users will not leave a rating if their experience was predictably average. Therefore, we would be more interested in a weaker condition: the expected value of a rating provided that the user purchased the product, watched the movie, visited the establishment, etc.

But secondly, even this interpretation has a big problem. The top of the global rating will have many niche categories: for example, documentaries and art house films. If a user watched such a film, the likelihood of a high rating is significantly higher than average. But few people watch them. So, in fact, we are interested in the expected value of a rating from a random user without any condition: how would a random user rate this object?

How can this be calculated? There is a technique for this, although it is not so popular in practice: Inverse Propensity Scoring, or Inverse Probability Weighting. The essence of it is to estimate the probability of each user leaving a rating and then average the ratings with weights inversely proportional to this probability. If a user had a very low chance of leaving a rating but did so anyway, then this rating is counted with greater weight. Unfortunately, this requires maintaining a model to estimate this probability, and besides, the variance of the calculated rating is quite high.

And here’s a separate question to which I still don’t know the correct answer: what should be the physical meaning of a personal rating (“This object suits you XX%”)? What do you think?

--

--