When Logarithmic Scale in Prediction Models Causes Bias

Michael Roizner
2 min readAug 6, 2023

--

Log-normal distribution — Wikipedia

The minimization of the sum of squares on a logarithmic scale inevitably causes some bias.

I’d like to delve deeper into this statement I shared in my previous post.

I’ll begin with an example from my personal experience. Once, I was involved in a project aimed at forecasting a certain value (say, money), that our users would generate over a specific future period. The distribution of this value across users was notably skewed. A significant portion of users brought zero, while for the rest, the distribution resembled a log-normal. Interestingly, it proved practical to predict separately the likelihood of the value exceeding zero, and the expected value given its positivity, and then multiply these.

When training the model conventionally (with MSE-loss), we imposed the same penalty on the model for predicting 1 instead of the correct 2, and for predicting 99 instead of the correct 100. This equal penalization felt counter-intuitive. There was a strong temptation to log-transform all the targets before training the model and then exponentiate the prediction during application. This way, the penalty for predicting a value twice or half the actual target would be constant, irrespective of the target’s value.

However, this approach is misleading. If we were to log-transform the targets and subsequently sum up the exponentiated predictions for all users, the result would be significantly lower than the actual value. We wisely decided against trying this approach, having anticipated this effect in advance.

So, why does this phenomenon occur? The explanation is fairly straightforward. During standard regression training (i.e., with MSE-loss), the model strives to predict the conditional expectation E(y|x) at each point in the feature space. If you apply the log-transform “trick,” you end up with exp(E(log y|x)). Given that the exponent is a convex function, exp(E(log y|x)) < E(exp(log y)|x) = E(y|x), leading to an underprediction.

Let’s illustrate this with an example. Suppose that at the same point in the feature space (i.e., for very similar features), you have two examples: one with a target value of 1 and another with a value of 4. If you log-transform first, you get targets of 0 and 2 (assuming we’re using a binary logarithm, or you could multiply by ln 2). Then, the model predicts the average in this point, i.e., 1 (provided it has enough freedom). However, when you revert to the original scale, the predicted value becomes 2. Clearly, an unbiased prediction should have been (1 + 4) / 2 = 2.5.

I’ve posed a similar question in several interviews over time. Surprisingly, not everyone could give the correct answer.

--

--

Michael Roizner
Michael Roizner

Responses (1)