Beyond Counters: Linear Models in Recommendations
Linear models are nearly the simplest tools in machine learning, but they shouldn’t be underestimated. They are remarkably adaptable, and their predictive power can be enhanced by utilizing good features. This post explores one of their applications in recommendations and draws an analogy with counters.
Let’s start with the basics — a linear model with two categorical features: user_id and item_id. The target is the same as usual in our task, such as whether a click occurred on the displayed document. The loss function is either MSE (linear regression) or binary log-loss (logistic regression), with the latter usually performing better. And, of course, regularization.
What will this model learn? First, a global shift corresponding to the average CTR. Second, weights for each user and document. If we add other features to the model (such as document categories, user segments, days of the week, etc.), we will obtain corresponding weights for them as well. An analogy with counters becomes evident here: instead of explicitly calculated CTR, we get weights in the linear model. The properties are similar: the better the object, the higher its CTR, and consequently, the higher the weight in the linear model.
What advantages do linear models have compared to counters? First and foremost, quality. The predictive capability of a linear model will be higher than that of a combination of counters. Second, if you need to use smoothing when calculating CTR (and need to select the optimal one), the linear model does this to some extent on its own (but optimal regularization must still be chosen). Third, linear models offer some debiasing almost out-of-the-box. For example, as we discussed in the previous post, a document may achieve a high CTR simply because it was shown to users who click more frequently. A linear model can better account for this: if clicks are explained by users, there’s no need to assign a high weight to the document. And if we add positional features, we can achieve position debiasing as well.
But these advantages are not free. Processing counters is much easier to maintain than training (and retraining in real-time) any models, even ones as simple as linear models. Counters for different features are calculated independently, making it much easier to find and fix issues. Moreover, with counters, there’s no need to come up with a target; they can simply be calculated for all sorts of events. For training linear models, a target is necessary, and this isn’t always so trivial. For example, if your system is not yet in production, the concept of a positive impression isn’t defined, but counters can already be calculated for various events.
It’s worth noting that if you train the model exactly as described above, it won’t be personal. Well, its predictions will be personal since we use the user_id feature. But the ranking of documents won’t be personal because the predictions for all documents will differ by a constant for any two users. Nevertheless, even such a model is useful to employ — as features for a higher-level model. By the way, it’s advisable not just to use the model’s prediction itself but also the weights of a given user, a given document, etc. (in the general case — the model’s components related to different feature subspaces).
To make the model truly personalized, cross-features need to be added: in addition to features like user123 and categoryA, a feature like user123_categoryA is added (if the original features had non-unit weights, then the new feature’s weight will be their product). This opens up space for endless improvement (and complication) of the model. Unfortunately, this approach doesn’t scale perfectly.
Whether to use linear models in a modern recommendation system or to immediately transition to more powerful tools (like factorization machines or neural nets) is an ambiguous question. But at the very least, to utilize more complex models, you will have to solve all the challenges that come with linear ones.