SLIM: A Fast and Interpretable Baseline for Recommender Algorithms

Michael Roizner
3 min readAug 31, 2023

Continuing from our post about linear models, today we’ll delve into a specific case — Sparse Linear Methods (SLIM). Here’s what sets this method apart:

  1. It’s simple.
  2. The model is interpretable, making it easy to debug.
  3. It’s quite efficient; the model trains quickly (though this depends on the problem size).
  4. The quality of recommendations is decent, serving as a good baseline for other methods. Furthermore, a 2019 paper reviewed recommendation algorithms published over the last three years and found that, when properly compared, most lose to SLIM. Take this paper with a grain of skepticism, but it’s worth noting.
Photo by Sidharth on spatnaik77

The core of the SLIM method is as follows. We consider two features (or groups of features, also known as namespaces in Vowpal Wabbit terminology). The first is categorical — the ID of the item being evaluated. The second group consists of items with which the user has had positive interactions (excluding the item in question). We then take the cross-product of these two groups to generate features like: [currently evaluating item i, item j appears in the user’s history]. We train a linear model on these cross-features. The target is 1 for positive interactions and 0 otherwise (even if the user hasn’t been shown the item). We use MSE loss and L1+L2 regularization (as in elastic net), adding the constraint that the model’s weights must be non-negative.

In matrix form, it’s even simpler:

Here, A is the ratings matrix (usually binarized to indicate whether a positive interaction occurred), and W is the square matrix of weights we aim to optimize. ||W||² is the Frobenius norm, sum of the squared elements (L2 regularization). And ||W||1 ​ is the L1 norm, which promotes sparsity. The larger λ is, the fewer non-zero weights we’ll have, speeding up optimization. Note that each column of W represents an independent optimization problem, so the computation can be parallelized.

The resulting model predicts each item based on the sum of its positive weights for similar items in the user’s history. The higher the weight, the more similar the items are to each other. This adds interpretability to the model, as it can explain its own recommendations through items from the user’s history.

The method is customizable. For instance, you can consider only past positive interactions to avoid recommending a phone right after someone buys a case. This makes negative sampling a bit trickier, though. Removing the non-negativity constraint on weights is also an option but may reduce interpretability. Alternatively, view this method as a component within a larger linear model.

SLIM is highly effective for smaller databases, containing perhaps thousands or tens of thousands of items, with the specific range dependent on your dataset. It shines in contexts like movie recommendation. But don’t count it out for larger tasks; you can always generalize from specific documents to their categories or authors.

The flip side is that this method won’t showcase the ‘magic of AI’; it won’t create a wow-effect.

Magic contradicts interpretability.