Leveraging External Embeddings in Recommender Systems: A Practical Guide

Michael Roizner
3 min readSep 10, 2023

In recommender systems, you frequently encounter situations where you have access to ‘external embeddings.’ These are embeddings trained independent of the task at hand — like using BERT for text or a pretrained neural network applied to images. Users might also bring external information, perhaps from their activities on another service, represented as an embedding. To keep things simple, we’ll consider these embeddings in the context of documents.

There are three proven techniques to make the most out of these external embeddings:

  1. Simple aggregation,
  2. Training dual embeddings,
  3. Inside a neural network model.


You take the user’s interaction history and aggregate the corresponding document embeddings. The no-brainer approach is to average these out (assuming they’re normalized). Then, use the cosine similarity between this aggregate and the embedding of the document you’re ranking as a feature for your ranking model. For a more advanced approach, you can also aggregate the cosines between the ranked document and the historical ones — like calculating the mean, percentiles (including maximum), or the proportion of cosines above a certain threshold. This way, you can capture how often a user has liked or disliked similar documents. And like when using counters, don’t just average — apply exponential decay.

Dual Embeddings

The next method is training dual embeddings, often termed as a single ‘IALS iteration.’ If you’re new to this, ALS (Alternating Least Squares) is a matrix factorization algorithm that alternates between fixing user and document vectors to optimize the other. IALS (Implicit ALS) adds a twist by considering all user-document pairs with missing interactions as low-weight negatives. One IALS iteration focuses on optimizing a user’s vector to predict the target when multiplied by the fixed external document embeddings. Essentially, this is yet another application of linear models. You can do this independently for each user. The model’s dimensionality equals that of the embeddings. Targets, losses, and weights are the same as in any linear model scenario.

This method typically yields slightly better results than simple aggregation, but it comes at a higher computational cost. The trade-off is similar to what you’d find between standard linear models (without embeddings) and simple counters.

Neural Models

Lastly, you can employ a neural model that takes the user’s history as input. If you don’t have such a model yet, I strongly recommend setting one up — assuming your system’s foundational issues are sorted. User history is converted into embeddings, which are then aggregated (via attention, recurrently, or simple pooling layers). This history can be represented by document descriptions, simple IDs, or a mix. So, you can enrich these descriptions with the external embeddings. You can either use a new model for this or augment an existing one.

This approach is more powerful. In a way, the model itself can calculate different types of aggregation like method 1, or find the optimal vector like method 2, or even do better. These models are worth diving into separately, a topic for future posts.

Which method to use?

In practice, I’d test the utility of external embeddings in this order: 1, 2, 3. Each subsequent method is more complex than the previous (though this depends on how automated and fine-tuned they are in your setup), but potentially offers greater rewards. That said, if the first method doesn’t yield any benefits, it’s unlikely that a more complicated one will.