GPU-Powered Retrieval: Recent Trend

2 min readOct 6, 2024

Not too long ago, I learned about a new trend in our industry. It’s happening in an area where everything seemed to be working well already, and further improvement seemed difficult.

As we’ve discussed before, two-tower networks and ANN indices, like HNSW, are the best for generating candidates and performing fast searches.

Two-Tower Networks and Negative Sampling in Recommender Systems

Understand the key elements that power advanced recommendation engines

towardsdatascience.com

Well, first Meta, then LinkedIn, and reportedly TikTok as well, have shown that in today’s world, there’s a better way to do this.

Revisiting Neural Retrieval on Accelerators

Retrieval finds a small number of relevant candidates from a large corpus for information retrieval and recommendation…

arxiv.org

LiNR: Model Based Neural Retrieval on GPUs at LinkedIn

This paper introduces LiNR, LinkedIn's large-scale, GPU-based retrieval system. LiNR supports a billion-sized index on…

arxiv.org

Two-tower networks are still used in the first stage. However, there’s no need to store the embeddings in an ANN index anymore. Instead… you just use GPUs!

With embeddings of small dimensions, even in quantized form, a single A100 GPU card can store around 100 million documents (which ought to be enough for anybody… well, almost), and calculate the dot product with all of them in just a few dozen milliseconds. For good throughput, query embeddings should be batched so that all calculations can be done in a single matrix multiplication.

What are the advantages of this?

1. Search recall is higher. As much as we like ANN, their recall in practice is above 95%, but still not 100%. Here, we calculate the dot product with every object in the database.

2. Normally, we select one or a few thousand candidates from ANN, but here we can return up to 100,000. ANN struggles with such large numbers. But then, what do we do with these 100,000? Meta suggests ranking them in the next stage with a heavier model, the mixture-of-logits (MoL), which is still two-tower but uses a more complex network instead of a simple dot product at the end, also on GPUs. The result of this step is then passed on to the heavy-ranking model, just like before. (By the way, there’s also a recent paper by Meta and Microsoft Research that claims MoL is a good universal approximator and, moreover, can be efficiently used with ANNs as well in practice.)

Efficient Retrieval with Learned Similarities

Retrieval plays a fundamental role in recommendation systems, search, and natural language processing by efficiently…

arxiv.org

3. This approach also allows much faster and more frequent embedding updates for documents. You just need to update them in GPU memory. In an ANN index, this is more complicated, so frequent updates are less common.

Looks promising.

GPU-Powered Retrieval: Recent Trend

Two-Tower Networks and Negative Sampling in Recommender Systems

Understand the key elements that power advanced recommendation engines

Revisiting Neural Retrieval on Accelerators

Retrieval finds a small number of relevant candidates from a large corpus for information retrieval and recommendation…

LiNR: Model Based Neural Retrieval on GPUs at LinkedIn

This paper introduces LiNR, LinkedIn's large-scale, GPU-based retrieval system. LiNR supports a billion-sized index on…

Efficient Retrieval with Learned Similarities

Retrieval plays a fundamental role in recommendation systems, search, and natural language processing by efficiently…

Written by Michael Roizner

No responses yet