GPU-Powered Retrieval: Recent Trend
Not too long ago, I learned about a new trend in our industry. It’s happening in an area where everything seemed to be working well already, and further improvement seemed difficult.
As we’ve discussed before, two-tower networks and ANN indices, like HNSW, are the best for generating candidates and performing fast searches.
Well, first Meta, then LinkedIn, and reportedly TikTok as well, have shown that in today’s world, there’s a better way to do this.
Two-tower networks are still used in the first stage. However, there’s no need to store the embeddings in an ANN index anymore. Instead… you just use GPUs!
With embeddings of small dimensions, even in quantized form, a single A100 GPU card can store around 100 million documents (which ought to be enough for anybody… well, almost), and calculate the dot product with all of them in just a few dozen milliseconds. For good throughput, query embeddings should be batched so that all calculations can be done in a single matrix multiplication.
What are the advantages of this?
1. Search recall is higher. As much as we like ANN, their recall in practice is above 95%, but still not 100%. Here, we calculate the dot product with every object in the database.
2. Normally, we select one or a few thousand candidates from ANN, but here we can return up to 100,000. ANN struggles with such large numbers. But then, what do we do with these 100,000? Meta suggests ranking them in the next stage with a heavier model, the mixture-of-logits (MoL), which is still two-tower but uses a more complex network instead of a simple dot product at the end, also on GPUs. The result of this step is then passed on to the heavy-ranking model, just like before. (By the way, there’s also a recent paper by Meta and Microsoft Research that claims MoL is a good universal approximator and, moreover, can be efficiently used with ANNs as well in practice.)
3. This approach also allows much faster and more frequent embedding updates for documents. You just need to update them in GPU memory. In an ANN index, this is more complicated, so frequent updates are less common.
Looks promising.