A successful exam is one where you not only pass but also learn something new.
More than a year ago, I interviewed at Microsoft. During one of the interviews, I was asked a question to which I didn’t respond very well. Even after hearing the answer, I didn’t grasp it, possibly due to the language or pronunciation. However, after thinking about it for a couple of days, I understood what was meant — or at least a reasonable possible answer.
The question was as follows. There are many models for recommendations (and not only) that work as the product of a user/query embedding on a document embedding. For example, matrix factorization or two-tower neural networks. Let’s consider the latter. For the operation of multiplying these embeddings, you can use dot product (also known as inner product or scalar product) or cosine similarity. So the question was — what’s better, dot product or cosine?
Our experience at Yandex was not always consistent — sometimes cosine worked more stably for training, and later on, the dot product seemed to train well and perform slightly better. Cosine’s limited range of values (even if followed by an affine transformation) is a drawback. However, the question was not about the quality of the model’s prediction but about additional applications of the trained document embeddings, like creating a fast index or clustering. I didn’t quite understand what the problem might be. For cosine, it seemed simpler, but dot product also works (at least for a fast index).
Consider a model trained using the dot product. If we multiply the first coordinate of all document embeddings by a large constant (say, 1000), and divide the first coordinate of all user embeddings by the same constant, the result won’t change; that is, the model’s prediction quality remains the same. However, if we measure distances between documents (usually through multiplication or cosine as well), everything will be determined by just this first coordinate, with all others having a much smaller scale. Thus, clustering such documents would work very poorly. While training, the model won’t multiply a coordinate by such a large constant, but during gradient descent, all coordinates (or more precisely, directions in this space) will stretch in some random way. User embeddings adapt to these stretches, but inter-document distances do not.
When training the model with cosine, this issue doesn’t arise. Cosine is basically the dot product of the normalized embeddings, so it’s like we normalize all embeddings in the end. Therefore, you can’t just rescale one coordinate without altering the others. That is, you can’t simply transfer all importance in document distances to one direction without ruining the user-document distances. By the way, in matrix factorization, unlike neural networks, this problem doesn’t exist either — even with dot product — due to regularization.
One of my former colleagues suggested devising an invariant metric between two documents for the dot product. Take the correlation of scores between these two documents to a random user. In other words, the cosine between the document vectors if multiplied through the covariance matrix of the user embeddings.
If anyone has tested any of these in practice, please share in the comments.