Counters, Filters, and Hashing
I’ve already written about counters and how to remove bias from them, but I haven’t touched on some important technical aspects. Specifically: counters take up space in user and item profiles. (An item profile is all the accumulated information about an item; it doesn’t have to be a single physical structure.) And that means we can’t store an infinite number of counters. For example, if a user profile grows to 1MB, it becomes hard to handle in a real-time system.
So typically, we enforce a limit on the number of counters per type. Once we hit that limit, we discard the “least important” ones — either the oldest (not updated in a while) or those with the smallest values (with exponential decay, the value implicitly accounts for time since last update).
But sometimes that strategy isn’t enough. For instance, when users scroll through a recommendation feed and we want to remember all impressions to avoid repeating them, highly active users can rack up tens of thousands of impressions. In that case, counters aren’t the most efficient tool.
If the goal is just to filter out previously seen items, we can use a well-known probabilistic data structure: the Bloom filter. It occasionally filters out too much — but rarely, and usually that’s acceptable. To prevent the filter from growing indefinitely (since you can’t delete from it), we can use a filter queue: when the current filter fills up, we create a new one, and when the queue gets too long, we remove the oldest.
At Yandex, we implemented a more space-efficient alternative: the quotient filter.
Compared to counters, filters use less space — but they come with two key limitations:
- They return only binary yes/no answers.
- You can’t iterate over their contents — only check for specific elements. This makes them unsuitable for candidate generation or building complex features.
But can we get around the first limitation? Can we store non-binary values, like counters, approximately — but more compactly?
We can! It’s called the count–min sketch — a generalization of the Bloom filter (or more precisely, the counting Bloom filter) that also uses multiple hash functions. And it works well with exponential decay, too.
Unfortunately, I don’t have hands-on experience using count–min sketch for features, so I can’t say whether it’s more effective in practice than simply trimming regular counters.