Published on Nov 01, 2025 (edited on Nov 01, 2025)
For years, dense retrieval—the use of neural language models (LMs) to produce a single vector embedding representing an input—has been the dominant force in information retrieval (IR). These models have proven immensely capable, generalizing across various retrieval datasets and tackling increasingly complex tasks, including reasoning and instruction-following. The community is continuously pushing these embeddings, implicitly tasking them with solving virtually *any* retrieval problem that could be defined.
However, new research from Google DeepMind challenges the assumption that better data or bigger models alone can overcome all hurdles. The paper, "On the Theoretical Limitations of Embedding-Based Retrieval," demonstrates that vector embeddings, under the existing single vector paradigm, face fundamental theoretical limitations that manifest even with extremely simple queries in realistic settings.
The theoretical limitation lies in the geometric constraints imposed by representing complex data using fixed-size vectors. When documents and queries are mapped into an embedding space, their relevance is determined by calculating the distance or dot product between their vectors.
The paper connects known results from learning theory to show that the number of unique subsets of documents (specifically, the top-$k$ documents) that can be returned as relevant for *some* query is fundamentally limited by the dimension ($d$) of the embedding.
To capture a given set of retrieval objectives perfectly—meaning, consistently ranking relevant documents higher than irrelevant ones for all queries—the required embedding dimension is tied directly to the complexity of the relationships in the data, measured mathematically through a concept known as the sign rank of the relevance matrix.
In simple terms, for any fixed dimension $d$, there exists a theoretical set of document combinations that the embedding model simply cannot represent. As tasks become more sophisticated—especially those involving instruction-following or logical operators that connect previously unrelated documents—the number of potential combinations of relevant documents increases dramatically, rapidly exceeding the fixed capacity of the embedding dimension.
The researchers empirically confirmed this theoretical limit by testing the absolute best-case scenario: "free embedding" optimization. In this setup, the document and query vectors are treated as free parameters and directly optimized using the correct answers (the test data). Since these vectors are not constrained by having to model natural language, this represents the highest performance any embedding model could possibly achieve.
They found a "critical-n" point—the maximum number of documents ($n$) an embedding dimension ($d$) can successfully model before failing to represent all top-$k$ combinations. Even when extrapolating from these ideal experiments, the results are sobering: modelling all possible combinations of two relevant documents (top-$k=2$) in a web-scale search setting would require infeasible embedding dimensions. For instance, modelling all top-2 combinations among 500,000 documents would ideally require a 512-dimensional embedding, while 250 million documents would require a 4096-dimensional embedding dimension, yet this is still considered the absolute best-case performance, far exceeding the capability of real, language-constrained models.
To show what this limitation means for real-world applications, the researchers created a new dataset called LIMIT. This dataset is designed to be trivially simple in terms of natural language complexity (e.g., documents are short, and queries ask simple questions like "who likes X?"). However, it is fundamentally difficult because it is constructed based on the theoretical limitations, ensuring a high density of relevant document combinations that stress-tests the model’s representational capacity.
The results were striking: even state-of-the-art embedding models struggled severely on LIMIT. For example, in the full LIMIT setting (50k documents), leading models scored less than 20% recall@100. When testing a smaller version of LIMIT (46 documents), models still could not solve the task perfectly.
The empirical analysis confirmed that performance on LIMIT depends crucially on the embedding dimension, with larger dimensions leading to better scores. Crucially, the issue is not poor training data or domain shift; models trained on a standard training split of LIMIT showed little improvement, indicating the task is intrinsically hard due to the underlying theory.
The findings suggest that the community must be keenly aware of these fundamental limits, especially as evaluation benchmarks push embeddings toward supporting an ever-increasing variety of complex queries.
For large-scale tasks requiring models to capture arbitrary relevance combinations, alternatives to the single-vector paradigm may be necessary. The study showed promising results for alternative architectures:
In summary, while neural embeddings have achieved immense success, their ability to handle the full range of possible retrieval tasks is ultimately constrained by their fixed dimension. To create general-purpose models capable of handling any query and relevance definition, future research must move beyond the constraints of the single-vector approach.