When “Almost” Isn’t Good Enough: Why Top Engineers Still Rely On BM25
- Gandhinath Swaminathan

- Jan 8
- 5 min read
Most entity resolution work isn’t actually about “matching”— it is the high-stakes process of deciding which records are even worth comparing.
From an operational standpoint, a poor decision here can lead to false positives (incorrect matches) or false negatives (missed matches), which compromises the integrity of identifiers. Why? Using customer databases as an example, 'almost the same' usually means 'not the same' in a business context.
Series: Entity Resolution
This post covers BM25 scoring, the math behind it, and why it still ranks well on entity-shaped text.

When “close” is wrong
Vectors capture intent. They know “automobile” and “car” represent the same concept. However, in entity resolution, intent can be a dangerous signal.
Take product SKUs: A-123-B is fundamentally different from A-123-C. A vector model might see these as nearly identical because they look similar in a semantic space. But in your warehouse, they are two distinct parts. Identifiers don't allow for "close"—they are either "equal" or "different."
Introducing BM25 — It isn’t a generative model, nor is it based on neural embeddings. Instead, BM25 belongs to the Okapi “Best Match” family of algorithms—the "25" refers to a specific iteration in its development. It is built on the principles of probabilistic information retrieval, which means it calculates the likelihood that a specific record is the correct answer based on the actual tokens present.
From TF‑IDF to BM25
If you’ve done any text scoring, you’ve seen TF‑IDF in some form. It relies on two primary mechanics to move from a manual count to match importance:
Term frequency (TF): The manual count. Terms that appear more often in a document should count more.
Inverse document frequency (IDF): The match importance. Terms that appear in fewer documents carry more unique signal and should be counted more heavily than common terms.
Classical TF‑IDF for a term “t” in document “d” from a corpus “D” looks roughly like:



This works well enough for ranking, but it creates three specific failure modes in production search:
Unlimited Scoring on Repetition: The manual count is often uncapped; repetition keeps adding points.
Bias Toward Length: Longer documents contain more tokens and accidental overlaps, creating a bias that rewards length over accuracy.
Implementation Inconsistency: “TF‑IDF” varies by implementation, especially around IDF and normalization, so behavior shifts across systems.
BM25 keeps the signal and clamps these failure modes.

The Math
BM25 scores a document D against a query Q={q1,…,qn} using:

Where:
f(qi,D): count of term qi in D
∣D∣: document length in tokens
avgdl: average document length
k1: term frequency saturation control
b: length normalization control
That’s the usable mental model: term statistics with saturation and length normalization. Now the mechanics actually matter.
Saturation (k1)
BM25 adds a limit to how much a single term can boost a score. The numerator scales term frequency by (k1 + 1) but the denominator also grows with f(qi,D).
That creates a curve that:
First occurrence: big jump.
Second occurrence: smaller jump.
Subsequent occurrences: diminishing returns.
This prevents long, repetitive fields from dominating ranking just because they repeat the same token. That constraint shows up everywhere in entity data because descriptions and notes fields can get noisy fast.
Field length normalization (b)
BM25 also adjusts for the fact that long text is statistically more likely to contain overlaps by chance. The normalization term scales the denominator so matches in very long documents count for less than matches in short ones.
This scales the “effective” term frequency by document length:
Matching “Smith” in a tiny field that only contains “John Smith” is a strong signal.
Matching “Smith” somewhere in a 500-page biography is not.
IDF: rare tokens matter more
BM25 uses IDF so common tokens contribute little and rare tokens contribute more. In entity resolution, that’s usually correct: “John” is weak, “Chewbacca” is strong, and “PS‑LX350H” carries more identity than “Sony” or “Turntable”.
This is one of the reasons BM25 remains robust for Entity Resolution. The model provides effective results without the need for handcrafted 'important token' rules.
See It Work: BM25 scoring in Lucene

Those two floats you pass into BM25Similarity are the same k1 and b that shaped the saturation and length-normalization curves earlier.
Why Leading Systems Still Use BM25
Even with the proliferation of dense retrieval and neural rerankers, BM25 remains the industry default for indexing layers. It is the engine inside Lucene-based systems (Elastic, OpenSearch, Solr), cloud search services, and modern hybrid retrieval stacks that expose sparse + dense retrieval.
The reasons are strictly practical:
Stability: BM25 has decades of IR work behind it. It behaves predictably on noisy text with minimal tuning.
Efficiency: The scoring uses simple arithmetic on term counts and precomputed statistics over an inverted index. It is inexpensive compared to GPU-heavy models from both a computational and resource perspective.
Explainability: This is its greatest operational asset. You can show a user or an analyst exactly why a match occurred:“This term contributed X% of the score because it was rare in the database but appeared twice in this record.” Lexical scoring, when properly implemented, can be audited token by token, allowing fair auditability and transparency.
Those traits map well to entity resolution, where teams often need to justify why two records were considered close before they accept a merge.

Why This Matters for Entity Resolution
Entity resolution lives in the gap between rigid exact matching and broad embedding similarity. BM25 helps when the identity signal is in the tokens, not in the concept.
Handling “almost” exact matches
Exact matching is operationally brittle. One extra space or a stray hyphen breaks equality joins. Embedding similarity can drift the other direction, especially around identifiers and near-identifiers.
BM25 stays grounded in what’s actually present as tokens, prioritizes rarer tokens, and doesn’t require byte-for-byte equality to surface overlaps.
The rare token signal
In entity resolution, rare tokens carry most of the information:
Customer: “John” is noise; “Chewbacca” is a beacon.
Product: “Turntable” is generic; “PS‑LX350H” is specific.
Address: “Street” is generic; the particular house number plus ZIP is specific.
BM25 naturally gives those rare tokens more influence. Because it’s grounded in document frequency, you don’t need to hand‑curate a list of “important” vs “unimportant” tokens; the corpus statistics drive that ranking.That’s especially valuable when you scale across many domains and don’t want to maintain stop-word lists and weighting rules for each one.
Tunability to the domain
You can tune k1 and b based on the shape of your data. Short structured fields often benefit from stronger length normalization, while longer descriptive fields may need a lighter touch.
This is not magic. It’s two knobs that let you match how your fields behave.This is a controlled way to ensure the system matches how your specific fields behave.
The Limitation: Vocabulary Gaps
BM25 is strictly lexical. It matches tokens, not meanings. If one record says “Software Engineer” and another says “Developer,” BM25 may score near zero because there is no literal overlap.
That trade-off is the point: BM25 is intentionally strict. It excels when you need to match specific identifiers, and it is weaker when two descriptions refer to the same thing with different words. Embedding retrieval is strong when vocabulary shifts.
Next Up: Reciprocal Rank Fusion (RRF), and how to merge a BM25-ranked list with an HNSW-ranked list to combine Strict Identity with Semantic Meaning without losing the identifier signal.


Comments