Hybrid Search and Reciprocal Rank Fusion: Building the Bridge Between Lexical and Semantic

Gandhinath Swaminathan
Jan 14
7 min read

The Series So Far

Part 1: We exposed the invisible problem of fragmented identity as to how scattered customer records destroy churn models and AI agents.
Part 2: We built tools for semantic similarity, using HNSW indexing and pgvector, learning how dense embeddings capture meaning when words don't match."Turntable" and "record player" could finally talk to each other through vector space.
Part 3: We focused on lexical precision with BM25 and Lucene, proving that catching specific tokens like "Sony PS-LX350H" is vital for entity resolution.

We now have two distinct lenses through which to view our data. They cover each other's blind spots, but they produce incompatible scores.

Abstract visualization showing geometric lexical data patterns merging with flowing semantic vector networks, with particles fusing at the convergence point, representing hybrid search combining BM25 and vector similarity.

Two opposing systems—exact match precision and semantic flexibility—finally agree. This is the Hybrid Search Problem.

The question is: how do you unify these outputs so they strengthen the search result rather than canceling each other out?

The tension between these two systems is exactly where most teams make their first critical mistake.

The Linear Combination Trap

When you have two ranking functions, the most intuitive solution is to merge them using a simple weighted sum:

final_score = 0.7 × vector_score + 
              0.3 × bm25_score

The logic seems sound. You're telling the system: "Vector search is the primary driver, but I want BM25 to have a stake in the final ranking to ensure keyword precision."

Then you deploy it to production. And it immediately breaks.

Why does it fail?

The Scale Disparity: BM25 scores are unbounded. A document with heavy term frequency can easily score 50, 100, or higher. In contrast, vector similarity produces scores typically between 0 and 1 (or 0 to 10 depending on your distance metric).

When you add them together, the BM25 score acts like a tidal wave.

The Normalization Paradox: To fix this, you might try normalizing the data using min-max scaling, z-scores, or dividing by the maximum score in the result set. But this introduces a new problem: normalized scores are relative, not absolute. They depend entirely on what other documents happened to appear in that specific result set:
- The Dilution Effect: If BM25 returns one perfect match and nine mediocre ones, normalization stretches the scale, making those mediocre results look artificially "good."
- The Clustering Problem: If vector search returns ten equally decent matches, normalization loses the ability to distinguish between them.
Incompatible Distributions: The underlying math doesn't align. BM25 follows a long-tail curve, where a few results are very relevant and the rest drop off sharply. Vector similarities cluster in specific ranges based on your embedding model.

Because these distributions aren't comparable, a linear weight will never be stable across different queries.

The Deeper Problem is Geometric

Imagine a 2D graph where you plot your search results:

X-Axis: BM25 Score
Y-Axis: Vector Similarity Score
Red Dots: Relevant documents
Blue Dots: Irrelevant documents

The Reality of the Scatter

If these systems were perfectly aligned, the red dots would cluster in the top-right corner. They don't.

Relevant and non-relevant documents are scattered throughout the 2D space. When you use a weighted combination, you are essentially drawing an arbitrary diagonal line through that scatter plot. Because the dots aren't neatly grouped, that line inevitably cuts through the wrong data, causing you to lose precision on both sides.

The Core Assumption

A linear combination relies on two assumptions that simply aren't true in search:

The scoring functions are independently scaled.
The results are linearly separable.

They are neither.

What If You Forgot the Scores?

Reciprocal Rank Fusion takes a completely different approach.

It ignores scores entirely.

Instead, it looks at one question: In what position did this document appear in each ranked list? First place. Second place. Third place. Then it computes a fused score based purely on positions.

The formula is deceptively simple:

Where:

d is a document
N is the number of ranked lists (typically 2: BM25 and vector)
ranki(d) is the position of the document “d” in the list “i”
k is the smoothing constant, typically 60

If a document is missing from a list, its contribution from that source is zero.

Insight: Documents that rank well across multiple independent methods are almost certainly relevant. Every ranking function has its own blindspots. When they both agree, you’ve found the real signal.

A Real Example To Solidify Your Intuition

Let's use data from the ABT-Buy entity resolution dataset—a classic benchmark for this exact problem.

A user searches for: “Sony PS-LX350H turntable”

You run BM25 against lexical search. Here's what comes back:

BM25 nails the identifier match. But It catches four other models because “able” is common but “PSLX350H" is rare and valuable. Rank 4 gets lower because "Denon” doesn't match “Sony."

Now you run vector search on semantic similarity:

Vector search sees these as turntables, so they all score high. But notice the ordering differs from BM25.

Apply the RRF formula with k = 60 to the two result sets:

Sony Turntable - PSLX350H:

Olympus PS-BLS-1 (Didn't appear in vector results. Only contributes from BM25. Lowest score):

RRF Score Illustration for Olympus PS-BLS-1.

Notice what happened:

Rank Over Power: The first results from BM25 and Vector search are now separated by their relative rank positions, rather than an arbitrary trust factor assigned to one method over the other.
Consistent Winners: Both variants of the "PS-LX350H" stay at the top. Even with different naming conventions, they are recognized as the same product because they performed well across both lists.
Precision vs. Bunching: A linear combination would have bunched all four results together. RRF distinguishes between them by looking at how consistently each document ranked across different search methods.

This is entity resolution. You're not picking winners based on which scoring function you trust. You're finding products that multiple independent ranking functions agree on.

Why This Works Without Training

You don’t need labeled data, feature engineering, or neural network training. It relies on just one core assumption: If multiple ranking functions independently place a document high, it’s probably genuinely relevant.

This works because BM25 and vector search fail in entirely different ways:

BM25 fails on vocabulary variation: "Turntable" and "record player" share no tokens. BM25 might treat them as completely unrelated, missing the connection.
Vector search fails on precision: It sees "turntable," "record player," and "music equipment store" as semantically close. It often struggles to distinguish exact identity.

By combining them, you leverage the strengths of one to cover the weaknesses of the other:

High Confidence: When both methods agree (the document ranks high in both), you have found a high-signal match.
Low Confidence: When they disagree (it ranks high in one but low in the other), the document is likely a "near-miss" rather than a true match.

The Smoothing Constant: Why k=60?

The k parameter controls how much rank position affects the final score.

With k = 60:

Rank 1: 1/61 = 0.0164
Rank 2: 1/62 = 0.0161 (down 2%)
Rank 10: 1/70 = 0.0143 (down 13%)
Rank 100: 1/160 = 0.0063 (down 62%)

Top ranks contribute significantly more than later ranks, but the curve is smooth. Moving from rank 1 to rank 2 costs you ~2% of your score. Moving from rank 50 to rank 51 costs ~1%.

Why 60 specifically?

It’s empirically derived. Early RRF research tested values ranging from 1 to 100, revealing a optimal equilibrium:

Values below 10: These made the top-ranked item dominate too heavily. A single strong ranking in one list could override everything else, effectively ignoring the consensus.
Values above 100: These flattened the curve too much. The system began treating Rank 1 and Rank 20 almost identically, losing the benefit of precision.

The value 60 balanced these extremes. It emphasizes top ranks enough to matter, without making them absolute.

But k is not magic. Optimal values vary by domain. For entity resolution on product identifiers, you might find k = 40 works better because you want stronger preference for exact identifier matches. For question answering, k = 80 might work better because semantic similarity matters more.

You can also weight the signals differently. Standard RRF treats all ranked lists equally, but in entity resolution, lexical matches often carry more signal. Weighted RRF is:

A typical setup uses BM25 weight 1.0 and vector weight 0.7—trusting exact matches more while still valuing semantic understanding.

You should tune k to your data by running A/B tests on your labeled validation set.

Building Hybrid Search: Spring Boot Application with ParadeDB

💡 Need the complete runnable code? — reach out to hello@minimalistinnovation.com for the full Spring Boot project with all migrations and dependencies configured.

This method translates the RRF mathematical formula directly into SQL, allowing us to dynamically weight and fuse the lexical and semantic signals in a single efficient query.

A Spring Boot repository method executing a Hybrid Search. The code demonstrates using SQL Common Table Expressions (CTEs) to perform parallel lexical and semantic searches, merging the results by rank position rather than raw score. — ***The Engine Room: Implementing Weighted Reciprocal Rank Fusion in a single, efficient Postgres query.***

Running our Spring Boot implementation against the Abt-Buy benchmark demonstrates how RRF successfully bridges the gap between lexical rigidity and semantic fuzziness, as seen in these top-ranked results:

User Query	Score	Source	Product Name
Sony PS-LX350H turntable	0.0328	Buy	Sony PS-LX350H Belt-Drive Turntable
	0.0323	Abt	Sony Turntable - PSLX350H
	0.0315	Abt	Sony Black USB Stereo Turntable System - PSLX300US...
high quality record player	0.0325	Abt	Sony Compact Disc Player/Recorder - RCDW500C
	0.0287	Abt	Canon VIXIA 60GB High Definition Hard Disc Drive B...
	0.0287	Abt	Canon VIXIA 120GB High Definition Hard Disc Drive ...
Sony PS-LX350H Belt-Drive Turntable	0.0328	Buy	Sony PS-LX350H Belt-Drive Turntable
	0.0320	Buy	Sony PSLX300USB USB Record Turntable
	0.0320	Abt	Sony Turntable - PSLX350H
Sony Record Player, Model PS-LX350H, Black	0.0323	Abt	Sony Black USB Stereo Turntable System - PSLX300US...
	0.0315	Abt	Sony Progressive Scan Black DVD Player - DVPNS57PB
	0.0307	Buy	Sony PS-LX350H Belt-Drive Turntable

What Comes Next

Hybrid Search is the architectural resolution to the fragmented identity problem. It allows us to build systems that are as precise as a primary key lookup and as flexible as a human conversation.

However, an asymmetry remains.

The Dimensional Divide:

We are effectively forcing two opposing mathematical structures to work together: one is distinct and empty, while the other is compressed and full.

BM25 (Sparse) It views the world as a massive checklist of 100,000+ words. If a document doesn't contain a specific word, that dimension is zero. It is wide, mostly empty, and rigid.
Vectors (Dense) They compress all nuance into a small, fixed list of dimensions where every single slot holds a value. Meaning is smeared across the entire array.