top of page

The Best of Both Worlds: Learned Sparse Retrieval (SPLADE) For Entity Resolution

  • Writer: Gandhinath Swaminathan
    Gandhinath Swaminathan
  • 3 days ago
  • 10 min read

In the previous post, we tried to force a truce between two opposing forces. We took BM25 (precise, rigid, lexical) and Dense Retrieval (fuzzy, semantic, vector-based) and glued them together with Reciprocal Rank Fusion (RRF).


It works. But it feels like a patch. You are running two separate search engines and hoping their results align.


The question is: Can we build a single model that has the precision of a keyword match and the understanding of a dense vector?


SPLADE (SParse Lexical AnD Expansion)  is one of the widely adopted attempts to answer that question.

Abstract bridge connecting a grid of text tokens to a cloud of vector embeddings.
SPLADE bridges the gap between rigid lexical matching and fuzzy dense vector search for entity resolution.

It takes the same transformer machinery used in dense embeddings and bends it back into a sparse, index‑friendly representation. The result?

  1. You still get an inverted index.

  2. You still see per‑term weights.

  3. However, the model learns how to expand and re-weight tokens using context, instead of relying on hand‑written rules or naive term counts.


The rest of this post explores what that actually means in practice. Specifically, when you are trying to decide whether:

  • “Sony Turntable – PSLX350H”

  • “Sony PS-LX350H Belt‑Drive Turntable”

  • “USB Stereo Record Player, PSLX350H, Black”

...are the same thing, or three different products your pricing agent can safely treat as unrelated.


Recap: The Dimensional Divide We're Stuck With Today

From earlier posts in this series:

  • Exact matching is unforgiving. One stray character and your join fails.

  • BM25 gives you lexical scoring with saturation and length normalization; excellent for identifiers and rare tokens.

  • Dense embeddings (HNSW, pgvector) give you meaning when words differ, but they blur sharp identity boundaries.

  • Hybrid search with RRF lets you reconcile BM25 and dense rankings without pretending their scores live on the same axis.

Infographic comparing BM25, dense vectors, and SPLADE as three different retrieval approaches.
BM25, dense vectors, and SPLADE occupy different points in the design space of representation, sparsity, and infrastructure.

Underneath all of that sits a hard split in how text is represented:

Aspect

BM25 / Lexical Sparse

Dense Vectors

Dimensionality

Very high (tens of thousands of terms)

Low (hundreds to a few thousand)

Sparsity

Almost all zeros

Every dimension has a value

Identity behavior

Exact term overlap only

No term notion; operates in embedding space

Infrastructure

Inverted index

ANN index (HNSW, IVF, etc.)

Explainability

Token‑wise

Geometric; difficult to attribute

Hybrid search stacked them together. It did not resolve that structural mismatch.


SPLADE tackles the representation itself: keep sparse vectors and inverted indexes, but lets a transformer decide:

  • which vocabulary terms should fire, and

  • how much each should matter for this specific piece of text.


That is why SPLADE matters in an entity‑resolution stack.


Why SPLADE Is Interesting for Entity Resolution

Entity resolution lives in a narrow corridor:

  • On one wall: Identifiers that behave like primary keys. “PS‑LX350H” either lines up or it doesn't.

  • On the other wall: Natural language noise. “USB stereo record player for vinyl lovers” is full of tokens that mean little for identity.


Traditional techniques force you to choose:

  • BM25: respects identifiers but ignores meaning when vocabulary shifts.

  • Dense vectors: understand “turntable” vs. “record player” but blur “PS‑LX350H” and “PS‑LX300USB” more than you’d like.


SPLADE sits between those extremes in three specific ways that matter for entity resolution.


  1. Expansion: Vocabulary Mismatch Without Blind Guesswork

Business text is full of near‑equivalents:

  • “Turntable” vs. “record player”

  • “PS‑LX350H” vs. “PSLX350H” vs. “PSLX350HBLK”

  • “Software Engineer” vs. “Developer” vs. “SWE”


SPLADE's expansion mechanism lets the model activate related terms as explicit sparse dimensions. For example, for a product description:“What Is The Capital Of France”,

the SPLADE representation assigns non-zero weights to dimensions corresponding to:

Rank

Term

Value

1

capital

3.1276

2

france

2.9040

3

french

2.4578

4

europe

1.7790

5

capitol

1.7707

6

city

1.3520

7

geography

0.8338

8

paris

0.7244

9

michel

0.7105

10

switzerland

0.5428

even if “europe”, “paris”, “city” and “capitol” doesn't appear in the raw string.


That matters when:

  • Retailer A calls it a “USB stereo record player”,

  • Retailer B calls it a “belt‑drive turntable”, and

  • Your ER pipeline needs to decide whether those SKUs refer to the same physical product.


SPLADE still gives you a postings list keyed by terms—some observed, some expanded—so you can inspect exactly which tokens carried the match.


  1. Reweighting: Teaching the Model What “Identity” Means in Your Domain

In classic BM25, token influence is driven by:

  • how frequent a term is in the corpus (IDF), and

  • how often it appears in the record (TF with saturation).

You can add manual boosts—e.g., weight “PS‑LX350H” more than “Sony”—but you end up in a thicket of hand‑tuned rules.


SPLADE moves that effort into training:

  • During learning, the model sees query–document pairs labeled as relevant or not.

  • The sparsity regularizer applies pressure to turn off unhelpful terms and turn up the few that matter.

  • Over time, domain‑specific signals (SKUs, customer IDs, physician license numbers) pick up more weight than generic words.


The effect for entity resolution:

  • Common tokens like “Record Player”, “Coffee”, “Street” get pushed toward the background.

  • Stable identifiers and rare phrases in your corpus absorb most of the impact.

  • Instead of manually telling the system “Always care more about SKU than description,” you let the model learn that pattern from your own labeled matches and non‑matches.


In practice, the hardest part is not the model; it is curating enough labeled “same entity” / “different entity” pairs—often using weak supervision from existing IDs and human review—to teach SPLADE what identity really means in your domain.


  1. Infrastructure Fit: Learned Semantics Without Throwing Away Your Index

Despite being “neural,” SPLADE still produces sparse vectors over a fixed vocabulary. That means:

  • You can store document vectors in a standard inverted index.

  • You can use existing dynamic pruning techniques and query execution strategies.

  • Score computation is a sum over a small set of overlapping terms, not a giant dense dot‑product.


In practice, that gives you:

  • Better recall than BM25 on nuanced queries, especially where vocabulary mismatch is a problem.

  • Better latency than running a full dense retriever over your entire corpus.

  • A representation you can debug token‑by‑token when a match looks suspicious.


For entity resolution scenarios with millions of products, patients, or accounts, those three traits matter more than another point of academic benchmark score.


Where SPLADE Can Go Wrong for Entity Resolution

  1. Tokenizers and Identifiers: Your SKUs Are Not Natural Language

Transformer tokenizers weren’t built for product codes or policy numbers. They were built for English.


That means:

  • “PS-LX350H” may get split into multiple sub‑tokens like ps-lx350h.

  • Two adjacent SKUs (e.g., PS-LX350H vs. PS-LX300USB) can share a lot of sub‑tokens.

  • If the model over‑relies on these fragments, you risk blurring sharp boundaries between neighboring products.


  1. Expansion That Overreaches

Expansion is the headline feature—and also the main way to make a mess.


Example failure pattern:

  • A model trained largely on web search learns that “iPhone charger” is strongly associated with “Apple”, “Lightning cable”, and “USB adapter”.

  • In your entity system, third‑party accessories should never merge with first‑party devices or unrelated SKUs.

  • An aggressive expansion can light up “Apple iPhone” terms in accessory descriptions, nudging your matcher toward incorrect links.


For entity resolution, the risk of SPLADE going wrong is higher in domains where:

  • Product families and ecosystems share names heavily.

  • Vendor‑neutral entities share descriptive terms but differ in subtle identifiers.


  1. Latency and Index Size

SPLADE vectors are still sparse, but not as skinny as classic BM25 term vectors:

  • Queries may have tens or hundreds of non‑zero dimensions instead of a handful.

  • Documents may have many more active vocabulary entries due to expansion.


That creates three practical issues:

  • Index size grows; each document has more posting entries.

  • Query latency increases; more postings lists have to be touched.

  • Cache behavior can degrade; more random accesses through a larger index.


Even with those improvements, you should expect SPLADE to land between BM25 and dense retrieval in terms of resource usage. For entity resolution workloads that already feed into agentic systems, that tradeoff can be acceptable—but it still needs to be measured, not assumed.


SPLADE v1, SPLADE v2, and How Fast This Space Is Moving

SPLADE sits in a family of learned sparse retrieval models (alongside approaches like DeepCT, doc2query, and uniCOIL), but has become one of the most widely adopted baselines for modern sparse search.


SPLADE, by itself, is not a single model either; it is already a small family.

  • SPLADE v1 (SIGIR 2021 short paper) introduced the core idea.

  • SPLADE v2 followed a few months later. It refined pooling, and explored better efficiency.

  • Later work such as “Exploring the Representation Power of SPLADE Models” dug into why these learned sparse vectors work as well as they do, showing that SPLADE captures term interactions and semantics in ways that go beyond simple “weighted bag of words.”

  • Elastic's ELSER (Elastic Learned Sparse Encoder) is a production model built on a SPLADE‑style architecture, exposed through the text_expansion query in Elasticsearch.

  • By the mid‑2020s, major search platforms—including Elastic (ELSER), OpenSearch (neural sparse search), Qdrant, ParadeDB, and others—have shipped sparse retrieval infrastructures and models inspired by these ideas.


This is not a frozen specification by any means. Every year brings a new variant (SPLADE‑Doc, CSPLADE, OpenSearch doc‑v3 models, alternative regularizers). If you adopt learned sparse retrieval for entity resolution, you are stepping onto a moving walkway, not installing a one‑off feature.


How SPLADE Actually Works

Diagram showing text flowing through a transformer, ReLU and log-saturation, max pooling, and ending as a sparse vector
SPLADE turns text into sparse lexical vectors by combining transformers with ReLU gating, log-saturation, and max pooling.

SPLADE does not compress text into a dense vector of 768 floating-point numbers like BERT. Instead, it maps text to the entire English vocabulary (30,000+ dimensions).

Most of those dimensions are zero. That's why it's Sparse.


But for the non-zero dimensions, it does something clever. It repurposes the Masked Language Modeling (MLM) head of BERT.


In pre-training, BERT hides a word and tries to guess it. SPLADE looks at a visible word and asks: “What other words belong here?”


If it sees “Turntable”, it activates the dimension for “Record Player”. If it sees “France”, it activates “French” and “Paris”. It performs Implicit Expansion. It projects the semantic meaning of the word back into the sparse vocabulary space.


The Mechanism: Max Pooling

The secret sauce isn't just the expansion; it's how the algorithm aggregates it.


In our BM25 post, we discussed the critical role of the Saturation Parameter (k1) to create a saturation curve: the first mention is worth a lot, the second mention is worth less, and by the tenth mention, you hit a ceiling. SPLADE replaces this heuristic TF saturation curve with Max Pooling over token activations. Instead of summing up occurrences and trying to dampen them with a curve, SPLADE simply takes the single strongest activation in the document.


In SPLADE, Max Pooling is a saturation curve that flattens instantly.

  • 1 mention at high intensity = High Score.

  • 100 mentions at high intensity = Same High Score.


It decouples relevance from repetition.


The Gatekeeper: ReLU

In dense retrieval, there is effectively no such thing as a hard zero. Almost every document has some non-zero similarity to every query, because all vectors live in the same continuous space. A Cat document still has a tiny, non-zero relationship to Dog because they are both animals. This gray noise accumulates, hurting precision.


SPLADE uses ReLU (Rectified Linear Unit) to restore this Clean Zero to the neural world. The Transformer outputs logits—raw predictions that cover the entire number line from - ∞ to + ∞.


  • Positive Score (e.g. +3.5): The model is confident this concept is present.

  • Negative Score (e.g. -5.0): The model is confident this concept is irrelevant or wrong.


Without ReLU, that -5.0 is just another number. With ReLU, it becomes a hard zero.

This is the gatekeeper. It forces the model to be opinionated. It doesn't just lower the score for irrelevant terms (like a small weight in a dense vector); it deletes them. It creates the sparsity that allows us to use inverted-index style search instead of relying solely on dense vector search, which is typically more expensive at large scale.


Implementing the “Magic”: The SPLADE V3 Resolver

💡 Need the complete training & testing code? — drop a note to hello@minimalistinnovation.com


Concepts like "Implicit Expansion" and "ReLU Gating" often sound abstract in research papers, but they translate into surprisingly simple PyTorch operations.


While libraries like Sentence Transformers offer ready-made implementations for these models, the code below is designed to illustrate the concept. By building it from scratch, we can see exactly how the raw transformer output is shaped by ReLU (to enforce sparsity), Log-Saturation (to dampen extreme values), and Max Pooling (to aggregate the sequence).


We also include an explain_match method to do what dense vector models cannot: show us exactly which words—real or expanded—drove the match.

Python code snippet defining the SpladeV3ExplainableResolver class. The code shows the forward pass converting BERT logits into sparse vectors using ReLU and log-saturation, followed by a custom method to calculate and visualize the specific term contributions to the similarity score.
PyTorch implementation of SpladeV3ExplainableResolver, demonstrating the core sparse expansion logic and the explain_matchauditing method.

Citation

@misc{lassance2024spladev3 title={SPLADE-v3: New baselines for SPLADE}, author={Carlos Lassance and Hervé Déjean and Thibault Formal and Stéphane Clinchant}, year={2024}, eprint={2403.06789}, archivePrefix={arXiv}, primaryClass={cs.IR}, copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International} }

Putting SPLADE Vectors Into Postgres with ParadeDB's svector

In the HNSW post, the examples stayed in dense‑vector land.


SPLADE gives you sparse vectors instead—so the natural question is: where do those live in a relational stack?


ParadeDB answered that by extending Postgres with a sparse vector type and HNSW index designed for SPLADE‑style embedding.

An illustration of creating svector in ParadeDB to store SPLADE vectors next to your source text.
Store SPLADE Vectors Next to Your Source Text in ParadeDB

When SPLADE Is the Right Tool (and When It Isn't)

Use SPLADE when:

  • Multiple sources use different vocabularies for the same entity.

  • Identifiers exist but are incomplete or messy.

  • You already operate an inverted-index stack (Elasticsearch, OpenSearch, ParadeDB).

  • You need to explain why a match happened.


Skip SPLADE when:

  • A shared, clean identifier already covers 95%+ of the work.

  • Most signal lives in dense codes that tokenizers mangle (without custom vocab work).

  • Latency or index footprint is absolutely constrained.


Practical: Introduce SPLADE as a candidate generator before relying on it for final decisions. Prove it on recall and cluster quality first.


The Verdict

We started this series looking at the cost of fragmented identity. We moved through the rigid world of exact matching, the fuzzy world of vectors, and the complex world of hybrid search.


SPLADE represents the convergence.


It is semantic, because it uses Transformers to understand meaning. It is lexical, because it maps that meaning back to words. It is interpretable, because you can look at the vector and see exactly why a match happened.


For Entity Resolution, where “almost” is usually “wrong,” this control is essential. You don't have to trust a black-box embedding. You can see the tokens, you can weigh the expansion, and you can resolve identity with precision.


Next Up

SPLADE gives you learned sparse expansions that surface high-quality candidate edges, but pairwise similarity still does not tell you which edges hold once the whole entity graph is in view.​


Next, we move to attention-based Graph Neural Networks (GAT, HierGAT, AutoGAT) and use SPLADE signals as node/edge features so the model can score neighbors in context, not in isolation.​


The focus is reducing false merges that look “close” in text, while keeping the connections that stay consistent across the broader graph.


Series: Entity Resolution



Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
bottom of page