Data Harmonization

Why Probabilistic Record Linkage Still Matters

Probabilistic record linkage still matters because identity data is messy and match decisions carry real financial and compliance risk. This article explains the intuition behind Fellegi–Sunter and Bayesian record linkage, shows how they control false merges and splits across noisy customer and product records, and points to modern tools and books that help you put these ideas into practice.

Gandhinath Swaminathan

2 days ago5 min read

Heterogeneous knowledge graph diagram showing product entity resolution with typed nodes (mentions, organizations, models, attributes) connected by colored relationship edges (madeby, hasmodel, hasattr). Multiple convergent paths highlighted between two mentions, illustrating multi-hop reasoning for entity matching.

Heterogeneous Knowledge Graphs: Multi-Hop Reasoning Beyond Pairwise Matching

Pairwise matching treats each comparison as a one-off. A persistent knowledge graph turns product mentions, manufacturers, model numbers, attributes, and price bins into typed nodes and relations. Matching becomes neighborhood comparison: multi-hop paths (convergent evidence) can beat any single similarity score.

Gandhinath Swaminathan

2 days ago7 min read

Feature illustration of a Sony PS‑LX350H turntable with SPLADE token weights on the left and a token‑to‑token attention graph on the right, showing sparse retrieval turning into an entity-resolution decision.

From Inverted Index to Attention Graph: Turning SPLADE Tokens Into ER Decisions

False entity merges don’t just dirty data. They distort inventory, pricing, and forecasts, then every model and report built on top. Learned sparse retrieval improves recall, but it can still treat records like unordered tokens. This post adds token-to-token attention as a structural check so near-duplicates pass and lookalikes fail, with a trail you can audit.

Gandhinath Swaminathan

3 days ago3 min read

The Best of Both Worlds: Learned Sparse Retrieval (SPLADE) For Entity Resolution

Entity resolution breaks when exact matching is too brittle and dense vectors blur identities. This post introduces SPLADE, a learned sparse retrieval model that keeps inverted indexes and token-level explainability while adding transformer-powered expansion and reweighting. We walk through where SPLADE beats BM25 and dense search, where it can fail on SKUs and over-expansion, and how to run it in Postgres/ParadeDB for large-scale product, customer, or patient identity.

Gandhinath Swaminathan

3 days ago10 min read

Abstract visualization showing geometric lexical data patterns merging with flowing semantic vector networks, with particles fusing at the convergence point, representing hybrid search combining BM25 and vector similarity.

Hybrid Search and Reciprocal Rank Fusion: Building the Bridge Between Lexical and Semantic

Entity resolution struggles when systems must choose between the rigid precision of BM25 and the fuzzy flexibility of Vector Search. Part 4 reveals why simple linear weighting fails and introduces Reciprocal Rank Fusion (RRF) as the superior alternative. We explore the architectural shift to Hybrid Search, demonstrating how to merge rank positions rather than raw scores using Spring Boot and ParadeDB.

Gandhinath Swaminathan

Jan 147 min read

Warehouse worker scanning package labels with a handheld barcode scanner.

When “Almost” Isn’t Good Enough: Why Top Engineers Still Rely On BM25

BM25 looks old on paper, but it still decides which records are worth comparing when identifiers can’t afford to be “almost” right. This post walks through the TF‑IDF roots of BM25, how k1 and b shape the scoring curve, and why Lucene, Elasticsearch, and OpenSearch still rely on it. You’ll see how term statistics, not embeddings, keep product codes, SKUs, and customer records anchored during entity resolution.

Gandhinath Swaminathan

Jan 85 min read

Diagram showing a single Sony turntable model with three conflicting names and SKU codes as it appears across CRM, inventory management, and pricing systems, illustrating how product fragmentation creates mismatched records.

How One Invisible Data Problem Quietly Destroys Your Churn Models, Your Pricing, and Your AI Agents

Healthcare providers track the same patient under five name variations. Retailers can't tell when the same SKU is under two different codes. CPG companies buy demand data showing one product with three different names across channels. Supply chains have suppliers that are actually the same company. Every week. Same problem. Different domain. Your data doesn't know what it's describing.

Gandhinath Swaminathan

Jan 26 min read