How Data Structures Build the Bridge from Exact Matching to Semantic Search
- Gandhinath Swaminathan

- Jan 5
- 8 min read
In my last post, we dissected the cost of fragmented identity. We saw how disconnected data silos create a chaotic, fractured view of the world. But merging those silos isn’t just about moving data—it’s about understanding it.
The question becomes: How do you search when you're looking for similarity instead of equality?
That answer lives in data structures. Not any one structure, but a path through them. Each step up that path solves a problem the previous one couldn’t.

The Baseline: Linked Lists and Linear Suffering
Imagine you have a list. Every item points to the next. Want to find something? Start at the beginning. Check the first item. Then the second. Then the third. Keep going until you find what you’re looking for.
This is O(N). You might check every single item.
In vector search terms, this is what happens when you do a flat index.

The First Leap: Skip Lists and Probabilistic Shortcuts
William Pugh’s 1990 innovation—the skip list—did exactly that.
Skip lists say: “We’ll keep some items at a higher level. When we search, we jump along the high level first. When we overshoot, we drop down and get precise.”
I still remember the first time this really landed for me: sitting in a Microsoft interview more than 20 years ago, getting hit with a deceptively simple question about “beating” linear search in an ordered list.
Imagine a sorted list. But there's also a second tier above it—containing every fourth element. Above that, a third tier with every sixteenth element. When you search, you hop along tier 3 until you’re close, then drop to tier 2, then to tier 1 for precision.
This isn’t guaranteed to be perfect, but it’s fast. You go from O(N) to O(log N).
The magic is layering. You create a hierarchy of “express lanes.” The higher you go, the sparser the data. But those sparse levels let you skip huge chunks of the search space.

The Pivot: Graphs and the Small World Problem
Here’s a strange phenomenon: In many real networks, every node is close to every other node. Your social graph. Power grids. The internet. High local clustering (you’re tightly connected to your friends), yet short average distances between strangers.
In 1998, Watts and Strogatz published their seminal paper formalizing the Small World Network model, characterized by high clustering yet short average path lengths.
This property turns out to be exactly what vector search needs. Connect each vector to its nearest neighbors. Start from any point and greedily walk toward neighbors closer to your query. The small world structure guarantees you’ll cluster into the right region quickly.
This is the Navigable Small World (NSW) algorithm. It beats flat search by orders of magnitude on latency.
But it has a fatal flaw: Greedy routing gets trapped. You start at the wrong entry point or navigate into a local cluster that’s not the right one. The graph scales worse than it should. Early insertions create long-range connections. Later insertions create local clusters. There’s no structure.
You’ve traded flat search’s simplicity for a graph that’s only sometimes fast.
The Synthesis: Hierarchical Navigable Small Worlds
In 2016, Malkov and Yashunin published a synthesis: Hierarchical Navigable Small Worlds (HNSW). The core insight was deceptively simple—layer a graph structure exactly like a skip list.
Take the graph idea. But organize it in layers, like a skip list.

The bottom layer (Layer 0) is a dense graph containing every vector. It’s the fine-grained layer where you get precision.
The layers above it are progressively sparser. Layer 1 has maybe 10% of the vectors. Layer 2 has 1%. The top layer might have just a handful.
Each layer is connected vertically—the same vector appears at multiple layers.
Now search works like this:
Start at the top layer, at some fixed entry point.
Walk toward neighbors closer to your query (this is fast because the graph is sparse).
Drop down to the layer below, when you can’t get any closer in this layer.
Refine your search there. Move again.
Repeat by dropping until you reach Layer 0.
Finalize high-recision search at Layer 0
Since you’re navigating coarse-grained express lanes first, then progressively refining, you get O(log N) search complexity — the same complexity as binary search or a B-Tree but for vectors in high dimensions.
HNSW introduces explicit structural parameters to control the trade-off between recall and performance:
M (edges per node): More edges = better recall, more memory.
ef_construction: How deep to search during insertion. Higher = better graph quality, slower builds.
ef_search: How much exploration during queries. Higher = better recall, slower queries.


The m = 24 means each vector connects to 24 neighbors (default is 16). The ef_construction = 200 tells PostgreSQL to explore deeply during index build—taking time upfront to create a higher-quality graph.

This is powerful as you can dial in exactly the recall you need. Need 95% recall? Turn up ef_search. Need sub-millisecond queries? Turn it down.
The Other Approaches: Hash Buckets and Clustering
While HNSW became the mainstream, two other ideas compete.
LSH (Locality-Sensitive Hashing)
Hash the vector. Vectors that hash to the same bucket are probably similar. Retrieve that bucket.
It’s memory-efficient. It’s parallelizable. It works for streaming data.
However, in high dimensions, it breaks down. Hash collisions become rare. You need dozens of hash tables to get reasonable recall. The overhead kills the advantage.
FAISS (Facebook AI Similarity Search)
It uses K-Means clustering to partition the space into Voronoi cells. To search, you find the closest cells and scan only those.
It combines this with Product Quantization—compressing vectors into small byte codes. This lets you fit a billion vectors in memory. HNSW can’t do that.
But FAISS requires a training phase (running K-Means). It’s less dynamic. For datasets that stay within RAM (up to ~50 million vectors), HNSW usually wins on latency.
Why Algorithm Architecture Matters: Entity Resolution as the Real Test
Approach | How It Works | Best For | Trade-off |
HNSW | Hierarchical graph layers | <100M vectors, real-time queries | High memory overhead |
LSH | Hash buckets | Low memory, streaming data | Struggles at high dimensions |
FAISS | Cluster pruning + compression | Billions of vectors | Requires training, less dynamic |
Flat Search | Exhaustive scan | Perfect recall needed | O(N) latency—unusable at scale |
Look at that table again. Only HNSW has a graph structure. This isn’t academic—it’s the difference between solving your fragmented identity problem and just finding similar items.
Entity resolution requires matching records across systems. You have a Sony turntable in System A. You need to find the exact match in System B, despite wildly different descriptions. It’s not about finding all similar products. It’s about finding the one.
LSH buckets are stateless. Hash the vector. Get a bucket. Retrieve contents. No traversal possible. If the query hashes poorly, you miss the match. You can’t walk through the neighborhood to refine.
FAISS clusters are defined by centroids. Find the closest centroid. Scan that cluster. But there's no graph connecting clusters. No way to explore adjacent clusters if your match sits just outside the boundary. Centroid-based search moves toward centers, not through neighborhoods.
HNSW is a graph. Start at one point. Walk to neighbors closer to your target. Those neighbors connect to their neighbors. You’re traversing a connected space. If a match sits one hop away, you’ll find it.
This is why HNSW became standard for entity resolution. When you need to match records across fragmented systems, you’re not just finding the closest vector—you’re resolving identity through graph exploration. A graph lets you navigate neighborhoods. Buckets and clusters don’t.
HNSW inspired the entire ecosystem of graph-based vector databases. Qdrant, Weaviate, Milvus—they all use HNSW under the hood because the algorithm works for the specific problem of resolving fragmented identities.
Embeddings: Bringing Semantics into the Distance Function
There’s one piece we haven’t discussed: embeddings.
HNSW doesn’t care what the vectors represent. It just computes distances. The magic comes from what goes into those vectors.
Here’s the Sony turntable problem again. You have three text strings:
Sony Turntable - PSLX350H
Sony PS-LX350H Belt-Drive Turntable
Sony Record Player, Model PS-LX350H, Black
String similarity metrics fail. They’re looking at character overlap. Edit distance. But “turntable” and “record player” have zero overlap, yet they mean the same thing.
Embeddings solve this. Run each string through a transformer model (BERT, SentenceTransformer, etc.). Get back a 384-dimensional vector that encodes meaning. The model has learned—from training on billions of texts—that “turntable” and “record player” point in similar directions in that space.
Now all three Sony strings get mapped to vectors close together. Vector distance now measures semantic similarity, not just string similarity.
You pair embeddings (which add semantic meaning) with HNSW (which finds nearest neighbors at speed) and suddenly your database can solve entity resolution. Fragmented identity becomes visible.
This is why FAISS includes papers on embeddings, and why every vector database assumes you’re using embeddings. The structure and the semantics are intertwined.
pgVector: Bringing Vector Search into Your Existing Database
PostgreSQL is a relational database. It’s built for structured data, exact queries, ACID guarantees. For decades, it was the opposite of what AI wanted.
Then someone asked: Why are these separate?
pgvector is the most widely adopted PostgreSQL extension for vector search, adding vector types and HNSW indexing directly into the database. While ParadeDB brings FAISS algorithms to Postgres, pgvector’s HNSW implementation has become the standard—proven in production across thousands of deployments.
Instead of this:
Store your product data in Postgres
Extract it to a Python script
Run embeddings locally or via API
Write vectors to a separate vector database
Write glue code to join results back to your original data
Hope everything stays in sync
You get this:
Store product data in Postgres
Add a vector(384) column for embeddings
Create an HNSW index
Query vectors using the <=> operator (cosine distance)
Join vector results directly to your product data
One database. No sync headaches. ACID transactions on your vector data.
Here’s how that plays out at query time.

That’s the full journey. From flat linear search → hierarchical graphs → semantic embeddings → practical database integration.
Why This Matters for Your Data
You don’t have time for separate databases. Your data is already in PostgreSQL or whatever system you're using. You need to find the semantic matches without ripping everything out.
The data structure progression we traced—from linked lists to HNSW—wasn't academic. Each step was solving a real performance problem. Flat search was too slow. Skip lists got partway there. Small World graphs worked but were fragile. HNSW nailed it.
Then embeddings added the semantic layer.
Now pgvector brings it all together in one system.
Your fragmented identity problem—the Sony turntable living under three names—doesn’t get solved by better string matching. It gets solved by understanding that three different text descriptions mean the same thing. HNSW finds them fast because the embedding model put them close together.
That’s the bridge. Structure + semantics + practical implementation.

Comments