top of page

How Data Structures Build the Bridge from Exact Matching to Semantic Search

  • Writer: Gandhinath Swaminathan
    Gandhinath Swaminathan
  • Jan 5
  • 8 min read

In my last post, we dissected the cost of fragmented identity. We saw how disconnected data silos create a chaotic, fractured view of the world. But merging those silos isn’t just about moving data—it’s about understanding it.


The question becomes: How do you search when you're looking for similarity instead of equality?


That answer lives in data structures. Not any one structure, but a path through them. Each step up that path solves a problem the previous one couldnt.


Infographic timeline showing the evolution from Linked Lists to Skip Lists, Small World Graphs, and HNSW, highlighting the progression in search efficiency and method.
Visualizing the evolution of data structures, from O(N) linear search to sophisticated semantic navigation with HNSW.

The Baseline: Linked Lists and Linear Suffering

Imagine you have a list. Every item points to the next. Want to find something? Start at the beginning. Check the first item. Then the second. Then the third. Keep going until you find what youre looking for.


This is O(N). You might check every single item.


In vector search terms, this is what happens when you do a flat index.

Diagram of a linked list showing a red linear search path from node 1 to 8, with the target node 8 highlighted. Text indicates O(N) complexity.
Illustrating a linear sequential scan in a linked list, demonstrating O(N) time complexity.

The First Leap: Skip Lists and Probabilistic Shortcuts

William Pughs 1990 innovation—the skip list—did exactly that.


Skip lists say: “Well keep some items at a higher level. When we search, we jump along the high level first. When we overshoot, we drop down and get precise.”


I still remember the first time this really landed for me: sitting in a Microsoft interview more than 20 years ago, getting hit with a deceptively simple question about “beating” linear search in an ordered list.


Imagine a sorted list. But there's also a second tier above it—containing every fourth element. Above that, a third tier with every sixteenth element. When you search, you hop along tier 3 until youre close, then drop to tier 2, then to tier 1 for precision.


This isnt guaranteed to be perfect, but its fast. You go from O(N) to O(log N).


The magic is layering. You create a hierarchy of “express lanes.” The higher you go, the sparser the data. But those sparse levels let you skip huge chunks of the search space.

Multi-layered skip list diagram with a red zigzag search path, starting from the top sparse layer and drilling down to the bottom dense layer to find a target.
A skip list structure using multi-layered "express lanes" for faster, O(log N) search.

The Pivot: Graphs and the Small World Problem

Heres a strange phenomenon: In many real networks, every node is close to every other node. Your social graph. Power grids. The internet. High local clustering (youre tightly connected to your friends), yet short average distances between strangers.


In 1998, Watts and Strogatz published their seminal paper formalizing the Small World Network model, characterized by high clustering yet short average path lengths.


This property turns out to be exactly what vector search needs. Connect each vector to its nearest neighbors. Start from any point and greedily walk toward neighbors closer to your query. The small world structure guarantees youll cluster into the right region quickly.


This is the Navigable Small World (NSW) algorithm. It beats flat search by orders of magnitude on latency.


But it has a fatal flaw: Greedy routing gets trapped. You start at the wrong entry point or navigate into a local cluster thats not the right one. The graph scales worse than it should. Early insertions create long-range connections. Later insertions create local clusters. Theres no structure.


Youve traded flat searchs simplicity for a graph thats only sometimes fast.


The Synthesis: Hierarchical Navigable Small Worlds

In 2016, Malkov and Yashunin published a synthesis: Hierarchical Navigable Small Worlds (HNSW). The core insight was deceptively simple—layer a graph structure exactly like a skip list.


Take the graph idea. But organize it in layers, like a skip list.

3D hierarchical graph diagram (HNSW) showing a query vector's search path cascading through layers from sparse top to dense bottom, ending at a target vector.
Hierarchical Navigable Small Worlds (HNSW) graph, enabling efficient semantic search through a layered structure.

The bottom layer (Layer 0) is a dense graph containing every vector. Its the fine-grained layer where you get precision.


The layers above it are progressively sparser. Layer 1 has maybe 10% of the vectors. Layer 2 has 1%. The top layer might have just a handful.


Each layer is connected vertically—the same vector appears at multiple layers.


Now search works like this:

  1. Start at the top layer, at some fixed entry point.

  2. Walk toward neighbors closer to your query (this is fast because the graph is sparse).

  3. Drop down to the layer below, when you cant get any closer in this layer.

  4. Refine your search there. Move again.

  5. Repeat by dropping until you reach Layer 0.

  6. Finalize high-recision search at Layer 0


Since youre navigating coarse-grained express lanes first, then progressively refining, you get O(log N) search complexity — the same complexity as binary search or a B-Tree but for vectors in high dimensions.


HNSW introduces explicit structural parameters to control the trade-off between recall and performance:

  • M (edges per node): More edges = better recall, more memory.

  • ef_construction: How deep to search during insertion. Higher = better graph quality, slower builds.

  • ef_search: How much exploration during queries. Higher = better recall, slower queries.


Code (Java) demonstration of the full ingestion pipeline.
In practice, here’s what the full ingestion pipeline looks like—parsing CSV data, normalizing text, generating embeddings, and building the index.
Code (SQL) demonstration of the the HNSW index.
Then the index creation happens.

The m = 24 means each vector connects to 24 neighbors (default is 16). The ef_construction = 200 tells PostgreSQL to explore deeply during index build—taking time upfront to create a higher-quality graph.



This is powerful as you can dial in exactly the recall you need. Need 95% recall? Turn up ef_search. Need sub-millisecond queries? Turn it down.


The Other Approaches: Hash Buckets and Clustering

While HNSW became the mainstream, two other ideas compete.


LSH (Locality-Sensitive Hashing)

Hash the vector. Vectors that hash to the same bucket are probably similar. Retrieve that bucket.


Its memory-efficient. Its parallelizable. It works for streaming data.


However, in high dimensions, it breaks down. Hash collisions become rare. You need dozens of hash tables to get reasonable recall. The overhead kills the advantage.


FAISS (Facebook AI Similarity Search)

It uses K-Means clustering to partition the space into Voronoi cells. To search, you find the closest cells and scan only those.


It combines this with Product Quantization—compressing vectors into small byte codes. This lets you fit a billion vectors in memory. HNSW cant do that.


But FAISS requires a training phase (running K-Means). Its less dynamic. For datasets that stay within RAM (up to ~50 million vectors), HNSW usually wins on latency.


Why Algorithm Architecture Matters: Entity Resolution as the Real Test

Approach

How It Works

Best For

Trade-off

HNSW

Hierarchical graph layers

<100M vectors, real-time queries

High memory overhead

LSH

Hash buckets

Low memory, streaming data

Struggles at high dimensions

FAISS

Cluster pruning + compression

Billions of vectors

Requires training, less dynamic

Flat Search

Exhaustive scan

Perfect recall needed

O(N) latency—unusable at scale

Look at that table again. Only HNSW has a graph structure. This isnt academic—its the difference between solving your fragmented identity problem and just finding similar items.


Entity resolution requires matching records across systems. You have a Sony turntable in System A. You need to find the exact match in System B, despite wildly different descriptions. Its not about finding all similar products. Its about finding the one.


LSH buckets are stateless. Hash the vector. Get a bucket. Retrieve contents. No traversal possible. If the query hashes poorly, you miss the match. You cant walk through the neighborhood to refine.


FAISS clusters are defined by centroids. Find the closest centroid. Scan that cluster. But there's no graph connecting clusters. No way to explore adjacent clusters if your match sits just outside the boundary. Centroid-based search moves toward centers, not through neighborhoods.


HNSW is a graph. Start at one point. Walk to neighbors closer to your target. Those neighbors connect to their neighbors. Youre traversing a connected space. If a match sits one hop away, youll find it.


This is why HNSW became standard for entity resolution. When you need to match records across fragmented systems, youre not just finding the closest vector—youre resolving identity through graph exploration. A graph lets you navigate neighborhoods. Buckets and clusters dont.


HNSW inspired the entire ecosystem of graph-based vector databases. Qdrant, Weaviate, Milvus—they all use HNSW under the hood because the algorithm works for the specific problem of resolving fragmented identities.


Embeddings: Bringing Semantics into the Distance Function

Theres one piece we havent discussed: embeddings.


HNSW doesnt care what the vectors represent. It just computes distances. The magic comes from what goes into those vectors.


Heres the Sony turntable problem again. You have three text strings:

  • Sony Turntable - PSLX350H

  • Sony PS-LX350H Belt-Drive Turntable

  • Sony Record Player, Model PS-LX350H, Black


String similarity metrics fail. Theyre looking at character overlap. Edit distance. But “turntable” and “record player” have zero overlap, yet they mean the same thing.


Embeddings solve this. Run each string through a transformer model (BERT, SentenceTransformer, etc.). Get back a 384-dimensional vector that encodes meaning. The model has learned—from training on billions of texts—that “turntable” and “record player” point in similar directions in that space.


Now all three Sony strings get mapped to vectors close together. Vector distance now measures semantic similarity, not just string similarity.


You pair embeddings (which add semantic meaning) with HNSW (which finds nearest neighbors at speed) and suddenly your database can solve entity resolution. Fragmented identity becomes visible.


This is why FAISS includes papers on embeddings, and why every vector database assumes youre using embeddings. The structure and the semantics are intertwined.


pgVector: Bringing Vector Search into Your Existing Database

PostgreSQL is a relational database. Its built for structured data, exact queries, ACID guarantees. For decades, it was the opposite of what AI wanted.


Then someone asked: Why are these separate?


pgvector is the most widely adopted PostgreSQL extension for vector search, adding vector types and HNSW indexing directly into the database. While ParadeDB brings FAISS algorithms to Postgres, pgvectors HNSW implementation has become the standard—proven in production across thousands of deployments.


Instead of this:

  • Store your product data in Postgres

  • Extract it to a Python script

  • Run embeddings locally or via API

  • Write vectors to a separate vector database

  • Write glue code to join results back to your original data

  • Hope everything stays in sync


You get this:

  • Store product data in Postgres

  • Add a vector(384) column for embeddings

  • Create an HNSW index

  • Query vectors using the <=> operator (cosine distance)

  • Join vector results directly to your product data


One database. No sync headaches. ACID transactions on your vector data.


Heres how that plays out at query time.

Code (Java) demonstration of the PostgreSQL cosine distance similarity operator.
You embed the search term and use PostgreSQL’s <=> operator (cosine distance) to find nearest neighbors

Thats the full journey. From flat linear search → hierarchical graphs → semantic embeddings → practical database integration.


Why This Matters for Your Data

You dont have time for separate databases. Your data is already in PostgreSQL or whatever system you're using. You need to find the semantic matches without ripping everything out.


The data structure progression we traced—from linked lists to HNSW—wasn't academic. Each step was solving a real performance problem. Flat search was too slow. Skip lists got partway there. Small World graphs worked but were fragile. HNSW nailed it.


Then embeddings added the semantic layer.


Now pgvector brings it all together in one system.


Your fragmented identity problem—the Sony turntable living under three names—doesnt get solved by better string matching. It gets solved by understanding that three different text descriptions mean the same thing. HNSW finds them fast because the embedding model put them close together.


Thats the bridge. Structure + semantics + practical implementation.

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
bottom of page