Benchmarking & Datasets for Entity Resolution

Gandhinath Swaminathan
Jan 26
8 min read

If you've spent any time building entity resolution systems, you've probably felt the gap between what your prototype achieves and what it handles in production. That gap usually comes down to one thing: how you measure success.

The wrong benchmark leads to the wrong conclusions. Worse, it leads to systems that look great in the lab and fail at scale.

Illustration comparing a neat prototype entity resolution model with a complex, messy production data graph. — ***When prototypes meet production reality***

In this final post, we detail the standard datasets, the specific metrics required for this domain (beyond simple F1), and the benchmarking frameworks that define the current State of the Art (SOTA).

The Challenge(s)

In ER research, we categorize challenges into three distinct buckets. Your production system likely faces a combination of all three.

Structured Data

This is the “cleanest” form, where schema alignment is high, but values differ.

Example: Amazon-Google dataset.
Characteristics: Two tables (e.g., Amazon electronics vs. Google products). The columns align (Name, Price, Manufacturer), but the values vary (“Canon EOS 5D” vs “Canon 5D Mark II”).
Primary Difficulty: String similarity and fuzzy matching.

Dirty Data

This represents the reality of most enterprise data lakes. Attributes are misplaced, null, or aggregated into a single field.

Example: Dirty-DBLP-ACM.
Characteristics: Derived from structured bibliographic data with synthetic corruption applied to simulate real-world data quality issues. Corruption techniques include: attribute shuffling (e.g., Authors appearing in Title field), random deletions (missing Years), value swapping, and typo injection. These datasets test robustness to schema violations and data entry errors.
Primary Difficulty: Attribute extraction and noisy token handling.

Textual / Unstructured Data

Matching entities based on long-form descriptions rather than specific attributes.

Example: Abt-Buy (Product descriptions).
Characteristics: High variance in text length. A short title must match a verbose paragraph description.
Primary Difficulty: Semantic understanding. Lexical overlap (BM25) often fails here.

Standard Benchmark Datasets

To benchmark your model, you need “Gold Standard” data—pairs of records manually labeled as matches or non-matches.

The academic community (specifically the Magellan and DeepMatcher projects from UW-Madison) has standardized the following datasets. If you are building a custom ER model, start by evaluating it against these.

Dataset	Domain	Type	Size (Labeled Pairs)	Matches
Abt-Buy	E-Commerce	Textual	~1,100	~1,000
Amazon-Google	E-Commerce	Structured	~1,300	~1,300
DBLP-ACM	Citation	Structured	~2,200	~2,200
DBLP-Scholar	Citation	Structured	~28,000	~5,300
Walmart-Amazon	E-Commerce	Dirty	~10,000	~900
BeerAdvo-RateBeer	Reviews	Textual	~450	~68

Labeled pairs refer to the manually annotated match/non-match examples in the benchmark, not the full Cartesian product of all records.

State-of-the-Art Performance (as of 2024-2025)

Your production system should benchmark against these ranges for comparable data types.

Deep learning methods like Ditto achieve F1 scores of 96.5% on company datasets and show 15-31% improvement over traditional ML approaches.
On dirty datasets (e.g., DBLP-ACM Dirty), SOTA methods achieve ~94% F1.
Structured datasets with high attribute quality (e.g., DBLP-ACM Clean) can exceed 99% F1.

The Metrics : Measuring Success

Decision-flow diagram mapping ER use cases to recommended evaluation metrics. — ***Picking the right metric for your goal***

In applied ER, "accuracy" is a deceptive metric. In a dataset with 1 million records and 1,000 true duplicates, a model that blindly predicts "non-match" for every pair achieves 99.9% accuracy while failing completely.

To evaluate an ER system effectively, we must measure performance across five distinct dimensions, each answering a different question about the system's behavior:

Pairwise Decision: Is this specific link correct?
Cluster Coherence: Is the resulting entity profile pure and complete?
Ranking Quality: Did the best match appear at the top of the search results?
Business Cost: What is the economic impact of the errors?
Computational Efficiency: Is the system filtering candidate pairs effectively?

Foundation: Confusion Matrix

ER is defined by extreme class imbalance. The number of non-matching pairs grows quadratically, while matching pairs grow linearly. Because True Negatives (non-matches) dominate the population, standard accuracy scores are meaningless.

Instead, we focus strictly on the Confusion Matrix, which isolates how the model handles positive identification.

	Predicted Match	Predicted Non-Match
Actual Match	True Positive (TP) (Correct Link)	False Negative (FN) (Missed Link)
Actual Non-Match	False Positive (FP) (Wrong Link)	True Negative (TN) (Correctly Ignored)

Pairwise Metrics

These metrics evaluate the binary decision: Do Record A and Record B represent the same entity?

They are the standard baseline for benchmarking deep learning models -

Precision: Precision measures the reliability of a positive prediction. High precision is non-negotiable for automated merging.

Recall: Recall measures the system's ability to find all existing matches. High recall is required for investigation and risk.

F1 Score: The harmonic mean of Precision and Recall. It penalizes extreme values, preventing a system from “gaming” the score by optimizing solely for one side.

Cluster-Level Metrics

Pairwise metrics have a blind spot: they look at links in isolation. If a model correctly links A↔B and B↔C, but fails to link A↔C, pairwise metrics might score it highly, even though the resulting graph is unstable or fractured.

We use B-Cubed metrics to evaluate the structural integrity of the final resolved entities.

B-Cubed Precision: Calculates, for every record, what percentage of its cluster-mates actually belong to the same true entity. This penalizes merging distinct entities.
B-Cubed Recall: Calculates, for every record, what percentage of its true siblings ended up in the predicted cluster. This penalizes splitting a single entity into duplicates.

Ranking Metrics

Modern ER systems often function as search engines (e.g., “Find potential matches for this incoming lead”). In this context, a binary “Match/No-Match” is insufficient.

We need to know if the correct match appeared at the top of the candidate list.

k (cutoff): The number of results the user sees or the system considers
reli (Relevance): The binary label for the result at rank i. (1 if it is a match, 0 if it is not).
DCG (Discounted Cumulative Gain) sums the relevance of the results, penalizing matches that appear lower in the list using a logarithmic reduction.
IDCG (Ideal DCG) IDCG is the DCG score of the perfect ranking (where all true matches appear at the very top).
NDCG (Normalized Discounted Cumulative Gain) evaluates the ranking quality. While not originally designed for entity resolution, NDCG has been adapted from information retrieval to measure how well true matches appear at the top of candidate rankings.

Note: Though an alternate version of DCG formula exists, the one illustrated here is most common in ER applications.

Cost-Based Metrics

Technical errors have unequal business costs. Fixing a “Split” (merging two profiles) is often a simple database update. Fixing a “Merge” (unpicking mixed transactions from two people) requires manual audit and legal remediation.

Generalized Merge Distance (GMD) treats ER evaluation as an edit-distance problem. It calculates the minimum cost to transform the predicted clusters into the true clusters using merge and split operations.

Blocking Efficiency

Before fine-grained matching, the Blocking step (indexing) reduces the search space. We must measure this independently to ensure the candidate generation layer is not the bottleneck.

Reduction Ratio (RR): The percentage of the total search space eliminated.
Pairs Completeness (PC): The recall of the blocking step specifically.

Metric Selection Matrix

General Capability	Primary Metric	Secondary Metric	Why?
High-Precision Matching	Precision	B-Cubed Precision	The priority is to prevent data corruption. Merging distinct entities is the worst-case scenario.
Recall-Oriented Discovery	Recall	Pairs Completeness	The priority is to find hidden links. False positives are tolerated to ensure no target is missed.
Ranked Retrieval	NDGC	Mean Reciprocal Rank (MRR)	The system acts as a search engine. The user only looks at the top k results. Mean Reciprocal Rank (MRR) is the position of the first true match. MRR = 1.0 means every query's first result is correct; MRR = 0.5 means the first match appears at position 2 on average.
Cost-Sensitive Resolution	GMD (Generalized Merge Distance)	Pairwise Precision	Errors have specific dollar costs. A merge error costs 100x more than a split error.
Structural/Graph Comparison	Variation of Information (VI)	Cluster F1	Evaluating the stability of the entire graph structure rather than individual links.
Graph Entropy Comparison	Variation of Information (VI)	Normalized VI	Measures information-theoretic distance between clusterings. Useful when you want a metric-space distance that satisfies triangle inequality for theoretical analysis.

Why Existing Benchmarks Aren't Enough

Here's what makes benchmarks useful and simultaneously insufficient: They're not your data.

Domain Shift: A benchmark trained on product SKUs doesn't transfer perfectly to customer names. An algorithm tuned on company records might behave differently on academic citations. The structure, field types, and matching signal vary.
Scale Differences: ABT-Buy has 1,076 products. Your production system has 50 million customer records or a complex product hierarchy and smart product codes. Algorithms that work at small scale sometimes degrade at large scale due to index construction, query time, or memory constraints. Conversely, some approaches designed for massive scale can be overkill for smaller datasets.
Real-World Noise: Benchmarks are curated. They have known ground truth. Your production data doesn't. You have records with missing fields, inconsistent formatting, corruption from failed data imports, and entries that were manually entered by people at 11 pm on a Friday.
Incomplete Matching: A benchmark tells you which records should match. But it might miss duplicates in the wild. The benchmark is an underestimate of true matches in real systems.
Temporal Drift: A benchmark is a snapshot. Your data changes. New variations appear. Your matching rules need to drift too. The benchmark doesn't capture this.
Business Constraints: Benchmarks rarely capture the cost of false positives versus false negatives in your specific context. In some domains, a false positive (incorrectly merging two different customers) is catastrophic. In others, false negatives (missing a match) are the bigger problem.

How to Run Your Own Benchmark

If you are evaluating a solution for your company, do not rely solely on public benchmarks. Public data is often cleaner than enterprise reality.

Synthesize “Dirty” Data: Take your clean records and apply noise functions (typos, token deletion, format swapping). Tools like nlpaug or the SPIDER dataset generation scripts can automate this.
The “Gold Subset”: You cannot label 10 million records. Randomly sample 1,000 pairs, but ensure you oversample likely matches (using a simple string blocker). If you only sample randomly, you will get 999 non-matches and 1 match, which provides statistically insignificant results.
Split Rigorously: Ensure that if Record A is in the Training set, it does not appear in the Test set. Data leakage in ER is common and leads to inflated performance metrics that crash in production.

Conclusion

Your production system will always face data that is dirtier, larger, and more heterogeneous than any academic dataset, so treat public benchmarks as tools to compare approaches, not as proxies for your true production accuracy.

Always back them with domain-specific, held-out test sets sampled from your own data distribution, and refresh those samples as your entities, schemas, and business rules evolve. The goal is not to “beat a dataset,” but to build an evaluation loop that reliably predicts how your system will behave in the wild.

The benchmarks in this post represent the minimum bar, not the finish line. Start with the dataset family that best matches your use case, then choose metrics that encode your real constraints and error costs, and build ground truth carefully with multiple reviewers so you trust every label.

Rigorously document what you measured, under what conditions, and with what sampling strategy, and make those evaluation pipelines repeatable so every new model, rule change, or vendor claim is judged against the same, production-centered standard.