Institut für Parallele und Verteilte Systeme

Universität Stuttgart
Universitätsstraße 38

D–70569 Stuttgart

Masterarbeit

Benchmarking Pre-Trained
Language Models for

Schema-Agnostic Entity
Resolution
Jan Bothmann

Studiengang: Informatik

Prüfer/in: Prof. Dr. rer. nat. Melanie Herschel

Betreuer/in: Prof. Dr. rer. nat. Melanie Herschel

Beginn am: 15. Mai 2024

Beendet am: 15. November 2024


Kurzfassung

Datenintegration ist ein wichtiger Prozess, beim dem Daten aus verschiedenen Quellen zusam-
mengeführt werden, um ein einheitliches Bild der Daten zu schaffen. Ein wesentlicher Schritt
in diesem Prozess ist Entity Resolution. Entity resolution versucht Elemente zu identifizieren,
die dieselbe Entität repräsentieren. Die Komplexität von Entity Resolution Aufgaben kann stark
variieren, da Daten unterschiedliche Eigenschaften und Strukturierungsgrade aufweisen, die die
Aufgabe entweder erschweren oder vereinfachen können.

In dieser Arbeit liegt der Fokus auf der Evaluierung von Entity Resolution Systemen hinsichtlich
semi-strukturierter Daten. Aus diesem Grund wurden mehrere semi-strukturierte Entity Resolution
Benchmarks erstellt, die Daten von verschiedenen Domänen benutzen und zur Bewertung benutzt
werden. Um auch zu untersuchen, wie verschiedene Datenmerkmale oder andere Einflussfaktoren
die Leistung von Entitiy Resolution Systemen beeinflussen, haben wir den Benchmark Creator
entwickelt. Dieser ermöglicht es uns und anderen Nutzern, Benchmarks zu erstellen, bei denen
die Daten spezifische Merkmale aufweisen, die die Performance von Entitiy Resolution Systemen
beeinflussen können.

Die Entity Resolution Systeme Ditto, Sudowoodo und das GPT4o-mini Modell wurden zur
Evaluierung herangezogen. Es wurde gezeigt, dass sowohl Ditto und das GPT4o-mini Modell
in der Lage sind, schema-agnostische Entity Resolution auf semi-strukturierten Daten effektiv
durchzuführen.

Abstract

Data integration is a process in which data from different sources are brought together to create a
unified picture of the data. A vital aspect of this integration is Entity Resolution, which tries to
identify elements that correspond to the same entity across multiple datasets. The complexity of
ER tasks can vary significantly, as data exhibits different characteristics and levels of structuredness,
which can influence the difficulty of the task.

In this thesis, we evaluate how current state-of-the-art Entity Resolution systems perform when
dealing with semi-structured data. To do this, several semi-structured ER benchmarks covering
data from various domains were created for evaluation. Additionally, to explore how different
data characteristics or other influencing factors impact the performance of matching systems, we
developed the Benchmark Creator. This tool allows us and other users to generate benchmarks
where data exhibits specific characteristics that may influence the complexity of the ER task.

We used Ditto, Sudowoodo and the GPT4o-mini model to evaluate performance on the newly
created benchmarks. Our evaluation reveals that Ditto and the GPT4o-mini model can effectively
perform schema-agnostic ER on semi-structured data.

3


Contents

1 Introduction 17
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2 Research Objectives and Thesis Contributions . . . . . . . . . . . . . . . . . . 20
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2 Related Work 23
2.1 Fundamental Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Existing Entity Resolution Benchmarks . . . . . . . . . . . . . . . . . . . . . . 30
2.3 Benchmark Difficulty Measures . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.4 Summary of Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3 Benchmark Design 43
3.1 Data Requirements Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Data Source Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3 Benchmark Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4 Benchmark Development Process . . . . . . . . . . . . . . . . . . . . . . . . . 55

4 Implementation 65
4.1 Design Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5 Evaluation 71
5.1 Metrics for Benchmark Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Linearity and Complexity Evaluation . . . . . . . . . . . . . . . . . . . . . . . 72
5.3 Evaluation of Entity Resolution Matching Systems . . . . . . . . . . . . . . . . 75
5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6 Conclusion and Outlook 97

Bibliography 99

5


List of Figures

1.1 Diverse Representations of Real-World Entities in the Video Domain . . . . . . . 18

2.1 ER Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2 Self-Supervised Learning [PA23] . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 EM architecture using Ditto as a matcher [LLS+20] . . . . . . . . . . . . . . . . 27
2.4 The Sudowoodoo EM pipeline [WLW23] . . . . . . . . . . . . . . . . . . . . . 28
2.5 Fine-Tuning in Sudowoodo [WLW23] . . . . . . . . . . . . . . . . . . . . . . . 29
2.6 Prompting Foundation Models [PB23] . . . . . . . . . . . . . . . . . . . . . . . 30
2.7 Scenarios for Benchmark Construction [WLF+22] . . . . . . . . . . . . . . . . . 32
2.8 Example of hard and easier matching and non-matching offer pairs from WDC

Products [PDB23] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.9 Formula for Attribute Sparsity [CAF+21] . . . . . . . . . . . . . . . . . . . . . 35

3.1 Benchmark Development Process . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2 Benchmark Creator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

7


List of Tables

2.1 Previous Benchmark Configurations [WLF+22] . . . . . . . . . . . . . . . . . 34

3.1 Statistics of Food Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 Statistics of Video Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3 Statistics of Book Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.1 Linearity Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2 OFF-FDCN: Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3 TMDB-OMDB: Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4 OFF-FDCM: Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.5 T-O-I: Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.6 OL-GK: Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.7 OFF-FDCN: Matches-NonMatches . . . . . . . . . . . . . . . . . . . . . . . . 77
5.8 TMDB-OMDB: Matches-NonMatches . . . . . . . . . . . . . . . . . . . . . . 77
5.9 OFF-FDCM: Matches-NonMatches . . . . . . . . . . . . . . . . . . . . . . . . 77
5.10 T-O-I: Matches-NonMatches . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.11 OL-GK: Matches-NonMatches . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.12 OFF-FDCN: Highly Identifying attributes . . . . . . . . . . . . . . . . . . . . 77
5.13 TMDB-OMDB: Highly Identifying attributes . . . . . . . . . . . . . . . . . . . 77
5.14 OFF-FDCM: Highly Identifying Attributes . . . . . . . . . . . . . . . . . . . . 77
5.15 T-O-I: Highly Identifying Attributes . . . . . . . . . . . . . . . . . . . . . . . . 77
5.16 OL-GK: Highly Identifying Attributes . . . . . . . . . . . . . . . . . . . . . . 77
5.17 OFF-FDCN: Number Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.18 TMDB-OMDB: Number Attributes . . . . . . . . . . . . . . . . . . . . . . . . 78
5.19 OFF-FDCM: Number Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.20 T-O-I: Number Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.21 OL-GK: Number Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.22 OFF-FDCN: Attribute Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.23 TMDB-OMDB: Attribute Sparsity . . . . . . . . . . . . . . . . . . . . . . . . 78
5.24 OFF-FDCM: Attribute Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.25 T-O-I: Attribute Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.26 OL-GK: Attribute Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.27 OFF-FDCN: Value Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.28 TMDB-OMDB: Value Length . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.29 OFF-FDCM: Value Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.30 T-O-I: Value Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.31 OL-GK: Value Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.32 OFF-FDCN: Similar Non-Matches . . . . . . . . . . . . . . . . . . . . . . . . 79
5.33 TMDB-OMDB: Similar Non-Matches . . . . . . . . . . . . . . . . . . . . . . 79

9


5.34 OFF-FDCM: Similar Non-Matches . . . . . . . . . . . . . . . . . . . . . . . . 79
5.35 T-O-I: Similar Non-Matches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.36 OL-GK: Similar Non-Matches . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.37 T-O-I: Unseen Entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.38 T-O-I: Over/Underrepresented Entities . . . . . . . . . . . . . . . . . . . . . . 80
5.39 TMDB-OMDB: Combination 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.40 OFF-FDCM: Combination 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.41 T-O-I: Combination 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.42 OFF-FDCN: Combination 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.43 TMDB-OMDB: Combination 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.44 OFF-FDCM: Combination 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.45 T-O-I: Combination 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.46 OL-GK: Combination 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.47 OFF-FDCN: Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.48 TMDB-OMDB: Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . 82
5.49 OFF-FDCM: Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.50 T-O-I: Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.51 OL-GK: Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.52 OFF-FDCN: Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.53 TMDB-OMDB: Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.54 OFF-FDCM: Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.55 T-O-I: Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.56 OL-GK: Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.57 OFF-FDCN: Matches-NonMatches . . . . . . . . . . . . . . . . . . . . . . . . 83
5.58 TMDB-OMDB: Matches-NonMatches . . . . . . . . . . . . . . . . . . . . . . 83
5.59 OFF-FDCM: Matches-NonMatches . . . . . . . . . . . . . . . . . . . . . . . . 83
5.60 T-O-I: Matches-NonMatches . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.61 OL-GK: Matches-NonMatches . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.62 OFF-FDCN: Highly Identifying attributes . . . . . . . . . . . . . . . . . . . . 83
5.63 TMDB-OMDB: Highly Identifying attributes . . . . . . . . . . . . . . . . . . . 83
5.64 OFF-FDCM: Highly Identifying Attributes . . . . . . . . . . . . . . . . . . . . 84
5.65 T-O-I: Highly Identifying Attributes . . . . . . . . . . . . . . . . . . . . . . . . 84
5.66 OL-GK: Highly Identifying Attributes . . . . . . . . . . . . . . . . . . . . . . 84
5.67 OFF-FDCN: Number Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.68 TMDB-OMDB: Number Attributes . . . . . . . . . . . . . . . . . . . . . . . . 84
5.69 OFF-FDCM: Number Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.70 T-O-I: Number Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.71 OL-GK: Number Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.72 OFF-FDCN: Attribute Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.73 TMDB-OMDB: Attribute Sparsity . . . . . . . . . . . . . . . . . . . . . . . . 84
5.74 OFF-FDCM: Attribute Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.75 T-O-I: Attribute Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.76 OL-GK: Attribute Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.77 OFF-FDCN: Value Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.78 TMDB-OMDB: Value Length . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.79 OFF-FDCM: Value Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.80 T-O-I: Value Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

10


5.81 OL-GK: Value Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.82 OFF-FDCN: Similar Non-Matches . . . . . . . . . . . . . . . . . . . . . . . . 85
5.83 TMDB-OMDB: Similar Non-Matches . . . . . . . . . . . . . . . . . . . . . . 85
5.84 OFF-FDCM: Similar Non-Matches . . . . . . . . . . . . . . . . . . . . . . . . 85
5.85 T-O-I: Similar Non-Matches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.86 OL-GK: Similar Non-Matches . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.87 T-O-I: Over/Underrepresented Entities . . . . . . . . . . . . . . . . . . . . . . 85
5.88 TMDB-OMDB: Combination 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.89 OFF-FDCM: Combination 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.90 T-O-I: Combination 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.91 OFF-FDCN: Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.92 TMDB-OMDB: Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . 86
5.93 OFF-FDCM: Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.94 T-O-I: Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.95 OL-GK: Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.96 OFF-FDCN: Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.97 TMDB-OMDB: Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.98 OFF-FDCM: Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.99 T-O-I: Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.100 OL-GK: Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.101 OFF-FDCN: Matches-NonMatches . . . . . . . . . . . . . . . . . . . . . . . . 88
5.102 TMDB-OMDB: Matches-NonMatches . . . . . . . . . . . . . . . . . . . . . . 88
5.103 OFF-FDCM: Matches-NonMatches . . . . . . . . . . . . . . . . . . . . . . . . 88
5.104 T-O-I: Matches-NonMatches . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.105 OL-GK: Matches-NonMatches . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.106 OFF-FDCN: Highly Identifying Attributes . . . . . . . . . . . . . . . . . . . . 88
5.107 T-O: Highly Identifying Attributes . . . . . . . . . . . . . . . . . . . . . . . . 88
5.108 OFF-FDCM: Highly Identifying Attributes . . . . . . . . . . . . . . . . . . . . 89
5.109 T-O-I: Highly Identifying Attributes . . . . . . . . . . . . . . . . . . . . . . . . 89
5.110 OL-GK: Highly Identifying Attributes . . . . . . . . . . . . . . . . . . . . . . 89
5.111 OFF-FDCN: Number Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.112 TMDB-OMDB: Number Attributes . . . . . . . . . . . . . . . . . . . . . . . . 89
5.113 OFF-FDCM: Number Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.114 T-O-I: Number Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.115 OL-GK: Number Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.116 OFF-FDCN: Attribute Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.117 TMDB-OMDB: Attribute Sparsity . . . . . . . . . . . . . . . . . . . . . . . . 89
5.118 OFF-FDCM: Attribute Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.119 T-O-I: Attribute Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.120 OL-GK: Attribute Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.121 OFF-FDCN: Value Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.122 TMDB-OMDB: Value Length . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.123 OFF-FDCM: Value Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.124 T-O-I: Value Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.125 OL-GK: Value Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.126 OFF-FDCN: Similar Non-Matches . . . . . . . . . . . . . . . . . . . . . . . . 90
5.127 TMDB-OMDB: Similar Non-Matches . . . . . . . . . . . . . . . . . . . . . . 90

11


5.128 OFF-FDCM: Similar Non-Matches . . . . . . . . . . . . . . . . . . . . . . . . 90
5.129 T-O-I: Similar Non-Matches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.130 OL-GK: Similar Non-Matches . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.131 T-O-I: Unseen Entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.132 T-O-I: Over/Underrepresented Entities . . . . . . . . . . . . . . . . . . . . . . 91
5.133 TMDB-OMDB: Combination 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.134 OFF-FDCM: Combination 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.135 T-O-I: Combination 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.136 T-O-I: Combination 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.137 OFF-FDCN: Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.138 TMDB-OMDB: Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . 92
5.139 OFF-FDCM: Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.140 T-O-I: Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.141 OL-GK: Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

12


Listings

4.1 Computation of Attribute Importance Scores . . . . . . . . . . . . . . . . . . . . 68
4.2 Prompt for Generating Synonyms . . . . . . . . . . . . . . . . . . . . . . . . . 69

13


Acronyms

BERT Bidirectional Encoder Representations from Transformers. 17

CSV Comma-Separated Values. 46

EM Entity Matching. 17

ER Entity Resolution. 17

GPT Generative Pre-Trained Transformers. 17

GPU Graphical Processing Unit. 25

GTIN Global Trade Item Number. 44

ISBN International Standard Book Number. 44

JSON JavaScript Object Notation. 19

MLM Masked Language Modeling. 25

NLP Natural Language Processing. 17

NSP Next Sentence Prediction. 26

RF Random Forest. 17

RL Record Linkage. 17

RNN Recurrent Neural Network. 25

SOTA State-Of-The-Art. 19

SVM Support Vector Machine. 17

TF-IDF Term Frequency-Inverse Document Frequency. 39

VIN Vehicle Identification Number. 44

XML Extensible Markup Language. 19

YAML Yet Another Markup Language. 19

15


1 Introduction

1.1 Motivation

The process of finding and matching entries that refer to the same real-world entity within a single
or between different datasets is known as Entity Resolution (ER), Entity Matching (EM) or Record
Linkage (RL). It is essential in many applications where data originates from different sources
and must be combined to make the data representation correct and meaningful. Such application
domains are biology, public health, insurance, e-commerce, and customer relationship management,
which frequently process lots of data from various sources that need more standardization and thus
require integration. ER is an essential step in the data integration process to reduce redundancy
and inconsistency, hence increasing data quality. It also enhances the user experience, facilitates
adherence to data regulations and avoids incorrect conclusions from conflicting or duplicate records,
contributing to more accurate and trustworthy data analysis.

Several solutions using different approaches have been proposed to address the ER task. Traditional
ER solutions, which assume that records referring to the same entity are more similar than those
referring to other entities, often rely on pairwise similarity comparisons [LLG15]. However, this
assumption does not hold in every case. For instance. different episodes of films have minor
differences in their data representation but still represent distinct entities. As a result, researchers
tried to develop more complex ER systems, such as rule-based matching systems and applied
heuristics to address more complex matching scenarios, achieving better results [LLG15]. However,
these approaches also work only to a certain extent. If data is presented in a structured and clean
manner, they perform well, as the high level of structuredness allows for identifying patterns and
rules that can be applied consistently. On the other hand, similarity-based ER or rule-based ER
systems struggle to maintain effectiveness when dealing with less structured data. In such cases,
where data may follow a semi-structured schema or lack any schema entirely, the predefined rules
become less effective and fail to capture the complex and nuanced patterns present in the data.

With the rise of machine learning (ML), ML models have become the preferred choice for tasks
previously handled by rule-based systems. Traditional ML models using non-linear classifiers
like the Support Vector Machine (SVM) or Random Forest (RF) have demonstrated the ability
to identify complex patterns that earlier ER systems struggled to detect and interpret [BM03]
[PB20]. Especially with the rise of deep learning solutions capable of understanding even more
complicated patterns in the underlying problem, various systems have emerged that leverage such
architectures, showing increased performance when using these. This increase in performance
has also been observed when utilizing such models for ER. Notably, the recent development of
transformer-based large language models has led to another significant boost in ER performance.
Language models such as Bidirectional Encoder Representations from Transformers (BERT)
[DCLT19] and Generative Pre-Trained Transformers (GPT) [Rad18] have emerged and achieved
remarkable success in Natural Language Processing (NLP) tasks. Such systems are pre-trained

17


1 Introduction

Figure 1.1: Diverse Representations of Real-World Entities in the Video Domain

using vast amounts of text sequences of large corpora on one or multiple different self-supervised
tasks to capture and understand the semantic relationships between words of a text. Compared to
other architectures, the transformer-based architecture captures context and relationships between
tokens or sets of tokens more flexibly and effectively [VSP+17]. Smaller language models often
additionally undergo a fine-tuning step to increase performance in the desired downstream task..
When the downstream task is the ER task, then the language model is fine-tuned such that it learns
to determine whether different records refer to the same entity or not. Using this kind of neural
network architecture already achieves promising results for ER tasks [CEP+20] [ZPSK23], and
researchers continue to explore even more innovative solutions.

18


1.1 Motivation

However, current research focuses mainly on structured data, where many solutions have already
achieved solid results. This raises the question of whether recently developed ER systems can
still perform well when dealing with less structured data, such as semi-structured or unstructured
data. Structured data is often presented in formats like Extensible Markup Language (XML),
JavaScript Object Notation (JSON) or Yet Another Markup Language (YAML), where information
is organized in nested structures, and attributes are not required to be consistently provided. On
the other hand, unstructured data lacks any specific organization and typically requires specialized
data management tools for effective handling. As mentioned, studies mainly focused on evaluating
matching systems on structured data. To compare the performance of different matching systems,
researchers began to create so-called Benchmarks for the ER task to assess how well matching
systems perform under different circumstances. However, most of the existing benchmarks for
ER contain only structured data or focus on other subtasks. Hence, it is unknown how matching
systems perform when dealing with different levels of structuredness in data. To assess this kind
of problem in more detail, developing such benchmarks would be of great interest and help to
determine whether such systems, especially the current State-Of-The-Art (SOTA) ER matching
systems, can deal with different data structures.

Another crucial aspect to consider is the variations in data characteristics when developing
benchmarks. Authors have focused on providing benchmarks with varying levels of difficulties,
as different data exhibits different characteristics that can significantly impact the difficulty of the
matching task [PB20] [CAF+21]. For instance, data sets may have varying structural and semantic
heterogeneity levels. Structural heterogeneity refers to differences in how data is organized, while
semantic heterogeneity involves variations in meaning and interpretation across different datasets.
These and other differences can lead to different levels of difficulty in the benchmark, primarily
assessed in structured datasets but not specifically in other levels of structuredness. Analyzing these
variations is crucial, as they highlight areas where current solutions need to focus on optimizing
results and areas that do not require attention because they are already well addressed.

Due to the lack of ER benchmarks for semi-structured data and the necessity to understand the
influence of varying levels of structuredness, this work aims to enhance the understanding of SOTA
matching systems in this context and seeks to establish a solid foundation for future research.

19


1 Introduction

1.2 Research Objectives and Thesis Contributions

The main objective of this master thesis is to study the effectiveness of SOTA ER approaches based
on pre-trained language models in a schema-agnostic setting on semi-structured data. This objective
is broken down into the following specific goals:

1. Development of Semi-Structured Entity Resolution Benchmarks:

The primary goal is to design and develop one or multiple ER benchmarks containing
semi-structured data. For this, sources providing the necessary data structure have to be
identified that can be used for benchmark creation. Additionally, identifying sources across
domains that reflect the variety and complexity of real-world entity resolution problems
should provide a solid foundation for evaluating ER matching systems.

2. Introduce Varying Levels of Complexity and Heterogeneity:

The second goal is to create more refined semi-structured benchmarks by systematically
varying underlying data characteristics to identify easier or more challenging scenarios. Such
characteristics can include both structural and semantical aspects. This aims to assess the
robustness and cleverness of ER matching systems across datasets with varying structures
and challenges.

3. Evaluate State-of-the-Art ER Methods Using the Benchmark:

The last goal is to evaluate the effectiveness of current SOTA ER systems in the new
setting. This also includes the assessment of the generated benchmarks containing varying
complexities and characteristics to identify possible difficulties.

The main contributions of this thesis are:

1. Providing Software for Developing ER Benchmarks:

We developed a new software component that allows users to create new benchmarks and
provides options to restrict specific benchmark dimensions to a desired extent. This enables
future researchers to more easily generate benchmarks with specific configurations.

2. Demonstrating the Effectiveness of ER Systems on Semi-Structured Data:

This work demonstrates that the SOTA ER systems, Ditto and GPT4o-mini, can effectively
perform ER on semi-structured data in a schema-agnostic manner. This finding suggests that
current systems can handle the complexities associated with hierarchically structured data.

3. Identification of Benchmark Characteristics that pose difficulties:

We present several semi-structured benchmarks that demonstrate diverse characteristics.
These benchmarks highlight areas where the employed matching systems struggle to perform
ER effectively, indicating room for improvement.

20


1.3 Outline

1.3 Outline

The structure of this thesis is organized as follows:

Chapter 2: Related Work: In this section, the fundamental concepts related to ER and key terms
relevant to this field are provided. It also offers insights into the current SOTA ER systems and
the various architectures they employ. Additionally, this chapter reviews existing benchmarks and
presents the dimensions on which authors have focused in the past.

Chapter 3: Benchmark Design: This chapter illustrates the steps and decisions taken before
implementing the benchmark. It describes the data requirements and the sources selected and
provides the architecture used to generate the benchmarks. It also describes the specific dimensions
covered by our benchmarks.

Chapter 4: Implementation: The design decisions encountered during the implementation phase
are explained in detail, along with the rationale for each selected approach.

Chapter 5: Evaluation: The evaluation of various ER systems on the newly created semi-structured
benchmarks is described here. Furthermore, the results of the matching systems are compared and
discussed in detail.

Chapter 6: Conclusion and Outlook: Finally, the main conclusions of this work are summarized,
and potential topics for future research based on the findings are presented.

21


2 Related Work

This chapter explains the fundamental concepts of ER, including the general workflow and the
specific ER approaches currently used. Furthermore, existing benchmarks are presented alongside
what the authors focused on when creating such benchmarks, as well as the various procedures used
to generate them.

2.1 Fundamental Concepts

Entity Resolution, Entity Matching, Record Linkage or De-Duplication aims to identify whether
different representations from one or several sources represent the same real-world entity [WLF+22].
An entity can be described as “a single unique object in the real world that is being mastered”
[Cor24] and can have various different representations. A representation of an entity can be textual,
graphical, or any other form and can thus be presented in numerous formats. Figure 1.1 shows
different textual representations of specific movies from the movie domain. The goal in this example
would be to understand whether two textual representations of movies refer to the same movie
or not. This task is non-trivial as many factors influence the difficulty of identifying the correct
mapping. Data may be represented in many ways and contain data errors such as schema or value
errors, thus increasing the difficulty.

The process of identifying the correct mapping between different representations and real-world
entities requires several steps, as illustrated in Figure 2.1. Initially, a set of source records from one
or multiple data sources is provided. ER is then performed on these records to establish a correct
mapping between the representations and the entities they represent. Before performing entity
resolution, Blocking or Clustering is typically applied to reduce the search space when comparing
records by grouping similar records.

Blocking or Clustering is a technique that tries to group entries that are more similar or represent
more likely the same entity in the same group while putting the other entries in different groups
[GFSS21]. Often, blocking employs similarity-based measures to identify such groups. More
advanced blocking methods, such as embedding-based blocking [OR24], utilize pre-trained language
models like BERT [DCLT19] or SentenceBERT, as demonstrated in DeepBlocker [TLT+21]. Such
models transform records into a lower-dimensional feature space, capturing semantic similarities.
In this feature space, similarity measures like Cosine Similarity are applied to identify closely
related records, enabling more effective matching of similar entities [TLT+21]. Thereby, not every
record has to be compared with any other record, as this would be inefficient and typically increases
the execution time quadratically. Sometimes, authors perform specific data cleaning steps before
blocking to address inconsistencies, missing values or other kinds of data errors to reduce the
difficulty of the matching task.

23


2 Related Work

Figure 2.1: ER Workflow

After identifying clusters and groups of likely similar entries, the Entity Resolution step is performed
within each group. Here, different kinds of ER approaches are available to choose from. One can
select traditional ER approaches, which employ similarity measures on attribute values. If the
similarity exceeds a certain threshold, the two entries are classified as a match; otherwise, they are
not. Often, blocking and entity resolution are performed iteratively to either exploit relationships
between entity descriptions, as seen in matching-based approaches, or to leverage the partial results
of merging descriptions in merging-based approaches [BGM+09]. A more recent approach involves
employing machine learning techniques, which are often better at understanding the underlying
patterns in data and thus may solve the task more effectively [BM03]. Today’s most commonly
used machine learning technique for solving the ER task is based on deep learning architectures.
Due to the complex architecture and rich set of parameters, deep learning models proved to be very
effective when facing inherently complex tasks, including ER [ZPSK23].

2.1.1 Entity Resolution Methods

Various ER methods are available for usage, each with its approach and strengths, which we will
briefly revisit in more detail.

Traditional ER methods often perform pairwise similarity comparisons between entries and assume
that entries from the same entity are more similar than other records [LLG15]. However, this
assumption does not necessarily have to hold, making this approach fragile to such constellations. To
resolve this issue, rule-based and manually crafted heuristics were developed, which use predefined
rules in combination with similarity metrics to compare records. However, these rules can often be
only employed on structured and cleaned data as the predefined rules require some structuredness
and struggle if this is not provided. Additionally, such ER systems often need to be reconfigured for
each dataset as rules need to be adjusted for each problem individually [ETJ+18].

Machine learning models, excluding neural network architectures, such as RFs or SVMs [BM03]
[PB20], have also been used by various researchers. Such ML models are trained using labelled
datasets to learn how to distinguish whether pairs refer to the same entity. However, feature
engineering is often required before feeding the data to the models, and this process often relies on
domain experts to design appropriate comparison features for the model [KQG+19].

ER systems using neural networks, especially language models which utilize deep-learning
architectures, are currently SOTA systems when encountering ER tasks, as these can learn complex
patterns in the data due to their large and complex architectures without the need for feature
engineering. These systems leverage so-called Transformer-Based Architectures, which are highly
effective for problems requiring understanding long-distance relationships between tokens or token

24


2.1 Fundamental Concepts

sets [VSP+17]. This means they can identify relationships between tokens even if positioned far
apart in the text. Since the current SOTA ER systems leverage models using this architecture, we
will briefly explain how they work, as such systems will be used for evaluation later.

Transformer:

Transformer-based deep learning architectures use so-called Transformers that rely on the Attention
or Self-Attention mechanism to learn the representation of the underlying problems more efficiently
and better capture long-distance relationships between text [VSP+17]. Transformers use a multi-head
attention mechanism that allows them “to attend to information from different representations
jointly” [VSP+17].

Attention:

Attention is a mechanism that allows a model to focus on different parts of an input sequence (keys)
when processing a query, assigning weights to each part based on its relevance [VSP+17]. These
weights determine how each part contributes to identifying the correct value.

Self-attention is a specific attention mechanism where the keys contain the same components as
the values. When dealing with text, this allows to relate every word of a text to every other word,
which helps to capture long-range relationships crucial in natural language processing (NLP) tasks
such as ER [VSP+17]. This can be especially highly relevant in the Schema-Agnostic ER setting,
where schematic information is discarded, and thus, only attribute values are used [PP18] [TSS20].
Therefore, correctly relating the records may be more difficult when schematic information is not
provided.

When dealing with large texts, transformer-based architectures have an advantage over other types
of architectures, such as Recurrent Neural Network (RNN), which, for instance, suffer from the
vanishing gradient problem. Transformers correlate all queries with all keys, allowing them to
capture long-range dependencies more effectively. This is especially important in the ER task when
dealing with records containing much textual information. Another advantage is that, due to the
absence of a recurrence mechanism, computations can be parallelized on Graphical Processing Unit
(GPU), resulting in faster processing [VSP+17].

Pre-Training

Due to this new architecture, it became possible to train even larger models, leading to the
development of large language models. These models are typically pre-trained using self-supervised
learning tasks, in this case, using language-specific tasks, allowing them to capture complex semantic
and syntactic relationships. This allows models to learn difficult patterns without the need for
labelled datasets, which is typically required. To understand the characteristics of a language, large
corpora of unlabeled data are used to train the neural network in combination with language-specific
pretext tasks. Common pre-training tasks for textual input are:

• Masked Language Modeling (MLM): This pre-training task first masks random tokens in a
sequence of tokens and then asks the model to predict the missing part given the surrounding
context. This aims to train the model to learn the bidirectional relationships within the text
and consider preceding and following tokens in a sequence when computing a solution.

25


2 Related Work

• Next Sentence Prediction (NSP): Here, the model is given two spans of text, and the question
is whether the order of the spans is correct. When considering textual input, the question
would be, is the logical order of the sentences correct or not? That way, the model learns to
identify the relationships between larger sets of sequences and not to focus solely on specific
tokens.

Various other pre-training tasks can be used to pre-train a model.

The emergence of large language models led to models like BERT, which were trained on the
mentioned pre-training tasks, or even larger models known as foundation models. Foundation
models like GPT use much more parameters and are trained on a more significant number of pretext
tasks than smaller language models like BERT. They can already solve various tasks without the
need for fine-tuning.

Figure 2.2: Self-Supervised Learning [PA23]

Fine-Tuning

On the other hand, smaller language models often require a fine-tuning step to perform well on
the downstream application. The task that the model should fulfil in the downstream application
is learned by exactly solving the domain-specific task where often only a restricted amount of
labelled data exists. This also often accounts for the ER task. When considering the ER task as a
fine-tuning task, a model’s backbone, pre-trained on linguistic tasks or on differentiating similar
from non-similar pairs, is commonly utilized. [LLS+20]. The fine-tuning task then often involves
determining, based on two representations of a set of pairs, whether they refer to the same or different
entities. Since fine-tuning requires an additional labelled dataset, one has to consider the additional
time and cost requirements potentially needed when developing such data sets. Especially for the
ER downstream task, identifying matches and non-matches across various sources for training is not
trivial, as small differences in representation can change the label, often requiring the knowledge of
domain experts. Therefore, the authors try to identify solutions that perform well even when using
small training sets.

26


2.1 Fundamental Concepts

2.1.2 State-Of-The-Art Entity Resolution Matching Systems

As previously mentioned, SOTA ER systems leverage language models. In this section, we provide
an overview of these systems.

Ditto

One widely recognized ER matching system using a large language model is Ditto [LLS+20]. Ditto
leverages pre-trained language models such as BERT, RoBERTa or DistilBERT as the backbone
and fine-tunes these on the binary ER task.

Figure 2.3: EM architecture using Ditto as a matcher [LLS+20]

In addition, Ditto offers several configurations to optimize effectiveness when resolving entities
[LLS+20]:

• Domain Knowledge Injection: When providing pairs of records as input to Ditto, users can
utilize a specific serialization format offered by the tool using [COL] and [VAL] tokens to
represent attribute names and their corresponding values. Furthermore, Ditto enables the
incorporation of domain knowledge via span types, allowing the model to concentrate on
spans of the same type, which can lead to better results.

• Span Normalization: Ditto also allows developers to integrate span normalization, with
which several rewriting rules can be specified. These rules rewrite “syntactically different
but equivalent spans into the same string” [LLS+20], e.g. when dealing with abbreviated
words or synonyms. This should improve the model’s capability to recognize equivalent
spans despite being represented differently.

• Summarization: Attribute values can sometimes contain long sequences of tokens, many of
which provide little useful information. Ditto employs a TF-IDF summarization technique to
reduce the number of tokens, keeping only those with high TF-IDF scores. This allows the
model to focus on the most informative parts of the input data.

• Data Augmentation: Ditto offers certain data augmentation options to increase the training
set size or make the model more robust to corrupted entries. These include deleting spans of
tokens or entire attributes, shuffling tokens within a span or across attributes, or swapping the
order of two data entries.

27


2 Related Work

The authors of Ditto noted that performance improved with any of these configurations activated,
while the summarization option demonstrated the most significant increase in effectiveness.

Sudowoodo

Another promising matching system utilizing pre-trained language models is Sudowoodo [WLW23].
The authors of Sudowoodo aimed to address challenges related to label requirements and task
variety. To achieve this, they employed a contrastive pre-training approach to learn data repre-
sentations and relationships between inputs in an unsupervised manner. Contrastive pre-training
leverages a contrastive loss function to differentiate between learned representation spaces without
requiring labelled data. This technique is commonly used in image processing, where various
data augmentation methods can modify images without altering their fundamental meaning. The
model must determine whether the augmented versions correspond to the original, enabling it to
learn a “meaningful representation where similar data items are close in the representation space
while pairs of dissimilar data items are far apart” [WLW23]. This effectively allows the model to
distinguish between similar and dissimilar inputs based on their representations. The goal of the
contrastive learning step is that, according to the authors [WLW23], fewer labelled data are required
to fine-tune the model while still achieving reasonable results.

Figure 2.4: The Sudowoodoo EM pipeline [WLW23]

Sudowoodo implements the following steps:

1. Contrastive Learning for Pre-training Models: Sudowoodo utilizes existing pre-trained
language models, such as BERT, RoBERTa, or DistilBERT, and adjusts their parameters
when performing the contrastive learning task. It implements various data manipulation
techniques, such as token manipulation or swapping and deleting attributes to modify data.
After applying a transformation, an additional manipulation option called cutoff can be
used, which adjusts the feature, span, or token representation in the embedding space

28


2.1 Fundamental Concepts

by setting the corresponding dimensions to zero, effectively serving as a regularization
technique. Furthermore, Sudowoodo employs cluster-based negative sampling where the
Cosine Similarity on TF-IDF vector representations is used to sample more similar records,
thus potentially forcing the model to learn more difficult patterns to solve the problem. After
that, the pre-trained model can be used for further processing.

2. Pseudo labelling: When users struggle to obtain sufficient labelled data for fine-tuning,
Sudowoodo offers the option to generate labelled data from unlabelled datasets through its
built-in pseudo-labelling approach. To generate pairs, Sudowoodo utilizes Cosine Similarity
on the embedded representations of records to identify likely matches or non-matches. If the
similarity value is above the match threshold, it is considered a match. The same is done for
identifying non-matches. To identify meaningful thresholds, the authors use a hill-climbing
heuristic from the hyperparameter optimization framework ’Optuna’ [WLW23] [ASY+19].

Figure 2.5: Fine-Tuning in Sudowoodo [WLW23]

3. Fine-tuning: After pre-training a model and identifying a small set of labelled data, the user
can fine-tune the model. The authors modified the fine-tuning procedure, noting that typically,
records are concatenated first, and then the concatenated versions are used for fine-tuning.
However, only individual records (not concatenated ones) are utilized for pre-training a model.
To address this difference, the fine-tuning procedure additionally captures the differences
between data representations beyond simple concatenation. Moreover, it incorporates the
individual record representations alongside the concatenated version, enhancing the model’s
ability to fine-tune more effectively, illustrated in figure 2.5. Once fine-tuned, the model is
ready for evaluation.

GPT

Foundation models like OpenAI’s GPT or Google’s Gemini models are large pre-trained language
models containing hundreds of billions of parameters being able to solve a wide range of tasks
across various domains [TZCM22]. Recently, researchers have also begun to utilize these systems
for ER tasks.

In the paper “Entity Matching using Large Language Models” [PB23], the authors explore how
such matching systems perform when used for ER. The advantage of such systems is that users do
not necessarily need to search for additional labelled data to fine-tune the model, as these models
often already perform well without changing any configurations [PB23].

29


2 Related Work

Figure 2.6: Prompting Foundation Models [PB23]

Over and above that, the authors assessed how different prompts affect the performance of several
large language models in ER tasks. They tried to understand what information needed to be included
in such a prompt to receive good results. For that, different kinds of prompts have been created and
evaluated. The conclusion was drawn that there is no perfect prompt as this relies on the respective
dataset and the model in use and must be adjusted for each combination. They also found that
prompts only significantly benefited models with low zero-shot performance. It was noted that
the GPT-4 model delivered solid results in zero-shot scenarios, often matching or surpassing the
performance of smaller, fine-tuned models such as BERT or RoBERTa [PB23]. While the benefit
of avoiding the need for fine-tuning is clear, the authors emphasize that the cost of using such large
models should also be carefully weighed [PB23].

2.2 Existing Entity Resolution Benchmarks

Deep learning-based solutions currently demonstrate the highest effectiveness in solving ER tasks.
The effectiveness of such systems under certain conditions is compared on so-called benchmarks.

Benchmarks in the context of ER are standardized datasets used to assess the effectiveness of
various approaches [PDB23]. They allow researchers to quantify and compare different solutions
under the same conditions. Specific metrics like accuracy, precision, recall or the F1 measure are
mainly used to quantify the performance of ER systems when dealing with ER benchmarks.

ER benchmarks contain pairs of representations that need to be matched. Additionally, for each pair,
a label is provided that indicates whether the two records match or do not match. As mentioned
in the chapter 2.1.1, the current ER solutions often fine-tune large language models for the ER
task. Therefore, authors of recently published benchmarks often provide additional training and
validation sets for fine-tuning the model. The test set typically contains records not contained in the
development set to ensure the model does not simply memorize data seen during training.

30


2.2 Existing Entity Resolution Benchmarks

Several authors of ER benchmarks tried to analyze how current ER solutions perform under different
conditions, e.g. when structuredness changes, when the number of sources used changes, etc.
referring to these as Benchmark Dimensions [PDB23] that can be adjusted.

In addition to evaluating how semi-structured data influences the performance of SOTA ER solutions,
we therefore also aimed to identify specific configurations within semi-structured data that may
affect performance. Identifying these configurations can help future researchers focus on ensuring
that their ER systems are robust in such difficult scenarios. Therefore, an analysis of various
configurations of certain benchmark dimensions, which have previously been used without specific
focus on semi-structured data, is also conducted in this work.

2.2.1 Benchmark Dimensions

This section presents the benchmark dimensions that previous work focused on.

Domain

Numerous benchmarks for ER have been published, and each typically focuses on data from
different domains. Peeters, Der and Bizer [PDB23] used data sources from the product domain.
The developers of the benchmark Alaska [CAF+21] concentrated their interest on e-commerce
data, whereas other researchers focused on historical voter data from North Carolina [PDWW21].
The motivation for developing benchmarks across various domains originates from the inherent
complexity of data in different domains and the interest in seeing if matching systems can perform
ER well across these diverse tasks.

Structuredness

Current benchmarks exhibit varying levels of structuredness. Many benchmarks provide highly
structured data where data is well organized in a well-defined schema. Examples are the popular
DBLP-ACM and Abt-Buy benchmarks provided by the Database Group Leipzig1 [KTR10] or the
benchmarks developed by the students from the Data Science Class CS 784 at UW-Madison2

[KDS+16]. However, other benchmarks, like the WDC Products [PDB23], offer a mix of structured
and less structured data. Alaska’s publishers also mention using data sources from various domains
containing different levels of structuredness [CAF+21]. Since performance often degraded when
using benchmarks that did not provide well-formed schemas, it is of interest to investigate whether
this is particularly true for semi-structured benchmarks.

Size of Fixed Data Splits

As mentioned before, recently developed benchmarks for ER provide, additionally to the test set,
training and validation data for fine-tuning. However, several older published benchmarks exist that
do not provide these splits, as deep learning solutions were not that popular then. This makes it
difficult to compare the effectiveness of matching systems across such benchmarks, as the authors
might use different training configurations that influence the outcome.

1https://dbs.uni-leipzig.de/research/projects/benchmark-datasets-for-entity-resolution
2https://sites.google.com/site/anhaidgroup/useful-stuff/the-magellan-data-repository/description-of-

the-784-data-sets?authuser=0

31

https://dbs.uni-leipzig.de/research/projects/benchmark-datasets-for-entity-resolution
https://sites.google.com/site/anhaidgroup/useful-stuff/the-magellan-data-repository/description-of-the-784-data-sets?authuser=0
https://sites.google.com/site/anhaidgroup/useful-stuff/the-magellan-data-repository/description-of-the-784-data-sets?authuser=0


2 Related Work

If benchmarks provide such development sets, then the size of these sets can also play a decisive role,
as it can significantly influence the performance of the matching systems [PDB23]. A low amount
of training and validation data gives the models only a small view of the problem. In contrast,
large development sets may provide a better understanding of the underlying task. The authors of
WDC Products provide small (2500 training and validation data), medium (6000 training, 3500
validation) and large (ca. 19000 training, 4500 validation) development sets, whereas Primpeli and
Bizer [PB20] already group benchmarks to the category small when providing development sets
containing only 300-400 pairs. Usually, performance decreases when utilizing a smaller training
set.

Figure 2.7: Scenarios for Benchmark Construction [WLF+22]

Seen and Unseen Entities

When creating training, validation, and test sets to train a neural network, authors usually avoid
including a record in multiple sets because this can lead to overfitting, meaning that the model only
memorizes specific records instead of generalizing to unseen data. In ER, it also plays a factor
if records of an entity were already seen despite not using the record in different sets. Several
publications [PDB23] observed that many existing benchmarks have a so-called “Restricted Entity
Assumption” [WLF+22], where entities that were already seen during training are also included in
the test set. The authors [WLF+22] claim that matching systems trained and evaluated on such data
may generalize less and tend to overfit more. Primpelli and Bizer [WLF+22] created benchmarks
for these scenarios, illustrated in figure 2.7 [WLF+22]:

• Open Matching: Considers only pairs of records whose entities were not seen during
training.

• Cluster-focused Matching: Includes records that were not shown during training, even
though other records of the same entities were seen during training.

32


2.2 Existing Entity Resolution Benchmarks

• Record Linking: Uses pairs where records seen during training are linked with records from
the same entity that were not shown during training.

They observed in their evaluation that their model achieved the highest F1-Score when performing
Record Linking while Cluster-focused Matching and Open Matching decreased performance,
suggesting that benchmark developers should carefully consider this aspect when developing their
next benchmark.

Corner Cases

Corner cases, also known as edge cases, generally describe scenarios that are more challenging
to solve. In ER, the authors of WDC Products analyzed the performance of ER systems when
encountering such corner cases. They argue that corner cases are cases where similarity measures
fail to capture the correct relationship between two records [PDB23]. Two types of corner cases
are:

• High Similarity, Different Entities: Pairs of representations that appear highly similar
according to specific similarity metrics but do not represent the same real-world entity.

• Low Similarity, Same Entity: Pairs of representations that appear highly dissimilar according
to similarity metrics but refer to the same real-world entity.

Figure 2.8: Example of hard and easier matching and non-matching offer pairs from WDC Products
[PDB23]

Figure 2.8 provides an example of easy and difficult matches and non-matches. The hard match
involves two similar entries but also contains differences in description and price, making correct
identification more challenging. Similarly, the hard non-match example features entries where the

33


2 Related Work

price and currency differ, but the brand and title are similar. Providing benchmarks containing such
corner cases can help determine whether the ER system can also handle scenarios that require more
than just similarity features to effectively perform RT.

Label Distribution

The distribution of matching and non-matching pairs in benchmarks is an important factor, as
different distributions can influence the difficulty of the matching task. Some researchers argue
that the label distributions in development and test sets are often overly optimistic. Often, similar
distributions are used across these sets, a practice considered unrealistic [WLF+22]. In real-
world scenarios, the ratio between matching and non-matching pairs is unknown and “diverges
significantly” [WLF+22]. Current benchmarks, according to [WLF+22] often use lower ratios,
which is not considered a realistic scenario as non-matches vastly outnumber matches in real-world
applications.

Benchmark Matched: Mismatched Seen Clusters Seen Records
abt-buy [MLR+18] ≈1:6 99% 96%
amazon-google [MLR+18] ≈1:6 99% 97%
dblp-acm [MLR+18] ≈1:20 100% 100%
dblp-scholar [MLR+18] ≈1:15 100% 100%
walmart-amazon [MLR+18] ≈1:12 100% 99%
wdc cameras [PPB19] ≈1:3 100% 78%
wdc watchers [PPB19] ≈1:3 100% 81%
wdc computers [PPB19] ≈1:3 100% 72%
wdc shoes [PPB19] ≈1:3 100% 62%

Table 2.1: Previous Benchmark Configurations [WLF+22]

The authors claim that realistic match to non-match ratios can range up to 1:100 as e.g. after
performing a blocking step, the developers are often retained with groups containing over 100
candidate records. Because of the long-tail phenomenon, it is a realistic scenario that only a
single instance within a group is a match, making this ratio reflective of practical EM applications
[WLF+22].

Single Modality Assumption

Several options and formats exist to represent a record of an entity. Often, benchmarks provide the
textual representations, discarding the potentially valuable information provided in other formats.
Different modalities like images, audio or videos, which can also provide valuable insights into
the problem, are often discarded and, therefore, not considered [WLF+22]. Therefore, researchers
suggest incorporating different modalities when developing benchmarks to provide current solutions
with more information to improve their ability to solve the tasks more effectively [WLF+22].

Amount of Attributes

Textual representations of entities, when provided in structured or semi-structured formats, use
attributes to store information in an organized manner. The quantity and length of attributes often
vary across different sources, with fewer attributes typically resulting in less granular information,
whereas having more attributes often leads to better distribution and splitting of information. Certain

34


2.2 Existing Entity Resolution Benchmarks

authors [CAF+21] [PB20] profiled ER benchmarks and the sources used in their creation, measuring
the number of attributes provided by the sources. Since sources containing more information may
increase the performance of ER systems, this aspect can also be analyzed in the context of the ER
task on semi-structured data.

Highly Identifying Attributes

Similar to the number of attributes, the minimum number of attributes required to solve the task
effectively is also an important criterion. Primpeli and Bizer refer to this dimension as “Schema
Complexity” and illustrate that matching systems requiring many attributes to perform effectively
on the ER task indicate that the benchmarks involved are of higher difficulty [PB20]. Such analysis
can also be applied to our analysis, focusing on semi-structured data to assess whether the task
quickly becomes more difficult when discarding essential attributes.

Attribute Sparsity

Attribute sparsity indicates the proportion of attributes that are missing information to all attributes.
When considering structured datasets, default values are often used as placeholders for missing
values such as “NULL”, “-”, “N/A”, and many others. Attribute Sparsity can be computed as,
according to the authors :

AS(S) = 1 −
∑

𝑟∈𝑆 |schema(𝑟) |
|schema(𝑆) | · |𝑆 |

• 𝑆 is a set containing records 𝑟 ,

• |schema(𝑟) |: The number of non-missing attributes in a single record 𝑟 .

• |schema(𝑆) |: The total number of unique attributes across all records in the set 𝑆, including
missing and non-missing attributes.

Figure 2.9: Formula for Attribute Sparsity [CAF+21]

Several benchmark publishers have investigated how varying attribute sparsity levels affect the
performance of matching systems. They observed that when attribute sparsity is high, i.e., many
attribute values are missing, then less information is available often making the task more challenging
and thus decreasing the performance of ER matching systems. Typically, a sparsity value above
approximately 0.4 to 0.5 is considered high, while a value below 0.2 to 0.3 is often considered low
[CAF+21] [PB20].

Attribute Value Length

Attribute value length or Textuality [PB20] refers to the dimension that analyzes the length of
attribute values and examines how entity matching systems handle pairs of records that exhibit
varying attribute value lengths. Attributes with many tokens and characters are considered highly
textual, while those with fewer tokens are classified as low textual [PB20]. The average value length
can be computed using the number of tokens or characters summed up for each attribute value and
divided by the number of all provided attributes or characters. Primpeli and Bizer [PB20] claim
that long attribute values may present increased challenges in the ER task. Such attributes might
contain too much information that could be split into further attributes. The effectiveness of deep

35


2 Related Work

learning solutions was not explored in detail, but evaluations on highly textual benchmarks, such as
Amazon-Google3, report smaller F1 scores [PB22]. This suggests that highly textual entries may
increase the task complexity. However, further analysis is needed to understand their impact on
deep learning solutions fully.

Under/Over-represented Entities

In ER benchmarks, entities are often represented by varying numbers of instances, creating an
imbalance in their distribution [CAF+21]. This imbalance can significantly impact the ER task,
especially if the distribution of represented entities changes between training and testing. For
instance, during training, the model may learn the representations of a limited subset of entities,
which can negatively affect performance when tested on a dataset containing a wider range of
entities. The authors of Alaska [CAF+21] note that uneven distributions of cluster sizes can increase
the problem’s difficulty and thus affect the performance of ER systems.

Combined Dimensions

Most ER benchmark developers focus on observing how the performance of matching systems
changes when changing one specific dimension. However, only a few papers have analyzed the
impact of combining multiple dimensions, which may reveal different interesting settings. The
WDC Product benchmark was created to determine how well matching systems perform when
combining specific benchmark dimensions. The authors [PDB23] varied dimensions, such as the
development set size, alongside the percentage of corner cases and percentage of unseen entities,
to assess the performance of matching systems under such different conditions. They created
benchmarks for all combinations of development set sizes (Small, Medium, Large), corner cases
(20%, 50%, 80%) and unseen entities (0%, 50%, 100%) and found that specific combinations, e.g.
providing a large percentage of corner cases in combination with many unseen entities, decrease the
performance across all matching systems tested and thus may be interesting for further analysis.

Another study grouped existing benchmarks based on specific combinations of characteristics
[PB20]. The authors aimed to identify combinations of profiling dimensions such as schema
complexity along with textuality, attribute sparsity, development set size and corner cases to identify
characteristics of easier and more difficult ER tasks. They classified benchmarks into various
categories, including one considered as Dense Data and Simple Schema, which, for instance,
includes tasks with low schema complexity (≤ 4 relevant attributes), high density (> 0.94), and
short attribute values (< 7 words) [PB20]. Classifying existing benchmarks into these groups
revealed how each group’s characteristics influenced the benchmarks’ difficulty, often indicating
whether a benchmark was relatively easy or challenging. The goal was to assess the performance
of ER matching systems in difficult scenarios and whether they can handle these. They observed
that benchmarks containing more corner cases and long attribute values are more difficult for ER
systems to handle. It is worth noting that they used Random Forests and Support Vector Machines
for these kind of evaluations [PB20].

Data Manipulation

3https://dbs.uni-leipzig.de/research/projects/benchmark-datasets-for-entity-resolution

36

https://dbs.uni-leipzig.de/research/projects/benchmark-datasets-for-entity-resolution


2.2 Existing Entity Resolution Benchmarks

Data manipulation is a commonly used technique in data augmentation. It is typically used to
increase the development set size when labelled data is limited or insufficient. In ER tasks, data
manipulation is also employed to evaluate the robustness of matching systems in response to data
changes, which often occur in real-world applications. Numerous data manipulation techniques
have been utilized in previous work [IRV13] [HPWR20], including:

• Character Variation: Inserting, deleting and updating characters of attribute values are
simple data manipulation techniques to add noise to the attribute values and thus potentially
increase the difficulty.

• Token Variation: Beyond focusing only on characters, inserting, deleting or updating entire
tokens at random positions may also significantly change the meaning of the whole token
sequence.

• Encoding Transformation: This technique involves changing data representation from one
format to another. For example, binary representations, such as {1, 0} can be converted to
{true, false} without losing meaningful information. This transformation may confuse models
and thus can also be applied when manipulating data.

• Format Transformation: This includes changing the format of the given data into another
structure without losing too much information. For instance, a date format transformation
might change “01.02.16” (DD.MM.YY) to “02.01.16” (MM.DD.YY). This kind of trans-
formation is often performed in real-world applications, and hence, assessing the impact of
changing the format of records for ER can be interesting.

• Synonyms and Homonyms: Another data manipulation used involves replacing words with
their synonyms or homonyms. Synonyms refer to words with the same meaning, while
homonyms are words spelt the same but have different meanings. For example, “purchase” and
“buy” are synonyms, while “ruler” (a person who governs) and “ruler” (a tool for measuring)
are homonyms. Replacing tokens with their synonyms or homonyms may increase the
difficulty for EM matching systems, especially if it has not encountered such variations during
training.

• Acronyms and Abbreviations: Acronyms and abbreviations are shortened forms of one or a
sequence of tokens, such as “NASA” for “National Aeronautics and Space Administration” or
“Dr.” for “Doctor”. ER systems are also expected to handle such cases; hence, this kind of
manipulation can also be employed when creating benchmarks.

• Splitting, Fusing, or Swapping Attribute Values: These transformations change the data
structure by separating a single attribute value into multiple parts, combining multiple values
into one, or switching the order of attributes. For example, splitting “John Doe” into separate
“First Name” and “Last Name” fields or fusing “123 Main St, Apt 4B” into a single address
field. Similarly, swapping attribute values, such as reversing “City, Country” to “Country,
City”, can also impact the performance of EM systems. Applying these transformations when
creating benchmarks may help to ensure that the system can handle such variations.

37


2 Related Work

• Multilingualism: Representing records in different languages may notably increase the
difficulty for EM systems. Only a few EM systems are trained on multilingual data; thus, it
could be interesting to see how this change influences effectiveness. Additionally, one could
assess the handling of variations in syntax and grammar across languages when solving ER
tasks by manipulating the data accordingly.

• Combinations: Finally, all mentioned data manipulation techniques can also be applied in
combination, thus increasing additionally the difficulty of benchmarks and allowing for the
assessment of whether ER systems can handle more challenging scenarios.

In summary, when authors create benchmarks, one can observe that they often vary specific data
characteristics or other factors and assess the performance of ER systems under these variations.
Current assessments have not specifically focused on semi-structured data, making it interesting to
explore how these configurations influence performance in this context.

2.2.2 Benchmark Development Approaches

Authors approached the development of new ER benchmarks in various ways. After defining a
specific goal for evaluation in creating an ER benchmark, the question arises regarding what data to
use and how to generate these new ER benchmarks. Since many data sources do not permit the use
or do not provide the data in the desired format, it is not always clear what the best approach is.

The authors of Alaska use a system called DEXTER [QBD+15], which contains product-specific data
crawled from the web. Web crawlers automatically browse and collect information from websites to
build large datasets. In the benchmark development process of WDC Products, the authors also
focused on data crawled from the web. They used the Common Crawl, one of the largest web corpora
available [PDB23]. To work with data provided in such corpora, authors typically needed to apply
several pre-processing steps. These included selecting pages with relevant information, filtering out
non-relevant information within each chosen page, and performing additional data cleaning steps to
ensure the data was usable and met the desired requirements. Additionally, domain experts were
consulted to verify labels that were likely assigned correctly by other automated systems [QBD+15]
or were already present in the data but could be incorrect [PDB23]. Panse et al. [PDWW21], on the
other hand, did not manually label the records when dealing with historical voter data from North
Carolina. Due to unique identifiers that distinguish people reliably and since these identifiers were
provided with much care, there was no need to consult domain experts [PDWW21].

The possible approaches to benchmark development show that using data from large web corpora
has advantages, such as offering diverse representations of sources, which tests whether ER systems
can effectively capture complex relationships between data. On the other hand, it was shown that
they often require a lot of pre-processing steps, large architectures capable of managing large data
volumes, and other resources, for instance, when consulting domain experts to label data. All this
information will be utilized and carefully considered when developing our benchmarks.

38


2.3 Benchmark Difficulty Measures

2.3 Benchmark Difficulty Measures

As outlined in Section 2.1.1, various approaches to Entity Resolution exist, with large language
models being among the most prominent methods currently in use. These models come in various
sizes, and for some ER tasks, especially when the inherent difficulty is not too high, smaller models
can be effectively utilized to achieve satisfactory results. Papadakis et al. [PKCP23] employed a
different approach to assess the difficulty of benchmarks without using ER systems utilizing large
language models. The analysis began with measuring the Degree of Linearity [PKCP23] in ER
benchmarks.

Degree of linearity

This measure assesses whether a linear classifier can successfully resolve the entities of a given
benchmark. In their evaluation, they use the Jaccard and Cosine similarities when classifying
matches and non-matches.

Jaccard Similarity4 measures the number of tokens that intersect between two sets 𝐴 and 𝐵 ,
divided by all tokens across these sets:

Jaccard(𝐴, 𝐵) = |𝐴 ∩ 𝐵|
|𝐴 ∪ 𝐵|

Cosine Similarity5 measures the similarity between two vectors in a multi-dimensional space:

Cosine Similarity(𝐴, 𝐵) = 𝐴 · 𝐵
∥𝐴∥∥𝐵∥

Since Cosine Similarity expects numerical vectors as input, the sets 𝐴 and 𝐵 are often represented
by Term Frequency-Inverse Document Frequency (TF-IDF) vectors.

TF-IDF6 measures the relevance of a word in a document relative to a corpus:

TF-IDF(𝑡, 𝑑, 𝐷) = TF(𝑡, 𝑑) × IDF(𝑡, 𝐷)

where TF(𝑡, 𝑑) is the frequency of term 𝑡 in document 𝑑, and IDF(𝑡, 𝐷) measures how much
information the term 𝑡 provides across all documents 𝐷.

The authors employ a threshold-based classification using both similarity measures. First, they
measure the similarity between each pair and save these values. Then, for each threshold between
0.01 and 0.99 with a step size of 0.01, they evaluate the F1 score, using this threshold as the
boundary for classifying matches and non-matches. According to the authors, the threshold that
produces the highest F1 score represents the degree of linearity. A high score indicates that the
classification task is relatively simple for linear classifiers, suggesting that large language models
are unnecessary, as smaller models could also effectively solve the task.

4https://en.wikipedia.org/wiki/Jaccard_index
5https://www.datastax.com/de/guides/what-is-cosine-similarity
6https://en.wikipedia.org/wiki/Tf-idf

39

https://en.wikipedia.org/wiki/Jaccard_index
https://www.datastax.com/de/guides/what-is-cosine-similarity
https://en.wikipedia.org/wiki/Tf-idf


2 Related Work

Complexity Measures

Additionally, the authors considered other measures implemented in a Python framework called
“Problexity” [KK23]. It uses so-called Complexity Measures [PKCP23] to determine the difficulty
of ER benchmarks. These measures evaluate the potential existence of complex patterns in the
underlying data, which can indicate a higher difficulty in solving the problem. “Problexity” combines
various measures into a single value to indicate whether the task is likely to be complex. The
following measures were used and are all mentioned in [PKCP23]:

1. Feature overlapping measures: These measures focus especially on the numeric features and
how well these contribute to solving the problem correctly. Examples include the maximum
Fisher’s discriminant ratio or measures that evaluate the degree of overlap between different
classes.

2. Neighborhood measures: The neighborhood measures examine how data points from
different classes are clustered within small data regions. Additionally, measures that evaluate
relationships among nearest neighbours and those that employ k-nearest neighbours (kNN)
classifiers or utilize neural networks are used to produce a value indicating the relative ease
or complexity of the classification task.

3. Network measures: These measures are used to analyze relationships in a dataset by focusing
on instances of the same class and their similarities, which can be helpful for classification
tasks. For that, the instances of a dataset are connected through edges to each other if they
originate from the same class and the Gower distance7, which is a similarity measure, is
smaller than a predefined threshold [PKCP23]. The density measure reflects then, for instance,
the ratio of existing edges to the total number of possible edges in the graph.

4. Dimensionality measures: Such measures assess data sparsity by calculating the mean
instances present within each dimension or the proportion of relevant dimensions.

The authors represented each pair in the ER benchmark by a two-dimensional feature vector, namely
Cosine and Jaccard Similarity, and used this as input to compute the complexity, which yielded a
value between 0 and 1. A higher complexity indicated that the specific entity resolution task would
likely be more difficult [PKCP23].

2.4 Summary of Related Work

This chapter provided an overview of the ER workflow, including its phases and an introduction to
several SOTA ER systems used in the matching process. We reviewed multiple ER benchmarks and
highlighted a gap in prior studies, noting that existing evaluations did not specifically focus on how
semi-structured data affects the effectiveness of ER systems. We observed that authors analyzed
the impact of specific benchmark configurations (Benchmark Dimensions) on ER systems, but
they did not focus on semi-structured data specifically. Further, various methods for selecting data
sources and combining information to create ER benchmarks were reviewed, along with alternative
techniques for evaluating benchmark complexity.

7https://en.wikipedia.org/wiki/Gower%27s_distance

40

https://en.wikipedia.org/wiki/Gower%27s_distance


2.4 Summary of Related Work

These findings are considered when designing the benchmarks, guiding both the selection of data
sources and the development of semi-structured benchmarks, presented in the next chapter.

41


3 Benchmark Design

When developing a new benchmark for ER, researchers typically aim to assess whether existing ER
matching systems can deal with a certain kind of problem or aspect. This means that each study
targeted one or multiple benchmark dimensions and assessed the effectiveness of matching systems
on this particular new kind of ER task.

Our study will focus on assessing ER systems’ performance using benchmarks that provide semi-
structured data, unlike most benchmarks that typically rely on fully structured data or those that
do not concentrate on a specific level of structuredness.. This means the goal is to create new ER
benchmarks containing data provided in a nested format to analyze how well matching systems can
handle this data in this context.

Furthermore, we aim to provide benchmarks that allow us to explore how specific dimension
configurations, such as the match to non-match ratio or certain data characteristics, including
attribute sparsity and value length, impact the difficulty of ER on semi-structured data. This is of
interest as previous studies have observed performance decreases in ER systems when encountering
benchmarks with specific characteristics [PDB23] [PB20]. We explicitly want to verify if these
challenges are also present when dealing with semi-structured or if they become even more difficult
in this context.

The benchmark development process is divided into the following steps:

• Data Requirements Definition: Several criteria and requirements must first be defined for
the sources used to create the ER benchmarks. Requirements with regard to the number of
records, the degree of structuredness, and other essential data characteristics are defined to
ensure that the benchmark is meaningful and representative.

• Data Source Selection: This section describes the data sources that satisfy the defined
requirements and thus were chosen for benchmark creation. The characteristics of these data
sources are presented in detail.

• Benchmark Dimensions: This part outlines the benchmark dimensions selected for evaluation
across different configurations and explains the significance of each in detail

• Benchmark Development Process: This section explains in detail the architecture and the
individual software components used to create benchmarks.

3.1 Data Requirements Definition

Here we outline several requirements that the data from the source should fulfill and explain these
in detail.

43


3 Benchmark Design

3.1.1 Structuredness

Before selecting datasets for creating ER benchmarks, it is essential to determine which sources
provide the desired data for the specific task. In our scenario, it was desired to identify data sources
that provide data in a semi-structured format. Datasets that provide many nested elements with
different levels of nesting were preferred over those with simpler nesting structures, as a goal was
to assess whether ER systems can handle data with more complex structures. The increased level
of nesting also contributes to greater data heterogeneity, particularly in schematic heterogeneity,
which is likely to result in a more challenging scenario for the systems.

3.1.2 Labels

Another important aspect to consider is whether the identified data sources provide unique identifiers
that can be used to determine whether two records refer to the same real-world entity. Such
identifiers are available in certain domains but are not universally provided across all domains. For
instance, product identifiers such as Global Trade Item Number (GTIN), vehicle identifiers like
Vehicle Identification Number (VIN), or book identifiers like International Standard Book Number
(ISBN) are widely used in their respective domains and make identifying matches much easier.

However, many sources do not include such identifiers in their records. As a result, ER benchmark
developers often face the challenge of determining the correct mapping between records and entities.
This process of identifying the correct mapping typically involves data cleaning, blocking, and
manual labelling by domain experts, which requires significant resources and time [PDWW21].

Due to the limited time, it was decided to only consider data sources that provide unique identifiers
so that matches and mismatches between sources can be easily identified. In addition, we chose
to use only sources where the data requires few pre-processing steps. When datasets from large
corpora, such as Common Crawl, are considered, it is obvious that many pre-processing steps, such
as identifying meaningful data or cleaning data, are required before data can be used for creating ER
benchmarks [PDB23]. Given the complexity of these tasks and the lack of necessary infrastructure
to handle such a large amount of data, we chose not to consider such data sources. Therefore, the
focus was on identifying data sources that do not require substantial pre-processing and already
provided labelled data.

3.1.3 Number of Matches

An important consideration when identifying suitable sources is to ensure that enough entries
between different sources relate to the same entities. This is necessary because, when creating
benchmarks, it is important to have sufficient matches and non-matches to additionally be able to
create training and validation sets. Therefore, having a representative number of entries that relate
to the same entity is crucial. Peeters and Bizer [PB23] argue that a benchmark should contain at
least 150 matches to ensure reliable results.

Since we also aim to assess the impact of the development set size when fine-tuning models, it is
important to provide development sets of varying sizes. In the WDC Products benchmark, the largest
configuration across all sets (training, validation, and test) contains approximately 10000 matching

44


3.1 Data Requirements Definition

entries [PDB23]. Similarly, we aimed to identify sources from domains where approximately 10000
matching pairs could be generated. Domains with fewer matches were also considered but were less
preferred.

3.1.4 Licensing

When reusing data from external sources, it is important to assess the licensing terms to determine
which data can be reused and which cannot. Contributors typically publish licenses that outline
how users can work with the published data. Since the goal of this work is to make the benchmarks
accessible to other researchers, we require data sources that allow for data reuse and redistribution
in academic contexts. Specifically, the license should grant permission to host and distribute the
data on institutional servers, such as those at the University of Stuttgart. This ensures that the
benchmarks remain available to the research community, enabling others to evaluate new entity
resolution systems and related tools using our datasets. Data sources not providing a suitable license
will be much less prioritized.

3.1.5 Heterogeneity

A key aspect that is also considered in this paper is the analysis of how varying levels of heterogeneity
between sources can influence the difficulty of ER tasks. Therefore, it is important to identify
data sources that exhibit varying levels of heterogeneity among themselves. We consider aspects
of structural heterogeneity which refers to variations in data organization or differences in data
models. Since we focus on semi-structured data, these benchmarks naturally exhibit more schematic
heterogeneity than structured benchmarks due to the more complex organization of data. Specifically,
we aim to identify data sources that, for example, vary in the number of nested elements, attributes
provided or the percentage of average attribute sparsity. We also considered the possibility of
analyzing semantic heterogeneity between sources, such as synonyms, homonyms, or different units
of measurement (e.g., euro vs. dollars). However, this requires a specific analysis and substantial
effort as sources often provide large amounts of attributes, and thus, this aspect was not a primary
focus here.

Other differences include variations in vocabulary size or the number of tokens provided in attribute
values, and we aimed to identify sources that exhibit these differences as well. Since we assess the
effectiveness of ER systems in a schema-agnostic setting, these mentioned factors influence the task
because only the attribute values are considered, not the attribute names.

3.1.6 Number of Data Sources

The number of data sources can influence the difficulty of matching tasks, making it an important
consideration when creating ER benchmarks. Since different sources provide multiple representa-
tions of real-world objects, matching systems need to be able to distinguish between matches and
non-matches across various sources and demonstrate their effectiveness in handling such diversity.
Most existing benchmarks contain information from only two sources, but we aim to assess how
an increased number of sources affects ER performance, particularly when using semi-structured
data.

45


3 Benchmark Design

In addition, it is also interesting to assess how over- or underrepresented entities impact the
effectiveness of matching systems when working with semi-structured records. The assessment of
such experiments requires that some entities are represented by multiple sources while others are
represented by fewer sources. These overrepresented entities can only be provided in the benchmark
data sets if we first locate numerous data sources within the same domain and, second, ensure
that these sources contain multiple matches where each data source contributes records to the
match. Therefore, the goal was to identify multiple sources with semi-structured data from the same
domain.

3.2 Data Source Selection

Data sources from various domains that align well with the specified requirements were sought.
In particular, we restricted the search to domains, such as videos, books, food, or vehicles, where
we knew that the sources may provide identifiers to distinguish between entities. Therefore, other
domains, such as restaurants, were excluded as they often relied only on coordinates and lacked
unique identifiers. Relying on coordinates alone in these domains proved insufficient, as multiple
restaurants could exist within the same building, for example, on different floors. Sources containing
vehicle information were also discarded. Initially, vehicles appeared promising due to the VIN
identifier, which uniquely distinguishes each vehicle. However, the sources obtained were either
too structured (e.g., lacking nested elements despite being provided in JSON format1) or we were
unable to find suitable semi-structured datasets with vehicle information at all.

As a result, although semi-structured data sources were identified, many were excluded due to
challenges in reliably differentiating entities or because they did not meet the requirements for
structuredness. In addition to these issues, data providers often did not permit data redistribution.

However, we found a domain, the food domain, where data sources met the necessary data structure
requirements, allowed redistribution, and included unique identifiers.

Food: In the Food domain, we identified two data sources:

• Open Food Facts2 (OFF): Provides an API to access a database containing content in JSON
format, which includes varying levels of nested elements. It contains a rich set of attributes
(max. 271 unique attributes) characterizing different foods, a high average sparsity (>0.6)
and a low average attribute value length (≈ 6.5).

• FoodData Central3 (FDC): When identifying this source, we initially found several Comma-
Separated Values (CSV) files, each containing different types of information (e.g., main
food properties, nutrient details, etc.). Since CSV files provide structured data, they were
unsuitable for our evaluation, as we focus on semi-structured data. Since we found it difficult
to locate other sources with the necessary structure, we decided to restructure this data
ourselves. We used the CSV file containing the main food properties and transformed it into
a JSON format to better align with our focus on semi-structured data. The other CSV files

1https://www.nhtsa.gov/nhtsa-datasets-and-apis
2https://openfoodfacts.github.io/openfoodfacts-server/api/ref-v2/
3https://fdc.nal.usda.gov/download-datasets.html

46

https://www.nhtsa.gov/nhtsa-datasets-and-apis
https://openfoodfacts.github.io/openfoodfacts-server/api/ref-v2/
https://fdc.nal.usda.gov/download-datasets.html


3.2 Data Source Selection

provided, contained additional information, such as nutrient details. Since each food has
many nutrients and each nutrient is described by many attributes, we appended, e.g. each
nutrient as a nested JSON object to the food as additional information. We did this for all
food elements and that way, we constructed a semi-structured dataset that could be used for
benchmark creation and evaluation. We call this modified source FDC Merged.

Later, during the evaluation phase, we discovered that FDC also offers a version of the data in
JSON format and hence used it in addition when creating benchmarks. We name it FDC
Native. This version provides records having more attributes than the FDC Merged dataset
and features more deeply nested elements. Both datasets contain longer attribute values (>11)
but less sparse data (<0.27) than OFF.

Since FDC Merged and FDC Native represent largely the same entities, we create two benchmarks:
one using the FDC Merged and OFF data and another using the FDC Native and OFF sources to
assess whether this change of representations affects the performance of matching systems.

Table 3.1: Statistics of Food Data Sources
Name License Size #Max Attributes Avg. Sparsity Avg. Value Length (#Chars) Voc. Size4

OFF ODbL v1.0 18500 271 0.617 6.43 40,719
FDC Merged CC0 1.0 47578 25 0.265 11.24 34,325
FDC Native CC0 1.0 46284 56 0.193 12.97 40,277

Video: In the Video domain, three data sources were selected:

• The Open Movie Database5 (OMDB): Provides an API called the OMDB API which returns
data about various movies in a JSON format. It includes records with nested elements, though
fewer than those provided by sources in the Food domain. However, the average attribute
sparsity is relatively high (≈ 0.33), and attributes like “Director” or “Authors” often contain
multiple names, indicating that the data is not perfectly separated. Additionally, it provides a
high vocabulary size and long attribute values (26.19 characters on average).

• The Movie Database6 (TMDB): Offers an API for non-commercial use, providing various
endpoints to query movie information and returns data in JSON format. When querying the
API for a movie using the imdbId identifier, the response provides general information about
movies in a semi-structured way with low amounts of nesting elements. Since this general
information provides only a few nesting elements, we queried additional endpoints from
the API and incorporated their data, such as crew and cast information, in nested elements.
These endpoints return data about individual members and their properties, representing each
member as a JSON object. All these JSON objects, representing the members, are added in a
JSON array for the specific movie as an additional attribute. Similarly, genres were stored
in a JSON array to accommodate multiple values, resulting in more nested elements in the
records. This source has a high vocabulary size and the average sparsity is very low (≈ 0.06).

4Measured for 10,000 entries.
5https://www.omdbapi.com/
6https://developer.themoviedb.org/

47

https://www.omdbapi.com/
https://developer.themoviedb.org/


3 Benchmark Design

It has to be noted that, unfortunately, this source does not allow redistribution of data. Since
it was difficult for us to find other suitable sources, we have retained this source and will
provide the evaluation results using this data. However, we will not publish the benchmarks
in this domain.

• IMDB7: Provides datasets on the IMDb Non-Commercial Datasets website containing
information about various movies. Unlike the other sources, IMDB offers its data in CSV
format, which classifies it as a structured dataset and is typically not ideal for further
evaluation in this context. However, we wanted to include a domain with at least three sources
to analyze how ER systems perform under different conditions, for instance, when changing
the ratio between unseen/seen entities or providing over- and underrepresented entities or
whether overall the presence of entries from multiple data sources complicates the problem.
Therefore, we converted it into a JSON file and also used it when creating benchmarks.
This source contains few attributes and a small vocabulary, suggesting a relatively easy task.
However, with approximately 25% sparsity, this can also pose some challenges.

Since the focus remained on semi-structured data, but we also wanted to assess how a third source
may influence the effectiveness of matching systems, we created two separate benchmarks: one
using the OMDB and TMDB sources and another incorporating additionally the IMDB source.

Table 3.2: Statistics of Video Data Sources
Name License Size #Max Attributes Avg. Sparsity Avg. Value Length (#Chars) Voc. Size8

OMDB CC BY-NC 4.0 48,080 26 0.332 26.19 106,589
TMDB Insufficient 45,929 21 0.063 15.73 180,605
IMDB Insufficient 904,326 13 0.254 10.64 23,728

Book: In the Book domain, two data sources were selected:

• Open Library API9 (OL): Provides an API to access the database content. It contains a rich
amount of information about books provided in a JSON format. This semi-structured data
provides many levels of nesting, a high average attribute sparsity (>0.4) and a high vocabulary.

• GoodReads dataset on Kaggle10 (GK): This dataset, provided on Kaggle, contains information
about books in CSV format, categorizing it as structured data. Unfortunately, we could
not find other sources that offer data in a semi-structured format while also providing the
necessary license for data redistribution. However, similar to the IMDB dataset mentioned
earlier, this dataset features many attributes that are not subdivided and additionally provides
very long attribute values (>90 characters).

7https://developer.imdb.com/non-commercial-datasets/
8Measured for 10,000 entries.
9https://openlibrary.org/developers/api

10https://www.kaggle.com/datasets/mdhamani/goodreads-books-100k

48

https://developer.imdb.com/non-commercial-datasets/
https://openlibrary.org/developers/api
https://www.kaggle.com/datasets/mdhamani/goodreads-books-100k


3.3 Benchmark Dimensions

Evaluating the effectiveness of an ER system on this data can help reveal how data sources with
different structural characteristics may impact the task’s difficulty. However, the focus of the
evaluation will mainly remain on the Food and Movie domain since the focus is on semi-structured
benchmarks.

Table 3.3: Statistics of Book Data Sources
Name License Size #Max Attributes Avg. Sparsity Avg. Value Length (#Chars) Voc. Size8

OL ODbL 65,779 37 0.436 30.59 193,381
GK CC0 100,000 13 0.015 90.59 70,374

In summary, we will establish benchmarks using the following sources, classified based on their
provided characteristics:

• Group 1: Natively semi-structured sources

– OFF-FDCN: This benchmark uses data from OFF and FDC Native which both provide
natively a high degree of semi-structuredness.

• Group 2: One semi-structured Source, One adjusted source

– OFF-FDCM: This benchmark uses data from OFF, which provides a high level of
semi-structuredness, and FDC Merged, which required modifications to achieve the
desired level of semi-structuredness.

– TMDB-OMDB: This benchmark uses data from TMDB, which provides semi-structured
data natively, and OMDB, which required modifications to achieve the desired level of
semi-structuredness.

• Group 3: Benchmarks with an increased number of sources

– TMDB-OMDB-IMDB (T-O-I): This benchmark uses data from TMDB, OMDB, and
IMDB and i