Institut für Parallele und Verteilte Systeme Universität Stuttgart Universitätsstraße 38 D–70569 Stuttgart Masterarbeit Benchmarking Pre-Trained Language Models for Schema-Agnostic Entity Resolution Jan Bothmann Studiengang: Informatik Prüfer/in: Prof. Dr. rer. nat. Melanie Herschel Betreuer/in: Prof. Dr. rer. nat. Melanie Herschel Beginn am: 15. Mai 2024 Beendet am: 15. November 2024 Kurzfassung Datenintegration ist ein wichtiger Prozess, beim dem Daten aus verschiedenen Quellen zusam- mengeführt werden, um ein einheitliches Bild der Daten zu schaffen. Ein wesentlicher Schritt in diesem Prozess ist Entity Resolution. Entity resolution versucht Elemente zu identifizieren, die dieselbe Entität repräsentieren. Die Komplexität von Entity Resolution Aufgaben kann stark variieren, da Daten unterschiedliche Eigenschaften und Strukturierungsgrade aufweisen, die die Aufgabe entweder erschweren oder vereinfachen können. In dieser Arbeit liegt der Fokus auf der Evaluierung von Entity Resolution Systemen hinsichtlich semi-strukturierter Daten. Aus diesem Grund wurden mehrere semi-strukturierte Entity Resolution Benchmarks erstellt, die Daten von verschiedenen Domänen benutzen und zur Bewertung benutzt werden. Um auch zu untersuchen, wie verschiedene Datenmerkmale oder andere Einflussfaktoren die Leistung von Entitiy Resolution Systemen beeinflussen, haben wir den Benchmark Creator entwickelt. Dieser ermöglicht es uns und anderen Nutzern, Benchmarks zu erstellen, bei denen die Daten spezifische Merkmale aufweisen, die die Performance von Entitiy Resolution Systemen beeinflussen können. Die Entity Resolution Systeme Ditto, Sudowoodo und das GPT4o-mini Modell wurden zur Evaluierung herangezogen. Es wurde gezeigt, dass sowohl Ditto und das GPT4o-mini Modell in der Lage sind, schema-agnostische Entity Resolution auf semi-strukturierten Daten effektiv durchzuführen. Abstract Data integration is a process in which data from different sources are brought together to create a unified picture of the data. A vital aspect of this integration is Entity Resolution, which tries to identify elements that correspond to the same entity across multiple datasets. The complexity of ER tasks can vary significantly, as data exhibits different characteristics and levels of structuredness, which can influence the difficulty of the task. In this thesis, we evaluate how current state-of-the-art Entity Resolution systems perform when dealing with semi-structured data. To do this, several semi-structured ER benchmarks covering data from various domains were created for evaluation. Additionally, to explore how different data characteristics or other influencing factors impact the performance of matching systems, we developed the Benchmark Creator. This tool allows us and other users to generate benchmarks where data exhibits specific characteristics that may influence the complexity of the ER task. We used Ditto, Sudowoodo and the GPT4o-mini model to evaluate performance on the newly created benchmarks. Our evaluation reveals that Ditto and the GPT4o-mini model can effectively perform schema-agnostic ER on semi-structured data. 3 Contents 1 Introduction 17 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.2 Research Objectives and Thesis Contributions . . . . . . . . . . . . . . . . . . 20 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2 Related Work 23 2.1 Fundamental Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2 Existing Entity Resolution Benchmarks . . . . . . . . . . . . . . . . . . . . . . 30 2.3 Benchmark Difficulty Measures . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.4 Summary of Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3 Benchmark Design 43 3.1 Data Requirements Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2 Data Source Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3 Benchmark Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.4 Benchmark Development Process . . . . . . . . . . . . . . . . . . . . . . . . . 55 4 Implementation 65 4.1 Design Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5 Evaluation 71 5.1 Metrics for Benchmark Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2 Linearity and Complexity Evaluation . . . . . . . . . . . . . . . . . . . . . . . 72 5.3 Evaluation of Entity Resolution Matching Systems . . . . . . . . . . . . . . . . 75 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6 Conclusion and Outlook 97 Bibliography 99 5 List of Figures 1.1 Diverse Representations of Real-World Entities in the Video Domain . . . . . . . 18 2.1 ER Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2 Self-Supervised Learning [PA23] . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3 EM architecture using Ditto as a matcher [LLS+20] . . . . . . . . . . . . . . . . 27 2.4 The Sudowoodoo EM pipeline [WLW23] . . . . . . . . . . . . . . . . . . . . . 28 2.5 Fine-Tuning in Sudowoodo [WLW23] . . . . . . . . . . . . . . . . . . . . . . . 29 2.6 Prompting Foundation Models [PB23] . . . . . . . . . . . . . . . . . . . . . . . 30 2.7 Scenarios for Benchmark Construction [WLF+22] . . . . . . . . . . . . . . . . . 32 2.8 Example of hard and easier matching and non-matching offer pairs from WDC Products [PDB23] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.9 Formula for Attribute Sparsity [CAF+21] . . . . . . . . . . . . . . . . . . . . . 35 3.1 Benchmark Development Process . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.2 Benchmark Creator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 7 List of Tables 2.1 Previous Benchmark Configurations [WLF+22] . . . . . . . . . . . . . . . . . 34 3.1 Statistics of Food Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2 Statistics of Video Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.3 Statistics of Book Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.1 Linearity Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.2 OFF-FDCN: Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.3 TMDB-OMDB: Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.4 OFF-FDCM: Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.5 T-O-I: Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.6 OL-GK: Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.7 OFF-FDCN: Matches-NonMatches . . . . . . . . . . . . . . . . . . . . . . . . 77 5.8 TMDB-OMDB: Matches-NonMatches . . . . . . . . . . . . . . . . . . . . . . 77 5.9 OFF-FDCM: Matches-NonMatches . . . . . . . . . . . . . . . . . . . . . . . . 77 5.10 T-O-I: Matches-NonMatches . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.11 OL-GK: Matches-NonMatches . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.12 OFF-FDCN: Highly Identifying attributes . . . . . . . . . . . . . . . . . . . . 77 5.13 TMDB-OMDB: Highly Identifying attributes . . . . . . . . . . . . . . . . . . . 77 5.14 OFF-FDCM: Highly Identifying Attributes . . . . . . . . . . . . . . . . . . . . 77 5.15 T-O-I: Highly Identifying Attributes . . . . . . . . . . . . . . . . . . . . . . . . 77 5.16 OL-GK: Highly Identifying Attributes . . . . . . . . . . . . . . . . . . . . . . 77 5.17 OFF-FDCN: Number Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.18 TMDB-OMDB: Number Attributes . . . . . . . . . . . . . . . . . . . . . . . . 78 5.19 OFF-FDCM: Number Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.20 T-O-I: Number Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.21 OL-GK: Number Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.22 OFF-FDCN: Attribute Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.23 TMDB-OMDB: Attribute Sparsity . . . . . . . . . . . . . . . . . . . . . . . . 78 5.24 OFF-FDCM: Attribute Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.25 T-O-I: Attribute Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.26 OL-GK: Attribute Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.27 OFF-FDCN: Value Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.28 TMDB-OMDB: Value Length . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.29 OFF-FDCM: Value Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.30 T-O-I: Value Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.31 OL-GK: Value Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.32 OFF-FDCN: Similar Non-Matches . . . . . . . . . . . . . . . . . . . . . . . . 79 5.33 TMDB-OMDB: Similar Non-Matches . . . . . . . . . . . . . . . . . . . . . . 79 9 5.34 OFF-FDCM: Similar Non-Matches . . . . . . . . . . . . . . . . . . . . . . . . 79 5.35 T-O-I: Similar Non-Matches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.36 OL-GK: Similar Non-Matches . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.37 T-O-I: Unseen Entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.38 T-O-I: Over/Underrepresented Entities . . . . . . . . . . . . . . . . . . . . . . 80 5.39 TMDB-OMDB: Combination 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.40 OFF-FDCM: Combination 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.41 T-O-I: Combination 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.42 OFF-FDCN: Combination 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.43 TMDB-OMDB: Combination 2 . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.44 OFF-FDCM: Combination 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.45 T-O-I: Combination 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.46 OL-GK: Combination 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.47 OFF-FDCN: Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.48 TMDB-OMDB: Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . 82 5.49 OFF-FDCM: Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.50 T-O-I: Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.51 OL-GK: Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.52 OFF-FDCN: Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.53 TMDB-OMDB: Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.54 OFF-FDCM: Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.55 T-O-I: Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.56 OL-GK: Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.57 OFF-FDCN: Matches-NonMatches . . . . . . . . . . . . . . . . . . . . . . . . 83 5.58 TMDB-OMDB: Matches-NonMatches . . . . . . . . . . . . . . . . . . . . . . 83 5.59 OFF-FDCM: Matches-NonMatches . . . . . . . . . . . . . . . . . . . . . . . . 83 5.60 T-O-I: Matches-NonMatches . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.61 OL-GK: Matches-NonMatches . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.62 OFF-FDCN: Highly Identifying attributes . . . . . . . . . . . . . . . . . . . . 83 5.63 TMDB-OMDB: Highly Identifying attributes . . . . . . . . . . . . . . . . . . . 83 5.64 OFF-FDCM: Highly Identifying Attributes . . . . . . . . . . . . . . . . . . . . 84 5.65 T-O-I: Highly Identifying Attributes . . . . . . . . . . . . . . . . . . . . . . . . 84 5.66 OL-GK: Highly Identifying Attributes . . . . . . . . . . . . . . . . . . . . . . 84 5.67 OFF-FDCN: Number Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.68 TMDB-OMDB: Number Attributes . . . . . . . . . . . . . . . . . . . . . . . . 84 5.69 OFF-FDCM: Number Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.70 T-O-I: Number Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.71 OL-GK: Number Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.72 OFF-FDCN: Attribute Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.73 TMDB-OMDB: Attribute Sparsity . . . . . . . . . . . . . . . . . . . . . . . . 84 5.74 OFF-FDCM: Attribute Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.75 T-O-I: Attribute Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.76 OL-GK: Attribute Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.77 OFF-FDCN: Value Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.78 TMDB-OMDB: Value Length . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.79 OFF-FDCM: Value Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.80 T-O-I: Value Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 10 5.81 OL-GK: Value Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.82 OFF-FDCN: Similar Non-Matches . . . . . . . . . . . . . . . . . . . . . . . . 85 5.83 TMDB-OMDB: Similar Non-Matches . . . . . . . . . . . . . . . . . . . . . . 85 5.84 OFF-FDCM: Similar Non-Matches . . . . . . . . . . . . . . . . . . . . . . . . 85 5.85 T-O-I: Similar Non-Matches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.86 OL-GK: Similar Non-Matches . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.87 T-O-I: Over/Underrepresented Entities . . . . . . . . . . . . . . . . . . . . . . 85 5.88 TMDB-OMDB: Combination 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.89 OFF-FDCM: Combination 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.90 T-O-I: Combination 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.91 OFF-FDCN: Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.92 TMDB-OMDB: Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . 86 5.93 OFF-FDCM: Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.94 T-O-I: Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.95 OL-GK: Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.96 OFF-FDCN: Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.97 TMDB-OMDB: Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.98 OFF-FDCM: Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.99 T-O-I: Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.100 OL-GK: Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.101 OFF-FDCN: Matches-NonMatches . . . . . . . . . . . . . . . . . . . . . . . . 88 5.102 TMDB-OMDB: Matches-NonMatches . . . . . . . . . . . . . . . . . . . . . . 88 5.103 OFF-FDCM: Matches-NonMatches . . . . . . . . . . . . . . . . . . . . . . . . 88 5.104 T-O-I: Matches-NonMatches . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.105 OL-GK: Matches-NonMatches . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.106 OFF-FDCN: Highly Identifying Attributes . . . . . . . . . . . . . . . . . . . . 88 5.107 T-O: Highly Identifying Attributes . . . . . . . . . . . . . . . . . . . . . . . . 88 5.108 OFF-FDCM: Highly Identifying Attributes . . . . . . . . . . . . . . . . . . . . 89 5.109 T-O-I: Highly Identifying Attributes . . . . . . . . . . . . . . . . . . . . . . . . 89 5.110 OL-GK: Highly Identifying Attributes . . . . . . . . . . . . . . . . . . . . . . 89 5.111 OFF-FDCN: Number Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.112 TMDB-OMDB: Number Attributes . . . . . . . . . . . . . . . . . . . . . . . . 89 5.113 OFF-FDCM: Number Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.114 T-O-I: Number Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.115 OL-GK: Number Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.116 OFF-FDCN: Attribute Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.117 TMDB-OMDB: Attribute Sparsity . . . . . . . . . . . . . . . . . . . . . . . . 89 5.118 OFF-FDCM: Attribute Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.119 T-O-I: Attribute Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.120 OL-GK: Attribute Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.121 OFF-FDCN: Value Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.122 TMDB-OMDB: Value Length . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.123 OFF-FDCM: Value Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.124 T-O-I: Value Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.125 OL-GK: Value Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.126 OFF-FDCN: Similar Non-Matches . . . . . . . . . . . . . . . . . . . . . . . . 90 5.127 TMDB-OMDB: Similar Non-Matches . . . . . . . . . . . . . . . . . . . . . . 90 11 5.128 OFF-FDCM: Similar Non-Matches . . . . . . . . . . . . . . . . . . . . . . . . 90 5.129 T-O-I: Similar Non-Matches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.130 OL-GK: Similar Non-Matches . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.131 T-O-I: Unseen Entities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.132 T-O-I: Over/Underrepresented Entities . . . . . . . . . . . . . . . . . . . . . . 91 5.133 TMDB-OMDB: Combination 1 . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.134 OFF-FDCM: Combination 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.135 T-O-I: Combination 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.136 T-O-I: Combination 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.137 OFF-FDCN: Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.138 TMDB-OMDB: Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . 92 5.139 OFF-FDCM: Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.140 T-O-I: Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.141 OL-GK: Data Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 12 Listings 4.1 Computation of Attribute Importance Scores . . . . . . . . . . . . . . . . . . . . 68 4.2 Prompt for Generating Synonyms . . . . . . . . . . . . . . . . . . . . . . . . . 69 13 Acronyms BERT Bidirectional Encoder Representations from Transformers. 17 CSV Comma-Separated Values. 46 EM Entity Matching. 17 ER Entity Resolution. 17 GPT Generative Pre-Trained Transformers. 17 GPU Graphical Processing Unit. 25 GTIN Global Trade Item Number. 44 ISBN International Standard Book Number. 44 JSON JavaScript Object Notation. 19 MLM Masked Language Modeling. 25 NLP Natural Language Processing. 17 NSP Next Sentence Prediction. 26 RF Random Forest. 17 RL Record Linkage. 17 RNN Recurrent Neural Network. 25 SOTA State-Of-The-Art. 19 SVM Support Vector Machine. 17 TF-IDF Term Frequency-Inverse Document Frequency. 39 VIN Vehicle Identification Number. 44 XML Extensible Markup Language. 19 YAML Yet Another Markup Language. 19 15 1 Introduction 1.1 Motivation The process of finding and matching entries that refer to the same real-world entity within a single or between different datasets is known as Entity Resolution (ER), Entity Matching (EM) or Record Linkage (RL). It is essential in many applications where data originates from different sources and must be combined to make the data representation correct and meaningful. Such application domains are biology, public health, insurance, e-commerce, and customer relationship management, which frequently process lots of data from various sources that need more standardization and thus require integration. ER is an essential step in the data integration process to reduce redundancy and inconsistency, hence increasing data quality. It also enhances the user experience, facilitates adherence to data regulations and avoids incorrect conclusions from conflicting or duplicate records, contributing to more accurate and trustworthy data analysis. Several solutions using different approaches have been proposed to address the ER task. Traditional ER solutions, which assume that records referring to the same entity are more similar than those referring to other entities, often rely on pairwise similarity comparisons [LLG15]. However, this assumption does not hold in every case. For instance. different episodes of films have minor differences in their data representation but still represent distinct entities. As a result, researchers tried to develop more complex ER systems, such as rule-based matching systems and applied heuristics to address more complex matching scenarios, achieving better results [LLG15]. However, these approaches also work only to a certain extent. If data is presented in a structured and clean manner, they perform well, as the high level of structuredness allows for identifying patterns and rules that can be applied consistently. On the other hand, similarity-based ER or rule-based ER systems struggle to maintain effectiveness when dealing with less structured data. In such cases, where data may follow a semi-structured schema or lack any schema entirely, the predefined rules become less effective and fail to capture the complex and nuanced patterns present in the data. With the rise of machine learning (ML), ML models have become the preferred choice for tasks previously handled by rule-based systems. Traditional ML models using non-linear classifiers like the Support Vector Machine (SVM) or Random Forest (RF) have demonstrated the ability to identify complex patterns that earlier ER systems struggled to detect and interpret [BM03] [PB20]. Especially with the rise of deep learning solutions capable of understanding even more complicated patterns in the underlying problem, various systems have emerged that leverage such architectures, showing increased performance when using these. This increase in performance has also been observed when utilizing such models for ER. Notably, the recent development of transformer-based large language models has led to another significant boost in ER performance. Language models such as Bidirectional Encoder Representations from Transformers (BERT) [DCLT19] and Generative Pre-Trained Transformers (GPT) [Rad18] have emerged and achieved remarkable success in Natural Language Processing (NLP) tasks. Such systems are pre-trained 17 1 Introduction Figure 1.1: Diverse Representations of Real-World Entities in the Video Domain using vast amounts of text sequences of large corpora on one or multiple different self-supervised tasks to capture and understand the semantic relationships between words of a text. Compared to other architectures, the transformer-based architecture captures context and relationships between tokens or sets of tokens more flexibly and effectively [VSP+17]. Smaller language models often additionally undergo a fine-tuning step to increase performance in the desired downstream task.. When the downstream task is the ER task, then the language model is fine-tuned such that it learns to determine whether different records refer to the same entity or not. Using this kind of neural network architecture already achieves promising results for ER tasks [CEP+20] [ZPSK23], and researchers continue to explore even more innovative solutions. 18 1.1 Motivation However, current research focuses mainly on structured data, where many solutions have already achieved solid results. This raises the question of whether recently developed ER systems can still perform well when dealing with less structured data, such as semi-structured or unstructured data. Structured data is often presented in formats like Extensible Markup Language (XML), JavaScript Object Notation (JSON) or Yet Another Markup Language (YAML), where information is organized in nested structures, and attributes are not required to be consistently provided. On the other hand, unstructured data lacks any specific organization and typically requires specialized data management tools for effective handling. As mentioned, studies mainly focused on evaluating matching systems on structured data. To compare the performance of different matching systems, researchers began to create so-called Benchmarks for the ER task to assess how well matching systems perform under different circumstances. However, most of the existing benchmarks for ER contain only structured data or focus on other subtasks. Hence, it is unknown how matching systems perform when dealing with different levels of structuredness in data. To assess this kind of problem in more detail, developing such benchmarks would be of great interest and help to determine whether such systems, especially the current State-Of-The-Art (SOTA) ER matching systems, can deal with different data structures. Another crucial aspect to consider is the variations in data characteristics when developing benchmarks. Authors have focused on providing benchmarks with varying levels of difficulties, as different data exhibits different characteristics that can significantly impact the difficulty of the matching task [PB20] [CAF+21]. For instance, data sets may have varying structural and semantic heterogeneity levels. Structural heterogeneity refers to differences in how data is organized, while semantic heterogeneity involves variations in meaning and interpretation across different datasets. These and other differences can lead to different levels of difficulty in the benchmark, primarily assessed in structured datasets but not specifically in other levels of structuredness. Analyzing these variations is crucial, as they highlight areas where current solutions need to focus on optimizing results and areas that do not require attention because they are already well addressed. Due to the lack of ER benchmarks for semi-structured data and the necessity to understand the influence of varying levels of structuredness, this work aims to enhance the understanding of SOTA matching systems in this context and seeks to establish a solid foundation for future research. 19 1 Introduction 1.2 Research Objectives and Thesis Contributions The main objective of this master thesis is to study the effectiveness of SOTA ER approaches based on pre-trained language models in a schema-agnostic setting on semi-structured data. This objective is broken down into the following specific goals: 1. Development of Semi-Structured Entity Resolution Benchmarks: The primary goal is to design and develop one or multiple ER benchmarks containing semi-structured data. For this, sources providing the necessary data structure have to be identified that can be used for benchmark creation. Additionally, identifying sources across domains that reflect the variety and complexity of real-world entity resolution problems should provide a solid foundation for evaluating ER matching systems. 2. Introduce Varying Levels of Complexity and Heterogeneity: The second goal is to create more refined semi-structured benchmarks by systematically varying underlying data characteristics to identify easier or more challenging scenarios. Such characteristics can include both structural and semantical aspects. This aims to assess the robustness and cleverness of ER matching systems across datasets with varying structures and challenges. 3. Evaluate State-of-the-Art ER Methods Using the Benchmark: The last goal is to evaluate the effectiveness of current SOTA ER systems in the new setting. This also includes the assessment of the generated benchmarks containing varying complexities and characteristics to identify possible difficulties. The main contributions of this thesis are: 1. Providing Software for Developing ER Benchmarks: We developed a new software component that allows users to create new benchmarks and provides options to restrict specific benchmark dimensions to a desired extent. This enables future researchers to more easily generate benchmarks with specific configurations. 2. Demonstrating the Effectiveness of ER Systems on Semi-Structured Data: This work demonstrates that the SOTA ER systems, Ditto and GPT4o-mini, can effectively perform ER on semi-structured data in a schema-agnostic manner. This finding suggests that current systems can handle the complexities associated with hierarchically structured data. 3. Identification of Benchmark Characteristics that pose difficulties: We present several semi-structured benchmarks that demonstrate diverse characteristics. These benchmarks highlight areas where the employed matching systems struggle to perform ER effectively, indicating room for improvement. 20 1.3 Outline 1.3 Outline The structure of this thesis is organized as follows: Chapter 2: Related Work: In this section, the fundamental concepts related to ER and key terms relevant to this field are provided. It also offers insights into the current SOTA ER systems and the various architectures they employ. Additionally, this chapter reviews existing benchmarks and presents the dimensions on which authors have focused in the past. Chapter 3: Benchmark Design: This chapter illustrates the steps and decisions taken before implementing the benchmark. It describes the data requirements and the sources selected and provides the architecture used to generate the benchmarks. It also describes the specific dimensions covered by our benchmarks. Chapter 4: Implementation: The design decisions encountered during the implementation phase are explained in detail, along with the rationale for each selected approach. Chapter 5: Evaluation: The evaluation of various ER systems on the newly created semi-structured benchmarks is described here. Furthermore, the results of the matching systems are compared and discussed in detail. Chapter 6: Conclusion and Outlook: Finally, the main conclusions of this work are summarized, and potential topics for future research based on the findings are presented. 21 2 Related Work This chapter explains the fundamental concepts of ER, including the general workflow and the specific ER approaches currently used. Furthermore, existing benchmarks are presented alongside what the authors focused on when creating such benchmarks, as well as the various procedures used to generate them. 2.1 Fundamental Concepts Entity Resolution, Entity Matching, Record Linkage or De-Duplication aims to identify whether different representations from one or several sources represent the same real-world entity [WLF+22]. An entity can be described as “a single unique object in the real world that is being mastered” [Cor24] and can have various different representations. A representation of an entity can be textual, graphical, or any other form and can thus be presented in numerous formats. Figure 1.1 shows different textual representations of specific movies from the movie domain. The goal in this example would be to understand whether two textual representations of movies refer to the same movie or not. This task is non-trivial as many factors influence the difficulty of identifying the correct mapping. Data may be represented in many ways and contain data errors such as schema or value errors, thus increasing the difficulty. The process of identifying the correct mapping between different representations and real-world entities requires several steps, as illustrated in Figure 2.1. Initially, a set of source records from one or multiple data sources is provided. ER is then performed on these records to establish a correct mapping between the representations and the entities they represent. Before performing entity resolution, Blocking or Clustering is typically applied to reduce the search space when comparing records by grouping similar records. Blocking or Clustering is a technique that tries to group entries that are more similar or represent more likely the same entity in the same group while putting the other entries in different groups [GFSS21]. Often, blocking employs similarity-based measures to identify such groups. More advanced blocking methods, such as embedding-based blocking [OR24], utilize pre-trained language models like BERT [DCLT19] or SentenceBERT, as demonstrated in DeepBlocker [TLT+21]. Such models transform records into a lower-dimensional feature space, capturing semantic similarities. In this feature space, similarity measures like Cosine Similarity are applied to identify closely related records, enabling more effective matching of similar entities [TLT+21]. Thereby, not every record has to be compared with any other record, as this would be inefficient and typically increases the execution time quadratically. Sometimes, authors perform specific data cleaning steps before blocking to address inconsistencies, missing values or other kinds of data errors to reduce the difficulty of the matching task. 23 2 Related Work Figure 2.1: ER Workflow After identifying clusters and groups of likely similar entries, the Entity Resolution step is performed within each group. Here, different kinds of ER approaches are available to choose from. One can select traditional ER approaches, which employ similarity measures on attribute values. If the similarity exceeds a certain threshold, the two entries are classified as a match; otherwise, they are not. Often, blocking and entity resolution are performed iteratively to either exploit relationships between entity descriptions, as seen in matching-based approaches, or to leverage the partial results of merging descriptions in merging-based approaches [BGM+09]. A more recent approach involves employing machine learning techniques, which are often better at understanding the underlying patterns in data and thus may solve the task more effectively [BM03]. Today’s most commonly used machine learning technique for solving the ER task is based on deep learning architectures. Due to the complex architecture and rich set of parameters, deep learning models proved to be very effective when facing inherently complex tasks, including ER [ZPSK23]. 2.1.1 Entity Resolution Methods Various ER methods are available for usage, each with its approach and strengths, which we will briefly revisit in more detail. Traditional ER methods often perform pairwise similarity comparisons between entries and assume that entries from the same entity are more similar than other records [LLG15]. However, this assumption does not necessarily have to hold, making this approach fragile to such constellations. To resolve this issue, rule-based and manually crafted heuristics were developed, which use predefined rules in combination with similarity metrics to compare records. However, these rules can often be only employed on structured and cleaned data as the predefined rules require some structuredness and struggle if this is not provided. Additionally, such ER systems often need to be reconfigured for each dataset as rules need to be adjusted for each problem individually [ETJ+18]. Machine learning models, excluding neural network architectures, such as RFs or SVMs [BM03] [PB20], have also been used by various researchers. Such ML models are trained using labelled datasets to learn how to distinguish whether pairs refer to the same entity. However, feature engineering is often required before feeding the data to the models, and this process often relies on domain experts to design appropriate comparison features for the model [KQG+19]. ER systems using neural networks, especially language models which utilize deep-learning architectures, are currently SOTA systems when encountering ER tasks, as these can learn complex patterns in the data due to their large and complex architectures without the need for feature engineering. These systems leverage so-called Transformer-Based Architectures, which are highly effective for problems requiring understanding long-distance relationships between tokens or token 24 2.1 Fundamental Concepts sets [VSP+17]. This means they can identify relationships between tokens even if positioned far apart in the text. Since the current SOTA ER systems leverage models using this architecture, we will briefly explain how they work, as such systems will be used for evaluation later. Transformer: Transformer-based deep learning architectures use so-called Transformers that rely on the Attention or Self-Attention mechanism to learn the representation of the underlying problems more efficiently and better capture long-distance relationships between text [VSP+17]. Transformers use a multi-head attention mechanism that allows them “to attend to information from different representations jointly” [VSP+17]. Attention: Attention is a mechanism that allows a model to focus on different parts of an input sequence (keys) when processing a query, assigning weights to each part based on its relevance [VSP+17]. These weights determine how each part contributes to identifying the correct value. Self-attention is a specific attention mechanism where the keys contain the same components as the values. When dealing with text, this allows to relate every word of a text to every other word, which helps to capture long-range relationships crucial in natural language processing (NLP) tasks such as ER [VSP+17]. This can be especially highly relevant in the Schema-Agnostic ER setting, where schematic information is discarded, and thus, only attribute values are used [PP18] [TSS20]. Therefore, correctly relating the records may be more difficult when schematic information is not provided. When dealing with large texts, transformer-based architectures have an advantage over other types of architectures, such as Recurrent Neural Network (RNN), which, for instance, suffer from the vanishing gradient problem. Transformers correlate all queries with all keys, allowing them to capture long-range dependencies more effectively. This is especially important in the ER task when dealing with records containing much textual information. Another advantage is that, due to the absence of a recurrence mechanism, computations can be parallelized on Graphical Processing Unit (GPU), resulting in faster processing [VSP+17]. Pre-Training Due to this new architecture, it became possible to train even larger models, leading to the development of large language models. These models are typically pre-trained using self-supervised learning tasks, in this case, using language-specific tasks, allowing them to capture complex semantic and syntactic relationships. This allows models to learn difficult patterns without the need for labelled datasets, which is typically required. To understand the characteristics of a language, large corpora of unlabeled data are used to train the neural network in combination with language-specific pretext tasks. Common pre-training tasks for textual input are: • Masked Language Modeling (MLM): This pre-training task first masks random tokens in a sequence of tokens and then asks the model to predict the missing part given the surrounding context. This aims to train the model to learn the bidirectional relationships within the text and consider preceding and following tokens in a sequence when computing a solution. 25 2 Related Work • Next Sentence Prediction (NSP): Here, the model is given two spans of text, and the question is whether the order of the spans is correct. When considering textual input, the question would be, is the logical order of the sentences correct or not? That way, the model learns to identify the relationships between larger sets of sequences and not to focus solely on specific tokens. Various other pre-training tasks can be used to pre-train a model. The emergence of large language models led to models like BERT, which were trained on the mentioned pre-training tasks, or even larger models known as foundation models. Foundation models like GPT use much more parameters and are trained on a more significant number of pretext tasks than smaller language models like BERT. They can already solve various tasks without the need for fine-tuning. Figure 2.2: Self-Supervised Learning [PA23] Fine-Tuning On the other hand, smaller language models often require a fine-tuning step to perform well on the downstream application. The task that the model should fulfil in the downstream application is learned by exactly solving the domain-specific task where often only a restricted amount of labelled data exists. This also often accounts for the ER task. When considering the ER task as a fine-tuning task, a model’s backbone, pre-trained on linguistic tasks or on differentiating similar from non-similar pairs, is commonly utilized. [LLS+20]. The fine-tuning task then often involves determining, based on two representations of a set of pairs, whether they refer to the same or different entities. Since fine-tuning requires an additional labelled dataset, one has to consider the additional time and cost requirements potentially needed when developing such data sets. Especially for the ER downstream task, identifying matches and non-matches across various sources for training is not trivial, as small differences in representation can change the label, often requiring the knowledge of domain experts. Therefore, the authors try to identify solutions that perform well even when using small training sets. 26 2.1 Fundamental Concepts 2.1.2 State-Of-The-Art Entity Resolution Matching Systems As previously mentioned, SOTA ER systems leverage language models. In this section, we provide an overview of these systems. Ditto One widely recognized ER matching system using a large language model is Ditto [LLS+20]. Ditto leverages pre-trained language models such as BERT, RoBERTa or DistilBERT as the backbone and fine-tunes these on the binary ER task. Figure 2.3: EM architecture using Ditto as a matcher [LLS+20] In addition, Ditto offers several configurations to optimize effectiveness when resolving entities [LLS+20]: • Domain Knowledge Injection: When providing pairs of records as input to Ditto, users can utilize a specific serialization format offered by the tool using [COL] and [VAL] tokens to represent attribute names and their corresponding values. Furthermore, Ditto enables the incorporation of domain knowledge via span types, allowing the model to concentrate on spans of the same type, which can lead to better results. • Span Normalization: Ditto also allows developers to integrate span normalization, with which several rewriting rules can be specified. These rules rewrite “syntactically different but equivalent spans into the same string” [LLS+20], e.g. when dealing with abbreviated words or synonyms. This should improve the model’s capability to recognize equivalent spans despite being represented differently. • Summarization: Attribute values can sometimes contain long sequences of tokens, many of which provide little useful information. Ditto employs a TF-IDF summarization technique to reduce the number of tokens, keeping only those with high TF-IDF scores. This allows the model to focus on the most informative parts of the input data. • Data Augmentation: Ditto offers certain data augmentation options to increase the training set size or make the model more robust to corrupted entries. These include deleting spans of tokens or entire attributes, shuffling tokens within a span or across attributes, or swapping the order of two data entries. 27 2 Related Work The authors of Ditto noted that performance improved with any of these configurations activated, while the summarization option demonstrated the most significant increase in effectiveness. Sudowoodo Another promising matching system utilizing pre-trained language models is Sudowoodo [WLW23]. The authors of Sudowoodo aimed to address challenges related to label requirements and task variety. To achieve this, they employed a contrastive pre-training approach to learn data repre- sentations and relationships between inputs in an unsupervised manner. Contrastive pre-training leverages a contrastive loss function to differentiate between learned representation spaces without requiring labelled data. This technique is commonly used in image processing, where various data augmentation methods can modify images without altering their fundamental meaning. The model must determine whether the augmented versions correspond to the original, enabling it to learn a “meaningful representation where similar data items are close in the representation space while pairs of dissimilar data items are far apart” [WLW23]. This effectively allows the model to distinguish between similar and dissimilar inputs based on their representations. The goal of the contrastive learning step is that, according to the authors [WLW23], fewer labelled data are required to fine-tune the model while still achieving reasonable results. Figure 2.4: The Sudowoodoo EM pipeline [WLW23] Sudowoodo implements the following steps: 1. Contrastive Learning for Pre-training Models: Sudowoodo utilizes existing pre-trained language models, such as BERT, RoBERTa, or DistilBERT, and adjusts their parameters when performing the contrastive learning task. It implements various data manipulation techniques, such as token manipulation or swapping and deleting attributes to modify data. After applying a transformation, an additional manipulation option called cutoff can be used, which adjusts the feature, span, or token representation in the embedding space 28 2.1 Fundamental Concepts by setting the corresponding dimensions to zero, effectively serving as a regularization technique. Furthermore, Sudowoodo employs cluster-based negative sampling where the Cosine Similarity on TF-IDF vector representations is used to sample more similar records, thus potentially forcing the model to learn more difficult patterns to solve the problem. After that, the pre-trained model can be used for further processing. 2. Pseudo labelling: When users struggle to obtain sufficient labelled data for fine-tuning, Sudowoodo offers the option to generate labelled data from unlabelled datasets through its built-in pseudo-labelling approach. To generate pairs, Sudowoodo utilizes Cosine Similarity on the embedded representations of records to identify likely matches or non-matches. If the similarity value is above the match threshold, it is considered a match. The same is done for identifying non-matches. To identify meaningful thresholds, the authors use a hill-climbing heuristic from the hyperparameter optimization framework ’Optuna’ [WLW23] [ASY+19]. Figure 2.5: Fine-Tuning in Sudowoodo [WLW23] 3. Fine-tuning: After pre-training a model and identifying a small set of labelled data, the user can fine-tune the model. The authors modified the fine-tuning procedure, noting that typically, records are concatenated first, and then the concatenated versions are used for fine-tuning. However, only individual records (not concatenated ones) are utilized for pre-training a model. To address this difference, the fine-tuning procedure additionally captures the differences between data representations beyond simple concatenation. Moreover, it incorporates the individual record representations alongside the concatenated version, enhancing the model’s ability to fine-tune more effectively, illustrated in figure 2.5. Once fine-tuned, the model is ready for evaluation. GPT Foundation models like OpenAI’s GPT or Google’s Gemini models are large pre-trained language models containing hundreds of billions of parameters being able to solve a wide range of tasks across various domains [TZCM22]. Recently, researchers have also begun to utilize these systems for ER tasks. In the paper “Entity Matching using Large Language Models” [PB23], the authors explore how such matching systems perform when used for ER. The advantage of such systems is that users do not necessarily need to search for additional labelled data to fine-tune the model, as these models often already perform well without changing any configurations [PB23]. 29 2 Related Work Figure 2.6: Prompting Foundation Models [PB23] Over and above that, the authors assessed how different prompts affect the performance of several large language models in ER tasks. They tried to understand what information needed to be included in such a prompt to receive good results. For that, different kinds of prompts have been created and evaluated. The conclusion was drawn that there is no perfect prompt as this relies on the respective dataset and the model in use and must be adjusted for each combination. They also found that prompts only significantly benefited models with low zero-shot performance. It was noted that the GPT-4 model delivered solid results in zero-shot scenarios, often matching or surpassing the performance of smaller, fine-tuned models such as BERT or RoBERTa [PB23]. While the benefit of avoiding the need for fine-tuning is clear, the authors emphasize that the cost of using such large models should also be carefully weighed [PB23]. 2.2 Existing Entity Resolution Benchmarks Deep learning-based solutions currently demonstrate the highest effectiveness in solving ER tasks. The effectiveness of such systems under certain conditions is compared on so-called benchmarks. Benchmarks in the context of ER are standardized datasets used to assess the effectiveness of various approaches [PDB23]. They allow researchers to quantify and compare different solutions under the same conditions. Specific metrics like accuracy, precision, recall or the F1 measure are mainly used to quantify the performance of ER systems when dealing with ER benchmarks. ER benchmarks contain pairs of representations that need to be matched. Additionally, for each pair, a label is provided that indicates whether the two records match or do not match. As mentioned in the chapter 2.1.1, the current ER solutions often fine-tune large language models for the ER task. Therefore, authors of recently published benchmarks often provide additional training and validation sets for fine-tuning the model. The test set typically contains records not contained in the development set to ensure the model does not simply memorize data seen during training. 30 2.2 Existing Entity Resolution Benchmarks Several authors of ER benchmarks tried to analyze how current ER solutions perform under different conditions, e.g. when structuredness changes, when the number of sources used changes, etc. referring to these as Benchmark Dimensions [PDB23] that can be adjusted. In addition to evaluating how semi-structured data influences the performance of SOTA ER solutions, we therefore also aimed to identify specific configurations within semi-structured data that may affect performance. Identifying these configurations can help future researchers focus on ensuring that their ER systems are robust in such difficult scenarios. Therefore, an analysis of various configurations of certain benchmark dimensions, which have previously been used without specific focus on semi-structured data, is also conducted in this work. 2.2.1 Benchmark Dimensions This section presents the benchmark dimensions that previous work focused on. Domain Numerous benchmarks for ER have been published, and each typically focuses on data from different domains. Peeters, Der and Bizer [PDB23] used data sources from the product domain. The developers of the benchmark Alaska [CAF+21] concentrated their interest on e-commerce data, whereas other researchers focused on historical voter data from North Carolina [PDWW21]. The motivation for developing benchmarks across various domains originates from the inherent complexity of data in different domains and the interest in seeing if matching systems can perform ER well across these diverse tasks. Structuredness Current benchmarks exhibit varying levels of structuredness. Many benchmarks provide highly structured data where data is well organized in a well-defined schema. Examples are the popular DBLP-ACM and Abt-Buy benchmarks provided by the Database Group Leipzig1 [KTR10] or the benchmarks developed by the students from the Data Science Class CS 784 at UW-Madison2 [KDS+16]. However, other benchmarks, like the WDC Products [PDB23], offer a mix of structured and less structured data. Alaska’s publishers also mention using data sources from various domains containing different levels of structuredness [CAF+21]. Since performance often degraded when using benchmarks that did not provide well-formed schemas, it is of interest to investigate whether this is particularly true for semi-structured benchmarks. Size of Fixed Data Splits As mentioned before, recently developed benchmarks for ER provide, additionally to the test set, training and validation data for fine-tuning. However, several older published benchmarks exist that do not provide these splits, as deep learning solutions were not that popular then. This makes it difficult to compare the effectiveness of matching systems across such benchmarks, as the authors might use different training configurations that influence the outcome. 1https://dbs.uni-leipzig.de/research/projects/benchmark-datasets-for-entity-resolution 2https://sites.google.com/site/anhaidgroup/useful-stuff/the-magellan-data-repository/description-of- the-784-data-sets?authuser=0 31 https://dbs.uni-leipzig.de/research/projects/benchmark-datasets-for-entity-resolution https://sites.google.com/site/anhaidgroup/useful-stuff/the-magellan-data-repository/description-of-the-784-data-sets?authuser=0 https://sites.google.com/site/anhaidgroup/useful-stuff/the-magellan-data-repository/description-of-the-784-data-sets?authuser=0 2 Related Work If benchmarks provide such development sets, then the size of these sets can also play a decisive role, as it can significantly influence the performance of the matching systems [PDB23]. A low amount of training and validation data gives the models only a small view of the problem. In contrast, large development sets may provide a better understanding of the underlying task. The authors of WDC Products provide small (2500 training and validation data), medium (6000 training, 3500 validation) and large (ca. 19000 training, 4500 validation) development sets, whereas Primpeli and Bizer [PB20] already group benchmarks to the category small when providing development sets containing only 300-400 pairs. Usually, performance decreases when utilizing a smaller training set. Figure 2.7: Scenarios for Benchmark Construction [WLF+22] Seen and Unseen Entities When creating training, validation, and test sets to train a neural network, authors usually avoid including a record in multiple sets because this can lead to overfitting, meaning that the model only memorizes specific records instead of generalizing to unseen data. In ER, it also plays a factor if records of an entity were already seen despite not using the record in different sets. Several publications [PDB23] observed that many existing benchmarks have a so-called “Restricted Entity Assumption” [WLF+22], where entities that were already seen during training are also included in the test set. The authors [WLF+22] claim that matching systems trained and evaluated on such data may generalize less and tend to overfit more. Primpelli and Bizer [WLF+22] created benchmarks for these scenarios, illustrated in figure 2.7 [WLF+22]: • Open Matching: Considers only pairs of records whose entities were not seen during training. • Cluster-focused Matching: Includes records that were not shown during training, even though other records of the same entities were seen during training. 32 2.2 Existing Entity Resolution Benchmarks • Record Linking: Uses pairs where records seen during training are linked with records from the same entity that were not shown during training. They observed in their evaluation that their model achieved the highest F1-Score when performing Record Linking while Cluster-focused Matching and Open Matching decreased performance, suggesting that benchmark developers should carefully consider this aspect when developing their next benchmark. Corner Cases Corner cases, also known as edge cases, generally describe scenarios that are more challenging to solve. In ER, the authors of WDC Products analyzed the performance of ER systems when encountering such corner cases. They argue that corner cases are cases where similarity measures fail to capture the correct relationship between two records [PDB23]. Two types of corner cases are: • High Similarity, Different Entities: Pairs of representations that appear highly similar according to specific similarity metrics but do not represent the same real-world entity. • Low Similarity, Same Entity: Pairs of representations that appear highly dissimilar according to similarity metrics but refer to the same real-world entity. Figure 2.8: Example of hard and easier matching and non-matching offer pairs from WDC Products [PDB23] Figure 2.8 provides an example of easy and difficult matches and non-matches. The hard match involves two similar entries but also contains differences in description and price, making correct identification more challenging. Similarly, the hard non-match example features entries where the 33 2 Related Work price and currency differ, but the brand and title are similar. Providing benchmarks containing such corner cases can help determine whether the ER system can also handle scenarios that require more than just similarity features to effectively perform RT. Label Distribution The distribution of matching and non-matching pairs in benchmarks is an important factor, as different distributions can influence the difficulty of the matching task. Some researchers argue that the label distributions in development and test sets are often overly optimistic. Often, similar distributions are used across these sets, a practice considered unrealistic [WLF+22]. In real- world scenarios, the ratio between matching and non-matching pairs is unknown and “diverges significantly” [WLF+22]. Current benchmarks, according to [WLF+22] often use lower ratios, which is not considered a realistic scenario as non-matches vastly outnumber matches in real-world applications. Benchmark Matched: Mismatched Seen Clusters Seen Records abt-buy [MLR+18] ≈1:6 99% 96% amazon-google [MLR+18] ≈1:6 99% 97% dblp-acm [MLR+18] ≈1:20 100% 100% dblp-scholar [MLR+18] ≈1:15 100% 100% walmart-amazon [MLR+18] ≈1:12 100% 99% wdc cameras [PPB19] ≈1:3 100% 78% wdc watchers [PPB19] ≈1:3 100% 81% wdc computers [PPB19] ≈1:3 100% 72% wdc shoes [PPB19] ≈1:3 100% 62% Table 2.1: Previous Benchmark Configurations [WLF+22] The authors claim that realistic match to non-match ratios can range up to 1:100 as e.g. after performing a blocking step, the developers are often retained with groups containing over 100 candidate records. Because of the long-tail phenomenon, it is a realistic scenario that only a single instance within a group is a match, making this ratio reflective of practical EM applications [WLF+22]. Single Modality Assumption Several options and formats exist to represent a record of an entity. Often, benchmarks provide the textual representations, discarding the potentially valuable information provided in other formats. Different modalities like images, audio or videos, which can also provide valuable insights into the problem, are often discarded and, therefore, not considered [WLF+22]. Therefore, researchers suggest incorporating different modalities when developing benchmarks to provide current solutions with more information to improve their ability to solve the tasks more effectively [WLF+22]. Amount of Attributes Textual representations of entities, when provided in structured or semi-structured formats, use attributes to store information in an organized manner. The quantity and length of attributes often vary across different sources, with fewer attributes typically resulting in less granular information, whereas having more attributes often leads to better distribution and splitting of information. Certain 34 2.2 Existing Entity Resolution Benchmarks authors [CAF+21] [PB20] profiled ER benchmarks and the sources used in their creation, measuring the number of attributes provided by the sources. Since sources containing more information may increase the performance of ER systems, this aspect can also be analyzed in the context of the ER task on semi-structured data. Highly Identifying Attributes Similar to the number of attributes, the minimum number of attributes required to solve the task effectively is also an important criterion. Primpeli and Bizer refer to this dimension as “Schema Complexity” and illustrate that matching systems requiring many attributes to perform effectively on the ER task indicate that the benchmarks involved are of higher difficulty [PB20]. Such analysis can also be applied to our analysis, focusing on semi-structured data to assess whether the task quickly becomes more difficult when discarding essential attributes. Attribute Sparsity Attribute sparsity indicates the proportion of attributes that are missing information to all attributes. When considering structured datasets, default values are often used as placeholders for missing values such as “NULL”, “-”, “N/A”, and many others. Attribute Sparsity can be computed as, according to the authors : AS(S) = 1 − ∑ 𝑟∈𝑆 |schema(𝑟) | |schema(𝑆) | · |𝑆 | • 𝑆 is a set containing records 𝑟 , • |schema(𝑟) |: The number of non-missing attributes in a single record 𝑟 . • |schema(𝑆) |: The total number of unique attributes across all records in the set 𝑆, including missing and non-missing attributes. Figure 2.9: Formula for Attribute Sparsity [CAF+21] Several benchmark publishers have investigated how varying attribute sparsity levels affect the performance of matching systems. They observed that when attribute sparsity is high, i.e., many attribute values are missing, then less information is available often making the task more challenging and thus decreasing the performance of ER matching systems. Typically, a sparsity value above approximately 0.4 to 0.5 is considered high, while a value below 0.2 to 0.3 is often considered low [CAF+21] [PB20]. Attribute Value Length Attribute value length or Textuality [PB20] refers to the dimension that analyzes the length of attribute values and examines how entity matching systems handle pairs of records that exhibit varying attribute value lengths. Attributes with many tokens and characters are considered highly textual, while those with fewer tokens are classified as low textual [PB20]. The average value length can be computed using the number of tokens or characters summed up for each attribute value and divided by the number of all provided attributes or characters. Primpeli and Bizer [PB20] claim that long attribute values may present increased challenges in the ER task. Such attributes might contain too much information that could be split into further attributes. The effectiveness of deep 35 2 Related Work learning solutions was not explored in detail, but evaluations on highly textual benchmarks, such as Amazon-Google3, report smaller F1 scores [PB22]. This suggests that highly textual entries may increase the task complexity. However, further analysis is needed to understand their impact on deep learning solutions fully. Under/Over-represented Entities In ER benchmarks, entities are often represented by varying numbers of instances, creating an imbalance in their distribution [CAF+21]. This imbalance can significantly impact the ER task, especially if the distribution of represented entities changes between training and testing. For instance, during training, the model may learn the representations of a limited subset of entities, which can negatively affect performance when tested on a dataset containing a wider range of entities. The authors of Alaska [CAF+21] note that uneven distributions of cluster sizes can increase the problem’s difficulty and thus affect the performance of ER systems. Combined Dimensions Most ER benchmark developers focus on observing how the performance of matching systems changes when changing one specific dimension. However, only a few papers have analyzed the impact of combining multiple dimensions, which may reveal different interesting settings. The WDC Product benchmark was created to determine how well matching systems perform when combining specific benchmark dimensions. The authors [PDB23] varied dimensions, such as the development set size, alongside the percentage of corner cases and percentage of unseen entities, to assess the performance of matching systems under such different conditions. They created benchmarks for all combinations of development set sizes (Small, Medium, Large), corner cases (20%, 50%, 80%) and unseen entities (0%, 50%, 100%) and found that specific combinations, e.g. providing a large percentage of corner cases in combination with many unseen entities, decrease the performance across all matching systems tested and thus may be interesting for further analysis. Another study grouped existing benchmarks based on specific combinations of characteristics [PB20]. The authors aimed to identify combinations of profiling dimensions such as schema complexity along with textuality, attribute sparsity, development set size and corner cases to identify characteristics of easier and more difficult ER tasks. They classified benchmarks into various categories, including one considered as Dense Data and Simple Schema, which, for instance, includes tasks with low schema complexity (≤ 4 relevant attributes), high density (> 0.94), and short attribute values (< 7 words) [PB20]. Classifying existing benchmarks into these groups revealed how each group’s characteristics influenced the benchmarks’ difficulty, often indicating whether a benchmark was relatively easy or challenging. The goal was to assess the performance of ER matching systems in difficult scenarios and whether they can handle these. They observed that benchmarks containing more corner cases and long attribute values are more difficult for ER systems to handle. It is worth noting that they used Random Forests and Support Vector Machines for these kind of evaluations [PB20]. Data Manipulation 3https://dbs.uni-leipzig.de/research/projects/benchmark-datasets-for-entity-resolution 36 https://dbs.uni-leipzig.de/research/projects/benchmark-datasets-for-entity-resolution 2.2 Existing Entity Resolution Benchmarks Data manipulation is a commonly used technique in data augmentation. It is typically used to increase the development set size when labelled data is limited or insufficient. In ER tasks, data manipulation is also employed to evaluate the robustness of matching systems in response to data changes, which often occur in real-world applications. Numerous data manipulation techniques have been utilized in previous work [IRV13] [HPWR20], including: • Character Variation: Inserting, deleting and updating characters of attribute values are simple data manipulation techniques to add noise to the attribute values and thus potentially increase the difficulty. • Token Variation: Beyond focusing only on characters, inserting, deleting or updating entire tokens at random positions may also significantly change the meaning of the whole token sequence. • Encoding Transformation: This technique involves changing data representation from one format to another. For example, binary representations, such as {1, 0} can be converted to {true, false} without losing meaningful information. This transformation may confuse models and thus can also be applied when manipulating data. • Format Transformation: This includes changing the format of the given data into another structure without losing too much information. For instance, a date format transformation might change “01.02.16” (DD.MM.YY) to “02.01.16” (MM.DD.YY). This kind of trans- formation is often performed in real-world applications, and hence, assessing the impact of changing the format of records for ER can be interesting. • Synonyms and Homonyms: Another data manipulation used involves replacing words with their synonyms or homonyms. Synonyms refer to words with the same meaning, while homonyms are words spelt the same but have different meanings. For example, “purchase” and “buy” are synonyms, while “ruler” (a person who governs) and “ruler” (a tool for measuring) are homonyms. Replacing tokens with their synonyms or homonyms may increase the difficulty for EM matching systems, especially if it has not encountered such variations during training. • Acronyms and Abbreviations: Acronyms and abbreviations are shortened forms of one or a sequence of tokens, such as “NASA” for “National Aeronautics and Space Administration” or “Dr.” for “Doctor”. ER systems are also expected to handle such cases; hence, this kind of manipulation can also be employed when creating benchmarks. • Splitting, Fusing, or Swapping Attribute Values: These transformations change the data structure by separating a single attribute value into multiple parts, combining multiple values into one, or switching the order of attributes. For example, splitting “John Doe” into separate “First Name” and “Last Name” fields or fusing “123 Main St, Apt 4B” into a single address field. Similarly, swapping attribute values, such as reversing “City, Country” to “Country, City”, can also impact the performance of EM systems. Applying these transformations when creating benchmarks may help to ensure that the system can handle such variations. 37 2 Related Work • Multilingualism: Representing records in different languages may notably increase the difficulty for EM systems. Only a few EM systems are trained on multilingual data; thus, it could be interesting to see how this change influences effectiveness. Additionally, one could assess the handling of variations in syntax and grammar across languages when solving ER tasks by manipulating the data accordingly. • Combinations: Finally, all mentioned data manipulation techniques can also be applied in combination, thus increasing additionally the difficulty of benchmarks and allowing for the assessment of whether ER systems can handle more challenging scenarios. In summary, when authors create benchmarks, one can observe that they often vary specific data characteristics or other factors and assess the performance of ER systems under these variations. Current assessments have not specifically focused on semi-structured data, making it interesting to explore how these configurations influence performance in this context. 2.2.2 Benchmark Development Approaches Authors approached the development of new ER benchmarks in various ways. After defining a specific goal for evaluation in creating an ER benchmark, the question arises regarding what data to use and how to generate these new ER benchmarks. Since many data sources do not permit the use or do not provide the data in the desired format, it is not always clear what the best approach is. The authors of Alaska use a system called DEXTER [QBD+15], which contains product-specific data crawled from the web. Web crawlers automatically browse and collect information from websites to build large datasets. In the benchmark development process of WDC Products, the authors also focused on data crawled from the web. They used the Common Crawl, one of the largest web corpora available [PDB23]. To work with data provided in such corpora, authors typically needed to apply several pre-processing steps. These included selecting pages with relevant information, filtering out non-relevant information within each chosen page, and performing additional data cleaning steps to ensure the data was usable and met the desired requirements. Additionally, domain experts were consulted to verify labels that were likely assigned correctly by other automated systems [QBD+15] or were already present in the data but could be incorrect [PDB23]. Panse et al. [PDWW21], on the other hand, did not manually label the records when dealing with historical voter data from North Carolina. Due to unique identifiers that distinguish people reliably and since these identifiers were provided with much care, there was no need to consult domain experts [PDWW21]. The possible approaches to benchmark development show that using data from large web corpora has advantages, such as offering diverse representations of sources, which tests whether ER systems can effectively capture complex relationships between data. On the other hand, it was shown that they often require a lot of pre-processing steps, large architectures capable of managing large data volumes, and other resources, for instance, when consulting domain experts to label data. All this information will be utilized and carefully considered when developing our benchmarks. 38 2.3 Benchmark Difficulty Measures 2.3 Benchmark Difficulty Measures As outlined in Section 2.1.1, various approaches to Entity Resolution exist, with large language models being among the most prominent methods currently in use. These models come in various sizes, and for some ER tasks, especially when the inherent difficulty is not too high, smaller models can be effectively utilized to achieve satisfactory results. Papadakis et al. [PKCP23] employed a different approach to assess the difficulty of benchmarks without using ER systems utilizing large language models. The analysis began with measuring the Degree of Linearity [PKCP23] in ER benchmarks. Degree of linearity This measure assesses whether a linear classifier can successfully resolve the entities of a given benchmark. In their evaluation, they use the Jaccard and Cosine similarities when classifying matches and non-matches. Jaccard Similarity4 measures the number of tokens that intersect between two sets 𝐴 and 𝐵 , divided by all tokens across these sets: Jaccard(𝐴, 𝐵) = |𝐴 ∩ 𝐵| |𝐴 ∪ 𝐵| Cosine Similarity5 measures the similarity between two vectors in a multi-dimensional space: Cosine Similarity(𝐴, 𝐵) = 𝐴 · 𝐵 ∥𝐴∥∥𝐵∥ Since Cosine Similarity expects numerical vectors as input, the sets 𝐴 and 𝐵 are often represented by Term Frequency-Inverse Document Frequency (TF-IDF) vectors. TF-IDF6 measures the relevance of a word in a document relative to a corpus: TF-IDF(𝑡, 𝑑, 𝐷) = TF(𝑡, 𝑑) × IDF(𝑡, 𝐷) where TF(𝑡, 𝑑) is the frequency of term 𝑡 in document 𝑑, and IDF(𝑡, 𝐷) measures how much information the term 𝑡 provides across all documents 𝐷. The authors employ a threshold-based classification using both similarity measures. First, they measure the similarity between each pair and save these values. Then, for each threshold between 0.01 and 0.99 with a step size of 0.01, they evaluate the F1 score, using this threshold as the boundary for classifying matches and non-matches. According to the authors, the threshold that produces the highest F1 score represents the degree of linearity. A high score indicates that the classification task is relatively simple for linear classifiers, suggesting that large language models are unnecessary, as smaller models could also effectively solve the task. 4https://en.wikipedia.org/wiki/Jaccard_index 5https://www.datastax.com/de/guides/what-is-cosine-similarity 6https://en.wikipedia.org/wiki/Tf-idf 39 https://en.wikipedia.org/wiki/Jaccard_index https://www.datastax.com/de/guides/what-is-cosine-similarity https://en.wikipedia.org/wiki/Tf-idf 2 Related Work Complexity Measures Additionally, the authors considered other measures implemented in a Python framework called “Problexity” [KK23]. It uses so-called Complexity Measures [PKCP23] to determine the difficulty of ER benchmarks. These measures evaluate the potential existence of complex patterns in the underlying data, which can indicate a higher difficulty in solving the problem. “Problexity” combines various measures into a single value to indicate whether the task is likely to be complex. The following measures were used and are all mentioned in [PKCP23]: 1. Feature overlapping measures: These measures focus especially on the numeric features and how well these contribute to solving the problem correctly. Examples include the maximum Fisher’s discriminant ratio or measures that evaluate the degree of overlap between different classes. 2. Neighborhood measures: The neighborhood measures examine how data points from different classes are clustered within small data regions. Additionally, measures that evaluate relationships among nearest neighbours and those that employ k-nearest neighbours (kNN) classifiers or utilize neural networks are used to produce a value indicating the relative ease or complexity of the classification task. 3. Network measures: These measures are used to analyze relationships in a dataset by focusing on instances of the same class and their similarities, which can be helpful for classification tasks. For that, the instances of a dataset are connected through edges to each other if they originate from the same class and the Gower distance7, which is a similarity measure, is smaller than a predefined threshold [PKCP23]. The density measure reflects then, for instance, the ratio of existing edges to the total number of possible edges in the graph. 4. Dimensionality measures: Such measures assess data sparsity by calculating the mean instances present within each dimension or the proportion of relevant dimensions. The authors represented each pair in the ER benchmark by a two-dimensional feature vector, namely Cosine and Jaccard Similarity, and used this as input to compute the complexity, which yielded a value between 0 and 1. A higher complexity indicated that the specific entity resolution task would likely be more difficult [PKCP23]. 2.4 Summary of Related Work This chapter provided an overview of the ER workflow, including its phases and an introduction to several SOTA ER systems used in the matching process. We reviewed multiple ER benchmarks and highlighted a gap in prior studies, noting that existing evaluations did not specifically focus on how semi-structured data affects the effectiveness of ER systems. We observed that authors analyzed the impact of specific benchmark configurations (Benchmark Dimensions) on ER systems, but they did not focus on semi-structured data specifically. Further, various methods for selecting data sources and combining information to create ER benchmarks were reviewed, along with alternative techniques for evaluating benchmark complexity. 7https://en.wikipedia.org/wiki/Gower%27s_distance 40 https://en.wikipedia.org/wiki/Gower%27s_distance 2.4 Summary of Related Work These findings are considered when designing the benchmarks, guiding both the selection of data sources and the development of semi-structured benchmarks, presented in the next chapter. 41 3 Benchmark Design When developing a new benchmark for ER, researchers typically aim to assess whether existing ER matching systems can deal with a certain kind of problem or aspect. This means that each study targeted one or multiple benchmark dimensions and assessed the effectiveness of matching systems on this particular new kind of ER task. Our study will focus on assessing ER systems’ performance using benchmarks that provide semi- structured data, unlike most benchmarks that typically rely on fully structured data or those that do not concentrate on a specific level of structuredness.. This means the goal is to create new ER benchmarks containing data provided in a nested format to analyze how well matching systems can handle this data in this context. Furthermore, we aim to provide benchmarks that allow us to explore how specific dimension configurations, such as the match to non-match ratio or certain data characteristics, including attribute sparsity and value length, impact the difficulty of ER on semi-structured data. This is of interest as previous studies have observed performance decreases in ER systems when encountering benchmarks with specific characteristics [PDB23] [PB20]. We explicitly want to verify if these challenges are also present when dealing with semi-structured or if they become even more difficult in this context. The benchmark development process is divided into the following steps: • Data Requirements Definition: Several criteria and requirements must first be defined for the sources used to create the ER benchmarks. Requirements with regard to the number of records, the degree of structuredness, and other essential data characteristics are defined to ensure that the benchmark is meaningful and representative. • Data Source Selection: This section describes the data sources that satisfy the defined requirements and thus were chosen for benchmark creation. The characteristics of these data sources are presented in detail. • Benchmark Dimensions: This part outlines the benchmark dimensions selected for evaluation across different configurations and explains the significance of each in detail • Benchmark Development Process: This section explains in detail the architecture and the individual software components used to create benchmarks. 3.1 Data Requirements Definition Here we outline several requirements that the data from the source should fulfill and explain these in detail. 43 3 Benchmark Design 3.1.1 Structuredness Before selecting datasets for creating ER benchmarks, it is essential to determine which sources provide the desired data for the specific task. In our scenario, it was desired to identify data sources that provide data in a semi-structured format. Datasets that provide many nested elements with different levels of nesting were preferred over those with simpler nesting structures, as a goal was to assess whether ER systems can handle data with more complex structures. The increased level of nesting also contributes to greater data heterogeneity, particularly in schematic heterogeneity, which is likely to result in a more challenging scenario for the systems. 3.1.2 Labels Another important aspect to consider is whether the identified data sources provide unique identifiers that can be used to determine whether two records refer to the same real-world entity. Such identifiers are available in certain domains but are not universally provided across all domains. For instance, product identifiers such as Global Trade Item Number (GTIN), vehicle identifiers like Vehicle Identification Number (VIN), or book identifiers like International Standard Book Number (ISBN) are widely used in their respective domains and make identifying matches much easier. However, many sources do not include such identifiers in their records. As a result, ER benchmark developers often face the challenge of determining the correct mapping between records and entities. This process of identifying the correct mapping typically involves data cleaning, blocking, and manual labelling by domain experts, which requires significant resources and time [PDWW21]. Due to the limited time, it was decided to only consider data sources that provide unique identifiers so that matches and mismatches between sources can be easily identified. In addition, we chose to use only sources where the data requires few pre-processing steps. When datasets from large corpora, such as Common Crawl, are considered, it is obvious that many pre-processing steps, such as identifying meaningful data or cleaning data, are required before data can be used for creating ER benchmarks [PDB23]. Given the complexity of these tasks and the lack of necessary infrastructure to handle such a large amount of data, we chose not to consider such data sources. Therefore, the focus was on identifying data sources that do not require substantial pre-processing and already provided labelled data. 3.1.3 Number of Matches An important consideration when identifying suitable sources is to ensure that enough entries between different sources relate to the same entities. This is necessary because, when creating benchmarks, it is important to have sufficient matches and non-matches to additionally be able to create training and validation sets. Therefore, having a representative number of entries that relate to the same entity is crucial. Peeters and Bizer [PB23] argue that a benchmark should contain at least 150 matches to ensure reliable results. Since we also aim to assess the impact of the development set size when fine-tuning models, it is important to provide development sets of varying sizes. In the WDC Products benchmark, the largest configuration across all sets (training, validation, and test) contains approximately 10000 matching 44 3.1 Data Requirements Definition entries [PDB23]. Similarly, we aimed to identify sources from domains where approximately 10000 matching pairs could be generated. Domains with fewer matches were also considered but were less preferred. 3.1.4 Licensing When reusing data from external sources, it is important to assess the licensing terms to determine which data can be reused and which cannot. Contributors typically publish licenses that outline how users can work with the published data. Since the goal of this work is to make the benchmarks accessible to other researchers, we require data sources that allow for data reuse and redistribution in academic contexts. Specifically, the license should grant permission to host and distribute the data on institutional servers, such as those at the University of Stuttgart. This ensures that the benchmarks remain available to the research community, enabling others to evaluate new entity resolution systems and related tools using our datasets. Data sources not providing a suitable license will be much less prioritized. 3.1.5 Heterogeneity A key aspect that is also considered in this paper is the analysis of how varying levels of heterogeneity between sources can influence the difficulty of ER tasks. Therefore, it is important to identify data sources that exhibit varying levels of heterogeneity among themselves. We consider aspects of structural heterogeneity which refers to variations in data organization or differences in data models. Since we focus on semi-structured data, these benchmarks naturally exhibit more schematic heterogeneity than structured benchmarks due to the more complex organization of data. Specifically, we aim to identify data sources that, for example, vary in the number of nested elements, attributes provided or the percentage of average attribute sparsity. We also considered the possibility of analyzing semantic heterogeneity between sources, such as synonyms, homonyms, or different units of measurement (e.g., euro vs. dollars). However, this requires a specific analysis and substantial effort as sources often provide large amounts of attributes, and thus, this aspect was not a primary focus here. Other differences include variations in vocabulary size or the number of tokens provided in attribute values, and we aimed to identify sources that exhibit these differences as well. Since we assess the effectiveness of ER systems in a schema-agnostic setting, these mentioned factors influence the task because only the attribute values are considered, not the attribute names. 3.1.6 Number of Data Sources The number of data sources can influence the difficulty of matching tasks, making it an important consideration when creating ER benchmarks. Since different sources provide multiple representa- tions of real-world objects, matching systems need to be able to distinguish between matches and non-matches across various sources and demonstrate their effectiveness in handling such diversity. Most existing benchmarks contain information from only two sources, but we aim to assess how an increased number of sources affects ER performance, particularly when using semi-structured data. 45 3 Benchmark Design In addition, it is also interesting to assess how over- or underrepresented entities impact the effectiveness of matching systems when working with semi-structured records. The assessment of such experiments requires that some entities are represented by multiple sources while others are represented by fewer sources. These overrepresented entities can only be provided in the benchmark data sets if we first locate numerous data sources within the same domain and, second, ensure that these sources contain multiple matches where each data source contributes records to the match. Therefore, the goal was to identify multiple sources with semi-structured data from the same domain. 3.2 Data Source Selection Data sources from various domains that align well with the specified requirements were sought. In particular, we restricted the search to domains, such as videos, books, food, or vehicles, where we knew that the sources may provide identifiers to distinguish between entities. Therefore, other domains, such as restaurants, were excluded as they often relied only on coordinates and lacked unique identifiers. Relying on coordinates alone in these domains proved insufficient, as multiple restaurants could exist within the same building, for example, on different floors. Sources containing vehicle information were also discarded. Initially, vehicles appeared promising due to the VIN identifier, which uniquely distinguishes each vehicle. However, the sources obtained were either too structured (e.g., lacking nested elements despite being provided in JSON format1) or we were unable to find suitable semi-structured datasets with vehicle information at all. As a result, although semi-structured data sources were identified, many were excluded due to challenges in reliably differentiating entities or because they did not meet the requirements for structuredness. In addition to these issues, data providers often did not permit data redistribution. However, we found a domain, the food domain, where data sources met the necessary data structure requirements, allowed redistribution, and included unique identifiers. Food: In the Food domain, we identified two data sources: • Open Food Facts2 (OFF): Provides an API to access a database containing content in JSON format, which includes varying levels of nested elements. It contains a rich set of attributes (max. 271 unique attributes) characterizing different foods, a high average sparsity (>0.6) and a low average attribute value length (≈ 6.5). • FoodData Central3 (FDC): When identifying this source, we initially found several Comma- Separated Values (CSV) files, each containing different types of information (e.g., main food properties, nutrient details, etc.). Since CSV files provide structured data, they were unsuitable for our evaluation, as we focus on semi-structured data. Since we found it difficult to locate other sources with the necessary structure, we decided to restructure this data ourselves. We used the CSV file containing the main food properties and transformed it into a JSON format to better align with our focus on semi-structured data. The other CSV files 1https://www.nhtsa.gov/nhtsa-datasets-and-apis 2https://openfoodfacts.github.io/openfoodfacts-server/api/ref-v2/ 3https://fdc.nal.usda.gov/download-datasets.html 46 https://www.nhtsa.gov/nhtsa-datasets-and-apis https://openfoodfacts.github.io/openfoodfacts-server/api/ref-v2/ https://fdc.nal.usda.gov/download-datasets.html 3.2 Data Source Selection provided, contained additional information, such as nutrient details. Since each food has many nutrients and each nutrient is described by many attributes, we appended, e.g. each nutrient as a nested JSON object to the food as additional information. We did this for all food elements and that way, we constructed a semi-structured dataset that could be used for benchmark creation and evaluation. We call this modified source FDC Merged. Later, during the evaluation phase, we discovered that FDC also offers a version of the data in JSON format and hence used it in addition when creating benchmarks. We name it FDC Native. This version provides records having more attributes than the FDC Merged dataset and features more deeply nested elements. Both datasets contain longer attribute values (>11) but less sparse data (<0.27) than OFF. Since FDC Merged and FDC Native represent largely the same entities, we create two benchmarks: one using the FDC Merged and OFF data and another using the FDC Native and OFF sources to assess whether this change of representations affects the performance of matching systems. Table 3.1: Statistics of Food Data Sources Name License Size #Max Attributes Avg. Sparsity Avg. Value Length (#Chars) Voc. Size4 OFF ODbL v1.0 18500 271 0.617 6.43 40,719 FDC Merged CC0 1.0 47578 25 0.265 11.24 34,325 FDC Native CC0 1.0 46284 56 0.193 12.97 40,277 Video: In the Video domain, three data sources were selected: • The Open Movie Database5 (OMDB): Provides an API called the OMDB API which returns data about various movies in a JSON format. It includes records with nested elements, though fewer than those provided by sources in the Food domain. However, the average attribute sparsity is relatively high (≈ 0.33), and attributes like “Director” or “Authors” often contain multiple names, indicating that the data is not perfectly separated. Additionally, it provides a high vocabulary size and long attribute values (26.19 characters on average). • The Movie Database6 (TMDB): Offers an API for non-commercial use, providing various endpoints to query movie information and returns data in JSON format. When querying the API for a movie using the imdbId identifier, the response provides general information about movies in a semi-structured way with low amounts of nesting elements. Since this general information provides only a few nesting elements, we queried additional endpoints from the API and incorporated their data, such as crew and cast information, in nested elements. These endpoints return data about individual members and their properties, representing each member as a JSON object. All these JSON objects, representing the members, are added in a JSON array for the specific movie as an additional attribute. Similarly, genres were stored in a JSON array to accommodate multiple values, resulting in more nested elements in the records. This source has a high vocabulary size and the average sparsity is very low (≈ 0.06). 4Measured for 10,000 entries. 5https://www.omdbapi.com/ 6https://developer.themoviedb.org/ 47 https://www.omdbapi.com/ https://developer.themoviedb.org/ 3 Benchmark Design It has to be noted that, unfortunately, this source does not allow redistribution of data. Since it was difficult for us to find other suitable sources, we have retained this source and will provide the evaluation results using this data. However, we will not publish the benchmarks in this domain. • IMDB7: Provides datasets on the IMDb Non-Commercial Datasets website containing information about various movies. Unlike the other sources, IMDB offers its data in CSV format, which classifies it as a structured dataset and is typically not ideal for further evaluation in this context. However, we wanted to include a domain with at least three sources to analyze how ER systems perform under different conditions, for instance, when changing the ratio between unseen/seen entities or providing over- and underrepresented entities or whether overall the presence of entries from multiple data sources complicates the problem. Therefore, we converted it into a JSON file and also used it when creating benchmarks. This source contains few attributes and a small vocabulary, suggesting a relatively easy task. However, with approximately 25% sparsity, this can also pose some challenges. Since the focus remained on semi-structured data, but we also wanted to assess how a third source may influence the effectiveness of matching systems, we created two separate benchmarks: one using the OMDB and TMDB sources and another incorporating additionally the IMDB source. Table 3.2: Statistics of Video Data Sources Name License Size #Max Attributes Avg. Sparsity Avg. Value Length (#Chars) Voc. Size8 OMDB CC BY-NC 4.0 48,080 26 0.332 26.19 106,589 TMDB Insufficient 45,929 21 0.063 15.73 180,605 IMDB Insufficient 904,326 13 0.254 10.64 23,728 Book: In the Book domain, two data sources were selected: • Open Library API9 (OL): Provides an API to access the database content. It contains a rich amount of information about books provided in a JSON format. This semi-structured data provides many levels of nesting, a high average attribute sparsity (>0.4) and a high vocabulary. • GoodReads dataset on Kaggle10 (GK): This dataset, provided on Kaggle, contains information about books in CSV format, categorizing it as structured data. Unfortunately, we could not find other sources that offer data in a semi-structured format while also providing the necessary license for data redistribution. However, similar to the IMDB dataset mentioned earlier, this dataset features many attributes that are not subdivided and additionally provides very long attribute values (>90 characters). 7https://developer.imdb.com/non-commercial-datasets/ 8Measured for 10,000 entries. 9https://openlibrary.org/developers/api 10https://www.kaggle.com/datasets/mdhamani/goodreads-books-100k 48 https://developer.imdb.com/non-commercial-datasets/ https://openlibrary.org/developers/api https://www.kaggle.com/datasets/mdhamani/goodreads-books-100k 3.3 Benchmark Dimensions Evaluating the effectiveness of an ER system on this data can help reveal how data sources with different structural characteristics may impact the task’s difficulty. However, the focus of the evaluation will mainly remain on the Food and Movie domain since the focus is on semi-structured benchmarks. Table 3.3: Statistics of Book Data Sources Name License Size #Max Attributes Avg. Sparsity Avg. Value Length (#Chars) Voc. Size8 OL ODbL 65,779 37 0.436 30.59 193,381 GK CC0 100,000 13 0.015 90.59 70,374 In summary, we will establish benchmarks using the following sources, classified based on their provided characteristics: • Group 1: Natively semi-structured sources – OFF-FDCN: This benchmark uses data from OFF and FDC Native which both provide natively a high degree of semi-structuredness. • Group 2: One semi-structured Source, One adjusted source – OFF-FDCM: This benchmark uses data from OFF, which provides a high level of semi-structuredness, and FDC Merged, which required modifications to achieve the desired level of semi-structuredness. – TMDB-OMDB: This benchmark uses data from TMDB, which provides semi-structured data natively, and OMDB, which required modifications to achieve the desired level of semi-structuredness. • Group 3: Benchmarks with an increased number of sources – TMDB-OMDB-IMDB (T-O-I): This benchmark uses data from TMDB, OMDB, and IMDB and i