Multi-version Indexing for large Datasets with high-rate continuous Insertions Von der Fakultät für Informatik, Elektrotechnik und Informationstechnik der Universität Stuttgart zur Erlangung der Würde eines Doktors der Naturwissenschaften (Dr. rer. nat.) genehmigte Abhandlung Vorgelegt von Christian Riegger aus Hechingen Hauptberichter: Prof. Dr.-Ing. Ilia Petrov Mitberichter: Prof. Dr.-Ing. habil. Bernhard Mitschang Tag der mündlichen Prüfung: 02. Juni 2023 Institut für Parallele und Verteilte Systeme (IPVS) der Universität Stuttgart 2023 Acknowledgments Foremost, I would like to express my sincere gratitude to my supervisors Prof. Dr.-Ing. Ilia Petrov and Prof. Dr.-Ing. habil. Bernhard Mitschang. Ilia triggered my research interest in algorithm and data structures during my master’s degree. Hemade the completion of this thesis possible by his support, expertise, encouragement and empathize for several personal misfortunes during the time. Bernhard Mitschang supported by inputs from a different perspective, what contributed to improvements and comprehensibility of this work. I gratefully acknowledge the cooperation and scientific exchange with my colleagues in the Data Management Lab of Reutlingen University, Tobias Vinçon, Arthur Bernhardt and Christian Knödler, which yielded interesting research, novel approaches and collaborative publications. I would like to thank my family and close friends for bearing with me all this time. This research was partially funded by the Ministry of Science of Baden- Württemberg, Germany, for the Doctoral Program ’Services Computing’. 3 Contents List of Abbreviations 9 Zusammenfassung 13 Abstract 15 1. Introduction 17 1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.2. Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.3. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.4. Scientific Publications . . . . . . . . . . . . . . . . . . . . . . . . . 24 1.5. Structure of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . 26 2. Technical Background 27 2.1. Characteristics of modern Hardware Technologies . . . . . . . 28 2.1.1. Highly parallelized and decentralized processing units 28 2.1.2. Primary and Secondary Storage Characteristics . . . . . 29 2.1.3. Testbed Hardware Metrics and Operating System . . . 34 2.2. Modern Workload Properties . . . . . . . . . . . . . . . . . . . . . 34 2.2.1. Application of standardized Database Benchmarks . . . 35 5 2.3. Fundamental Design Decisions inMulti-Version DatabaseMan- agement Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.3.1. Opportunities and VersionManagement inMulti-Version Concurrency Control . . . . . . . . . . . . . . . . . . . . . 38 2.3.2. Estimated Base Table Cost Model . . . . . . . . . . . . . . 46 2.4. Additional Access Paths in Multi-Version DBMS . . . . . . . . . 47 2.4.1. Version-oblivious Index Equality and Range Search . . 49 2.5. Improving Index Search Operation in Multi-Version DBMS . . 51 3. Related Work 53 3.1. Storage and auxiliary Index Management Structures . . . . . . 54 3.1.1. Ubiquitous B+-Tree . . . . . . . . . . . . . . . . . . . . . . 54 3.1.2. Structures with Multi-Version Capabilities . . . . . . . . 60 3.1.3. Structures leveragingModern Secondary Storage Hard- ware Characteristics . . . . . . . . . . . . . . . . . . . . . . 62 3.1.4. Applied Basic Structure: Partitioned BTree . . . . . . . . 63 3.1.5. Prototypical Environments . . . . . . . . . . . . . . . . . . 65 3.2. Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . 68 4. Multi-Version Partitioned BTree 69 4.1. Multi-Version Indexing Design Concepts . . . . . . . . . . . . . . 70 4.2. Write-Optimization with Partitioned BTrees . . . . . . . . . . . 72 4.2.1. Data Structure and Components . . . . . . . . . . . . . . 73 4.2.2. Partition Switch & Sequential Write . . . . . . . . . . . . 86 4.2.3. Basic Operations . . . . . . . . . . . . . . . . . . . . . . . . 94 4.2.4. Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.2.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.3. Version-Awareness with Multi-Version Partitioned BTree . . . 110 4.3.1. Applying Multi-Version Record Types . . . . . . . . . . . 112 4.3.2. Version-Aware Record Sizes and Compression Techniques114 4.3.3. Logical Version Chain in evolving MV-PBT . . . . . . . . 115 4.3.4. In-Partition Version Record Ordering Convention . . . . 119 4.3.5. Version-Aware Basic Operations . . . . . . . . . . . . . . 122 6 Contents 4.3.6. Index-Only Visibility Check . . . . . . . . . . . . . . . . . 127 4.3.7. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 4.4. Workload Adaptiveness and Optimizations . . . . . . . . . . . . 134 4.4.1. Adaptive Data Reorganization and Indexing . . . . . . . 135 4.4.2. Garbage Collection . . . . . . . . . . . . . . . . . . . . . . 142 4.4.3. Application of Auxiliary Filter Structures and Data Skipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 4.4.4. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 4.5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 5. Arbitrary Approximate Membership Testing 159 5.1. Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 5.2. Point Filter Techniques . . . . . . . . . . . . . . . . . . . . . . . . 161 5.3. Broadened Range Filter Techniques . . . . . . . . . . . . . . . . 163 5.3.1. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 165 5.4. bloomRF: Unified Point-Range Filter for Arbitrary Approxi- mate Membership Testing . . . . . . . . . . . . . . . . . . . . . . . 167 5.4.1. Implicit Dyadic Trace Trees (DTT) . . . . . . . . . . . . . 169 5.4.2. Prefix Hashing and Piecewise-monotone Hash Functions173 5.4.3. Error Correction Policies . . . . . . . . . . . . . . . . . . . 177 5.4.4. Heuristical Tuning Advisor . . . . . . . . . . . . . . . . . . 181 5.4.5. bloomRF’s False Positive Rate (FPR) Model . . . . . . . 183 5.4.6. Basic Operations . . . . . . . . . . . . . . . . . . . . . . . . 184 5.4.7. Multi Attribute and Data Type Support . . . . . . . . . . 190 5.5. Experimental Evaluation of PRF . . . . . . . . . . . . . . . . . . . 194 5.6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 6. Application and System Integration 205 6.1. Indexing in Relational Database Management Systems . . . . 206 6.1.1. Implementation Details of MV-PBT in PostgreSQL with SIAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 6.1.2. Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 206 6.1.3. Evaluation and Selection of Baseline . . . . . . . . . . . . 207 Contents 7 6.1.4. Write-optimized Indexing in OLTP . . . . . . . . . . . . . 211 6.1.5. Index-Only Visibility Checks dominate in HTAP . . . . . 214 6.1.6. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 6.2. Storage Manager in Key/Value-Stores . . . . . . . . . . . . . . . 217 6.2.1. Implementation Details of MV-PBT in WT . . . . . . . . 218 6.2.2. Implications on Secondary Indexing MV-PBT Stores . . 219 6.2.3. Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 220 6.2.4. Experimental Evaluation . . . . . . . . . . . . . . . . . . . 220 6.2.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 6.3. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 7. Summary and Research Opportunities 231 7.1. Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . 232 7.2. Further Application Scenarios of MV-PBT . . . . . . . . . . . . . 233 7.2.1. Time-Travel Query Processing . . . . . . . . . . . . . . . . 233 7.2.2. Blockchain Storage Structure in Enterprise Systems . . 234 7.2.3. Accelerators and Complex Memory Hierarchies . . . . . 235 Bibliography 237 A. Appendix 259 A.1. Complex Memory Hierarchies . . . . . . . . . . . . . . . . . . . . 259 A.2. Technical Background: Flash in SSD . . . . . . . . . . . . . . . . 260 A.3. Example calculation of single B+-Tree Cache Efficiency . . . . 262 8 Contents List of Abbreviations B+-Tree Balancing Tree Structure with Records in Leaves [BM70; Com79] (compare Section 3.1.1) BF Bloom Filter [Blo70] (compare Section 5.2) bloomRF Bloom Range Filter [MRBP23; RBMP20] BPK Bits per Key Bw-Tree Latch-Free Storage Management Structure based on B-Tree [LLS13; WPL+18] CH-Benchmark HTAP Benchmark based on TPC-C and TPC-H [CFG+11] CP Cached Partition CPU Central Processing Unit DBMS (Relational) (Multi-Version) Database Management Sys- tem Abbreviation Description Continued on next page Table 1: Common abbreviations used in this thesis. 9 Continued from previous page DI Dyadic Interval DTT Dyadic Trace Tree ETL Extraction, Transformation and Load Process FPR False Positive Rate GC Garbage Collection HDD Hard Disk Drive HOT Heap-Only Tuple (Standard Base Table Organization in PostgreSQL [Pos21]) HTAP Hybrid Transactional and Analytical Processing I/O Input- / Output-Operations (in the context of read or write to secondary storage) IOPS Input- / Output-Operations (I/O) per Second K/V-Store Key-/Value-Store LBA Logical Block Address LSM Log-Structured Merge Tree [OCGO96; SR12] (compare Section 3.1.3) MV-PBT Multi-Version Partitioned BTree [RP22; RVGP20] MVCC Multi-Version Concurrency Control NO-tpmC New-Order Transactions per Minute of TPC-CWorkload [TPC10] NVM Non-Volatile Memory OLAP Online Analytical Processing Abbreviation Description Continued on next page Table 1: Common abbreviations used in this thesis. (Continued) 10 Contents Continued from previous page OLTP Online Transaction Processing Ops/s Operations per Second PBA Physical Block Address PBF Prefix Bloom Filter PBT Partitioned BTree [RVP17b; RVP19] (based on [Gra03]) PMHF Piecewise-Monotone Hash Function Pnr Partition Number PRF Point-Range Filter RI/O Read I/O Costs RA Read Amplification RAM Random Access Memory (Primary Storage / Main Mem- ory) Rosetta Robust Space-Time Optimized Range Filter for Key- Value Stores [LCK+20] RQ Research Question SA Space Amplification SI Snapshot Isolation SIAS Snapshot Isolation Append Storage [Got16] (compare Section 3.1.5.1) SSD Solid State Drive SuRF Succinct Range Filter [ZLL+18] Abbreviation Description Continued on next page Table 1: Common abbreviations used in this thesis. (Continued) Contents 11 Continued from previous page TPC Transaction Processing Performance Council [TPC22] TPC-C Transaction Processing Performance Council Bench- mark C [TPC10] TPC-H Transaction Processing Performance Council Bench- mark H [TPC21] tp{m,s} Transactions per Minute, Second TS Timestamp TT Trace Tree TXTS MVCC Transaction with Timestamp TS VID Virtual Identifier WI/O Write I/O Costs WA Write Amplification YCSB Yahoo! Cloud Serving Benchmark [CST+10; RKD21] Abbreviation Description Table 1: Common abbreviations used in this thesis. 12 Contents Zusammenfassung Die Eigenschaften moderner Arbeitslasten gegenüber Datenbanken orien- tieren sich an den Anforderungen von Geschäftsanwendungen, die auf Wis- sensgewinne und einen technologischen Vorsprung ausgerichtet sind. Diese Arbeitslasten zeichnen sich durch ein exponentielles Datenwachstum, einen hohen Anteil an kontinuierlichen Einfügungen und analytische Verarbeitung aus. Flash Sekundärspeicher sind eine wirtschaftliche Möglichkeit, mit gro- ßen Mengen modifizierbarer Daten umzugehen, wenn ihre Eigenschaften effizient genutzt werden. In diesem Kontext führt die Verwaltung von phy- sisch materialisierten Tupel-Versionen in Basistabellen durch vorteilhafte Zugriffsmuster auf Sekundärspeichermedien zu einem kostengünstigen und skalierbaren transaktionalen Durchsatz. Vorteilhafte Eigenschaften gelten jedoch kaum für gängige unversionierte sekundäre Zugriffspfade. Ihr tatsächlicher Nutzen wird durch Wartung sowie zusätzliche Kosten für Suche und Sichtbarkeitsprüfung begrenzt. Mittels empirischer Methoden, also Literaturstudien und kontrollierter Experimente, werden spezifische Eigenschaften von modernen Arbeitslas- ten, Flash-basierten Speichertechnologien und modernster Datenverwal- tungstechniken in Wissenschaft und Industrie gesammelt, um existierende Probleme, Forschungsmöglichkeiten und Herausforderungen in deren In- teraktion zu identifizieren. Auf Basis der daraus abgeleiteten Erkenntnisse werden sowohl neue als auch ausgereifte Techniken betrachtet, mit dem 13 Ziel eine neue multi-versionierte Speicher- und Indexverwaltungsstruktur für die Eigenschaften moderner Speichertechnologien zu entwickeln. Durch standardisierten Arbeitslasten wird eine prototypische Implementierung in bekannten Systemen experimentell evaluiert. Diese Arbeit leistet einen wichtigen Beitrag zur modernen Datenverwal- tung mit Multi-Version Partitioned BTrees (MV-PBT). Ferner beinhaltet dies eine anfüge-basierte Tupel-Versions-Verwaltung für Arbeitslasten mit gleich- bleibend hohen Dateneinspeisungsraten und analytischer Verarbeitung einer gemeinsamen Datenbasis, welche sich hauptsächlich auf einem Sekundär- speicher befindet. Erstens, der Ansatz verbessert die Selektivität sekundärer Zugriffspfade durch die Einführung von internen Sichtbarkeitsprüfungen. Dies führt bei gemischten Arbeitslasten zu einer Verdopplung des analy- tischen und zu einer Erhöhung von 14% des transaktionalen Durchsatzes. Zweitens, das Tupel-Aktualisierungsverfahren verbessert die Verwaltungskos- ten von Indizes durch das Hinzufügen logisch verketteter Versionen, womit ein um 47% erhöhter Durchsatz erzielt wird. Drittens, MV-PBT eignet sich auf- grund seiner Schreibeigenschaften und kosteneffizienten Suchvorgängen als Speicherverwaltungsstruktur, welche den Durchsatz im Vergleich zu den ver- breiteten LSM-Bäumen verdoppelt. Viertens, die logische Verknüpfung von Versionsdaten erleichtert unabhängige Partitionierungs-, Reorganisations- und Wartungstechniken, was einen gleichbleibenden und um ein Vielfaches erhöhten Durchsatz für verschiedene Arbeitslasten ermöglicht. Letztens, bloomRF ermöglicht als neuartige, kostengünstige Punkt- und Intervall- Filtertechnik gleichbleibende Leistung für Such- und Wartungsoperationen in MV-PBT. In dieser Arbeit werden die Vorteile einer multi-versionierten Hochleis- tungsindexverwaltung für die Eigenschaften moderner Arbeitslasten erläu- tert. Darüber hinaus kombinieren erweiterbare Designkonzepte auf Basis des B+-Baums moderne Hauptspeichertechniken mit enormen Datenmengen und bilden die Grundlage für die Einbindung dezentraler Verarbeitungs- und Speichertechnologien. MV-PBT ist für ein breites Anwendungsspektrum geeignet und bildet einen ganzheitlichen Ersatz für bestehende Speicher- und Indexverwaltungsstrukturen. 14 Contents Abstract Trends in modern database workload properties are guided by business appli- cation needs with the characteristics of exponential growth of data, high-rate continuous insertions and analytical processing, aiming for knowledge gains and leading edges over competitors. Cheap Flash-based secondary storage devices provide an economic way to deal with massive amounts of modifiable data whenever their characteristics are efficiently leveraged. Thereby, it turned out that maintenance of physical materialized tuple version records in base tables not only scale with the number of concurrent transactions, but also provides beneficial access patterns to secondary storage devices, whereby asset and operational costs become manageable. However, beneficial characteristics are hardly valid for common version- oblivious additional access paths. Their actual profit is limited by excessive maintenance as well as additional search and visibility check costs. By means of empirical methods, i.e. literature studies and controlled experiments, characteristics of modern workloads, Flash-based storage hard- ware as well as state-of-the-art data management techniques in academia and industry are gathered in order to identify existing problems, research op- portunities and challenges in mutual interactions. Based on derived findings, novel as well as matured techniques are considered to design a new kind of version-aware and hardware leveraging storage and index management 15 structure, which is prototypically implemented and integrated in well-known systems and experimentally evaluated by system performance benchmarks. This thesis gives significant contributions in modern data management with Multi-Version Partitioned BTrees (MV-PBT) – i.e. append-based multi- versioned tuple maintenance, high-rate continuous insertion workloads with analytical processing on a common dataset instance, which comprises mas- sive amounts of data on secondary storage devices. First, the approach massively improves selectivity of additional access paths by introduced index- only visibility checks, yielding 2× increased analytical and 14% transactional throughput in mixed workloads. Second, strict append-based and out-of- place replacement update schemes facilitate improved benefit by maintained indexes, by 47% improved throughput. Third, due to its near-optimal write characteristics and cost-efficient searches, the applied approach is highly qualified as storage management structure, with 2× increased throughput compared to widely used LSM-Tree. Fourth, logical linkage of version records facilitate independent partition, reorganization and maintenance techniques as well as robust performance characteristics for various workload proper- ties, scaling up to orders of magnitude. Last, bloomRF, as a novel low-cost point-range filter technique, enables robust performance characteristics for search and maintenance operations in MV-PBT. Contributions of this work unambiguously elaborate benefits of high- performant version-aware index management for recent developments in modern workload properties. Moreover, extendable design concepts on base of the ubiquitous B+-Tree combine modern in-memory techniques, massive amounts of data and form a basis for recent trends in decentralized processing and storage hardware technologies. MV-PBT exhibits a broad range of applicability and facilitates a full substitution of matured storage and index management structures. 16 Contents Ch ap te r 1 Introduction This introductory chapter firstly motivates the scope of action and significance to decision-makers in a general business context. Within this context, I share my personal experience, when I was involved as a trainee in the launch of a customer relationship management system and business warehousing solution upon an existing enterprise resource planning application. Upon that, consequent challenges of the motivational context are outlined and treated contributions are specified. A brief list of authors publications within this research context and following structure of this thesis close the chapter. 17 1.1. Motivation Recent trends in business process requirements to information management systems result in new types of workloads on their most performance criti- cal backbone – the database management systems (DBMS). Traditionally, DBMS workloads can be separated in different categories, such as online transactional processing (OLTP) or online analytical processing (OLAP). Characteristics of access patterns as well as effects on data management technologies massively differ. Emerging popularity of large scale, data in- tensive, real time analytical applications combine these characteristics in hybrid transactional and analytical processing workloads (HTAP) on the same dataset instance [HG20; MBL17; ÖTT17; RVGP20; SDA21]. An HTAP scenario brings several benefits to an enterprise in contrast to traditional data warehousing approaches, whereby data is extracted from the transaction processing system, transformed in a special query optimized form and loaded to an analytical processing system (ETL-process). First, real time data in HTAP scenarios improve and simplify quality as well as output of decision-making process in business operations [HG20; ÖTT17]. ’Greater situation awareness’ and ’improved business agility’ are required skills to stay competitive [Gar15]. As a personal experience, any delay in relevance of reported data reduces user experience and increases frustration. Furthermore, analytical outputs influence decisions in operative tasks, like fore- casting, pricing or production planning and controlling. Afterwards or incorrectly maintained data in an OLTP-system deficiently affect business operations between ETL-processes or require enormous manual adjustments and manpower. Anyways, significant costs are affiliated to a company. Second, if a system is able to adopt the new HTAP-approach, an information management infrastructure of organizations could be simplified [Gar15; MBL17]. Less complex infrastructures are cheaper to maintain and evolve. 18 1 | Introduction Last, global players have (almost) no operative down times. The extraction process probably stresses an OLTP-system and significantly shrinks transac- tional throughput for a long period of time1. Alongside HTAP-characterizations, upcoming technologies, markets, busi- ness cases and cloud services yield data-intensive and high-rate continuous insertion workloads. Internet of Things (IoT) platforms [Bos21; IBM21; Mic21a; Ora21b; SAP21; Ser21] handle data of billions of IoT-devices. The intention is by no means just storing massive amounts of data, rather a gain of knowledge and a technological distinct competitive edge. For instance, an automotive manufacturer could necessitate service intervals based on evaluated wear and sensor data correlation gathered by cloud IoT platforms. Quality of service aspects in such cloud approaches are contractually fixed in service level agreements [Koh18]. Violation of these metrics, e.g. in trans- actional throughput, response times, availability or consistency, are associ- ated with noticeably monetary costs [SOSM12]. A key driver for compliance in data-intensive tasks are naturally located in DBMS and their appropriated hardware resources. In this context, predominant approaches are evaluated in [HSB15]. Scaling-up means upgrading powerful components in single nodes, whereas scaling-out allows acquisition of more processing nodes. Both approaches, especially in combination, enable management of a mas- sive amount of data. Accompanying challenges in scale-out methods, like par- titioning, communication, data locality and skew handling [CL16; RMU+14; Röd16; XY17; ZBS15], and scaling-up hardware [FZZ+19; KHL18; PDZ+18], are in scope of recent proposals. Several solutions focus on main-memory optimizations [DFI+13; FML+12; KN11; Pav14], e.g. in the area of key sorted indexing [KFM+15; LKN13; LLS13; ZCWJ21]. Essential prerequisites are large volatile main-memory capacities and powerful processing units as well as persistent non-volatile secondary storage resources. Increased main- memory (scale-up) or scaling-out the problem space by new data nodes for growing dataset sizes are costly in hardware, operation and administration [LHKN18]. 1[Sup21] mentions a general job execution time of 2-4 hours. 1.1 | Motivation 19 Focusing only on main-memory leads to natural limits – in a technical as well as in an economical context [Lom18; NF20]. The much cheaper sec- ondary storage technologies miss scaling principles in main-memory-oriented configurations and are limited to backup tasks and archiving. Facebook is a pioneer in introduction and evaluation of new solid state storage devices (SSD) and upcoming non-volatile memory (NVM) in massive data production setups [EGA+18; MWKM15]. Leveraging characteristics of new storage hardware in a complex memory hierarchy allow comparable performance to expensive up-scaled main-memory solutions and cheap – typically low main-memory equipped – data nodes. Establishing a technological edge by leveraging hardware characteristics beyond, could significantly reduce asset, administrative and operative costs. In combination of these contexts, current storage manager and structures of (secondary) access paths encounter problems, outlined in the following section. 1.2. Problem Statement Viewed in isolation, aspects of application generated workloads (Section 2.2), characteristics of modern hardware technologies (Section 2.1) and fun- damentals of DBMS designs (Section 2.3) are well studied and partially state-of-the-art. Combining these aspects imply new challenges. Application-generated high-rate continuous insertion and analytical work- loads become read- and write-intensive to secondary storage hardware of data nodes. Additional access paths can reduce occurring unnecessary read effort and improve latencies, but require frequent maintenance. Whereas basic table storage management structures like Snapshot Isolation Append Storage (SIAS, compare Section 3.1.5.1) [Got16] obtain beneficial append- based write patterns very well, traditional strict alpha-numeric-sorted addi- tional access paths result in a random write pattern, poor latency and high write amplification (WA, ratio between logically and physically written size of data). Additional access paths entail massive pressure, amplified by WA 20 1 | Introduction unbuffered to secondary storage. Bandwidths of secondary storage devices are not efficiently used and shrink overall throughput of DBMS. Maintenance costs of the additional access path become more expensive than scanning the base table. Considering different characteristics of main-memory primary storage and Flash-based secondary storage techniques, the problem space grows with complex memory hierarchies (compare in Sections 2.1 and A.1). Generally, input-/output-operation (I/O) patterns of traditional additional access paths do not leverage characteristics of Flash-based secondary storage devices and shrink performance in read- and write-intensive workloads. By this means, company’s investments are not optimally used. Furthermore, durability of secondary storage devices shrinks due to high update rates and WA of additional access paths. Endurance of Flash-based storage media depends on physical write-/erase-cycles. Write patterns and WA of strictly key- sorted additional access paths drastically wear out secondary storage devices. A key characteristic of almost all commercial and academic DBMS is the application of Multi-Version Concurrency Control (MVCC, compare Section 2.3). Multiple tuple versions are maintained, valid for a different period in time. Major benefit is an improved throughput in the DBMS because concurrent reads and writes are not mutually blocking and proceed in their own calculated snapshot. Supplemental work for detection of a transactions visible tuple version is necessary. Typically, additional access paths are version- oblivious due to maintenance costs and the DBMS determines visibility by means of expensive base table look-ups or large cached in-memory structures. In principle, Flash-based secondary storage devices perform well on reads due to internal parallelism and asymmetric read/write behavior (outlined in Section 2.1), but visibility checking intensifies amount of – in principle – unnecessary read data, especially in an HTAP-scenario. Briefly, DBMS and their characteristics are the linkage between application- generated workloads and the optimal usage of hardware technologies. Lever- aging these characteristics can reduce running expenses of a business and allow new types of workloads. Current additional access paths as well as virtually every storage manager do not provide adequate characteristics. The goal of this work is to close this gap. 1.2 | Problem Statement 21 1.3. Contributions Central contributions of this thesis are the conception, development and evaluation of a storage and index management structure, which meet the demand on modern workloads and hardware technologies. To the best of one’s knowledge no other storage and index manager fully integrates these aspects in one single consistent structure with the result of substantial cuts in performance and endurance of data nodes, architectural fuzziness in DBMS or massive complexity in evolvement and adjustment of DBMS backend. The presented approach is Multi-Version Partitioned BTree (MV-PBT), an unified tree-based multi-version storage and index management structure. Its origin and basis capabilities of a regular and well studied B+-Tree [BM70] enable a simple integration in existing architectures and adaptability of recently published B-Tree techniques. Major benefit of tree structures is the natural alpha-numeric sort order, which enable – in contrast to hash-based indexes – powerful range querying in HTAP analytical workloads. In addi- tion, due to the lack of pre-filtering data ranges, a range filtering approach is presented. bloomRF (bloom range filter) allows data skipping and in- crease throughput in MV-PBT by firstly introducing piecewise-monotone hash functions and prefix hashing in a bloom filter. Research objectives are formulated on base of the contributing approach in following research questions (RQ): RQ1: How could a visibility check of multi-version data be performed in Partitioned BTrees and leveraging modern hardware characteristics? RQ2: What further applications arise from timestamp-based index-only visibility-checking in a MV-PBT? RQ3: What optimizations for reading behavior are required for MV-PBT in the areas of data skipping and buffer efficiency? RQ4: How do online reorganization methods in MV-PBT enable a work- load adaptivity? RQ5: What are the performance effects of optimized garbage collection in MV-PBT? 22 1 | Introduction More and detailed contributing aspects of this thesis are as follows: Formulation of findings, how a storage and index management structure need to interact, based on characterization of modern workloads and hardware technologies. In order to understand evaluation and conception of its linking layer – the DBMS storage structures – a brief characterization of modern workloads and hardware technologies is provided. Impacts on data man- agement structures are elaborated. Thereby, the relevance of DBMS design decisions and underlying core mechanics become visible. Theoretical and experimental evaluation of version models in multi-version DBMS and K/V-Stores. Primarily, the concepts and impacts of Multi-Version Concurrency Control (MVCC) designs, buffer management and additional access paths are focused on. One key finding in DBMS is the contrary optimal version organization model in base table main storage and additional access path maintenance. Analysis of state-of-the-art literature, as well as theoretical and experimental evaluation of existing storage and indexing approaches. This thesis outlines a brief review of current state-of-the-art storage and index management structures in the areas of multi-version capabilities and leveraging modern storage hardware. Introduction of Index-Only Visibility Check and demonstration of its rele- vance, especially in an HTAP scenario. Visibility checking is an expensive operation with linear growth to the length of version chains, but is required in multi-version DBMS. Modern indexing approaches require to support this operation. Enabling a strict concept of out-of-place update and invalidation. MV-PBT introduce several different record types, whereby robustness in high concur- rency and update-intensity situations is enabled. Conception and evaluation of workload adaptivity, reorganization and garbage collection techniques in MV-PBT. Whereas state-of-the-art storage manage- ment structures focus on compaction mechanics for increased look-up per- formance by reduction of read-amplification (RA), MV-PBT aims to achieve a near-optimal WA and leverage its natural batch-wise partitioning sequence in order to guarantee robust access and update performance. 1.3 | Contributions 23 1.4. Scientific Publications Contributions of this thesis and influential extensive work have been intro- duced in various scientific peer-reviewed or preprint publications, denoted in following list: [MRBP23] B. Moessner, C. Riegger, A. Bernhardt, I. Petrov. ‘bloomRF: On Performing Range-Queries in Bloom-Filters with Piecewise-Monotone Hash Functions and Prefix Hashing’. In: EDBT’23 (accepted) (2023) [RP22] C. Riegger, I. Petrov. ‘Storage Management with Multi-Version Partitioned BTrees’. In: ADBIS’22 (accepted) (2022). [RBMP20] C. Riegger, A. Bernhardt, B. Moessner, I. Petrov. ‘bloomRF: On Performing Range-Queries with Bloom-Filters based on Piecewise-Monotone Hash Functions and Dyadic Trace-Trees’. In: CoRR abs/2012. 15596 (2020). arXiv: 2012.15596. url: https://arxiv.org/abs/2012. 15596 [RVGP20] C. Riegger, T. Vinçon, R. Gottstein, I. Petrov. ‘MV-PBT: Multi- Version Index for Large Datasets and HTAP Workloads’. In: Proceedings of the 23rd International Conference on Extending Database Technology (EDBT 2020). Copenhagen, Denmark, 2020. [VWB+20] T. Vinçon, L. Weber, A. Bernhardt, A. Koch, I. Petrov, C. Knödler, S. Hardock, S. Tamimi, C. Riegger. ‘nKV in Action: Accelerating KV-Stores on NativeComputational Storage with Near-Data Processing’. In: Proc. VLDB Endow. 13.12 (2020), pp. 2981–2984. [VHR+19] T. Vinçon, S. Hardock, C. Riegger, A. Koch, I. Petrov. ‘nativeNDP: Processing Big Data Analytics on Native Storage Nodes’. In: ADBIS. 2019 24 1 | Introduction [PKH+19] I. Petrov, A. Koch, S. Hardock, T. Vinçon, C. Riegger. ‘Native Storage Techniques for Data Management’. In: 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, April 8-11, 2019. IEEE, 2019, pp. 2048–2051. [RVP19] C. Riegger, T. Vinçon, I. Petrov. ‘Indexing Large Updatable Datasets in Multi-Version Database Management Systems’. In: Proceedings of the 23rd International Database Applications & Engineering Symposium. IDEAS ’19. Athens, Greece: Association for Computing Machinery, 2019. [RVP18a] C. Riegger, T. Vinçon, I. Petrov. ‘Efficient Data and Indexing Structure for Blockchains in Enterprise Systems’. In: Proceedings of the 20th International Conference on Information Integration and Web-Based Applications & Services. iiWAS2018. Yogyakarta, Indonesia: Association for Computing Machinery, 2018, pp. 173–182. [RVP18b] C. Riegger, T. Vinçon, I. Petrov. ‘Efficient Data and Indexing Structure for Blockchains in Enterprise Systems’. In: IBM Technical Report RC25681. 2018 [PVK+19] I. Petrov, T. Vinçon, A. Koch, J. Oppermann, S. Hardock, C. Riegger. ‘Active Storage’. In: Encyclopedia of Big Data Technologies. 2019 [PKV+19] I. Petrov, A. Koch, T. Vinçon, S. Hardock, C. Riegger. ‘Hardware- Assisted Transaction Processing: NVM’. In: Encyclopedia of Big Data Tech- nologies. 2019 [VHR+18] T. Vinçon, S. Hardock, C. Riegger, J. Oppermann, A. Koch, I. Petrov. ‘NoFTL-KV: TacklingWrite-Amplification on KV-Stores with Native Storage Management’. In: Proceedings of the 21th International Conference on Extending Database Technology (EDBT 2018). Vienna, Austria: Open- Proceedings, 2018, pp. 457–460 1.4 | Scientific Publications 25 [RVP17a] C. Riegger, T. Vinçon, I. Petrov. ‘Multi-Version Indexing and Modern Hardware Technologies: A Survey of Present Indexing Approaches’. In: Proceedings of the 19th International Conference on Information Integra- tion and Web-Based Applications & Services. iiWAS ’17. Salzburg, Austria: Association for Computing Machinery, 2017, pp. 266–275. [RVP17b] C. Riegger, T. Vinçon, I. Petrov. ‘Write-Optimized Indexing with Partitioned B-Trees’. In: Proceedings of the 19th International Conference on Information Integration and Web-Based Applications & Services. iiWAS ’17. Salzburg, Austria: Association for Computing Machinery, 2017, pp. 296–300. 1.5. Structure of this Thesis Since the motivational context, consequential problem statements and con- tributions are defined, the residual thesis is structured as follows. In Chap- ter 2, a brief introduction in the technical background is given. Thereby, required insights and concepts in the areas of hardware technology char- acteristics, workload characteristics and fundamentals of DBMS designs are provided. Based on the outlined findings, several mostly key-sorted indexing approaches are introduced and evaluated in Chapter 3. The designated founding approach Partitioned BTrees (PBT) is elaborated and enhanced to version-aware Multi-Version Partitioned BTree (MV-PBT), coping determined concepts and capabilities in Chapter 4. Shortcomings in existing point-range filter techniques – a basic requirement for efficient data skipping in MV-PBT – lead to the concept and development of bloomRF outlined in Chapter 5. Chapter 6 provides a full integration and experimental evaluation of MV-PBT in an append-based DBMS and K/V-Store. Finally, a brief summarization and opportunity overview is presented in Chapter 7. 26 1 | Introduction Ch ap te r 2 Technical Background With the aim of understanding the various aspects and challenges of key- sorted access paths, modern complex memory hierarchies and diverse charac- teristics of its members are described first. Subsequently, natural behavior of evolving workload types are outlined. Thereupon, fundamentals of modern database management system (DBMS) design techniques are provided. With respect to formulated characterizing statements, an emerging significance in robust and Flash-leveraging key-sorted multi-version storage and index management structures is elaborated. 27 2.1. Characteristics of modern Hardware Technologies Understanding basic concepts and characteristics of modern hardware tech- nologies is crucial for database access method design. Execution time on hard- ware can be simplified to processing (Tprocessing) and data access (Tdata−access) costs. Adopting trends towards resource and memory efficiency in modern K/V-Stores [EGA+18; Inc22; MWKM15], access paths must consider char- acteristics of secondary storage devices. Even though, instructions in data- intensive operations are frequently performed, access latencies to massive amounts of data on high-capacity secondary storage are assumed to become a dominant factor, so that: Tdata−access ≫ Tprocessing (2.1) Therefore, the scope of this work is focused on characteristics of the memory hierarchy. Due to the reliance of processing and data provisioning, a quite brief characterization of recent processing evolutions is given. 2.1.1. Highly parallelized and decentralized processing units Over the past decades, processing power benefits from increasing clock speeds. Applications generally profit by this trend. Latest developments in processing units yield increased parallelism and manifold unconventional calculation design options. Even though the scope of this thesis is not primarily focused on optimizing processing costs, one central point can be formulated: Data access methods require to be aware of decentralized computing models. Multi-core central processing units (CPU) and various possible hardware accelerator units are probably involved in modern DBMS [BGHS19] and require to complementary operate on a consistent dataset instance, without shrinking performance due to concurrency issues, like excessive cache-invalidation or double buffering. Efficient data provisioning and persistence for decentralized and parallel calculations is the purpose of memory hierarchies on data node servers. 28 2 | Technical Background 2.1.2. Primary and Secondary Storage Characteristics Components of the memory hierarchy are the bottleneck of massively data- intensive tasks. Efficiently using clock-cycles of processing units rely on data provisioning along these volatile primary and persistent secondary storage devices. Latency describes the delay in response time between components. As a general rule, latencies, but also capacities, increase with higher distance to a processing unit. CPU caches retrieve cache-line-sized data from main- memory (e.g. RAM). Whenever an already persistent (mostly volatile) work- ing copy is not present on a byte-addressable memory device, it is retrieved by a read I/O operation from a block-addressable storage device. The other way around, in volatile memory newly allocated data or modified working copies are persisted on block-addressable secondary storage devices by write I/O operations. Traditionally, in DBMS algorithms, only two levels in memory hierarchy are considered – volatile main-memory and and persistent disk [Gra11]. Latencies of read andwrite I/O operations form amassive access gap (compare Section A.1). Whereas wait latencies between caches and memory is about ×102, the access gap between caches and HDD (traditional cheap mechanic Hard Disk Drive) is about ×106. Common behavior in traditional memory and storage media is formed in symmetric read and write latencies as well as benefits in sequential operations. Evolving and recently established storage technologies close the access gap (compare Complex Memory Hierarchies in Section A.1), but also broaden the characteristics of devices in memory hierarchies. Reflecting latencies of solid state drives (SSD) and non-volatile memories (NVM), they naturally fill the gap, however, their special characteristics1 must be considered: High level of inherent parallelism. Independent or decomposed execution of I/O operations within several structural levels of an SSD has been well 1Characteristics are exemplary taken from SSDs. A detailed technical background is given in Section A.2. 2.1 | Characteristics of modern Hardware Technologies 29 studied [APW+08; HJF+11; PSS+10; RZA+12; Shi17; SXX+09; WKS15]. Leveraging structural characteristics by such techniques enable much higher parallelism and I/O-performance as has been known from HDD. Access structures might leverage high parallelism whilst accessing data – i.e. SSDs cope with more purposeful I/O of DBMS and K/V-Stores. Asymmetric read and write performance. Flash mainly supports the na- tive operations read, write and background erase. Reads perform an order of magnitude better than writes and two orders of magnitude better than erases [CKZ09; Got16] regarding latencies and IOPS (I/O per second). Especially in case of random write I/O, the latency and costs alternate between cheap sequential write and expensive erase operations [BJB09]. SSDs contain several on-device caches for many purposes: caching of address translation mapping tables, performing asynchronous read ahead, concealing asymmet- ric behavior by write buffering or enabling background tasks. Generally, operation costs depend on the internal circumstance of an SSD. High-rate continuous I/O operations uncover asymmetric behavior in steady-state conditions [DISK20; Got16]. As depicted in Figure 2.1b, write throughput quickly drop as caches satiate at different utilization levels by block sizes. In contrast, asymmetric read behavior remains constant in all block sizes (Figure 2.1a). Access structures leverage asymmetric I/O characteristics by minimizing write accesses, whereas SSDs cope with increased read I/O. Out-of-place update operation. Unlike to HDDs, in which a page at a specific address can be physically overwritten in-place, Flash require a clean page status before writing. Rewriting a page requires an out-of-place write at a different clean page and an update of the address translation mapping table. The costly erase is performed by a background garbage collection (GC) task at a later point in time. [CKZ09; DISK20] Access structures must avoid in-place updates to already persisted data and replace modifications by out-of-place updates. 30 2 | Technical Background 1 2 3 1 2 3 1 2 4 8 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 time (in min) IO PS (i n 1E +5 ) Intel DC P3600 Samsung 850 Pro Bl oc k Si ze (i n kB ) (a) IOPS Random Read 5 10 2 4 6 1 IO PS (i n 1E +4 ) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 time (in min) 4 8 Bl oc k Si ze (i n kB ) 16 (b) IOPS Random Write (c) Intel DC P3600 - Sequential vs. Random Figure 2.1.: Read and write metrics of Intel DC P3600 enterprise SSD and Samsung 850/860 Pro consumer SSD applied in the testbed. Advantages in sequential write I/O pattern. As an effect of out-of-place page replacement, inherent parallelism and low memory footprint hybrid- /block-level-mapping in on-device caches, sequential write patterns can improve performance of an SSD [MFL14]. Sequential sectors can be arranged and distributed along independent Flash components. Random writes do not result in sequential sectors, whereby memory footprint of mapping tables increase and background maintenance operations, like garbage collection, become more complex and expensive. As a general rule, "random writes should be limited to a focused area" and "sequential writes should be limited to a few [concurrent] partitions" [BJB09]. Figure 2.1c indicates roughly equal very good performance for sequential and random read I/O for different page sizes and numbers of pending I/O requests (queue depth), however, random write performance is almost a fraction of the sequential operation for all measurement points. Generally, access structures must adapt sequential writes for vast majority of payload, whereas few random writes of small meta structure data are feasible. Contrary, read latencies roughly remain unaffected by access patterns. 2.1 | Characteristics of modern Hardware Technologies 31 Background Garbage Collection (GC). Writing a page requires a clean on-device target-page status. As a result, write-back operations of updated pages are performed at a different storage location (out-of-place) and the former page version becomes invalid. Erase operations are very expensive but are required for space reclamation. A fundamental property in Flash (NAND) is, that data can be read and written on page level, but require to be deleted on superior block level. First, a victim block is determined. Second, still valid pages in the block are written to an appropriate different block location. Third, the victim block becomes erased. As a side effect, the write amplification (WA) increases on device level (WA-D). [MFL14; MWKM15] Access structures must consider GC to be valuable for larger regions. Tempo- rary space amplification by obsolete data is negligible on secondary storage. Limited durability and wear. Status of memory cells in pages and blocks is switched by write and erase operations in a write-erase-cycle. Each write- erase-cycle in a block wear out the transistor oxide layer of memory cells. The approximate number of maximum write-erase-cycles depends on the underlying technique, whereas 3000 cycles are common for consumer devices and 10.000 for enterprise devices. SSDs track the number of erases for each block. Wear-leveling methods equally distribute wear over all blocks and consequently, unreliable blocks become discarded. [MFL14; MWKM15] By knowledge of the maximum write-erase-cycles (Nw/e−c ycles) per block, the absolute storage capacity of the SSD (Scapaci t y) in blocks, the modified data by a workload (Smodi f ied) and the write amplification (WA of data structure and on-device) the durability (D) is calculated as follows: D = Nw/e−c ycles × Scapaci t y WA× Smodi f ied (2.2) In the denominator, the multiplication of WA and Smodi f ied defines the wear in written blocks. Since wear is equally distributed over all blocks, the absolute number of write-erase-cycles of the SSD is calculated by the multiplication of Nw/e−c ycles and Scapaci t y in the numerator. Access structures must minimize WA, since it negatively affects Flash’s durability. 32 2 | Technical Background 100 100 100 50 100 20 100 0 67 100 67 50 67 20 67 0 50 100 50 50 50 20 50 0 33 100 33 50 33 20 33 0 0 100 0 50 0 20 0 0 0 5 10 15 4 8 644 8 644 8 644 8 644 8 644 8 644 8 644 8 644 8 644 8 644 8 644 8 644 8 644 8 644 8 644 8 644 8 644 8 644 8 644 8 64block size (in kB) IO PS (i n 1E + 4) Intel DC P3600 Samsung 850 Pro Read % RND % (a) IOPS - random/sequential , read/write distribution 100 Read % 100 RND % 100 50 100 20 100 0 67 100 67 50 67 20 67 0 50 100 50 50 50 20 50 0 33 100 33 50 33 20 33 0 0 100 0 50 0 20 0 0 W rite Read 4 8 644 8 644 8 644 8 644 8 644 8 644 8 644 8 644 8 644 8 644 8 644 8 644 8 644 8 644 8 644 8 644 8 644 8 644 8 644 8 64 0 20 40 60 0 10 20 30 40 50 block size (in kB) Av g La te nc y (in m s) (b) Latency - random/sequential , read/write distribution Figure 2.2.: IOPS and latency metrics of applied SSDs in the testbed for different workload characteristics and block sizes. I/O transfer time is majority. Accessing data on secondary storage re- quires time for seek and transfer. Since SSDs have no electro-mechanic components, seek time is very low, whereas, unlike to HDD, transfer time becomes majority. Larger transfer / page sizes result in fewer I/O per second (IOPS, compare Figure 2.2) for different access patterns. Access structures must consider to optimize net transferred data by cheap access structures and appropriate page sizes. 2.1 | Characteristics of modern Hardware Technologies 33 2.1.3. Testbed Hardware Metrics and Operating System Performance evaluation benchmarking is performed on a 64-Bit Ubuntu 16.04.7 LTS (xenial) server setup with Linux Kernel Version 4.4.0-210-generic. Benchmarks and DBMS are compiled with GCC Version 9.4.0. The CPU is a 4 core (8 thread hyper-threading) Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50GHz with internal L1/L2/L3 256kB/1MB/10MB caches. Memory is provided by 2×16GB DDR4 RAM devices (manageable up to 32GB by boot parameters). The server is equipped with a 400 GB Intel DC P3600 PCIe enterprise SSD and a 500GB and 1TB SATA Samsung 850/860 Pro consumer SSD, whereas logging and payload perform without any inter- ference and on different intensity-levels. Generally, write caches are disabled for steady state, if possible. Read and write characteristics are depicted in Figures 2.1 and 2.2 and are in accordance to outlined Flash characteristics in Section 2.1.2. 2.2. Modern Workload Properties In order to design, implement and evaluate a version-aware storage and index management structure that leverages modern hardware characteristics, several types of workloads are considered. First, Online Transaction Processing (OLTP) represents traditional business contextual data management use cases, executing small updates and inserts alongside to multiple reads [TPC10]. The very opposite in characteristics of query execution is a decision support Online Analytical Processing (OLAP) workload, whereby queries with high degree of complexity and large volume of examined data are performed [TPC21]. An upcoming type of workload is a mixture of OLTP and OLAP in Hybrid Transactional and Analytical Processing (HTAP) on a huge shared dataset instance [CFG+11; HG20; ÖTT17]. Cloud data serving workloads bring OLTP-like transaction processing and massive amounts of data and high-rate continuous insertions together [CST+10]. 34 2 | Technical Background Some general assumptions about modern workload properties are as follows: Exponential growth of data. Ongoing digital transformation and per- sonalization of services incur rapidly growing dataset sizes [DB20; SG06]. Massive amounts of data require to be efficiently managed. High-rate continuous insertion / modification. Evolving types of applica- tions and data sources with varying lifespan incur frequent data modification. Furthermore, availability and quality of services require stable steady state throughput (elasticity) of high-rate continuous insertion and modification workloads [CST+10]. Application of HTAP workloads. Integration of data relying on various isolated solutions lead to inflexible and complex infrastructure landscapes. Data-centric infrastructures, whereby applications and services are designed and built upon a core of commonly applied and maintained dataset instance [DB20; HG20; ÖTT17; RVGP20], enable cheap administration and flexible evolvability. 2.2.1. Application of standardized Database Benchmarks Introduced workload property assumptions include high-rate continuous insertions and modifications with simultaneous analytical querying on mas- sive amounts of data, which potentially are performed on a shared dataset instance. Properties are well represented in HTAP as well as cloud serving benchmarks. It is common practice to apply a best concurring standardized database benchmark. Popular benchmarks are designed by the Transaction Processing Performance Council (TPC)1. 1’The TPC is a non-profit corporation focused on developing data-centric benchmark standards and disseminating objective, verifiable data to the industry.’ [TPC22] 2.2 | Modern Workload Properties 35 The TPC-C OLTP benchmark is very common in database benchmarking. OLTP mainly performs transactions of multiple reads as well as small updates and inserts [TPC10] in a business contextual database schema with multiple indexes (compare Figure 2.3 TPC-C). The open source DBT-2 [WW21] implementation has been applied as business contextual OLTP benchmark in this thesis. Offside the commonness of TPC-C as OLTP benchmark, however, there is a significant reason for its application. The authors of [CFG+11] recognized the lack of meaningful HTAP workloads and developed the CH-Benchmark upon TPC-C (TPC OLTP-workload) and TPC-H (TPC OLAP-workload). As depicted in Figure 2.3, CH-Benchmark operates on a commonly shared dataset instance, yielding special HTAP characteristics. This benchmark is integrated in the OLTP-Bench framework [Pav+21]. As an alternative for high-rate continuous insertion workload on K/V- Stores, the Yahoo! Cloud Serving Benchmark (YCSB) [CST+10] is applied in [RKD21]. YCSB consist of different key-value style workloads for cloud-based applications. In the core workload package, several mixtures of key-value put-, get- and scan-operations for different data and request distributions as Figure 2.3.: Schema of the CH-Benchmark [PWM+14] adopts TPC-C and TPC-H as an HTAP workload representative. 36 2 | Technical Background well as sizes are supported. Many parameters are adjustable for simulation of arbitrary use-cases. 2.3. Fundamental Design Decisions in Multi-Version Database Management Systems Multi-Version Database Management Systems (in this thesis simply denoted as DBMS) as well as K/V-Stores are the backbone of data-intensive appli- cations. By this are meant all DBMS, which implement the Multi-Version Concurrency Control (MVCC) protocol and apply Snapshot Isolation (SI). Well-known MVCC is one of the most popular transaction management schemes with many representatives: Oracle [Ora21a], Microsoft SQL Server [Mic21b], HyPer [KN11], SAP HANA [FML+12], Google Cloud Spanner [BBB+17], MongoDB WiredTiger [Mon21], NuoDB [Nuo21], PostgreSQL [Pos21] or MySQL-InnoDB [Ora21c], just to name a few. In Figure 2.4, modules of a DBMS are schematically depicted. Architectures in K/V-Stores vary and potentially enable less features and modules. Users interact with the DBMS by an user interface according to which the user request is parsed in an execution plan. Thereupon, the query is executed by accessing data with the best matching access method and perhaps performing further operations. A required set of transaction properties is guaranteed by the concurrency control module, namely MVCC. Accessible data in massively data-intensive operations is likely to be located on secondary storage devices, e.g. on Flash SSDs. Access methods request the byte-accessible data from a RAM-located Buffer Manager. If the data is not represented in the Buffer Manager, it is read from block-addressable secondary storage. In order to guarantee maximum memory footprint, buffers require to be evicted from cache by a replacement policy. Unchanged buffers can be discarded, modified ones require to be written to secondary storage media by logic of the Storage Manager. Data corruption is avoided by logging of critical operations. Scope of this thesis is the storage and index management represented in Access Methods and accompanying modules Buffer and Storage Manager 2.3 | Fundamental Design Decisions in Multi-Version Database Management Systems 37 User-Interface (SQL, DDL, Key-Iterator, …) Query Evaluation Engine (Parser, Execution Planner, Optimizer, …) Access Methods (Base Table, Additional Access Paths) Logging and Recovery Multi-Version Concurrency Control (Transaction, Lock and Version Management) Buffer Manager Storage Manager Flash Storage Read / Write IO Figure 2.4.: Regular DBMS Architecture (compare [RG03]). Scope of this thesis is in dashed lines, whereby critical interfering modules are shaded in darker and main contributions in brighter gray. with focus on properties of interlinked MVCC and beneficial I/O to secondary storage devices. Query Evaluation Engine as well as Logging and Recovery modules are affected as well, though are not treated in detail. Therefore, a detailed overview of MVCC designs is given. 2.3.1. Opportunities and Version Management in Multi-Version Concurrency Control Multi-Version Concurrency Control (MVCC) and Snapshot Isolation (SI) are well-known mechanisms in DBMS. However, fundamentals of MVCC with SI are essential prerequisites in this thesis, therefore some background is 38 2 | Technical Background provided. Parallel execution of transactions is a major scaling factor of DBMS, whereas throughput and response times are improved, however, interleaving concurrent transactions can cause non-tolerable data anomalies to an applica- tion. Several well-known version-oblivious concurrency control mechanisms enable integrity, nevertheless typically reading and writing transactions are mutually blocking, i.e. readers require to wait for completion of a writer and writers wait for lock releases of readers. In other words, this means in an HTAP scenario that OLAP queries on update-intensive data never complete and OLTP modifications are not able to proceed. MVCC with SI overcome these issues. MVCC creates a new version of a data tuple on each update. As a result, readers do not block writers anymore, because read locks do not oblique writers to wait for release. In MVCC, it is very common to leverage its versioning behavior in Snapshot Isolation (SI). SI let transactions operate on a consistent (logical) state (snapshot) of the database, e.g. the latest committed change to a tuple at begin of the transaction, respectively its own modification, is visible. Moreover, SI prevents concurrent transactions from updating the same tuple. Due to the fact, that every transaction knows its snapshot, writers do not block readers anymore. MVCC with SI enable high concurrency and isolation level – even in HTAP workloads with long-lasting Figure 2.5.: Logical View on Tuple Versions in MVCC with creation and invalidation timestamps [RVGP20]. Physical order typically differ. 2.3 | Fundamental Design Decisions in Multi-Version Database Management Systems 39 analytical queries and frequent updates. Therefore, it is a good choice to leverage parallelism in processing units and Flash storage. From a logical point of view, maintenance of tuple versions bases on some prerequisites. Each tuple consists of at least one tuple version record. By this means, several tuple version records of one logical tuple exist in a version chain and each of them is valid for a different period in time. Validity of a tuple version, strictly speaking its visibility to a specific transaction snap- shot, is determined by timestamps for creation (tcreation) and invalidation (t inval idation)1. A visibility check is responsible to return the valid tuple ver- sion to a transaction snapshot, whereby the version chain is processed and the timestamps are evaluated. In order to process the version chain, an entry point to the chain must be known, i.e. a specific tuple version record, and each tuple version requires to know its predecessor or successor. Ver- sion chains preferably have the structure of a latch-free singly linked list [WAL+17]. Tuple version records become obsolete, if it is no more visible to any active transaction snapshot, and is removed by garbage collection (GC). Figure 2.5 depicts a logical view on a tuple version chain. Tuple t was changed by transactions (T Xu0, T Xu1, T Xu2, T Xu3 several times by modify- ing the attribute a to 7, 3,1, 9. Each transaction maintains timestamps for creation and appropriate invalidation of its predecessor at the logical tuple version. Tuple y depicts a tuple version, which is not a member of the mentioned tuple version chain of tuple t. The depicted creation of new tuple version records is in accordance to out-of-place update characteristics of Flash secondary storage devices. Considering creation of a new version each time a tuple is updated in MVCC, this might happen in different ways – detailedly outlined in the following sections. Several DBMS implement variations of the version man- agement scheme, whereby the design decisions specify its capabilities and applicability. [WAL+17] evaluated different designs from a main-memory 1Transaction timestamps constitute a logical sequence of executed transactions. In addition, DBMS possibly maintain further information, e.g. a sequence of command numbers per transaction in PostgreSQL [Pos21], in order to avoid race conditions. In this thesis, transaction timestamps are meant to consider command numbers in the logical sequence as well. 40 2 | Technical Background point of view. Building up on these insights, different MVCC designs require to be evaluated in this thesis with respect to characteristics of secondary storage devices in a complex memory hierarchy [RVGP20]. 2.3.1.1. Version Storage A logical tuple corresponds to one or more tuple versions, which form a singly linked list of records in a version chain (Figure 2.5). There are two possible physical representations of a tuple version (Figure 2.6): physically materialized or delta-record-based. The former implies that each tuple version record is entirely stored physically materialized. By this means, each tuple version record contains all valid information of a logical tuple in a specific period of time. Whenever a version record is valid to a transaction snapshot, further records are not required to restore the logical tuple. The latter implies that each modification of a logical tuple results in a delta record, which indicates the difference to another tuple version (e.g. applied in BW-Tree [LLS13; WPL+18]). Delta records are linked and require to be retrieved on demand by the DBMS storage manager for tuple reconstruction. Delta-record-based system designs typically store a single version record (oldest or newest – depends on version ordering, compare Section 2.3.1.2) in the main store. Delta records are located in a separate storage location, e.g. the undo log (applied in InnoDB [Ora21c]) or a temporary version record store (applied as option in MS SQL Server [Mic21b]). By this means, only the version record in the main store can be directly accessed, other Tuple t version t.v3 t.v3 9 TXu3 -V3 Tuple t version t.v2 t.v2 1 TXu2 TXu3V2 y.vn ... Physically Materialized Storage Delta-Record Storage latest versiont.v3 9 TXu3 -V3 Delta storage t.v2 TXu2V2 UNDO log LogLSN TXU2 Version Pool/Temp Storage t.v2 ... TXu2 TXu3 t.v0 ... ... Figure 2.6.: Version Storage Alternatives [RVGP20] 2.3 | Fundamental Design Decisions in Multi-Version Database Management Systems 41 version records require to retrieve the required information by processing the main version record and all intermediate delta records up to the valid tuple version is fully reconstructed. Both physical representation models are capable to perform modifica- tions in-place or out-of-place. The former creates a copy (or delta) of the most recent valid tuple version representation at some other location and maintains version chain linkage information, transaction timestamps and desired modifications to the newer version in-place. The latter let the most recent valid tuple version representation unchanged and creates a tuple version record (or delta record) with desired modifications at some other location and maintains version chain linkage information and transaction timestamps. Considering the characteristics of modern Flash storage (outlined in Sec- tion 2.1.2), physically materialized version record storage and out-of-place update scheme are preferable, due to lower RA, WA and tuple reconstruction costs. Delta records tend to consume less space than materialized tuple versions, especially in case of large tuple information sizes and small modifi- cations, but require additional processing and all predecessors or successors for tuple reconstruction. 2.3.1.2. Version Ordering Tuple version records of a logical tuple form a tuple version chain. Its organization of a singly linked list enables lock-free modifications. As a Newer version t.v3 ... TXu3 - Older Version t.v2 ... TXu2 TXu3 New-to-Old Ordering Old-to-New Ordering New-to-old Reference Newer versiont.v3 ... TXu3 - Older Versiont.v2 ... TXu2 TXu3 Old-to-New Reference Figure 2.7.: Version Ordering Alternatives [RVGP20] 42 2 | Technical Background result, there is one entry point to the tuple version chain. Every tuple version record is accessible by successively processing the tuple version chain – beginning with the entry point. Internal ordering in this singly linked list can be organized from old-to-new or new-to-old (depicted in Figure 2.7). The former organization enable a static reference to its entry point. A predecessor require to know its successor, wherefore references in the predecessor must be modified. Adding a new tuple version record requires to process the whole tuple version chain beginning from the entry point and append it to the end of the list. The latter organization model, namely new-to-old, prepends newly inserted tuple version records to the entry point of the tuple version chain, whereas it becomes the new entry point. The new record references the old entry point, which requires no modifications. Treading mixed workload characteristics like HTAP, both version ordering schemes favor different types of queries. Old-to-new ordering scheme tend to favor long-lasting OLAP queries, precisely because visible tuple version records to its transaction snapshot are located near the entry point. Con- versely, recently inserted tuple version records are much faster accessible in a new-to-old ordering scheme. Considering the characteristics of modern Flash storage (outlined in Sec- tion 2.1.2]), new-to-old ordering scheme makes a leading edge, due to its version chain maintenance. Predecessor version records need not to update linkage information, contributing to reduce WA. A desirable append-only behavior is the case if combining out-of-place physically materialized tuple version records (compare Section 2.3.1.1) and outlined new-to-old ordering. Other combinations require in-place updates. 2.3.1.3. Version Invalidation Model A tuple version is said to be invalidated whenever a successor version record exists. Two different invalidation models are considered [Got16] (Figure 2.8). Two-point invalidation is a widespread model, where the creation timestamp of the successor version is also placed as invalidation timestamp on the predecessor. By evaluation of a single tuple version record, its visibility to 2.3 | Fundamental Design Decisions in Multi-Version Database Management Systems 43 Tuple t version t.v2 t.v2 1 TXu2 TXu3... Tuple t version t.v1 t.v1 3 TXu1 TXu2... Two-Point Invalidation One-Point Invalidation Tuple t version t.v2 t.v2 1 TXu2... Tuple t version t.v1 t.v1 3 TXu1... Figure 2.8.: Version Invalidation Alternatives [RVGP20] a transaction snapshot can be determined. On the other hand, the existence of a successor version itself implicitly invalidates its predecessor. One-point invalidation [GPHB17] leverages this point by maintaining only the creation timestamp at each tuple version record and implicit invalidation. Major benefit of this technique is the avoidance of in-place invalidation – what is actually a modification – at predecessors of tuple version records. However, evaluating the visibility of a tuple version record to a transaction snapshot probably requires successor information. One-point invalidation matches well previously outlined new-to-old ordering scheme, since all successors and respective creation timestamps are processed as part of the tuple version record discovery. With regards to characteristics of modern Flash storage (outlined in Section 2.1.2), only one-point invalidation is capable to avoid in-place updates and therefore enables beneficial append-only behavior with low WA. Anyways, required successor versions are processed in case of new-to-old ordering scheme (compare Section 2.3.1.2), and therefore it is not considered to be any drawback. 2.3.1.4. Garbage Collection Modifications to a tuple result in the creation of a new tuple versions. Tuple version records become obsolete, if they are no longer visible to any of the active transaction snapshots. Obsolete tuple version records require to be garbage collected (GC) for 44 2 | Technical Background space reclamation and probably performance gains. However, GC reduces concurrency as some form of locking is required, causes performance spikes as it interferes with foreground I/O and increases WA on secondary storage devices. GC can be performed on transaction, tuple and index levels [LLS13; LSP+16; WAL+17; WPL+18]. In order to keep GC costs low, operations require to be worth it and should not break with append-only behavior and minimizing WA principles (compare Section 2.1.2). Furthermore, independence of access structures – i.e. base tables, additional access paths and potential helper structures – reduces locking effort and bouncing across structures. 2.3.1.5. Discussion Different possible design decisions in MVCC are introduced and theoreti- cally evaluated on aspects of modern secondary storage characteristics. As depicted in Table 2.1, MVCC designs can be combined to an optimal set for base table storage on Flash. An append-only new-to-old ordered storage manager with one-point invalidation for base tables is Snapshot Isolation Append Storage (SIAS) [Got16; GPHB17] (outlined in Section 3.1.5.1). SIAS exhibits in a write-heavy transactional workload 97% reduced write I/O and an improved throughput of 30% compared to its baseline MVCC with SI in PostgeSQL [GPHB17]. However, additional access paths become more complex as outlined in Section 2.4. Ideal MVCC Design for Storage, but appropriate for Additional Access Paths? Version Storage • Physically Materialized Storage • Out-of-Place insertion of new Tuple Version Records Version Ordering • New-to-Old Version Chain Linkage • Rolling Entry Point of Version Chain Invalidation Model • One-Point Invalidation • Successor Version indicates Invalidation (with Timestamp) Garbage Collection • Space Reclaimation / Performance Improvement • Large Segements of obsolete Records Table 2.1.: Optimal MVCC design for characteristics of Flash storage devices. 2.3 | Fundamental Design Decisions in Multi-Version Database Management Systems 45 2.3.2. Estimated Base Table Cost Model In the previous section a strict append-only heap organization is outlined, which leverages characteristics of Flash (compare Section 2.1.2). Major benefits of this base table organization are the minimal insert, update as well as delete costs (iSIAS) and yielding WASIAS per page of strict append-only storage. WASIAS = 1 (2.3) Since records are invalidated out-of-place, all modifying operations have the processing costs iSIAS = O ( 1 ) practically no read I/O (except for potential entry point identification) and optimal write I/O costs (WI/O) per operation (iI/OSIAS ) iI/OSIAS =WI/O × BN+R N + R (2.4) whereby N comprise the set of logical tuples, R the totality of logical replace- ments and BN+R the number of required pages (buffers) to store all version records (N + R). Search costs (sSIAS), however, in heaps are muchmore expensive. Generally, search costs are declared in average as s = O ( N/2 ), since the required record might be located anywhere in the dataset. However, version records are invalidated out-of-place, whereas principally the entire dataset must be scanned to identify the required tuple version record, hence sSIAS = O (N+R) and yielding read I/O costs (RI/O) per search as well as scan (sI/OSIAS ) are defined as follows: sI/OSIAS = RI/O × BN+R × (1− pBc) , pBc ≃ 0 (2.5) Due to assumed data-intensive workloads and trends towards resource and memory efficiency (compare Sections 2.1, 2.2), a cache probability is as- sumed to be pBc ≃ 0, whereas each required page BN+R cause RI/O. By this means, cache probability per iterated tuple version (including N + R) is 46 2 | Technical Background probably better in sequential processing and depends on the version records per page N+R BN+R . 2.4. Additional Access Paths in Multi-Version DBMS DBMS objects organize data in tuple records and arrange them for optimal support of one or a few operations [RG03]. Different organization models, e.g. heap, sorted or hashed, are favorable for a set of operations, due to different cost models. A preferable base table organization for Flash storage characteristics might be a strict append-only heap. Thus, insertion operations in the base table heap are very cheap – as tuple records are collected in main-memory pages and get evicted once (as applied in SIAS [Got16]), how- ever, equality and range search operations become very expensive (compare Section 2.3.2). As a result, equality search costs increase to a scan of the whole multi-versioned dataset, as its size grows by the number of version records and the necessity of a visibility check. A fully attribute-key-sorted organization model probably improve perfor- mance of equality and range search operations, but could also break the Tuple t version t.v3 t.v3 9 TXu3 -VID(t) Tuple y version y.v1 t.v2 1 TXu2 TXu3VID(t) y.v1 11 TXu3 -VID(y) t.v1 3 TXu1 TXu2VID(t) y.v0 11 TXu1 TXu3VID(y) t.v0 7 TXu0 TXu1VID(t)Ph ys ic al ly M at er ia liz ed VID(t) In di re ct io n La ye r RecordID (latest version) Tuple (VID) recordID(y.v1) VID(y) Index Record Index Search Key Values Transaction Timestamp?VID... ... entry-point to version chain of tuple y Physcial Reference Logical Reference recordID(t.v3) ... ... ... ... ... ... recordID Figure 2.9.: Version / Index Record Referencing [RVGP20] 2.4 | Additional Access Paths in Multi-Version DBMS 47 desired append-only behavior. Furthermore, any equality or range search on a different attribute of the tuple result in a scan of the whole dataset. DBMS provide the option to create additional access paths (alias indexes) to base table primary data storage. Application of the auxiliary structures is manyfold, e.g. support of uniqueness constraint checks, primary keys and foreign keys or potentially fast look-ups on any indexed set of attributes (secondary index). Index structures apply a search optimized hashed or sorted organization model, however, modifications tend to become expen- sive, due to maintenance costs and resulting unfavorable random write I/O pattern to secondary storage devices. Sorted indexes allow equality and range search with low RA1 and avoidance of expensive downstream sort operations – unlike to hash indexes which allow only equality search, wherefore they are not suitable for HTAP scans and not in scope of this thesis. Index records regularly consist of a pair of search key value(s) and a one data tuple reference2 [RG03], whereas they are version-oblivious, due to missing transaction timestamp information (compare Figure 2.9). The compact representation allows cheap search and maintenance costs, however, there is no guarantee for improved throughput of the DBMS, as any applied index requires maintenance. Administrators must evaluate the benefit of an index. Based on query attributes and statistics like data selectivity, the optimizer of the query evaluation engine (compare DBMS architecture in Figure 2.4) can decide to use the additional access path via index. Since the index record is located fast, its referenced data tuple location in the base table is also known and can be accessed in a following step. A very classical way of data tuple referencing in an index record is a physical reference by record id. In the context of MVCC, several tuple version records of one logical data tuple exist. Principally, each tuple version record is required to be indexed, albeit it is sufficient for reliability to locate each matching tuple version record via the version chain. A rolling entry point in the new-to-old ordered version chain (as referred in Table 2.1) requires an 1Comparing linear heap table search costs of Equation 2.5 in Section 2.3.2 with exemplary logarithmic B+-Tree costs of Equation 3.4 in Section 3.1.1. 2Reference organization variants, e.g. lists of data tuple references, are not considered. 48 2 | Technical Background updated physical reference for each modification – i.e. respectively in case of appended successor versions, physical record movement or performed garbage collection. Modifications to indexed attribute values become much more complex, as a new index record is inserted and the entry point reference of the predecessor index record requires to be fixed. In order to avoid expensive index maintenance operations, some DBMS [DFI+13; Pos21] sacrifice base table characteristics from Table 2.1 and employ an old-to-new version ordering with in-place updates, in which case the entry point remains stable. A second way is to implement an indirection layer between index and base table. Each data tuple record is augmented with an unique tuple identifier (Virtual Tuple Identifier – VID), which is stored in the index record reference pointer. An index operation resolves the VID via usage of a mapping table in order to locate the physical entry point of a data tuple (as depicted in Figure 2.9). An indirection layer is able to reduce index maintenance costs, but requires additional processing and recover-intensive main-memory data structures. With respect to decentralized computing models in modern hardware, it is challenging to maintain an indirection layer. Furthermore, in a key-sorted structure, modifications to an indexed attribute are not treated by the indirection layer and require an index update – resulting in expensive maintenance and random write I/O patterns. With current indexing techniques, there is no convenient way to simulta- neously obtain a beneficial append-only sequential write pattern for index structures and base table data (as listed in Table 2.1) on secondary storage devices. 2.4.1. Version-oblivious Index Equality and Range Search Indexes in modernmulti-version DBMS are version-oblivious by means of missing transaction timestamp (depicted in Figure 2.9). This lack of information results in serious consequences and can negate effectiveness of additional access paths, due to lowered selectivity and read amplification (RA) for visibility checking. With regards to Figure 2.10, a small scenario is given. 2.4 | Additional Access Paths in Multi-Version DBMS 49 Return all tuple versions satisfying: Table R a z tcrea tion tinvali dation Tuple t version t.v0 7 TXu0 TXu1 version t.v1 3 TXu1 TXu2 version t.v2 1 TXu2 TXu3 version t.v3 9 TXu3 null Tuple y version y.v0 11 TXu3 null Logical ViewA B+-Tree idx 931 7 IndexC 11 ***** Physical StorageB t.v0 Page 3 t.v3 Page 42 y.v0 Page 5 t.v2 Page 72 t.v1 Page 117 D : SELECT COUNT(*) FROM R WHERE a <= 10; : Transaction TXR COSTS: Index Scan + 4 Table Pages / Random I/Os RESULT: 1 Version Visibility CheckE Execution of long-running transaction TXR CREATE INDEX idx ON R(a); ... MAX(tcreation) <= TXR & COMMITTED recID(t.v3) recID(t.v2) recID(t.v1)recID(t.v0) recID(t.v0) Figure 2.10.: Index search in Multi-Version DBMS and resulting Read Am- plification (RA) on Base Tables [RVGP20] In Figure 2.10- A (settled up on example in Figure 2.5), a logical view on a base table R is given. Attribute a of tuple t is modified several times by different transactions (T Xu0 - T Xu3), resulting in physically materialized version records t.v0 - t.v3 with attribute value 7,3, 1 and 9. An additional tuple y is inserted for clarification of physically materialized storage. Figure 2.10- B depicts an unclustered tuple version storage on random base table pages of a heap structure, since tuple versions are independent record entities. Previously, a key-sorted B+-Tree [BM70] secondary index (outlined in Section 3.1.1) was built upon the attribute a (Figure 2.10- C ). Its index records reference the tuple version records in the unclustered heap. Figure 2.10- D depicts a long-lasting analytical HTAP transaction T XR, which started before T Xu1, T Xu2 and T Xu3. The execution planner and optimizer of the DBMS query evaluation engine decide to use the index from Figure 2.10- C . In order to find all tuple version records, which are visible to T XR, the index is traversed and the data tuple reference is gathered for each matching index record. Referencing tuple version records are fetched from their unclustered physical storage location (Figure 2.10- B ) and returned to the 50 2 | Technical Background MVCC visibility check in Figure 2.10- E . A visibility check forces massive read amplification (RA) in order to gather required tuple version record timestamps for snapshot calculations. 2.5. Improving Index Search Operation in Multi-Version DBMS Optimal MVCC designs for base tables (as listed in Table 2.1) efficiently leverage characteristics of secondary storage devices, especially in case of massive data amounts, which are generated by modern workloads, and relatively sparse main-memory. DBMS apply several additional access paths for various purposes. Indexes require frequent maintenance for correctness, however, traditional approaches neglect ideal storage characteristics and result in ineffectiveness. This behavior bases on following indications: Expensive strictly Key-Sorted Organization. As this index record organi- zation enable fast search operations, its maintenance result in random write I/O patterns and high write amplification (WA). Reference to Tuple Version Records. Rolling entry points to version chains amplify maintenance costs. Reduced Selectivity. Downstream visibility checks cascade in high read amplification (RA) as tuple record versions are located along multiple base table pages. 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 0 10 20 30 40 50 Number of Successor Version Records R es po ns e Ti m e (m s) Figure 2.11.: Correlation of Version Chain Length and Response Times. 2.5 | Improving Index Search Operation in Multi-Version DBMS 51 Base Table Data Visibility Check Transaction: SELECT COUNT (*) FROM TABLEWHERE ATTRIBUTE <= 'VALUE'; Additional Access Path Secondary Index R 1 ... R n Result Set Index-Only Visibility Check Version Record Candidates Tuple Version 1 (w/ timestamp) Tuple Version 2 (w/ timestamp) … Tuple Version n (w/ timestamp) expensive base table fetch return check B A Figure 2.12.: Index-Only Visibility Checks A avoid expensive RA of princi- pally unnecessary base table page fetches B . Additional access paths should allow a key-sorted structure to leverage characteristics of modern storage devices whereby a sequential write pattern with low WA must be enabled. Once an indexing structure facilitates these requirement, it is powerful enough to maintain amplified emergence of tuple version records without a need for layers of indirection, which break optimal MVCC design characteristics or are not suitable for decentralized computing models. Since every tuple version record is physically referenced in the powerful index structure, the only need for processing version chains in base tables is the visibility check – causing massive RA due to sequential processing alongside large tuple record sizes and relative small required timestamp information. Figure 2.11 depicts the linear complexity between version chain length and response time when using a B+-Tree index in an HTAP scenario. Contrarily, an increased RA in the indexing structure is negligible as required information is space-efficiently packed and asymmetry in storage devices allow fast and parallel read performance. As part of the search operation, a version-aware index structure immediately determines matching tuple versions, which are visible to a transaction snapshot, by a highly selective index-only visibility check (as depicted in Figure 2.12). 52 2 | Technical Background Ch ap te r 3 Related Work In Chapter 2, the relevance for high-performance storage and index manage- ment structures is exposed. Workload characteristics of modern applications lead to high transactional pressure and raising amounts in data volume to DBMS. MVCC enables high data quality in in HTAP query processing. Furthermore, MVCC designs (as listed in Table 2.1) leverage storage char- acteristics of Flash-based devices. However, index structures for additional access paths must resist amplified maintenance effort, as it is desirable to index any potentially visible tuple version. In order to obtain a minimal read amplification (RA) on largely sized and randomly spread records in base tables, an ideal index structure cheaply determines the tuple version record, which is visible to a transaction snapshot – i.e. RA on base tables is potentially reduced up to a tuple’s version chain length by similar index search effort. In this chapter, a brief overview of the ubiquitous state-of-the-art B+-Tree indexing structure is given. Furthermore, current approaches are introduced, which partially tackle required capabilities. In addition, an overview of the designated DBMS and K/V-Store storage structures Snapshot Isolation Append Storage (SIAS) [Got16; GPHB17] and WiredTiger [Mon21] is given. 53 It is shown, that the structure of Partitioned B-Trees exhibit potentially best properties to fulfill all requirements in indexing massive amounts of tuple version records on modern hardware. 3.1. Storage and auxiliary Index Management Structures Storage and index management structures allocate information in records, which are located in pages / buffers (compare [LHKN18]). These levels of abstraction unify block-based access on secondary storage media via buffer managers among other reasons. Levels of abstractions are equal for storage as well as index management structures, contained information in the records differ. Whereas storage management structures in this thesis are meant for straight representation of tuple versions in records, index management structures apply an efficient additional access path to desired tuple version records in storage structures. Hence, major differentiation of both structures is in their application. Most key-sorted structures are based on the ubiquitous B+-Tree, which is applied as state-of-the-art indexing structure in many DBMS. Hence, a brief overview is given in the following. 3.1.1. Ubiquitous B+-Tree B+-Trees are well-known and the most common indexing structure in DBMS. Several structures involve aspects or are fully built upon them, including the applied basic structure in this thesis (Partitioned BTrees [Gra03], compare Section 3.1.4), whereas a brief overview is given. They enable several modification operations on data; i.e. insert, update and delete; as well as efficient data retrieval in equality and range search, due a preserved native sort order in a balanced tree structure with costs, which are based on their logarithmic height by high fan-out index nodes [BM70; Gra11]. B+-Trees represent a balanced tree structure, by means that each traversal operation pass through an equal distance of nodes, i.e. they ensure a robust performance by the maximum height h of a tree. The size of a node can 54 3 | Related Work A B C D E F G Figure 3.1.: Schematic representation of a B+-Tree (compare [BM70; Gra11]). vary, generally it is a multiple of the applied transfer and storage size. The structure consists of different types of nodes. With regards to Figure 3.1, inner nodes C including the root A apply reference pointers B in order to indicate several child nodes (e.g. C is the child of A and D is the child of C ). The actual payload is located in leaf nodes D 1. Nodes possibly contain sibling pointers, whereas each level in the tree is comparable to a doubly linked list, however, it is not a requirement (e.g. leaves D maintain sibling pointers, inner nodes C do not). The structure of each node is very similar as depicted in Figure 3.1 E . Nodes probably maintain a set of meta data; e.g. size, number of records, reference counters; perhaps sibling pointers (left and right end of the node) and some kind of alpha-numeric sorted records F . Nodes contain some blank space G , which is typically up to half of its capacity. Since records are sorted within a node, binary search enables efficient look-up. Records exhibit separator key characteristics in 1[BM70] considered the basic B-Tree to locate payload records also in inner nodes, however, nowadays, it is common practice to locate all of them in leaf nodes [Com79]. B-Tree and B+-Tree became synonymous terms. 3.1 | Storage and auxiliary Index Management Structures 55 inner nodes. A child node of a separator key contains only records, with keys similar or greater than itself, but smaller than the next separator key in the node. Leaves contain tuple or index records of pattern {ke y, value}1, based on their application. A key consist of one or a set of attributes (e.g. key k = {at t r1, at t r2, . . . , at t rn}), which are alpha-numerically sorted with successive relevance. Therefore, a B+-Tree is able to efficiently search for a set of search key values of a pattern {at t r1, at t r2} or {at t r1}, though not only for {at t r2}. If its application is data tuple storage, the value contains a set of attributes, which are not part of the sort order, otherwise in case of indexing, a value references the tuple record id (or other reference type via indirection layer as outlined in Section 2.4) of the indexed base table tuple record. In order to provide a cost model, following B+-Tree base operations are considered: Equality Search. The B+-Tree gets traversed from the only entry point (root A ) to a leaf node D . Thereby, the path is followed alongside reference pointers, level by level. Child nodes are selected by binary searching a search key in a parent node. An exact match of an index record key in a leaf node indicates a positive result of the equality search and the record value is returned. There could be quite a few matching equal search key index records. Range Search. A search interval is defined by a lower and upper key bound. The lower search key is processed as defined in the equality search operation. Index records are processed inside the sorted set in a leaf node and alongside several leaves by following the sibling pointers, while the index record keys match the search interval. Values are processed and returned in the sort order of the tree. Sorted Insert. An equality search operation is performed with the insert record key. Additional functionality like uniqueness constraints are cheaply 1Other record value organization variants, e.g. {ke y, value-l ist}, are not considered. 56 3 | Related Work examined as part of the equality search. The insert record is placed in the designated sorted leaf node. Thereby, existing index records are relocated to fulfill the sort order. Therefore, the blank space G is utilized. Overflowing nodes require additional maintenance – i.e. the node gets split. Half of the records is moved to the new node. The new node is propagated to the parent node. Thereby, it is possible to cause an overflow in the parent node. Modifications require to be protected from side effects, e.g. by applying page locks. Value Update. Updates to index records, which only affect the value are performed in-place. Therefore, an equality search operation is performed. The value of the equal match index record gets changed. Side effects require to be considered and avoided by locking or indirection layers, as inconsis- tencies require to be avoided. Updates to search key values result in a costly multiphase deletion of the old record and an insertion of the updated index record. Record Delete. An equality search operation is performed with the dele- tion record key. The equal matching index record gets removed from its location in the leaf node. Existing index records become relocated in order to fulfill the sort order. As the blank space G exceeds a certain threshold, e.g. half of the node size, nodes underflow and become merged and rebal- anced. Correctness of separator keys in parent nodes must be guaranteed. Modifications require to be protected from side effects. With focus on I/O latencies, costs of considered base operations are derived from following model. Cost Model. Basic operations in B+-Trees are subjecting a logarithmic complexity as traversal operation costs depend on their maximum height hBT . Height of the tree depends on the average fan-out F of inner nodes and the number of required leaf nodes LN , which store a number of index 3.1 | Storage and auxiliary Index Management Structures 57 records N . Required nodes are affected by the fill factor f ( fi fill factor of inner nodes, fl fill factor of leaves), which is the ratio between used and blank space. Hence, it is valid for B+-Trees exceeding capacity of one leaf node, that (compare [Gra11]): hBT = ⌈logF× fi LN fl ⌉ + 1 , F ̸= 1 , LN > 1 , fi , fl ∈ [0.5;1] (3.1) A common cost model for equality search in B+-Trees is defined as follows: sBT ≈ log2 N (3.2) This cost model focus on processing costs, if binary search is assumed. Considering hBT of Equation 3.1, this model is expanded to binary search costs in inner nodes and one leaf node (compare [Gra11]): sBT = ⌈logF× fi LN fl ⌉ × log2 (F × fi) + log2 ( N LN × fl ) (3.3) With reference to Equation 2.1, major costs stem from data access on sec- ondary storage devices. Replacing negligible binary search costs log2 {records} with the I/O costs for reading a node, the cost model is a function of read I/O RI/O and the height hBT of a B+-Tree. Regarding the caching probability for inner nodes pic and leaves plc , read I/O can be reduced: sI/OBT ≈ RI/O × (⌈logF× fi LN fl ⌉ × (1 − pic) + (1 − plc)) , pic ≫ plc (3.4) It is more likely for inner nodes to be located in main-memory cache, as buffer managers would prefer more often used nodes. Inner nodes get frequently accessed as part of the traversal operation. Blank space G effectively reduces the tree fan-out, due to decreasing fill factor fi , as well as increases necessary leaf nodes LN by an average fill factor – and effectively influences the space amplification and search performance, since less data is cached per node and pic shrinks. However, blank space is beneficial for inserts, since maintenance operations such as node splits are reduced. Any modified node M Pmod is 58 3 | Related Work asynchronously evicted and written from main-memory to secondary storage device at some point in time. Thereby, an insertion causes at least one write I/O WI/O 1, however, due to escalating node splits up to 2×h+ 1 are possible, but unlikely. iI/OBT ≈ sI/OBT + WI/O ×M Pmod , M Pmod ∈ {1, 3, . . . , 2×h+ 1} (3.5) Deletions behave comparable to insertions, as underflowing nodes cause merges. In-place value updates do not cause maintenance operations, thus M Pmod is equal to 1. Range search costs are equal to search costs, but require successive reads of leaves alongside sibling pointers. Assuming massive amounts of data and high rate continuous insertion workloads, it is most likely that only one modification (’-1’ in Equation 3.6) is buffered in main-memory until a leaf node is evicted. Vast majority of a node’s comprised records ( N LN× fl ) is unchanged, though gets rewritten. Hence, average write amplification ’WA’ of a modified node is: WA ≈ N LN × fl − 1 (3.6) Suitability to Flash Characteristics. Basic B+-Trees are not capable of leveraging modern storage hardware characteristics. First, in order to keep- ing up a sorted dataset, B+-Trees perform updates in-place. Randomly modified nodes become asynchronously evicted from buffer management, whereby changes are persisted by rewriting the whole node. On Flash, pages become randomly invalidated and rewritten out-of-place. Second, modifications are manyfold, as reference and sibling pointers, node splits and merges as well as the payload result in altered nodes. Third, marginal modifications require to write one or several whole nodes – resulting in very high WA. Fourth, general average fill factors of approximately 0.67 [RG03] to 0.7 [Gra11] are common, whereby WA is amplified. Last, even if successive logarithmic search effort is very cheap, parallelism is not leveraged. 1The probability of multiple modifications per node is ignored, since low ratio in main-memory buffer caches and massive amounts of data makes this case very unlikely – except for serial insertions. 3.1 | Storage and auxiliary Index Management Structures 59 Multi-Version Capabilities. Multi-Version capabilities require transaction timestamps for performing visibility checks. Therefore, a basic B+-Tree needs to maintain creation and invalidation timestamps for each index record – ultimately resulting in amplified in-place updates, increased record sizes and reduced fan-out. Moreover, current and obsolete data is intermingled, wherefore search performance shrinks. A very common practice is to return a set of candidates whereupon visibility is checked on tuple version records in base tables. 3.1.2. Structures with Multi-Version Capabilities Several searchable structures tackled the capability of multi-version stor- age and index management [MTT00; ST99]. Applications might maintain different tuple versions in order to perform MVCC or provide constant snap- shots for time travel querying. Query predicates and snapshot information constitute different dimensions in search key attribute values, transaction timestamps as well as probably additional application defined temporal di- mensions. Well-known representatives are Multi-Version B-Tree [BGO+96], Time-Split B-Tree [LS90], MV-IDX [GGH+14] or Bi-Temporal Timeline Index [KFM+15], just to name a few. These structures differ in maintenance effort and characteristics, space amplification (SA), searchability of raised dimensions, linkage of related version records as well as general applicability [RVP17a]. Bi-Temporal Timeline Index [KFM+15] enables multi-version capabilities very well in a linearly growing log-based append-only approach of events (activation and invalidation) over transaction time, which are mapped to versions. Checkpoints