Multi-version Indexing for large Datasets
with high-rate continuous Insertions

Von der Fakultät für Informatik, Elektrotechnik und Informationstechnik
der Universität Stuttgart zur Erlangung der Würde eines Doktors der

Naturwissenschaften (Dr. rer. nat.) genehmigte Abhandlung

Vorgelegt von
Christian Riegger

aus Hechingen

Hauptberichter: Prof. Dr.-Ing. Ilia Petrov
Mitberichter: Prof. Dr.-Ing. habil. Bernhard Mitschang

Tag der mündlichen Prüfung: 02. Juni 2023

Institut für Parallele und Verteilte Systeme (IPVS)
der Universität Stuttgart

2023


Acknowledgments

Foremost, I would like to express my sincere gratitude to my supervisors
Prof. Dr.-Ing. Ilia Petrov and Prof. Dr.-Ing. habil. Bernhard Mitschang. Ilia
triggered my research interest in algorithm and data structures during my
master’s degree. Hemade the completion of this thesis possible by his support,
expertise, encouragement and empathize for several personal misfortunes
during the time. Bernhard Mitschang supported by inputs from a different
perspective, what contributed to improvements and comprehensibility of
this work.

I gratefully acknowledge the cooperation and scientific exchange with my
colleagues in the Data Management Lab of Reutlingen University, Tobias
Vinçon, Arthur Bernhardt and Christian Knödler, which yielded interesting
research, novel approaches and collaborative publications.

I would like to thank my family and close friends for bearing with me all
this time.
This research was partially funded by the Ministry of Science of Baden-

Württemberg, Germany, for the Doctoral Program ’Services Computing’.

3


Contents

List of Abbreviations 9

Zusammenfassung 13

Abstract 15

1. Introduction 17
1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.2. Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3. Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.4. Scientific Publications . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.5. Structure of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . 26

2. Technical Background 27
2.1. Characteristics of modern Hardware Technologies . . . . . . . 28

2.1.1. Highly parallelized and decentralized processing units 28
2.1.2. Primary and Secondary Storage Characteristics . . . . . 29
2.1.3. Testbed Hardware Metrics and Operating System . . . 34

2.2. Modern Workload Properties . . . . . . . . . . . . . . . . . . . . . 34
2.2.1. Application of standardized Database Benchmarks . . . 35

5


2.3. Fundamental Design Decisions inMulti-Version DatabaseMan-
agement Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.1. Opportunities and VersionManagement inMulti-Version

Concurrency Control . . . . . . . . . . . . . . . . . . . . . 38
2.3.2. Estimated Base Table Cost Model . . . . . . . . . . . . . . 46

2.4. Additional Access Paths in Multi-Version DBMS . . . . . . . . . 47
2.4.1. Version-oblivious Index Equality and Range Search . . 49

2.5. Improving Index Search Operation in Multi-Version DBMS . . 51

3. Related Work 53
3.1. Storage and auxiliary Index Management Structures . . . . . . 54

3.1.1. Ubiquitous B+-Tree . . . . . . . . . . . . . . . . . . . . . . 54
3.1.2. Structures with Multi-Version Capabilities . . . . . . . . 60
3.1.3. Structures leveragingModern Secondary Storage Hard-

ware Characteristics . . . . . . . . . . . . . . . . . . . . . . 62
3.1.4. Applied Basic Structure: Partitioned BTree . . . . . . . . 63
3.1.5. Prototypical Environments . . . . . . . . . . . . . . . . . . 65

3.2. Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . . 68

4. Multi-Version Partitioned BTree 69
4.1. Multi-Version Indexing Design Concepts . . . . . . . . . . . . . . 70
4.2. Write-Optimization with Partitioned BTrees . . . . . . . . . . . 72

4.2.1. Data Structure and Components . . . . . . . . . . . . . . 73
4.2.2. Partition Switch & Sequential Write . . . . . . . . . . . . 86
4.2.3. Basic Operations . . . . . . . . . . . . . . . . . . . . . . . . 94
4.2.4. Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.2.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

4.3. Version-Awareness with Multi-Version Partitioned BTree . . . 110
4.3.1. Applying Multi-Version Record Types . . . . . . . . . . . 112
4.3.2. Version-Aware Record Sizes and Compression Techniques114
4.3.3. Logical Version Chain in evolving MV-PBT . . . . . . . . 115
4.3.4. In-Partition Version Record Ordering Convention . . . . 119
4.3.5. Version-Aware Basic Operations . . . . . . . . . . . . . . 122

6 Contents


4.3.6. Index-Only Visibility Check . . . . . . . . . . . . . . . . . 127
4.3.7. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

4.4. Workload Adaptiveness and Optimizations . . . . . . . . . . . . 134
4.4.1. Adaptive Data Reorganization and Indexing . . . . . . . 135
4.4.2. Garbage Collection . . . . . . . . . . . . . . . . . . . . . . 142
4.4.3. Application of Auxiliary Filter Structures and Data

Skipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
4.4.4. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

4.5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

5. Arbitrary Approximate Membership Testing 159
5.1. Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
5.2. Point Filter Techniques . . . . . . . . . . . . . . . . . . . . . . . . 161
5.3. Broadened Range Filter Techniques . . . . . . . . . . . . . . . . 163

5.3.1. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 165
5.4. bloomRF: Unified Point-Range Filter for Arbitrary Approxi-

mate Membership Testing . . . . . . . . . . . . . . . . . . . . . . . 167
5.4.1. Implicit Dyadic Trace Trees (DTT) . . . . . . . . . . . . . 169
5.4.2. Prefix Hashing and Piecewise-monotone Hash Functions173
5.4.3. Error Correction Policies . . . . . . . . . . . . . . . . . . . 177
5.4.4. Heuristical Tuning Advisor . . . . . . . . . . . . . . . . . . 181
5.4.5. bloomRF’s False Positive Rate (FPR) Model . . . . . . . 183
5.4.6. Basic Operations . . . . . . . . . . . . . . . . . . . . . . . . 184
5.4.7. Multi Attribute and Data Type Support . . . . . . . . . . 190

5.5. Experimental Evaluation of PRF . . . . . . . . . . . . . . . . . . . 194
5.6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

6. Application and System Integration 205
6.1. Indexing in Relational Database Management Systems . . . . 206

6.1.1. Implementation Details of MV-PBT in PostgreSQL with
SIAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

6.1.2. Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 206
6.1.3. Evaluation and Selection of Baseline . . . . . . . . . . . . 207

Contents 7


6.1.4. Write-optimized Indexing in OLTP . . . . . . . . . . . . . 211
6.1.5. Index-Only Visibility Checks dominate in HTAP . . . . . 214
6.1.6. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

6.2. Storage Manager in Key/Value-Stores . . . . . . . . . . . . . . . 217
6.2.1. Implementation Details of MV-PBT in WT . . . . . . . . 218
6.2.2. Implications on Secondary Indexing MV-PBT Stores . . 219
6.2.3. Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 220
6.2.4. Experimental Evaluation . . . . . . . . . . . . . . . . . . . 220
6.2.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

6.3. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

7. Summary and Research Opportunities 231
7.1. Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . 232
7.2. Further Application Scenarios of MV-PBT . . . . . . . . . . . . . 233

7.2.1. Time-Travel Query Processing . . . . . . . . . . . . . . . . 233
7.2.2. Blockchain Storage Structure in Enterprise Systems . . 234
7.2.3. Accelerators and Complex Memory Hierarchies . . . . . 235

Bibliography 237

A. Appendix 259
A.1. Complex Memory Hierarchies . . . . . . . . . . . . . . . . . . . . 259
A.2. Technical Background: Flash in SSD . . . . . . . . . . . . . . . . 260
A.3. Example calculation of single B+-Tree Cache Efficiency . . . . 262

8 Contents


List of Abbreviations

B+-Tree Balancing Tree Structure with Records in Leaves
[BM70; Com79] (compare Section 3.1.1)

BF Bloom Filter [Blo70] (compare Section 5.2)
bloomRF Bloom Range Filter [MRBP23; RBMP20]
BPK Bits per Key
Bw-Tree Latch-Free Storage Management Structure based on

B-Tree [LLS13; WPL+18]
CH-Benchmark HTAP Benchmark based on TPC-C and TPC-H

[CFG+11]
CP Cached Partition
CPU Central Processing Unit
DBMS (Relational) (Multi-Version) Database Management Sys-

tem

Abbreviation Description

Continued on next page
Table 1: Common abbreviations used in this thesis.

9


Continued from previous page

DI Dyadic Interval
DTT Dyadic Trace Tree
ETL Extraction, Transformation and Load Process
FPR False Positive Rate
GC Garbage Collection
HDD Hard Disk Drive
HOT Heap-Only Tuple (Standard Base Table Organization

in PostgreSQL [Pos21])
HTAP Hybrid Transactional and Analytical Processing
I/O Input- / Output-Operations (in the context of read or

write to secondary storage)
IOPS Input- / Output-Operations (I/O) per Second
K/V-Store Key-/Value-Store
LBA Logical Block Address
LSM Log-Structured Merge Tree [OCGO96; SR12] (compare

Section 3.1.3)
MV-PBT Multi-Version Partitioned BTree [RP22; RVGP20]
MVCC Multi-Version Concurrency Control
NO-tpmC New-Order Transactions per Minute of TPC-CWorkload

[TPC10]
NVM Non-Volatile Memory
OLAP Online Analytical Processing

Abbreviation Description

Continued on next page
Table 1: Common abbreviations used in this thesis. (Continued)

10 Contents


Continued from previous page

OLTP Online Transaction Processing
Ops/s Operations per Second
PBA Physical Block Address
PBF Prefix Bloom Filter
PBT Partitioned BTree [RVP17b; RVP19] (based on

[Gra03])
PMHF Piecewise-Monotone Hash Function
Pnr Partition Number
PRF Point-Range Filter
RI/O Read I/O Costs
RA Read Amplification
RAM Random Access Memory (Primary Storage / Main Mem-

ory)
Rosetta Robust Space-Time Optimized Range Filter for Key-

Value Stores [LCK+20]
RQ Research Question
SA Space Amplification
SI Snapshot Isolation
SIAS Snapshot Isolation Append Storage [Got16] (compare

Section 3.1.5.1)
SSD Solid State Drive
SuRF Succinct Range Filter [ZLL+18]

Abbreviation Description

Continued on next page
Table 1: Common abbreviations used in this thesis. (Continued)

Contents 11


Continued from previous page

TPC Transaction Processing Performance Council [TPC22]
TPC-C Transaction Processing Performance Council Bench-

mark C [TPC10]
TPC-H Transaction Processing Performance Council Bench-

mark H [TPC21]
tp{m,s} Transactions per Minute, Second
TS Timestamp
TT Trace Tree
TXTS MVCC Transaction with Timestamp TS
VID Virtual Identifier
WI/O Write I/O Costs
WA Write Amplification
YCSB Yahoo! Cloud Serving Benchmark [CST+10; RKD21]

Abbreviation Description

Table 1: Common abbreviations used in this thesis.

12 Contents


Zusammenfassung

Die Eigenschaften moderner Arbeitslasten gegenüber Datenbanken orien-
tieren sich an den Anforderungen von Geschäftsanwendungen, die auf Wis-
sensgewinne und einen technologischen Vorsprung ausgerichtet sind. Diese
Arbeitslasten zeichnen sich durch ein exponentielles Datenwachstum, einen
hohen Anteil an kontinuierlichen Einfügungen und analytische Verarbeitung
aus. Flash Sekundärspeicher sind eine wirtschaftliche Möglichkeit, mit gro-
ßen Mengen modifizierbarer Daten umzugehen, wenn ihre Eigenschaften
effizient genutzt werden. In diesem Kontext führt die Verwaltung von phy-
sisch materialisierten Tupel-Versionen in Basistabellen durch vorteilhafte
Zugriffsmuster auf Sekundärspeichermedien zu einem kostengünstigen und
skalierbaren transaktionalen Durchsatz.

Vorteilhafte Eigenschaften gelten jedoch kaum für gängige unversionierte
sekundäre Zugriffspfade. Ihr tatsächlicher Nutzen wird durch Wartung sowie
zusätzliche Kosten für Suche und Sichtbarkeitsprüfung begrenzt.
Mittels empirischer Methoden, also Literaturstudien und kontrollierter

Experimente, werden spezifische Eigenschaften von modernen Arbeitslas-
ten, Flash-basierten Speichertechnologien und modernster Datenverwal-
tungstechniken in Wissenschaft und Industrie gesammelt, um existierende
Probleme, Forschungsmöglichkeiten und Herausforderungen in deren In-
teraktion zu identifizieren. Auf Basis der daraus abgeleiteten Erkenntnisse
werden sowohl neue als auch ausgereifte Techniken betrachtet, mit dem

13


Ziel eine neue multi-versionierte Speicher- und Indexverwaltungsstruktur
für die Eigenschaften moderner Speichertechnologien zu entwickeln. Durch
standardisierten Arbeitslasten wird eine prototypische Implementierung in
bekannten Systemen experimentell evaluiert.
Diese Arbeit leistet einen wichtigen Beitrag zur modernen Datenverwal-

tung mit Multi-Version Partitioned BTrees (MV-PBT). Ferner beinhaltet dies
eine anfüge-basierte Tupel-Versions-Verwaltung für Arbeitslasten mit gleich-
bleibend hohen Dateneinspeisungsraten und analytischer Verarbeitung einer
gemeinsamen Datenbasis, welche sich hauptsächlich auf einem Sekundär-
speicher befindet. Erstens, der Ansatz verbessert die Selektivität sekundärer
Zugriffspfade durch die Einführung von internen Sichtbarkeitsprüfungen.
Dies führt bei gemischten Arbeitslasten zu einer Verdopplung des analy-
tischen und zu einer Erhöhung von 14% des transaktionalen Durchsatzes.
Zweitens, das Tupel-Aktualisierungsverfahren verbessert die Verwaltungskos-
ten von Indizes durch das Hinzufügen logisch verketteter Versionen, womit
ein um 47% erhöhter Durchsatz erzielt wird. Drittens, MV-PBT eignet sich auf-
grund seiner Schreibeigenschaften und kosteneffizienten Suchvorgängen als
Speicherverwaltungsstruktur, welche den Durchsatz im Vergleich zu den ver-
breiteten LSM-Bäumen verdoppelt. Viertens, die logische Verknüpfung von
Versionsdaten erleichtert unabhängige Partitionierungs-, Reorganisations-
und Wartungstechniken, was einen gleichbleibenden und um ein Vielfaches
erhöhten Durchsatz für verschiedene Arbeitslasten ermöglicht. Letztens,
bloomRF ermöglicht als neuartige, kostengünstige Punkt- und Intervall-
Filtertechnik gleichbleibende Leistung für Such- und Wartungsoperationen
in MV-PBT.
In dieser Arbeit werden die Vorteile einer multi-versionierten Hochleis-

tungsindexverwaltung für die Eigenschaften moderner Arbeitslasten erläu-
tert. Darüber hinaus kombinieren erweiterbare Designkonzepte auf Basis
des B+-Baums moderne Hauptspeichertechniken mit enormen Datenmengen
und bilden die Grundlage für die Einbindung dezentraler Verarbeitungs-
und Speichertechnologien. MV-PBT ist für ein breites Anwendungsspektrum
geeignet und bildet einen ganzheitlichen Ersatz für bestehende Speicher-
und Indexverwaltungsstrukturen.

14 Contents


Abstract

Trends in modern database workload properties are guided by business appli-
cation needs with the characteristics of exponential growth of data, high-rate
continuous insertions and analytical processing, aiming for knowledge gains
and leading edges over competitors. Cheap Flash-based secondary storage
devices provide an economic way to deal with massive amounts of modifiable
data whenever their characteristics are efficiently leveraged. Thereby, it
turned out that maintenance of physical materialized tuple version records
in base tables not only scale with the number of concurrent transactions,
but also provides beneficial access patterns to secondary storage devices,
whereby asset and operational costs become manageable.

However, beneficial characteristics are hardly valid for common version-
oblivious additional access paths. Their actual profit is limited by excessive
maintenance as well as additional search and visibility check costs.
By means of empirical methods, i.e. literature studies and controlled

experiments, characteristics of modern workloads, Flash-based storage hard-
ware as well as state-of-the-art data management techniques in academia
and industry are gathered in order to identify existing problems, research op-
portunities and challenges in mutual interactions. Based on derived findings,
novel as well as matured techniques are considered to design a new kind
of version-aware and hardware leveraging storage and index management

15


structure, which is prototypically implemented and integrated in well-known
systems and experimentally evaluated by system performance benchmarks.
This thesis gives significant contributions in modern data management

with Multi-Version Partitioned BTrees (MV-PBT) – i.e. append-based multi-
versioned tuple maintenance, high-rate continuous insertion workloads with
analytical processing on a common dataset instance, which comprises mas-
sive amounts of data on secondary storage devices. First, the approach
massively improves selectivity of additional access paths by introduced index-
only visibility checks, yielding 2× increased analytical and 14% transactional
throughput in mixed workloads. Second, strict append-based and out-of-
place replacement update schemes facilitate improved benefit by maintained
indexes, by 47% improved throughput. Third, due to its near-optimal write
characteristics and cost-efficient searches, the applied approach is highly
qualified as storage management structure, with 2× increased throughput
compared to widely used LSM-Tree. Fourth, logical linkage of version records
facilitate independent partition, reorganization and maintenance techniques
as well as robust performance characteristics for various workload proper-
ties, scaling up to orders of magnitude. Last, bloomRF, as a novel low-cost
point-range filter technique, enables robust performance characteristics for
search and maintenance operations in MV-PBT.
Contributions of this work unambiguously elaborate benefits of high-

performant version-aware index management for recent developments in
modern workload properties. Moreover, extendable design concepts on
base of the ubiquitous B+-Tree combine modern in-memory techniques,
massive amounts of data and form a basis for recent trends in decentralized
processing and storage hardware technologies. MV-PBT exhibits a broad
range of applicability and facilitates a full substitution of matured storage
and index management structures.

16 Contents


Ch
ap

te
r 1

Introduction

This introductory chapter firstly motivates the scope of action and significance
to decision-makers in a general business context. Within this context, I share
my personal experience, when I was involved as a trainee in the launch
of a customer relationship management system and business warehousing
solution upon an existing enterprise resource planning application. Upon
that, consequent challenges of the motivational context are outlined and
treated contributions are specified. A brief list of authors publications within
this research context and following structure of this thesis close the chapter.

17


1.1. Motivation

Recent trends in business process requirements to information management
systems result in new types of workloads on their most performance criti-
cal backbone – the database management systems (DBMS). Traditionally,
DBMS workloads can be separated in different categories, such as online
transactional processing (OLTP) or online analytical processing (OLAP).
Characteristics of access patterns as well as effects on data management
technologies massively differ. Emerging popularity of large scale, data in-
tensive, real time analytical applications combine these characteristics in
hybrid transactional and analytical processing workloads (HTAP) on the
same dataset instance [HG20; MBL17; ÖTT17; RVGP20; SDA21].
An HTAP scenario brings several benefits to an enterprise in contrast to

traditional data warehousing approaches, whereby data is extracted from
the transaction processing system, transformed in a special query optimized
form and loaded to an analytical processing system (ETL-process).
First, real time data in HTAP scenarios improve and simplify quality as

well as output of decision-making process in business operations [HG20;
ÖTT17]. ’Greater situation awareness’ and ’improved business agility’ are
required skills to stay competitive [Gar15].

As a personal experience, any delay in relevance of reported data
reduces user experience and increases frustration. Furthermore,
analytical outputs influence decisions in operative tasks, like fore-
casting, pricing or production planning and controlling. Afterwards
or incorrectly maintained data in an OLTP-system deficiently affect
business operations between ETL-processes or require enormous
manual adjustments and manpower. Anyways, significant costs are
affiliated to a company.

Second, if a system is able to adopt the new HTAP-approach, an information
management infrastructure of organizations could be simplified [Gar15;
MBL17]. Less complex infrastructures are cheaper to maintain and evolve.

18 1 | Introduction


Last, global players have (almost) no operative down times. The extraction
process probably stresses an OLTP-system and significantly shrinks transac-
tional throughput for a long period of time1.

Alongside HTAP-characterizations, upcoming technologies, markets, busi-
ness cases and cloud services yield data-intensive and high-rate continuous
insertion workloads. Internet of Things (IoT) platforms [Bos21; IBM21;
Mic21a; Ora21b; SAP21; Ser21] handle data of billions of IoT-devices. The
intention is by no means just storing massive amounts of data, rather a gain
of knowledge and a technological distinct competitive edge. For instance,
an automotive manufacturer could necessitate service intervals based on
evaluated wear and sensor data correlation gathered by cloud IoT platforms.

Quality of service aspects in such cloud approaches are contractually fixed
in service level agreements [Koh18]. Violation of these metrics, e.g. in trans-
actional throughput, response times, availability or consistency, are associ-
ated with noticeably monetary costs [SOSM12]. A key driver for compliance
in data-intensive tasks are naturally located in DBMS and their appropriated
hardware resources. In this context, predominant approaches are evaluated
in [HSB15]. Scaling-up means upgrading powerful components in single
nodes, whereas scaling-out allows acquisition of more processing nodes.

Both approaches, especially in combination, enable management of a mas-
sive amount of data. Accompanying challenges in scale-out methods, like par-
titioning, communication, data locality and skew handling [CL16; RMU+14;
Röd16; XY17; ZBS15], and scaling-up hardware [FZZ+19; KHL18; PDZ+18],
are in scope of recent proposals. Several solutions focus on main-memory
optimizations [DFI+13; FML+12; KN11; Pav14], e.g. in the area of key
sorted indexing [KFM+15; LKN13; LLS13; ZCWJ21]. Essential prerequisites
are large volatile main-memory capacities and powerful processing units as
well as persistent non-volatile secondary storage resources. Increased main-
memory (scale-up) or scaling-out the problem space by new data nodes for
growing dataset sizes are costly in hardware, operation and administration
[LHKN18].

1[Sup21] mentions a general job execution time of 2-4 hours.

1.1 | Motivation 19


Focusing only on main-memory leads to natural limits – in a technical as
well as in an economical context [Lom18; NF20]. The much cheaper sec-
ondary storage technologies miss scaling principles in main-memory-oriented
configurations and are limited to backup tasks and archiving. Facebook is
a pioneer in introduction and evaluation of new solid state storage devices
(SSD) and upcoming non-volatile memory (NVM) in massive data production
setups [EGA+18; MWKM15]. Leveraging characteristics of new storage
hardware in a complex memory hierarchy allow comparable performance
to expensive up-scaled main-memory solutions and cheap – typically low
main-memory equipped – data nodes. Establishing a technological edge by
leveraging hardware characteristics beyond, could significantly reduce asset,
administrative and operative costs.

In combination of these contexts, current storage manager and structures
of (secondary) access paths encounter problems, outlined in the following
section.

1.2. Problem Statement

Viewed in isolation, aspects of application generated workloads (Section
2.2), characteristics of modern hardware technologies (Section 2.1) and fun-
damentals of DBMS designs (Section 2.3) are well studied and partially
state-of-the-art. Combining these aspects imply new challenges.

Application-generated high-rate continuous insertion and analytical work-
loads become read- and write-intensive to secondary storage hardware of
data nodes. Additional access paths can reduce occurring unnecessary read
effort and improve latencies, but require frequent maintenance. Whereas
basic table storage management structures like Snapshot Isolation Append
Storage (SIAS, compare Section 3.1.5.1) [Got16] obtain beneficial append-
based write patterns very well, traditional strict alpha-numeric-sorted addi-
tional access paths result in a random write pattern, poor latency and high
write amplification (WA, ratio between logically and physically written size
of data). Additional access paths entail massive pressure, amplified by WA

20 1 | Introduction


unbuffered to secondary storage. Bandwidths of secondary storage devices
are not efficiently used and shrink overall throughput of DBMS. Maintenance
costs of the additional access path become more expensive than scanning the
base table. Considering different characteristics of main-memory primary
storage and Flash-based secondary storage techniques, the problem space
grows with complex memory hierarchies (compare in Sections 2.1 and A.1).
Generally, input-/output-operation (I/O) patterns of traditional additional
access paths do not leverage characteristics of Flash-based secondary storage
devices and shrink performance in read- and write-intensive workloads.
By this means, company’s investments are not optimally used. Furthermore,

durability of secondary storage devices shrinks due to high update rates
and WA of additional access paths. Endurance of Flash-based storage media
depends on physical write-/erase-cycles. Write patterns and WA of strictly key-
sorted additional access paths drastically wear out secondary storage devices.
A key characteristic of almost all commercial and academic DBMS is the

application of Multi-Version Concurrency Control (MVCC, compare Section
2.3). Multiple tuple versions are maintained, valid for a different period
in time. Major benefit is an improved throughput in the DBMS because
concurrent reads and writes are not mutually blocking and proceed in their
own calculated snapshot. Supplemental work for detection of a transactions
visible tuple version is necessary. Typically, additional access paths are version-
oblivious due to maintenance costs and the DBMS determines visibility by
means of expensive base table look-ups or large cached in-memory structures.
In principle, Flash-based secondary storage devices perform well on reads
due to internal parallelism and asymmetric read/write behavior (outlined
in Section 2.1), but visibility checking intensifies amount of – in principle –
unnecessary read data, especially in an HTAP-scenario.

Briefly, DBMS and their characteristics are the linkage between application-
generated workloads and the optimal usage of hardware technologies. Lever-
aging these characteristics can reduce running expenses of a business and
allow new types of workloads. Current additional access paths as well as
virtually every storage manager do not provide adequate characteristics. The
goal of this work is to close this gap.

1.2 | Problem Statement 21


1.3. Contributions

Central contributions of this thesis are the conception, development and
evaluation of a storage and index management structure, which meet the
demand on modern workloads and hardware technologies. To the best of
one’s knowledge no other storage and index manager fully integrates these
aspects in one single consistent structure with the result of substantial cuts in
performance and endurance of data nodes, architectural fuzziness in DBMS
or massive complexity in evolvement and adjustment of DBMS backend.
The presented approach is Multi-Version Partitioned BTree (MV-PBT),

an unified tree-based multi-version storage and index management structure.
Its origin and basis capabilities of a regular and well studied B+-Tree [BM70]
enable a simple integration in existing architectures and adaptability of
recently published B-Tree techniques. Major benefit of tree structures is the
natural alpha-numeric sort order, which enable – in contrast to hash-based
indexes – powerful range querying in HTAP analytical workloads. In addi-
tion, due to the lack of pre-filtering data ranges, a range filtering approach
is presented. bloomRF (bloom range filter) allows data skipping and in-
crease throughput in MV-PBT by firstly introducing piecewise-monotone
hash functions and prefix hashing in a bloom filter.
Research objectives are formulated on base of the contributing approach

in following research questions (RQ):
RQ1: How could a visibility check of multi-version data be performed in

Partitioned BTrees and leveraging modern hardware characteristics?
RQ2: What further applications arise from timestamp-based index-only

visibility-checking in a MV-PBT?
RQ3: What optimizations for reading behavior are required for MV-PBT

in the areas of data skipping and buffer efficiency?
RQ4: How do online reorganization methods in MV-PBT enable a work-

load adaptivity?
RQ5: What are the performance effects of optimized garbage collection

in MV-PBT?

22 1 | Introduction


More and detailed contributing aspects of this thesis are as follows:
Formulation of findings, how a storage and index management structure

need to interact, based on characterization of modern workloads and hardware
technologies. In order to understand evaluation and conception of its linking
layer – the DBMS storage structures – a brief characterization of modern
workloads and hardware technologies is provided. Impacts on data man-
agement structures are elaborated. Thereby, the relevance of DBMS design
decisions and underlying core mechanics become visible.
Theoretical and experimental evaluation of version models in multi-version

DBMS and K/V-Stores. Primarily, the concepts and impacts of Multi-Version
Concurrency Control (MVCC) designs, buffer management and additional
access paths are focused on. One key finding in DBMS is the contrary optimal
version organization model in base table main storage and additional access
path maintenance.
Analysis of state-of-the-art literature, as well as theoretical and experimental

evaluation of existing storage and indexing approaches. This thesis outlines
a brief review of current state-of-the-art storage and index management
structures in the areas of multi-version capabilities and leveraging modern
storage hardware.
Introduction of Index-Only Visibility Check and demonstration of its rele-

vance, especially in an HTAP scenario. Visibility checking is an expensive
operation with linear growth to the length of version chains, but is required
in multi-version DBMS. Modern indexing approaches require to support this
operation.
Enabling a strict concept of out-of-place update and invalidation. MV-PBT

introduce several different record types, whereby robustness in high concur-
rency and update-intensity situations is enabled.
Conception and evaluation of workload adaptivity, reorganization and garbage

collection techniques in MV-PBT. Whereas state-of-the-art storage manage-
ment structures focus on compaction mechanics for increased look-up per-
formance by reduction of read-amplification (RA), MV-PBT aims to achieve a
near-optimal WA and leverage its natural batch-wise partitioning sequence
in order to guarantee robust access and update performance.

1.3 | Contributions 23


1.4. Scientific Publications

Contributions of this thesis and influential extensive work have been intro-
duced in various scientific peer-reviewed or preprint publications, denoted
in following list:

[MRBP23] B. Moessner, C. Riegger, A. Bernhardt, I. Petrov. ‘bloomRF: On
Performing Range-Queries in Bloom-Filters with Piecewise-Monotone Hash
Functions and Prefix Hashing’. In: EDBT’23 (accepted) (2023)

[RP22] C. Riegger, I. Petrov. ‘Storage Management with Multi-Version
Partitioned BTrees’. In: ADBIS’22 (accepted) (2022).

[RBMP20] C. Riegger, A. Bernhardt, B. Moessner, I. Petrov. ‘bloomRF: On
Performing Range-Queries with Bloom-Filters based on Piecewise-Monotone
Hash Functions and Dyadic Trace-Trees’. In: CoRR abs/2012.
15596 (2020). arXiv: 2012.15596. url: https://arxiv.org/abs/2012.
15596

[RVGP20] C. Riegger, T. Vinçon, R. Gottstein, I. Petrov. ‘MV-PBT: Multi-
Version Index for Large Datasets and HTAP Workloads’. In: Proceedings
of the 23rd International Conference on Extending Database Technology
(EDBT 2020). Copenhagen, Denmark, 2020.

[VWB+20] T. Vinçon, L. Weber, A. Bernhardt, A. Koch, I. Petrov, C. Knödler,
S. Hardock, S. Tamimi, C. Riegger. ‘nKV in Action: Accelerating KV-Stores on
NativeComputational Storage with Near-Data Processing’. In: Proc. VLDB
Endow. 13.12 (2020), pp. 2981–2984.

[VHR+19] T. Vinçon, S. Hardock, C. Riegger, A. Koch, I. Petrov. ‘nativeNDP:
Processing Big Data Analytics on Native Storage Nodes’. In: ADBIS. 2019

24 1 | Introduction


[PKH+19] I. Petrov, A. Koch, S. Hardock, T. Vinçon, C. Riegger. ‘Native
Storage Techniques for Data Management’. In: 35th IEEE International
Conference on Data Engineering, ICDE 2019, Macao, China, April 8-11,
2019. IEEE, 2019, pp. 2048–2051.

[RVP19] C. Riegger, T. Vinçon, I. Petrov. ‘Indexing Large Updatable Datasets
in Multi-Version Database Management Systems’. In: Proceedings of the
23rd International Database Applications & Engineering Symposium. IDEAS
’19. Athens, Greece: Association for Computing Machinery, 2019.

[RVP18a] C. Riegger, T. Vinçon, I. Petrov. ‘Efficient Data and Indexing
Structure for Blockchains in Enterprise Systems’. In: Proceedings of the
20th International Conference on Information Integration and Web-Based
Applications & Services. iiWAS2018. Yogyakarta, Indonesia: Association for
Computing Machinery, 2018, pp. 173–182.

[RVP18b] C. Riegger, T. Vinçon, I. Petrov. ‘Efficient Data and Indexing
Structure for Blockchains in Enterprise Systems’. In: IBM Technical Report
RC25681. 2018

[PVK+19] I. Petrov, T. Vinçon, A. Koch, J. Oppermann, S. Hardock, C.
Riegger. ‘Active Storage’. In: Encyclopedia of Big Data Technologies. 2019

[PKV+19] I. Petrov, A. Koch, T. Vinçon, S. Hardock, C. Riegger. ‘Hardware-
Assisted Transaction Processing: NVM’. In: Encyclopedia of Big Data Tech-
nologies. 2019

[VHR+18] T. Vinçon, S. Hardock, C. Riegger, J. Oppermann, A. Koch, I.
Petrov. ‘NoFTL-KV: TacklingWrite-Amplification on KV-Stores with Native
Storage Management’. In: Proceedings of the 21th International Conference
on Extending Database Technology (EDBT 2018). Vienna, Austria: Open-
Proceedings, 2018, pp. 457–460

1.4 | Scientific Publications 25


[RVP17a] C. Riegger, T. Vinçon, I. Petrov. ‘Multi-Version Indexing and
Modern Hardware Technologies: A Survey of Present Indexing Approaches’.
In: Proceedings of the 19th International Conference on Information Integra-
tion and Web-Based Applications & Services. iiWAS ’17. Salzburg, Austria:
Association for Computing Machinery, 2017, pp. 266–275.

[RVP17b] C. Riegger, T. Vinçon, I. Petrov. ‘Write-Optimized Indexing with
Partitioned B-Trees’. In: Proceedings of the 19th International Conference
on Information Integration and Web-Based Applications & Services. iiWAS
’17. Salzburg, Austria: Association for Computing Machinery, 2017, pp.
296–300.

1.5. Structure of this Thesis

Since the motivational context, consequential problem statements and con-
tributions are defined, the residual thesis is structured as follows. In Chap-
ter 2, a brief introduction in the technical background is given. Thereby,
required insights and concepts in the areas of hardware technology char-
acteristics, workload characteristics and fundamentals of DBMS designs are
provided. Based on the outlined findings, several mostly key-sorted indexing
approaches are introduced and evaluated in Chapter 3. The designated
founding approach Partitioned BTrees (PBT) is elaborated and enhanced to
version-aware Multi-Version Partitioned BTree (MV-PBT), coping determined
concepts and capabilities in Chapter 4. Shortcomings in existing point-range
filter techniques – a basic requirement for efficient data skipping in MV-PBT
– lead to the concept and development of bloomRF outlined in Chapter 5.
Chapter 6 provides a full integration and experimental evaluation of MV-PBT
in an append-based DBMS and K/V-Store. Finally, a brief summarization
and opportunity overview is presented in Chapter 7.

26 1 | Introduction


Ch
ap

te
r 2

Technical Background

With the aim of understanding the various aspects and challenges of key-
sorted access paths, modern complex memory hierarchies and diverse charac-
teristics of its members are described first. Subsequently, natural behavior of
evolving workload types are outlined. Thereupon, fundamentals of modern
database management system (DBMS) design techniques are provided. With
respect to formulated characterizing statements, an emerging significance
in robust and Flash-leveraging key-sorted multi-version storage and index
management structures is elaborated.

27


2.1. Characteristics of modern Hardware Technologies

Understanding basic concepts and characteristics of modern hardware tech-
nologies is crucial for database access method design. Execution time on hard-
ware can be simplified to processing (Tprocessing) and data access (Tdata−access)
costs. Adopting trends towards resource and memory efficiency in modern
K/V-Stores [EGA+18; Inc22; MWKM15], access paths must consider char-
acteristics of secondary storage devices. Even though, instructions in data-
intensive operations are frequently performed, access latencies to massive
amounts of data on high-capacity secondary storage are assumed to become
a dominant factor, so that:

Tdata−access ≫ Tprocessing (2.1)

Therefore, the scope of this work is focused on characteristics of the
memory hierarchy. Due to the reliance of processing and data provisioning,
a quite brief characterization of recent processing evolutions is given.

2.1.1. Highly parallelized and decentralized processing units

Over the past decades, processing power benefits from increasing clock
speeds. Applications generally profit by this trend. Latest developments in
processing units yield increased parallelism and manifold unconventional
calculation design options. Even though the scope of this thesis is not
primarily focused on optimizing processing costs, one central point can
be formulated: Data access methods require to be aware of decentralized
computing models. Multi-core central processing units (CPU) and various
possible hardware accelerator units are probably involved in modern DBMS
[BGHS19] and require to complementary operate on a consistent dataset
instance, without shrinking performance due to concurrency issues, like
excessive cache-invalidation or double buffering. Efficient data provisioning
and persistence for decentralized and parallel calculations is the purpose of
memory hierarchies on data node servers.

28 2 | Technical Background


2.1.2. Primary and Secondary Storage Characteristics

Components of the memory hierarchy are the bottleneck of massively data-
intensive tasks. Efficiently using clock-cycles of processing units rely on data
provisioning along these volatile primary and persistent secondary storage
devices.
Latency describes the delay in response time between components. As

a general rule, latencies, but also capacities, increase with higher distance
to a processing unit. CPU caches retrieve cache-line-sized data from main-
memory (e.g. RAM). Whenever an already persistent (mostly volatile) work-
ing copy is not present on a byte-addressable memory device, it is retrieved
by a read I/O operation from a block-addressable storage device. The other
way around, in volatile memory newly allocated data or modified working
copies are persisted on block-addressable secondary storage devices by write
I/O operations.
Traditionally, in DBMS algorithms, only two levels in memory hierarchy

are considered – volatile main-memory and and persistent disk [Gra11].
Latencies of read andwrite I/O operations form amassive access gap (compare
Section A.1). Whereas wait latencies between caches and memory is about
×102, the access gap between caches and HDD (traditional cheap mechanic
Hard Disk Drive) is about ×106. Common behavior in traditional memory
and storage media is formed in symmetric read and write latencies as well
as benefits in sequential operations.
Evolving and recently established storage technologies close the access

gap (compare Complex Memory Hierarchies in Section A.1), but also broaden
the characteristics of devices in memory hierarchies. Reflecting latencies of
solid state drives (SSD) and non-volatile memories (NVM), they naturally
fill the gap, however, their special characteristics1 must be considered:

High level of inherent parallelism. Independent or decomposed execution
of I/O operations within several structural levels of an SSD has been well

1Characteristics are exemplary taken from SSDs. A detailed technical background is given
in Section A.2.

2.1 | Characteristics of modern Hardware Technologies 29


studied [APW+08; HJF+11; PSS+10; RZA+12; Shi17; SXX+09; WKS15].
Leveraging structural characteristics by such techniques enable much higher
parallelism and I/O-performance as has been known from HDD.
Access structures might leverage high parallelism whilst accessing data – i.e.

SSDs cope with more purposeful I/O of DBMS and K/V-Stores.

Asymmetric read and write performance. Flash mainly supports the na-
tive operations read, write and background erase. Reads perform an order of
magnitude better than writes and two orders of magnitude better than erases
[CKZ09; Got16] regarding latencies and IOPS (I/O per second). Especially
in case of random write I/O, the latency and costs alternate between cheap
sequential write and expensive erase operations [BJB09]. SSDs contain
several on-device caches for many purposes: caching of address translation
mapping tables, performing asynchronous read ahead, concealing asymmet-
ric behavior by write buffering or enabling background tasks. Generally,
operation costs depend on the internal circumstance of an SSD. High-rate
continuous I/O operations uncover asymmetric behavior in steady-state
conditions [DISK20; Got16]. As depicted in Figure 2.1b, write throughput
quickly drop as caches satiate at different utilization levels by block sizes.
In contrast, asymmetric read behavior remains constant in all block sizes
(Figure 2.1a).
Access structures leverage asymmetric I/O characteristics by minimizing write

accesses, whereas SSDs cope with increased read I/O.

Out-of-place update operation. Unlike to HDDs, in which a page at a
specific address can be physically overwritten in-place, Flash require a clean
page status before writing. Rewriting a page requires an out-of-place write
at a different clean page and an update of the address translation mapping
table. The costly erase is performed by a background garbage collection
(GC) task at a later point in time. [CKZ09; DISK20]
Access structures must avoid in-place updates to already persisted data and

replace modifications by out-of-place updates.

30 2 | Technical Background


1
2
3

1
2
3

1
2

4

8

16
1 2 3 4 5 6 7 8 9 10 11 12 13  14 time (in min)

IO
PS

 (i
n 

1E
+5

) Intel DC P3600 Samsung 850 Pro

Bl
oc

k 
Si

ze
 (i

n 
kB

)

(a) IOPS Random Read
5

10

2
4
6

1

 IO
PS

 (i
n 

1E
+4

)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 time (in min)

4

8

Bl
oc

k 
Si

ze
 (i

n 
kB

)
16

(b) IOPS Random Write (c) Intel DC P3600 - Sequential vs. Random
Figure 2.1.: Read and write metrics of Intel DC P3600 enterprise SSD and

Samsung 850/860 Pro consumer SSD applied in the testbed.

Advantages in sequential write I/O pattern. As an effect of out-of-place
page replacement, inherent parallelism and low memory footprint hybrid-
/block-level-mapping in on-device caches, sequential write patterns can
improve performance of an SSD [MFL14]. Sequential sectors can be arranged
and distributed along independent Flash components. Random writes do not
result in sequential sectors, whereby memory footprint of mapping tables
increase and background maintenance operations, like garbage collection,
become more complex and expensive. As a general rule, "random writes
should be limited to a focused area" and "sequential writes should be limited to
a few [concurrent] partitions" [BJB09]. Figure 2.1c indicates roughly equal
very good performance for sequential and random read I/O for different
page sizes and numbers of pending I/O requests (queue depth), however,
random write performance is almost a fraction of the sequential operation
for all measurement points.
Generally, access structures must adapt sequential writes for vast majority of

payload, whereas few random writes of small meta structure data are feasible.
Contrary, read latencies roughly remain unaffected by access patterns.

2.1 | Characteristics of modern Hardware Technologies 31


Background Garbage Collection (GC). Writing a page requires a clean
on-device target-page status. As a result, write-back operations of updated
pages are performed at a different storage location (out-of-place) and the
former page version becomes invalid. Erase operations are very expensive
but are required for space reclamation. A fundamental property in Flash
(NAND) is, that data can be read and written on page level, but require to be
deleted on superior block level. First, a victim block is determined. Second,
still valid pages in the block are written to an appropriate different block
location. Third, the victim block becomes erased. As a side effect, the write
amplification (WA) increases on device level (WA-D). [MFL14; MWKM15]
Access structures must consider GC to be valuable for larger regions. Tempo-

rary space amplification by obsolete data is negligible on secondary storage.

Limited durability and wear. Status of memory cells in pages and blocks
is switched by write and erase operations in a write-erase-cycle. Each write-
erase-cycle in a block wear out the transistor oxide layer of memory cells.
The approximate number of maximum write-erase-cycles depends on the
underlying technique, whereas 3000 cycles are common for consumer devices
and 10.000 for enterprise devices. SSDs track the number of erases for each
block. Wear-leveling methods equally distribute wear over all blocks and
consequently, unreliable blocks become discarded. [MFL14; MWKM15]
By knowledge of the maximum write-erase-cycles (Nw/e−c ycles) per block,

the absolute storage capacity of the SSD (Scapaci t y) in blocks, the modified
data by a workload (Smodi f ied) and the write amplification (WA of data
structure and on-device) the durability (D) is calculated as follows:

D =
Nw/e−c ycles × Scapaci t y

WA× Smodi f ied
(2.2)

In the denominator, the multiplication of WA and Smodi f ied defines the
wear in written blocks. Since wear is equally distributed over all blocks,
the absolute number of write-erase-cycles of the SSD is calculated by the
multiplication of Nw/e−c ycles and Scapaci t y in the numerator. Access structures
must minimize WA, since it negatively affects Flash’s durability.

32 2 | Technical Background


100
100

100 
50

100
20

100
0

67
100

67
50

67
20

67
0

50
100

50
50

50
20

50
0

33
100

33
50

33
20

33
0

0
100

0
50

0
20

0
0

0

  5

  10

  15

 4  8 644  8 644  8 644  8 644  8 644  8 644  8  644  8  644  8  644  8  644   8  644  8  644  8  644  8  644  8  644   8  644  8  644   8  644   8  644  8  64block size (in kB)

IO
PS

 (i
n 

1E
+

4)
Intel DC P3600 Samsung 850 Pro

 Read %
 RND %

(a) IOPS - random/sequential , read/write distribution
100 Read  %
100  RND %

100
50

100
20

100
0

67
100

67
50

67
20

67
0

50
100

50
50

50
20

50
0

33
100

33
50

33
20

33
0

0
100

0
50

0
20

0
0

W
rite

Read

4 8 644 8 644 8 644 8 644 8 644 8 644 8 644 8 644 8 644  8 644 8 644 8 644 8 644 8 644 8 644  8 644 8 644 8 644 8 644 8 64

0
20

40

60

0
10
20
30
40
50

block size (in kB)

Av
g 

La
te

nc
y 

(in
 m

s)

(b) Latency - random/sequential , read/write distribution

Figure 2.2.: IOPS and latency metrics of applied SSDs in the testbed for
different workload characteristics and block sizes.

I/O transfer time is majority. Accessing data on secondary storage re-
quires time for seek and transfer. Since SSDs have no electro-mechanic
components, seek time is very low, whereas, unlike to HDD, transfer time
becomes majority. Larger transfer / page sizes result in fewer I/O per second
(IOPS, compare Figure 2.2) for different access patterns.
Access structures must consider to optimize net transferred data by cheap

access structures and appropriate page sizes.

2.1 | Characteristics of modern Hardware Technologies 33


2.1.3. Testbed Hardware Metrics and Operating System

Performance evaluation benchmarking is performed on a 64-Bit Ubuntu
16.04.7 LTS (xenial) server setup with Linux Kernel Version 4.4.0-210-generic.
Benchmarks and DBMS are compiled with GCC Version 9.4.0.
The CPU is a 4 core (8 thread hyper-threading) Intel(R) Xeon(R) CPU

E5-1620 v3 @ 3.50GHz with internal L1/L2/L3 256kB/1MB/10MB caches.
Memory is provided by 2×16GB DDR4 RAM devices (manageable up to
32GB by boot parameters). The server is equipped with a 400 GB Intel DC
P3600 PCIe enterprise SSD and a 500GB and 1TB SATA Samsung 850/860
Pro consumer SSD, whereas logging and payload perform without any inter-
ference and on different intensity-levels. Generally, write caches are disabled
for steady state, if possible. Read and write characteristics are depicted in
Figures 2.1 and 2.2 and are in accordance to outlined Flash characteristics
in Section 2.1.2.

2.2. Modern Workload Properties

In order to design, implement and evaluate a version-aware storage and
index management structure that leverages modern hardware characteristics,
several types of workloads are considered.

First, Online Transaction Processing (OLTP) represents traditional business
contextual data management use cases, executing small updates and inserts
alongside to multiple reads [TPC10].

The very opposite in characteristics of query execution is a decision support
Online Analytical Processing (OLAP) workload, whereby queries with high
degree of complexity and large volume of examined data are performed
[TPC21]. An upcoming type of workload is a mixture of OLTP and OLAP in
Hybrid Transactional and Analytical Processing (HTAP) on a huge shared
dataset instance [CFG+11; HG20; ÖTT17].
Cloud data serving workloads bring OLTP-like transaction processing

and massive amounts of data and high-rate continuous insertions together
[CST+10].

34 2 | Technical Background


Some general assumptions about modern workload properties are as
follows:

Exponential growth of data. Ongoing digital transformation and per-
sonalization of services incur rapidly growing dataset sizes [DB20; SG06].
Massive amounts of data require to be efficiently managed.

High-rate continuous insertion / modification. Evolving types of applica-
tions and data sources with varying lifespan incur frequent data modification.
Furthermore, availability and quality of services require stable steady state
throughput (elasticity) of high-rate continuous insertion and modification
workloads [CST+10].

Application of HTAP workloads. Integration of data relying on various
isolated solutions lead to inflexible and complex infrastructure landscapes.
Data-centric infrastructures, whereby applications and services are designed
and built upon a core of commonly applied and maintained dataset instance
[DB20; HG20; ÖTT17; RVGP20], enable cheap administration and flexible
evolvability.

2.2.1. Application of standardized Database Benchmarks

Introduced workload property assumptions include high-rate continuous
insertions and modifications with simultaneous analytical querying on mas-
sive amounts of data, which potentially are performed on a shared dataset
instance. Properties are well represented in HTAP as well as cloud serving
benchmarks.
It is common practice to apply a best concurring standardized database

benchmark. Popular benchmarks are designed by the Transaction Processing
Performance Council (TPC)1.

1’The TPC is a non-profit corporation focused on developing data-centric benchmark standards
and disseminating objective, verifiable data to the industry.’ [TPC22]

2.2 | Modern Workload Properties 35


The TPC-C OLTP benchmark is very common in database benchmarking.
OLTP mainly performs transactions of multiple reads as well as small updates
and inserts [TPC10] in a business contextual database schema with multiple
indexes (compare Figure 2.3 TPC-C). The open source DBT-2 [WW21]
implementation has been applied as business contextual OLTP benchmark
in this thesis.

Offside the commonness of TPC-C as OLTP benchmark, however, there is
a significant reason for its application. The authors of [CFG+11] recognized
the lack of meaningful HTAP workloads and developed the CH-Benchmark
upon TPC-C (TPC OLTP-workload) and TPC-H (TPC OLAP-workload). As
depicted in Figure 2.3, CH-Benchmark operates on a commonly shared
dataset instance, yielding special HTAP characteristics. This benchmark is
integrated in the OLTP-Bench framework [Pav+21].
As an alternative for high-rate continuous insertion workload on K/V-

Stores, the Yahoo! Cloud Serving Benchmark (YCSB) [CST+10] is applied in
[RKD21]. YCSB consist of different key-value style workloads for cloud-based
applications. In the core workload package, several mixtures of key-value
put-, get- and scan-operations for different data and request distributions as

Figure 2.3.: Schema of the CH-Benchmark [PWM+14] adopts TPC-C and
TPC-H as an HTAP workload representative.

36 2 | Technical Background


well as sizes are supported. Many parameters are adjustable for simulation
of arbitrary use-cases.

2.3. Fundamental Design Decisions in Multi-Version Database
Management Systems

Multi-Version Database Management Systems (in this thesis simply denoted
as DBMS) as well as K/V-Stores are the backbone of data-intensive appli-
cations. By this are meant all DBMS, which implement the Multi-Version
Concurrency Control (MVCC) protocol and apply Snapshot Isolation (SI).
Well-known MVCC is one of the most popular transaction management
schemes with many representatives: Oracle [Ora21a], Microsoft SQL Server
[Mic21b], HyPer [KN11], SAP HANA [FML+12], Google Cloud Spanner
[BBB+17], MongoDB WiredTiger [Mon21], NuoDB [Nuo21], PostgreSQL
[Pos21] or MySQL-InnoDB [Ora21c], just to name a few.

In Figure 2.4, modules of a DBMS are schematically depicted. Architectures
in K/V-Stores vary and potentially enable less features and modules. Users
interact with the DBMS by an user interface according to which the user
request is parsed in an execution plan. Thereupon, the query is executed by
accessing data with the best matching access method and perhaps performing
further operations. A required set of transaction properties is guaranteed by
the concurrency control module, namely MVCC. Accessible data in massively
data-intensive operations is likely to be located on secondary storage devices,
e.g. on Flash SSDs. Access methods request the byte-accessible data from a
RAM-located Buffer Manager. If the data is not represented in the Buffer
Manager, it is read from block-addressable secondary storage. In order to
guarantee maximum memory footprint, buffers require to be evicted from
cache by a replacement policy. Unchanged buffers can be discarded, modified
ones require to be written to secondary storage media by logic of the Storage
Manager. Data corruption is avoided by logging of critical operations.

Scope of this thesis is the storage and index management represented in
Access Methods and accompanying modules Buffer and Storage Manager

2.3 | Fundamental Design Decisions in Multi-Version Database Management Systems 37


User-Interface
(SQL, DDL, Key-Iterator, …)

Query Evaluation Engine
(Parser, Execution Planner, Optimizer, …)

Access Methods
(Base Table, 

Additional Access Paths)

Logging and 
Recovery

Multi-Version 

Concurrency

Control
(Transaction, Lock 

and Version 

Management)

Buffer 
Manager

Storage 

Manager

Flash Storage

Read / Write IO

Figure 2.4.: Regular DBMS Architecture (compare [RG03]). Scope of this
thesis is in dashed lines, whereby critical interfering modules
are shaded in darker and main contributions in brighter gray.

with focus on properties of interlinked MVCC and beneficial I/O to secondary
storage devices. Query Evaluation Engine as well as Logging and Recovery
modules are affected as well, though are not treated in detail. Therefore, a
detailed overview of MVCC designs is given.

2.3.1. Opportunities and Version Management in Multi-Version
Concurrency Control

Multi-Version Concurrency Control (MVCC) and Snapshot Isolation (SI) are
well-known mechanisms in DBMS. However, fundamentals of MVCC with
SI are essential prerequisites in this thesis, therefore some background is

38 2 | Technical Background


provided.
Parallel execution of transactions is a major scaling factor of DBMS,

whereas throughput and response times are improved, however, interleaving
concurrent transactions can cause non-tolerable data anomalies to an applica-
tion. Several well-known version-oblivious concurrency control mechanisms
enable integrity, nevertheless typically reading and writing transactions are
mutually blocking, i.e. readers require to wait for completion of a writer and
writers wait for lock releases of readers. In other words, this means in an
HTAP scenario that OLAP queries on update-intensive data never complete
and OLTP modifications are not able to proceed. MVCC with SI overcome
these issues.
MVCC creates a new version of a data tuple on each update. As a result,

readers do not block writers anymore, because read locks do not oblique
writers to wait for release. In MVCC, it is very common to leverage its
versioning behavior in Snapshot Isolation (SI). SI let transactions operate
on a consistent (logical) state (snapshot) of the database, e.g. the latest
committed change to a tuple at begin of the transaction, respectively its own
modification, is visible. Moreover, SI prevents concurrent transactions from
updating the same tuple. Due to the fact, that every transaction knows its
snapshot, writers do not block readers anymore. MVCC with SI enable high
concurrency and isolation level – even in HTAP workloads with long-lasting

Figure 2.5.: Logical View on Tuple Versions in MVCC with creation and
invalidation timestamps [RVGP20]. Physical order typically
differ.

2.3 | Fundamental Design Decisions in Multi-Version Database Management Systems 39


analytical queries and frequent updates. Therefore, it is a good choice to
leverage parallelism in processing units and Flash storage.

From a logical point of view, maintenance of tuple versions bases on some
prerequisites. Each tuple consists of at least one tuple version record. By this
means, several tuple version records of one logical tuple exist in a version
chain and each of them is valid for a different period in time. Validity of
a tuple version, strictly speaking its visibility to a specific transaction snap-
shot, is determined by timestamps for creation (tcreation) and invalidation
(t inval idation)1. A visibility check is responsible to return the valid tuple ver-
sion to a transaction snapshot, whereby the version chain is processed and
the timestamps are evaluated. In order to process the version chain, an
entry point to the chain must be known, i.e. a specific tuple version record,
and each tuple version requires to know its predecessor or successor. Ver-
sion chains preferably have the structure of a latch-free singly linked list
[WAL+17]. Tuple version records become obsolete, if it is no more visible to
any active transaction snapshot, and is removed by garbage collection (GC).
Figure 2.5 depicts a logical view on a tuple version chain. Tuple t was

changed by transactions (T Xu0, T Xu1, T Xu2, T Xu3 several times by modify-
ing the attribute a to 7, 3,1, 9. Each transaction maintains timestamps for
creation and appropriate invalidation of its predecessor at the logical tuple
version. Tuple y depicts a tuple version, which is not a member of the
mentioned tuple version chain of tuple t. The depicted creation of new tuple
version records is in accordance to out-of-place update characteristics of Flash
secondary storage devices.
Considering creation of a new version each time a tuple is updated in

MVCC, this might happen in different ways – detailedly outlined in the
following sections. Several DBMS implement variations of the version man-
agement scheme, whereby the design decisions specify its capabilities and
applicability. [WAL+17] evaluated different designs from a main-memory

1Transaction timestamps constitute a logical sequence of executed transactions. In addition,
DBMS possibly maintain further information, e.g. a sequence of command numbers per
transaction in PostgreSQL [Pos21], in order to avoid race conditions. In this thesis, transaction
timestamps are meant to consider command numbers in the logical sequence as well.

40 2 | Technical Background


point of view. Building up on these insights, different MVCC designs require
to be evaluated in this thesis with respect to characteristics of secondary
storage devices in a complex memory hierarchy [RVGP20].

2.3.1.1. Version Storage

A logical tuple corresponds to one or more tuple versions, which form a
singly linked list of records in a version chain (Figure 2.5). There are two
possible physical representations of a tuple version (Figure 2.6): physically
materialized or delta-record-based. The former implies that each tuple version
record is entirely stored physically materialized. By this means, each tuple
version record contains all valid information of a logical tuple in a specific
period of time. Whenever a version record is valid to a transaction snapshot,
further records are not required to restore the logical tuple. The latter
implies that each modification of a logical tuple results in a delta record,
which indicates the difference to another tuple version (e.g. applied in
BW-Tree [LLS13; WPL+18]). Delta records are linked and require to be
retrieved on demand by the DBMS storage manager for tuple reconstruction.
Delta-record-based system designs typically store a single version record
(oldest or newest – depends on version ordering, compare Section 2.3.1.2)
in the main store. Delta records are located in a separate storage location,
e.g. the undo log (applied in InnoDB [Ora21c]) or a temporary version
record store (applied as option in MS SQL Server [Mic21b]). By this means,
only the version record in the main store can be directly accessed, other

Tuple t 
version t.v3

t.v3 9 TXu3 -V3

Tuple t 
version t.v2

t.v2 1 TXu2 TXu3V2

y.vn

...

Physically Materialized Storage Delta-Record Storage
latest versiont.v3 9 TXu3 -V3

Delta storage

t.v2 TXu2V2

UNDO log

LogLSN TXU2

Version Pool/Temp Storage

t.v2 ... TXu2 TXu3

t.v0

...

...

Figure 2.6.: Version Storage Alternatives [RVGP20]

2.3 | Fundamental Design Decisions in Multi-Version Database Management Systems 41


version records require to retrieve the required information by processing
the main version record and all intermediate delta records up to the valid
tuple version is fully reconstructed.
Both physical representation models are capable to perform modifica-

tions in-place or out-of-place. The former creates a copy (or delta) of the
most recent valid tuple version representation at some other location and
maintains version chain linkage information, transaction timestamps and
desired modifications to the newer version in-place. The latter let the most
recent valid tuple version representation unchanged and creates a tuple
version record (or delta record) with desired modifications at some other
location and maintains version chain linkage information and transaction
timestamps.
Considering the characteristics of modern Flash storage (outlined in Sec-

tion 2.1.2), physically materialized version record storage and out-of-place
update scheme are preferable, due to lower RA, WA and tuple reconstruction
costs. Delta records tend to consume less space than materialized tuple
versions, especially in case of large tuple information sizes and small modifi-
cations, but require additional processing and all predecessors or successors
for tuple reconstruction.

2.3.1.2. Version Ordering

Tuple version records of a logical tuple form a tuple version chain. Its
organization of a singly linked list enables lock-free modifications. As a

Newer 
version t.v3 ... TXu3 -

Older 
Version t.v2 ... TXu2 TXu3

New-to-Old Ordering Old-to-New Ordering

New-to-old Reference

Newer 
versiont.v3 ... TXu3 -

Older 
Versiont.v2 ... TXu2 TXu3

Old-to-New Reference

Figure 2.7.: Version Ordering Alternatives [RVGP20]

42 2 | Technical Background


result, there is one entry point to the tuple version chain. Every tuple version
record is accessible by successively processing the tuple version chain –
beginning with the entry point. Internal ordering in this singly linked list
can be organized from old-to-new or new-to-old (depicted in Figure 2.7). The
former organization enable a static reference to its entry point. A predecessor
require to know its successor, wherefore references in the predecessor must
be modified. Adding a new tuple version record requires to process the
whole tuple version chain beginning from the entry point and append it
to the end of the list. The latter organization model, namely new-to-old,
prepends newly inserted tuple version records to the entry point of the tuple
version chain, whereas it becomes the new entry point. The new record
references the old entry point, which requires no modifications.

Treading mixed workload characteristics like HTAP, both version ordering
schemes favor different types of queries. Old-to-new ordering scheme tend
to favor long-lasting OLAP queries, precisely because visible tuple version
records to its transaction snapshot are located near the entry point. Con-
versely, recently inserted tuple version records are much faster accessible in
a new-to-old ordering scheme.
Considering the characteristics of modern Flash storage (outlined in Sec-

tion 2.1.2]), new-to-old ordering scheme makes a leading edge, due to its
version chain maintenance. Predecessor version records need not to update
linkage information, contributing to reduce WA. A desirable append-only
behavior is the case if combining out-of-place physically materialized tuple
version records (compare Section 2.3.1.1) and outlined new-to-old ordering.
Other combinations require in-place updates.

2.3.1.3. Version Invalidation Model

A tuple version is said to be invalidated whenever a successor version record
exists. Two different invalidation models are considered [Got16] (Figure
2.8). Two-point invalidation is a widespread model, where the creation
timestamp of the successor version is also placed as invalidation timestamp on
the predecessor. By evaluation of a single tuple version record, its visibility to

2.3 | Fundamental Design Decisions in Multi-Version Database Management Systems 43


Tuple t 
version t.v2

t.v2 1 TXu2 TXu3...

Tuple t 
version t.v1

t.v1 3 TXu1 TXu2...

Two-Point Invalidation One-Point Invalidation
Tuple t 
version t.v2

t.v2 1 TXu2...

Tuple t 
version t.v1

t.v1 3 TXu1...

Figure 2.8.: Version Invalidation Alternatives [RVGP20]

a transaction snapshot can be determined. On the other hand, the existence
of a successor version itself implicitly invalidates its predecessor. One-point
invalidation [GPHB17] leverages this point by maintaining only the creation
timestamp at each tuple version record and implicit invalidation. Major
benefit of this technique is the avoidance of in-place invalidation – what is
actually a modification – at predecessors of tuple version records. However,
evaluating the visibility of a tuple version record to a transaction snapshot
probably requires successor information. One-point invalidation matches
well previously outlined new-to-old ordering scheme, since all successors
and respective creation timestamps are processed as part of the tuple version
record discovery.

With regards to characteristics of modern Flash storage (outlined in Section
2.1.2), only one-point invalidation is capable to avoid in-place updates and
therefore enables beneficial append-only behavior with low WA. Anyways,
required successor versions are processed in case of new-to-old ordering
scheme (compare Section 2.3.1.2), and therefore it is not considered to be
any drawback.

2.3.1.4. Garbage Collection

Modifications to a tuple result in the creation of a new tuple versions. Tuple
version records become obsolete, if they are no longer visible to any of the
active transaction snapshots.
Obsolete tuple version records require to be garbage collected (GC) for

44 2 | Technical Background


space reclamation and probably performance gains. However, GC reduces
concurrency as some form of locking is required, causes performance spikes
as it interferes with foreground I/O and increases WA on secondary storage
devices. GC can be performed on transaction, tuple and index levels [LLS13;
LSP+16; WAL+17; WPL+18].
In order to keep GC costs low, operations require to be worth it and

should not break with append-only behavior and minimizing WA principles
(compare Section 2.1.2). Furthermore, independence of access structures
– i.e. base tables, additional access paths and potential helper structures –
reduces locking effort and bouncing across structures.

2.3.1.5. Discussion

Different possible design decisions in MVCC are introduced and theoreti-
cally evaluated on aspects of modern secondary storage characteristics. As
depicted in Table 2.1, MVCC designs can be combined to an optimal set for
base table storage on Flash. An append-only new-to-old ordered storage
manager with one-point invalidation for base tables is Snapshot Isolation
Append Storage (SIAS) [Got16; GPHB17] (outlined in Section 3.1.5.1). SIAS
exhibits in a write-heavy transactional workload 97% reduced write I/O and
an improved throughput of 30% compared to its baseline MVCC with SI
in PostgeSQL [GPHB17]. However, additional access paths become more
complex as outlined in Section 2.4.

Ideal MVCC Design for Storage, but appropriate for Additional Access Paths?

Version Storage

• Physically Materialized
Storage

• Out-of-Place insertion of
new Tuple Version Records

Version Ordering

• New-to-Old Version Chain 
Linkage

• Rolling Entry Point of 
Version Chain

Invalidation Model

• One-Point Invalidation

• Successor Version indicates
Invalidation (with
Timestamp)

Garbage Collection

• Space Reclaimation / 
Performance Improvement

• Large Segements
of obsolete Records

Table 2.1.: Optimal MVCC design for characteristics of Flash storage devices.

2.3 | Fundamental Design Decisions in Multi-Version Database Management Systems 45


2.3.2. Estimated Base Table Cost Model

In the previous section a strict append-only heap organization is outlined,
which leverages characteristics of Flash (compare Section 2.1.2). Major
benefits of this base table organization are the minimal insert, update as well
as delete costs (iSIAS) and yielding WASIAS per page of strict append-only
storage.

WASIAS = 1 (2.3)

Since records are invalidated out-of-place, all modifying operations have the
processing costs iSIAS = O ( 1 ) practically no read I/O (except for potential
entry point identification) and optimal write I/O costs (WI/O) per operation
(iI/OSIAS

)

iI/OSIAS
=WI/O ×

BN+R

N + R
(2.4)

whereby N comprise the set of logical tuples, R the totality of logical replace-
ments and BN+R the number of required pages (buffers) to store all version
records (N + R).

Search costs (sSIAS), however, in heaps are muchmore expensive. Generally,
search costs are declared in average as s = O ( N/2 ), since the required
record might be located anywhere in the dataset. However, version records
are invalidated out-of-place, whereas principally the entire dataset must be
scanned to identify the required tuple version record, hence sSIAS = O (N+R)
and yielding read I/O costs (RI/O) per search as well as scan (sI/OSIAS

) are
defined as follows:

sI/OSIAS
= RI/O × BN+R × (1− pBc) , pBc ≃ 0 (2.5)

Due to assumed data-intensive workloads and trends towards resource and
memory efficiency (compare Sections 2.1, 2.2), a cache probability is as-
sumed to be pBc ≃ 0, whereas each required page BN+R cause RI/O. By
this means, cache probability per iterated tuple version (including N + R) is

46 2 | Technical Background


probably better in sequential processing and depends on the version records
per page N+R

BN+R
.

2.4. Additional Access Paths in Multi-Version DBMS

DBMS objects organize data in tuple records and arrange them for optimal
support of one or a few operations [RG03]. Different organization models,
e.g. heap, sorted or hashed, are favorable for a set of operations, due to
different cost models. A preferable base table organization for Flash storage
characteristics might be a strict append-only heap. Thus, insertion operations
in the base table heap are very cheap – as tuple records are collected in
main-memory pages and get evicted once (as applied in SIAS [Got16]), how-
ever, equality and range search operations become very expensive (compare
Section 2.3.2). As a result, equality search costs increase to a scan of the whole
multi-versioned dataset, as its size grows by the number of version records and
the necessity of a visibility check.

A fully attribute-key-sorted organization model probably improve perfor-
mance of equality and range search operations, but could also break the

Tuple t version t.v3

t.v3 9 TXu3 -VID(t)

Tuple y version y.v1

t.v2 1 TXu2 TXu3VID(t)

y.v1 11 TXu3 -VID(y)

t.v1 3 TXu1 TXu2VID(t)

y.v0 11 TXu1 TXu3VID(y)

t.v0 7 TXu0 TXu1VID(t)Ph
ys

ic
al

ly
 M

at
er

ia
liz

ed

VID(t)

In
di

re
ct

io
n 

La
ye

r

RecordID 
(latest version)

Tuple
(VID)

recordID(y.v1) VID(y)

Index
Record

Index

Search Key 
Values

Transaction 
Timestamp?VID... ...

entry-point to version 
chain of tuple y

Physcial Reference Logical Reference

recordID(t.v3)

...

...

...

...

...

...

 recordID 

Figure 2.9.: Version / Index Record Referencing [RVGP20]

2.4 | Additional Access Paths in Multi-Version DBMS 47


desired append-only behavior. Furthermore, any equality or range search
on a different attribute of the tuple result in a scan of the whole dataset.
DBMS provide the option to create additional access paths (alias indexes) to
base table primary data storage. Application of the auxiliary structures is
manyfold, e.g. support of uniqueness constraint checks, primary keys and
foreign keys or potentially fast look-ups on any indexed set of attributes
(secondary index). Index structures apply a search optimized hashed or
sorted organization model, however, modifications tend to become expen-
sive, due to maintenance costs and resulting unfavorable random write I/O
pattern to secondary storage devices. Sorted indexes allow equality and range
search with low RA1 and avoidance of expensive downstream sort operations –
unlike to hash indexes which allow only equality search, wherefore they are
not suitable for HTAP scans and not in scope of this thesis.
Index records regularly consist of a pair of search key value(s) and a

one data tuple reference2 [RG03], whereas they are version-oblivious, due
to missing transaction timestamp information (compare Figure 2.9). The
compact representation allows cheap search and maintenance costs, however,
there is no guarantee for improved throughput of the DBMS, as any applied
index requires maintenance. Administrators must evaluate the benefit of
an index. Based on query attributes and statistics like data selectivity, the
optimizer of the query evaluation engine (compare DBMS architecture in
Figure 2.4) can decide to use the additional access path via index.
Since the index record is located fast, its referenced data tuple location

in the base table is also known and can be accessed in a following step. A
very classical way of data tuple referencing in an index record is a physical
reference by record id. In the context of MVCC, several tuple version records
of one logical data tuple exist. Principally, each tuple version record is
required to be indexed, albeit it is sufficient for reliability to locate each
matching tuple version record via the version chain. A rolling entry point in
the new-to-old ordered version chain (as referred in Table 2.1) requires an

1Comparing linear heap table search costs of Equation 2.5 in Section 2.3.2 with exemplary
logarithmic B+-Tree costs of Equation 3.4 in Section 3.1.1.

2Reference organization variants, e.g. lists of data tuple references, are not considered.

48 2 | Technical Background


updated physical reference for each modification – i.e. respectively in case
of appended successor versions, physical record movement or performed
garbage collection. Modifications to indexed attribute values become much
more complex, as a new index record is inserted and the entry point reference
of the predecessor index record requires to be fixed.
In order to avoid expensive index maintenance operations, some DBMS

[DFI+13; Pos21] sacrifice base table characteristics from Table 2.1 and
employ an old-to-new version ordering with in-place updates, in which
case the entry point remains stable. A second way is to implement an
indirection layer between index and base table. Each data tuple record is
augmented with an unique tuple identifier (Virtual Tuple Identifier – VID),
which is stored in the index record reference pointer. An index operation
resolves the VID via usage of a mapping table in order to locate the physical
entry point of a data tuple (as depicted in Figure 2.9). An indirection
layer is able to reduce index maintenance costs, but requires additional
processing and recover-intensive main-memory data structures. With respect
to decentralized computing models in modern hardware, it is challenging
to maintain an indirection layer. Furthermore, in a key-sorted structure,
modifications to an indexed attribute are not treated by the indirection
layer and require an index update – resulting in expensive maintenance and
random write I/O patterns.
With current indexing techniques, there is no convenient way to simulta-

neously obtain a beneficial append-only sequential write pattern for index
structures and base table data (as listed in Table 2.1) on secondary storage
devices.

2.4.1. Version-oblivious Index Equality and Range Search

Indexes in modernmulti-version DBMS are version-oblivious by means of missing
transaction timestamp (depicted in Figure 2.9). This lack of information
results in serious consequences and can negate effectiveness of additional
access paths, due to lowered selectivity and read amplification (RA) for visibility
checking. With regards to Figure 2.10, a small scenario is given.

2.4 | Additional Access Paths in Multi-Version DBMS 49


Return all tuple 
versions satisfying:

Table R a z tcrea
tion

tinvali
dation

Tuple t version t.v0 7 TXu0 TXu1
version t.v1 3 TXu1 TXu2
version t.v2 1 TXu2 TXu3
version t.v3 9 TXu3 null

Tuple y version y.v0 11 TXu3 null

Logical ViewA

B+-Tree
idx

931 7

IndexC

11 *****

Physical StorageB
t.v0

Page 3
t.v3

Page 42
y.v0

Page 5
t.v2

Page 72
t.v1

Page 117 

D

                 :
SELECT COUNT(*)
FROM R WHERE a <= 10;
                 :

Transaction TXR

COSTS: Index Scan 
  + 4 Table Pages / 
     Random I/Os 
RESULT: 1 Version 

Visibility 
CheckE

Execution of long-running 
transaction TXR

CREATE INDEX 
idx ON R(a);

...

MAX(tcreation) <= TXR 
& COMMITTED

recID(t.v3)
recID(t.v2)

recID(t.v1)recID(t.v0)

recID(t.v0)

Figure 2.10.: Index search in Multi-Version DBMS and resulting Read Am-
plification (RA) on Base Tables [RVGP20]

In Figure 2.10- A (settled up on example in Figure 2.5), a logical view
on a base table R is given. Attribute a of tuple t is modified several times
by different transactions (T Xu0 - T Xu3), resulting in physically materialized
version records t.v0 - t.v3 with attribute value 7,3, 1 and 9. An additional
tuple y is inserted for clarification of physically materialized storage. Figure
2.10- B depicts an unclustered tuple version storage on random base table
pages of a heap structure, since tuple versions are independent record entities.
Previously, a key-sorted B+-Tree [BM70] secondary index (outlined in Section
3.1.1) was built upon the attribute a (Figure 2.10- C ). Its index records
reference the tuple version records in the unclustered heap. Figure 2.10- D
depicts a long-lasting analytical HTAP transaction T XR, which started before
T Xu1, T Xu2 and T Xu3. The execution planner and optimizer of the DBMS
query evaluation engine decide to use the index from Figure 2.10- C . In
order to find all tuple version records, which are visible to T XR, the index
is traversed and the data tuple reference is gathered for each matching
index record. Referencing tuple version records are fetched from their
unclustered physical storage location (Figure 2.10- B ) and returned to the

50 2 | Technical Background


MVCC visibility check in Figure 2.10- E . A visibility check forces massive read
amplification (RA) in order to gather required tuple version record timestamps
for snapshot calculations.

2.5. Improving Index Search Operation in Multi-Version DBMS

Optimal MVCC designs for base tables (as listed in Table 2.1) efficiently
leverage characteristics of secondary storage devices, especially in case of
massive data amounts, which are generated by modern workloads, and
relatively sparse main-memory. DBMS apply several additional access paths
for various purposes. Indexes require frequent maintenance for correctness,
however, traditional approaches neglect ideal storage characteristics and
result in ineffectiveness. This behavior bases on following indications:

Expensive strictly Key-Sorted Organization. As this index record organi-
zation enable fast search operations, its maintenance result in random write
I/O patterns and high write amplification (WA).

Reference to Tuple Version Records. Rolling entry points to version chains
amplify maintenance costs.

Reduced Selectivity. Downstream visibility checks cascade in high read
amplification (RA) as tuple record versions are located along multiple base
table pages.

0 2 4 6 8 10  12  14  16  18  20  22  24  26  28  30  32  34  36  38  40  42  44  46  48  50
0    

10    

20    

30    

40    

50    

Number of Successor Version Records

R
es

po
ns

e 
Ti

m
e 

(m
s)

Figure 2.11.: Correlation of Version Chain Length and Response Times.

2.5 | Improving Index Search Operation in Multi-Version DBMS 51


Base Table Data

Visibility Check

Transaction:

SELECT COUNT (*)  FROM 
TABLEWHERE ATTRIBUTE
<= 'VALUE';

Additional Access Path

Secondary Index

R 1 ... R n Result Set
Index-Only Visibility Check

Version 
Record

Candidates

Tuple Version 1 (w/ timestamp)
Tuple Version 2 (w/ timestamp)
…

Tuple Version n (w/ timestamp)

expensive base
table fetch

return
check

B
A

Figure 2.12.: Index-Only Visibility Checks A avoid expensive RA of princi-
pally unnecessary base table page fetches B .

Additional access paths should allow a key-sorted structure to leverage
characteristics of modern storage devices whereby a sequential write pattern
with low WA must be enabled. Once an indexing structure facilitates these
requirement, it is powerful enough to maintain amplified emergence of
tuple version records without a need for layers of indirection, which break
optimal MVCC design characteristics or are not suitable for decentralized
computing models. Since every tuple version record is physically referenced
in the powerful index structure, the only need for processing version chains
in base tables is the visibility check – causing massive RA due to sequential
processing alongside large tuple record sizes and relative small required
timestamp information. Figure 2.11 depicts the linear complexity between
version chain length and response time when using a B+-Tree index in an
HTAP scenario. Contrarily, an increased RA in the indexing structure is
negligible as required information is space-efficiently packed and asymmetry
in storage devices allow fast and parallel read performance. As part of the
search operation, a version-aware index structure immediately determines
matching tuple versions, which are visible to a transaction snapshot, by a
highly selective index-only visibility check (as depicted in Figure 2.12).

52 2 | Technical Background


Ch
ap

te
r 3

Related Work

In Chapter 2, the relevance for high-performance storage and index manage-
ment structures is exposed. Workload characteristics of modern applications
lead to high transactional pressure and raising amounts in data volume
to DBMS. MVCC enables high data quality in in HTAP query processing.
Furthermore, MVCC designs (as listed in Table 2.1) leverage storage char-
acteristics of Flash-based devices. However, index structures for additional
access paths must resist amplified maintenance effort, as it is desirable to
index any potentially visible tuple version. In order to obtain a minimal
read amplification (RA) on largely sized and randomly spread records in
base tables, an ideal index structure cheaply determines the tuple version
record, which is visible to a transaction snapshot – i.e. RA on base tables
is potentially reduced up to a tuple’s version chain length by similar index
search effort.

In this chapter, a brief overview of the ubiquitous state-of-the-art B+-Tree
indexing structure is given. Furthermore, current approaches are introduced,
which partially tackle required capabilities. In addition, an overview of
the designated DBMS and K/V-Store storage structures Snapshot Isolation
Append Storage (SIAS) [Got16; GPHB17] and WiredTiger [Mon21] is given.

53


It is shown, that the structure of Partitioned B-Trees exhibit potentially best
properties to fulfill all requirements in indexing massive amounts of tuple
version records on modern hardware.

3.1. Storage and auxiliary Index Management Structures

Storage and index management structures allocate information in records,
which are located in pages / buffers (compare [LHKN18]). These levels
of abstraction unify block-based access on secondary storage media via
buffer managers among other reasons. Levels of abstractions are equal for
storage as well as index management structures, contained information in
the records differ. Whereas storage management structures in this thesis
are meant for straight representation of tuple versions in records, index
management structures apply an efficient additional access path to desired
tuple version records in storage structures. Hence, major differentiation of
both structures is in their application.

Most key-sorted structures are based on the ubiquitous B+-Tree, which is
applied as state-of-the-art indexing structure in many DBMS. Hence, a brief
overview is given in the following.

3.1.1. Ubiquitous B+-Tree

B+-Trees are well-known and the most common indexing structure in DBMS.
Several structures involve aspects or are fully built upon them, including the
applied basic structure in this thesis (Partitioned BTrees [Gra03], compare
Section 3.1.4), whereas a brief overview is given. They enable several
modification operations on data; i.e. insert, update and delete; as well as
efficient data retrieval in equality and range search, due a preserved native
sort order in a balanced tree structure with costs, which are based on their
logarithmic height by high fan-out index nodes [BM70; Gra11].

B+-Trees represent a balanced tree structure, by means that each traversal
operation pass through an equal distance of nodes, i.e. they ensure a robust
performance by the maximum height h of a tree. The size of a node can

54 3 | Related Work


A

B

C

D

E

F G

Figure 3.1.: Schematic representation of a B+-Tree (compare [BM70;
Gra11]).

vary, generally it is a multiple of the applied transfer and storage size. The
structure consists of different types of nodes. With regards to Figure 3.1,
inner nodes C including the root A apply reference pointers B in order to
indicate several child nodes (e.g. C is the child of A and D is the child of
C ). The actual payload is located in leaf nodes D 1. Nodes possibly contain
sibling pointers, whereas each level in the tree is comparable to a doubly
linked list, however, it is not a requirement (e.g. leaves D maintain sibling
pointers, inner nodes C do not). The structure of each node is very similar
as depicted in Figure 3.1 E . Nodes probably maintain a set of meta data;
e.g. size, number of records, reference counters; perhaps sibling pointers
(left and right end of the node) and some kind of alpha-numeric sorted
records F . Nodes contain some blank space G , which is typically up to
half of its capacity. Since records are sorted within a node, binary search
enables efficient look-up. Records exhibit separator key characteristics in

1[BM70] considered the basic B-Tree to locate payload records also in inner nodes,
however, nowadays, it is common practice to locate all of them in leaf nodes [Com79]. B-Tree
and B+-Tree became synonymous terms.

3.1 | Storage and auxiliary Index Management Structures 55


inner nodes. A child node of a separator key contains only records, with
keys similar or greater than itself, but smaller than the next separator key
in the node. Leaves contain tuple or index records of pattern {ke y, value}1,
based on their application. A key consist of one or a set of attributes (e.g.
key k = {at t r1, at t r2, . . . , at t rn}), which are alpha-numerically sorted with
successive relevance. Therefore, a B+-Tree is able to efficiently search for a
set of search key values of a pattern {at t r1, at t r2} or {at t r1}, though not
only for {at t r2}. If its application is data tuple storage, the value contains a
set of attributes, which are not part of the sort order, otherwise in case of
indexing, a value references the tuple record id (or other reference type via
indirection layer as outlined in Section 2.4) of the indexed base table tuple
record. In order to provide a cost model, following B+-Tree base operations
are considered:

Equality Search. The B+-Tree gets traversed from the only entry point
(root A ) to a leaf node D . Thereby, the path is followed alongside reference
pointers, level by level. Child nodes are selected by binary searching a
search key in a parent node. An exact match of an index record key in a leaf
node indicates a positive result of the equality search and the record value
is returned. There could be quite a few matching equal search key index
records.

Range Search. A search interval is defined by a lower and upper key bound.
The lower search key is processed as defined in the equality search operation.
Index records are processed inside the sorted set in a leaf node and alongside
several leaves by following the sibling pointers, while the index record keys
match the search interval. Values are processed and returned in the sort
order of the tree.

Sorted Insert. An equality search operation is performed with the insert
record key. Additional functionality like uniqueness constraints are cheaply

1Other record value organization variants, e.g. {ke y, value-l ist}, are not considered.

56 3 | Related Work


examined as part of the equality search. The insert record is placed in the
designated sorted leaf node. Thereby, existing index records are relocated to
fulfill the sort order. Therefore, the blank space G is utilized. Overflowing
nodes require additional maintenance – i.e. the node gets split. Half of
the records is moved to the new node. The new node is propagated to the
parent node. Thereby, it is possible to cause an overflow in the parent node.
Modifications require to be protected from side effects, e.g. by applying page
locks.

Value Update. Updates to index records, which only affect the value are
performed in-place. Therefore, an equality search operation is performed.
The value of the equal match index record gets changed. Side effects require
to be considered and avoided by locking or indirection layers, as inconsis-
tencies require to be avoided. Updates to search key values result in a costly
multiphase deletion of the old record and an insertion of the updated index
record.

Record Delete. An equality search operation is performed with the dele-
tion record key. The equal matching index record gets removed from its
location in the leaf node. Existing index records become relocated in order
to fulfill the sort order. As the blank space G exceeds a certain threshold,
e.g. half of the node size, nodes underflow and become merged and rebal-
anced. Correctness of separator keys in parent nodes must be guaranteed.
Modifications require to be protected from side effects.

With focus on I/O latencies, costs of considered base operations are derived
from following model.

Cost Model. Basic operations in B+-Trees are subjecting a logarithmic
complexity as traversal operation costs depend on their maximum height
hBT . Height of the tree depends on the average fan-out F of inner nodes
and the number of required leaf nodes LN , which store a number of index

3.1 | Storage and auxiliary Index Management Structures 57


records N . Required nodes are affected by the fill factor f ( fi fill factor of
inner nodes, fl fill factor of leaves), which is the ratio between used and
blank space. Hence, it is valid for B+-Trees exceeding capacity of one leaf
node, that (compare [Gra11]):

hBT = ⌈logF× fi

LN

fl
⌉ + 1 , F ̸= 1 , LN > 1 , fi , fl ∈ [0.5;1] (3.1)

A common cost model for equality search in B+-Trees is defined as follows:

sBT ≈ log2 N (3.2)

This cost model focus on processing costs, if binary search is assumed.
Considering hBT of Equation 3.1, this model is expanded to binary search
costs in inner nodes and one leaf node (compare [Gra11]):

sBT = ⌈logF× fi

LN

fl
⌉ × log2 (F × fi) + log2 (

N
LN × fl

) (3.3)

With reference to Equation 2.1, major costs stem from data access on sec-
ondary storage devices. Replacing negligible binary search costs log2 {records}
with the I/O costs for reading a node, the cost model is a function of read
I/O RI/O and the height hBT of a B+-Tree. Regarding the caching probability
for inner nodes pic and leaves plc , read I/O can be reduced:

sI/OBT
≈ RI/O × (⌈logF× fi

LN

fl
⌉ × (1 − pic) + (1 − plc)) , pic ≫ plc (3.4)

It is more likely for inner nodes to be located in main-memory cache, as buffer
managers would prefer more often used nodes. Inner nodes get frequently
accessed as part of the traversal operation. Blank space G effectively reduces
the tree fan-out, due to decreasing fill factor fi , as well as increases necessary
leaf nodes LN by an average fill factor – and effectively influences the space
amplification and search performance, since less data is cached per node and
pic shrinks. However, blank space is beneficial for inserts, since maintenance
operations such as node splits are reduced. Any modified node M Pmod is

58 3 | Related Work


asynchronously evicted and written from main-memory to secondary storage
device at some point in time. Thereby, an insertion causes at least one write
I/O WI/O

1, however, due to escalating node splits up to 2×h+ 1 are possible,
but unlikely.

iI/OBT
≈ sI/OBT

+ WI/O ×M Pmod , M Pmod ∈ {1, 3, . . . , 2×h+ 1} (3.5)

Deletions behave comparable to insertions, as underflowing nodes cause
merges. In-place value updates do not cause maintenance operations, thus
M Pmod is equal to 1. Range search costs are equal to search costs, but require
successive reads of leaves alongside sibling pointers.
Assuming massive amounts of data and high rate continuous insertion

workloads, it is most likely that only one modification (’-1’ in Equation 3.6)
is buffered in main-memory until a leaf node is evicted. Vast majority of
a node’s comprised records ( N

LN× fl
) is unchanged, though gets rewritten.

Hence, average write amplification ’WA’ of a modified node is:

WA ≈
N

LN × fl
− 1 (3.6)

Suitability to Flash Characteristics. Basic B+-Trees are not capable of
leveraging modern storage hardware characteristics. First, in order to keep-
ing up a sorted dataset, B+-Trees perform updates in-place. Randomly
modified nodes become asynchronously evicted from buffer management,
whereby changes are persisted by rewriting the whole node. On Flash,
pages become randomly invalidated and rewritten out-of-place. Second,
modifications are manyfold, as reference and sibling pointers, node splits
and merges as well as the payload result in altered nodes. Third, marginal
modifications require to write one or several whole nodes – resulting in very
high WA. Fourth, general average fill factors of approximately 0.67 [RG03] to
0.7 [Gra11] are common, whereby WA is amplified. Last, even if successive
logarithmic search effort is very cheap, parallelism is not leveraged.

1The probability of multiple modifications per node is ignored, since low ratio in
main-memory buffer caches and massive amounts of data makes this case very unlikely –
except for serial insertions.

3.1 | Storage and auxiliary Index Management Structures 59


Multi-Version Capabilities. Multi-Version capabilities require transaction
timestamps for performing visibility checks. Therefore, a basic B+-Tree needs
to maintain creation and invalidation timestamps for each index record –
ultimately resulting in amplified in-place updates, increased record sizes
and reduced fan-out. Moreover, current and obsolete data is intermingled,
wherefore search performance shrinks. A very common practice is to return
a set of candidates whereupon visibility is checked on tuple version records
in base tables.

3.1.2. Structures with Multi-Version Capabilities

Several searchable structures tackled the capability of multi-version stor-
age and index management [MTT00; ST99]. Applications might maintain
different tuple versions in order to perform MVCC or provide constant snap-
shots for time travel querying. Query predicates and snapshot information
constitute different dimensions in search key attribute values, transaction
timestamps as well as probably additional application defined temporal di-
mensions. Well-known representatives are Multi-Version B-Tree [BGO+96],
Time-Split B-Tree [LS90], MV-IDX [GGH+14] or Bi-Temporal Timeline Index
[KFM+15], just to name a few. These structures differ in maintenance
effort and characteristics, space amplification (SA), searchability of raised
dimensions, linkage of related version records as well as general applicability
[RVP17a].

Bi-Temporal Timeline Index [KFM+15] enables multi-version capabilities
very well in a linearly growing log-based append-only approach of events
(activation and invalidation) over transaction time, which are mapped to
versions. Checkpoints