Visual Analysis of Spatio-temporal
Patterns in Digital Humanities
Data with Reproducible and
Confidence-aware Workflows

Von der Fakultät Informatik, Elektrotechnik und
Informationstechnik der Universität Stuttgart

zur Erlangung der Würde eines
Doktors der Naturwissenschaften (Dr. rer. nat.)

genehmigte Abhandlung

Vorgelegt von

Max Franke
aus Chemnitz

Hauptberichter: Prof. Dr. Thomas Ertl
Mitberichter: Prof. Dr. Wolfgang Aigner
Tag der mündlichen Prüfung: 12. Juni 2024

Institut für Visualisierung und Interaktive Systeme
der Universität Stuttgart

2024


Contents

List of Figures vi

List of Tables ix

List of Abbreviations and Acronyms xi

Acknowledgments xvii

Abstract xix

Zusammenfassung (German Abstract) xxiii

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Contributions and Remaining Structure . . . . . . . . . . . . 5

2 Foundations 9
2.1 Interactive Information Visualization and Visual Analysis . . 9

Basic Principles of Information Visualization . . . . . . . . . 9
Visualization Domains . . . . . . . . . . . . . . . . . . . . . . 12
Common Information Visualization Concepts . . . . . . . . 14

2.2 Data Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . 17
Geographical Projections . . . . . . . . . . . . . . . . . . . . 18

2.3 Statistical Evaluation . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Visualization in the Digital Humanities . . . . . . . . . . . . . 23
2.5 Technologies Used . . . . . . . . . . . . . . . . . . . . . . . . . 26

3 The “Dhimmis & Muslims” Project 29
3.1 Project Description . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Collaboration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

iii


Contents

3.3 Results: The Damast System . . . . . . . . . . . . . . . . . . . 34
Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Interactive Visualization . . . . . . . . . . . . . . . . . . . . 42
Persisting Analysis Results in Textual Reports . . . . . . . . 50
Supported Workflows . . . . . . . . . . . . . . . . . . . . . . 52
Example Analysis . . . . . . . . . . . . . . . . . . . . . . . . 54
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4 Reproducibility and Confidence in the Digital Humanities 57
4.1 Reproducibility in the Digital Humanities . . . . . . . . . . . . 58

Reproducibility Typology and Pipeline . . . . . . . . . . . . 60
Reasons for Reproducibility . . . . . . . . . . . . . . . . . . 62
Strategies to Enable Reproducible Visualizations . . . . . . 64
Reproducibility Beyond the DH . . . . . . . . . . . . . . . . 68

4.2 Confidence in the Digital Humanities . . . . . . . . . . . . . . 68
Confidence as a Primary Data Attribute . . . . . . . . . . . 70
Confidence and Text Data . . . . . . . . . . . . . . . . . . . 74
Incomplete and Missing Data . . . . . . . . . . . . . . . . . 75

5 Analysis of Spatial Data 79
5.1 Spatial Aggregation . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2 Non-contiguous Maps, Transitions, and Projections . . . . . . 85

Non-contiguous Maps for Non-uniform Spatial Distributions 86
Animated Transitions Between Spatial Viewpoints . . . . . 95
Spatial Projections for Animated Transitions . . . . . . . . . 103

5.3 One-dimensional Projections . . . . . . . . . . . . . . . . . . . 113
Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Comparison of Projections . . . . . . . . . . . . . . . . . . . 116

5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

6 Analysis of Temporal Data 123
6.1 Visualizing Periodicity in Event Data . . . . . . . . . . . . . . 124

Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Pre-calculation and Guidance . . . . . . . . . . . . . . . . . 128
Visual Representation . . . . . . . . . . . . . . . . . . . . . . 130
Visually Mapping the Phase . . . . . . . . . . . . . . . . . . 132
Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

6.2 Visual Representations of Time . . . . . . . . . . . . . . . . . 137
Visualization of Time Series Data . . . . . . . . . . . . . . . 137
Binned Representations . . . . . . . . . . . . . . . . . . . . . 149
Qualitative and Quantitative Visualization . . . . . . . . . . 151

iv


Contents

Color . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

7 Analysis of Spatio-temporal Data 155
7.1 Spatio-temporal Patterns . . . . . . . . . . . . . . . . . . . . . 156

Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Hotspots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . 160
Trendsetters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
Spatial Expansion . . . . . . . . . . . . . . . . . . . . . . . . 161
Synchronous Movement . . . . . . . . . . . . . . . . . . . . 161
Coexistence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

7.2 Integrated Visualization of Space and Time . . . . . . . . . . 162
LilyPads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Mapping Periodicity Phase to Color or Shape in a Map . . 165
Mapping Space-Time to 2D . . . . . . . . . . . . . . . . . . . 167

7.3 Separated Visualization of Space and Time . . . . . . . . . . 179
7.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

8 Conclusion 185
8.1 Summary of Chapters . . . . . . . . . . . . . . . . . . . . . . . 185
8.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

Research question 1 . . . . . . . . . . . . . . . . . . . . . . . 187
Research question 2 . . . . . . . . . . . . . . . . . . . . . . . 188
Research question 3 . . . . . . . . . . . . . . . . . . . . . . . 188
Research question 4 . . . . . . . . . . . . . . . . . . . . . . . 189

8.3 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . 189

References 193
Own Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

Peer-reviewed Publications . . . . . . . . . . . . . . . . . . . 193
Other Publications . . . . . . . . . . . . . . . . . . . . . . . . 194

Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

v


List of Figures

2.1 Visualized pixel density in print and on a computer monitor . . 10
2.2 Information visualization pipeline (Card et al., 1999) . . . . . . 11
2.3 Visual analytics model (Keim et al., 2008) . . . . . . . . . . . . . 14
2.4 Simplified sensemaking model (Pirolli & Card, 2005) . . . . . . . 16
2.5 Example Mercator projection and WebMercator tiles . . . . . . . 19
2.6 Examples for azimuthal and two-point equidistant projection . 20
2.7 Examples for confidence intervals and their pairwise differences 22
2.8 VarifocalReader (Koch et al., 2014) . . . . . . . . . . . . . . . . . 24
2.9 VAiRoma (Cho et al., 2016) . . . . . . . . . . . . . . . . . . . . . . 25

3.1 Screenshot of Damast v0.2.0 . . . . . . . . . . . . . . . . . . . . . 33
3.2 Abstracted data schema for the “Dhimmis & Muslims” project . 36
3.3 GeoDB-Editor: tabular data entry in Damast . . . . . . . . . . . 39
3.4 Annotator: text-based data entry in Damast . . . . . . . . . . . . 40
3.5 Visualization component of the Damast system (v1.3.0) . . . . 42
3.6 Visual map glyph design in the Damast system . . . . . . . . . . 43
3.7 Tooltips for details on demand in the Damast visualization . . . 44
3.8 The religion view of Damast . . . . . . . . . . . . . . . . . . . . . 45
3.9 The map view and location list of Damast . . . . . . . . . . . . . 46
3.10 The qualitative and quantitative timelines of Damast . . . . . . 47
3.11 The source and tag view of Damast . . . . . . . . . . . . . . . . . 49
3.12 The Damast system’s visualization in confidence mode . . . . . 50
3.13 Extract of a Damast report . . . . . . . . . . . . . . . . . . . . . . 51
3.14 Workflows supported by the Damast system . . . . . . . . . . . 52
3.15 Damast timeline for example analysis . . . . . . . . . . . . . . . 54

4.1 Reproducibility pipeline . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2 The historical data processing pipeline . . . . . . . . . . . . . . . 70
4.3 Example of confidence filters in Damast . . . . . . . . . . . . . . 73
4.4 Explicitly showing filtered-out data in Damast . . . . . . . . . . 76

5.1 Inhomogeneous distribution of cities in Europe . . . . . . . . . 81

vi


List of Figures

5.2 Unaggregated and aggregated map in Damast . . . . . . . . . . 82
5.3 Details-on-demand for aggregated location data in Damast . . 83
5.4 Examples for inset, offset, and proxy maps . . . . . . . . . . . . 86
5.5 The interface of LilyPads . . . . . . . . . . . . . . . . . . . . . . . 88
5.6 Word cloud in LilyPads . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.7 Examples for map insets in LilyPads . . . . . . . . . . . . . . . . 90
5.8 Examples for tooltips with details on demand in LilyPads . . . . 92
5.9 Selections of multiple items in LilyPads . . . . . . . . . . . . . . 94
5.10 Parts of a geo-located graph with inhomogeneous distribution 96
5.11 Placement of proxy maps for off-screen positions . . . . . . . . 97
5.12 Screenshot of our ego-perspective visualization approach . . . 98
5.13 Animated ego-perspective transition in a geo-located network 99
5.14 Selection of most relevant proxy maps during animated transition 102
5.15 Comparison of geodetic lines in different map projections . . . 104
5.16 Example projections for TPEQD for different distances . . . . . 105
5.17 Direction task of the projection comparison study . . . . . . . . 107
5.18 Example stimulus task for the projection comparison user study 108
5.19 Evaluation results for the projection comparison user study . . 110
5.20 Pairwise differences for the projection comparison user study . 111
5.21 Schematic workflow of hierarchical projection of space to 1D . 115
5.22 Visual support to compare projections from geo-space to 1D . 117
5.23 Different spatial distributions before 1D projection . . . . . . . . 119

6.1 Components of the periodicity visualization approach . . . . . . 127
6.2 Schematic explanation of aggregated periodicity visualization 129
6.3 Example patterns for periodicity in different representations . . 130
6.4 Example visual mappings of phase in a scatter plot . . . . . . . 133
6.5 Mapping of year and day of year to space in a scatter plot . . . 134
6.6 Time series visualization comparison user study example stimuli 140
6.7 Time series study accuracy CIs . . . . . . . . . . . . . . . . . . . 143
6.8 Time series study completion time CIs . . . . . . . . . . . . . . . 144
6.9 Time series study pairwise accuracy differences . . . . . . . . . 145
6.10 Time series study pairwise completion time differences . . . . . 146
6.11 Color mappings in LilyPads . . . . . . . . . . . . . . . . . . . . . . 153

7.1 Spatio-temporal pattern examples in LilyPads (1) . . . . . . . . 163
7.2 Spatio-temporal pattern examples in LilyPads (2) . . . . . . . . 164
7.3 Spatio-temporal pattern examples for periodicity . . . . . . . . 166
7.4 Screenshot of approach where geo-space is mapped to 1D . . 169
7.5 Synthetic expansion pattern with 1D spatial projection . . . . . 171
7.6 Synthetic trendsetter pattern with 1D spatial projection . . . . 172
7.7 Spatio-temporal pattern examples for COVID-19 data . . . . . . 174

vii


List of Figures

7.8 Spatio-temporal pattern examples in wildfire data (1) . . . . . . 176
7.9 Spatio-temporal pattern examples in wildfire data (2) . . . . . . 177
7.10 Spatio-temporal pattern examples in wildfire data (3) . . . . . . 178
7.11 Spatio-temporal pattern examples in Damast (1) . . . . . . . . . 180
7.12 Spatio-temporal pattern examples in Damast (2) . . . . . . . . . 181
7.13 Spatio-temporal pattern examples in Damast (3) . . . . . . . . . 182

Note on figure labeling. Figures are numbered within chapters (e.g., fig-
ure 2.3 in chapter 2). Some figures consist of two or more subfigures. Subfigures
are captioned separately, and are numbered within the figure with lowercase let-
ters (e.g., figure 5.8b). In addition, points or regions of interest within figures may
be labeled 1 for later reference. These are numbered with Arabic numerals, and
referenced as follows: Figure 3.5 label 3 , figure 7.4 label 8 .

viii


List of Tables

4.1 Reproducibility strategy aspect coverage . . . . . . . . . . . . . 65

5.1 Quality measures for 1D projections of example distributions . 120

6.1 Phase histograms and quality measures for example datasets 131

7.1 Examples for spatio-temporal patterns . . . . . . . . . . . . . . . 158

ix


List of Abbreviations and
Acronyms

azeqd Azimuthal equidistant projection.
An azimuthal map projection which is defined by one geographical lo-
cation. Distances from this node to any point are represented without
distortion. This is a special case of the two-point equidistant projection
(tpeqd) where both nodes are identical, and is described in more detail
in section 2.2.2.

tpeqd Two-point equidistant projection.
A map projection which is defined by two geographical locations. Be-
tween these two nodes, the projection is not distorted, and distances be-
tween the nodes and any point are represented undistorted. Described
in more detail in section 2.2.2.

BCa bootstrapping Bias-corrected and accelerated bootstrapping.
A bootstrapping method that uses random sampling with replacement
of a known sample to estimate confidence intervals (Efron, 1987). See
section 2.3 for more details on statistical evaluation.

t-SNE t-stochastic neighborhood embedding.
A non-linear dimensionality reduction (DR) technique introduced by
vanderMaaten andHinton (2008) thatworkswell for high-dimensional
data with clear clusters.

1D One-dimensional.
A space with only one coordinate. In visualization research, time is a
typical one-dimensional data space.

2D Two-dimensional.
A space spanned by two coordinates. In visualization research, many
spatial (and non-spatial, see section 2.2.1) datasets are simplified to a
2D space, because conventional displays are two-dimensional.

xi


List of Abbreviations and Acronyms

3D Three-dimensional.
A space spanned by three coordinates. In visualization research, many
datasets based in the real world (e.g., simulation data or computer to-
mography scans) are three-dimensional in their spatial attributes (see
section 2.1.2).

AHC Agglomerative hierarchical clustering.
Abottom-up clustering approach described in detail by Sibson (1973). In
each step, the two closest clusters are merged into one based on a given
distance metric and linkage criterion, until only one cluster remains.
The resulting binary tree can be cut at an arbitrary height to produce a
set of clusters at least a given distance apart.

API Application programming interface.
A defined interface between two parts of the same program, or between
different programs.

cDMD Constrained dynamic mode decomposition.
An extension of dynamic mode decomposition (DMD) which was pro-
posed by Krake et al. (2022), where the additional constraints regarding
the expected period length can be specified.

CI Confidence interval.
A value interval within which a sample mean lies with a certain proba-
bility, often 95%, see section 2.3.

COVID-19 Coronavirus Disease 2019.
A disease caused by the SARS-CoV-2 virus which lead to a pandemic
starting in March 2020 (WHO, 2020).

CVD Color vision deficiency.
Also “color blindness”. CVD describes a reduced capability to perceive
colors and differences between colors. The most common form of CVD
affects the ability to distinguish colors along the red-green axis.

DH Digital humanities.
A hybrid research field combining humanities research (e.g., history)
with digital methods and support. This opens up more quantitative, au-
tomated analyses on larger datasets to humanities researchers.

DMD Dynamic mode decomposition.
An approach to extract dynamic temporal changes from sampled data
presented by Schmid (2010).

xii


List of Abbreviations and Acronyms

DOI Digital object identifier.
A unique identifier for the persistent identification of digital objects. Ex-
amples are digital documents such as papers, and datasets in long-term
data repositorys (LTRs). DOIs are a type of uniform resource identifier
(URI).

DoI Degree of interest.
A measure for the relevancy of different, related data items that can be
used to Different approaches can be taken here to obtain such a mea-
sure; Furnas (1986) introduced DoI measures to determine which other
leaves and subtrees of a hierarchy to show depending on a selected leaf
node.

DOM Document object model.
The programmatic data model that underlies documents written in, for
example, extensible markup language, scalable vector graphics (SVG),
or hypertextmarkup language (HTML). Eachnode of the document tree
is represented by an object, and objects and hierarchies can be manipu-
lated programmatically.

DR Dimensionality reduction.
A family of techniques that project an n-dimensional dataset into a m-
dimensional space,m < n; see section 2.2.1.

DTW Dynamic time warping.
An algorithm that correlates and aligns two time series and measures
their similarity (Sakoe & Chiba, 1978).

FRP Fire radiative power.
The power emanated by a fire (e.g., a wildfire) in thermal radiation.
This is a common data point in satellite-based wildfire monitoring (e.g.,
Giglio et al., 2020), and is usually measured in Megawatts.

GPU Graphics processing unit.
A dedicated piece of hardware in a computer that is optimized to do
low-level graphical computations quickly with high parallelism.

HTML Hypertext markup language.
A markup language which is used for web pages.

HTTP Hypertext transfer protocol.
A protocol for transferring data and files over a network. HTTP is used
dominantly (with and without an additional encryption layer) for the
communication between a web client (e.g., a browser) and a server.

xiii


List of Abbreviations and Acronyms

LOD Level of detail.
In visualization, the level of detail in which information is shown de-
pends mostly on the available space: In a more restricted space, only
a rough overview of data can be shown. When a subset of the data is
selected (e.g., by zooming in), more details can be shown. See also se-
mantic zooming in section 2.1.1.

LTR Long-term data repository.
A repository for the long-term storage of digital data. “Long term” can
have the goal of meaning “indefinite” here. LTRs are essential for long-
term reproducibility of scientific results. Examples for LTRs used along-
side the results presented in this thesis are Zenodo1 and the Data Repos-
itory of the University of Stuttgart, Germany2 (DaRUS).

MCV Multiple coordinated views.
This concept from interactive visualization research describes a collec-
tion of two or more views, which each visualize different aspects of the
data. These views affect each other; that is, they are coordinated (see
section 2.1.3).

MDS Multidimensional scaling.
A linear DR method (see section 2.2.1).

NER Named-entity recognition.
A set of techniques for the identification and classification of named
entities in unstructured natural language text. Examples for named en-
tities are person or corporation names, places, and points in time.

NLP Natural language processing.
A sub-field of computer science and set of techniques concerned with
the computational parsing, understanding, andmanipulation of natural
language (e.g., English).

NOAA National Oceanic and Atmospheric Administration.
An agency of the government of the United States of America that is
responsible for weather forecasts and the monitoring of the conditions
if oceans and atmosphere, among other things.

OCR Optical character recognition.
Automated methods that generate machine-readable text from optical
images (i.e., scans) of hand-written or printed documents.

1https://zenodo.org/
2https://darus.uni-stuttgart.de/

xiv

https://zenodo.org/
https://darus.uni-stuttgart.de/


List of Abbreviations and Acronyms

p.p. Percentage point.
A relativemeasure of difference between two values given in percentage
units. For example, for two values a = 20% and b = 10%, a is 50% of b,
but the relative difference is a reduction of 10 p.p., given in relation to
the value the percentage values were in reference to.

PCA Principal component analysis.
A linear DR technique that projects a n-dimensional dataset into them-
dimensional subspace (m < n) spanned by the m largest eigenvectors
of the dataset’s covariance matrix; see section 2.2.1.

REST Representational state transfer.
An application programming interface (API) paradigm where paths de-
scribe resources via URIs. Actions on resources (e.g., hypertext transfer
protocol (HTTP) verbs like GET or DELETE) specify what should happen
to the referenced resource.

SOMs Self-organizing maps.
A non-linear DR technique introduced by Kohonen (1981) that works
well for small datasets with high dimensionality.

STL Seasonal trend decomposition based on Loess.
An approach to decompose a time series into a seasonal, linear, and
noise component proposed by R. B. Cleveland et al. (1990).

SVG Scalable vector graphics.
A markup language format for graphics that is based on the extensible
markup language (XML). Graphical primitives such as paths, rectan-
gles, or circles are declared as document object model (DOM) nodes.
SVG images are scalable, as opposed to raster image formats. They are
therefore useful on the web, where sizing of DOM elements is depen-
dent on the client’s viewport.

UI User interface.
The interface between a human user and a computer.Within this thesis,
the term refers to a graphical user interface that consists of visualiza-
tions as well as control elements (e.g., buttons) that users can interact
with.

UMAP Uniform manifold approximation and projection for DR.
A non-linear, general-purpose DR technique introduced by McInnes
and Healy (2018) and McInnes et al. (2020).

xv


List of Abbreviations and Acronyms

URI Uniform Resource Identifier.
A unique string that identifies a resource. In digital humanities (DH)
data collections, entities are usually referenced by their URI. For ex-
ample, the city Aleppo has the path /place/18 in the Syriaca.org data
collection (Vanderbilt University et al., 2014), and the URI https://
syriaca.org/place/18.

xvi

https://syriaca.org/place/18
https://syriaca.org/place/18


Acknowledgments

This thesis and the research it presents would not have been possible without the
help of several people, to whom I am very grateful. First and foremost, I thank
my advisor Thomas Ertl for givingme the opportunity for doing exciting research
for my doctoral degree. He always had critical questions and insights that helped
to refine and improve my work. I am also very grateful to Steffen Koch for his
continuous support and guidance. Our regular discussions were always interest-
ing and productive. Steffen’s input was essential to explore many fun ideas, and
to develop good ones further. I would like to thank both Thomas Ertl and Steffen
Koch for providingmewith funding to visit various conferences in exciting places
around the world. Finally, I am grateful to Wolfgang Aigner for taking the time
to review my dissertation, and for traveling to Stuttgart to attend my defense.

I very much appreciate the support and assistance of my co-authors, without
whom this work would not have been possible: Ralph Barczok, Tanja Blascheck,
Markus John, Jana Keck, Moritz Knabben, Steffen Koch, Kuno Kurzhals, Julian
Lang, Henry Martin, Guido Reina, and Dorothea Weltecke. Many engaging dis-
cussions with other colleagues and collaborators also helped shape and refinemy
research. I want to thank especially Frank Heyen, Florian Jäckel, Daniel Klötzl,
Tim Krake, Michael Sedlmair, and Daniel Weiskopf. To the various colleagues
with whom I shared both on- and off-topic discussions during lunch and cof-
fee breaks—but in particular to Markus John and Franziska Huth, who endured
sharing an ofÏce with me—I want to extend my gratitude. I would also like to
acknowledge the contributions of several students who, over the years, helped
to explore research directions I did not have the capacity to pursue at the time. I
acknowledge, in particular, the exceptional work of Leon Gutknecht, Benjamin
Hahn, Alexandra Hirsch, Julian Lang, Alexander Riedlinger, Ingo Schwendinger,
Markus Stengel, Ba-Anh Vu, and Joel Waimer.

To my friends and family, I am eternally grateful for their continuous love
and support. I would not have come this far without the fascination for science
and critical thinking, computers, and cartography they infected me with. I also
want to thank my sensei, Sabine, for helping me find balance, both mentally and
physically. Finally, I appreciate all my favorite people and dogs who encouraged
me to take a break from time to time, explore the outdoors, and relax.

xvii


Acknowledgments

Legal acknowledgments. Most prototypes presented in this work used
map data, map tiles, or both from OpenStreetMap. The figures showing these
prototypes are captioned “Map tiles © OpenStreetMap contributors.” OpenStreet-
Map™ is open data, licensed under the Open Data Commons Open Database Li-
cense (ODbL) by the OpenStreetMap Foundation.3 Other prototypes, where the
OpenStreetMap map tiles with modern-day political borders and infrastructure
were not appropriate, use map tiles with only a shaded relief. These were cre-
ated by me (Franke, 2024) based on NASA SRTM (National4 Aeronautics and
Space Administration Shuttle Radar Topographic Mission) topographic height
data (NASA JPL, 2013), and are published under the Creative Commons By-At-
tribution 4.0 license. In addition, the maps in these prototypes use vector map
material by Natural Earth5 for oceans, lakes, and rivers. The NASA SRTM data,
as well as the Natural Earth data, are in the public domain. The respective figures
are captioned “Map tiles © 2024Max Franke.” Wikipedia andWikidata were also
used as data sources for various figures and datasets. Wikidata’s data is published
under the Creative Commons CC0 License. Wikipedia’s text data is published un-
der the Creative Commons Attribution-ShareAlike 3.0 Unported License. GNU
Parallel (Tange, 2022) was used for parallelization and job control in various data
preprocessing steps described in this work.

3https://www.openstreetmap.org/copyright
4United States of America
5https://www.naturalearthdata.com/

xviii

https://www.openstreetmap.org/copyright
https://www.naturalearthdata.com/


Abstract

Recent decades have seen a consistent increase of the availability of digitized data,
aswell as an increase in dataset volumes. The digital humanities (DH) emerged as
a consequence of this development and brought digital storage, automated analy-
sismethods, and digital presentation to humanities research fields such as literary
research and history. In DH data in particular, interpretation and human judg-
ment are essential to contextualize the data, which is produced and relayed by
humans, not sensors or simulations. Hence, the provenance and trustworthiness
of the data is essential information for an objective analysis. This is especially
true for historical data, where pieces of information can contradict each other,
sources may exaggerate or lie, and the data is generally inhomogeneous and in-
complete. The analysis of such data adds another level of human interpretation
and judgment, which need to be recorded to become part of later sense-making
and reproducibility efforts.

The human capabilities to make sense of data, and to recognize patterns and
structure in it, declines with increasing complexity and size of the dataset. By
transforming the data and representing it by visual structures, the powerful hu-
man visual apparatus can be harnessed to alleviate these shortcomings. Even so,
the complexity anddataset size are limited.Hence, the correct choice of data trans-
formations, visual mappings, and aggregation of details is essential to support hu-
man sense-making. Besides the general visualization design, good support of do-
main experts’ workflows is essential to encourage adoption of digital tools in their
day-to-day work. Further, finding good solutions for the digital and visual repre-
sentation of uncertain information and provenance data, as well as the recording
thereof, improves general data quality as well as trust in the data and findings.

Previous work within and outside of visualization research has already stud-
ied the analysis of incomplete or uncertain data. In the DH, however, data of-
ten has a unique combination of small dataset size and high data complexity.
This fact, in combination with the presence of contradictory statements in the
data, still require novel approaches for faithful representation to support objec-
tive analyses. Especially with this data nature in mind, analyses in the DH are
seldom linear. The use of incremental loops of data foraging, knowledge mining,

xix


Abstract

and sense-making have been explored in detail both inside and outside of visual-
ization research. Research in the humanities usuallymovesmore carefully and on
a longer timescale. The established methods and workflows applied here need to
work long-term, and the introduction of new methods from the fast-paced devel-
opments of the digital world still poses open challenges. In particular, provenance
and reproducibility both of data and of analyses need to be guaranteed on a longer
timescale to promote open science.

The high complexity of DHdata alsomakes itmore difÏcult to find interesting
insights about patterns and relationships in the data. Such patterns can translate
to visual patterns that can easily be recognized. Recent works have explored the
nature of DH data and tasks, but the nature of the data patterns is very specific to
the concrete domain and research questions. The complexity of the data poses a
challenge to the designs of visual representations that reveal these patterns visu-
ally. Suitable visual transformations are particularly challenging for geographical
and geo-temporal data, where an inherent placement given by the data precludes
many layout techniques. Projection of the geographical space into less complex
spaces may offer ways to enhance the data layout and to reveal patterns. Much re-
search onmap projections and dimensionality reduction already exists. However,
both the elicitation of patterns from the data and the communication of spatial
relations, direction, and distance to analysts in an intuitive manner in projected
representations are still under-explored.

This thesis presents strategies for the visual analysis of complex, incomplete,
and heterogeneous data often present in the DH. Here, the entire data lifecycle,
from data collection over visual analysis to the publication of results, is consid-
ered. In addition, the support of the backwards path through that lifecycle is con-
sidered as part of visualization approaches to provide data and analysis prove-
nance, foster trust in the results, and promote reproducible and open science.
To further aid more objective analyses, this work explores the use of qualitative
confidence measures to record the quality and trustworthiness as a coequal part
of the collected DH data, which can subsequently be used as part of the anal-
ysis. This thesis also explores the visualization and visual analysis of data with
spatio-temporal components, which are often found in the DH. Here, this work
proposes novel visualization techniques, as well as novel combinations of exist-
ing techniques, to elicit hidden patterns of interest from the data to support do-
main research questions. The use of separated and integrated representations of
space and time are examined. For the integrated representations, the use of data
transformations and projections that reduce the complexity and restriction of ge-
ographical data on the layout of the visualization. To this end, non-contiguous
maps and the use of map insets to solve challenges with the level of detail in
heterogeneous geographical distributions are employed. In addition, different ge-
ographical projections are compared regarding their suitability to communicate
the spatial relationships between data items. Further, this thesis explores projec-

xx


tions of geographical space to various one-dimensional, discrete orderings to elicit
relationships between space and other attributes in a less complex layout, as well
as the reduction of the temporal component of event data to a space of period
length and phase to elicit hidden patterns of periodically recurring events.

The methods presented in this thesis were developed largely to support DH
researchers in their domains’ research questions, with the characteristics and vol-
umes of data typical in that field in mind. Still, these methods can be—and in
some cases have been—extended and adapted for different research fields. The
principles applied in the approaches presented within this thesis are, in general,
extensible and domain agnostic, given certain preconditions in the data.

xxi


Zusammenfassung

GrößeundVerfügbarkeit digitalerDatensätze haben aufgrundder sinkendenKos-
ten von Speicherplatz und steigender Rechenleistung in Computern in den letz-
ten Jahrzehnten rasant zugenommen. Der technischen Fortschritt führte auch
zur Entstehung der digitalenGeisteswissenschaften (DH, von engl. „digital huma-
nities“) und spornte die Digitalisierung geisteswissenschaftlicher Inhalte an. Die
DH kombinieren Forschung in denGeisteswissenschaftenmit digitalen Speicher-
möglichkeiten und Analysemethoden, wodurch Geisteswissenschaftler*innen
viel größere Datensammlungen auf einmal analysieren können. Die Erfassung
der Herkunft und der Vertrauenswürdigkeit von Daten kann in allen Fachge-
bieten, insbesondere aber in den DH, einen Mehrwert für unvoreingenommene
Analysen liefern: Hier wurden die Daten von Menschen erzeugt und enthalten
Informationen über menschliches Schaffen oder gesellschaftliche Aspekte, und
werden im Laufe ihrer Entstehung oft mehrmals von Menschen analysiert, inter-
pretiert und bewertet. In historischen Daten, die ohnehin sehr inhomogen und
unvollständig sein können, sind einzelne Datenpunkte außerdem gelegentlich
widersprüchlich, oder die Realität wurde aus verschiedenen Gründen verzerrt
oder vorurteilsbehaftet dargestellt, bis hin zu bewusst falschen Aussagen in Quel-
len. Hier ist die Erfassung von Herkunft und Vertrauenswürdigkeit der Daten,
und deren Interpretation bei der Dateneintragung, besonders wichtig, um ein ob-
jektives Gesamtbild zu erhalten. Die Visualisierung und Analyse der Daten sind
ein weiterer Arbeitsschritt, in dem menschliche Interpretation ins Spiel kommt.
Um die Schlussfolgerungen aus solchen Analysen nachvollziehen und reprodu-
zieren zu können, ist es also auch für diesen Prozess wichtig, Kernaspekte wie
Datenfilter und visuelle Parameter zu erfassen und abzuspeichern.

Ohne geeigneteRepräsentation ist es schwierig,Muster undZusammenhänge
in größeren oder komplexeren Datensätzen zu erkennen und zu verstehen. Das
visuelle SystemdesMenschenhat sich allerdings über Jahrmillionen dahin entwi-
ckelt, Regelmäßigkeiten undAusreißer in statischen und bewegten Bildern unter-
bewusst entdecken zu können. Automatisierte Visualisierung übersetzt Daten in
visuelle Strukturen, in denen Charakteristiken der Daten visuell erkennbar sind.
Damit können auch deutlich größere und komplexereDatensätze noch analysiert

xxiii


Zusammenfassung (German Abstract)

und verstandenwerden. Eine geeigneteWahl derDatentransformationen, derAb-
bildung auf visuelle Primitive und der Aggregation von Details ist jedoch wichtig,
da die Skalierbarkeit begrenzt ist. Um den Domänenexpert*innen die Aufnahme
von digitalen Methoden in ihren Arbeitsalltag zu erleichtern, ist es auch wichtig,
ihre Anforderungen und bestehenden Arbeitsabläufe zu verstehen und zu ergän-
zen, anstatt sie ersetzen zu wollen. Insbesondere in den DH ist eine geeignete
digitale und visuelle Darstellung der Vertrauenswürdigkeit der Daten und ihrer
Herkunft – von ihrer Entstehung bis zur Interpretation und digitalen Eintragung
durch die Domänenexpert*innen – essenziell. Ein Arbeitsablauf, in dem die Her-
kunft und Vertrauenswürdigkeit der Daten konsequent miterfasst wird, erhöht
langfristig die Qualität des Datenbestands, und dadurch auch das Vertrauen in
die Daten und die daraus gewonnenen Erkenntnisse.

Forschungsarbeiten haben sowohl innerhalb als auch außerhalb der Visua-
lisierung die Analyse von unvollständigen oder unsicheren Datenbeständen un-
tersucht. Bestehende Lösungen lassen sich allerdings schwer auf DH-Daten an-
wenden, da die Datensätze zwar vergleichsweise klein, dafür aber sehr komplex
und – gerade bei historischen Daten – teils widersprüchlich und inhomogen sind.
Für möglichst unvoreingenommene Analysen sind hier neuartige Ansätze von-
nöten, um diese Unzulänglichkeiten der Daten angemessen wiederzugeben. Auf-
grund der zuvor genannten Eigenschaften von DH-Daten sind Analysen hier sel-
ten linear. Die Nutzung von aufeinander aufbauenden Zyklen von Datensuche,
Wissensanreicherung und Erkenntnisgewinn wurde sowohl innerhalb als auch
außerhalb der Visualisierung detailliert erforscht. Die Forschung in den Geistes-
wissenschaften hat oft einen langen Zeithorizont mit etablierten Arbeitsabläu-
fen, die langfristig funktionieren müssen. Neue Arbeitsmethoden aus der rasant
fortschreitenden digitalen Welt führen hier immer noch zu Herausforderungen,
insbesondere, wenn man bestehende Arbeitsabläufe ergänzen und nicht erset-
zen will. Für langfristig freie und zugängliche Forschung sind hier die Herkunft,
Nachverfolgbarkeit und Reproduzierbarkeit von Daten und Analysen essenziell.

Die visuelle Darstellung der Daten erleichtert zwar die Erkennung von Mus-
tern in denDaten durch denMenschen, allerdings wird dies durch die hohe Kom-
plexität von DH-Daten wieder erschwert. Die Charakteristik der Daten und die
zu lösenden Problemstellungen in den DH wurden bereits gründlich erforscht.
Allerdings gibt es kaum generelle Lösungsansätze, da die DH ein sehr breites
Forschungsfeld sind und die Daten und Forschungsfragen sehr vom jeweiligen
Forschungsfeld abhängig sind. Visuelle Transformationen, die Muster in den Da-
ten aufdecken, sind insbesondere in geografischen und geografisch-zeitlichenDa-
ten schwer umzusetzen, da die intrinsische geografische Positionierung der Da-
ten das Layout der Visualisierung sehr einschränkt. Durch eine Projektion des
geografischen Raums in einen weniger komplexen Raum kann das Layout der
visualisierten Daten verbessert werden, wodurch Muster besser sichtbar werden.
Karten- und Datenprojektion sind gut erforschte Felder. Bezüglich der Sichtbar-

xxiv


keit vonMustern und der intuitiv verständlichen Darstellung von räumlichen Be-
zügen, Richtung und Distanz gibt es aber noch Potenzial für weitere Forschung.

Diese Arbeit präsentiert Strategien zur visuellen Analyse von komplexen, un-
vollständigen und heterogenen Daten, wie sie oft in den DH vorkommen. Dabei
wird der gesamte Lebenszyklus der Daten von der Dateneintragung über deren
visuelle Analyse bis zur Veröffentlichung der Analyseergebnisse berücksichtigt.
Außerdemwird die Rückrichtung durch die Provenienz der Daten als Teil des Vi-
sualisierungsansatzes gesehen. Dadurch könnenHerkunft und Entwicklung von
Daten und Analysen nachvollzogen und reproduziert werden, um das Vertrauen
in die Analyseergebnisse zu stärken und reproduzierbare und freie Forschung zu
fördern. Diese Arbeit untersucht zur Unterstützung objektiverer Analysen auch
qualitative Konfidenzwerte, um die Qualität und Vertrauenswürdigkeit von den
gesammelten DH-Daten als gleichberechtigte Datenattribute zu berücksichtigen.
Solche Attribute können somit in der Analyse mitverwendet werden.

Diese Arbeit untersucht auch die Visualisierung und visuelle Analyse vonDa-
ten mit räumlich-zeitlichen Bezügen, welche häufig eine wichtige Rolle in DH-
Daten spielen. In diesem Kontext schlägt diese Arbeit neuartige Visualisierungs-
ansätze und Kombinationen bestehender Ansätze vor, um versteckte, interessan-
te Muster und Zusammenhänge in den Daten sichtbar zu machen und Domä-
nenexpert*innen in der Beantwortung von Forschungsfragen zu unterstützen. Es
werden separierte und integrierte Darstellungen von Raum und Zeit untersucht.
Bei den integrierten Darstellungen werden Datentransformationen und Projek-
tionen eingesetzt, die die Komplexität und Layout-Einschränkungen in der Vi-
sualisierung geografischer Daten reduzieren. Dafür werden fragmentierte Karten
und eingebettete Kartenausschnitte mit größeremDetailgrad genutzt, um Proble-
me bei der Darstellung vonDetails in Datenmit heterogener geografischer Vertei-
lung zu lösen. Außerdemwerden verschiedene geografische Projektionen darauf-
hin verglichen, wie gut sie die räumlichen Verhältnisse zwischen Datenpunkten
kommunizieren können. Zusätzlich präsentiert diese Arbeit Projektionen geogra-
fischer Daten in verschiedene eindimensionale Anordnungen, um Zusammen-
hänge zwischen dem räumlichen und anderen Datenattributen in einem einfa-
cheren Layout besser sichtbar zu machen. Um versteckte periodische Wiederho-
lungsmuster von Events zu finden, wird auch die Reduktion der zeitlichen Kom-
ponente von Event-Daten in einen Periodendauer-Phasen-Raum untersucht.

Die in dieser Arbeit präsentierten Methoden wurden hauptsächlich für die
Anwendung in den DH und für Forschungsfragen der DH entwickelt. Dadurch
sind sie auf die in den DH typischen Datensatzgrößen und -charakteristiken aus-
gelegt. Sie können allerdings auch auf Forschungsfragen aus anderen Gebieten
erweitert und angepasst werden, was zum Teil bereits umgesetzt wurde. Die Prin-
zipien, die in den in dieser Arbeit präsentiertenAnsätzen verwendetwerden, sind
auf weitere Anwendungsgebiete übertragbar, sofern deren Anforderungen hin-
sichtlich Zielsetzung sowie Datenmenge und -komplexität vergleichbar sind.

xxv


Ch
ap
te
r

1
Introduction

This chapter outlines the advantages of interactive visualization to assist human
analysts in understanding data that is either large in volume, complex, or both.
Two digital humanities (DH) project collaborations that were part of the work
leading up to this thesis are presented. Based on these two projects, typical do-
main problems in the DH that interactive visualization can help solve are pres-
ented. The chapter further contains a brief summary of the state of the art in
visualization research, as it pertains to those domain problems. From this, four
research questions are formulated. Finally, an outline on the structure of the re-
mainder of the thesis is given.

1.1 Motivation
The human capabilities to understand data quickly reach their limits for larger
datasets, or more complex datasets with many attributes. To support humans
in understanding the data, visualization can be used. In visualization, data is
mapped to visual structures (Bertin, 1983; Card et al., 1999). The human visual
system, which has evolved to recognize visual patterns effortlessly (Wertheimer,
1923; Healey & Enns, 2011), can thenmuch easier recognize interesting phenom-
ena even in larger datasets, given the right visualmappings. The scalability of visu-
alization to aid human cognition can be increased further by introducing interac-
tion on digitalmedia: At first, an aggregated overview of the data is visualized. An-
alysts can then narrow down the data by specifying filters interactively, and query
details for individual data items (Shneiderman, 1996). The correct choice of visual
mappings, and the—possibly computer-aided (Keim et al., 2008)—preprocessing
of the data can further assist the human analysts (see figure 2.3). This makes in-
teractive visualization a useful technique for understanding relationships and pat-
terns in large datasets.

1


1 Introduction

Large parts of this thesis were created within the scope of projects that were
collaborations with DH researchers. A shorter collaboration occurred within the
“Oceanic Exchanges” project (Cordell et al., 2022). Here, the research focus was on
the spread of information and misinformation in the 19th and early 20th century,
before and during the construction of the first trans-oceanic telegraph cables. This
spread of information was observed through the lens of historical newspaper arti-
cles, focusing on a handful of historical events such as the eruption of the Kraka-
toa volcano in 1883, the outbreak of the Spanish-American war in 1898, or the
political propaganda tour of the Hungarian revolutionary Lajos Kossuth through
the United States of America in 1851 and 1852. The second and larger collabora-
tion occurred within the “Dhimmis & Muslims” project (Weltecke & Koch, 2018).
The research focus in this project was on the peaceful coexistence of different non-
Muslim religious groups under Muslim rule in cities of the medieval Middle East.
The project is described in more detail in chapter 3.

Both projects worked with historical data, which is fairly typical for a wide
range of DH research. Information, and consequently data, in the DH is usually
made up of discrete statements or information packets. These come from a variety
of sources of different quality, and the data entry process involves interpretation
and judgment of accuracy and meaning by domain experts. In the “Oceanic Ex-
changes” project, these characteristics manifested themselves through different
newspapers in different languages, texts digitized by optical character recognition
(OCR) with different qualities, and fragmented or incomplete newspaper articles.
The incremental, manual data entry throughout the collaboration naturally lead
to an incomplete view on the situation at the time in the “Dhimmis & Muslims”
project. This was exacerbated by actual missing source material. Here, the con-
tradiction of different sources and their trustworthiness was also a large issue;
for example, authors of one religious group would exaggerate their own group’s
presence in an area and downplay that of other groups, if it was mentioned at all.
The interpretation of historical sources, for instance regarding the attribution of
a statement to a specific city, was also challenging due to the large number of lan-
guages the sources were written in. As a consequence, data in these projects, and
in DH in general, is heterogeneous, incomplete, and at times contradictory.
Analyzing and communicating such data in a manner that does not skew the
reader’s impression of the represented information, hence, becomes a challenge.

Data entry from multiple sources was a core component of the “Dhimmis &
Muslims” project collaboration. However, on a larger scale, data entry is always
part of DH projects, as the data originally is not digital. Digitization also requires
the design and application of a data model, which can be challenging especially
for the heterogeneous data found in DH research. Hence, both the interpretation
of non-digital data during data entry, and the simplification of the data to adhere
to the data model, might distort the data. Such distortions will cascade to the vi-
sual analysis of the data. The steps analysts take here to reach interesting insights

2


1.1 Motivation

might be affected, also, by their context and domain knowledge. As a final step,
findings are then communicated to peers, who often only have condensed and ag-
gregated result data or images available to them, not the entire data the findings
were based on. Hence, the multi-step process of data entry, visual analysis, and
publication of findings that are deemed interesting is affected in each step by hu-
man judgment and interpretation, possibly by multiple people. These judgments
and the steps taken that lead to the recorded data, the interactive path through
the visualization, and the published outcome (i.e., findings supported by a scien-
tific paper, a screenshot, or a video) need to be communicated for better trust in
the results. At the same time, domain experts might want to inspect the data a
visualization is based on, and explore and interact with the visualization a pub-
lished finding was based on. More succinctly, the one-way workflow should be
augmented to a two-way workflow to support reproducible scientific results
and to ensure provenance of the information and analyses.

When analyzing data, researchers might look at correlations or behavior of in-
terest that affect the entire datasets. Such relations aremostly easy to find, but can
be interesting nonetheless. However, inmany cases only small subsets of a dataset
are of interest for a particular phenomenon. This is especially true with heteroge-
neous datasets. Such subsets of data often formmeaningful patterns. The nature
of these patterns depends on the data, the domain, and the concrete analysis tasks.
In the DH, the patterns often appear in the spatial (often, geospatial), temporal, or
spatio-temporal arrangement of the data. Visualization can assist in finding pat-
terns, since the human visual system has evolved to quickly and effortlessly find
and recognize meaningful patterns (Wertheimer, 1923; Healey & Enns, 2011)

Mapping data attributes to position is an integral technique in information vi-
sualization (W. S. Cleveland&McGill, 1985). A linearmapping of an attribute to a
coordinate component is often used, and a pattern in the datawould then inmany
cases become obvious in the visual representation. However, a non-linear projec-
tion of one or multiple attributes into the 2D display space can in some cases
greatly improve the clarity of the aforementioned patterns, or reveal them
to begin with. These projections can take many forms, one example being map
projections that represent the three-dimensional, quasi-spherical earth on a two-
dimensional map. Different map projections exist, which emphasize different
properties of the spatial layout. Going beyond geographical space, other data attri-
butes can also be projected; for instance, time is often represented on a linear scale
in visualization, but can also be represented non-linearly to highlight different
phenomena in the data. Depending on the task and the data, different represen-
tations and different projections of the data might be better suited to emphasize
or reveal patterns of interest, both within the context of the DH and beyond.

3


1 Introduction

1.2 Research Questions
Previouswork, bothwithin and outside of visualization research, has explored the
analysis of uncertain and incomplete data in detail (Eaton et al., 2003; Bisantz et
al., 2005; Correa et al., 2009; Brodlie et al., 2012).Whilemany previously proposed
techniques can be applied to DH research, the small—but complex—datasets
in combination with contradictory data still require additional considerations,
which have not yet been thoroughly explored (Windhager et al., 2019a). The data
of heterogeneous quality found in the DH also requires more qualitative assess-
ment and interpretation by experts, which are not covered by quantitative meth-
ods of dealing with uncertainty found in other fields.

This nature of the data, in combination with the constant involvement of ex-
perts for interpretation, mean that analyses in the DH rarely follow a linear path
fromdata collection to results. Incremental data foraging, knowledgemining, and
sensemaking (Pirolli & Card, 2005) have been discussed previously—also in the
context of uncertainty (Sacha et al., 2015). The unique combination of require-
ments found in the DH still require further research and a case-by-case assess-
ment of the requirements.

Works on the nature of spatio-temporal patterns of interest (e.g., Viboud et
al., 2006; G. Andrienko et al., 2011, 2013; Li et al., 2018) have already proposed
suitable visualization techniques for many situations. Techniques for the trans-
formation and projection of data to elicit hidden patterns have previously been
explored, for example, in the context of dimensionality reduction (DR) (Bunte et
al., 2012; Ayesha et al., 2020), geographical projections (Snyder, 1997; Gosling &
Symeonakis, 2020), and evenwithin the DH (e.g., Jänicke et al., 2012). Somework
within the DH (e.g., Šavrič et al., 2015; Robinson & The Committee on Map Pro-
jections, 2017) also considers the better communication of spatial relationships.
Here as well, the specific nature of interesting patterns, and suitable transforma-
tions and projections, depends on the data and domain research questions. The
design space is not yet fully explored, especially regarding DH data. As a conse-
quence, the best choice of projection or data transformation for specific research
questions is also under-explored.

This thesis aims to address the following research questions to support the
requirements of DH scholars (see section 1.1) and to further the state of the art of
visualization research:

RQ1 How can data that is heterogeneous, incomplete, or contradic-
tory be communicated visually in an adequate manner to analysts
without exaggerating its significance, or conveying a false sense of
data density and completeness?

4


1.3 Contributions and Remaining Structure

RQ2 How can the one-way workflow of data acquisition, visualization and
publication be augmented to a two-way workflow to support re-
producibility and provenance?

RQ3 What common types of patterns that indicate interesting phenom-
ena typically appear in the spatio-temporal aspects of DHdata, and
how can visualization assist domain experts in finding and analyzing
them?

RQ4 What types of projections can be used on the spatio-temporal attri-
butes of such data to reveal interesting patterns (RQ3), and how can
we determine whether the projections are helpful or not?

1.3 Contributions and Remaining Structure
This thesis is structured thematically, rather than by publication. Therefore, indi-
vidual research questions (see section 1.2) are addressed in multiple chapters.

Contributions of papers. The contributions presented in this thesis are
based mainly on my own, previously published work. Because of the thematic
structure, some papers are split up into multiple chapters. In the following, these
seven previous publications are described briefly.

In our workshop paper (Franke et al., 2019) presented at the 4th Workshop
on Visualization for the Digital Humanities (VIS4DH) in Vancouver, Canada, we
argued for confidence to be included as a first-class data attribute in DH data and
analyses. We chose the term “confidence” to highlight the qualitative nature of
uncertain data in the DH, as well as the human factor present in data collection
and interpretation. By treating the confidence of collected data as part of the data,
it can be included into analyses as a filter criterion, and can be visualized. This
provides analysts with a more objective picture on heterogeneous, incomplete,
and contradictory data (RQ1).

In our paper (Franke et al., 2020) presented at the 11th International Confer-
ence on Information Visualization Theory and Applications (IVAPP) in Valletta,
Malta, we presented LilyPads, an interactive visualization approach for the inter-
active visual analysis of spatio-temporal dissemination in historical newspaper
articles. LilyPads explores the use of broken-up inset maps to visualize a hetero-
geneous geographical distribution with a higher level of detail (LOD). The inset
maps are placed relative to a chosen geographical perspective to uphold the spa-
tial relationships between places for analysts (RQ4). The empty space between
themap insets can be used to visualize other data attributes such as temporal data
(RQ3) and textual contents of the newspaper articles.

5


1 Introduction

In our paper (Franke et al., 2021a) presented virtually at EuroVis and pub-
lished in Computer Graphics Forum (CGF), we explored projections from geo-
graphical space to a one-dimensional (1D) ordering. These projections (RQ4)
simplify the layout of data items, and their relation with other data attributes
(e.g., time-dependent data) can be visualized statically to reveal spatio-temporal
patterns of interest (RQ3). Additional quality measures, guidance, and interac-
tions help understand where neighborhoods in the 1D ordering correspond to
neighborhoods in the geographical space.

In our paper (Franke et al., 2022a) presented at the 13th IVAPP held virtually
in Zurich, Switzerland, we performed a comparative study of three visualizations
for multiple time series: line charts, aligned area charts, and stream graphs. We
compared these regarding three low-level tasks that are commonly part of visual
analyses (RQ3): identifying the time series with the largest value at one given
point in time, identifying the time series with the largest integral under the curve
between two given points in time, and deciding at which of two given points in
time the sum of all time series’ values is larger. We found trends and weak evi-
dence for the suitability of different visualization techniques for the three tasks.

In our paper (Franke & Koch, 2023a) presented at the 14th IVAPP in Lisbon,
Portugal, we presented Damast, a visual analysis approach for the analysis of the
peaceful coexistence of different religious groups under Muslim rule in the me-
dievalMiddle East.Damast visualizes the spatial and temporal aspects of the data
(RQ3), as well as other aspects such as religious groups and literary data sources.
Incomplete data and confidence are visualized explicitly and can be used as part
of a visual query (RQ1). Damast supports data entry, visual data analysis, and
the publication of results. In addition, the interactive visualization is reproducible
from published results, and analysts can navigate to the source data the visualiza-
tion is based upon to allow for iterative knowledge generation cycles (RQ2).

In our Visualization Notes paper (Franke et al., 2023a) presented at the 16th
IEEE Pacific Visualization Symposium (PacificVis) in Seoul, Korea, we argued for
reproducible workflows (RQ2) in visualization research. We discussed different
aspects that affect how well visualization results can be reproduced, such as the
required time frame and data granularity, the reason for reproducibility, the type
of data, and the different input parameters that need to be considered. We high-
lighted the importance of reproducibility and suggested a set of strategies that can
be applied and combined to support reproducible workflows.

In our short paper (Franke & Koch, 2023b) presented at the 2023 IEEE VIS
conference in Melbourne, Australia, we presented an approach for the interac-
tive exploration of hidden periodicity patterns (RQ3) in event data. Our approach
calculates a cheap-to-compute phase histogram of the data for thousands of pos-
sible period lengths. Analysts are guided towards period lengths of interest with
two quality measures, as well as interactive suggestions and details on demand.

6


1.3 Contributions and Remaining Structure

Our approach serves as an interactive, overview-first exploration tool for the data,
where analysts can later apply advanced, but computationally more expensive
methods to individual period lengths of interest.

Remaining structure. Important foundations for this work, as well as
previous related work, are discussed and organized in chapter 2. Chapters 3 to 7
contain the main content of this thesis, grouped by topic as summarized in the
following. Chapter 8 provides a conclusion of the thesis, as well as an outlook.

Chapter 3 describes the “Dhimmis & Muslims” project, which was a collab-
oration with historians of religion. The historians’ main research interest within
the project concerned the peaceful coexistence of different religious groups in the
medieval Middle East. Here, the changes in co-location of religious groups over
space and time (RQ3) was of interest. In addition, challenges and possible solu-
tions of heterogeneous, incomplete, and contradictory data in the DH (RQ1) are
discussed. Finally, the role of reproducible workflows, as well as data and analy-
sis provenance (RQ2), are discussed in the context of the project. One outcome of
the project was the visualization approach Damast. The IVAPP paper describing
Damast (Franke & Koch, 2023a) provides the basis of this chapter. The parts of
the VIS4DH workshop paper (Franke et al., 2019) which describe the data model
and the inclusion of confidence measures in Damast are also included. Finally,
Damast was used as a case study in our PacificVis paper (Franke et al., 2023a) on
reproducibility, and this part of the paper is included in the chapter as well.

Chapter 4 highlights the importance of reproducibility and confidence in the
DH. Reproducibility (RQ2) is an essential step towards open science. In human-
ities research and the DH, data provenance and the mental reproducibility of
thought processes are required for a long time frame. Large parts of our PacificVis
paper (Franke et al., 2023a) make up the basis of the reproducibility part of this
chapter. In addition, parts from the Damast paper (Franke & Koch, 2023a) per-
taining to reproducibility are included here. We proposed the term “confidence”
to delineate the qualitative judgments of uncertainty and heterogeneity in DH
data (RQ1) from the use of the term “uncertainty” in more quantitative contexts.
Our VIS4DH paper (Franke et al., 2019) serves as the basis for the discussion of
confidence in this chapter. Our thoughts on confidence and uncertain DH data
in the context of two DH collaborations from the Damast and LilyPads papers
(Franke et al., 2020; Franke & Koch, 2023a) are recapitulated here.

Chapter 5 describes the visual analysis of data with spatial context. Here, the
use of spatial aggregation and its challenge regarding heterogeneous data with
low confidence (RQ1) is discussed. In addition, the use of projections to one-
and two-dimensional space to elicit hidden patterns from the data (RQ3, RQ4)
are proposed. Spatial data were of relevance in the LilyPads and Damast papers
(Franke et al., 2020; Franke & Koch, 2023a), as well as our EuroVis paper (Franke

7


1 Introduction

et al., 2021a). The parts of these papers concerning spatial projection, aggregation,
and the visualization of geospatial data are recapitulated in this chapter. The part
of our VIS4DH paper (Franke et al., 2019) concerning aggregation of spatial data
is also included.

Chapter 6 discusses the visual analysis of time-dependent data, which can
be used as part of the discovery and analysis of spatio-temporal patterns (RQ3).
The chapter recapitulates our VIS short paper (Franke & Koch, 2023b) that pro-
poses a data transformation (RQ4) and visualization approach to elicit hidden
periodicities in event data. The chapter further recapitulates our IVAPP paper
(Franke et al., 2022a) comparing different visualization techniques for time series
data regarding their suitability for different tasks. Finally, the merit of different
visual representations of time in different visualization approaches are summa-
rized. The parts of the LilyPads and Damast papers (Franke et al., 2020; Franke
& Koch, 2023a), as well as our EuroVis paper (Franke et al., 2021a), regarding
visualization of temporal data are included here.

Chapter 7 addresses the analysis of data with a spatio-temporal component
(RQ3). The chapter presents patterns that can be of interest both within and out-
side the DH. Both the use of visualization techniques that combine space and
time in one view and the separated visualization in multiple views are consid-
ered. The work discussed here is a combination or extension of parts of work cov-
ered in chapters 5 and 6. The chapter summarizes the previously presented tech-
niques (Franke et al., 2020, 2021a; Franke&Koch, 2023a, 2023b) regarding spatio-
temporal visualization and differentiates between separated and integrated repre-
sentations of space and time. Different types of spatio-temporal patterns are listed,
and examples for such patterns and their analysis from the presented approaches
are shown.

8


Ch
ap
te
r

2
Foundations

This chapter introduces concepts and techniques that are essential prerequisites
for the rest of the thesis. In section 2.1, the foundations of visualization and in-
formation visualization are explained. Section 2.2 introduces two types of data
projection that are used in this thesis: dimensionality reduction and geographi-
cal projection. The principles of quantitative statistical evaluation using interval
estimation, which is used in the user evaluations presented in this thesis, are ex-
plained in section 2.3. Section 2.4 gives an overview of the challenges and the state
of the art of visualization in the digital humanities.

2.1 Interactive Information Visualization and
Visual Analysis
Information visualization is a large and broad sub-field of the even larger field of
visualization. Certain principles, concepts, and techniques have emerged in typi-
cal use within this field. A subset of these, which are also used in the approaches
presented in this thesis, are described in the following.

2.1.1 Basic Principles of Information Visualization
Information visualization harnesses the parallel processing capabilities of the vi-
sual cortex of the human mind (Ware, 2004). Over millions of years, this part
of the human brain evolved to recognize certain patterns pre-attentively (Healey
& Enns, 2011); that is, without requiring any conscious or time-intensive think-
ing. The Gestalt laws (Wertheimer, 1923) are based on the pre-attentive percep-
tion and describe patterns that the human visual system will recognize without
much thought. The law of proximity describes how the human visual system will
group individual entities into clusters if they are close together. The law of similar-
ity describes how visual items of similar characteristics (e.g., color, or shape) are

9


2 Foundations

DIN A0 @ 600DPI
DIN A0 @ 300DPI
DIN A2 @ 600DPI

DIN A2 @ 300DPI
DIN A4 @ 600DPI

VISUS Powerwall
(10800×4096px)

4K monitor
(3840×2160px)

Full HD monitor
(1920×1080px)

Figure 2.1: A visual comparison of the possible pixel density between print
media and computer monitors. At a density of 600 dots per inch, a DIN A0 pa-
per printout (1189mm×841mm) can show 269.1 times more pixel data than a
full HD (1920 px×1080 px) computer monitor, and 67.3 times more pixel data
than a 4K (3840 px×2160 px) computer monitor.

grouped together regardless of position. The law of connectedness describes how
lines between visual items instantly imply a relation between the items. The laws
of continuity and closure describe how the human visual system can complete
shapes where parts are missing, and group individual subsets of items based on
their probable relatedness, even if parts of the groups are not visible. Wertheimer
describes additional Gestalt laws, but the aforementioned ones are especially rel-
evant to information visualization. Patterns in the data, such as tight grouping in
two or more aspects, linear relationships, or similar groups represented by color
or shape, are recognized pre-attentively. This makes information visualization
a powerful technique for understanding relationships in larger datasets, where
looking at the data itself would not yield any appreciable insights.

For visualization, different visual variables (Bertin, 1983) (or visual channels
(Chen&Floridi, 2012;Munzner, 2014)) can be utilized to encode different aspects
of data. To a degree, different visual variables can be combined to encodemultiple
data attributes in the same visual marks (Munzner, 2014). Examples for visual
variables are the position, size, shape, and texture of visual marks, as well as their
color (hue, value, and saturation).W. S. Cleveland andMcGill (1985) explored the
clarity of encoding for different visual variables for quantitative data, and found
position and length to be themost suitable encodings. For ordinal or nominal data
(Stevens, 1946), position is still the best choice, but other visual variables can be
more suitable (Mackinlay, 1986).

The volume and density of data often makes a visual representation challeng-
ing. Especially when considering the limitations even of modern computer mon-
itors (see figure 2.1), it is hard to represent both local structures and the dataset
in its entirety accurately. Shneiderman’s mantra of “overview first, zoom and fil-

10


2.1 Interactive Information Visualization and Visual Analysis

▶ Figure 2.2: The infor-
mation visualization pipeline
(Card et al., 1999) describes the
formal stages: Rawdata is struc-
tured. The structured data is
then mapped to visual struc-
tures. Views of the visual struc-
tures are created via geomet-
rical transformations, such as
scaling and translation (zoom
and pan). Users can interact
with all stages of the pipeline.

raw data

structured data

visual structures

views

user

data transformations

visual mappings

view transformations

sees

interacts

ter, then details-on-demand” (Shneiderman, 1996) has been a governing principle
for interactive, computer-aided visualization: To circumvent the limited screen
estate, the initial overview shows only an aggregated representation of the data.
Users can the zoom and filter into subsets of the data, which is then shown in
more detail.

The information visualization pipeline (Card et al., 1999), shown in figure 2.2,
describes how interactions in different parts of the pipeline can affect the results.
For computer-aided visualization, the creation of uniformly structured data is es-
sential for an effective, rule-based visualization (see section 2.1.2). Based on these
rules, the structured data is subsequently mapped to visual structures or visual
marks (Munzner, 2014). By transforming the view, for example through zooming
and panning, a subset of the data domain and the visual marks can be shown.
Users can interact with all stages of the pipeline; for instance, they can affect how
the raw data is transformed to structured data (e.g., by changing the aggregation
level), change the visual mapping of different data attributes, or zoom and pan
to subsets of the data. Shneiderman’s mantra applies to all stages of interaction,
owing to the concept of semantic zooming (Bederson et al., 1996): Zooming hap-
pens in the rendering stage, filtering in the filtering stage. At the same time, the
reduction in data shown, and the smaller data domain shown in the screen estate,
allows for more details and affects the visual mapping stage.

11


2 Foundations

2.1.2 Visualization Domains
Computer-aided visualization1 is a large discipline with different requirements,
types of data, and techniques. Within this discipline, different sub-disciplines
formed because of these individual differences. Information visualization is the
visualization of abstract data (Rhyne et al., 2003; Weiskopf et al., 2006), which
does not contain an inherent spatial layout. As a consequence, the spatial layout
of the visualization’s output (i.e., the image space) can be chosen freely. Different
information visualization techniques exist for different data types. Examples for
these are bar charts for singular quantitative values assigned to discrete entities;
line charts for quantitative values dependent of another variable (such as time);
visualizations of tree structures such as icicle plots (Kruskal & Landwehr, 1983),
indented trees (Burch et al., 2010), or tree maps (Shneiderman, 1992; Bruls et al.,
2000); or node-link diagrams to visualize general graph structures. Information
visualization typically uses the two-dimensional space for its representations to
fully utilize the 2D paper, and later computer monitor, representation domain.

In contrast, scientific visualization has its roots in rendering research and com-
puter graphics (Rhyne et al., 2003; Weiskopf et al., 2006). Hence, the techniques
discussed here are often closer to the hardware. The data visualized, as opposed
to in information visualization, also has an inherent spatial layout. Typical exam-
ples for data visualized in scientific visualizations are 3D medical scans, flow or
density fields obtained from measurements or simulations, or 3D surface mod-
els. Regardless of an often-present temporal component, this data typically has a
three-dimensional spatial layout that is anchored in the real world. As a conse-
quence, scientific visualizations usually use 3D rendering techniques.

The separation between scientific visualization and information visualization
is largely due to historical reasons and their different origins. Both disciplines aim
to solve similar challenges, albeit in different data and application domains. As
such, the naming of the two disciplines is somewhat unfortunate. In fact, Mun-
zner stated: “scientific visualization is not uninformative, and information visual-
ization isn’t unscientific” (Rhyne et al., 2003, p. 612). Recently, efforts in the visual-
ization community (e.g., Rhyne et al., 2003;Weiskopf et al., 2006) have beenmade
to bridge the gap, and to reduce the bias stemming from the original naming of the
disciplines. Furthermore, visualization approaches around real-world-positioned
data have beenutilizing information visualization techniques, and vice versa (e.g.,
Sedlmair et al., 2009).

1The term “visualization” has vastly different meanings across disciplines. In the humanities, for
instance, the term can refer to a mental sensemaking process. Within this thesis, the term “visu-
alization”—when used without additional qualifiers—always refers to the computer-aided, rule-
based transformation of data to a visual representation. The usage of the term implies interactivity:
Users can interact with the automated process in all stages (see section 2.1.1) to affect its output.

12


2.1 Interactive Information Visualization and Visual Analysis

Geographical visualization is another sub-discipline that lies between visual-
ization and cartography (MacEachren et al., 1998). Often, abstract data is visual-
ized in a geographical context, and so, information visualization techniques are
utilized. At the same time, the geographical aspect of the data inherently affects
the spatial layout of the visualization. As with scientific visualization, the expres-
sive visual variable of position (see section 2.1.1) cannot be used freely. Hence, geo-
graphical visualization shares properties of both scientific and information visual-
ization. Depending on the application domain, geographical visualization might
lean into techniques from either sub-discipline more; for example, geographical
visualizations showing air current vector fields, or altitude data, might lean more
towards the scientific visualization’s 3D rendering techniques. On the other hand,
visualizations such as choropleth maps (Brewer et al., 1997; Slocum et al., 2014)
or glyph-based visualizations (Borgo et al., 2013) in a map are closer to the tech-
niques used in information visualization.

In the digital humanities (DH), data is overwhelmingly abstract. While real-
world, 3D data exists, for example, in the form of archaeological ground scans
(Eggert et al., 2013) or 3D models of historical artifacts (Martínez Carillo et al.,
2008), most aspects of the data collected here is abstract. Examples are text pas-
sages and the entities and relationships described within, or subjective recollec-
tions of historical actualities by individual sources. At the same time, DH data
often has a geographical component. As a consequence, the visualizations pres-
ented in this thesis either use information visualization techniques, or geograph-
ical visualization techniques that are closer to the information visualization side
of the spectrum.

Visual analytics describes a methodology that includes the use of information
visualization. The model proposed by Keim et al. (2008) and shown in figure 2.3
describes an incremental process, where knowledge is gained from data. This is
done both by the interactive, visual exploration of the data, and by the use of
computational models and automated data mining techniques. Insights gained
from the visualization can be used to improve the automated models, and vice
versa. The resulting knowledge can be used to further improve the data input into
this process. The “models” described by Keim et al. do not necessarily have to be
extensive artificial intelligence or large language models. In the work presented
in this thesis, various clustering techniques, automated analyses, and regression
were used. This still qualifies the respective approaches as using visual analytics.
In the context of this thesis, especially for chapter 4, thework by Sacha et al. (2015)
is relevant as well. Their work extends on the visual analytics model of Keim et al.
to include uncertainty in the analysis process.

13


2 Foundations

visual data exploration

data mining

data

visualization

knowledge

models

feedback loop

Figure 2.3: The visual analytics model, adapted from Keim et al. (2008).
Knowledge is obtained from data both via interactive information visualization
(top), and by automated data-mining methods (bottom). Insights gathered from
both processes can be used to improve the other process, respectively. The knowl-
edge gained through both processes can be used to improve and enrich the input
data.

2.1.3 Common Information Visualization Concepts
Various information visualization techniques exist that are commonly used in vi-
sualization approaches. Within the approaches described in this thesis, the tech-
niques described in the following are used often.

MCV, faceted filtering, and brushing and linking. Multiple co-
ordinated views (MCV) (Wang Baldonado et al., 2000; Roberts, 2007) is a concept
where a visualization consists of more than one view. Each view shows a different,
partial perspective on the dataset. This canmean that each view shows a different
visual representation of the data, but typically, the visualizations in the different
views also show different attributes of the data; for example, one viewmight show
the geographical distribution of the data in a map, while another shows its tem-
poral distribution in a density plot over time.

The views are coordinated,whichmeans that interactingwith one of the views
will affect the other views as well. This coordination is realized, for one thing, in
filtering: Filtering in one view, for example by selecting a valid data range, af-
fects the data shown in other views as well. Often,multi-faceted filtering (Weaver,
2004; Hearst, 2006) is used here: Filter selections within the same data attribute
are treated as a union, while selections between different attributes are treated
by intersection. This allows for very expressive, but still intuitive, filtering possi-

14


2.1 Interactive Information Visualization and Visual Analysis

bilities. In the example above, a user might have selected two disjunct areas in
the geographical view, as well as a time range in the temporal view. For each sin-
gular data item, the decision whether it will be shown depends on whether its
geographical aspect lies within either of the areas. However, its temporal aspect
must also lie within the selected span. More formally, the set of data items d ∈ D,
for a set of data aspects Ai, I := {1, . . . , |A|} ∋ i,

D =×
i∈I

Ai

and a set of ni distinct filters on singular data aspects, Ji := {1, . . . , ni} ∋ ji

fAi,ji : Ai → {true, false}

is filtered toDfiltered ⊆ D such that:

Dfiltered :=







d ∈ D

∣

∣

∣

∣

∣

∣

⋂

i∈I

⋃

j∈Ji

fAi,ji(d)







Within MCV visualizations, brushing and linking (Becker & Cleveland, 1987;
Ward, 1994) is often used. This technique allows users to better understand the
relationships between data attributes, and to anticipate the effects of singular re-
strictions on the faceted filtering. The technique consists of one interaction, the
brushing, and a visual response, the linking. Brushing is the action of selecting a
data range, or a set of visual marks, in one of the views. As a consequence, linking
is implemented by highlighting visual marks representing the same data items—
for example, by reducing the saturation and lightness of other visual marks—in
all views. Different visualization approaches implement brushing and linking dif-
ferently. Brushing can, for example, be realized by simply hovering the mouse
cursor over a data item, or by creating a range selection. Linking can be shown,
for example, by highlighting the corresponding visual marks, or by desaturating
the non-corresponding ones.

The approaches presented in this thesis all utilize MCV to a certain degree.
Faceted filtering is often implemented due to its intuitive use, and its expressive-
ness despite the simplicity of the concept. Brushing and linking are also used in
most approaches to support the task of relating different data aspects.

Overview+detail, focus+context. When data is visualized within a
large data domain, often, local structures can get lost in the larger picture. On
the other hand, when visualizing solely one local structure, the overall distribu-
tion of the data might get lost. Cockburn et al. (2009) describe overview+detail
approaches, where the overall extent of the data domain of a view is visualized
in a smaller representation, while the main view shows only a small part of the

15


2 Foundations

world

structured
data

knowledge

foraging
loop

sensemaking
loop

reality/policy loop

Figure 2.4: A simplified version of the sensemaking model as presented by
Pirolli and Card (2005). Data is collected and structured from the world at large
in the foraging loop. The collected data is analyzed and knowledge is generated in
the sensemaking loop. These loops can be traversed multiple times; for example,
when collecting additional data based on interim insights from existing data. The
overarching reality or policy loop covers the entire process.

domain in larger detail. They also describe focus+context approaches. Here, the
area of interest (e.g., the local structure) is visualized in a larger level of detail. Sur-
rounding it, the context in the attribute space is visualized, often in a distorted
manner to accommodate the larger local area of interest. Generalized fish-eye
lenses (Furnas, 1986) are an example of this technique.

The sensemaking process. Pirolli and Card (2005) discuss the sense-
making process,which canbe found inmany visualization and visual analytics ap-
proaches. Figure 2.4 shows a simplified version of the sensemaking model, omit-
ting the fine-grained stages and loops also discussed by Pirolli and Card (2005).
External data is collected, structured, and enriched with additional searches in
the foraging loop. Collected data is analyzed, hypotheses are tested, and resulting
knowledge is presented in the sensemaking loop. The overall process of data for-
aging and sensemaking is spanned by the reality or policy loop. The sensemaking
loop can be supported by visual analysis. Visualization approaches that consider

16


2.2 Data Projection

enriching or narrowing down the base dataset in addition to the visualization of
data also support the foraging loop. The sensemaking process, applied to visual-
ization, is related to the visual analytics model presented by Keim et al. (2008):
The visual data exploration part of the model (see figure 2.3) mostly supports the
sensemaking loop. The feedback loop from knowledge to data entry, as well as
the data mining part of the model, support the foraging loop.

2.2 Data Projection
When the data that should be analyzed is high-dimensional, visualization can be
challenging. For interactive information visualization, a two-dimensional (2D)
representation is usually easier to understand. While additional attributes can
be encoded in other visual variables, in some cases it can be valuable to reduce
the dimensionality of the data before visualizing it. Section 2.2.1 presents general
methods that transform high-dimensional data to a dimensionality that can suit-
ably be encoded in a visualization. Section 2.2.2 discusses geographical projection,
where positions on the quasi-spherical surface of Earth are projected to a flat 2D
space.

2.2.1 Dimensionality Reduction
Dimensionality reduction (DR) is a family of techniques that projects data with n

attributes into a subspace withm dimensions,m < n (van derMaaten et al., 2009;
Ayesha et al., 2020). Often, m is set to two (2), such that the low-dimensional
projection can be visualized on a computer monitor. The overall goal of DR is
to preserve structures and patterns in the high-dimensional space in the low-
dimensional space and to not introduce false positives (Venna & Kaski, 2001).

DR can be divided into linear and nonlinearmethods. In linearmethods, each
of the low-dimensional attributes is defined as a linear combination of all high-
dimensional attributes. Scatter plot matrices are a straightforward visualization
approach that shows an n× nmatrix of scatter plots, each of which shows an or-
thographic view on the data in the subspace spanned by two attributes. Principal
component analysis (PCA) determines them largest eigenvectors of the data and
produces an orthographic view of the data within the subspace spanned by those
eigenvectors.

Nonlinear methods are more flexible and can usually preserve complex high-
dimensional structures better, but are more expensive to compute. Various meth-
ods exist, with their individual strengths regarding the quality of projection and
the suitable dataset characteristics. Multidimensional scaling (MDS) (Torgerson,
1952) attempts to preserve pairwise distances between data points in the high-
dimensional space. t-stochastic neighborhood embedding (t-SNE) (van der Maa-

17


2 Foundations

ten & Hinton, 2008) preserves the distances between distributions, and hence
works well for high-dimensional data with clear clusters. Self-organizing maps
(SOMs) (Kohonen, 1981) project data by unsupervised machine learning, and
work well for small datasets with high dimensionality.

The approaches presented in this thesis do not make use of many DR tech-
niques. One approach, which projects spatial data to a one-dimensional (1D) or-
dering based on different methods (Franke et al., 2021a), and which is presented
inmore detail in sections 5.3 and 7.2.3, uses uniformmanifold approximation and
projection (UMAP) (McInnes & Healy, 2018; McInnes et al., 2020) as one projec-
tion method, but otherwise does not use DR techniques. However, the approach
uses quality measures that were developed for DR techniques. Namely, the M1

and M2 metrics proposed by Venna and Kaski (2001) for nonlinear projections,
metric stress proposed for classical MDS (Torgerson, 1952; Goodhill & Sejnowski,
1996), and non-metric stress proposed for non-metric MDS (Kruskal, 1964; Good-
hill & Sejnowski, 1996) are used. However, the use of MCV visualizations (see
section 2.1.3) could be seen as a special case of DR: Within each view, only a sub-
set of data attributes is visualized: The data is projected to an attribute subspace.

2.2.2 Geographical Projections
Mapping Earth’s surface on a 2Dmedium, such as paper or a computer screen, is
challenging. Earth is roughly ellipsoidal, and can be approximated very well by
a spherical geoid (Moritz, 2011). The spherical topology and curvature of Earth
makes it impossible to map its surface to a flat plain without introducing discon-
tinuities, or distortions to angle, distance, or area. Various map projections exist
that either preserve some of these desired properties at the cost of others, or of-
fer a compromise solution where none of the properties is fulfilled (Snyder, 1997;
Slocum et al., 2014).

Earth is geographically subdivided by the World Geodetic System. Lines of
constant longitude run north-south between the poles. Lines of constant latitude
run in a constant distance fromeither pole. AGreatCircle is a circle (circle-like un-
less Earth is approximated as a perfect sphere) of maximal circumference around
the earth. For any two points on Earth’s surface, a Great Circle exists through
them. The shorter of the two Great Circle arcs is the shortest connection between
the two points, the geodetic line between them. In contrast, a loxodrome is a line
with constant bearing; that is, a line which at every point has the same angle be-
tween the direction of the line and the North Pole. Loxodromes are either latitude
lines, parts of longitude lines, or spirals that end up at either the North or South
Pole at some point.

18


2.2 Data Projection

0/0/0 1/0/0 1/1/0

1/0/1 1/1/1

1 2/2/0 2/3/0

2/2/1 2/3/1

2

3/4/2 3/5/2

3/4/3 3/5/3

3 4/8/4 4/9/4

4/8/5 4/9/5

4

(a) Root tile (b) Hierarchical subdivision

Figure 2.5: An example for Mercator projection. Areas towards the poles are
exaggerated. WebMercator “slippy map” tiles are an efÏcient way to store pre-
rendered map material as image tiles of different levels of detail. A square area of
the Mercator-projected Earth is recursively subdivided into four smaller squares.
Here, the root tile (a) of level 0, with indices 0/0, is subdivided (b) to four tiles
of level 1 (label 1). The tile containing Stuttgart, Germany is subsequently subdi-
vided into four tiles of level 2 (label 2). This happens twomore times (labels 3 and
4). This process demonstrates the amount of map material that has to be loaded
when zooming into a WebMercator map: Only a few tiles need to be loaded for
each level of detail.

Mercator and WebMercator. Mercator projection (see figure 2.5a) was
invented in the 16th century by Gerardus Mercator. It is a cylindrical projection,
in which longitude is mapped linearly, and latitude is mapped to the length of
the opposite side of a right triangle to the angle of the latitude. As a consequence,
the mapped vertical positions tend towards positive and negative infinities as the
latitudes go towards ±90°. This leads to a drastic increase in area for features
closer to the poles, but also results in loxodromes being represented as straight
lines in the projection. The latter point made Mercator projection indispensable
for early celestial navigation and lead to its widespread use.

In modern times, technology such as GPS has eliminated the need for Merca-
tor projection in navigation. However, Mercator projection is still in widespread
use, in particular in the form of WebMercator “slippy tile” maps. These maps are
defined for a latitude range between approximately 85° northern and southern lat-
itude, which results in a perfectly square area inMercator projection. This square
is subdivided into four square areas, which happens recursively across many lev-
els (see figure 2.5b). For each cell of each level, a rendered map tile of a con-

19


2 Foundations

(a) azeqd (b) tpeqd

Figure 2.6: Azimuthal equidistant projection (azeqd) (a) is defined via an
arbitrary projection point. Distances and angles from this point to others are rep-
resented without distortion. Here, Stuttgart, Germany is chosen as the projection
point. Two-point equidistant projection (tpeqd) (b) is defined via two projec-
tion points. Distances are represented without distortion from either point. The
straight line between the points is represented without distortion as well, and
follows the Great Circle arc between the points. Here, Stuttgart, Germany and
Melbourne, Australia are chosen as the projection points.

stant size (usually 256×256 px) is stored in a simple directory structure (usually
/z/x/y.png, where z is the level, and x and y are the row-wise and column-wise
indices of the cell in the level). This data structure allows for efÏcient and inter-
active display of maps on a client without requiring any projection and rendering
of map data at runtime.

Azimuthal equidistant projection (azeqd) (see figure 2.6a) is a projection that
is defined by one point p on Earth, which can be chosen arbitrarily. This point is
mapped to the center of the projection space. All other points on Earth are pro-
jected to the angle and distance of the geodetic line between p and itself. As a con-
sequence, all distances are represented undistorted relative to p, and all geodetic
lines through p are represented as straight lines.

Two-point equidistant projection (tpeqd) (see figure 2.6b) is a projection that
is defined by two arbitrary points pA and pB . Distances from either of the two
points are represented without distortion in the projection, and the geodetic line
between pA and pB is represented as a straight line—and consequently, is not
distorted at all—in the projection. azeqd can be seen as a special case of tpeqd
where p = pA = pB .

20


2.3 Statistical Evaluation

2.3 Statistical Evaluation
To compare two or more techniques, or variants of techniques, quantitative eval-
uations can be done (Munzner, 2009; Forsell, 2010). For these, a number of study
participants are faced with a set of stimuli, and are asked to solve tasks. Within
these stimuli, a set of parameters that are to be studied are varied. These are the
independent variables. The goal is to keep everything about the stimuli that is
not affected by the independent variables constant, or to remove biases by fully
randomizing these aspects. The measured results of the tasks are the dependent
variables.Each valid combination of values for the independent variables is called
a condition. To improve the statistical validity of results, often, each condition is
repeated a set number of times per participant, with different randomized non-
independent variable aspects. In the end, statistical evaluation per condition is
performed to gauge whether the independent variables affect the dependent vari-
ables. A study can be within-subject, which means that all participants see the
same types of conditions, or between-subject, where participants are assigned a
group, and groups only see disjunct sets of conditions. In a mixed design, some
independent variables are varied for all groups, while others are kept constant
within groups. A mixed design is a compromise for more complex studies: In a
within-subject design, the combination of multiple independent variables and
repetitions can quickly lead to too-long evaluation sessions per participants. A
between-subject design, on the other hand, leads to a large number of partici-
pants, which can be challenging to organize.

As an example, researchers want to compare two visualizations v ∈ V , where
V = {A,B}. In addition, they want to compare both visualizations on three com-
plexity levels c ∈ C, where C = {easy, intermediate,hard}. Further, they want
to ensure that each participant sees ten repetitions of the same condition. In a
within-subject design, there would be |V × C| = 2 · 3 = 6 conditions, and each
participant would see 60 stimuli. Another option would be to do a mixed design,
where each participant would only see one of the visualization types. In this case,
each participant would see 30 stimuli, but twice as many participants would need
to be recruited. In reality, the number of participants and repetitions often de-
pends on the time required per stimulus, the maximum time each participant
should spend on the study (ideally under 30min), the funds available to pay par-
ticipants, and the required statistical significance of the study results. The last
point can be estimated, for example, using power analysis (Kang, 2021).

Historically, statistical significance of evaluation results has been determined
by the p-value. Recently, different science communities have brought forward
arguments against using p-values, as their outcome can be influenced via a tar-
geted selection of results (Cumming, 2013a; Cockburn et al., 2020). Instead, the
recommendation in the literature (Cumming, 2013b; Dragicevic, 2016; Besançon
& Dragicevic, 2017, 2019; Cockburn et al., 2020) is to use interval estimation.

21


2 Foundations

A
B
C

0 1 2 3

(a) Sample group values

A
B
C

0 1 2 3

(b) Sample mean 95% CIs

A - B
A - C
B - C

-0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4

1

(c) Bonferroni-corrected pairwise mean differences

Figure 2.7: Three random sample groups A, B, and C are measured (a). The
interval estimation of their sample means is calculated (b). The 95% confidence
intervals (CIs) are plotted as black lines, the mean of all samples is plotted as a
black circle. The CIs of the mean differences between the groups are calculated
and plotted as well (c). Here, Bonferroni correction with a correction factor of 3 is
applied to account for multiple comparisons. The corrected CIs are plotted with
red whiskers. The pairwise differences show evidence that values for group B are
smaller than those of group C (the CI is fully below 0), and that values for group
B are smaller than those of group A (the CI is fully above 0). For the difference
between group A and C, only a trend for A being larger can be found, because the
corrected CIs intersect 0 1 .

The statistical evaluations in this thesis follow the guidelines suggested by Be-
sançon and Dragicevic (2017).2 Their method constructs 95% confidence inter-
vals (CIs)—meaning we can be 95% confident that the sample mean lies within
the interval—via bias-corrected and accelerated (BCa) bootstrapping with 10 000
iterations. Equivalent p-values can be calculated using themethod by Krzywinski
andAltman (2013). For pairwise comparison, the difference of theCIs can be used.
The interpretation of the difference, then, is as follows (Cumming, 2013b, 2013a;
Dragicevic, 2016; Besançon&Dragicevic, 2017, 2019; Dragicevic et al., 2019; Cock-
burn et al., 2020): CIs of mean differences show evidence if they do not overlap
with 0, and the strength of the evidence increases for tighter intervals and inter-
vals farther away from 0. A small overlap with 0 can still indicate a trend towards

2An English translation of the work of Besançon and Dragicevic (2017) can be found in Appendix
A of the first author’s thesis (Besançon, 2017).

22


2.4 Visualization in the Digital Humanities

one of the methods. For multiple comparisons using the same data, the CIs of
the mean differences have to be adjusted using Bonferroni corrections (Higgins,
2004). Figure 2.7 shows an example for three measurement distributions, their
CIs, and the pairwise differences, where a Bonferroni correction factor of 3 has
been applied. Here, the pairwise differences show evidence for themean of group
B being smaller than that of groups A and C. The evidence for the mean of group
B being smaller than that of group A is stronger in this case, as the CI of mean dif-
ferences is further away from the 0. The CI of mean differences between groups
A and C shows only a trend for group A being larger (see figure 2.7 label 1 ), as
the corrected CI intersects the zero.

2.4 Visualization in the Digital Humanities
The DH is an emerging research field that introduces digital methods to humani-
ties research. As a field, it covers a broad area from literary research over art and
architecture to archaeology and history. Therefore, the goals, data, and methods
are also numerous. Visualization can support the various research fields in un-
derstanding their complex data and in collaboration (Bradley et al., 2018). The
individual research communities in DH are small, but engaged. Although the
combination of fast-paced computer science-backed visualization research and
moderately-paced humanities resea