Visual Analysis of Spatio-temporal Patterns in Digital Humanities Data with Reproducible and Confidence-aware Workflows Von der Fakultät Informatik, Elektrotechnik und Informationstechnik der Universität Stuttgart zur Erlangung der Würde eines Doktors der Naturwissenschaften (Dr. rer. nat.) genehmigte Abhandlung Vorgelegt von Max Franke aus Chemnitz Hauptberichter: Prof. Dr. Thomas Ertl Mitberichter: Prof. Dr. Wolfgang Aigner Tag der mündlichen Prüfung: 12. Juni 2024 Institut für Visualisierung und Interaktive Systeme der Universität Stuttgart 2024 Contents List of Figures vi List of Tables ix List of Abbreviations and Acronyms xi Acknowledgments xvii Abstract xix Zusammenfassung (German Abstract) xxiii 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Contributions and Remaining Structure . . . . . . . . . . . . 5 2 Foundations 9 2.1 Interactive Information Visualization and Visual Analysis . . 9 Basic Principles of Information Visualization . . . . . . . . . 9 Visualization Domains . . . . . . . . . . . . . . . . . . . . . . 12 Common Information Visualization Concepts . . . . . . . . 14 2.2 Data Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . 17 Geographical Projections . . . . . . . . . . . . . . . . . . . . 18 2.3 Statistical Evaluation . . . . . . . . . . . . . . . . . . . . . . . 21 2.4 Visualization in the Digital Humanities . . . . . . . . . . . . . 23 2.5 Technologies Used . . . . . . . . . . . . . . . . . . . . . . . . . 26 3 The “Dhimmis & Muslims” Project 29 3.1 Project Description . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 Collaboration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 iii Contents 3.3 Results: The Damast System . . . . . . . . . . . . . . . . . . . 34 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Interactive Visualization . . . . . . . . . . . . . . . . . . . . 42 Persisting Analysis Results in Textual Reports . . . . . . . . 50 Supported Workflows . . . . . . . . . . . . . . . . . . . . . . 52 Example Analysis . . . . . . . . . . . . . . . . . . . . . . . . 54 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4 Reproducibility and Confidence in the Digital Humanities 57 4.1 Reproducibility in the Digital Humanities . . . . . . . . . . . . 58 Reproducibility Typology and Pipeline . . . . . . . . . . . . 60 Reasons for Reproducibility . . . . . . . . . . . . . . . . . . 62 Strategies to Enable Reproducible Visualizations . . . . . . 64 Reproducibility Beyond the DH . . . . . . . . . . . . . . . . 68 4.2 Confidence in the Digital Humanities . . . . . . . . . . . . . . 68 Confidence as a Primary Data Attribute . . . . . . . . . . . 70 Confidence and Text Data . . . . . . . . . . . . . . . . . . . 74 Incomplete and Missing Data . . . . . . . . . . . . . . . . . 75 5 Analysis of Spatial Data 79 5.1 Spatial Aggregation . . . . . . . . . . . . . . . . . . . . . . . . 81 5.2 Non-contiguous Maps, Transitions, and Projections . . . . . . 85 Non-contiguous Maps for Non-uniform Spatial Distributions 86 Animated Transitions Between Spatial Viewpoints . . . . . 95 Spatial Projections for Animated Transitions . . . . . . . . . 103 5.3 One-dimensional Projections . . . . . . . . . . . . . . . . . . . 113 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 Comparison of Projections . . . . . . . . . . . . . . . . . . . 116 5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 6 Analysis of Temporal Data 123 6.1 Visualizing Periodicity in Event Data . . . . . . . . . . . . . . 124 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Pre-calculation and Guidance . . . . . . . . . . . . . . . . . 128 Visual Representation . . . . . . . . . . . . . . . . . . . . . . 130 Visually Mapping the Phase . . . . . . . . . . . . . . . . . . 132 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.2 Visual Representations of Time . . . . . . . . . . . . . . . . . 137 Visualization of Time Series Data . . . . . . . . . . . . . . . 137 Binned Representations . . . . . . . . . . . . . . . . . . . . . 149 Qualitative and Quantitative Visualization . . . . . . . . . . 151 iv Contents Color . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 6.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 7 Analysis of Spatio-temporal Data 155 7.1 Spatio-temporal Patterns . . . . . . . . . . . . . . . . . . . . . 156 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Hotspots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . 160 Trendsetters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 Spatial Expansion . . . . . . . . . . . . . . . . . . . . . . . . 161 Synchronous Movement . . . . . . . . . . . . . . . . . . . . 161 Coexistence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 7.2 Integrated Visualization of Space and Time . . . . . . . . . . 162 LilyPads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Mapping Periodicity Phase to Color or Shape in a Map . . 165 Mapping Space-Time to 2D . . . . . . . . . . . . . . . . . . . 167 7.3 Separated Visualization of Space and Time . . . . . . . . . . 179 7.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 8 Conclusion 185 8.1 Summary of Chapters . . . . . . . . . . . . . . . . . . . . . . . 185 8.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Research question 1 . . . . . . . . . . . . . . . . . . . . . . . 187 Research question 2 . . . . . . . . . . . . . . . . . . . . . . . 188 Research question 3 . . . . . . . . . . . . . . . . . . . . . . . 188 Research question 4 . . . . . . . . . . . . . . . . . . . . . . . 189 8.3 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . 189 References 193 Own Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 Peer-reviewed Publications . . . . . . . . . . . . . . . . . . . 193 Other Publications . . . . . . . . . . . . . . . . . . . . . . . . 194 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 v List of Figures 2.1 Visualized pixel density in print and on a computer monitor . . 10 2.2 Information visualization pipeline (Card et al., 1999) . . . . . . 11 2.3 Visual analytics model (Keim et al., 2008) . . . . . . . . . . . . . 14 2.4 Simplified sensemaking model (Pirolli & Card, 2005) . . . . . . . 16 2.5 Example Mercator projection and WebMercator tiles . . . . . . . 19 2.6 Examples for azimuthal and two-point equidistant projection . 20 2.7 Examples for confidence intervals and their pairwise differences 22 2.8 VarifocalReader (Koch et al., 2014) . . . . . . . . . . . . . . . . . 24 2.9 VAiRoma (Cho et al., 2016) . . . . . . . . . . . . . . . . . . . . . . 25 3.1 Screenshot of Damast v0.2.0 . . . . . . . . . . . . . . . . . . . . . 33 3.2 Abstracted data schema for the “Dhimmis & Muslims” project . 36 3.3 GeoDB-Editor: tabular data entry in Damast . . . . . . . . . . . 39 3.4 Annotator: text-based data entry in Damast . . . . . . . . . . . . 40 3.5 Visualization component of the Damast system (v1.3.0) . . . . 42 3.6 Visual map glyph design in the Damast system . . . . . . . . . . 43 3.7 Tooltips for details on demand in the Damast visualization . . . 44 3.8 The religion view of Damast . . . . . . . . . . . . . . . . . . . . . 45 3.9 The map view and location list of Damast . . . . . . . . . . . . . 46 3.10 The qualitative and quantitative timelines of Damast . . . . . . 47 3.11 The source and tag view of Damast . . . . . . . . . . . . . . . . . 49 3.12 The Damast system’s visualization in confidence mode . . . . . 50 3.13 Extract of a Damast report . . . . . . . . . . . . . . . . . . . . . . 51 3.14 Workflows supported by the Damast system . . . . . . . . . . . 52 3.15 Damast timeline for example analysis . . . . . . . . . . . . . . . 54 4.1 Reproducibility pipeline . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2 The historical data processing pipeline . . . . . . . . . . . . . . . 70 4.3 Example of confidence filters in Damast . . . . . . . . . . . . . . 73 4.4 Explicitly showing filtered-out data in Damast . . . . . . . . . . 76 5.1 Inhomogeneous distribution of cities in Europe . . . . . . . . . 81 vi List of Figures 5.2 Unaggregated and aggregated map in Damast . . . . . . . . . . 82 5.3 Details-on-demand for aggregated location data in Damast . . 83 5.4 Examples for inset, offset, and proxy maps . . . . . . . . . . . . 86 5.5 The interface of LilyPads . . . . . . . . . . . . . . . . . . . . . . . 88 5.6 Word cloud in LilyPads . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.7 Examples for map insets in LilyPads . . . . . . . . . . . . . . . . 90 5.8 Examples for tooltips with details on demand in LilyPads . . . . 92 5.9 Selections of multiple items in LilyPads . . . . . . . . . . . . . . 94 5.10 Parts of a geo-located graph with inhomogeneous distribution 96 5.11 Placement of proxy maps for off-screen positions . . . . . . . . 97 5.12 Screenshot of our ego-perspective visualization approach . . . 98 5.13 Animated ego-perspective transition in a geo-located network 99 5.14 Selection of most relevant proxy maps during animated transition 102 5.15 Comparison of geodetic lines in different map projections . . . 104 5.16 Example projections for TPEQD for different distances . . . . . 105 5.17 Direction task of the projection comparison study . . . . . . . . 107 5.18 Example stimulus task for the projection comparison user study 108 5.19 Evaluation results for the projection comparison user study . . 110 5.20 Pairwise differences for the projection comparison user study . 111 5.21 Schematic workflow of hierarchical projection of space to 1D . 115 5.22 Visual support to compare projections from geo-space to 1D . 117 5.23 Different spatial distributions before 1D projection . . . . . . . . 119 6.1 Components of the periodicity visualization approach . . . . . . 127 6.2 Schematic explanation of aggregated periodicity visualization 129 6.3 Example patterns for periodicity in different representations . . 130 6.4 Example visual mappings of phase in a scatter plot . . . . . . . 133 6.5 Mapping of year and day of year to space in a scatter plot . . . 134 6.6 Time series visualization comparison user study example stimuli 140 6.7 Time series study accuracy CIs . . . . . . . . . . . . . . . . . . . 143 6.8 Time series study completion time CIs . . . . . . . . . . . . . . . 144 6.9 Time series study pairwise accuracy differences . . . . . . . . . 145 6.10 Time series study pairwise completion time differences . . . . . 146 6.11 Color mappings in LilyPads . . . . . . . . . . . . . . . . . . . . . . 153 7.1 Spatio-temporal pattern examples in LilyPads (1) . . . . . . . . 163 7.2 Spatio-temporal pattern examples in LilyPads (2) . . . . . . . . 164 7.3 Spatio-temporal pattern examples for periodicity . . . . . . . . 166 7.4 Screenshot of approach where geo-space is mapped to 1D . . 169 7.5 Synthetic expansion pattern with 1D spatial projection . . . . . 171 7.6 Synthetic trendsetter pattern with 1D spatial projection . . . . 172 7.7 Spatio-temporal pattern examples for COVID-19 data . . . . . . 174 vii List of Figures 7.8 Spatio-temporal pattern examples in wildfire data (1) . . . . . . 176 7.9 Spatio-temporal pattern examples in wildfire data (2) . . . . . . 177 7.10 Spatio-temporal pattern examples in wildfire data (3) . . . . . . 178 7.11 Spatio-temporal pattern examples in Damast (1) . . . . . . . . . 180 7.12 Spatio-temporal pattern examples in Damast (2) . . . . . . . . . 181 7.13 Spatio-temporal pattern examples in Damast (3) . . . . . . . . . 182 Note on figure labeling. Figures are numbered within chapters (e.g., fig- ure 2.3 in chapter 2). Some figures consist of two or more subfigures. Subfigures are captioned separately, and are numbered within the figure with lowercase let- ters (e.g., figure 5.8b). In addition, points or regions of interest within figures may be labeled 1 for later reference. These are numbered with Arabic numerals, and referenced as follows: Figure 3.5 label 3 , figure 7.4 label 8 . viii List of Tables 4.1 Reproducibility strategy aspect coverage . . . . . . . . . . . . . 65 5.1 Quality measures for 1D projections of example distributions . 120 6.1 Phase histograms and quality measures for example datasets 131 7.1 Examples for spatio-temporal patterns . . . . . . . . . . . . . . . 158 ix List of Abbreviations and Acronyms azeqd Azimuthal equidistant projection. An azimuthal map projection which is defined by one geographical lo- cation. Distances from this node to any point are represented without distortion. This is a special case of the two-point equidistant projection (tpeqd) where both nodes are identical, and is described in more detail in section 2.2.2. tpeqd Two-point equidistant projection. A map projection which is defined by two geographical locations. Be- tween these two nodes, the projection is not distorted, and distances be- tween the nodes and any point are represented undistorted. Described in more detail in section 2.2.2. BCa bootstrapping Bias-corrected and accelerated bootstrapping. A bootstrapping method that uses random sampling with replacement of a known sample to estimate confidence intervals (Efron, 1987). See section 2.3 for more details on statistical evaluation. t-SNE t-stochastic neighborhood embedding. A non-linear dimensionality reduction (DR) technique introduced by vanderMaaten andHinton (2008) thatworkswell for high-dimensional data with clear clusters. 1D One-dimensional. A space with only one coordinate. In visualization research, time is a typical one-dimensional data space. 2D Two-dimensional. A space spanned by two coordinates. In visualization research, many spatial (and non-spatial, see section 2.2.1) datasets are simplified to a 2D space, because conventional displays are two-dimensional. xi List of Abbreviations and Acronyms 3D Three-dimensional. A space spanned by three coordinates. In visualization research, many datasets based in the real world (e.g., simulation data or computer to- mography scans) are three-dimensional in their spatial attributes (see section 2.1.2). AHC Agglomerative hierarchical clustering. Abottom-up clustering approach described in detail by Sibson (1973). In each step, the two closest clusters are merged into one based on a given distance metric and linkage criterion, until only one cluster remains. The resulting binary tree can be cut at an arbitrary height to produce a set of clusters at least a given distance apart. API Application programming interface. A defined interface between two parts of the same program, or between different programs. cDMD Constrained dynamic mode decomposition. An extension of dynamic mode decomposition (DMD) which was pro- posed by Krake et al. (2022), where the additional constraints regarding the expected period length can be specified. CI Confidence interval. A value interval within which a sample mean lies with a certain proba- bility, often 95%, see section 2.3. COVID-19 Coronavirus Disease 2019. A disease caused by the SARS-CoV-2 virus which lead to a pandemic starting in March 2020 (WHO, 2020). CVD Color vision deficiency. Also “color blindness”. CVD describes a reduced capability to perceive colors and differences between colors. The most common form of CVD affects the ability to distinguish colors along the red-green axis. DH Digital humanities. A hybrid research field combining humanities research (e.g., history) with digital methods and support. This opens up more quantitative, au- tomated analyses on larger datasets to humanities researchers. DMD Dynamic mode decomposition. An approach to extract dynamic temporal changes from sampled data presented by Schmid (2010). xii List of Abbreviations and Acronyms DOI Digital object identifier. A unique identifier for the persistent identification of digital objects. Ex- amples are digital documents such as papers, and datasets in long-term data repositorys (LTRs). DOIs are a type of uniform resource identifier (URI). DoI Degree of interest. A measure for the relevancy of different, related data items that can be used to Different approaches can be taken here to obtain such a mea- sure; Furnas (1986) introduced DoI measures to determine which other leaves and subtrees of a hierarchy to show depending on a selected leaf node. DOM Document object model. The programmatic data model that underlies documents written in, for example, extensible markup language, scalable vector graphics (SVG), or hypertextmarkup language (HTML). Eachnode of the document tree is represented by an object, and objects and hierarchies can be manipu- lated programmatically. DR Dimensionality reduction. A family of techniques that project an n-dimensional dataset into a m- dimensional space,m < n; see section 2.2.1. DTW Dynamic time warping. An algorithm that correlates and aligns two time series and measures their similarity (Sakoe & Chiba, 1978). FRP Fire radiative power. The power emanated by a fire (e.g., a wildfire) in thermal radiation. This is a common data point in satellite-based wildfire monitoring (e.g., Giglio et al., 2020), and is usually measured in Megawatts. GPU Graphics processing unit. A dedicated piece of hardware in a computer that is optimized to do low-level graphical computations quickly with high parallelism. HTML Hypertext markup language. A markup language which is used for web pages. HTTP Hypertext transfer protocol. A protocol for transferring data and files over a network. HTTP is used dominantly (with and without an additional encryption layer) for the communication between a web client (e.g., a browser) and a server. xiii List of Abbreviations and Acronyms LOD Level of detail. In visualization, the level of detail in which information is shown de- pends mostly on the available space: In a more restricted space, only a rough overview of data can be shown. When a subset of the data is selected (e.g., by zooming in), more details can be shown. See also se- mantic zooming in section 2.1.1. LTR Long-term data repository. A repository for the long-term storage of digital data. “Long term” can have the goal of meaning “indefinite” here. LTRs are essential for long- term reproducibility of scientific results. Examples for LTRs used along- side the results presented in this thesis are Zenodo1 and the Data Repos- itory of the University of Stuttgart, Germany2 (DaRUS). MCV Multiple coordinated views. This concept from interactive visualization research describes a collec- tion of two or more views, which each visualize different aspects of the data. These views affect each other; that is, they are coordinated (see section 2.1.3). MDS Multidimensional scaling. A linear DR method (see section 2.2.1). NER Named-entity recognition. A set of techniques for the identification and classification of named entities in unstructured natural language text. Examples for named en- tities are person or corporation names, places, and points in time. NLP Natural language processing. A sub-field of computer science and set of techniques concerned with the computational parsing, understanding, andmanipulation of natural language (e.g., English). NOAA National Oceanic and Atmospheric Administration. An agency of the government of the United States of America that is responsible for weather forecasts and the monitoring of the conditions if oceans and atmosphere, among other things. OCR Optical character recognition. Automated methods that generate machine-readable text from optical images (i.e., scans) of hand-written or printed documents. 1https://zenodo.org/ 2https://darus.uni-stuttgart.de/ xiv https://zenodo.org/ https://darus.uni-stuttgart.de/ List of Abbreviations and Acronyms p.p. Percentage point. A relativemeasure of difference between two values given in percentage units. For example, for two values a = 20% and b = 10%, a is 50% of b, but the relative difference is a reduction of 10 p.p., given in relation to the value the percentage values were in reference to. PCA Principal component analysis. A linear DR technique that projects a n-dimensional dataset into them- dimensional subspace (m < n) spanned by the m largest eigenvectors of the dataset’s covariance matrix; see section 2.2.1. REST Representational state transfer. An application programming interface (API) paradigm where paths de- scribe resources via URIs. Actions on resources (e.g., hypertext transfer protocol (HTTP) verbs like GET or DELETE) specify what should happen to the referenced resource. SOMs Self-organizing maps. A non-linear DR technique introduced by Kohonen (1981) that works well for small datasets with high dimensionality. STL Seasonal trend decomposition based on Loess. An approach to decompose a time series into a seasonal, linear, and noise component proposed by R. B. Cleveland et al. (1990). SVG Scalable vector graphics. A markup language format for graphics that is based on the extensible markup language (XML). Graphical primitives such as paths, rectan- gles, or circles are declared as document object model (DOM) nodes. SVG images are scalable, as opposed to raster image formats. They are therefore useful on the web, where sizing of DOM elements is depen- dent on the client’s viewport. UI User interface. The interface between a human user and a computer.Within this thesis, the term refers to a graphical user interface that consists of visualiza- tions as well as control elements (e.g., buttons) that users can interact with. UMAP Uniform manifold approximation and projection for DR. A non-linear, general-purpose DR technique introduced by McInnes and Healy (2018) and McInnes et al. (2020). xv List of Abbreviations and Acronyms URI Uniform Resource Identifier. A unique string that identifies a resource. In digital humanities (DH) data collections, entities are usually referenced by their URI. For ex- ample, the city Aleppo has the path /place/18 in the Syriaca.org data collection (Vanderbilt University et al., 2014), and the URI https:// syriaca.org/place/18. xvi https://syriaca.org/place/18 https://syriaca.org/place/18 Acknowledgments This thesis and the research it presents would not have been possible without the help of several people, to whom I am very grateful. First and foremost, I thank my advisor Thomas Ertl for givingme the opportunity for doing exciting research for my doctoral degree. He always had critical questions and insights that helped to refine and improve my work. I am also very grateful to Steffen Koch for his continuous support and guidance. Our regular discussions were always interest- ing and productive. Steffen’s input was essential to explore many fun ideas, and to develop good ones further. I would like to thank both Thomas Ertl and Steffen Koch for providingmewith funding to visit various conferences in exciting places around the world. Finally, I am grateful to Wolfgang Aigner for taking the time to review my dissertation, and for traveling to Stuttgart to attend my defense. I very much appreciate the support and assistance of my co-authors, without whom this work would not have been possible: Ralph Barczok, Tanja Blascheck, Markus John, Jana Keck, Moritz Knabben, Steffen Koch, Kuno Kurzhals, Julian Lang, Henry Martin, Guido Reina, and Dorothea Weltecke. Many engaging dis- cussions with other colleagues and collaborators also helped shape and refinemy research. I want to thank especially Frank Heyen, Florian Jäckel, Daniel Klötzl, Tim Krake, Michael Sedlmair, and Daniel Weiskopf. To the various colleagues with whom I shared both on- and off-topic discussions during lunch and cof- fee breaks—but in particular to Markus John and Franziska Huth, who endured sharing an ofÏce with me—I want to extend my gratitude. I would also like to acknowledge the contributions of several students who, over the years, helped to explore research directions I did not have the capacity to pursue at the time. I acknowledge, in particular, the exceptional work of Leon Gutknecht, Benjamin Hahn, Alexandra Hirsch, Julian Lang, Alexander Riedlinger, Ingo Schwendinger, Markus Stengel, Ba-Anh Vu, and Joel Waimer. To my friends and family, I am eternally grateful for their continuous love and support. I would not have come this far without the fascination for science and critical thinking, computers, and cartography they infected me with. I also want to thank my sensei, Sabine, for helping me find balance, both mentally and physically. Finally, I appreciate all my favorite people and dogs who encouraged me to take a break from time to time, explore the outdoors, and relax. xvii Acknowledgments Legal acknowledgments. Most prototypes presented in this work used map data, map tiles, or both from OpenStreetMap. The figures showing these prototypes are captioned “Map tiles © OpenStreetMap contributors.” OpenStreet- Map™ is open data, licensed under the Open Data Commons Open Database Li- cense (ODbL) by the OpenStreetMap Foundation.3 Other prototypes, where the OpenStreetMap map tiles with modern-day political borders and infrastructure were not appropriate, use map tiles with only a shaded relief. These were cre- ated by me (Franke, 2024) based on NASA SRTM (National4 Aeronautics and Space Administration Shuttle Radar Topographic Mission) topographic height data (NASA JPL, 2013), and are published under the Creative Commons By-At- tribution 4.0 license. In addition, the maps in these prototypes use vector map material by Natural Earth5 for oceans, lakes, and rivers. The NASA SRTM data, as well as the Natural Earth data, are in the public domain. The respective figures are captioned “Map tiles © 2024Max Franke.” Wikipedia andWikidata were also used as data sources for various figures and datasets. Wikidata’s data is published under the Creative Commons CC0 License. Wikipedia’s text data is published un- der the Creative Commons Attribution-ShareAlike 3.0 Unported License. GNU Parallel (Tange, 2022) was used for parallelization and job control in various data preprocessing steps described in this work. 3https://www.openstreetmap.org/copyright 4United States of America 5https://www.naturalearthdata.com/ xviii https://www.openstreetmap.org/copyright https://www.naturalearthdata.com/ Abstract Recent decades have seen a consistent increase of the availability of digitized data, aswell as an increase in dataset volumes. The digital humanities (DH) emerged as a consequence of this development and brought digital storage, automated analy- sismethods, and digital presentation to humanities research fields such as literary research and history. In DH data in particular, interpretation and human judg- ment are essential to contextualize the data, which is produced and relayed by humans, not sensors or simulations. Hence, the provenance and trustworthiness of the data is essential information for an objective analysis. This is especially true for historical data, where pieces of information can contradict each other, sources may exaggerate or lie, and the data is generally inhomogeneous and in- complete. The analysis of such data adds another level of human interpretation and judgment, which need to be recorded to become part of later sense-making and reproducibility efforts. The human capabilities to make sense of data, and to recognize patterns and structure in it, declines with increasing complexity and size of the dataset. By transforming the data and representing it by visual structures, the powerful hu- man visual apparatus can be harnessed to alleviate these shortcomings. Even so, the complexity anddataset size are limited.Hence, the correct choice of data trans- formations, visual mappings, and aggregation of details is essential to support hu- man sense-making. Besides the general visualization design, good support of do- main experts’ workflows is essential to encourage adoption of digital tools in their day-to-day work. Further, finding good solutions for the digital and visual repre- sentation of uncertain information and provenance data, as well as the recording thereof, improves general data quality as well as trust in the data and findings. Previous work within and outside of visualization research has already stud- ied the analysis of incomplete or uncertain data. In the DH, however, data of- ten has a unique combination of small dataset size and high data complexity. This fact, in combination with the presence of contradictory statements in the data, still require novel approaches for faithful representation to support objec- tive analyses. Especially with this data nature in mind, analyses in the DH are seldom linear. The use of incremental loops of data foraging, knowledge mining, xix Abstract and sense-making have been explored in detail both inside and outside of visual- ization research. Research in the humanities usuallymovesmore carefully and on a longer timescale. The established methods and workflows applied here need to work long-term, and the introduction of new methods from the fast-paced devel- opments of the digital world still poses open challenges. In particular, provenance and reproducibility both of data and of analyses need to be guaranteed on a longer timescale to promote open science. The high complexity of DHdata alsomakes itmore difÏcult to find interesting insights about patterns and relationships in the data. Such patterns can translate to visual patterns that can easily be recognized. Recent works have explored the nature of DH data and tasks, but the nature of the data patterns is very specific to the concrete domain and research questions. The complexity of the data poses a challenge to the designs of visual representations that reveal these patterns visu- ally. Suitable visual transformations are particularly challenging for geographical and geo-temporal data, where an inherent placement given by the data precludes many layout techniques. Projection of the geographical space into less complex spaces may offer ways to enhance the data layout and to reveal patterns. Much re- search onmap projections and dimensionality reduction already exists. However, both the elicitation of patterns from the data and the communication of spatial relations, direction, and distance to analysts in an intuitive manner in projected representations are still under-explored. This thesis presents strategies for the visual analysis of complex, incomplete, and heterogeneous data often present in the DH. Here, the entire data lifecycle, from data collection over visual analysis to the publication of results, is consid- ered. In addition, the support of the backwards path through that lifecycle is con- sidered as part of visualization approaches to provide data and analysis prove- nance, foster trust in the results, and promote reproducible and open science. To further aid more objective analyses, this work explores the use of qualitative confidence measures to record the quality and trustworthiness as a coequal part of the collected DH data, which can subsequently be used as part of the anal- ysis. This thesis also explores the visualization and visual analysis of data with spatio-temporal components, which are often found in the DH. Here, this work proposes novel visualization techniques, as well as novel combinations of exist- ing techniques, to elicit hidden patterns of interest from the data to support do- main research questions. The use of separated and integrated representations of space and time are examined. For the integrated representations, the use of data transformations and projections that reduce the complexity and restriction of ge- ographical data on the layout of the visualization. To this end, non-contiguous maps and the use of map insets to solve challenges with the level of detail in heterogeneous geographical distributions are employed. In addition, different ge- ographical projections are compared regarding their suitability to communicate the spatial relationships between data items. Further, this thesis explores projec- xx tions of geographical space to various one-dimensional, discrete orderings to elicit relationships between space and other attributes in a less complex layout, as well as the reduction of the temporal component of event data to a space of period length and phase to elicit hidden patterns of periodically recurring events. The methods presented in this thesis were developed largely to support DH researchers in their domains’ research questions, with the characteristics and vol- umes of data typical in that field in mind. Still, these methods can be—and in some cases have been—extended and adapted for different research fields. The principles applied in the approaches presented within this thesis are, in general, extensible and domain agnostic, given certain preconditions in the data. xxi Zusammenfassung GrößeundVerfügbarkeit digitalerDatensätze haben aufgrundder sinkendenKos- ten von Speicherplatz und steigender Rechenleistung in Computern in den letz- ten Jahrzehnten rasant zugenommen. Der technischen Fortschritt führte auch zur Entstehung der digitalenGeisteswissenschaften (DH, von engl. „digital huma- nities“) und spornte die Digitalisierung geisteswissenschaftlicher Inhalte an. Die DH kombinieren Forschung in denGeisteswissenschaftenmit digitalen Speicher- möglichkeiten und Analysemethoden, wodurch Geisteswissenschaftler*innen viel größere Datensammlungen auf einmal analysieren können. Die Erfassung der Herkunft und der Vertrauenswürdigkeit von Daten kann in allen Fachge- bieten, insbesondere aber in den DH, einen Mehrwert für unvoreingenommene Analysen liefern: Hier wurden die Daten von Menschen erzeugt und enthalten Informationen über menschliches Schaffen oder gesellschaftliche Aspekte, und werden im Laufe ihrer Entstehung oft mehrmals von Menschen analysiert, inter- pretiert und bewertet. In historischen Daten, die ohnehin sehr inhomogen und unvollständig sein können, sind einzelne Datenpunkte außerdem gelegentlich widersprüchlich, oder die Realität wurde aus verschiedenen Gründen verzerrt oder vorurteilsbehaftet dargestellt, bis hin zu bewusst falschen Aussagen in Quel- len. Hier ist die Erfassung von Herkunft und Vertrauenswürdigkeit der Daten, und deren Interpretation bei der Dateneintragung, besonders wichtig, um ein ob- jektives Gesamtbild zu erhalten. Die Visualisierung und Analyse der Daten sind ein weiterer Arbeitsschritt, in dem menschliche Interpretation ins Spiel kommt. Um die Schlussfolgerungen aus solchen Analysen nachvollziehen und reprodu- zieren zu können, ist es also auch für diesen Prozess wichtig, Kernaspekte wie Datenfilter und visuelle Parameter zu erfassen und abzuspeichern. Ohne geeigneteRepräsentation ist es schwierig,Muster undZusammenhänge in größeren oder komplexeren Datensätzen zu erkennen und zu verstehen. Das visuelle SystemdesMenschenhat sich allerdings über Jahrmillionen dahin entwi- ckelt, Regelmäßigkeiten undAusreißer in statischen und bewegten Bildern unter- bewusst entdecken zu können. Automatisierte Visualisierung übersetzt Daten in visuelle Strukturen, in denen Charakteristiken der Daten visuell erkennbar sind. Damit können auch deutlich größere und komplexereDatensätze noch analysiert xxiii Zusammenfassung (German Abstract) und verstandenwerden. Eine geeigneteWahl derDatentransformationen, derAb- bildung auf visuelle Primitive und der Aggregation von Details ist jedoch wichtig, da die Skalierbarkeit begrenzt ist. Um den Domänenexpert*innen die Aufnahme von digitalen Methoden in ihren Arbeitsalltag zu erleichtern, ist es auch wichtig, ihre Anforderungen und bestehenden Arbeitsabläufe zu verstehen und zu ergän- zen, anstatt sie ersetzen zu wollen. Insbesondere in den DH ist eine geeignete digitale und visuelle Darstellung der Vertrauenswürdigkeit der Daten und ihrer Herkunft – von ihrer Entstehung bis zur Interpretation und digitalen Eintragung durch die Domänenexpert*innen – essenziell. Ein Arbeitsablauf, in dem die Her- kunft und Vertrauenswürdigkeit der Daten konsequent miterfasst wird, erhöht langfristig die Qualität des Datenbestands, und dadurch auch das Vertrauen in die Daten und die daraus gewonnenen Erkenntnisse. Forschungsarbeiten haben sowohl innerhalb als auch außerhalb der Visua- lisierung die Analyse von unvollständigen oder unsicheren Datenbeständen un- tersucht. Bestehende Lösungen lassen sich allerdings schwer auf DH-Daten an- wenden, da die Datensätze zwar vergleichsweise klein, dafür aber sehr komplex und – gerade bei historischen Daten – teils widersprüchlich und inhomogen sind. Für möglichst unvoreingenommene Analysen sind hier neuartige Ansätze von- nöten, um diese Unzulänglichkeiten der Daten angemessen wiederzugeben. Auf- grund der zuvor genannten Eigenschaften von DH-Daten sind Analysen hier sel- ten linear. Die Nutzung von aufeinander aufbauenden Zyklen von Datensuche, Wissensanreicherung und Erkenntnisgewinn wurde sowohl innerhalb als auch außerhalb der Visualisierung detailliert erforscht. Die Forschung in den Geistes- wissenschaften hat oft einen langen Zeithorizont mit etablierten Arbeitsabläu- fen, die langfristig funktionieren müssen. Neue Arbeitsmethoden aus der rasant fortschreitenden digitalen Welt führen hier immer noch zu Herausforderungen, insbesondere, wenn man bestehende Arbeitsabläufe ergänzen und nicht erset- zen will. Für langfristig freie und zugängliche Forschung sind hier die Herkunft, Nachverfolgbarkeit und Reproduzierbarkeit von Daten und Analysen essenziell. Die visuelle Darstellung der Daten erleichtert zwar die Erkennung von Mus- tern in denDaten durch denMenschen, allerdings wird dies durch die hohe Kom- plexität von DH-Daten wieder erschwert. Die Charakteristik der Daten und die zu lösenden Problemstellungen in den DH wurden bereits gründlich erforscht. Allerdings gibt es kaum generelle Lösungsansätze, da die DH ein sehr breites Forschungsfeld sind und die Daten und Forschungsfragen sehr vom jeweiligen Forschungsfeld abhängig sind. Visuelle Transformationen, die Muster in den Da- ten aufdecken, sind insbesondere in geografischen und geografisch-zeitlichenDa- ten schwer umzusetzen, da die intrinsische geografische Positionierung der Da- ten das Layout der Visualisierung sehr einschränkt. Durch eine Projektion des geografischen Raums in einen weniger komplexen Raum kann das Layout der visualisierten Daten verbessert werden, wodurch Muster besser sichtbar werden. Karten- und Datenprojektion sind gut erforschte Felder. Bezüglich der Sichtbar- xxiv keit vonMustern und der intuitiv verständlichen Darstellung von räumlichen Be- zügen, Richtung und Distanz gibt es aber noch Potenzial für weitere Forschung. Diese Arbeit präsentiert Strategien zur visuellen Analyse von komplexen, un- vollständigen und heterogenen Daten, wie sie oft in den DH vorkommen. Dabei wird der gesamte Lebenszyklus der Daten von der Dateneintragung über deren visuelle Analyse bis zur Veröffentlichung der Analyseergebnisse berücksichtigt. Außerdemwird die Rückrichtung durch die Provenienz der Daten als Teil des Vi- sualisierungsansatzes gesehen. Dadurch könnenHerkunft und Entwicklung von Daten und Analysen nachvollzogen und reproduziert werden, um das Vertrauen in die Analyseergebnisse zu stärken und reproduzierbare und freie Forschung zu fördern. Diese Arbeit untersucht zur Unterstützung objektiverer Analysen auch qualitative Konfidenzwerte, um die Qualität und Vertrauenswürdigkeit von den gesammelten DH-Daten als gleichberechtigte Datenattribute zu berücksichtigen. Solche Attribute können somit in der Analyse mitverwendet werden. Diese Arbeit untersucht auch die Visualisierung und visuelle Analyse vonDa- ten mit räumlich-zeitlichen Bezügen, welche häufig eine wichtige Rolle in DH- Daten spielen. In diesem Kontext schlägt diese Arbeit neuartige Visualisierungs- ansätze und Kombinationen bestehender Ansätze vor, um versteckte, interessan- te Muster und Zusammenhänge in den Daten sichtbar zu machen und Domä- nenexpert*innen in der Beantwortung von Forschungsfragen zu unterstützen. Es werden separierte und integrierte Darstellungen von Raum und Zeit untersucht. Bei den integrierten Darstellungen werden Datentransformationen und Projek- tionen eingesetzt, die die Komplexität und Layout-Einschränkungen in der Vi- sualisierung geografischer Daten reduzieren. Dafür werden fragmentierte Karten und eingebettete Kartenausschnitte mit größeremDetailgrad genutzt, um Proble- me bei der Darstellung vonDetails in Datenmit heterogener geografischer Vertei- lung zu lösen. Außerdemwerden verschiedene geografische Projektionen darauf- hin verglichen, wie gut sie die räumlichen Verhältnisse zwischen Datenpunkten kommunizieren können. Zusätzlich präsentiert diese Arbeit Projektionen geogra- fischer Daten in verschiedene eindimensionale Anordnungen, um Zusammen- hänge zwischen dem räumlichen und anderen Datenattributen in einem einfa- cheren Layout besser sichtbar zu machen. Um versteckte periodische Wiederho- lungsmuster von Events zu finden, wird auch die Reduktion der zeitlichen Kom- ponente von Event-Daten in einen Periodendauer-Phasen-Raum untersucht. Die in dieser Arbeit präsentierten Methoden wurden hauptsächlich für die Anwendung in den DH und für Forschungsfragen der DH entwickelt. Dadurch sind sie auf die in den DH typischen Datensatzgrößen und -charakteristiken aus- gelegt. Sie können allerdings auch auf Forschungsfragen aus anderen Gebieten erweitert und angepasst werden, was zum Teil bereits umgesetzt wurde. Die Prin- zipien, die in den in dieser Arbeit präsentiertenAnsätzen verwendetwerden, sind auf weitere Anwendungsgebiete übertragbar, sofern deren Anforderungen hin- sichtlich Zielsetzung sowie Datenmenge und -komplexität vergleichbar sind. xxv Ch ap te r 1 Introduction This chapter outlines the advantages of interactive visualization to assist human analysts in understanding data that is either large in volume, complex, or both. Two digital humanities (DH) project collaborations that were part of the work leading up to this thesis are presented. Based on these two projects, typical do- main problems in the DH that interactive visualization can help solve are pres- ented. The chapter further contains a brief summary of the state of the art in visualization research, as it pertains to those domain problems. From this, four research questions are formulated. Finally, an outline on the structure of the re- mainder of the thesis is given. 1.1 Motivation The human capabilities to understand data quickly reach their limits for larger datasets, or more complex datasets with many attributes. To support humans in understanding the data, visualization can be used. In visualization, data is mapped to visual structures (Bertin, 1983; Card et al., 1999). The human visual system, which has evolved to recognize visual patterns effortlessly (Wertheimer, 1923; Healey & Enns, 2011), can thenmuch easier recognize interesting phenom- ena even in larger datasets, given the right visualmappings. The scalability of visu- alization to aid human cognition can be increased further by introducing interac- tion on digitalmedia: At first, an aggregated overview of the data is visualized. An- alysts can then narrow down the data by specifying filters interactively, and query details for individual data items (Shneiderman, 1996). The correct choice of visual mappings, and the—possibly computer-aided (Keim et al., 2008)—preprocessing of the data can further assist the human analysts (see figure 2.3). This makes in- teractive visualization a useful technique for understanding relationships and pat- terns in large datasets. 1 1 Introduction Large parts of this thesis were created within the scope of projects that were collaborations with DH researchers. A shorter collaboration occurred within the “Oceanic Exchanges” project (Cordell et al., 2022). Here, the research focus was on the spread of information and misinformation in the 19th and early 20th century, before and during the construction of the first trans-oceanic telegraph cables. This spread of information was observed through the lens of historical newspaper arti- cles, focusing on a handful of historical events such as the eruption of the Kraka- toa volcano in 1883, the outbreak of the Spanish-American war in 1898, or the political propaganda tour of the Hungarian revolutionary Lajos Kossuth through the United States of America in 1851 and 1852. The second and larger collabora- tion occurred within the “Dhimmis & Muslims” project (Weltecke & Koch, 2018). The research focus in this project was on the peaceful coexistence of different non- Muslim religious groups under Muslim rule in cities of the medieval Middle East. The project is described in more detail in chapter 3. Both projects worked with historical data, which is fairly typical for a wide range of DH research. Information, and consequently data, in the DH is usually made up of discrete statements or information packets. These come from a variety of sources of different quality, and the data entry process involves interpretation and judgment of accuracy and meaning by domain experts. In the “Oceanic Ex- changes” project, these characteristics manifested themselves through different newspapers in different languages, texts digitized by optical character recognition (OCR) with different qualities, and fragmented or incomplete newspaper articles. The incremental, manual data entry throughout the collaboration naturally lead to an incomplete view on the situation at the time in the “Dhimmis & Muslims” project. This was exacerbated by actual missing source material. Here, the con- tradiction of different sources and their trustworthiness was also a large issue; for example, authors of one religious group would exaggerate their own group’s presence in an area and downplay that of other groups, if it was mentioned at all. The interpretation of historical sources, for instance regarding the attribution of a statement to a specific city, was also challenging due to the large number of lan- guages the sources were written in. As a consequence, data in these projects, and in DH in general, is heterogeneous, incomplete, and at times contradictory. Analyzing and communicating such data in a manner that does not skew the reader’s impression of the represented information, hence, becomes a challenge. Data entry from multiple sources was a core component of the “Dhimmis & Muslims” project collaboration. However, on a larger scale, data entry is always part of DH projects, as the data originally is not digital. Digitization also requires the design and application of a data model, which can be challenging especially for the heterogeneous data found in DH research. Hence, both the interpretation of non-digital data during data entry, and the simplification of the data to adhere to the data model, might distort the data. Such distortions will cascade to the vi- sual analysis of the data. The steps analysts take here to reach interesting insights 2 1.1 Motivation might be affected, also, by their context and domain knowledge. As a final step, findings are then communicated to peers, who often only have condensed and ag- gregated result data or images available to them, not the entire data the findings were based on. Hence, the multi-step process of data entry, visual analysis, and publication of findings that are deemed interesting is affected in each step by hu- man judgment and interpretation, possibly by multiple people. These judgments and the steps taken that lead to the recorded data, the interactive path through the visualization, and the published outcome (i.e., findings supported by a scien- tific paper, a screenshot, or a video) need to be communicated for better trust in the results. At the same time, domain experts might want to inspect the data a visualization is based on, and explore and interact with the visualization a pub- lished finding was based on. More succinctly, the one-way workflow should be augmented to a two-way workflow to support reproducible scientific results and to ensure provenance of the information and analyses. When analyzing data, researchers might look at correlations or behavior of in- terest that affect the entire datasets. Such relations aremostly easy to find, but can be interesting nonetheless. However, inmany cases only small subsets of a dataset are of interest for a particular phenomenon. This is especially true with heteroge- neous datasets. Such subsets of data often formmeaningful patterns. The nature of these patterns depends on the data, the domain, and the concrete analysis tasks. In the DH, the patterns often appear in the spatial (often, geospatial), temporal, or spatio-temporal arrangement of the data. Visualization can assist in finding pat- terns, since the human visual system has evolved to quickly and effortlessly find and recognize meaningful patterns (Wertheimer, 1923; Healey & Enns, 2011) Mapping data attributes to position is an integral technique in information vi- sualization (W. S. Cleveland&McGill, 1985). A linearmapping of an attribute to a coordinate component is often used, and a pattern in the datawould then inmany cases become obvious in the visual representation. However, a non-linear projec- tion of one or multiple attributes into the 2D display space can in some cases greatly improve the clarity of the aforementioned patterns, or reveal them to begin with. These projections can take many forms, one example being map projections that represent the three-dimensional, quasi-spherical earth on a two- dimensional map. Different map projections exist, which emphasize different properties of the spatial layout. Going beyond geographical space, other data attri- butes can also be projected; for instance, time is often represented on a linear scale in visualization, but can also be represented non-linearly to highlight different phenomena in the data. Depending on the task and the data, different represen- tations and different projections of the data might be better suited to emphasize or reveal patterns of interest, both within the context of the DH and beyond. 3 1 Introduction 1.2 Research Questions Previouswork, bothwithin and outside of visualization research, has explored the analysis of uncertain and incomplete data in detail (Eaton et al., 2003; Bisantz et al., 2005; Correa et al., 2009; Brodlie et al., 2012).Whilemany previously proposed techniques can be applied to DH research, the small—but complex—datasets in combination with contradictory data still require additional considerations, which have not yet been thoroughly explored (Windhager et al., 2019a). The data of heterogeneous quality found in the DH also requires more qualitative assess- ment and interpretation by experts, which are not covered by quantitative meth- ods of dealing with uncertainty found in other fields. This nature of the data, in combination with the constant involvement of ex- perts for interpretation, mean that analyses in the DH rarely follow a linear path fromdata collection to results. Incremental data foraging, knowledgemining, and sensemaking (Pirolli & Card, 2005) have been discussed previously—also in the context of uncertainty (Sacha et al., 2015). The unique combination of require- ments found in the DH still require further research and a case-by-case assess- ment of the requirements. Works on the nature of spatio-temporal patterns of interest (e.g., Viboud et al., 2006; G. Andrienko et al., 2011, 2013; Li et al., 2018) have already proposed suitable visualization techniques for many situations. Techniques for the trans- formation and projection of data to elicit hidden patterns have previously been explored, for example, in the context of dimensionality reduction (DR) (Bunte et al., 2012; Ayesha et al., 2020), geographical projections (Snyder, 1997; Gosling & Symeonakis, 2020), and evenwithin the DH (e.g., Jänicke et al., 2012). Somework within the DH (e.g., Šavrič et al., 2015; Robinson & The Committee on Map Pro- jections, 2017) also considers the better communication of spatial relationships. Here as well, the specific nature of interesting patterns, and suitable transforma- tions and projections, depends on the data and domain research questions. The design space is not yet fully explored, especially regarding DH data. As a conse- quence, the best choice of projection or data transformation for specific research questions is also under-explored. This thesis aims to address the following research questions to support the requirements of DH scholars (see section 1.1) and to further the state of the art of visualization research: RQ1 How can data that is heterogeneous, incomplete, or contradic- tory be communicated visually in an adequate manner to analysts without exaggerating its significance, or conveying a false sense of data density and completeness? 4 1.3 Contributions and Remaining Structure RQ2 How can the one-way workflow of data acquisition, visualization and publication be augmented to a two-way workflow to support re- producibility and provenance? RQ3 What common types of patterns that indicate interesting phenom- ena typically appear in the spatio-temporal aspects of DHdata, and how can visualization assist domain experts in finding and analyzing them? RQ4 What types of projections can be used on the spatio-temporal attri- butes of such data to reveal interesting patterns (RQ3), and how can we determine whether the projections are helpful or not? 1.3 Contributions and Remaining Structure This thesis is structured thematically, rather than by publication. Therefore, indi- vidual research questions (see section 1.2) are addressed in multiple chapters. Contributions of papers. The contributions presented in this thesis are based mainly on my own, previously published work. Because of the thematic structure, some papers are split up into multiple chapters. In the following, these seven previous publications are described briefly. In our workshop paper (Franke et al., 2019) presented at the 4th Workshop on Visualization for the Digital Humanities (VIS4DH) in Vancouver, Canada, we argued for confidence to be included as a first-class data attribute in DH data and analyses. We chose the term “confidence” to highlight the qualitative nature of uncertain data in the DH, as well as the human factor present in data collection and interpretation. By treating the confidence of collected data as part of the data, it can be included into analyses as a filter criterion, and can be visualized. This provides analysts with a more objective picture on heterogeneous, incomplete, and contradictory data (RQ1). In our paper (Franke et al., 2020) presented at the 11th International Confer- ence on Information Visualization Theory and Applications (IVAPP) in Valletta, Malta, we presented LilyPads, an interactive visualization approach for the inter- active visual analysis of spatio-temporal dissemination in historical newspaper articles. LilyPads explores the use of broken-up inset maps to visualize a hetero- geneous geographical distribution with a higher level of detail (LOD). The inset maps are placed relative to a chosen geographical perspective to uphold the spa- tial relationships between places for analysts (RQ4). The empty space between themap insets can be used to visualize other data attributes such as temporal data (RQ3) and textual contents of the newspaper articles. 5 1 Introduction In our paper (Franke et al., 2021a) presented virtually at EuroVis and pub- lished in Computer Graphics Forum (CGF), we explored projections from geo- graphical space to a one-dimensional (1D) ordering. These projections (RQ4) simplify the layout of data items, and their relation with other data attributes (e.g., time-dependent data) can be visualized statically to reveal spatio-temporal patterns of interest (RQ3). Additional quality measures, guidance, and interac- tions help understand where neighborhoods in the 1D ordering correspond to neighborhoods in the geographical space. In our paper (Franke et al., 2022a) presented at the 13th IVAPP held virtually in Zurich, Switzerland, we performed a comparative study of three visualizations for multiple time series: line charts, aligned area charts, and stream graphs. We compared these regarding three low-level tasks that are commonly part of visual analyses (RQ3): identifying the time series with the largest value at one given point in time, identifying the time series with the largest integral under the curve between two given points in time, and deciding at which of two given points in time the sum of all time series’ values is larger. We found trends and weak evi- dence for the suitability of different visualization techniques for the three tasks. In our paper (Franke & Koch, 2023a) presented at the 14th IVAPP in Lisbon, Portugal, we presented Damast, a visual analysis approach for the analysis of the peaceful coexistence of different religious groups under Muslim rule in the me- dievalMiddle East.Damast visualizes the spatial and temporal aspects of the data (RQ3), as well as other aspects such as religious groups and literary data sources. Incomplete data and confidence are visualized explicitly and can be used as part of a visual query (RQ1). Damast supports data entry, visual data analysis, and the publication of results. In addition, the interactive visualization is reproducible from published results, and analysts can navigate to the source data the visualiza- tion is based upon to allow for iterative knowledge generation cycles (RQ2). In our Visualization Notes paper (Franke et al., 2023a) presented at the 16th IEEE Pacific Visualization Symposium (PacificVis) in Seoul, Korea, we argued for reproducible workflows (RQ2) in visualization research. We discussed different aspects that affect how well visualization results can be reproduced, such as the required time frame and data granularity, the reason for reproducibility, the type of data, and the different input parameters that need to be considered. We high- lighted the importance of reproducibility and suggested a set of strategies that can be applied and combined to support reproducible workflows. In our short paper (Franke & Koch, 2023b) presented at the 2023 IEEE VIS conference in Melbourne, Australia, we presented an approach for the interac- tive exploration of hidden periodicity patterns (RQ3) in event data. Our approach calculates a cheap-to-compute phase histogram of the data for thousands of pos- sible period lengths. Analysts are guided towards period lengths of interest with two quality measures, as well as interactive suggestions and details on demand. 6 1.3 Contributions and Remaining Structure Our approach serves as an interactive, overview-first exploration tool for the data, where analysts can later apply advanced, but computationally more expensive methods to individual period lengths of interest. Remaining structure. Important foundations for this work, as well as previous related work, are discussed and organized in chapter 2. Chapters 3 to 7 contain the main content of this thesis, grouped by topic as summarized in the following. Chapter 8 provides a conclusion of the thesis, as well as an outlook. Chapter 3 describes the “Dhimmis & Muslims” project, which was a collab- oration with historians of religion. The historians’ main research interest within the project concerned the peaceful coexistence of different religious groups in the medieval Middle East. Here, the changes in co-location of religious groups over space and time (RQ3) was of interest. In addition, challenges and possible solu- tions of heterogeneous, incomplete, and contradictory data in the DH (RQ1) are discussed. Finally, the role of reproducible workflows, as well as data and analy- sis provenance (RQ2), are discussed in the context of the project. One outcome of the project was the visualization approach Damast. The IVAPP paper describing Damast (Franke & Koch, 2023a) provides the basis of this chapter. The parts of the VIS4DH workshop paper (Franke et al., 2019) which describe the data model and the inclusion of confidence measures in Damast are also included. Finally, Damast was used as a case study in our PacificVis paper (Franke et al., 2023a) on reproducibility, and this part of the paper is included in the chapter as well. Chapter 4 highlights the importance of reproducibility and confidence in the DH. Reproducibility (RQ2) is an essential step towards open science. In human- ities research and the DH, data provenance and the mental reproducibility of thought processes are required for a long time frame. Large parts of our PacificVis paper (Franke et al., 2023a) make up the basis of the reproducibility part of this chapter. In addition, parts from the Damast paper (Franke & Koch, 2023a) per- taining to reproducibility are included here. We proposed the term “confidence” to delineate the qualitative judgments of uncertainty and heterogeneity in DH data (RQ1) from the use of the term “uncertainty” in more quantitative contexts. Our VIS4DH paper (Franke et al., 2019) serves as the basis for the discussion of confidence in this chapter. Our thoughts on confidence and uncertain DH data in the context of two DH collaborations from the Damast and LilyPads papers (Franke et al., 2020; Franke & Koch, 2023a) are recapitulated here. Chapter 5 describes the visual analysis of data with spatial context. Here, the use of spatial aggregation and its challenge regarding heterogeneous data with low confidence (RQ1) is discussed. In addition, the use of projections to one- and two-dimensional space to elicit hidden patterns from the data (RQ3, RQ4) are proposed. Spatial data were of relevance in the LilyPads and Damast papers (Franke et al., 2020; Franke & Koch, 2023a), as well as our EuroVis paper (Franke 7 1 Introduction et al., 2021a). The parts of these papers concerning spatial projection, aggregation, and the visualization of geospatial data are recapitulated in this chapter. The part of our VIS4DH paper (Franke et al., 2019) concerning aggregation of spatial data is also included. Chapter 6 discusses the visual analysis of time-dependent data, which can be used as part of the discovery and analysis of spatio-temporal patterns (RQ3). The chapter recapitulates our VIS short paper (Franke & Koch, 2023b) that pro- poses a data transformation (RQ4) and visualization approach to elicit hidden periodicities in event data. The chapter further recapitulates our IVAPP paper (Franke et al., 2022a) comparing different visualization techniques for time series data regarding their suitability for different tasks. Finally, the merit of different visual representations of time in different visualization approaches are summa- rized. The parts of the LilyPads and Damast papers (Franke et al., 2020; Franke & Koch, 2023a), as well as our EuroVis paper (Franke et al., 2021a), regarding visualization of temporal data are included here. Chapter 7 addresses the analysis of data with a spatio-temporal component (RQ3). The chapter presents patterns that can be of interest both within and out- side the DH. Both the use of visualization techniques that combine space and time in one view and the separated visualization in multiple views are consid- ered. The work discussed here is a combination or extension of parts of work cov- ered in chapters 5 and 6. The chapter summarizes the previously presented tech- niques (Franke et al., 2020, 2021a; Franke&Koch, 2023a, 2023b) regarding spatio- temporal visualization and differentiates between separated and integrated repre- sentations of space and time. Different types of spatio-temporal patterns are listed, and examples for such patterns and their analysis from the presented approaches are shown. 8 Ch ap te r 2 Foundations This chapter introduces concepts and techniques that are essential prerequisites for the rest of the thesis. In section 2.1, the foundations of visualization and in- formation visualization are explained. Section 2.2 introduces two types of data projection that are used in this thesis: dimensionality reduction and geographi- cal projection. The principles of quantitative statistical evaluation using interval estimation, which is used in the user evaluations presented in this thesis, are ex- plained in section 2.3. Section 2.4 gives an overview of the challenges and the state of the art of visualization in the digital humanities. 2.1 Interactive Information Visualization and Visual Analysis Information visualization is a large and broad sub-field of the even larger field of visualization. Certain principles, concepts, and techniques have emerged in typi- cal use within this field. A subset of these, which are also used in the approaches presented in this thesis, are described in the following. 2.1.1 Basic Principles of Information Visualization Information visualization harnesses the parallel processing capabilities of the vi- sual cortex of the human mind (Ware, 2004). Over millions of years, this part of the human brain evolved to recognize certain patterns pre-attentively (Healey & Enns, 2011); that is, without requiring any conscious or time-intensive think- ing. The Gestalt laws (Wertheimer, 1923) are based on the pre-attentive percep- tion and describe patterns that the human visual system will recognize without much thought. The law of proximity describes how the human visual system will group individual entities into clusters if they are close together. The law of similar- ity describes how visual items of similar characteristics (e.g., color, or shape) are 9 2 Foundations DIN A0 @ 600DPI DIN A0 @ 300DPI DIN A2 @ 600DPI DIN A2 @ 300DPI DIN A4 @ 600DPI VISUS Powerwall (10800×4096px) 4K monitor (3840×2160px) Full HD monitor (1920×1080px) Figure 2.1: A visual comparison of the possible pixel density between print media and computer monitors. At a density of 600 dots per inch, a DIN A0 pa- per printout (1189mm×841mm) can show 269.1 times more pixel data than a full HD (1920 px×1080 px) computer monitor, and 67.3 times more pixel data than a 4K (3840 px×2160 px) computer monitor. grouped together regardless of position. The law of connectedness describes how lines between visual items instantly imply a relation between the items. The laws of continuity and closure describe how the human visual system can complete shapes where parts are missing, and group individual subsets of items based on their probable relatedness, even if parts of the groups are not visible. Wertheimer describes additional Gestalt laws, but the aforementioned ones are especially rel- evant to information visualization. Patterns in the data, such as tight grouping in two or more aspects, linear relationships, or similar groups represented by color or shape, are recognized pre-attentively. This makes information visualization a powerful technique for understanding relationships in larger datasets, where looking at the data itself would not yield any appreciable insights. For visualization, different visual variables (Bertin, 1983) (or visual channels (Chen&Floridi, 2012;Munzner, 2014)) can be utilized to encode different aspects of data. To a degree, different visual variables can be combined to encodemultiple data attributes in the same visual marks (Munzner, 2014). Examples for visual variables are the position, size, shape, and texture of visual marks, as well as their color (hue, value, and saturation).W. S. Cleveland andMcGill (1985) explored the clarity of encoding for different visual variables for quantitative data, and found position and length to be themost suitable encodings. For ordinal or nominal data (Stevens, 1946), position is still the best choice, but other visual variables can be more suitable (Mackinlay, 1986). The volume and density of data often makes a visual representation challeng- ing. Especially when considering the limitations even of modern computer mon- itors (see figure 2.1), it is hard to represent both local structures and the dataset in its entirety accurately. Shneiderman’s mantra of “overview first, zoom and fil- 10 2.1 Interactive Information Visualization and Visual Analysis ▶ Figure 2.2: The infor- mation visualization pipeline (Card et al., 1999) describes the formal stages: Rawdata is struc- tured. The structured data is then mapped to visual struc- tures. Views of the visual struc- tures are created via geomet- rical transformations, such as scaling and translation (zoom and pan). Users can interact with all stages of the pipeline. raw data structured data visual structures views user data transformations visual mappings view transformations sees interacts ter, then details-on-demand” (Shneiderman, 1996) has been a governing principle for interactive, computer-aided visualization: To circumvent the limited screen estate, the initial overview shows only an aggregated representation of the data. Users can the zoom and filter into subsets of the data, which is then shown in more detail. The information visualization pipeline (Card et al., 1999), shown in figure 2.2, describes how interactions in different parts of the pipeline can affect the results. For computer-aided visualization, the creation of uniformly structured data is es- sential for an effective, rule-based visualization (see section 2.1.2). Based on these rules, the structured data is subsequently mapped to visual structures or visual marks (Munzner, 2014). By transforming the view, for example through zooming and panning, a subset of the data domain and the visual marks can be shown. Users can interact with all stages of the pipeline; for instance, they can affect how the raw data is transformed to structured data (e.g., by changing the aggregation level), change the visual mapping of different data attributes, or zoom and pan to subsets of the data. Shneiderman’s mantra applies to all stages of interaction, owing to the concept of semantic zooming (Bederson et al., 1996): Zooming hap- pens in the rendering stage, filtering in the filtering stage. At the same time, the reduction in data shown, and the smaller data domain shown in the screen estate, allows for more details and affects the visual mapping stage. 11 2 Foundations 2.1.2 Visualization Domains Computer-aided visualization1 is a large discipline with different requirements, types of data, and techniques. Within this discipline, different sub-disciplines formed because of these individual differences. Information visualization is the visualization of abstract data (Rhyne et al., 2003; Weiskopf et al., 2006), which does not contain an inherent spatial layout. As a consequence, the spatial layout of the visualization’s output (i.e., the image space) can be chosen freely. Different information visualization techniques exist for different data types. Examples for these are bar charts for singular quantitative values assigned to discrete entities; line charts for quantitative values dependent of another variable (such as time); visualizations of tree structures such as icicle plots (Kruskal & Landwehr, 1983), indented trees (Burch et al., 2010), or tree maps (Shneiderman, 1992; Bruls et al., 2000); or node-link diagrams to visualize general graph structures. Information visualization typically uses the two-dimensional space for its representations to fully utilize the 2D paper, and later computer monitor, representation domain. In contrast, scientific visualization has its roots in rendering research and com- puter graphics (Rhyne et al., 2003; Weiskopf et al., 2006). Hence, the techniques discussed here are often closer to the hardware. The data visualized, as opposed to in information visualization, also has an inherent spatial layout. Typical exam- ples for data visualized in scientific visualizations are 3D medical scans, flow or density fields obtained from measurements or simulations, or 3D surface mod- els. Regardless of an often-present temporal component, this data typically has a three-dimensional spatial layout that is anchored in the real world. As a conse- quence, scientific visualizations usually use 3D rendering techniques. The separation between scientific visualization and information visualization is largely due to historical reasons and their different origins. Both disciplines aim to solve similar challenges, albeit in different data and application domains. As such, the naming of the two disciplines is somewhat unfortunate. In fact, Mun- zner stated: “scientific visualization is not uninformative, and information visual- ization isn’t unscientific” (Rhyne et al., 2003, p. 612). Recently, efforts in the visual- ization community (e.g., Rhyne et al., 2003;Weiskopf et al., 2006) have beenmade to bridge the gap, and to reduce the bias stemming from the original naming of the disciplines. Furthermore, visualization approaches around real-world-positioned data have beenutilizing information visualization techniques, and vice versa (e.g., Sedlmair et al., 2009). 1The term “visualization” has vastly different meanings across disciplines. In the humanities, for instance, the term can refer to a mental sensemaking process. Within this thesis, the term “visu- alization”—when used without additional qualifiers—always refers to the computer-aided, rule- based transformation of data to a visual representation. The usage of the term implies interactivity: Users can interact with the automated process in all stages (see section 2.1.1) to affect its output. 12 2.1 Interactive Information Visualization and Visual Analysis Geographical visualization is another sub-discipline that lies between visual- ization and cartography (MacEachren et al., 1998). Often, abstract data is visual- ized in a geographical context, and so, information visualization techniques are utilized. At the same time, the geographical aspect of the data inherently affects the spatial layout of the visualization. As with scientific visualization, the expres- sive visual variable of position (see section 2.1.1) cannot be used freely. Hence, geo- graphical visualization shares properties of both scientific and information visual- ization. Depending on the application domain, geographical visualization might lean into techniques from either sub-discipline more; for example, geographical visualizations showing air current vector fields, or altitude data, might lean more towards the scientific visualization’s 3D rendering techniques. On the other hand, visualizations such as choropleth maps (Brewer et al., 1997; Slocum et al., 2014) or glyph-based visualizations (Borgo et al., 2013) in a map are closer to the tech- niques used in information visualization. In the digital humanities (DH), data is overwhelmingly abstract. While real- world, 3D data exists, for example, in the form of archaeological ground scans (Eggert et al., 2013) or 3D models of historical artifacts (Martínez Carillo et al., 2008), most aspects of the data collected here is abstract. Examples are text pas- sages and the entities and relationships described within, or subjective recollec- tions of historical actualities by individual sources. At the same time, DH data often has a geographical component. As a consequence, the visualizations pres- ented in this thesis either use information visualization techniques, or geograph- ical visualization techniques that are closer to the information visualization side of the spectrum. Visual analytics describes a methodology that includes the use of information visualization. The model proposed by Keim et al. (2008) and shown in figure 2.3 describes an incremental process, where knowledge is gained from data. This is done both by the interactive, visual exploration of the data, and by the use of computational models and automated data mining techniques. Insights gained from the visualization can be used to improve the automated models, and vice versa. The resulting knowledge can be used to further improve the data input into this process. The “models” described by Keim et al. do not necessarily have to be extensive artificial intelligence or large language models. In the work presented in this thesis, various clustering techniques, automated analyses, and regression were used. This still qualifies the respective approaches as using visual analytics. In the context of this thesis, especially for chapter 4, thework by Sacha et al. (2015) is relevant as well. Their work extends on the visual analytics model of Keim et al. to include uncertainty in the analysis process. 13 2 Foundations visual data exploration data mining data visualization knowledge models feedback loop Figure 2.3: The visual analytics model, adapted from Keim et al. (2008). Knowledge is obtained from data both via interactive information visualization (top), and by automated data-mining methods (bottom). Insights gathered from both processes can be used to improve the other process, respectively. The knowl- edge gained through both processes can be used to improve and enrich the input data. 2.1.3 Common Information Visualization Concepts Various information visualization techniques exist that are commonly used in vi- sualization approaches. Within the approaches described in this thesis, the tech- niques described in the following are used often. MCV, faceted filtering, and brushing and linking. Multiple co- ordinated views (MCV) (Wang Baldonado et al., 2000; Roberts, 2007) is a concept where a visualization consists of more than one view. Each view shows a different, partial perspective on the dataset. This canmean that each view shows a different visual representation of the data, but typically, the visualizations in the different views also show different attributes of the data; for example, one viewmight show the geographical distribution of the data in a map, while another shows its tem- poral distribution in a density plot over time. The views are coordinated,whichmeans that interactingwith one of the views will affect the other views as well. This coordination is realized, for one thing, in filtering: Filtering in one view, for example by selecting a valid data range, af- fects the data shown in other views as well. Often,multi-faceted filtering (Weaver, 2004; Hearst, 2006) is used here: Filter selections within the same data attribute are treated as a union, while selections between different attributes are treated by intersection. This allows for very expressive, but still intuitive, filtering possi- 14 2.1 Interactive Information Visualization and Visual Analysis bilities. In the example above, a user might have selected two disjunct areas in the geographical view, as well as a time range in the temporal view. For each sin- gular data item, the decision whether it will be shown depends on whether its geographical aspect lies within either of the areas. However, its temporal aspect must also lie within the selected span. More formally, the set of data items d ∈ D, for a set of data aspects Ai, I := {1, . . . , |A|} ∋ i, D =× i∈I Ai and a set of ni distinct filters on singular data aspects, Ji := {1, . . . , ni} ∋ ji fAi,ji : Ai → {true, false} is filtered toDfiltered ⊆ D such that: Dfiltered :=    d ∈ D ∣ ∣ ∣ ∣ ∣ ∣ ⋂ i∈I ⋃ j∈Ji fAi,ji(d)    Within MCV visualizations, brushing and linking (Becker & Cleveland, 1987; Ward, 1994) is often used. This technique allows users to better understand the relationships between data attributes, and to anticipate the effects of singular re- strictions on the faceted filtering. The technique consists of one interaction, the brushing, and a visual response, the linking. Brushing is the action of selecting a data range, or a set of visual marks, in one of the views. As a consequence, linking is implemented by highlighting visual marks representing the same data items— for example, by reducing the saturation and lightness of other visual marks—in all views. Different visualization approaches implement brushing and linking dif- ferently. Brushing can, for example, be realized by simply hovering the mouse cursor over a data item, or by creating a range selection. Linking can be shown, for example, by highlighting the corresponding visual marks, or by desaturating the non-corresponding ones. The approaches presented in this thesis all utilize MCV to a certain degree. Faceted filtering is often implemented due to its intuitive use, and its expressive- ness despite the simplicity of the concept. Brushing and linking are also used in most approaches to support the task of relating different data aspects. Overview+detail, focus+context. When data is visualized within a large data domain, often, local structures can get lost in the larger picture. On the other hand, when visualizing solely one local structure, the overall distribu- tion of the data might get lost. Cockburn et al. (2009) describe overview+detail approaches, where the overall extent of the data domain of a view is visualized in a smaller representation, while the main view shows only a small part of the 15 2 Foundations world structured data knowledge foraging loop sensemaking loop reality/policy loop Figure 2.4: A simplified version of the sensemaking model as presented by Pirolli and Card (2005). Data is collected and structured from the world at large in the foraging loop. The collected data is analyzed and knowledge is generated in the sensemaking loop. These loops can be traversed multiple times; for example, when collecting additional data based on interim insights from existing data. The overarching reality or policy loop covers the entire process. domain in larger detail. They also describe focus+context approaches. Here, the area of interest (e.g., the local structure) is visualized in a larger level of detail. Sur- rounding it, the context in the attribute space is visualized, often in a distorted manner to accommodate the larger local area of interest. Generalized fish-eye lenses (Furnas, 1986) are an example of this technique. The sensemaking process. Pirolli and Card (2005) discuss the sense- making process,which canbe found inmany visualization and visual analytics ap- proaches. Figure 2.4 shows a simplified version of the sensemaking model, omit- ting the fine-grained stages and loops also discussed by Pirolli and Card (2005). External data is collected, structured, and enriched with additional searches in the foraging loop. Collected data is analyzed, hypotheses are tested, and resulting knowledge is presented in the sensemaking loop. The overall process of data for- aging and sensemaking is spanned by the reality or policy loop. The sensemaking loop can be supported by visual analysis. Visualization approaches that consider 16 2.2 Data Projection enriching or narrowing down the base dataset in addition to the visualization of data also support the foraging loop. The sensemaking process, applied to visual- ization, is related to the visual analytics model presented by Keim et al. (2008): The visual data exploration part of the model (see figure 2.3) mostly supports the sensemaking loop. The feedback loop from knowledge to data entry, as well as the data mining part of the model, support the foraging loop. 2.2 Data Projection When the data that should be analyzed is high-dimensional, visualization can be challenging. For interactive information visualization, a two-dimensional (2D) representation is usually easier to understand. While additional attributes can be encoded in other visual variables, in some cases it can be valuable to reduce the dimensionality of the data before visualizing it. Section 2.2.1 presents general methods that transform high-dimensional data to a dimensionality that can suit- ably be encoded in a visualization. Section 2.2.2 discusses geographical projection, where positions on the quasi-spherical surface of Earth are projected to a flat 2D space. 2.2.1 Dimensionality Reduction Dimensionality reduction (DR) is a family of techniques that projects data with n attributes into a subspace withm dimensions,m < n (van derMaaten et al., 2009; Ayesha et al., 2020). Often, m is set to two (2), such that the low-dimensional projection can be visualized on a computer monitor. The overall goal of DR is to preserve structures and patterns in the high-dimensional space in the low- dimensional space and to not introduce false positives (Venna & Kaski, 2001). DR can be divided into linear and nonlinearmethods. In linearmethods, each of the low-dimensional attributes is defined as a linear combination of all high- dimensional attributes. Scatter plot matrices are a straightforward visualization approach that shows an n× nmatrix of scatter plots, each of which shows an or- thographic view on the data in the subspace spanned by two attributes. Principal component analysis (PCA) determines them largest eigenvectors of the data and produces an orthographic view of the data within the subspace spanned by those eigenvectors. Nonlinear methods are more flexible and can usually preserve complex high- dimensional structures better, but are more expensive to compute. Various meth- ods exist, with their individual strengths regarding the quality of projection and the suitable dataset characteristics. Multidimensional scaling (MDS) (Torgerson, 1952) attempts to preserve pairwise distances between data points in the high- dimensional space. t-stochastic neighborhood embedding (t-SNE) (van der Maa- 17 2 Foundations ten & Hinton, 2008) preserves the distances between distributions, and hence works well for high-dimensional data with clear clusters. Self-organizing maps (SOMs) (Kohonen, 1981) project data by unsupervised machine learning, and work well for small datasets with high dimensionality. The approaches presented in this thesis do not make use of many DR tech- niques. One approach, which projects spatial data to a one-dimensional (1D) or- dering based on different methods (Franke et al., 2021a), and which is presented inmore detail in sections 5.3 and 7.2.3, uses uniformmanifold approximation and projection (UMAP) (McInnes & Healy, 2018; McInnes et al., 2020) as one projec- tion method, but otherwise does not use DR techniques. However, the approach uses quality measures that were developed for DR techniques. Namely, the M1 and M2 metrics proposed by Venna and Kaski (2001) for nonlinear projections, metric stress proposed for classical MDS (Torgerson, 1952; Goodhill & Sejnowski, 1996), and non-metric stress proposed for non-metric MDS (Kruskal, 1964; Good- hill & Sejnowski, 1996) are used. However, the use of MCV visualizations (see section 2.1.3) could be seen as a special case of DR: Within each view, only a sub- set of data attributes is visualized: The data is projected to an attribute subspace. 2.2.2 Geographical Projections Mapping Earth’s surface on a 2Dmedium, such as paper or a computer screen, is challenging. Earth is roughly ellipsoidal, and can be approximated very well by a spherical geoid (Moritz, 2011). The spherical topology and curvature of Earth makes it impossible to map its surface to a flat plain without introducing discon- tinuities, or distortions to angle, distance, or area. Various map projections exist that either preserve some of these desired properties at the cost of others, or of- fer a compromise solution where none of the properties is fulfilled (Snyder, 1997; Slocum et al., 2014). Earth is geographically subdivided by the World Geodetic System. Lines of constant longitude run north-south between the poles. Lines of constant latitude run in a constant distance fromeither pole. AGreatCircle is a circle (circle-like un- less Earth is approximated as a perfect sphere) of maximal circumference around the earth. For any two points on Earth’s surface, a Great Circle exists through them. The shorter of the two Great Circle arcs is the shortest connection between the two points, the geodetic line between them. In contrast, a loxodrome is a line with constant bearing; that is, a line which at every point has the same angle be- tween the direction of the line and the North Pole. Loxodromes are either latitude lines, parts of longitude lines, or spirals that end up at either the North or South Pole at some point. 18 2.2 Data Projection 0/0/0 1/0/0 1/1/0 1/0/1 1/1/1 1 2/2/0 2/3/0 2/2/1 2/3/1 2 3/4/2 3/5/2 3/4/3 3/5/3 3 4/8/4 4/9/4 4/8/5 4/9/5 4 (a) Root tile (b) Hierarchical subdivision Figure 2.5: An example for Mercator projection. Areas towards the poles are exaggerated. WebMercator “slippy map” tiles are an efÏcient way to store pre- rendered map material as image tiles of different levels of detail. A square area of the Mercator-projected Earth is recursively subdivided into four smaller squares. Here, the root tile (a) of level 0, with indices 0/0, is subdivided (b) to four tiles of level 1 (label 1). The tile containing Stuttgart, Germany is subsequently subdi- vided into four tiles of level 2 (label 2). This happens twomore times (labels 3 and 4). This process demonstrates the amount of map material that has to be loaded when zooming into a WebMercator map: Only a few tiles need to be loaded for each level of detail. Mercator and WebMercator. Mercator projection (see figure 2.5a) was invented in the 16th century by Gerardus Mercator. It is a cylindrical projection, in which longitude is mapped linearly, and latitude is mapped to the length of the opposite side of a right triangle to the angle of the latitude. As a consequence, the mapped vertical positions tend towards positive and negative infinities as the latitudes go towards ±90°. This leads to a drastic increase in area for features closer to the poles, but also results in loxodromes being represented as straight lines in the projection. The latter point made Mercator projection indispensable for early celestial navigation and lead to its widespread use. In modern times, technology such as GPS has eliminated the need for Merca- tor projection in navigation. However, Mercator projection is still in widespread use, in particular in the form of WebMercator “slippy tile” maps. These maps are defined for a latitude range between approximately 85° northern and southern lat- itude, which results in a perfectly square area inMercator projection. This square is subdivided into four square areas, which happens recursively across many lev- els (see figure 2.5b). For each cell of each level, a rendered map tile of a con- 19 2 Foundations (a) azeqd (b) tpeqd Figure 2.6: Azimuthal equidistant projection (azeqd) (a) is defined via an arbitrary projection point. Distances and angles from this point to others are rep- resented without distortion. Here, Stuttgart, Germany is chosen as the projection point. Two-point equidistant projection (tpeqd) (b) is defined via two projec- tion points. Distances are represented without distortion from either point. The straight line between the points is represented without distortion as well, and follows the Great Circle arc between the points. Here, Stuttgart, Germany and Melbourne, Australia are chosen as the projection points. stant size (usually 256×256 px) is stored in a simple directory structure (usually /z/x/y.png, where z is the level, and x and y are the row-wise and column-wise indices of the cell in the level). This data structure allows for efÏcient and inter- active display of maps on a client without requiring any projection and rendering of map data at runtime. Azimuthal equidistant projection (azeqd) (see figure 2.6a) is a projection that is defined by one point p on Earth, which can be chosen arbitrarily. This point is mapped to the center of the projection space. All other points on Earth are pro- jected to the angle and distance of the geodetic line between p and itself. As a con- sequence, all distances are represented undistorted relative to p, and all geodetic lines through p are represented as straight lines. Two-point equidistant projection (tpeqd) (see figure 2.6b) is a projection that is defined by two arbitrary points pA and pB . Distances from either of the two points are represented without distortion in the projection, and the geodetic line between pA and pB is represented as a straight line—and consequently, is not distorted at all—in the projection. azeqd can be seen as a special case of tpeqd where p = pA = pB . 20 2.3 Statistical Evaluation 2.3 Statistical Evaluation To compare two or more techniques, or variants of techniques, quantitative eval- uations can be done (Munzner, 2009; Forsell, 2010). For these, a number of study participants are faced with a set of stimuli, and are asked to solve tasks. Within these stimuli, a set of parameters that are to be studied are varied. These are the independent variables. The goal is to keep everything about the stimuli that is not affected by the independent variables constant, or to remove biases by fully randomizing these aspects. The measured results of the tasks are the dependent variables.Each valid combination of values for the independent variables is called a condition. To improve the statistical validity of results, often, each condition is repeated a set number of times per participant, with different randomized non- independent variable aspects. In the end, statistical evaluation per condition is performed to gauge whether the independent variables affect the dependent vari- ables. A study can be within-subject, which means that all participants see the same types of conditions, or between-subject, where participants are assigned a group, and groups only see disjunct sets of conditions. In a mixed design, some independent variables are varied for all groups, while others are kept constant within groups. A mixed design is a compromise for more complex studies: In a within-subject design, the combination of multiple independent variables and repetitions can quickly lead to too-long evaluation sessions per participants. A between-subject design, on the other hand, leads to a large number of partici- pants, which can be challenging to organize. As an example, researchers want to compare two visualizations v ∈ V , where V = {A,B}. In addition, they want to compare both visualizations on three com- plexity levels c ∈ C, where C = {easy, intermediate,hard}. Further, they want to ensure that each participant sees ten repetitions of the same condition. In a within-subject design, there would be |V × C| = 2 · 3 = 6 conditions, and each participant would see 60 stimuli. Another option would be to do a mixed design, where each participant would only see one of the visualization types. In this case, each participant would see 30 stimuli, but twice as many participants would need to be recruited. In reality, the number of participants and repetitions often de- pends on the time required per stimulus, the maximum time each participant should spend on the study (ideally under 30min), the funds available to pay par- ticipants, and the required statistical significance of the study results. The last point can be estimated, for example, using power analysis (Kang, 2021). Historically, statistical significance of evaluation results has been determined by the p-value. Recently, different science communities have brought forward arguments against using p-values, as their outcome can be influenced via a tar- geted selection of results (Cumming, 2013a; Cockburn et al., 2020). Instead, the recommendation in the literature (Cumming, 2013b; Dragicevic, 2016; Besançon & Dragicevic, 2017, 2019; Cockburn et al., 2020) is to use interval estimation. 21 2 Foundations A B C 0 1 2 3 (a) Sample group values A B C 0 1 2 3 (b) Sample mean 95% CIs A - B A - C B - C -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4 1 (c) Bonferroni-corrected pairwise mean differences Figure 2.7: Three random sample groups A, B, and C are measured (a). The interval estimation of their sample means is calculated (b). The 95% confidence intervals (CIs) are plotted as black lines, the mean of all samples is plotted as a black circle. The CIs of the mean differences between the groups are calculated and plotted as well (c). Here, Bonferroni correction with a correction factor of 3 is applied to account for multiple comparisons. The corrected CIs are plotted with red whiskers. The pairwise differences show evidence that values for group B are smaller than those of group C (the CI is fully below 0), and that values for group B are smaller than those of group A (the CI is fully above 0). For the difference between group A and C, only a trend for A being larger can be found, because the corrected CIs intersect 0 1 . The statistical evaluations in this thesis follow the guidelines suggested by Be- sançon and Dragicevic (2017).2 Their method constructs 95% confidence inter- vals (CIs)—meaning we can be 95% confident that the sample mean lies within the interval—via bias-corrected and accelerated (BCa) bootstrapping with 10 000 iterations. Equivalent p-values can be calculated using themethod by Krzywinski andAltman (2013). For pairwise comparison, the difference of theCIs can be used. The interpretation of the difference, then, is as follows (Cumming, 2013b, 2013a; Dragicevic, 2016; Besançon&Dragicevic, 2017, 2019; Dragicevic et al., 2019; Cock- burn et al., 2020): CIs of mean differences show evidence if they do not overlap with 0, and the strength of the evidence increases for tighter intervals and inter- vals farther away from 0. A small overlap with 0 can still indicate a trend towards 2An English translation of the work of Besançon and Dragicevic (2017) can be found in Appendix A of the first author’s thesis (Besançon, 2017). 22 2.4 Visualization in the Digital Humanities one of the methods. For multiple comparisons using the same data, the CIs of the mean differences have to be adjusted using Bonferroni corrections (Higgins, 2004). Figure 2.7 shows an example for three measurement distributions, their CIs, and the pairwise differences, where a Bonferroni correction factor of 3 has been applied. Here, the pairwise differences show evidence for themean of group B being smaller than that of groups A and C. The evidence for the mean of group B being smaller than that of group A is stronger in this case, as the CI of mean dif- ferences is further away from the 0. The CI of mean differences between groups A and C shows only a trend for group A being larger (see figure 2.7 label 1 ), as the corrected CI intersects the zero. 2.4 Visualization in the Digital Humanities The DH is an emerging research field that introduces digital methods to humani- ties research. As a field, it covers a broad area from literary research over art and architecture to archaeology and history. Therefore, the goals, data, and methods are also numerous. Visualization can support the various research fields in un- derstanding their complex data and in collaboration (Bradley et al., 2018). The individual research communities in DH are small, but engaged. Although the combination of fast-paced computer science-backed visualization research and moderately-paced humanities resea