Adaptation of Point- and Line-Based Visualization Nils Rodrigues Dissertation Adaptation of Point- and Line-Based Visualization Von der Fakultät Informatik, Elektrotechnik und Informationstechnik der Universität Stuttgart zur Erlangung der Würde eines Doktors der Naturwissenschaften (Dr. rer. nat.) genehmigte Abhandlung Vorgelegt von Nils Rodrigues aus Stuttgart Hauptberichter: Prof. Dr. Daniel Weiskopf Mitberichter: Prof. Dr. Lars Linsen Tag der mündlichen Prüfung: 13. Dezember 2023 Visualisierungsinstitut der Universität Stuttgart 2024 Acknowledgments First, I want to thank Daniel Weiskopf for giving me the opportunity to do research into visualization, advising me, providing guidance, feedback, and enough liberty to explore various ideas. Daniel, I am also grateful for you taking an active role and all the work you put in—even during the late hours—that helped to get the research published! Thank you also for your patience and motivation. Thank you, Lars Linsen, for taking the time to review this dissertation and participating in its defense. Thank you, Tobias Schreck, for welcoming me to your institute for the research stay, for the continued collaboration, and for suggestions on what to do in Graz. I thank all coauthors I met over the years and with whom I had the joy of working on various projects. In alphabetical order: Albrecht Schmidt, Alexander Schultz, Andreas Bulling, Andreas Henicke, Andrés Bruhn, Anja Haug, Antoine Lhuillier, Arman Mielke, Benedikt V. Ehinger, Bruno Burger, Christian Baumann, Christoph Schulz, Cristina Morariu, Daniel A. Keim, Daniel Baumgartner, Francesco Chiossi, Frederik L. Dennig, Guido Reina, Harald Reiterer, Jakob Karolus, Jia Jun Yan, Johannes Zagermann, Katrin Angerbauer, Kazi Riaz Ullah, Krishna Damarla, Kuno Kurzhals, Lewis L. Chuang, Lin Shao, Lorenzo Di Silvestro, Marc O. Ernst, Marco Amann, Martin Raubal, Maurice Koch, Michael Becher, Michael Burch, Michael Sedlmair, Michael Stoll, Nelson de Jesus Silvério da Silva, Nelusa Path- manathan, Peter Schäfer, Priscilla Balestrucci, Radu Jianu, René Cutura, Rudolf Netzel, Sabine Storandt, Seyda Zerife Öney, Sven Mayer, Sören Döring, Tanja Blascheck, Thomas Ertl, Tiare M. Feuchtner, Tim Krake, Tobias Schreck, and Vincent Brandt. From the long list of coauthors, I want to single out Christoph Schulz for the pleasant and fruitful discussions and for always working hard in our collaborations. I am also very grateful to Anabela, Christoph, Jessica, Kuno, Moataz, and Tycho (in alphabetical order) for reading this thesis and for their valuable feedback. Very heartfelt thanks to my family and friends who endured me and stood by my side. I would not be where I am now without your strong support! Thank you for everything! I am especially thankful to Jessica for helping me stay focused, supporting me over the years as well as during the write-up, and taking burdens off my back when she herself was swamped. My work at the Visualization Research Center of the University of Stuttgart was mostly funded by the German Research Foundation DFG within project B01 of SFB-TRR 161. iii Contents Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii List of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii List of Abbreviations and Acronyms . . . . . . . . . . . . . . . . . . . xv Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii Zusammenfassung . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviii 1 Introduction 1 1.1 Goals and Research Questions . . . . . . . . . . . . . . . . . . . 2 1.2 Outline and Research Contributions . . . . . . . . . . . . . . . . 3 2 Data Types and Their Visualizations 9 2.1 Low-Dimensional Data . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Multivariate Data . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Point-Based Visualization . . . . . . . . . . . . . . . . . . . . . 11 Line-Based Visualization . . . . . . . . . . . . . . . . . . . . . 12 2.3 Spatial Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 I Dot Plots 15 3 Nonlinear Dot Plots 21 3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Original One-Way Sweep . . . . . . . . . . . . . . . . . . . . . 22 Two-Way Sweep Algorithm . . . . . . . . . . . . . . . . . . . . 24 Combining Sweeps . . . . . . . . . . . . . . . . . . . . . . . . 25 Dot Diameter . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Envelope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Anti-Aliasing . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Variants, Extensions, and Hybrid Visualizations . . . . . . . . . 32 3.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Distribution of Air Temperature . . . . . . . . . . . . . . . . . 33 v Contents Citation Statistics . . . . . . . . . . . . . . . . . . . . . . . . 36 Renewable Energy . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4 Expert Review . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4 Relaxed Dot Plots 43 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2 Goals and Overview . . . . . . . . . . . . . . . . . . . . . . . . 45 4.3 Plotting Space, Shape, and Dot Size . . . . . . . . . . . . . . . . 46 Kernel Frequency Estimation . . . . . . . . . . . . . . . . . . . 46 Nonlinear Frequency Scaling . . . . . . . . . . . . . . . . . . . 49 Individual Dot Diameter . . . . . . . . . . . . . . . . . . . . . 50 4.4 Dot Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Alternating Vertical Order . . . . . . . . . . . . . . . . . . . . 51 Centroidal Voronoi Tessellation (CVT) . . . . . . . . . . . . . . 52 Placement Correction . . . . . . . . . . . . . . . . . . . . . . 53 Swaps for Tunneling . . . . . . . . . . . . . . . . . . . . . . . 54 4.5 Termination Criterion . . . . . . . . . . . . . . . . . . . . . . . . 55 4.6 Approaches Without KFE and Scaled Envelope . . . . . . . . . . . 56 4.7 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.8 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Crowdsourced User Study . . . . . . . . . . . . . . . . . . . . 61 Blue Noise Property and Moiré Effect . . . . . . . . . . . . . . 65 4.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5 Excursion: Mathematical Transition from Columns to Fre- quencies 71 5.1 Recap, Definitions, and Observations . . . . . . . . . . . . . . . 71 5.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.3 Generic Observations . . . . . . . . . . . . . . . . . . . . . . . 73 5.4 Linear Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.5 Root Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.6 Logarithmic Scaling . . . . . . . . . . . . . . . . . . . . . . . . 75 5.7 Generic Dot Diameter . . . . . . . . . . . . . . . . . . . . . . . 78 5.8 Conslusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 vi Contents II Scatter Plots 81 6 Comparative Evaluation of Animated Scatter Plot Transi- tions 87 6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Animated Visualization in Other Contexts . . . . . . . . . . . . 89 Animations for Scatter Plot Visualizations . . . . . . . . . . . . 89 User Studies on Animated Visualizations . . . . . . . . . . . . . 91 6.3 Design Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Dimensions and Views in a Scatter Plot Matrix . . . . . . . . . . 92 Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Non-Evaluated Alternatives . . . . . . . . . . . . . . . . . . . 96 6.4 Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.5 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Animation Speed . . . . . . . . . . . . . . . . . . . . . . . . . 101 Data Set Size . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Power Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Participant Recruitment . . . . . . . . . . . . . . . . . . . . . 103 6.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Exclusion Criterion . . . . . . . . . . . . . . . . . . . . . . . . 104 Point Task Performance . . . . . . . . . . . . . . . . . . . . . . 104 Cluster Task Performance . . . . . . . . . . . . . . . . . . . . 106 Animation Direction . . . . . . . . . . . . . . . . . . . . . . . 107 Subjective Rating . . . . . . . . . . . . . . . . . . . . . . . . 109 6.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.8 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.9 Verdict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 7 Gaze-Based Recommendations for the Exploration of Scat- ter Plot Matrices 117 7.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.2 Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 Gaze as Indicator for Data of Interest . . . . . . . . . . . . . . 119 From Gaze to Clusters . . . . . . . . . . . . . . . . . . . . . . 120 Data Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 (Dis)similarity for Recommendations . . . . . . . . . . . . . . . 121 7.3 Software System . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.4 Use Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 vii 7.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 8 Excursion: Eye Tracking for Immersive and Situated Ana- lytics 129 8.1 Hypothetical Use Case in Virtual Reality: FiberClay . . . . . . . . . 130 The Original Unchanged System . . . . . . . . . . . . . . . . . 130 Benefits and Challenges of Eye Tracking Support . . . . . . . . 130 8.2 Accuracy of Gaze Depth for Augmented Reality . . . . . . . . . . 132 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 III Parallel Coordinates Plots 141 9 Cluster-Flow Parallel Coordinates Plots 147 9.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 9.2 Model and Overview . . . . . . . . . . . . . . . . . . . . . . . . 150 9.3 Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 Axis Duplication . . . . . . . . . . . . . . . . . . . . . . . . . 151 Dimension Ordering . . . . . . . . . . . . . . . . . . . . . . . 153 Cluster Ordering . . . . . . . . . . . . . . . . . . . . . . . . . 156 9.4 Line Mapping and Rendering . . . . . . . . . . . . . . . . . . . . 157 Curve Geometry . . . . . . . . . . . . . . . . . . . . . . . . . 157 Density Rendering . . . . . . . . . . . . . . . . . . . . . . . . 159 Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 9.5 Visual Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 9.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Escherichia Coli . . . . . . . . . . . . . . . . . . . . . . . . . 165 NetPerf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Energy Production . . . . . . . . . . . . . . . . . . . . . . . . 167 9.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 10 Conclusion 171 10.1 Answers to Research Questions . . . . . . . . . . . . . . . . . . 171 10.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . 174 Bibliography 177 List of Figures 1.1 The visualization pipeline . . . . . . . . . . . . . . . . . . . . . . 2 I-1 A dot plot, according to Wilkinson . . . . . . . . . . . . . . . . . 17 I-2 Elements of the visualization pipeline that are affected by the proposed dot plot techniques . . . . . . . . . . . . . . . . . . . . 18 I-3 A “dot plot,” according to Cleveland . . . . . . . . . . . . . . . . 19 3.1 Illustration of Wilkinson’s sweep layout algorithm . . . . . . . . 22 3.2 Dot overlap from varying value density . . . . . . . . . . . . . . 26 3.3 Low-resolution plots of normal distributions . . . . . . . . . . . 29 3.4 Comparison of visualizations for more than 9,000 air temperature values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.5 Number of citations referencing publications . . . . . . . . . . 36 3.6 Percentage of electricity produced from renewable energy sources 38 4.1 Illustration of various dot plot layouts . . . . . . . . . . . . . . . 44 4.2 Overview of the relaxed dot plot method . . . . . . . . . . . . . 46 4.3 Kernels for frequency estimation . . . . . . . . . . . . . . . . . . 48 4.4 Comparison between unbounded and bounded kernel frequency estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.5 Leaning columns of dots from relaxation . . . . . . . . . . . . . 51 4.6 A narrow bandwidth can lead to split Voronoi cells . . . . . . . 51 4.7 Analysis of positional error with varying settings for relaxation 53 4.8 Mean dot movement between relaxation iterations . . . . . . . 56 4.9 Different dot-based visualizations of temperature data . . . . . 58 4.10 Relaxed dot plot of the percentage of renewables in electricity production. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.11 Comparison of displayed variance in dot plots layouts . . . . . 60 4.12 Examples of study tasks for each hypothesis . . . . . . . . . . . 63 4.13 The independent variables of each experiment . . . . . . . . . . 63 4.14 Percentage of correct responses . . . . . . . . . . . . . . . . . . 65 4.15 Dot plots with overlap marked in red . . . . . . . . . . . . . . . 67 5.1 Plots of hf,l . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 ix List of Figures 5.2 Plot of the functions for envelope height and dot size from fre- quency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.3 Plot of the piecewise functions for logarithmic dot plots (ĥf,l, d̂f,l) 79 II-1 Single scatter plot and matrix of the iris data set . . . . . . . . 83 II-2 Visualization pipeline with proposed additions for scatter plots 84 6.1 Generic concept for rotation animation between scatter plots . 90 6.2 1D and 2D transitions in a SPLOM . . . . . . . . . . . . . . . . . 93 6.3 Examples of spline-based animation paths . . . . . . . . . . . . 94 6.4 Concept of staged rotation . . . . . . . . . . . . . . . . . . . . . 95 6.5 Concept of perspective rotation . . . . . . . . . . . . . . . . . . 95 6.6 User interface for the point task . . . . . . . . . . . . . . . . . . 100 6.7 User interface for the cluster task . . . . . . . . . . . . . . . . . 100 6.8 Results of the pilot study with different animation speeds . . . 101 6.9 Results of the pilot study with different point counts . . . . . . 102 6.10 Distribution of errors in indicated dot position . . . . . . . . . . 105 6.11 Distribution of correctly identified cluster interactions . . . . . 106 6.12 Distribution of point task performance with 2D transitions . . . 108 6.13 Subjective feedback from study participants . . . . . . . . . . . 109 7.1 The three steps in our concept . . . . . . . . . . . . . . . . . . . 119 7.2 Main window of our visual analytics tool . . . . . . . . . . . . . 120 7.3 Initial view of our tool after loading a data set . . . . . . . . . . 122 7.4 GUI with options for subset generation . . . . . . . . . . . . . . 123 7.5 Comparative view with table and SPLOM . . . . . . . . . . . . . 124 7.6 Controls for the eye-tracking heatmap and threshold for data subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 7.7 Analysis of surprising data in network performance data . . . . 125 8.1 Scene setup and real photograph. . . . . . . . . . . . . . . . . . 135 8.2 Augmented scene from the experiment. . . . . . . . . . . . . . . 135 8.3 Error in measured gaze depth . . . . . . . . . . . . . . . . . . . 138 III-1 Parallel coordinates plot of fictitious fruit data . . . . . . . . . . 143 III-2 Elements of the visualization pipeline affected in Part III . . . . 144 III-3 Comparison of rendering techniques for PCPs . . . . . . . . . . 145 9.1 Feature overview of CF-PCPs . . . . . . . . . . . . . . . . . . . . 149 9.2 Construction of CF-PCPs . . . . . . . . . . . . . . . . . . . . . . 150 9.3 Illustration of two clusters using different coordinate systems . 152 9.4 Concept of reading directions . . . . . . . . . . . . . . . . . . . 152 9.5 Tree representation of our model for dimension ordering . . . . 154 9.6 Simplified CF-PCP with crossing bundles . . . . . . . . . . . . . 156 9.7 Comparison of line geometry in CF-PCPs . . . . . . . . . . . . . 158 x List of Figures 9.8 Composite line geometry in CF-PCPs . . . . . . . . . . . . . . . 158 9.9 Varying opacity for data points with multiple soft labels . . . . 160 9.10 Visualizing uncertainty in the clustering results . . . . . . . . . 161 9.11 Inter-cluster patterns . . . . . . . . . . . . . . . . . . . . . . . . 162 9.12 Intra-cluster patterns . . . . . . . . . . . . . . . . . . . . . . . . 163 9.13 Classification of E. coli bacteria . . . . . . . . . . . . . . . . . . 164 9.14 Different clustering and PCP techniques applied to the NetPerf data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 9.15 CF-PCP of electric power production . . . . . . . . . . . . . . . 168 xi List of Tables 5.1 Conventions for function and variable names . . . . . . . . . . . 71 6.1 Comparison of point task performance . . . . . . . . . . . . . . 105 6.2 Comparison of cluster task performance . . . . . . . . . . . . . 106 6.3 Comparison of horizontal vs. vertical 1D transitions . . . . . . . 108 6.4 Comparison of subjective ease of the point task . . . . . . . . . 110 6.5 Comparison of reported preference for frequent use . . . . . . 111 6.6 Pairwise comparison of participant feedback on animation speed 111 List of Algorithms 3.1 Wilkinson’s original sweep algorithm . . . . . . . . . . . . . . 23 3.2 Adaptation for symmetric nonlinear dot plots. . . . . . . . . . 24 4.1 Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2 TunnelSwaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 xiii List of Abbreviations and Acronyms 2AFC two-alternative forced-choice ACM Association for Computing Machinery AR augmented reality BNP blue noise plot BSP bee swarm plot CF-PCP cluster-flow parallel coordinates plot CHCCS Canadian Human-Computer Communications Soci- ety CPU central processing unit DBSCAN density-based spatial clustering of applications with noise DDA differential domain analysis DR dimensionality reduction EBL edge-bundling layout FA Fourier analysis GPU graphics processing unit GUI graphical user interface HMD head-mounted display IEEE Institute of Electrical and Electronics Engineers IPCs illustrative parallel coordinates ISE Fraunhofer Institute for Solar Energy Systems KDE kernel density estimation KFE kernel frequency estimation MDM mean dot movement MOD mean dot overlap in display diameters MSE mean squared error NDP nonlinear dot plot PCA principal component analysis PCP parallel coordinates plot RDP relaxed dot plot RGB red, green, and blue color model SP scatter plot xv List of Abbreviations and Acronyms SPLOM scatter plot matrix SSE sum of squared errors TU Graz Graz University of Technology UMAP uniform manifold approximation and projection VAC vergence-accommodation conflict VISUS Visualization Research Center of the University of Stuttgart VR virtual reality xvi Abstract Visualization plays an important role in the lives of various heterogeneous parts of society: from a voter looking for the latest results of an election, to statisticians examining a distribution, to analysts trying to make sense of multidimensional data sets. This thesis adapts existing point- and line-based visualization methods to improve knowledge gain. The included contribu- tions address three research questions: How to scale unit visualization for 1D data? How to improve navigation between 2D visualizations of multi- variate data? How to combine the advantages of multiple 2D views in a single static visualization for multivariate data? The first part of the thesis focuses on unit visualization of 1D data with dot plots. Compared to the previous state of the art, the developed visu- alizations fit a wider range of data and expand the number of potential users by requiring less prior knowledge for interpretation. They adapt the definition of dot plots to scale nonlinearly with sample count, accurately show value frequencies in high-dynamic-range data, reduce positional error in displayed data points, and enhance the perception of subtle nuances in the data while avoiding moiré effects. We provide evidence for claimed improvements through evaluation with computational metrics and a crowd- sourced user study. The second part of the dissertation focuses on visualizing multivariate data with scatter plots and scatter plot matrices. First, we evaluate six animated transitions between plots of different 2D subspaces with respect to task performance for tracking individual points and interactions between clusters. The results of a quantitative study with 170 participants show that orthographic rotation animation performs best and should be adopted more widely. Next, we develop a novel concept for recommending views in scatter plot matrices. It provides user- and task-specific suggestions by focusing on the data of interest to the viewer. Together, animation and recommendation adapt scatter plots to improve the user’s ability to analyze more complex data effectively. In the third part, we develop a new visualization technique that extends parallel coordinate plots to provide a static alternative to scatter plots with animated transitions. The approach does not require interaction to display data flow between 2D subspace clusters. A custom density-based rendering technique enables the visibility of individual lines and structures within highly overdrawn regions. Our technique can communicate fuzzy clustering results through binning and color mapping. Finally, we discuss the presented contributions with respect to the original main questions and show possible directions for future research. xvii Zusammenfassung Visualisierung spielt im Leben verschiedener Teile der Gesellschaft eine wichtige Rolle: Von einem Wähler, der nach den aktuellsten Wahlergeb- nissen sucht, über Statistiker, die eine Verteilung untersuchen, bis hin zu Analysten, die mehrdimensionale Datensätze verstehen wollen. Die Beiträge dieser Arbeit passen bestehende punkt- und linienbasierte Visuali- sierungsmethoden an, um den Wissenserwerb zu verbessern, und widmen sich dabei drei Forschungsfragen: Wie kann man die Visualisierung von 1D-Daten mit einzelnen grafischen Elementen skalieren? Wie kann man die Navigation zwischen 2D-Visualisierungen multivariater Daten verbes- sern? Wie kann man die Vorteile mehrerer 2D-Ansichten in einer einzigen statischen Visualisierung für multivariate Daten kombinieren? Der erste Teil dieser Arbeit fokussiert sich auf die Visualisierung von 1D-Daten mit einzelnen grafischen Elementen in Punktdiagrammen. Im Vergleich zum bisherigen Stand der Technik sind die entwickelten Visuali- sierungen für ein breiteres Spektrum an Daten geeignet und erweitern den Kreis potenzieller Nutzer, da sie weniger Vorkenntnisse erfordern. Sie pas- sen die Definition von Punktdiagrammen an, um nichtlinear mit der Punk- tezahl zu skalieren, Werteverteilungen mit hohem Dynamikumfang präzise darzustellen, Positionsfehler der angezeigten Daten zu reduzieren und die Wahrnehmung subtiler Nuancen zu verbessern, während Moiré-Effekte vermieden werden. Wir belegen die Verbesserungen mit rechnerisch ermit- telten Metriken und einer Crowdsourcing-basierten Nutzerstudie. Der zweite Teil der Dissertation behandelt die Visualisierung multiva- riater Daten mit Streudiagrammen und Streudiagrammmatrizen. Zunächst untersuchen wir sechs Animationen zwischen Plots unterschiedlicher 2D- Unterräume im Hinblick auf die Verfolgbarkeit einzelner Punkte und die Interaktion zwischen Clustern. Die Ergebnisse einer quantitativen Stu- die mit 170 Teilnehmern zeigen, dass orthografische Rotationsanimation am besten abschneidet und breiter angewendet werden sollte. Anschlie- ßend entwickeln wir ein neuartiges Konzept für die Empfehlung weiterer Ansichten in einer Streudiagrammmatrix. Es bietet nutzer- und aufgaben- spezifische Vorschläge, indem es den Fokus auf die für den Betrachter interessanten Daten legt. Zusammen passen Animation und Empfehlung Streudiagramme an, um die Fähigkeit des Benutzers zu verbessern, kom- plexere Daten effektiv zu analysieren. Im dritten Teil entwickeln wir eine neue Visualisierungstechnik, welche parallele Koordinaten erweitert und eine statische Alternative zu Streudia- grammen mit animierten Übergängen darstellt. Der Ansatz erfordert keine Interaktion, um den Datenfluss zwischen 2D-Subraum-Clustern anzuzei- gen. Angepasstes, dichtebasiertes Rendering ermöglicht die Sichtbarkeit xviii Zusammenfassung einzelner Linien sowie Strukturen innerhalb stark überlagerter Bereiche. Die Ergebnisse von unscharfem Clustering werden durch Intervallbildung und Farbcodierung dargestellt. Abschließend diskutieren wir die Beiträge im Hinblick auf die ursprüng- lichen Fragen und zeigen mögliche Richtungen für zukünftige Forschung auf. xix 1 Introduction Visualization is ubiquitous. Young pupils use them in school when they first draw charts of mathematical functions to gain intuitive insight into the characteristics of equations. Viewers are confronted with visualization as part of the news when bar charts show a recent development. Infographics help to understand complex systems and how individual components are interconnected. Data analysts and visualization experts compose figures to gain insight into large data and use them to communicate information with other people. All these kinds of visualizations, simple or complex, can be created using an underlying pipeline. Haber and McNabb created a formal definition for such a pipeline in the context of scientific visualiza- tion [HM90]. We adapted Card et al.’s derived pipeline [CMS99] and will use it throughout this document to put proposed changes into context (see Figure 1.1). The first stage of the pipeline takes a potentially large input data set and extracts a subset of data that is relevant for visualization. The second stage maps the data to a description of individual visual primitives. The rendering stage creates a visible representation from the description without referencing the original data. Visualization serves human beings. Hence, the final stage is the perception by a viewer. In the case of inter- active visualization, the viewer can adapt parts of the pipeline, creating a feedback loop for visual analytics [Kei+08]. Some visualizations are dynamic or interactive, giving their users, e.g., data analysts, the ability to adjust parameters to gain new insight from images. Other visualizations are created by a designer and remain static. They can then serve the same person—this time in the role of an analyst—or be distributed among others, who then use them for their own analysis or as confirmation of communicated results. The latter is often the case in print media, e.g., newspapers, where readers are purely “consumers” and cannot change the presented graphics. The scientific field of visualization research has matured, and publi- 1 1. Introduction Transformed data Displayable representation View user Processed data Manual interaction and feedback Rendering PerceptionFiltering Mapping Abstraction Figure 1.1: The visualization pipeline, adapted from [CMS99]. Humans can use existing knowledge to adjust parameters and interact at various steps to adapt the output and gain new insights. cations include increasing numbers of visual representations [Rei+20; Che+21]. Researchers have creative freedom for the design of new visual- ization techniques. The range of used visual elements spans from text and simple dots, over shapes composed of lines and polygons, to continuous fields of color and further. The complexity of employed and developed software for data visualization has increased over time [Rei+20]. On the one hand, and if used correctly, this growing complexity of tools along the visualization pipeline might reduce the workload for human users of the resulting images by preprocessing data, using more elaborate algorithms to find information of interest, optimizing perceptual variables, etc. On the other hand, it might also lead to the creation of visualizations that encode more and more data, contain evermore visual primitives, and become too complex or overwhelming for humans to analyze and extract information. 1.1 Goals and Research Questions Contrary to the underlying trend of growing possibilities, this thesis fo- cuses on known visualization techniques that only use dots and lines. They are typically used for relatively small data sets with a limited number of dimensions. The overarching goal of this thesis is not the creation of radical new visualizations but the adaptation and extension of proven methods to improve the knowledge gain. The presented plots and techniques target data analysts with prior experience but also the general public. To achieve our overarching goal, we identify three central research questions: RQ1 – How to scale unit visualization for 1D data? Visualizing 1D data with a graphical element for each data point helps to make it relatively simple to understand. It is, therefore, well suited to target large audiences, including the general public. The most straightforward visualization techniques for 1D data might be limited to low sample counts in the orders of tens or hundreds. We develop these plots further and make them suitable 2 1.2. Outline and Research Contributions for high-dynamic-range data sets with several thousands of data points. As screen space is limited, we do not expect the resulting techniques to show each and every sample but to adapt to the underlying data and retain the advantages of unit visualization where possible. In the context of recent work on scalability in visualization [Ric+23], the result should be more scalable in the sense of shape because it requires nonlinear functions to improve the visibility of outliers in high-dynamic-range data. Additionally, the answer to this question should increase the suitable problem size for a given threshold to allow for larger sample counts in the same screen space. RQ2 – How to improve navigation between 2D visualizations of multivariate data? Most visualization is 2D and presented on digital screens. It seems logical to select only two attributes from a multivariate data set and to only display this subspace. It results in comprehensible visualizations that allow for multiple types of insight. To gain more knowledge from the underlying data set, it is then necessary to select and analyze multiple such subspaces. This question aims to help the analysts with their analysis tasks by facilitating the discovery of and switching between fitting 2D visualizations from multivariate data. RQ3 – How to combine the advantages of multiple 2D views in a single static visualization for multivariate data? Visualization is not always interactive, e.g., when printed on paper. This question targets the creation of a plotting technique that can provide insights from multiple 2D visualizations in a single static image. To this end, we extend the range of included approaches beyond simple dots and points to also encompass polylines, giving us more freedom to design a new visualization technique. 1.2 Outline and Research Contributions This thesis contains chapters that are based on published or submitted papers and reused with consent from the coauthors. I am the main author in most cases. For publications where I only contributed to distinct parts, I will limit reuse to the sections that I wrote. Daniel Weiskopf is a coauthor of all papers, contributing his experience and valuable feedback, helping to polish the content for publication, and editing the manuscripts. He supervised me in all projects behind the papers. 3 1. Introduction After the introduction, Chapter 2 provides a coarse background on data types and how they can be visualized. On this basis, it helps motivate the main research questions and explains the selection of techniques that follow. The main content of this document is structured around these main research questions, which themselves are arranged according to the dimensionality of numerical data and main drawing primitives for visualization. Each of the following major parts of this dissertation creates a link back to the research questions and provides an overview of how the presented content ties into the general visualization pipeline. Part I (Dot Plots) answers RQ1 and focuses on 1D data and its fre- quency. At the beginning, it gives a short overview of the concepts behind multiple visualizations that are called “dot plots” and clarifies which of these techniques we use as a basis for further development. Chapter 3 – Nonlinear Dot Plots This chapter introduces nonlinear scaling to dot plots in order to allow the visualization of larger data sets with highly dynamic data frequencies. It is based on the results of the first project I started to work on at the Visualization Research Center of the University of Stuttgart (VISUS) [RW18]. Daniel Weiskopf had the initial idea of nonlinear scaling for dot plots. I implemented the adapted algorithm and extended it into a two-way-sweep layout with two scaling functions. I developed a software suite for testing the layouts and introduced a vertical blur in dense areas. Chapter 4 – Relaxed Dot Plots In this chapter, we improve on the display of data frequency and increased positional correctness in dot plots while, at the same time, providing better aesthetics with little to no moiré effect. It is based on a project that led to the publication of dot plots without straight columns [Rod+23b; Rod+22b]. Christoph Schulz and I both identified parts of the Linde-Buzo-Gray stip- pling algorithm [DSZ17] that might be applicable to dot plots and intro- duced new constraints. As a result, we arrived at Lloyd relaxation as a technique that was useful for redistributing circles inside an envelope shape. I performed the transfer from dot-count-based nonlinear scaling from Chapter 3 to a frequency-based approach in order to compute the envelope shape from kernel frequency estimation of the underlying visu- alization data. Tim Krake proved mathematical properties and helped to create consistent equations. Having a container shape and an algorithm for the distribution of circles, I designed the initial adapted dot layout that serves as a starting point for relaxation and avoids visual artifacts. I also developed an approach with linear complexity for tunneling swaps between dots that increases posi- tional correctness and decreases runtime. Two student assistants—Sören 4 1.2. Outline and Research Contributions Döring and Daniel Baumgartner—implemented all algorithms. Christoph Schulz and I analyzed the code and provided feedback for improvements. I designed and conducted a crowdsourced study to evaluate the relaxed vi- sualization against regular nonlinear dot plots. Sören Döring implemented a generator for study stimuli and set up a template for the online survey. Chapter 5 – Excursion: Mathematical Transition from Columns to Frequencies The equations for the mathematical transformations from column-based to frequency-based scaling are contained in this chapter. The content is based on a document that is part of the supplemental material to relaxed dot plots [Rod+22b]. Tim Krake provided a first draft for a clean write-up of the math behind the transformations, which I then edited and extended with other material for the completed document. I present this work as an additional excursion to be able to better explain the ideas behind the transformations and to not interrupt the flow of text in Chapter 4 with too many equations. Part II (Scatter Plots) answers RQ2 and focuses on exploring and navigating pairs of two dimensions from multivariate data. It starts with an introduction to the application of scatter plots—as individuals and in matrices of small multiples—for the visualization of multidimensional data. Chapter 6 – Comparative Evaluation of Animated Scatter Plot Transi- tions This chapter compares multiple animation techniques for switching between scatter plots. It evaluates them with respect to traceability of individual points and interactions between data clusters. We published the work behind this chapter in IEEE Transactions on Visualization and Computer Graphics [Rod+24]. This project started with the desire to visualize flow between clusters in different scatter plots of the same underlying data set. Under my super- vision, Vincent Brandt created a software library for animated transitions between scatter plots and made a preliminary small-scale comparison of a wide range of animations that sparked interest in a more thorough evalu- ation. To this end, I designed and conducted a crowdsourced study with 170 participants and two pilot studies. Frederik Dennig gave feedback on the design of the study and provided a first partial draft for the write-up of related work. Daniel Keim supervised Frederik Dennig and provided feedback on the ideas of the paper. I performed the statistical analysis of the study results and wrote the remainder of the publication. Chapter 7 – Gaze-Based Recommendations for the Exploration of Scatter Plot Matrices This chapter follows logically from the previous one, and both together provide an answer to RQ2: since we now know what animation between scatter plots works best, we want to recommend specific dimension pairs that suit the analyst’s task. The main content of 5 1. Introduction this chapter is based on a joint publication with researchers from Aus- tria [Rod+22c]. The project started with a research stay at Graz University of Technology (TU Graz). Tobias Schreck, Lin Shao, and I jointly supervised the student Jia Jun Yan, who implemented a proof of concept for a visual analytics application. I proposed possible ways of using gaze to implicitly select data and helped by suggesting techniques to solve specific problems. I took the lead role in writing the paper. Lin Shao and Tobias Schreck gave feedback for the paper and helped with the related work based on their prior knowledge of multiple recommender systems. Chapter 8 – Excursion: Eye Tracking for Immersive and Situated Analytics We based the implicit data selection from the previous chapter on gaze data. Within this excursion, we extend the display dimension and discuss the use of eye tracking in the context of immersive and situated analytics. First, the chapter presents hypothetical benefits and challenges of adding gaze support to an existing virtual reality application (Fiber- Clay [Hur+19]). The work is based on my contributed sections about the analytics tool that appear within a publication with many authors from institutions in multiple countries [Sil+19]. The second part of the chapter contains an empirical evaluation of mea- sured gaze depth on real and virtual objects in augmented reality. It is based on a publication from 2020 [Öne+20]. Seyda Öney implemented all necessary software as part of her bachelor thesis. Michael Becher advised her on topics related to HoloLens, while I focused on eye tracking and the comparative study. Seyda Öney conducted the study and analyzed the results. She summarized them in a first draft that I used as inspiration to write the publication. Michael Becher helped to edit the paper and was supervised by Guido Reina and Thomas Ertl. Michael Sedlmair was the examiner of the student work. Part III (Parallel Coordinates Plots) answers RQ3 and presents a visualization technique that aims at combining advantages from animated scatter plot transitions and static parallel coordinates plots (PCPs). It starts with a short introduction to rendering techniques that tackle the problem of overdraw in PCPs. Chapter 9 – Cluster-Flow Parallel Coordinates Plots This chapter is based on published work [Rod+20] and describes a visualization technique that we developed to answer RQ3. During conversations with Christoph Schulz about fuzzy clusters, I came up with the initial idea of replicating the axes of parallel coordinates to show cluster assignments. We involved Antoine Lhuillier in the discussions due to his experience in visualizations with bundled lines. All three of us regularly discussed ideas and came up 6 1.2. Outline and Research Contributions with the final design. Thus direct attribution is difficult. I implemented the web-based visualization, metrics, and an algorithm for ordering axes (horizontal) and clusters (vertical). Christoph Schulz contributed fuzzy clustering and his knowledge in the field of uncertainty. Antoine Lhuillier performed initial analyses of the NetPerf data set. While I was involved in writing and editing all sections of the paper, the other authors contributed text segments and provided proofreading over the entire document. Chapter 10 concludes this thesis with a discussion of the results with respect to the research questions and an outline of future research direc- tions. Material from the publication [RW18] is copyrighted by IEEE and used with permission under the agreement for reuse in a thesis. The material from the publication [Rod+23b] is public domain and licensed to IEEE under the Creative Commons Attribution 4.0 (CCBY 4.0). Material from the publications [Rod+22c; Sil+19; Öne+20] is copyrighted by ACM and used with their permission under the agreement for reuse in a dissertation. The copyright for material from the publication [Rod+20] is retained by its authors, while CHCCS is granted non-exclusive rights of publication. I had the joy of collaborating with many good, intelligent, and motivated researchers from VISUS and other fine institutions. To honor their contri- butions to the work presented in this dissertation, I use the editorial “we” for most of the content and only revert to “I” where appropriate. I do not explicitly cite extracts from coauthored publications to improve readability. During my time at VISUS, I also coauthored other publications that do not fit the general theme of this thesis and from which I do not use excerpts. A collaboration with the Fraunhofer Institute for Solar Energy Systems (ISE) led to a paper that presented a map interface for data from power plants that was integrated with existing charts on a website (https://www.energy-charts.info, accessed 2024-01-27) [Rod+17b]. Collaboration with IBM resulted in a paper about the application of PCPs for data analysis at the same company [Sch+17]. Eye tracking was a topic in multiple publications. We created visual- izations and tools for gaze analysis on static stimuli [Rod+18; Sch+22; Sch+23] and videos [Kur+20]. We used eye tracking as input for interactive applications in augmented reality [Pat+20] and showed how it could be used with adaptive user interfaces [Chi+22]. The human visual system and perception have also played a role in mul- tiple publications. For example, we modeled the strength of simultaneous orientation contrast and compensated it through counter-rotation [Net+19]. We developed a simulator for various physiological disorders of the visual system [Sch+19] and analyzed the effects of color vision deficiency on the 7 https://www.energy-charts.info 1. Introduction accessibility of figures that were published over the years at the IEEE VIS conference [Ang+22]. Based on prior work from my diploma thesis and together with my former supervisor Michael Burch, I published a visual analytics approach for the exploration of texts from multiple corpora that won a �� best paper award [Rod+17a]. I also generated figures for a previous publication about the visualization technique that I implemented during the work on my diploma thesis [Bur+14]. 8 2 Data Types and Their Visualizations All visualizations have one thing in common: they represent the underlying data that forms the starting point of the pipeline in Figure 1.1. However, not all visualization techniques are well-suited for all kinds of data. This chapter discusses different types of data and provides a background for cor- responding visualization approaches. It shows how the detailed techniques proposed in the following parts of the thesis relate to the overarching goal and research questions. 2.1 Low-Dimensional Data Simple 1D data and its visualizations are essential for humans, even in a world where big data and data retention are increasingly relevant to artifi- cial intelligence [Jai+20] and further evolving quantum computing [Tow22; UK 21]. For example, large portions of populations are affected by the results of elections for public offices. The number of samples, i.e., votes, in such a case is typically high while the dimensionality is low because each vote is a single choice from a small range of values, i.e., parties. Visualiza- tions of this low-dimensional data are sufficient to gain information and educate the public about the outcome of the election: the total number of votes, the ratio of abstentions, the ratio between parties, etc. Most people do not know the details of the volatile energy market, how much electricity is generated, where it comes from, or where it is currently needed. In the context of climate change, however, they might be interested in knowing simple, 1D data. For example, the percentage of renewable energy, the amount of carbon dioxide emissions, or the mix of primary energy sources of their consumed electricity. Single-dimensional data even arises when friends and family enjoy their free time and play games: the heights of the 9 2. Data Types and Their Visualizations bar in limbo, the number of goals in football, or the score of each round in a card game. Dot plots visualize 1D data. They only use a simple number line and place dots along it. This works well for small data sets and is very accessible to a broad range of viewers. Previously, this kind of visualization did not scale well for larger high-dynamic-range data, i.e., containing outliers and regions with high sample frequency alike. Addressing RQ1, we will in- vestigate methods of applying nonlinear scaling to dot plots. Although they require more experience to interpret correctly, nonlinear scales already work well with bar charts. For the nonlinear plots proposed in Part I, we will apply the scaling to the dots that represent individual data points. Such unit visualization provides an accessible layout where the data frequency is low because viewers are able to recognize and count the individual dots. On the one hand, the simplicity of dot plots provides visualization for a broad range of audiences. On the other hand, it is only suited to display a single dimension. We can enrich the visualization by encoding a second attribute as color or icons within the dots. Such an approach works best for few and quantized values. When the data contains two numerical dimensions, it is intuitive to also use two display axes. Many viewers are used to this from school, where they plotted equations with x- and y-variables in Cartesian coordinate systems. But there is also a need to analyze data where there is no previously known equation that governs the relationship between the two dimensions. In this case, we can use the x-axis and y-axis of the display and draw unconnected individual data points. The resulting scatter plot allows viewers to gain deeper insight into possible connections between the data dimensions: It allows them to find correlations, to identify trends and clusters. 2.2 Multivariate Data Plots and graphs up to 2D are helpful in cases where the data is of the same low dimensionality. Nevertheless, data sets can include more than two attributes for the same record. For example, a table with champions in motorsports might include the year of a race, name of the driver, name of the team, various measurements of the vehicle, length of the track, total race time, etc. A different data set might include additional attributes forming a network graph that shows how different racing teams interact. With the abundance and variety of recorded information, it is inevitable for some analysts to work with multivariate data. For example, when trying to optimize engine power output, it is necessary to analyze multiple variable settings, i.e., dimensions, and find their effect on the final output but also 10 2.2. Multivariate Data interactions between multiple settings. Increasing the amount of injected fuel requires a different setting for the air intake valve and might influence the combustion temperature. What visualizations can analysts use when the number of dimensions increases beyond two? There is a plethora of techniques that fit specific kinds of data. To keep the length of this thesis within reasonable limits, we will only discuss generic visualizations for numerical data on rational, ordinal, or nominal scales. 2.2.1 Point-Based Visualization Scatter plots are a suitable approach for 2D data, and many people have used them, but their perpendicular axis layout does not scale with a higher number of dimensions. 3D scatter plots on 2D paper or computer screens are prone to overdraw, and the introduced virtual depth is not easily perceivable for the user. It is, however, possible to create a grid with scatter plots of all dimension pairs: the scatter plot matrix (SPLOM) [Eme+12]. As the number D of data dimensions increases, each individual cell shrinks to fit the increasing number of D2 cells in the matrix. Therefore, this approach is often combined with a regular scatter plot. The matrix provides only coarse thumbnails and allows the user to select a dimension pair for the larger plot. This technique allows analysts to explore the data in detail. But how can they trace individual points between the different plots? Previous work has suggested animating the dots when switching between data axes in the main scatter plot [EDF08]. In accordance with RQ2, this thesis presents a selection of animation techniques and compares them with respect to user performance for following individual data points and for identifying cluster interactions (see Chapter 6). With animation, it is easier to maintain the mental model while switching between different views. But how should analysts know what cells of the SPLOM to select? Which cells are the most relevant and have the least redundancy? Previous work has resulted in recommender systems that use a manual selection of known relevant cells to suggest new dimension pairs [Beh+14]. In line with RQ2, this dissertation presents a technique that uses the analyst’s gaze to identify data of interest. The proof of concept in Chapter 7 combines this implicit data selection with multiple metrics to provide information that can help the viewer select the next interesting cell in the SPLOM. The concept of using gaze to select data is reasonably well suited for 2D visualization. Beyond the original scope of RQ2, we also explored the use of 3D gaze with stereo eye-tracking hardware in Chapter 8. 11 2. Data Types and Their Visualizations 2.2.2 Line-Based Visualization A SPLOM scales with the square of the number of dimensions (D2), re- quires much space, and the individual cells become increasingly smaller to fit on the screen. Additionally, it requires interaction to switch between different detailed views. What happens when the number of data dimen- sions increases too far, or there is no way of interacting with the plotting software? For this reason, we formulated RQ3 and investigated ways to combine advantages from multiple 2D views in a single static visualization. One could use dimensionality reduction techniques to arrive at a 2D scatter plot from an arbitrary number of data attributes. However, the resulting display dimensions are often not intuitive to grasp as the connection to the underlying data dimensions is obfuscated. A possible answer to RQ3 is not based on visualization with points but rather with lines. A PCP shows all data dimensions with parallel—most often vertical—axes. Each data point becomes a polyline, and the crossings with the axes represent the data value in each dimension. The results of an expert review suggest that they are helpful for analysts that want to explore previously unknown multivariate data [Sch+17]. In Chapter 9, we propose a technique that extends on the original PCPs and allows reading the data values of individual samples but also shows soft clusters in 2D subspaces (advantage of scatter plots). Our proposed visualization shows how individual data points move between clusters and how the clusters of adjacent dimension pairs interact with each other (advantage of animation between scatter plots). The technique is, however, targeted at experienced analysts. As with all PCPs in general, the axis sequence on the screen influences heavily what patterns of the data are visible. There is previous work on quantifying the patterns and optimizing the sequence [DK10]. As our work in the context of RQ3 focuses on transferring the advantages from multiple 2D visualizations, we also developed new metrics to optimize the dimension order for cluster interactions. 2.3 Spatial Data Spatial data can be encoded with two and three dimensions. It is often well- suited for display with individual dots marking a position on a map and can also be useful to draw lines that show trajectories. We created a point-based map visualization of data on power plants for the general public [Rod+17b]. We also worked on the visualization of eye-tracking scanpaths. They are well suited for display on a stimulus by drawing lines for saccades and, 12 2.3. Spatial Data optionally, dots for fixations. We also worked with eye-tracking trajectories from multiple study participants on the same stimulus and found ways of reducing the visual clutter of scanpaths. We used one approach based on quad-trees [Rod+18] and one with group diagrams [Sch+22; Sch+23]. While work on spatial data is important, it is often targeted toward the visualization of data sets from specific contexts and does not fit the broader and more generic theme of this dissertation. Hence, we will not include these applications in further discussions. 13 Part I Dot Plots 15 Part I – Dot Plots -2 Value 420 Figure I-1: A dot plot, according to Wilkinson [Wil99]. Each dot represents a value along the number line at the bottom. Dots are stacked into straight columns if they cannot be placed next to each other without overlap. Visualization is not limited to complex interactive systems for the par- ticular analysis of large data sets. It can be simple, show only a single data dimension with few samples, and yet be helpful. Dot plots are a case of such relatively simple visualization. Even school children without prior knowledge can use them to extract characteristics of the data distribution [I G15]. The basic algorithm behind dot plots does not require computing hardware and can be used for hand-drawn visualizations. Dot plots are composed of a single number line and a circular dot for each data point. When neighboring dots would overlap, the layout algorithm stacks them into straight columns (see Figure I-1). As a result, while the input data only has a single attribute, dot plots use 2D output to convey information: One axis—typically horizontal—displays the value of the underlying items, and a perpendicular axis shows data frequency. The attentive reader might now note how the goals of histograms and dot plots overlap. There are, however, decisive differences. First, dot plots are a kind of unit visualization that maps each data point to an individual graphical element. This allows for a clearly readable representation with countable items and tends to be more intuitive to understand [Bak04]. The second major difference concerns the bins in regular histograms. All points along the visible data axis belong to a bin. There are no unassigned ranges. Additionally, the bins themselves are uniform. Contrary to this, the dot plot positions the circles where there are data items. This can create empty spaces between areas of higher data frequency and outliers that are positioned accurately with regard to their underlying value. Unfortunately, dot plots also have disadvantages with regard to other visualization techniques. Unit visualization is better for an intuitive un- derstanding, but it does not scale well with increasing sample count and high-dynamic-range data. Stacking dots into straight columns moves the display points away from their underlying value. Finally, showing many 17 Part I – Dot Plots Transformed data Displayable representation View user Processed data Manual interaction and feedback Filtering nonlinear dot size Mapping Abstraction PerceptionRendering directed blur blue noise property Figure I-2: The visualization pipeline (adapted from [CMS99]) with additions from the proposed dot plot techniques. Affected elements are highlighted in red. High-dynamic-range data is supported through dot size scaling. Directed blur and blue noise reduce unwanted perceptual effects. dots in close proximity tends to create moiré effects. Following the theme of RQ1, we adapted and extended various parts of the visualization pipeline of dot plots to tackle the disadvantages and make this intuitive and accessible visualization suitable for a wider range of data (see Figure I-2). In Chapter 3, we change the mapping from data points to differently sized dots and include a data-agnostic low-pass filter to reduce moiré patterns. In Chapter 4, we adapt the location of dots to decrease positional error and en- sure the blue noise property, which is better suited for visual perception in the later stage of the pipeline. With these latest changes, remaining visual patterns are not an unwanted side effect but actually convey information about the underlying data. Definition of Dot Plots Before continuing with details about the proposed techniques, we have to define what we mean by “dot plots.” There is ambiguity because, over time, there have been multiple visualizations that share the same name. Already in 1884, Jevons is supposed to have included a dot plot to show the distribution of coin weights [Wil99]. More than a century later, Cleveland also employs the name “dot plot” for a visualization that functions like a bar chart [Cle85]. The difference is that each bar is replaced by a single dot at the far end to avoid meaninglessly covering an entire area. He also proposed ordering the data items to enhance the perception of the distribution of values, as in the example in Figure I-3. The same author also composed the plots in matrices of small multiples to create “multiway dot plots” [Cle93]. In 1996, Sasieni and Royston [SR96] presented dot plots that use the regular binning of histograms but replace the bars with columns of same-sized dots. Depending on the bin width, the individual dots are hard to recognize. So far, the mentioned techniques have a common theme: they are based 18 Part I – Dot Plots -2 Value 420 Apples Strawberries Raspberries Cherries Pears Bananas Oranges Figure I-3: A “dot plot,” according to Cleveland [Cle85]. The dots in each row encode a value. Sorted rows facilitate the perception of distribution. The underlying data is the same as in Figure I-1 on and related to bar charts but use circles instead of rectangles. For our work, we will focus on the definition and algorithm introduced by Wilkinson [Wil99]. His technique yields Figure I-1 when run on the same source data as Figure I-3. We distilled four rules that apply to regular linear dot plots: DP1 There must be a one-to-one mapping between each data value and each rendered dot. DP2 All dots are of the same size. DP3 Lone dots are placed at the exact position of their data value along the displayed scale. DP4 Colliding dots are stacked into straight columns. These rules result in properties that distinguish dot plots from histograms and other bar-chart-based visualizations. More recently, Dang et al. [DWA10] generalized the concept of stacking and produced symmetrical dot plots. They further extended the previously existing technique to create 3D visualizations of 2D data. Their latest advances, however, go beyond the scope of this thesis. The 3D visualization is harder to read and more complex to create in hand-drawn plots, shifting the focus toward a different target audience. In this part of the dissertation and the included chapters, we often talk about differences, advantages, and disadvantages. Note that we need to stress differences to distinguish between the techniques but do not intend to discredit or look down upon any of the mentioned plot types. They all fit their own specific use cases—often in statistical contexts. Even in other applications, such as landscape visualizations, dot-based visualization can be more memorable and outperform alternative approaches [TSD09]. 19 3 Nonlinear Dot Plots This chapter is based on previous work and will use extracts thereof without explicit quotation: N. Rodrigues and D. Weiskopf. “Nonlinear dot plots.” In: IEEE Transactions on Visualization and Computer Graphics 24.1 (2018), pp. 616–625. DOI: 10.1109/tvcg.2017.2744018. 3.1 Motivation So far, we have established that regular dot plots provide a simple and intuitive visualization for the distribution of data items. They show clusters, gaps, and outliers and are most useful for smaller sample counts. However, they do not scale well for high dynamic frequency ranges in the distribution of underlying data items. The reasons lie within the rules and assumptions behind their layout algorithm. Rule DP1 is responsible for the intuitive unit visualization. There are, however, multiple implications from rule DP2 that have negative implications that we will now discuss. Imagine increasing the number of items for visualization by replicating each point in the source data set. As a result, the sample count rises, but the range of values remains unchanged, just as in the real-world scenario of performing repeated measures to get a robust representation of a distribution. Assuming a constant circle diameter, the columns in the plot will grow linearly in height as the data set size increases. But the width is only affected by the range that the data points cover on the x-axis. The plot will need to be rescaled to fit the computer display or sheet of paper, and the aspect ratio will shift toward a narrow visualization. To address this issue, we need to reduce the diameter of the used circles. This will reduce overlap and produce a more fitting aspect ratio. 21 https://doi.org/10.1109/tvcg.2017.2744018 3. Nonlinear Dot Plots X1 X2 X3 X4 X5X1 X2 X3 X4 X5 Figure 3.1: Illustration of Wilkinson’s sweep layout algorithm [Wil99]. Adding dots for X2, X3, and X4 would create overlap (marked red). There- fore, dots for these values are stacked. X1 and X5 can be plotted without any issues. With a shrinking diameter, the overall shape of dense areas in the dis- tribution remains visible, but outliers can become too small to perceive. Therefore, the selection of a suitable dot size is key to creating a usable visualization. Wilkinson observed an analogy between circle size in dot plots and the bin width in histograms. Similarly to data-dependent sugges- tions for the latter, he presented a dot size that depends on the number of samples in the source data and assumes a normal distribution with none to moderate skewness. His approach yields a steady aspect ratio for normal distribution, but outliers remain too small. The newer method from generalized stacking does not solve this issue either [DWA10]. However, depending on specific parameter settings and the variability in data density, it introduces more overlap or larger empty spaces. Directly addressing RQ1, our goal within this chapter is to expand on the intuitive original dot plots and adapt them for higher dynamic ranges of data density while preserving the visibility of outliers. A useful aspect ratio requires small dots in dense areas of the plot, while outliers must remain large enough for perception. Therefore, our approach scales the circle diameters as the number of dots in a stack increases, yielding nonlinear dot plots. While the concept of nonlinear scaling has already been used with other visualizations [LA94], it is new for dot plots. 3.2 Technique 3.2.1 Original One-Way Sweep We use a simplified version of Wilkinson’s original algorithm for linear dot plots [Wil99] as a starting point for our extension to nonlinear dot plots. For 22 3.2. Technique Algorithm 3.1: Wilkinson’s original sweep algorithm Input: Data set X. Dot diameter d. 1 Sort X ascending 2 while X ̸= ∅ do 3 Xi ← lowest non-placed value from X 4 c← 1 5 while |Xi −Xi+c | ≤ d do 6 c←c +1 7 Place c dots with diameter d in a column above (Xi −Xi+c )/2 8 Mark Xi to Xi+c as placed a self-contained description, we briefly summarize the original algorithm without smoothing and lateral offsets in this section. Algorithm 3.1 performs an upward sweep through the data to place the dots from left to right. It assumes that the data samples are given as data points with corresponding data value Xi, with index i = 1, 2, . . . , n for n samples in the entire data set. It depends on the distance d, which is the (constant) diameter of the dots for the creation of columns. The key point is that the algorithm keeps incrementing the number of dots, c, for the current column as long as further dots still overlap with the column (lines 5 and 6). To accommodate large, high dynamic range data sets and preserve the advantages of dot plots, the diameter has to be varied in the same plot: the higher the dot column, the smaller the rendered symbols. This leads to a nonlinear modification in column height and width, which is not very intuitive to interpret. In a histogram with nonlinear scaling, we would only have to measure the height and know the transformation function in order to calculate the represented value. With nonlinear dot plots, we also have to factor in varying column widths, making the computation of displayed value densities more difficult (see Section 3.2.5). However, the total size of the plot can be reduced while retaining large dots for outliers. The decrease in height also provides enough flexibility to bring the output closer to the optimal aspect ratio. We first describe an extended two-way sweep algorithm that allows us to work with varying dot sizes (Section 3.2.2). Then, we discuss the merging of intermediate results of the two-way sweep (Section 3.2.3), models for the data-driven adaptation of the dot diameter (Section 3.2.4), resulting envelopes (Section 3.2.5), and aliasing from rendering dots (Sec- tion 3.2.6). The section closes with extensions and variants of the visualiza- tion, including color coding and overlays with other diagram components (Section 3.2.7). 23 3. Nonlinear Dot Plots Algorithm 3.2: Adaptation for symmetric nonlinear dot plots. Input: Data set X. Scaling function dc(c) for stacked dots. 1 Sort X ascending 2 while X ̸= ∅ do 3 Xi ← lowest / highest non-placed value from X 4 c← 1 5 while |Xi −Xi +− c | ≤ dc(c) do 6 c← c +1 7 Place c dots with diameter d1 in a column above (Xi −Xi +− c )/2 8 Mark Xi to X i +− c as placed 9 Do one upward and one downward pass 10 Use average of upward and downward pass 3.2.2 Two-Way Sweep Algorithm We present a summary of our new two-way sweep in Algorithm 3.2. The modifications with respect to the original Algorithm 3.1 are highlighted. To adapt the dot size, we have to replace the constant diameter from Wilkinson’s original algorithm (line 5) with a data-dependent variant: dc(c) = d1 · gc(c) , (3.1) where the index c denotes a function that performs the mapping of a dot count within a column. The variable c (not indexed!) represents the current number of data points in a column, and d1 is a given start diameter that facilitates an overall scaling of all dots and remains constant during the entire layout pass. The function gc(c) should be weakly monotonically decreasing, i.e., dots should become smaller for columns with more data points. Also, we should have gc(1) = 1 so that we obtain the diameter d1 if just a single dot is placed. Section 3.2.4 discusses concrete examples of gc(c). With our approach, the original linear dot plots are included as a special case when gc(c) = 1 for all c. Therefore, all improvements from the new algorithm that are not related to scaling carry over to the linear variant. In general, scaling compresses the height range but also increases the cognitive load for reading exact values [Hla+13; Tuf83; CM85; BDJ14; Arb+17]. Please note that—as per Equation (3.1)—we keep diameters within each stack constant. As a result, our technique provides a partially linear representation that preserves rule DP1 locally to aid comprehension. 24 3.2. Technique As indicated in Algorithm 3.2, we can use the data-dependent dot diam- eter (Equation (3.1)) within Wilkinson’s original single-sweep algorithm. However, while this single-sweep approach works very well for a constant dot size, it exhibits some issues when the data value density decreases along the sweep direction. The underlying reason is that shrinking diame- ters affect column width and, in consequence, they change which dots are to be stacked together. To illustrate this problem, we will assume that the data consists of a dense, sorted group of values X. The first and last values are within the initial dot diameter, which would lead to rendering a single column for linear plots: Xn −X1 ⪅ d1. During the upward sweep from the smallest to the largest value Xi, the dot count c1 in the first stack increases, and the column becomes narrower: dc(c1)≪ d1. Only a few values from X remain and create their own column with a small number of c2 dots with a size close to the initial diameter: dc(c2) ⪅ d1. Since all values have less than d1 distance, there will be overlap between the two columns’ dots. The inverted case, however, presents no problems: When the density increases along the sweep direction, the columns become narrower. There- fore, a second column will not be as wide as the first one, and there will not be any overlap between them. This observation leads to a solution to the problem: we use two sweeps— one in each direction—and combine their results by averaging. Algo- rithm 3.2 indicates the two sweeps as upward and downward passes. Please note that c remains a positive number of dots in the current column, regardless of the sweep direction. Section 3.2.3 describes in more detail how the two sweeps are combined. The original sweep algorithm only needs one pass over the source data. Since none of our alterations are of higher complexity than O(n), our proposed algorithm still has the same linear runtime complexity (with a factor of roughly 2 for the number of computations). 3.2.3 Combining Sweeps Dang et al.’s greedy version of a dot plot algorithm [DWA10] was designed to achieve symmetrical diagrams. While it is preferable to create sym- metrical visualizations when using symmetrical data, the algorithm either introduces severe overlaps or gaps between dot stacks (depending on the chosen factor for h in step 1 of their algorithm). As already indicated in Section 3.2.2, a single sweep direction creates overlap problems for non- linear dot plots, too. In addition, it also causes asymmetry under certain conditions. 25 3. Nonlinear Dot Plots (a) Single left-to-right pass (b) Two opposed passes combined Figure 3.2: Two dot plots of the same data with varying value density: rising from 1–2, falling from 3–4, and constant between 5–6. Vertical lines on the bottom axis show the individual data value positions. Merging two passes results in a plot with less overlap and more symmetry (b). To address these problems, we decrease the sweep direction’s influ- ence on the final layout by combining both directions, as described in Algorithm 3.2. The open question is: how are the results of two opposing sweeps adequately combined? As we will explain later in this section, the rules we imposed on the dot diameters (Section 3.2.2) guarantee that both sweeps return the same num- ber of stacks. We can, therefore, define a one-to-one mapping of columns from one layout pass to the other. Columns resulting from a single sweep are defined by three values: a position in the data dimension as well as the number and diameter of the dots. For the merging of stacks from opposite sweep directions, their positions can be readily averaged by using their arithmetic mean. However, it is not straightforward to merge the actual dots if the two columns contain a different number of them: the arithmetic 26 3.2. Technique mean of the number of included dots might contain “half” points, but we can only draw entire dots. We choose to only draw entire dots by rounding off the number of dots in the current column and then carry the remainder to the following column. In total, this will result in the same number of dots as from a single pass, and the remainder is only distributed within a neighborhood of columns with less than d1 distance. If the neighboring column—that received the remainder—were farther apart than d1, it would belong to a different cluster of columns. A distance between data points greater than d1 would not allow for overlap, thus, would not require stacking of dots that, consequently, cannot interfere with each other in the sweep algorithm. In a nutshell, dots that are farther than d1 apart can never appear in the same column! The arithmetic could be applied to the column’s positions directly and their height (number of dots) with the slight modification above. However, the arithmetic mean is not suitable for the diameter of rendered dots. This is an implication of the nonlinear dependency between c and dc(c) in Equation (3.1). Instead, we calculate the correct diameter from the actual number of dots in each column by applying Equation (3.1) to c after the number of dots is averaged. There remains one prerequisite for the two-pass method to work: both passes must return the same number of columns. This is important because the first column of the upward pass has to be averaged with the last column of the downward pass. If they do not match, the data values of different neighborhoods will be merged. Now we analyze the scenarios encountered during the layout process to check whether they meet the mentioned prerequisite. Our first observation is that data values that are further apart than the start diameter d1 divide the data samples into clusters that can be treated separately. Therefore, we can restrict the discussion to a single cluster of data samples. Clusters that produce a single column in one direction will also produce a single column in the opposite direction, i.e., this is a trivial case. Let us assume the upward pass A resulted in two columns and look at the following two scenarios. Scenario 1: If the downward pass B returned a single column, it would mean that the total number of values in the cluster was so small that dc(c) was still larger than the distance between the first and last data value. This would, in turn, also mean that pass A would have returned a single column containing the complete cluster, which is in direct contradiction to our premise. Scenario 2: Can pass B return three columns? If it started at the same data point, where pass A finished, then its first column would cover at least 27 3. Nonlinear Dot Plots the same amount of data values as the last column of A. That would leave the same data that pass A used for its first column, which in turn would also only result in a single column of pass B. Therefore, three columns could only be generated if pass B encountered additional data values outside the cluster. However, we only look at the data inside the cluster because different clusters are so far apart that they do not interfere with each other. According to this logic (similarly to mathematical induction), the two opposed passes cannot result in a different dot stack count, which makes our averaging method applicable to arbitrary source data. As shown in Figure 3.2b, the two-way layout leads to columns that are better centered on contained data and have less severe overlaps than the single-pass algorithm (see Figure 3.2a). This makes our algorithm well suited not only for nonlinear dot plots but also improves the layout of traditional, linear plots. Figure 3.2b also shows that a constant sample density does not necessarily lead to a constant dot stack height, which makes the visualization inconsistent with the underlying data. This is due to rendering individual dots, which leads to quantized column heights. In order to get a constant height among all columns of the third cluster, we have to fit the initial dot diameter and the nonlinear transformation to the specific samples. While these adapted parameters will fit the targeted cluster, they might be wrong for the clusters at 1 and 3. Therefore, it is not trivial to find a globally optimal solution. It is not even clear how to completely define the optimality of a solu- tion. We might define an optimal solution as one in which the distance of rendered dots from their input values (layout error) is minimal. If we then selected a sufficiently small dot diameter, we would render each dot at the exact data value position and have an optimal solution according to that metric. However, the resulting plot would not be readable because the dots would be too small to perceive. Therefore, an objective function for an optimal solution would have to consider the layout error but also the dot sizes, display resolution, viewing distance, and aspect ratio. We leave the definition of such an objective function as an open question for future research. 3.2.4 Dot Diameter The above two-way sweep algorithm makes heavy use of the dynamic adaption of dot size according to Equation (3.1). We will now discuss useful choices for the adaptation model, as formalized by g(c). For this, we assume that we have a rather dense packing of dots so that the dot plot resembles the corresponding histogram. 28 3.2. Technique (a) 1,000 dots without blur (b) 10,000 dots without blur (c) 1,000 dots with blur (d) 10,000 dots with blur Figure 3.3: Low-resolution plots of normal distributions. Column-oriented anti-aliasing dampens moiré patterns in visualizations (c) and (d). When used with large dots, this blurring introduces unwanted optical effects (c). Although gc(c) controls the nonlinearity of the dot plot, it is not identical to the nonlinear mapping known from histograms or other function plots. It is important to note that gc(c) does not map the original height of a stack to the nonlinear modification. For example, we cannot simply use gc(c) = log(c) to obtain the analog of a log-scale dot plot. In fact, the height of a column with c dots is: hc(c) = c · dc(c) = c · d1 · gc(c) . (3.2) As noted earlier, we require that gc(1) = 1 and that gc(c) decreases with increasing c. In addition, we want to guarantee weak monotonicity of the plot: a column with more data points should never be smaller than a column with fewer points, i.e., hc(c1) ≤ hc(c2) if c1 < c2 . (3.3) The extreme case of constant height is obtained for: gc(c) = 1 c , (3.4) i.e., this choice leads to hc(c1) = hc(c2) = d1. As shown later in Figure 3.4c, there are some applications for this extreme model. The corresponding 29 3. Nonlinear Dot Plots visualization resembles a combination of jittered strip charts and stripe charts [Cha+83] (see Figure 3.4c). Typical nonlinear mappings target strong monotonicity. To this end, the dot size needs to shrink less quickly than in Equation (3.4). A corresponding mapping is achieved by: gc,r(c) = 1 cs , (3.5) with an additional parameter s that controls the shrink rate. The useful range for the shrink rate s is between 0 and 1. Selecting a shrink rate of s = 0 will create traditional linear dot plots; s = 1 yields the earlier model of Equation (3.4). For in-between shrink rates, the column height is proportional to c1−s, with the fractional exponent 1 − s corresponding to a root. We mark functions related to this root scaling with index r. For images within this chapter, we select a default value of 0.4, which leads to a column height of c0.6. This choice leads to good results for the upcoming example plots but could certainly be replaced with other values. We added a table with varying d1 and s as supplemental material to the original publication [RW18] to illustrate the effect of these parameters on the final layout. The root mapping in Equation (3.5) might be good for many data sets, but it will not fit most processes in nature that exhibit exponential growth, where the growth rate is proportional to the current function value. Exam- ples of such can be found in nuclear decay, the Weber-Fechner law [Fec60], and population growth. The existence of exponential processes is also reflected in the way we represent floating point numbers: mantissa and exponent. To get a log-scale mapping, we naively try to approximate hc(c) = log(c) . (3.6) Based on Equation (3.2), we would calculate gc(c) = logb(c) d1 · c (3.7) for any desired base b, but this would violate the requirements for single dot columns, as log(1) = 0. To get gc(1) = 1, we could change the numerator in Equation (3.7) to logb(c) + d1 . This, in turn, would lead to other problems. For instance, with b = 2 and d1 = 0.5, the diameters in columns with two dots would be larger than a single dot (gc(2) = 1.5). Instead, for logarithmic dot plots (index l), we propose gc,l(c) = logb(c+ b− 1) c , (3.8) 30 3.2. Technique as it fits all requirements when used with b ≥ √ 5+1 2 (the golden ratio). The supplemental material of the underlying publication [RW18] includes a table with varying d1 and b to illustrate the effect of these parameters on the layout. The final issue is the choice of d1. This parameter should be chosen according to the number of data samples, their distribution, the size of the plot, and—very importantly—the targeted aspect ratio. We iteratively optimize for d1 with nested intervals: first, we compute a plot with the current value of d1; then increase it in the next iteration if the current aspect ratio is wider than desired and vice versa until we reach a ratio that is sufficiently close to the target. 3.2.5 Envelope The above discussion holds for the mapping of height, i.e., a single dimen- sion. Now, the dot plots intrinsically link column height and width because both dimensions are determined by the dot diameter. If the diameter is scaled by a factor a, the covered area (i.e., the area of the column) is scaled by a factor a2. Therefore, a square root computation needs to be included in all mappings if the adjustment is meant for areas, not just height. Based on these observations, one can consider the limit case of a very large number of dots and the envelope of the nonlinear dot plot. For the case of root dot plots according to Equation (3.5), the height of the envelope scales with α(1−s)/(1+s) if the number of dots is multiplied by α. Thus, the exponent s from the diameter scaling matches an exponent of (1− s)/(1+ s) in the corresponding nonlinear histogram. 3.2.6 Anti-Aliasing In general, dot plots can come with high demands regarding rendering quality because they consist of clearly defined dots with sharp boundaries. Such boundaries, in combination with some rather regular placement of dots, can lead to aliasing and moiré artifacts [Gla69; SW82]; see Figure 3.3. Small differences in dot sizes are the main cause of moiré effects. These accumulate along the height of neighboring columns and create virtual tilted lines. Typical anti-aliasing approaches from computer graphics work with supersampling on the image plane, followed by low-pass filtering and down- sampling. Our solution adopts the same strategy but exploits the special characteristics of dot plots. In the rendering stage of the visualization pipeline, low-pass filtering blurs the image, i.e., the individual dots would eventually disappear, and only a solid colored area would appear. Other 31 3. Nonlinear Dot Plots than generic anti-aliasing techniques, we limit blurring to the vertical direction because individual columns should still be distinguishable. Fur- thermore, a few dots at the top and bottom are left unchanged, as they play a key role in estimating the size of individual dots and in comparing column heights. We also decided on not blurring those columns that are not surrounded by others because they do not add to the moiré effect. Finally, we only start blurring after the dot count inside a column exceeds a certain lower threshold (in our examples: 12) because otherwise, the rendering does not create an area that is big enough for the effect to be perceivable. Figures 3.3c and 3.3d show examples of anti-aliasing. Dots in a blurred line are not countable anymore. However, as the height of columns increases, and the dot diameter decreases, it becomes more and more difficult to make out individual dots for counting, anyway. We use our anti-aliasing method with moderation in order to balance advantages and side effects. The vertical lines in Figure 3.3c are an example of overly aggressive blurring for the low dot density that trades the moiré pattern for even worse optical effects, especially when rotating the image. Therefore, we recommend anti-aliasing only for plots with very small dots, as in Figure 3.3d. 3.2.7 Variants, Extensions, and Hybrid Visualizations Just like conventional dot plots, our nonlinear generalization can be widely applied to depict any kind of data distribution. Similarly, it can be combined with other visual mappings to include further information or emphasize certain aspects of the data. One example is additional color mapping. In general, color plays an im- portant role in visualization because it can show additional data attributes on top of the positional variables of the diagram. We argue that color map- ping is especially useful in the context of dot plots because each single data sample generates exactly one dot, i.e., we can have a direct mapping between sample and color. To make use of perceptual grouping by color (hue), similar colors should be spatially grouped in the dot plot. We cannot change the layout between the columns in the dot plot because they are driven by the distribution of data values. However, we may modify the order of dots within a column. Therefore, the dots in each column should be ordered vertically according to the additional data attribute mapped to color. A typical example is a chronological data distribution, i.e., a data set with samples that not only carry some data value but also a timestamp. Such time-series distributions are best ordered chronologically in each dot column. Another possible application of color is the comparative visualiza- 32 3.3. Examples tion of several data sets integrated into one dot plot: the color indicates the data sources. The figures in Section 3.3 use such colored dot plots. We mostly render our dot plots with the stacks aligned to the x-axis. It is possible, however, to center the columns vertically and create dia- grams with a horizontal symmetry axis (like Wilkinson’s symmetrical dot plots [Wil99]). Further variations can align dot stacks to the y-axis. This is especially useful when dealing with nominal data as it improves the layout of labels and creates an output similar to Cleveland’s multiway dot plots [Cle93]. In addition to layout and rendering variations, we can combine different visualization methods. For instance, Figure 3.6b shows a symmetrical dot plot overlaid by a box plot, whereas Figure 3.2 adds strip charts to the x-axis. Such hybrid diagrams can help users classify and iden- tify data in meaningful ways, providing more insights. Tukey’s suspended rootograms plot data in relation to a known density distribution [Tuk72]. The same technique could be applied to the vertical positioning of the dot stacks to show deviations but would require special care to adjust for areal distortions (see Section 3.2.5). 3.3 Examples We demonstrate nonlinear dot plots for typical examples of real-world data. These contain frequencies of varying ranges. We include comparisons to histograms and linear dot plots to examine different characteristics. 3.3.1 Distribution of Air Temperature Air temperature has a direct impact on our daily life, but it is also related to issues of global climate change. Therefore, we pick this application as our first example. In 2017, we downloaded a data set from the “Climate Data Centre” of the German meteorological service (Deutscher Wetterdienst). It contains monthly averages of maximum daily air temperatures by weather station for the years 1961 to 1990. The stations are labeled using the identifiers supplied by the World Meteorological Organization (WMO). This data set provides a total of 9685 data samples from 875 weather stations worldwide; temperatures are in degrees Celsius. The data is still available online, albeit in a different format and with more samples [Deu]. Visualizing this data with the plots in Figure 3.4, one can immediately see that the distribution is unimodal and has its maximum between 31 to 32 degrees. While mean temperatures down to around -30 degrees are still quite common, only single instances of temperatures at or below - 31 degrees can be observed. Temperatures between 16 and 29 degrees 33 3. Nonlinear Dot Plots (a) Root dot plot (b) Linear dot plot (c) Strip dot plot (root plot with s = 1) (d) Logarithmic histogram Figure 3.4: More than 9,000 air temperature values (in degrees Celsius). The data set shows the monthly average of daily maxima measured by weather stations all around the world. Individual dots in (a) and (b) are colored according to the month: J F M A M J J A S O N D . We use black dots in plot (c) to maximize the perceivable range of brightness. 34 3.3. Examples are almost evenly distributed, forming a plateau in the linear dot plot. This plateau is also visible in the logarithmic histogram and the root dot plot, albeit less noticeable. The nonlinear plots outperform their linear counterparts when we focus on less dense areas of the plots. Figures 3.4a and 3.4d both show minimum and maximum values (outliers) clearly, but only the dot plot allows counting them. The histogram would need a fine- grained vertical axis with ticks at intervals of 1, which would lead to overplotting. The extreme, strip chart-like visualization in Figure 3.4c only shows the outliers clearly but turns the dense regions into light-gray areas. Cross-referencing the temperature data with the WMO database yields additional geographic information. Picking data points in nonlinear dot plots for cross-referencing is very simple because of the one-to-one relationship between dots and data points and since the low-frequency points are rendered as quite large dots. While the two nonlinear visualizations look similar at first glance, the dot plot provides more detail in dense areas. The histogram cannot provide such a view, as all its bins are of equal width. Figures 3.4a and 3.4b also exhibit some gaps in the stacks (for instance, near =4 °C), which contrast the tightly packed neighboring areas. This is due to the characteristics of the data source: the temperatures are rounded to a single decimal place, and there are many data points with the same value. The value density in this area is too low to create high and narrow columns, but at the same time, it is too high to place a wider column at each decimal of a degree. Therefore, the layout algorithm cannot create a tightly packed field of uniform columns, which leads to visible gaps. To encode more information, we colored the individual dots in Fig- ures 3.4a and 3.4b according to the four seasons. We compensated the phase shift between the Earth’s northern and southern hemispheres by adding six months to data from weather stations below the equator. Using this colorization with the nonlinear dot plot, it is immediately noticeable that higher temperatures tend to occur in summer, while the lower ones are predominantly measured in winter. That is no surprise; however, the tendency is not so obvious when looking at the linear dot plot. This effect can be explained by the unimodal sample distribution that has a lower value frequency at the outer edges and the bias toward outliers in Figure 3.4. It would be possible to color the histograms analogously to the dot plots by subdividing the bars. However, there is no appropriate color coding for the logarithmic histogram because there is a conflict between the linear (pro- portional) splitting of individual bars versus the overall logarithmic scaling. There is no such ambiguity with the colored dot plots, as the individual stacks are always linear, giving the users an impression of the distribution of sample categories within the stacks. 35 3. Nonlinear Dot Plots (a) Log2 dot plot using d1 = 795 (Shnei- derman only) (b) Root dot plot using s = 0.6 and d1 = 526 (c) Log2 dot plot using d1 = 795 (d) Linear dot plot using d1 ≈ 8.4 (e) Logarithmic histogram (f) Linear histogram Figure 3.5: Number of citations for papers by Ben Shneiderman , William S. Cleveland , Leland Wilkinson , and William E. Lorensen . Each dot represents one of 1,463 publications from these authors. 3.3.2 Citation Statistics Our next example shows bibliometric data in the form of citation statistics. The h index [Hir05] is a popular indicator of publication impact by an author, heavily aggregating data about all papers into a single number. In contrast, visualizing the complete citation data presents a challenge, as the number of citations may vary extremely per author and paper: there tend to be few publications with a big impact (i.e., many references) and many papers that hardly anyone notices (i.e., little to no references). Therefore, we obtain a high concentration in the frequency plot near zero and some outliers with many citations, forming a long tail. Since these outliers are the most relevant publications, the visualization should represent them 36 3.3. Examples accordingly. However, even the bulk of low-citation papers is interesting because it indicates publication productivity. For illustration purposes, we use citation data of four well-known re- searchers (Shneiderman, Cleveland, Wilkinson, and Lorensen) obtained from Google Scholar1 through the Publish or Perish software [Har16] on June 22, 2017 (without any data cleansing). Figure 3.5 shows the results. In 3.5a, we plotted 1,045 publications by Ben Shneiderman. His most cited work is “Designing the user interface: strategies for effective human- computer interaction” (14,309). There are six additional publications that are clearly distinguishable, but most of the other papers seem to have one thousand or fewer citations. To compare his citation data to that of the other three authors, we add their data and use color-mapped dots. By the dominant dark blue color in the nonlinear dot plots in (b) or (c), we can immediately see that Shneiderman has the largest number of publications. Dots below about 1,000 citations become too small to distinguish individually, but their nonlinearly scaled column heights can still be perceived; therefore, we can obtain the approximate frequencies and compare them between authors. The much fewer papers with high citation counts are large and clearly visible. From these dot plots, we can see that all four authors have publications with 9,000 or more references. Bill Lorensen even has two papers with very high citation counts: “Marching cubes: A high resolution 3D surface construction algorithm” (13,495) and “Object-oriented modeling and design” (11,147). Leland Wilkinson’s most cited work is “SYSTAT for Windows: statistics, graphics, data, getting started, version 5” (9,914). The red dot at 9,017 represents William Cleveland’s “Robust locally weighted regression and smoothing scatter plots.” In contrast, linear plots (Figures 3.5d and 3.5f) are not well suited for such high-dynamic-range data, as they cannot show any useful information about the important long tail. The logarithmic histogram in (e) renders the highly referenced publications as bars. This allows for a relatively accurate estimation of the citation counts but does not provide any method of showing and comparing the authors. 3.3.3 Renewable Energy Our third example combines dot plots with box plots to address the way in which electricity is produced. The European Commission provides the gen- eral public with access to statistical data through its website2. It includes 1https://scholar.google.com, accessed 2024-01-27. 2https://ec.europa.eu/eurostat/web/main/home, accessed 2024-01-27. 37 https://scholar.google.com https://ec.europa.eu/eurostat/web/main/home 3. Nonlinear Dot Plots 2 31 PT SEAT IS NO (a) Root dot plot with annotations for clusters and country names (b) Symmetric root dot plot with box plot overlay (c) Linear histogram Figure 3.6: Percentage of electricity produced from renewable energy sources versus total consumed electricity of 30 European countries from 2004 to 2014. There is a value for each country and year. Dot plots use a color scale to represent the value’s year: ’04 ’05 ’06 ’07 ’08 ’09 ’10 ’11 ’12 ’13 ’14 . information on the amount and type of energy produced in each country of the European Union (EU), as well as Norway and Iceland. The data set “tsdcc330”3 compares the generated renewable electricity with the total consumed amount on a yearly basis from 2004 to 2014. Visualizing this data set with dot plots and histograms, three groups of values stick out that can be interpreted by combining the use of plots with the tabular representation of the source data. As Figure 3.6 and, more specifically, cluster 1 in Figure 3.6a show, there is a relatively high data density near 0 %. Temporal information from the dot plots shows that it must have been from countries that have only recent