Distributed Computing and Transparency Rendering for Large Displays Von der Fakultät Informatik, Elektrotechnik und Informationstechnik der Universität Stuttgart zur Erlangung der Würde eines Doktors der Naturwissenschaften (Dr. rer. nat.) genehmigte Abhandlung Vorgelegt von Daniel M. Kauker aus Lichtenstein Hauptberichter: Prof. Dr. Thomas Ertl Mitberichter: Prof. Dr. Anders Ynnerman Tag der mündlichen Prüfung: 29.06.2015 Visualisierungsinstitut der Universität Stuttgart 2015 Contents Abstract xi German Abstract – Zusammenfassung xiii 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Structure and Contribution . . . . . . . . . . . . . . . . . . . . . . 4 2 Fundamentals 9 2.1 Personal Computer Hardware . . . . . . . . . . . . . . . . . . . . 9 2.1.1 Graphics Processing Units . . . . . . . . . . . . . . . . . . . 9 2.1.2 General Purpose Accelerators . . . . . . . . . . . . . . . . . 11 2.2 Rendering and Visualization . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.2 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.3 MegaMol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Distributed and Remote Rendering . . . . . . . . . . . . . . . . . 17 2.4 Large Displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5 Mobile Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3 Computation Hardware Abstraction 21 3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.1 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.2 Using DIANA . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3.3 Support for Remote Devices and Computations . . . . . . . 28 3.4 Large-Scale Finite-Time Lyapunov Exponent Field Visualization 31 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5.2 Outlook and Conclusion . . . . . . . . . . . . . . . . . . . . 38 4 Transparency Rendering 41 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2.1 Painter’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . 43 4.2.2 Depth Peeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2.3 k-buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 iii Contents 4.2.4 Per-Pixel Linked Lists . . . . . . . . . . . . . . . . . . . . . . 45 4.3 Per-Pixel Linked List Rendering Algorithm . . . . . . . . . . . . . 46 4.4 Distributed Transparency Rendering . . . . . . . . . . . . . . . . . 49 4.4.1 Merging Transparent Renderings from Multiple Render Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.4.2 Using Per-Pixel Linked Lists for Distributed Transparency Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.4.3 Two-sided Fragment Transformation . . . . . . . . . . . . . 55 4.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.5 Advanced Rendering Techniques using Per-Pixel Linked Lists . . 63 4.5.1 Splatting for Depth of Field Effects . . . . . . . . . . . . . . 63 4.5.2 Constructive Solid Geometry . . . . . . . . . . . . . . . . . . 63 4.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.6.1 Rendering Molecular Surfaces using Order-Independent Transparency . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.6.2 Visual Multi-Variant Analysis . . . . . . . . . . . . . . . . . 75 4.6.3 Distributed Mesh Rendering . . . . . . . . . . . . . . . . . . 81 4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.7.1 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.7.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5 Generalized Rendering using Per-Voxel Linked Lists 87 5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.1.1 Voxelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.1.2 Sparse Volume Rendering . . . . . . . . . . . . . . . . . . . . 89 5.2 From Pixel to Voxel: Per-Voxel Linked Lists . . . . . . . . . . . . 90 5.2.1 Voxelization Algorithm . . . . . . . . . . . . . . . . . . . . . 91 5.2.2 Final Image Rendering . . . . . . . . . . . . . . . . . . . . . 94 5.3 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . 95 5.3.1 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.3.2 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.4 Voxel Ray Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.4.1 Sparse Volume Rendering . . . . . . . . . . . . . . . . . . . . 97 5.4.2 Global Effects Rendering . . . . . . . . . . . . . . . . . . . . 98 5.5 Results & Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.5.1 Voxelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.5.2 Sparse Volumes . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.5.3 Global Effects Rendering . . . . . . . . . . . . . . . . . . . . 105 5.5.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.6 Summary and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . 107 iv Contents 6 Using Mobile Devices for Interaction with Visualizations 109 6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.3 Technical Foundation . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.3.1 Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6.3.2 Device Localization . . . . . . . . . . . . . . . . . . . . . . . 114 6.3.3 High-Scale Visualization on Mobile Devices . . . . . . . . . 116 6.3.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.4 Mobile Device Interaction Concepts for Visualization . . . . . . . 118 6.4.1 Visual Analysis Application for Dynamic Protein Cavities and Binding Sites . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.4.2 Mobile Device Interaction . . . . . . . . . . . . . . . . . . . . 120 6.5 Summary and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . 122 7 Conclusion 125 7.1 Big Picture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.1.1 System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.1.2 Operating Procedure . . . . . . . . . . . . . . . . . . . . . . . 127 7.1.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 7.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Bibliography 133 v List of Figures Chapter 2 2.1 OpenGL pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Visualizations of different data set types . . . . . . . . . . . . . . . . 13 2.3 Graphical explanation of the rendering methods . . . . . . . . . . . 13 2.4 Visualization pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Chapter 3 3.1 DIANA System Overview . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 Visualization of a Finite-Time Lyapunov Exponent field computed by distributed DIANA instances . . . . . . . . . . . . . . . . . . . . . 31 3.3 Finite-Time Lyapunov Exponents displaying application . . . . . . 33 3.4 Box-plot of DIANA vs. CUDA . . . . . . . . . . . . . . . . . . . . . . 36 3.5 DIANA speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Chapter 4 4.1 Per-Pixel Linked Lists example images . . . . . . . . . . . . . . . . . 41 4.2 Per-Pixel Linked Lists pipeline . . . . . . . . . . . . . . . . . . . . . 46 4.3 Per-Pixel Linked Lists principle . . . . . . . . . . . . . . . . . . . . . 47 4.4 Example applications using Per-Pixel Linked Lists order indepen- dent transparency rendering . . . . . . . . . . . . . . . . . . . . . . . 47 4.5 Rungholt city model . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.6 Per-Pixel Linked Lists pipeline for multiple renderers . . . . . . . . 49 4.7 MPI Informatics Building model rendered on eight nodes . . . . . 50 4.8 Minimal sorting problem example . . . . . . . . . . . . . . . . . . . 53 4.9 Example applications using Per-Pixel Linked Lists order indepen- dent transparency rendering . . . . . . . . . . . . . . . . . . . . . . . 53 4.10 Two-sided transformation . . . . . . . . . . . . . . . . . . . . . . . . 55 4.11 Example cases of how fragments taken over into the update buffer 58 4.12 Renderings with incomplete depth layers . . . . . . . . . . . . . . . 60 4.13 Evaluation of two-sided fragment transformation . . . . . . . . . . 61 4.14 Image qualitiy for two-sided transformation . . . . . . . . . . . . . 63 4.15 Depth-of-field rendering . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.16 Constructive Solid Geometry operation example . . . . . . . . . . . 64 4.17 Constructive Solid Geometry schematics . . . . . . . . . . . . . . . 65 vi Figures 4.18 Example applications using Per-Pixel Linked Lists order indepen- dent transparency rendering . . . . . . . . . . . . . . . . . . . . . . . 66 4.19 Example molecular surface renderings using Per-Pixel Linked Lists 67 4.20 Schematics of the Solvent Accessible Surface and Solvent Exclusive Surface representation . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.21 Schematics of Solvent Exclusive Surface representation . . . . . . . 69 4.22 Schematic of Solvent Exclusive Surface representation, Reduced Surface, and inner remains . . . . . . . . . . . . . . . . . . . . . . . . 72 4.23 Per-Pixel Linked Lists evaluation renderings . . . . . . . . . . . . . 73 4.24 Molecules used for evaluation. . . . . . . . . . . . . . . . . . . . . . 74 4.25 Three different variants and analysis view of 2VEO molecule . . . 76 4.26 Multi-variant analysis visualization versus standard blending . . . 77 4.27 2JQW protein rendered by three nodes with different merge settings 78 4.28 Distributedly rendered meshes . . . . . . . . . . . . . . . . . . . . . 82 Chapter 5 5.1 Per-Voxel Linked Lists example images . . . . . . . . . . . . . . . . 87 5.2 Per-Voxel Linked Lists algorithm . . . . . . . . . . . . . . . . . . . . 92 5.3 Different voxelization parameters . . . . . . . . . . . . . . . . . . . . 93 5.4 Schematics of the ray tracing process . . . . . . . . . . . . . . . . . . 93 5.5 Per-Voxel Linked Lists data structure . . . . . . . . . . . . . . . . . 95 5.6 Example scenes for voxel-based ray tracing . . . . . . . . . . . . . . 98 5.7 Color blending for split rays . . . . . . . . . . . . . . . . . . . . . . . 99 5.8 Quantitative visualization of numbers of ray steps per pixel . . . . 100 5.9 Scenes used in the evaluation . . . . . . . . . . . . . . . . . . . . . . 102 5.10 Performance of sphere rotated around y-axis and x-axis . . . . . . 103 5.11 Per-Voxel Linked Lists voxelization evaluation . . . . . . . . . . . . 103 5.12 Data sets used in the evaluation of sparse volume rendering . . . . 104 Chapter 6 6.1 X-ray scenario with mobile device and VISUS power wall . . . . . 109 6.2 Mobile device user interface . . . . . . . . . . . . . . . . . . . . . . . 113 6.3 Device localization example . . . . . . . . . . . . . . . . . . . . . . . 115 6.4 Mobile device connection setup . . . . . . . . . . . . . . . . . . . . . 116 6.5 Visual analysis application for protein cavities and binding sites . 119 Chapter 7 7.1 Big picture illustration . . . . . . . . . . . . . . . . . . . . . . . . . . 126 vii List of Tables Chapter 3 3.1 Detailed performance comparison of DIANA vs. CUDA . . . . . . . 36 3.2 Summed performance comparison of DIANA vs. CUDA . . . . . . . 36 3.3 DIANA cluster evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 37 Chapter 4 4.1 Per-Pixel Linked Lists vs. other Order-Independent Transparency rendering algorithms performance comparison. . . . . . . . . . . . 59 4.2 Performance of different compression algorithms . . . . . . . . . . 60 4.3 Performance of two-sided rendering approach . . . . . . . . . . . . 62 4.4 Per-Pixel Linked Lists statistics . . . . . . . . . . . . . . . . . . . . . 74 4.5 Performance of Per-Pixel Linked Lists for different molecules and surface representations . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.6 Constructive Solid Geometry evaluation . . . . . . . . . . . . . . . . 77 4.7 Performance evaluation of the multi-variant analysis application . 79 4.8 Breakdown of average time for distributed rendering steps . . . . . 82 4.9 Breakdown of average time for different numbers of render nodes 83 Chapter 5 5.1 Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.2 Brick size evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.3 Sparse volume rendering evaluation . . . . . . . . . . . . . . . . . . 106 viii List of Abbreviations and Acronyms API Application Programming Interface APU Accelerated Processing Unit CPU Central Processing Unit CSG Constructive Solid Geometry DDP Dual Depth Peeling DP Depth Peeling FTLE Finite-Time Lyapunov Exponent GPA General Purpose Accelerator GPGPU General Purpose Computation on Graphics Processing Unit GPU Graphics Processing Unit MIMD Multiple Instruction Multiple Data OIT Order-Independent Transparency PPLL Per-Pixel Linked List PVLL Per-Voxel Linked List RS Reduced Surface SAS Solvent Accessible Surface SDK Software Development Kit SES Solvent Exclusive Surface SIMD Single Instruction Multiple Data VdW Van-der-Waals Surface Units fps Frames per Second PiB Pebibyte, 250 Bytes TiB Tebibyte, 240 Bytes GiB Gibibyte, 230 Bytes MiB Mebibyte, 220 Bytes ix Abstract Today’s computational problems are getting bigger and the performance re- quired to solve them increases steadily. Furthermore, the results are getting more detailed, so that methods for rendering, visualization, and interaction methods need to adapt. While the computational power of a single chip also increases steadily, the most performance is gained by parallelization of the algorithms. Although Graphics Processing Units are built and specialized for the task of graphics rendering, their programmability makes them also suitable for general purpose computations. Thus, a typical workstation computer offers at least two processing units, the Central Processing Unit and the Graphics Pro- cessing Unit. Using multiple processing units for a task is commonly referred to as “distributed computing”. One of the biggest challenges when using such heterogeneous and distributed systems is the variety of software and ways to use them for an optimal result. The first section of the thesis focuses on an abstraction layer to simplify software development on heterogeneous computing systems. The presented framework aims to encapsulate the vendor-specific details and the hardware architecture, giving the programmer a task-oriented interface which is easy to use, to extend, and to maintain. Having the results computed in a distributed environment, the interactive visu- alization becomes another challenge, especially when semi-transparent parts are involved, as the rendering order has to be taken into account. Additionally, the distributed rendering nodes do not know the details about their surroundings like the existence or complexity of objects in front. Typically, the large scale computations are distributed in object space so that one node works exclusively on one part of the scene. As it is too costly to collect all computation results on a single node for rendering, those nodes also have to do the rendering work to achieve interactive framerates. The resulting parts of the visualization are then sent to specialized display nodes. These display nodes are responsible for compositing the final image, e.g. combining data from multiple sources, and show them on display devices. In this context, rendering transparency effects with objects that might intersect each other within a distributed envi- ronment is challenging. This thesis will present an approach for rendering object-space decomposed scenes with semi-transparent parts using “Per-Pixel Linked Lists”. Presenting these visualizations on large display walls or on a remote (mobile) device raises the final challenge discussed in this thesis. As the scenes can be either complex or very detailed and thus large in terms of memory, a single xi Abstract system is not always capable of handling all data for a scene. Typically, display walls that can handle such amounts of data consist of multiple displays or projectors, driven by a number of display nodes, and often have a separate node where an operator controls which part of the scene is displayed. I will describe interaction methods where the user can directly control the visualization on a large display wall using mobile devices without an operator. The last part of the thesis presents interaction concepts using mobile devices for large displays, allowing the users to control the visualization with a smartphone or tablet. Depending on the data and visualization method, the mobile device can either visualize the data directly or in a reduced form, or uses streaming mechanisms so that the user has the same visual impression as a user in front of the display wall. With the mobile application, the user can directly influence any parameter of the visualization and can thus actively steer an interactive presentation. In this thesis, I will present approaches for employing heterogeneous computing environments, from a single PC to networked clusters, how to use order- independent transparency rendering for local and distributed visualization, as well as interaction methods for large display walls and remote visualization devices. The approaches for heterogeneous computing environments make the development easier, especially in terms of support of different hardware platforms. The presented distributed rendering approach enables accurate transparency renderings with far less memory transfer than existing algorithms. For the interaction methods, the usage of ubiquitous mobile devices brings the described approaches to all types of display devices without the need for special hardware. Additionally, a concept for an integrated system containing the contributions of the thesis is proposed. It uses the abstraction layer as a middle ware for the computation and visualization operations in the distributed rendering environments. The user controls the application using the methods for mobile device interactions. xii German Abstract — Zusammenfassung — Berechnungs- und Simulationsprobleme werden immer komplizierter, Algo- rithmen zur Lösung immer komplexer und es ist immer mehr Performance nötig, um die Probleme zeitnah zu lösen. Zugleich werden die Resultate die- ser Berechnungen detaillierter, so dass auch die Methoden zum Rendern, zur Visualisierung und zur Interaktion angepasst werden müssen. Zwar steigt die Berechnungsgeschwindigkeit der einzelnen Chips stetig, der größte Performan- cegewinn ist heutzutage allerdings durch Parallelisierung von Algorithmen zu erreichen. Obwohl Grafikkarten auf grafisches Rendern ausgelegt sind, lassen sie sich aufgrund ihrer Programmierbarkeit gut für bestimmte Arten von ge- nerellen Berechnungsproblemen nutzen. Daher bieten heutige Workstations mindestens zwei Berechnungseinheiten, den Hauptprozessor und die Grafik- karte, und bilden damit ein „verteiltes“, heterogenes System. Eine der größten Herausforderungen bezüglich solcher Systeme ist die Vielfalt an Software und die unzähligen Wege sie einzusetzen, um das optimale Ergeb- nis zu erzielen. Der erste Teil dieser Dissertation beschreibt eine Abstraktions- schicht, die die Entwicklung von Software für heterogene Systeme erleichtern soll. Das Framework kapselt die herstellerspezifischen Programmierschnittstel- len, -details und die Hardware-Architektur und stellt dem Programmierer ein aufgabenorientiertes Interface bereit, das einfach zu verwenden, zu erweitern, und zu warten ist. Wenn die Berechnungen des verteilten Systems abgeschlossen sind, ist die inter- aktive Visualisierung der Daten eine weitere Herausforderung – insbesondere, wenn semi-transparente Objekte enthalten sind. Typischerweise werden große, verteilte Berechnungen im Objektraum unterteilt, so dass jeder der teilnehmen- den Knoten an einem Teil der gesamten Szene arbeitet. Es ist zu aufwändig, die Ergebnisse von allen Knoten einzusammeln, um sie durch einen einzelnen Renderknoten darstellen zu lassen. Daher müssen schon die Berechnungskno- ten die Visualisierungsarbeit übernehmen, um (semi-)interaktive Darstellungen zu erhalten. Die Visualisierungsresultate werden dann zu spezialisierten An- zeigeknoten geschickt. Diese setzen die Szene zusammen und zeigen sie an. Der zweite Teil dieser Dissertation beschreibt Verfahren, um verteilt gerenderte Objekte, die teilweise transparent sind und sich auch überschneiden können, ohne visuelle Artefakte zu rendern. Dazu werden sogenannte „Per-Pixel Linked Lists“ eingesetzt. Der dritte Teil behandelt, wie die so gewonnene Visualisierung auf einem großen xiii German Abstract – Zusammenfassung Display oder einem anderen (mobilen) Gerät präsentiert werden kann. Die Sze- nen können sehr groß und komplex sein, benötigen also viel Speicherplatz, den ein einzelnes Visualisierungssystem nicht immer zur Verfügung stellen kann. Große Displaywände, die mit solchen Datenmengen umgehen können, bestehen aus mehreren Monitoren oder Projektoren, die von mehreren Anzeige- knoten betrieben werden. Zudem gibt es meist spezielle Operatorrechner, mit denen die Visualisierung auf diesen Displaywänden gesteuert wird. Es werden Interaktionskonzepte vorgestellt, wie der Benutzer die Visualisierung auf einem großen Display nur unter Zuhilfenahme von mobilen Geräten kontrollieren kann und somit nicht den Umweg über einen Operatorrechner gehen muss. Dazu müssen die Visualisierungen auf dem mobilen Gerät auch angezeigt werden. Je nachdem, welche Daten gerendert bzw. welche Darstellung genutzt wird, können die Daten auf dem mobilen Gerät entweder direkt, in reduzierter Form oder mittels Streaming-Technologien mit der gleichen grafischen Qualität wie auf dem großen Display angezeigt werden. In dieser Dissertation werden Methoden vorgestellt, wie heterogene Compu- terumgebungen, sei es ein einzelner PC oder ein ganzer Cluster, mit wenig Aufwand programmiert werden können; wie tiefenunabhängiges Transparenz- rendering mit weniger Speichertransfers als bisher zur lokalen und verteilten Vi- sualisierung genutzt werden kann; und wie mittels mobiler Geräte mit Displays interagiert werden kann, ohne dass spezielle Hardware nötig ist. Die Methoden zur Programmierung heterogener Systeme vereinfachen die Entwicklung und Wartung insbesondere im Hinblick auf die Unterstützung verschiedener Hard- wareplattformen. Die vorgestellten Algorithmen zum Transparenzrendering liefern akkurate Darstellungen und benötigen weniger Grafikspeicher als bishe- rige Ansätze. Die Konzepte zur Interaktion mittels mobiler Geräte beschreiben, wie diese allgegenwärtigen Alltagsgegenstände mit jeder Art von Display be- nutzt werden können, ohne dass hierfür besondere Hardware benötigt wird. Als Abschluss werden diese einzelnen Themen zu einem Gesamtkonzept vereint. Die Abstraktionsschicht bildet das Framework der Programme zur Berechnung und Visualisierung. Die so gewonnenen Daten werden mittels des verteilten Renderingansatzes visualisiert und die mobilen Geräte zur Benutzerinteraktion eingesetzt. xiv C h a p t e r 1 Introduction With the performance of today’s computer systems and ongoing trend of increas- ing performance, programmers and users are keen to exploit this performance gains. In the context of simulation and computation, these improvements are used to achieve higher frame rates, more detailed calculations, or increased in- put and output data sizes. Visualization systems can make use of the increased performance by increasing the display resolution, by rendering more realistic effects, or by being able to process the output of the computation faster and more detailed. The central research questions of this thesis focus on computing and processing the ever increasing amount of data, visualizing the results without losing details and interacting with the visualization in an efficient and user friendly fashion. To be precise, this thesis will present methods to abstract the hardware and Application Programming Interfaces (APIs) for computational programs (see Chapter 3), distributed remote transparency rendering for presentation (see Chapter 4) and an extension for generalized rendering of structured and unstructured data (see Chapter 5), and show new approaches for the interaction with large display walls using mobile devices (see Chapter 6). 1.1 Motivation The user wants to process more data, ideally in less time, and expects the new hardware to deliver adequate performance. The presented abstraction layer targets those expectations and aims to provide maintainable and extensible 2 Chapter 1 • Introduction functions for applications to make use of the performance deliverable by the hardware; now and for future releases. For the detailed output, rendering and visualization techniques are needed which can provide interactive framerates and allow the user to fully inspect the computation results. A concept which can provide this is transparency rendering. For displaying large amounts of data, sometimes complex displaying solutions like display walls are used. To make interactions easy and comfortable for the user, the thesis presents conceptual methods for using mobile devices for interacting with the visualization – on large display walls as well as in remote rendering scenarios. Having the increasing computational power, users expect the ability to handle bigger data sets, faster computation times, and more accurate results. To satisfy these expectations, not only the hardware development needs to be continued but also the software development needs to advance. This holds for the computational part as well as for the visualization and interaction. The computer systems and clusters used for computational tasks get bigger and more complex. One of the largest systems is the Tianhe-2 [Dongarra, 2013] which has about 3 million cores, about 1.34 PB in main memory and a performance of about 33.68 petaflops/s. Developing software for these systems is time-intensive and costly. When the underlying system architecture or APIs change, a certain effort is necessary to update the software and usually the process for optimization of the code starts again. The abstraction layer middleware, DIANA, presented in this thesis (cf. Chapter 3) aims to provide a layer between the hardware and the actual application. The abstraction layer acts as the connecting element for algorithmic building blocks, allowing an easy development of applications by composing readily available plugins or modules. Its most important design goal is maxi- mum performance. Thus, it should not waste any time on communication, data handling, or other administrative tasks. On the other hand, the software needs to be maintainable and easy to understand. Thus, maintainability is the second design goal for the abstraction layer. The third design goal for the abstraction layer is extensibility. This enables programmers to create modules which can perform a specific task on a given hardware device, thus allowing the software to be adapted for future uses. Instead of programming using the APIs itself, DIANA can be used as a foun- dation for computational programs. It is put between the (hardware vendor) APIs, and the application, and is used as an interface to or abstractor of the computation hardware and the data. Modules implement the API of a given hardware or product and are used by the actual application to do its work. The implementation of the modules is hidden from the application which can make 1.1 • Motivation 3 use of the provided functionality. It is even possible to use remote devices for distributed computing when the module implements network code. Thus, any application using the abstraction layer can make use of the available hardware, either locally or on a remote computer, when the employed module supports the given task. Once the computation is finished, the data can be stored for later use, or be prepared for presentation and direct visualization in an in-situ scenario [Knoll et al., 2013]. When a distributed system was used for computation, the resulting data can be too large – either due to the data set resolution or the domain size – to be handled for visualization on a single node. Therefore, the distributed architecture also needs to be taken into account for the visualization system. It might be important for the understanding of the data that all information is visualized at once. The usage of transparency is an option to give the user a (proverbial) insight into the data. On one hand, this creates visual clutter, i.e. more data that might confuse the user, but on the other hand, can provide additional information, like showing multiple layers simultaneously. The appli- cation has to take care of the visual clutter, e.g. by providing functionality to remove layers or alter the color schemes. The rendering process can either take place on the computation device(s) or on dedicated nodes. For large data sets that have been computed on a distributed system, a distributed visualization might also be beneficial because this avoids the transfer of the raw data. Instead, only the visualization result has to be transferred which in some cases can be much smaller and thus faster. Other reasons for a distributed visualization are a large resolution or a display system which consists of multiple display devices. Rendering transparency in a distributed system is a challenging task. The main part of this thesis (cf. Chapter 4) focuses on rendering semi-transparent objects which overlap each other. The reason for this is that for the standard procedure, where color and depth buffers are used for remote visualization, it is not possible to compose overlapping semi-transparent objects as interleaving insertion is not possible. In a distributed computation system, the data set can be decomposed in object space. Note, that no clipping is used and instead, complete parts of the scene are distributed. An example for this is a simulation of a car with objects like body, frame, tires and suspension. In this case, the outer body completely encapsulates the frame and suspension. Other examples for this scenario are distributed mesh renderings, especially if the geometry is too big to store in the graphics memory of a single node, or comparative visualizations where multiple instances of an object are rendered with slight variations, e.g. at different time steps of a simulation. 4 Chapter 1 • Introduction When the rendering of the data contains very fine structures, a standard work- station setup with one or two monitors might not have enough resolution to show the whole visualization and simultaneously deliver the level of details needed to understand the data. Standard paradigms like panning and zoom- ing [Cockburn et al., 2009] might not always be viable, e.g. when the whole scene or visualization is necessary, or additional data needs to be displayed together with the actual data set. Visual analytics applications [Kehrer and Hauser, 2013] often display one or more visualizations of the data and addi- tionally present tables or graphs that provide extra information to the viewer. Especially when multiple users want to inspect a data set at the same time, display systems that can deliver a high resolution and provide enough space for multiple persons are necessary. While driving these visualization and displaying systems is an interesting research topic by itself, parts of this thesis focus on user interaction with large displays (cf. Chapter 6). Usually, the user interaction for visualization is done directly via conventional input devices. The device is either connected to the rendering node directly or an application controls the input signals, and distributes them among the participating rendering nodes. It is cumbersome for large displays to walk around with the standard input devices like mouse or keyboard. For the research in this thesis, mobile devices like smartphones or tablets are used instead. They act as an input device and additionally, the running application can display context information either for interaction or as ancillary information on the data. The proposed method aims at distributed visualization systems, e.g. for pre- sentations or remote visualization scenarios, where the user can control the visualization with a mobile device. When giving a talk, the presenter wants to focus on the text and the audience, not on controlling the presented visu- alization. Using standard devices like smartphones or tablets, the user can intuitively interact with the visualization. Additionally, for collaboration scenar- ios, multiple users can use their devices and all work together with the data set. Here, priorities and permission have to be worked out but this topic is beyond the focus of this thesis. 1.2 Structure and Contribution This thesis is structured as follows: Chapter 2 presents the current evolution of the hardware and the state of the art of the concepts, algorithms and software in the area of General Purpose Computation on Graphics Processing Unit (GPGPU), transparency rendering, visualization and user interaction. 1.2 • Structure and Contribution 5 Chapter 3 details the concept of a computation hardware abstraction layer and demonstrates an example application. With every new hardware or update, the new application programming interfaces for computation devices have to be updated in a program. Changing programming models might make it necessary to completely rebuild the application. The presented abstraction layer [Panagiotidis et al., 2011, 2012] aims to simplify and generalize the devel- opment for computational hardware by encapsulating the device/API relevant code. Pre-compiled plugins deliver the access to the new devices for the appli- cations which do not even have to be recompiled. In general, the abstraction layer provides higher maintainability, flexibility, and extensibility than existing frameworks or applications without a significant loss in performance. Chapter 4 presents extensions to the Per-Pixel Linked Lists (PPLLs) algorithm for rendering Order-Independent Transparency (OIT) in distributed environments. With the data sizes increasing further, it is harder to keep track of the details which might play an important role in the visual analysis [Krone et al., 2014b]. Possibilities are to use rendering techniques like transparency rendering for blending multiple layers or larger displays up to wall-sized gigapixel power- walls. Therefore, algorithms which can handle the amount of data have to be used. The PPLLs rendering algorithm is one of them, allowing a generic visualization of scenes with semi-transparent objects [Kauker et al., 2013b] and to be used in distributed computation and visualization environments [Kauker et al., 2013a]. The concepts and implementation details for the extensions are described together with performance evaluations and an outlook for further optimizations to this approach. Chapter 5 extends the concept of PPLLs into a generalized rendering approach using Per-Voxel Linked Lists (PVLLs). PVLLs convert a scene geometry, volumes, and implicit surfaces into a voxel structure. The approach renders the scene from the three main axes, thereby capturing all fragments within. Rendering a standard volume data set with the PVLLs algorithm, only the visible voxels are stored, removing completely transparent sections and thus saving memory on sparse volumes compared to traditional volume rendering. Having all fragments available additionally enables global effects rendering like reflection, refraction, and shadows. Chapter 6 describes interaction concepts for mobile devices as input devices and second screens in visualization systems and for large displays. As said before, with increasing data sizes it is harder to keep a general view on the visualization on one hand, but also to focus and keep the contextual information visible for the user, especially when the data is projected on wall-sized display devices. Here, mobile devices can come to help by displaying detailed or contextual information, additional visualizations, or serving as a intuitive input 6 Chapter 1 • Introduction device, especially in a more advanced visualization environment than the typical workstation setup. Single-user and collaborative interaction concepts are presented in this chapter together with an outlook of how this topic might evolve in the future. Chapter 7 summarizes the contributions of the thesis on a higher level. Further- more, it gives an outlook to a system combining the presented techniques. 1.2 • Structure and Contribution 7 Non-referenced co-authored Papers Memory Saving Discrete Fourier Transform on GPUs Authors: Daniel Kauker, Harald Sanftmann, Steffen Frey, and Thomas Ertl Conference: Proceedings of the IEEE International Conference on Computer and Information Technology (pages 1152–1157), [2010] This paper shows an alternative method to compute the two-dimensional Discrete Fourier Transform. While current GPU Fourier transform libraries need a large buffer for storing intermediate results, the presented method can compute the same output with far less memory. This will function by exploiting the separability of the Fourier transform. Using this scheme, it is possible to transform multiple rows and columns independently. The presented approach can compete with the timings of the two-dimensional transform provided by the NVIDIA CUFFT library but consumes at the same time far less memory, thus enabling the transformation of much bigger data sets on the GPU. Evaluation of Visualizations for Interface Analysis of SPH Authors: Michael Krone, Markus Huber, Katrin Scharnowski, Manuel Hirschler, Daniel Kauker, Guido Reina, Ulrich Nieken, Daniel Weiskopf, and Thomas Ertl Conference: EuroVis 2014 Short Papers (pages 109–113), [2014a] This paper presents a GPU-accelerated visualization application that employs methods from computer graphics and visualization to analyze SPH simulations from the field of material science. To this end, the iso-surface that separates the stable phases in a fluid mixture is extracted via the same kernel function that was used by the simulation. The application enables the analysis of the separation process using interactive 3D renderings of the data and an additional line chart that shows the computed surface area over time. This also allows for validating the correctness of the simulation method, since the surface area can be compared to the power law that describes the change in area over time. Furthermore, the iso-surface that is based on the simulation kernel is compared with an established method to extract smooth high-quality SPH surfaces. The comparison focuses on demonstrating the applicability for data analysis in the context of material science, which is based on the resulting surface area and how well the two phases are separated with respect to the original particles. The evaluation was carried out together with experts in material science. C h a p t e r 2 Fundamentals This chapter describes the state of the art at the time of writing. It defines the hardware situation, common terms, and standard approaches for visualization as a base line for the following chapters. 2.1 Personal Computer Hardware In the last five to ten years, the trend even on low-end consumer products goes to the use of multi-processors. Here, the standard von-Neumann-architecture [God- frey and Hendry, 1993] is still used, but parallelism is introduced. While Central Processing Units (CPUs) are very flexible in their execution model, allowing different threads to execute different operations using the Multiple Instruction Multiple Data (MIMD) pattern on a relatively low number of physical cores. Graphics Processing Units (GPUs) use the more strict model but do feature more than thousand parallel cores, allowing each thread to execute the same operation on different data in a lockstep manner, the so-called Single Instruction Multiple Data (SIMD) pattern. GPUs can easily outperform CPUs on tasks which fit this model, e.g. matrix multiplication with larger dimension sizes. 2.1.1 Graphics Processing Units GPUs are a hardware component dedicated to rendering and graphical oper- ations. A GPU in the context of this thesis consists of a chip housing a high number of parallel Shaders (2000 and more), and a dedicated graphics memory sub-system. The so-called Unified Shader Model (OpenGL) or Shader Model 4.0 10 Chapter 2 • Fundamentals Vertex Tesselation Geometry Fragment Compute Rasterization Memory CPU Memory Application Uniforms Screen Programmable Stages Vertex Data Pixel Data Uniform Data Arbitrary Data Graphic Pipeline Drawcall Dispatch N Figure 2.1 — Simplified OpenGL pipeline with the programmable stages (green boxes). The arrows denote the data flow between the CPU and application side, the GPU memory, and the stages within the pipeline. (DirectX) feature programmable shaders that can be used for the different stages of the rendering pipeline. While there were dedicated hardware sections with diverging capabilities for different parts of the rendering pipeline in earlier hardware, today’s graphics chips use one common shader architecture for all parts. This flexible allocation model was introduced with the NVIDIA GeForce 8 series or ATI Radeon HD 2000 series GPUs in 2006. The rendering pipeline consists of different stages (cf. Figure 2.1), most of which are at least partly programmable by the application. The application delivers the input data in form of so-called primitives, typically points, lines, triangles, or quads, to the Vertex Shader. Here, each vertex of the input data is transformed according to the model, view, and projection matrices. The Tessellation Shader is used for subdivision of the input model. In the Geometry Shader, each incoming primitive can be processed by a user-defined program. A typical use case for the geometry shader is the generation of proxy geometry, e.g. for Raycasting of implicit surfaces like spheres. Here, point vertices are uploaded to the GPU and the geometry shader generates a quad from the input vertex and a given radius. This saves bandwidth and memory but the computation on the GPU does cost performance. For the following stages, the input is the same as if four vertices were uploaded. After clipping and rasterization – the process of converting the geometry to fragments – , the Fragment Shader is called for each generated fragment. In this programmable shader, the fragment can be colored, textured, and lit, or removed, i.e., not blended into the final output. The final image is then directly sent to the screen or can be stored for later processing in the GPU memory. Typically, the APIs used for programming GPUs are OpenGL or DirectX. Current versions are OpenGL 4.4 and DirectX 11. In the context of this thesis, the OpenGL notations will be used. 2.1 • Personal Computer Hardware 11 2.1.2 General Purpose Accelerators General Purpose Computation on Graphics Processing Unit (GPGPU) is the concept of using GPUs not only for graphical operations but for general com- putations, too. This paradigm is enabled by the possibility of programming the vertex shader or fragment shader of the GPUs [Maughan and Wloka, 2001]. With the introduction of the unified shader model and the NVIDIA CUDA and OpenCL Application Programming Interfaces (APIs) using the GPUs for arbitrary computation tasks became even more popular. The user can program Compute Shaders, which are actually the same hardware as the graphic pipeline shaders, but behave a little different in terms of input and output. While the graphics shaders have a defined input and output type, the compute shaders directly access the graphics memory via buffers of arbitrary content. The loss of the graphics related restriction allows the GPU to be used for other purposes of computation outside of the rendering pipeline. Additionally, compute shaders have access to shared memory. It is accessible by all concurrent threads. Barriers are used to avoid race conditions during execution. The benefit of using GPUs for computation is that the highly parallel architecture can quickly execute SIMD-tasks. With hundreds or thousands of parallel shaders, the GPU can outperform a standard CPU on highly parallel tasks like matrix multiplication. The drawbacks of the GPU are the comparatively small memory of 12 GiB, whereas large CPU systems can access 128 GiB and more, and the comparatively small bandwidth to the host CPU. The single CPU is usually very good at a wide range of tasks whereas the GPU needs to strictly follow the SIMD-pattern to reach its peak performance. Typically, a certain number of GPU threads are grouped into so-called warps. The performance of the GPU can only be maximized when all threads in a warp follow the exactly same instruction in a lockstep manner [NVIDIA Corp., 2014]. When branching occurs (like in if-cases or for-loops of different length), some threads might need to wait for other threads working on other instructions. Inactive threads, e.g. the threads in the other if-case, do the same operations but the writing operation is masked out, preventing them from writing variables. Thus, the whole execution takes as long as the union of all instructions executed by any thread combined. Accelerated Processing Units (APUs) combine the different processors from CPUs and GPUs on one chip. This reduces the performance compared to discrete, high-end graphic processors but allows both sub-units to use the same but slow memory, effectively avoiding costly memory transfer operations. APUs are often found in small or embedded systems, e.g. mobile devices. CUDA can only be used with GPU devices by NVIDIA, whereas OpenCL can 12 Chapter 2 • Fundamentals perform on any GPU, CPUs, APUs, and other specialized hardware. Currently, CUDA has a stronger user base on GPUs and is better supported than OpenCL due to the commercial interest of NVIDIA. The OpenGL compute shaders are closely related to OpenCL but are used from a graphics context from within OpenGL applications which makes the development and integration into graph- ical applications much easier. 2.2 Rendering and Visualization Rendering is the process of converting raw data into a final high-quality visual representation, usually mimicking the real world to some extent. Visualization aims to present the raw input data in a way that can more easily be interpreted by the viewer than the data itself. 2.2.1 Rendering In the area of graphics rendering, the raw data typically consists of primitives like points or triangles forming a mesh, volumetric data, or other representations. The result of the graphics rendering process is an image which is to be shown on a display device. Basically, it is a projection from the (higher dimensional) data/attribute space into the (two dimensional) screen space. On the screen, each pixel of the final image only consists of a color value. Blending operations and a depth buffer are used to overcome the limitation that each screen pixel can only hold one color value. Blending can take place front-to-back or back-to-front and is subject to the imple- mentation. It describes how the existing and new color values are combined and put into the framebuffer. Thus, in the context of this thesis, it is only referred to as blending unless otherwise noted. The distinction has no impact on the performance or capabilities of the presented methods. Usually, blending is not commutative, making the order of blending operations important for the composition of the final image. The depth buffer is used to store the image-space depth of fragments, typically the top-most one. This enables composition of unordered scenes with fully opaque objects. When a new fragment requires insertion on top of the currently top-most one, the current color in the frame buffer and depth value in the depth buffer are replaced, otherwise the new fragment is discarded and the top-most one stays. For scenes with transparency enabled, the opaque geometry is usually rendered in a first pass with an enabled depth buffer. In the second pass, the semi- 2.2 • Rendering and Visualization 13 (a) Mesh Rendering (b) Volume Rendering (c) Implicit Surface Rendering N Figure 2.2 — The Stanford Bunny mesh (a), a volume from geostatistic calculations (b), and the ball-and-stick representation of a protein rendered using ray casting (c). Fragment Triangle Screen Pixel (a) Mesh Rendering Volume Data Sampling Points Grid structure View Ray (b) Volume Rendering ts View Rays Sphere Billboard Plane Intersection (c) Implicit Surface N Figure 2.3 — Schematics of mesh rendering using rasterization, volume rendering, and implicit surfaces. transparent objects are rendered ordered by depth using the respective blending mode. They are applied only if their depth value lies over the opaque fragments already drawn before. After this, it is no longer possible to insert fragments behind the semi-transparent fragments, as they are already blended and the color information can no longer be separated. Mesh Rendering Mesh rendering (see Figures 2.3a and 2.2a) is one of the rendering methods used in this thesis to visualize data sets. A mesh is usually made of quads or triangles which are loaded onto or generated on the GPU. On the GPU, the vertices defining a primitive are passed through the vertex shader followed by the tessellation shader and the geometry shader. While the tessellation stage is used to subdivide the input, the geometry shader can apply modifications 14 Chapter 2 • Fundamentals the input data topology or spawn further primitives which are also passed on. Then, the primitives are rasterized, that is one fragment is generated for every pixel in the view port, and the fragments are then fed into the fragment shader. Here, the color and depth of the fragment can be modified. Finally, fragment tests, e.g. alpha or depth tests, are applied before it is passed on to the frame buffer where it is ready to be displayed. Volume Rendering Volume data is another data representation and widely used in scientific and medical visualizations. A scalar volume is (typically) some kind of three- dimensional array or grid where each cell, a Voxel, holds one or multiple data values. Other topologies, like unstructured grids or analytical/procedural formats, exist but are out of scope of this thesis. Volume data can be obtained from medical imaging devices like CT scanners, from geological measurements, mathematical computations, or simulations, only to name a few. When rendering the volume [Stegmaier et al., 2005a], rays are traced from the camera through each pixel of the screen and through the volume in small steps (see Figures 2.3b and 2.2b). The actual traversal strategy depends on the algorithm used and its implementation. For each step, the data value for the current position is calculated, e.g. by interpolation from its neighbors. Next, a Transfer Function is applied, converting the value into a color, which is then blended onto the screen. The transfer function maps the current data value into the color space for rendering. It is typically implemented using a lookup table which can be interpolated to find the exact output value for a given input. Other methods for coloring a volume are for example the Gradient Mapping. Here, the local gradient of the volume is used instead of the data value. The gradient value can also be used as a normal for lighting computations. The resulting color value is then blended onto the frame buffer or collected along the ray for later blending or processing. Alternatively, the voxel value or local gradient of the volume can be used to generate an iso-surface [Krone et al., 2012]. The iso-surface can then be rendered as a mesh. Raycasting of Implicit Surfaces Implicit surfaces are a mathematical representation of an object, e.g. a plane or a sphere [Gumhold, 2003; Reina, 2008] using very few parameters. During rendering (see Figures 2.3c and 2.2c), the intersection of the viewing ray from 2.2 • Rendering and Visualization 15 filter map render Renderable representation Visualization interact view Hardware Abstraction Interaction using mobile devices Chapter 3 Chapter 6 Simulation and Computation gather Chapter 4 & 5 Raw data Visualization data Transparency Rendering N Figure 2.4 — Abstract visualization pipeline in context of the chapters and the contribution of this thesis. the camera with the object is calculated. The result is a pixel-accurate visu- alization of the given object. In contrast, meshed objects need to have a very fine resolution which costs graphics memory and performance to deliver the same visual quality. The backside of implicit surfaces is that a mathematical expression has to be known for the object. This is true and relatively easy to compute for simple objects like a sphere, a cylinder or a torus, but also for higher-order data [Üffinger et al., 2010; Sadlo et al., 2011] Triangles from a mesh could also be rendered as implicit surfaces but the sheer number (models might use hundreds of million triangles) would make this approach inefficient. 2.2.2 Visualization Visualization is the process of converting raw data into meaningful visual representations. The raw data is often given in the form of lists, tables, or large and unordered chunks of numbers. The process of visualization uses colors, shapes, graph relations, geometric figures, highlighting, contextual additions, and other methods, to convert the data into an image. The goal of a visualization is that a viewer can easily interpret the image and understand the underlying data. A very popular visualization is a weather map where simulation outputs are turned into a colored map and are shown together with additional icons and temperature information. Rendering is one part of the visualization process, e.g. generating the underlying image. On this image, highlighting and color transformation is applied then. The Visualization Pipeline (cf. Figure 2.4) defines the typical workflow of a (scientific) visualization application. First, the data has to be generated, for example by calculations, simulations, or by measurements. In the weather map example, the data is the result of a simulation. The raw data is then filtered, if necessary. For a coarse scale weather map of an entire country fewer data samples are necessary than for a large, high-scale view of a specific region. This filtered data is then transformed into some kind of ordered data structure, e.g. 16 Chapter 2 • Fundamentals into a regular two-dimensional grid or array. The result is mapped to graphical primitives. There exist a plethora of methods to map the data structures to graphical primitives. For the weather map example, the two-dimensional array can be used as-is for the rendering step. Regular grid data can be visualized using volume rendering, iso-surfaces can be extracted as a geometry mesh, and particles can be rendered as spheres, only to name a few. The graphical primitives are then rendered by the graphic system and displayed for the viewers to inspect and analyze. In case of the weather map, the array values are used as input for a transfer function which maps temperature to color, and the resulting color is displayed or printed. The user can typically influence all of these steps, e.g. by selecting ranges of data, the mapping function, and the rendering method with all of their respective parameters. When the data generation or acquisition in form of a computation or simulation is also interactively controlled by the user, it is often referred to as interactive steering [Mulder et al., 1999; Wright et al., 2010]. In the context of this thesis, the presented abstraction layer (see Chapter 3) can be useful in the data generation, filtering and mapping. A method suitable for mapping and rendering the data, especially in a distributed environment, is presented in Chapter 4. Finally, the methods for user interaction with the focus on large displays and mobile devices is discussed in Chapter 6. 2.2.3 MegaMol MegaMol [Grottel et al., 2015] was developed as visualization framework for large numbers of particles within the SFB7161. In this thesis, MegaMol was used for the remote and multi variant rendering (cf. Section 4.6) as well as for the user interaction using mobile devices (cf. Chapter 6). It is a framework which provides windows, modules which do the actual rendering, modules for loading and processing data, and it also handles user input. The modules communicate using a user-defined interface which allows easy replacement of modules, e.g. exchanging a volume renderer for an iso- surface renderer connected to the same data loading module when the interface is compatible. The modular design makes it ideal for scientific development as the modules do not influence each other and can be developed and tested independently. It is also possible to use MegaMol with tiled display devices like the powerwall with the respective modules. 1 SFB716 “Dynamic Simulation of Systems with Large Particle Numbers”, http://www.sfb716. uni-stuttgart.de/ (accessed 29.06.2015) 2.3 • Distributed and Remote Rendering 17 2.3 Distributed and Remote Rendering Distributed rendering [Brodlie et al., 2005, 2004] is a term for generating the final image on multiple independent nodes. In the context of this thesis, a node describes a computer participating in a cluster. All nodes are connected by a network infrastructure, e.g. Ethernet or Infiniband. Rendering using a distributed network environment on multiple nodes and displaying it on an additional node is challenging [Makhinya, 2012]: The main challenge is the transport of the data to the displaying machine, regardless of it being parts of the final image or intermediate data, ideally at interactive frame rates, and compositing it without artifacts into a final visualization. In a distributed or remote rendering scenario, the render application produces the image or intermediate data and the display application shows the final image to the viewers. The continuous communication between the renderer and the display application demands an infrastructure which is capable of handling the required bandwidth for the data transport and has a low latency to provide a reasonable interactive user experience. The main drawbacks of distributed or remote rendering is the network aspect. The network, typically also shared by other applications, has to deliver enough bandwidth and guarantee a low latency for an interactive and fluent user experience. Even if the network is exclusively reserved for the visualization system, normally it is not possible to scale the network architecture as easily as upgrading the rendering nodes, i.e. adding additional nodes. Meligy [2008] gives a state of the art report about general parallel and dis- tributed visualization. For distributed rendering, there are various systems and frameworks available, for example, DRONE [Repplinger et al., 2009] or Equalizer [Eilemann et al., 2009]. IceT [Moreland, 2011] is a scalable parallel rendering framework for applications to render remotely to a large display. There are also commercial and industry-grade frameworks that support remote visualization, e.g. RemoteViz2 for the Open Inventor framework by FEI Visual- ization Sciences Group and the Visualization Platform3 offered by Dragon HPC. Those systems are far more sophisticated in terms of distributing the work load or managing the nodes than the approach presented in this thesis. However, their transparency handling is still framebuffer-based, making the handling of distributed, intersecting objects unfeasible. For a remote rendering scenario, rendering servers are run and controlled by a client application. A server is a software application which runs on a node in the 2 http://www.vsg3d.com/open-inventor/remoteviz (accessed 29.06.2015) 3 http://dragon-hpc.com/remote-visualisation.php (accessed 29.06.2015) 18 Chapter 2 • Fundamentals cluster. It interacts with other servers or the client by network communication. The client is also an application but is controlled by the user i.e. it has a (graphical) user interface and allows the user to tweak parameter settings. The rendering result can either be displayed on the rendering servers directly, can be transferred to the client application and displayed there, or can be transferred to dedicated display nodes. Both remote and distributed rendering are closely related. The only difference is whether the image is transferred as a whole (remote rendering) or the display node have to perform post-rendering operations like compositing (distributed rendering). Hybrid systems can, for example, combine both approaches by using nodes for compositing the output of the render server(s) and forwarding the result to the displaying devices. Distributed rendering can be used with a power wall visualization system where one or more devices are used to display the whole image space. As it might not be feasible for the user to interact with the large display, a client application can be used to steer the visualization on the large display. In a remote rendering scenario, a device does not need the computational power for rendering the scene. Here, the user interacts with the device which sends the controlling information to the render server. The server then sends the visualization information back, either in a composited image or just (partial) raw data, depending on the application. Use-cases for this scenario might be a smart phone or tablet, used in a meeting or presentation, showing the visualization without the need of having a powerful laptop, workstation, or even rendering cluster at hand. There are two different decomposition variants for distributed rendering: image- space (or screen-space) decomposition and object-space decomposition. The first variant partitions the screen and each participating render node works on one or more of these sections. As these sections do not overlap and only one node works on a portion of the screen, it needs to know the complete scene at this point. Therefore, the complete scene needs to be distributed among the participating render nodes. This might not always be feasible, e.g. if the scene is too large. For example, in a big in-situ simulation, the output sizes would be far too large to transmit for interactive rendering. This correlates to the sort-first approach for rendering as the partitioning of the data set takes places before the rendering. The latter variant divides the scene in object-space. Here, one node renders its designated partition of the scene. This can lead to overlapping screen-space sections where more than one node contribute to one pixel on the screen. For this case, the compositing algorithm needs to be aware of this and has to be 2.4 • Large Displays 19 able to blend multiple render results onto the same pixel in the correct order. Depending on where the sorting is done, either on the render nodes after the rendering process directly before sending or on the display nodes, this can be referred to as sort-middle or sort-last, respectively. As an example, Section 4.6 details two different applications which make use of the distributed rendering using Per-Pixel Linked Lists (PPLLs). 2.4 Large Displays In the context of this thesis, large displays are a setup of one or more display devices, e.g. monitors or projectors, which are driven by one or multiple display nodes. Typically, a single monitor or projector of today has a resolution of at least 1920× 1080. The specific display wall setup used for research in this thesis is the VISUS powerwall [Müller et al., 2013]. It is a stereo back-projection wall and consists of five projectors per eye. The projectors are positioned in portrait-mode next to each other. Each projector has a resolution of 4096× 2400 and four input ports. The total resolution is 10800× 4096 with an overlap of 300 pixels between the projectors. It is driven by one display node per projector, two graphics cards each, and each card is connected to two projector inputs. No special interaction techniques are available to the presenter until now. When using the powerwall, a separate operator node in the back of the room is used to steer the visualization. The main reason is that the operator needs to see the power wall and has to control the presented material while the speaker talks facing the audience. Chapter 6 will detail approaches to give the speaker (or the audience) a direct influence on the visualization. 2.5 Mobile Devices Today’s mobile devices have a touch-enabled 4-inch to 10-inch display, typically featuring a resolution of 1280× 720, 1920× 1080 or 2560× 1440, which corre- sponds to about 200 to 400 ppi. The devices typically have no hardware buttons except for power, volume control, or the camera shutter, if available. High-end devices feature a double or quad-core processor and two or more Gigabytes of main memory. They usually connect via WiFi to the local wireless LAN to communicate with other services. Possible interactions on the touch screen are panning, zooming, and rotating using one or two fingers. Additionally, the touch screen can display standard 20 Chapter 2 • Fundamentals interaction elements like buttons, sliders, etc., enabling the user to also control parameters provided by of the application. Although the used devices feature a 3D-enabled graphics chip which can run OpenGL ES, the graphical capabilities of the devices are limited and the graphics chips are very battery consuming. OpenGL ES, an OpenGL variant for embedded systems, provides a reduced API for graphics programming. This allows the chip vendors to develop more efficient hardware, as it only needs to support a reduced function set. As mobile devices do not have the room for large chips, nor the power to drive them, and, especially for the flat devices, cannot dissipate the heat, the mobile CPUs performance cannot be compared to today’s workstation computers. These limitations and the fact that the data sets used can be in the size of Gigabytes, which is too much for transferring and storing them on the device, are the reasons to either use remote rendering or reduced data sets for the visualization on the devices itself. C h a p t e r 3 Computation Hardware Abstraction As stated in the introduction, today’s computer systems continuously offer more performance and the users exploit the performance by increasing the size and complexity of the problems. This chapter describes DIANA, a Distributed In A Node Abstraction layer for computational hardware. The goal of DIANA is to provide easy access to the ever-changing environment of the hardware in a node and in a homogeneous or heterogeneous cluster system, thus simplifying the development process. It is designed to be integrated into an application once and then supply independent plugins which deliver access to the hardware. Thus, the application does not need to be recompiled when a new API is deployed by the vendor or new hardware has to be used. Only the new plugin needs to be implemented, reducing the maintenance effort and development costs and time. The plugins are a kind of building blocks for the programmer. They include access to a hardware component or deliver a compute function which can actually execute code on the hardware. The programmer uses DIANA to select the functions he needs and then execute them on the available devices, always having the possibility to exchange the hardware when the appropriate plugins are available. The presented system also allows remote computations by using the same interface, effectively elevating standard single-node applications into cluster- ready applications. DIANA is therefore a scalable abstraction architecture. The performance cost of the abstraction is kept at a minimum, making it in no way inferior to the actual implementation without the middleware. As an example 22 Chapter 3 • Computation Hardware Abstraction case, a distributed visualization of a Finite-Time Lyapunov Exponent (FTLE) field using DIANA is presented. Parts of this chapter are based on the publications “DIANA: A Device Ab- straction Framework for Parallel Computations” [Panagiotidis et al., 2011] and “Distributed Computation and Large-Scale Visualization in Heterogeneous Compute Environments” [Panagiotidis et al., 2012]. 3.1 Motivation GPUs were always designed with a strong focus on performance due to the need for faster and more realistic graphics. Formerly, configurable fixed function pipelines were used on GPUs, but for some time now programmable compute units, so called shaders, are prevalent. Initially, shaders were used for graphics exclusively, for example for the transformation and lighting stage, texturing, or post processing. Due to hundreds of parallel processing cores on GPUs and the resulting high peak floating point operations per second, shaders were quickly adopted for general purpose computations. Since then, the concept of GPGPU has become a driving force in the development of new graphics hardware and a major aspect in high performance computing (HPC). Moreover, there are other specialized hardware platforms suitable for parallel computation, for example Intel’s Xeon Phi. Programming data-parallel GPUs is different to task-parallel CPUs in several ways. First, data residing in the main memory needs to be transferred to and from the GPU memory in order to access it, while the CPU has direct access. This non-uniform memory access within one node is similar to distributed memory processing with interconnected nodes. Second, GPUs were designed for graphics, that means for different memory access patterns, like texture lookups, where data locality is given. In contrast, CPUs normally access memory in a very random manner with a smaller performance penalty than GPUs. Finally, GPU cores are combined in so-called multi-processors that execute the same code, making branching costly. Nonetheless, using GPUs for some operations has proven to deliver speed-ups by several orders of magnitude and should not be easily disregarded. However, the kernels for these operations do need to be written and optimized by hand and for each hardware architecture again. Therefore, it is obvious to implement a system that takes care of the environment and only to exchange the parts that are in contact with the actual API or hardware. In general, hardware vendors supply their own, distinct API for their GPUs or compute units. So, in order to access the various parallel computation hardware, 3.2 • Previous Work 23 developers have to use different interfaces and libraries. This is complex and time-consuming for a variety of reasons, like detecting available hardware, using diverse interfaces, incompatibilities between hardware, and different tool-chains. Even when developers choose to specialize their applications by supporting only one API, they face problems like differing hardware capabilities (e.g., double- precision support), and changing APIs. Additionally, it is sensible to support different devices, because of changes in hardware (e.g., better performance, higher energy efficiency, lower costs) and suitability for computations [Göddeke et al., 2007; Brodtkorb et al., 2010]. Furthermore, software that is already in long-term production is not easily adapted to new hardware and the GPGPU paradigm. For example, tested and productive code needs to be changed to use GPU APIs, code (kernels) for GPUs needs to be written, and interfaces are needed to decide whether to call CPU or GPU code. Nowadays, clusters are typically homogeneous regarding hardware as well as software. While this simplifies development, deployment, and debugging of parallel applications, it also dictates APIs and platforms (e.g., GPUs vs. CPUs, NVIDIA vs. AMD, CUDA vs. OpenCL). Furthermore, extending the infrastructure with heterogeneous hardware often necessitates widespread changes in software. On the one hand, using homogeneous infrastructures is the predominant approach in distributed memory processing and parallel computing, as seen on the various cluster systems, e.g. the Tianhe-2. On the other hand, this is hard to maintain in today’s rapidly changing field of high performance computing because of steadily increasing performance and memory, and decreasing power consumption of CPUs and GPUs, making hardware upgrades and thus software adjustments mandatory. 3.2 Previous Work An overview of node-level parallelization is given by Brodtkorb et al. [2010]. They conclude that applications cannot rely on a single architecture alone but need to be developed with general heterogeneous systems in mind. This means that applications need to support different compute units locally, as well as remotely, to maximize scalability. CAPS HMPP [Dolbeau, 2007] is a commercial solution that provides compiler directives for calling functions on accelerator devices. It allows for automatic generation of CUDA and OpenCL code through their compiler backends as well as user-supplied code. While HMPP and DIANA are conceptually similar, DIANA is more intrusive because it requires explicitly allocating memory, transferring data, and executing commands on a device. On the one hand, this increases the development effort. On the other hand, this allows for better control and 24 Chapter 3 • Computation Hardware Abstraction fine-tuning, for example by including more detailed and relevant information (e.g., data locality, transfer rates, estimated duration) for the scheduling. CUDASA [Strengert et al., 2008] provides extensions to CUDA that handle communication and scheduling in multi-GPU and GPU-cluster systems using MPI. By using new keywords and an additional compilation step, computations are scheduled automatically for execution on remote nodes. These poll a head node for work when they are idle which may lead to a bottleneck, for example when many nodes poll simultaneously. DIANA, in contrast, handles communi- cation to local and remote devices using various APIs automatically, without imposing a specific scheduling. This allows applications to decide how to distribute and access data, especially in a decentralized way as demonstrated in the example application (see Section 3.4). Furthermore, arbitrary compute units are supported through plugins without changes to existing tool chains. Domonkos and Jakab [2009] describe a programming model for GPGPU-based parallel computing. They support clusters using MPI, and CUDA and OpenMP on the local node for the actual computation. While DIANA also supports these, it additionally features a flexible plugin model, allowing for easy extensibility of hardware APIs and computation kernels. rCUDA [Duato et al., 2010] is similar to CUDASA by providing a client-server system for CUDA that communicates using sockets. Currently, rCUDA 3.1 exposes the CUDA 4 API, except for graphics-related methods, through a wrapper library that is loaded on clients as drop-in replacement for the real CUDA library. It provides support for the runtime API and the CUBLAS library but not for other libraries like CUFFT. Though one can use multiple rCUDA servers, it is unclear how an application uses a specific GPU on a server. Data transfers are handled implicitly but kernel launches need to be configured by manually building the call stack. DIANA goes beyond this approach not only by hiding a specific API but by providing access to arbitrary APIs (e.g., CUDA and OpenCL kernels, CUBLAS) for use through well-defined interfaces. Furthermore, all remote devices are exposed to the clients, providing full control of their use. Barak et al. [2010] present two approaches to distributed heterogeneous com- puting. First, VCL provides an implementation of the OpenCL specification that exposes OpenCL devices of remote nodes. Second, MGP provides an API and source directives for easy partitioning and distribution of tasks on different backends, for example VCL. DIANA also allows the transparent use of devices but is not restricted to a specific API, providing access to any compute hardware with the respective plugin. PaTraCo [Frey and Ertl, 2010] supports automatic scheduling and distribution of 3.2 • Previous Work 25 problems to multiple nodes with different compute units. Computations need to be declared in passes and pass-blocks. Every node constructs a task-graph with computations as vertices, and transfer and execution duration as weights on the edges. The nodes then determine the critical path with regard to minimizing transfers in order to minimize the total duration for processing the whole graph. The edge weights can be set initially or guessed by PaTraCo and are updated every iteration, thus dynamically adapting to interconnects and workloads. However, the partitioning of computations into passes and pass-blocks can impose rather large changes to existing applications. Furthermore, PaTraCo only allows for scheduling of a fixed number of different compute architectures that need to be implemented at compile time. DIANA improves upon this by providing access to arbitrary devices through plugins loaded at runtime and a query mechanism for selection of available devices. Applications can use any dynamically loaded device in the same manner, thus requiring no source changes to support additional compute units. DIANA is also less intrusive since computations do not need to be expressed as passes and pass-blocks, and the scheduling is left for the application to decide. Rossbach et al. [2011] present PTask, an abstraction layer for operating systems to manage GPUs as compute devices. Their integration with the operating system allows PTask a high performance and little overhead, as the objects can directly be managed by the kernel. They demonstrate PTask using a gestural interface and an encrypted file system as an example. In contrast to DIANA, PTask is designed as an integral operating system component. DIANA is designed on top of existing operating system and hardware driver APIs to be integrated on the application level. Bourgoin et al. [2014] present SPOC (Stream Processing with OCaml), an OCaml based GPGPU programming framework. It allows low-level and high- level programming of the computation devices either by directly using CUDA or OpenCL, and by the integration of a OCaml language extension for GPGPU. While these systems are both similar to DIANA, their programming model is closer to the actual parallel programming style. DIANA hides the details from the programmer, offering only interfaces that have to be used instead. The remote execution of DIANA resembles modern RPC systems like Smart- GridRPC [Brady et al., 2010] in the sense of automatic serialization and transport of data, and invocation of procedures on remote nodes. While SmartGridRPC uses agents for service discovery, DIANA provides a local database – populated on startup by querying remote nodes – for access to remote compute units and operations. Contrary to the automatic load balancing of computations and reduction of communication in SmartGridRPC, DIANA requires developers to manually trigger computations and transfers. This allows for fine-grained 26 Chapter 3 • Computation Hardware Abstraction Hardware NVIDIA GPU AMD GPU CPU Remote hosts A P I CUDA OPENCL CPU D e vice in te rfa ce DB dgev dgemm Plugins C o m m a n d in te rf a ce Application provide execute configure Devices Commands Network provide FTLE … User Data transfer N Figure 3.1 — Overview of the DIANA architecture. Plugins (orange) provide commands and devices to DIANA (blue). An application (green) initiates the data transfers and selects the commands and devices needed for computation and DIANA provides the interfaces for execution on the hardware (gray). scheduling optimization based on the knowledge of developers about their applications and problem domain. Internally, DIANA minimizes communication and the amount of transferred data where possible, for example by sending buffer contents only when necessary, or using special API calls on remote nodes to copy data between local compute units. 3.3 System Overview In principle, the core component of DIANA is a database and it provides abstract interfaces for hardware and functions to be executed (see Figure 3.1). The following section describes the internals and how an application can make use of them. 3.3.1 Concepts As the goal was to design a common high-level layer for multiple computation devices, three main components needed to be abstracted: the computational Devices itself, an interface for the abstract Memory Objects, and a description of the work to be done, the Commands. Basically, DIANA is a SQL database holding the plugins, and a set of management and interface functions to populate 3.3 • System Overview 27 the database, retrieve its content, and execute the kernel functions on the computation hardware. Database Storage The main part of DIANA is a SQL database backend for storing all information about devices, commands, and the internal management. During startup, the database is populated with the information provided by the plugins. Each plugin delivers the data it retrieved from the specific API (e.g. CUDA for the available devices) about the devices or the information about the commands to DIANA which then puts them into the database. Upon execution of a command, the database is queried for the issued command, e.g. the matrix multiplication. The query may contain specific device capabilities necessary for execution like the support of double precision floating point operations. Additionally, a matching device is queried from the database. Devices The device interface holds all information necessary to run a command on a device and all information about the capabilities of the device. Its implemen- tation uses the standard interfaces provided by the operating system for the CPU device and the manufacturer specific APIs for all other hardware devices. The main ability of a device is to run a command. Even the device-specific commands for memory management like allocation, copy, put, get, set, and free are implemented as commands executed by the device. Commands A command mainly consists of a runmethod which is called when the command is executed. It handles the API specific execution of a function, e.g. calling the CUDA functions to run a given kernel. The command also handles the parameters necessary for the execution, e.g. the memory objects to compute on or the like. Implementing a new command means to provide an abstract interface in form of a C++ class, a kernel function, and defining the capabilities of the command. The interface is what the programmer for the application is working with and the implementation is one instance of this interface. This ensures that multiple commands, e.g. one for OpenCL and one for CUDA, can be used with that one interface defining them. To run a command on a device, the parameters for the command need to be set and then the command and the device are passed to DIANA which handles the 28 Chapter 3 • Computation Hardware Abstraction execution of the selected command on the given device. The application can implement an asynchronous command management, querying the command for completion (or failure), so it knows the current state of the command and whether it can continue with the computations. Memory Objects Another important component is the abstraction of the memory objects on each device. As each API can have its own memory management, the buffers also need to be abstracted. From the application point of view, the memory objects are managed using the device-specific memory commands. 3.3.2 Using DIANA Upon start, DIANA loads all available plugins and queries the found devices for their capabilities like double-precision computation, available memory and the like. Usually, the first step for an application is to allocate memory on a computation device and initialize the memory. Therefore, a device is chosen and the database delivers a compatible command for the memory operations. Then, the application selects a computation command from the database and requests a device which can execute this command. Here, data locality, compu- tation capabilities or performance may play a role. These selectors may either be automatically derived by DIANA, e.g. when a given command needs a special capability, or by the user, e.g. selecting a device with a given feature. The application then configures the command, sets all parameters and buffers, if required. Finally, DIANA executes the command on the selected device. Multithreading is used to keep the impact on the actual application as low as possible. Callbacks or fences can be used to signal the completion of a computation request or memory operation. 3.3.3 Support for Remote Devices and Computations DIANA supports abstraction of compute units on a node-level. For further scalability, DIANA automatically provides access to compute units on remote hosts. A small application (the DIANA server) runs on each remote host, listening for incoming compute requests from applications. Applications use DIANA exactly as before; remote devices are automatically discovered by the network plugin and are used in the same way as local devices. For the programmer, the remote devices are accessed like local devices with all the memory transfers handled by DIANA. 3.3 • System Overview 29 Application Side The network plugin sends a discovery message to known servers when it is loaded and receives a list of all devices and commands that are available on each server. When registering remote devices and commands in the application-side DIANA database, the network plugin appends a unique identifier for each server to the server-side identifiers, thus making them unique application-wide. Internally, remote devices and commands are proxies that implement the same interface as their local counterparts. The proxies serialize and forward all calls to the target server, handling all communication automatically. Thereby, only scalar values (e.g., integers, floats) and globally unique identifiers for devices, commands, and buffers are transmitted. Note that sending and receiving of buffer contents are explicit memory operations, triggered by the application. Since the applications’ network plugin generates all required identifiers for further communication, there is no need to wait for an answer from the server, except when querying the state of a command. The transportation layer ensures the end-to-end communication and the integrity of the transmitted data. The network plugin does not link nor require any information for the remotely used target hardware or APIs. If the plugin is not loaded, there will be no automatic communication to hosts running the DIANA server. In any case, other device plugins (e.g., CUDA, OpenCL) behave exactly as before. As a result, applications can use devices without explicit distinction and knowledge of their physical location. An application uses the same calls for both local and remote devices for memory allocation and release, data transfer, command execution, and result retrieval. Furthermore, the constraints for using remote devices also apply to local de- vices, since those might operate asynchronously. Therefore, it should never be assumed that drivers or APIs ensure barriers or execution order. Instead, applications need to query if commands are finished or encountered an error. When accessing data, an application needs to wait for the transfer to finish. There is no implicit or automatic scheduling of any operations in DIANA since partitioning of the problem or data management is highly application dependent. Only the network communication is handled automatically by the network plugin. This leaves all responsibility for balancing computational workloads among available devices to the application which can employ any desired scheduling strategy, e.g. work-stealing. Ultimately, when a network command is finished, the application releases it and the network plugin tells the server to free all corresponding remote resources. 30 Chapter 3 • Computation Hardware Abstraction DIANA Server On startup, the server loads the DIANA plugins for access to local devices and available commands thereon. The server then waits for requests from the application-side network plugins to discover devices and commands, execute operations, or query command states. The discovery simply returns the local database of the devices and commands. When a command request is submitted, the server deserializes the parameters, constructs the corresponding local com- mand, and queues it for execution. All submitted commands are non-blocking on the client side to maximize parallel operation. Applications only block when querying the command state. Memory operations behave in the same way as computations: execution fol- lowed by query of the state for completion or error. The only exception is the transfer of data from a remote device, where the data is returned together with the notification of command completion. Subsequently, the network plugin on the application side copies the result into a local buffer. Commands and other resources are held until the application issues a release. DIANA servers can also communicate with each other, without routing traffic through the node that executes the application. In such a case, the usual device- to-device transfer is realized by a copy between remote nodes. As each device has a unique id, the servers know where to send the data and if the destination is within their own domain or on another server. However, the application is in control and the DIANA servers only receive the copy command with the source and destination device identifiers. Use Cases and Scenarios With DIANA it is possible to supply the same operation (e.g., FTLE computation) multiple times for the same and for different compute units. Applications can choose from the available operations in order to optimize performance or efficiency. For example, applications can operate on small data sets locally when equipped with less performant devices, or offload larger data sets to more powerful remote nodes. Using this feature also enables seamless switching between operation imple- mentations to evaluate different platforms and increase portability. For instance, developers can start using CUBLAS on NVIDIA hardware and then later choose to use clAmdBlas on OpenCL platforms without any changes to their applica- tion. Furthermore, one can provide specifically tuned variants of operations that are more performant for given problem sizes than standard versions in libraries. The user has to specify selectors or conditions for the command to be used for the current input data. 3.4 • Large-Scale Finite-Time Lyapunov Exponent Field Visualization 31 3.4 Large-Scale Finite-Time Lyapunov Exponent Field Visualization The presented system can be utilized for computation and visualization. The following section explains an example of using DIANA for visualization of a FTLE field. As an example of the capabilities of DIANA, it is extended for computation and visualization of the FTLE [Sadlo and Peikert, 2007] in a distributed heteroge- neous environment. It demonstrates how DIANA can be used with multiple different nodes and compute units to generate a large-scale visualization of the computationally expensive algorithm. N Figure 3.2 — Visualization of a FTLE field computed by distributed DIANA instances and presented on the VISUS power wall. The system consists of two parts: a DIANA plugin and a display application. The display application triggers the computation, fetches the results from DIANA, and composes the resulting image which is shown on the attached display (see Figure 3.2). Since each pixel of the FTLE visualization requires the computation of a forward and a reverse trajectory in the vector field, it lends itself well for parallelization. Each trajectory is computed independently and is defined by its start point, start time, and duration of advection. The viewport of the display 32 Chapter 3 • Computation Hardware Abstraction is partitioned into as many parts as there are compute nodes assigned to the application. Thus, the application builds the final image from parts of equal width and height computed by statically assigned compute nodes. As each pixel is exactly as expensive as every other pixel, no balancing issues occur in this scenario and, thus, no scheduling is necessary here. The VISUS powerwall (see Section 2.4) together with a cluster of 64 nodes is used for visualization (see Figure 3.2). Each cluster node contains two NVIDIA GTX 480 GPUs, and the result is displayed on five 4096×2400 projectors resulting in a 40 mega-pixel display on a powerwall. Simultaneously, the same frame is computed at lower resolution on commodity hardware of workstations (NVIDIA GTX 465/480 and AMD RADEON HD 5870) for use on a smaller auxiliary display, in this case the operator node of the powerwall. All displays are synchronized with each other, resulting in a consistent view on the powerwall and the operator node. DIANA Plugins for FTLE Computation The plugin is the second component of the system. It computes two FTLE fields, one resulting from forward and one from reverse particle advection. The retrieval of time-dependent vector field data is treated as a black box, that means, DIANA assumes no details regarding its internals or runtime characteristics. Consequently, one can switch between simulations, if they provide correctly formatted data, without changing anything in DIANA or the application. For the tests, pre-calculated, time-dependent CFD data with a resolution of 101×101 grid was used as input. Commands on devices consist of two parts in DIANA: the abstract interface and the device-specific implementation. The abstract interface for computing the FTLE requires some device buffers as well as parameters describing the compute domain. The computeFTLE command for this example case provides an implementation of this interface for CUDA and for OpenCL, each residing in a separate DIANA plugin (for development reasons only). Note that these mostly call the respective API for execution on local hardware. The actual transfer of data and the remote invocation is handled by the network plugin (see Section 3.3.3). The implementation requests the necessary vector field data from the simulation black box and passes it to the kernels for both forward and reverse integration. The results are then stored in separate buffers which are sent back to the display nodes. 3.4 • Large-Scale Finite-Time Lyapunov Exponent Field Visualization 33 Request next forward frame Request next backward frame Reply backward frame Reply forward frame Display synchronization N Figure 3.3 — Each display application requests a forward and a backward frame and merges them to the final image. The display nodes stay in sync to deliver a consistent image. Display Application The FTLE evaluation domain is mapped to a display that may contain multiple distributed display nodes, each having its own local viewport. Thus, the display application needs a list of other nodes to synchronize the display with, the number n of display nodes it shares the global viewport with, and the index i of its local viewport relative to the others. In this scenario, the synchronization list contains the five powerwall nodes and the operator node. For the display nodes of the cluster n is set to 5 and i is set to 1..5, so every display node shows one fifth of the final image (see Figure 3.2). The auxiliary display uses n = i = 1, thus the whole FTLE domain is shown in one window. The list of synchronized nodes needs to be the same for all application nodes, to provide a synchronized view. Additionally, the display application receives a static assignment of DIANA servers to compute the FTLE field. As described in Section 3.3.3, the application-side network plugin discovers remote devices and commands on startup. Then the application queries the local database for implementations of the computeFTLE interface and subsequently for devices that can execute it. After starting the computation and synchronization threads, it sets up an OpenGL context and enters the rendering loop. The computation thread requests result frame f from every assigned remote device using f as time step parameter for the FTLE command (see Figure 3.3). Thereby, every device computes a part of the image. The resulting buffers 34 Chapter 3 • Computation Hardware Abstraction containing the forward and reverse FTLE are retrieved after each device of this display node finished its computation in parallel. Those parts are then merged to two coherent buffers and cached for display. Finally, the application broadcasts to all display nodes that it is ready to display frame f and continues by requesting frame f + 1. The synchronization thread waits for messages about completed frames from other display nodes. Upon receiving one, it increments the node counter for that frame. When all nodes have signaled the availability of the frame, it notifies the render loop to use that frame and all previous frame parts are discarded. Note that the chronological order is maintained as the frames are requested in ascending order and frames are discarded only when a newer frame is ready on all display nodes. Thus, every frame is computed sequentially, transferred to the display node, and displayed in the correct order. The render loop simply maps the scalar field output of the FTLE computation to a texture that is rendered on a full screen quad by using the forward FTLE as the red channel and reverse FTLE as green channel; the blue channel is set to 0.0 and alpha to 1.0 (see Figure 3.2). Scalability and Universality The cost of the FTLE computation depends on the output resolution and integration length. The amount of data a display node receives is constant and only depends on its total display resolution. Thus, the time needed to generate a single frame decreases if more nodes are added, since every node needs to compute smaller images. Note that the prototype is not restricted to nodes of the cluster to create an image on the powerwall. One can easily use nodes outside the cluster as long as DIANA is installed, a suitable network connection is available, and they support CUDA or OpenCL when the plugins for those APIs are provided. However, as the views are synchronized, the slowest node sets the pace as each node only can display the new frame when it is available to all other nodes. The setup described here can be applied to accomplish any distributed computa- tion and visualization scenario. For development, one can use local commodity hardware in workstations. Even when displaying the result on a single monitor, one can simply use multiple windows to verify if the tiles fit together. The whole system can then be easily deployed on multiple machines or a cluster. 3.5 • Summary 35 3.5 Summary All benefits of the presented abstraction layer are achieved by trading-in perfor- mance. Every layer between the calling software and the executing hardware does cost performance. However, what matters when using the software is not only the performance itself but also the maintainability and flexibility. The system has to be well-balanced with respect to these goals. 3.5.1 Results DIANA provides an interface to different device APIs, Software Development Kits (SDKs), and libraries. Since there are some layers of indirection (e.g., database lookups, virtual function calls) calling an API via DIANA should always be slower than calling that APIs directly. Therefore, a major goal was to minimize the time spent in DIANA while still having high extensibility and flexibility. In order to measure this overhead, we constructed the following test: initialize CUDA, set up the device, allocate memory on the device for a vector, transfer it from main to device memory, modify it on the device, transfer it back to main memory, free the memory on the device, and shutdown the device and CUDA. What is measured is the time it takes the CPU to call CUDA and DIANA respectively, since the intention was not to measure the performance of the computation or the hardware. This test is repeated 10,000 times with a vector containing 10,000,000 floats, resulting in 40 MiB to transfer. The test system was an Intel i7 920 with 12 GiB main memory and a NVIDIA GTX 465 with 1 GiB device memory. In the initiation phase, only a context for CUDA is created, while DIANA loads a plugin and initializes it, thereby registering one CUDA device (that creates the context) and one command. The setup phase does not apply to CUDA, since the created context is already active. In contrast, DIANA needs to query the database for the CUDA device; its context will be activated inside of the command that modifies the vector. Memory operations (allocation, put, get, free) directly map to the respective CUDA functions and to the buffer methods of DIANA. Computation is split into two distinct parts, lookup and execution. In the lookup phase, an application determines which kernel to execute depending on available hardware. While a real application supporting multiple devices cannot skip this, no specific checks of hardware capabilities are assumed for the test. Therefore a CUDA kernel file is loaded directly and the function pointer for the kernel is retrieved. The execution phase creates the call stack and launches the kernel. These two phases are different in DIANA. In the lookup phase, the database is queried for the command and it is instantiated. In the execution phase, the application sets the parameters of the command and executes it 36 Chapter 3 • Computation Hardware Abstraction 160 ms 170 ms 180 ms 190 ms 200 ms 210 ms 220 ms CUDA DIANA ◭ Figure 3.4 — From top to bottom: maximum, third quartile, median, first quar- tile, minimum values of ∑ ∗. Init Setup Alloc Put Lookup Exec Get Free Term DIANA 96.8 0.1 0.2 70.8 0.4 26.4 71.5 5.5 75.3 CUDA 89.1 0.0 0.1 70.7 25.8 0.0 71.5 5.5 71.7 ∆ 7.7 0.1 0.1 0.1 -25.4 26.4 0.0 0.0 3.6 NTable 3.1 — Median time in milliseconds the CPU needed to call CUDA and DIANA. Note that the execution time is not measured on the GPU. ∑ ∑ ∗ DIANA 345.0 174.8 CUDA 331.8 173.6 ∆ 13.2 1.2 ◭Table 3.2 — ∑ ∗ is the sum without Init, Setup, and Terminate from Table 3.1. Note that the execu- tion time is not measured on the GPU. directly. The command then executes all steps of CUDAs lookup and execution phase as described above. The termination phase releases all CUDA resources while DIANA unloads the plugin (thereby releasing devices and commands) and shuts down the database. Table 3.1 shows the median times (rounded to the first decimal) of all phases for CUDA and DIANA, and their difference. Figure 3.4 shows the distribution of values for ∑ ∗ of Table 3.2 (total time without the phases Init, Setup, and Term). Here, ∑ ∗ equals the orange lines inside the boxes. Table 3.1 shows the median times (rounded to the first decimal) of all phases for CUDA and DIANA, and their difference. The difference for the computation can be accounted to two factors. First, the lookup used for CUDA is very trivial and even though DIANA’s command lookup is fast due to the in-memory database, there is still some time needed. 3.5 • Summary 37 Cluster GPUs Height Time GPU Time App Speedup Nodes (pixel) (ms) (ms) (%) 1 1 4098 27940.1 28660.9 - 1 2 2050 13993.9 15145.6 47.2 2 4 1026 7021.0 8129.3 46.3 3 6 684 4677.6 5765.7 29.1 4 8 514 3533.1 4659.6 19.2 5 10 411 2826.1 3949.0 15.3 6 12 343 2361.7 3527.0 10.7 7 14 294 2030.7 3178.8 9.9 8 16 258 1791.5 2944.7 7.4 9 18 229 1595.3 2753.5 6.5 10 20 206 1438.0 2580.4 6.3 11 22 188 1302.5 2490.2 3.4 12 24 172 1188.7 2391.9 3.9 13 26 159 1101.0 2307.9 3.5 14 28 148 1030.0 2237.0 3.0 NTable 3.3 — Average times of the FTLE computation duration in milliseconds on the cluster nodes (GTX 480 using CUDA) for different image heights de- pending on the number of GPUs used. Additionally it is shown how long the display application takes from issuing all commands to receiving that all have finished and the resulting speedup when using more GPUs. The image is always 2400 pixels wide. Second, the command parameters are stored in a hash-map that is filled by the caller. This map is then read inside the command and used to build the call stack for CUDA. Therefore, DIANA is overall a little slower in both phases than CUDA. Overall, the small overhead of using DIANA is acceptable since it provides many benefits. As a benchmark, the average computation duration and the total time in the display application for collecting all parts of a FTLE image on the display nodes of the powerwall was measured. Thereby, the number of cluster nodes a display node uses for computing its part of the whole display is increased for ten successive frames. When using n GPUs, each GPU needs to compute an image 2400 pixels wide and (4096/n) + 2 pixels high. The output resolutions tested are the same in both cases and correspond to the resolution of the partial images of the powerwall display nodes depending on the amount of compute nodes used (up to twelve). Note that using an additional cluster node adds two GPUs 38 Chapter 3 • Computation Hardware Abstraction ◭ Figure 3.5 — Perfor- mance and speedup (per- centage over the bars) when using multiple GPUs for the computation. to the system. Two additional pixel rows are needed to avoid artifacts between the partial images because the FTLE involves gradient computation. Table 3.3 and Figure 3.5 show the average computation times depending on the number of GPUs (respectively cluster nodes), the total time the display application requires from starting the commands on all nodes to receiving the acknowledgment that they have finished, and the speedup when increasing the number of used GPUs. As can be seen, the speedup decays very fast, dropping below 20% when using four instead of three nodes (respectively 8 instead of 6 GPUs), below 10% using more than six nodes (12 GPUs), and drops to less than 4% when using more than ten nodes (20 GPUs). This correlates to the change in the output resolution when adding more GPUs. For example, each GPU needs to compute 170 lines less when using four instead of three nodes (514 lines instead of 684), and only 49 lines less when using seven instead of six nodes (294 lines instead of 343). For the latter, it still results in a difference of about 331 ms for the computation but suffices only for about 135 ms in the display application. Therefore, even though the computation is getting faster with more nodes and GPUs, the overall system performance is not increasing accordingly due to latency and overhead in communication. The difference between computation time and the time spent in the display application is varying throughout the different resolutions between 1− 2 seconds and can be accounted to overhead in DIANA and network communication. 3.5.2 Outlook and Conclusion As the evaluation has shown, DIANA meets the goals without losing significant amounts of performance and still provides flexibility and maintainability. The system also scales within the tested cluster systems. What cannot be solved by the presented system is the fact that each hardware needs optimized kernel codes to achieve its designed maximum performance. This topic is subject to 3.5 • Summary 39 ongoing research [Magni et al., 2013; Grauer-Gray et al., 2012]. Integrating automatic kernel optimization into DIANA would again reduce the amount of work, and thus costs, when upgrading to newer accelerator hardware. DIANA provides common interfaces and automatisms to simplify the devel- opment for different hardware and its usage. With DIANA, the continuous development and support for upcoming accelerator hardware is reduced to the implementation of a device plugin and a command plugin, completely removing any update work on the main application. With OpenCL, CUDA, high performance cards from NVIDIA, AMD and Intels Xeon Phi currently on the market, there are three competitors for high perfor- mance computation. Having the option to switch to another hardware platform with minimal implementation cost is always an advantage over specialized solutions which cannot be upgraded easily. This is especially the case, when the performance drawback is as small as in the case of DIANA presented here. The recent trend in hardware development is going to more specialized chips which are combined within one die, let it be APUs on standard workstations, or fast performance chips with an added low-powered companion core in the mobile processor section. This leads to a broader spectrum in processors available for a computer configuration and requires yet another configuration or implementation of the respective API. Using an abstraction layer, all these details can be hidden and the performance of the underlying hardware can be exploited with low effort. C h a p t e r 4 Transparency Rendering (a) 1VIS protein rendering with transparency effects (b) UNC power plant mesh rendering (c) Visualization of compara- tive multi-variant renderings N Figure 4.1 — PPLLs OIT renderings of a molecule surface combined with inner structure representation (a), the UNC power plant (b), and a comparative visualization of three different states of a hydrogel simulation (c). Transparency is an important visual effect but requires additional effort when rendering, typically sorting geometry or multiple render passes. It might also require more memory that could be filled sparsely and fragmented, depending on the rendering technique. To achieve high frame rates, approximations and heuristics can be used but thus, the scene is not rendered exactly. The methods presented in this chapter aim to accurately render the complete scene but without any heuristics or simplifications. Parts of the techniques and algorithms presented in this chapter have previously 42 Chapter 4 • Transparency Rendering been published in “Rendering Molecular Surfaces using Order-Independent Transparency” [Kauker et al., 2013b] and “Evaluation of per-pixel linked lists for distributed rendering and comparative analysis” [Kauker et al., 2013a]. 4.1 Motivation Using transparency in rendering and visualization can make the image more realistic, e.g. when rendering glass to look through. It can also be used to add additional information to a visualization, e.g. make multiple layers visible at once in volume data or visualize internal structures behind a semi-transparent surface mesh. Transparency allows the viewer to look behind layers in an image, thus allowing him to gain a better understanding of the scene or data. The best perception of transparent objects can be achieved in interactive scenes where the user can move the camera around a given object. Here, the different depth layers are not only shown as a blended color with the other layers, but the movement can help the viewer to get a better understanding of the scene. Rendering more content can also increase the visual clutter in an image that might make it difficult to spot important details or even might overcharge the viewer. Care has to be taken to design the application in a way that it is possible to tune the parameters so that the viewer can still see the details, e.g. by altering the transparency values, color schemes, or by tweaking the visible layers. In the context of this thesis, the main focus for the transparency rendering is when it is used for multiple objects which overlap each other, e.g. objects simulated by a distributed system. Example use cases are rendering of molecular surfaces (cf. Section 4.6.1), comparative visualizations (cf. Section 4.6.2), e.g. of multiple instances of a given data set, each representing a slight variation, or the rendering of unaltered object-space decomposed models (cf. Section 4.6.3), e.g. meshes which were segmented in logical units. The algorithms used here are built upon the PPLLs concept introduced by Yang et al. [2010]. Figure 4.1 shows some example images rendered with PPLLs. There exist a plethora of other algorithms for transparency rendering, but the PPLLs approach is the most suitable and flexible as explained in the following Section 4.2. 4.2 Previous Work For rendering scenes with semi-transparent objects there exist a plethora of algorithms. Basically, they can be divided into so called sort-first and sort- middle/sort-last methods. 4.2 • Previous Work 43 For sort-first methods, the primitives with semi-transparent sections are sorted by depth first and then the rasterizer generates the fragments which are blended in order. An example algorithm in this category is the Painter’s algorithm [Foley et al., 1990]. The drawback of these methods is that it is necessary to sort the geometry before rendering. This step typically has to be done for each frame as the viewpoint might change or the transparent objects move around in an animated scene. In contrast, sort-last algorithms render the primitives in arbitrary order and sort the fragments before compositing. This category is also referred to as Order-Independent Transparency (OIT) rendering as the depth ordering is not important here and done implicitly by the algorithm. Algorithms like Depth Peeling (DP) [Bavoil and Myers, 2008], and especially the PPLLs algorithm which is discussed in this chapter, are representatives of this category. The following sections will detail different standard transparency rendering algorithms. For a more complete overview on the topic, Maule et al. [2011] gave a survey of current raster based transparency techniques. 4.2.1 Painter’s Algorithm Rendering semi-transparent scenes is challenging as it requires the fragments to be sorted before blending them into the final image. One way of doing this is the Painter’s algorithm [Foley et al., 1990] which orders polygons by depth before drawing. This implicitly guarantees a correct rendering of the scene (when no cycles are in the scene). The Painter’s algorithm usually separates the opaque and non-opaque primi- tives. The opaque primitives can be rendered with the depth-buffer enabled, thus guaranteeing a artifact-free opaque image. Only the semi-transparent primitives have to be sorted by depth and are then blended into the scene. Using the depth-buffer ensures that the semi-transparent fragments are only visible in front of the opaque scene. Thus, the Painter’s algorithm produces one final image containing all fragments blended into it. This makes it impossible to insert fragments between two already blended fragments which is important for distributed rendering, especially, when the objects are interleaving each other. Recent transparency rendering algorithms using the Painter’s algorithm are published for example by Zhang and Pajarola [2007], and Chen et al. [2012]. Zhang et al. do single-pass deferred blending using point-based rendering, also achieving transparency effects. Chen et al. use view point dependent presorted meshes and render using the painter’s algorithm. 44 Chapter 4 • Transparency Rendering 4.2.2 Depth Peeling The Depth Peeling (DP) algorithm by Everitt [2001] renders the scene multiple times and in each pass the front layer not yet extracted is “peeled” off the object. Therefore, it requires n render passes to fully capture an object with n depth layers. A faster version is the Dual Depth Peeling (DDP) [Bavoil and Myers, 2008] which can peel the front and back layer at a time. Fragments that have been captured in earlier render passes are rejected. Regarding remote or distributed rendering, this algorithm would require the transfer of 2× n full sized textures containing the color and depth values for each layer. This is not feasible, as the size of the network transfer is crucial for this scenario. There are several optimizations of the basic concept, for example peeling multi- ple layers per pass [Bavoil and Myers, 2008; Liu et al., 2009], offering z-fighting awareness [Vasilakis and Fudos, 2011], or using only constant memory [Bavoil and Enderton, 2011], but they still consume too much traffic for distributed transparency rendering. 4.2.3 k-buffer The k-buffer [Bavoil et al., 2007] concept is a generalization of the z-buffer. Instead of only storing the depth value for a given pixel location, k different values can be stored for each pixel. This allows for storing multiple fragments, thus enabling various effects like OIT, depth-of-field, and volume rendering. Recently, Zhang [2014] presented a memory-hazard-aware variant of the k-buffer avoiding read-write-conflicts. The stencil routed A-Buffer by Myers and Bavoil [2007] uses multisample textures to store multiple fragments for one pixel location in one texel. Stencil- testing on sample-level allows the shader to write only one sub-element of the texel at a time. However, the drawback of the k-buffer is that it allocates the full memory for k depth layers. The depth complexity of most scenes is distributed very inhomogeneously, wasting memory for elements which might never be used. In the context of distributed transparency rendering, this drastically increases the amount of transferred data although there might be no information. Addi- tionally, as the k-variable is fixed for a given render pass, it has to be adapted to scenes with higher (or lower) depth complexity. 4.2 • Previous Work 45 4.2.4 Per-Pixel Linked Lists A algorithm which overcomes the drawbacks of the transparency algorithms explained beforehand is the Per-Pixel Linked Lists (PPLLs) approach. This algorithm is the basis for the transparency rendering algorithms presented in this thesis. Yang et al. [2010] presented this rendering algorithm as an alternative to texture- based OIT rendering methods to overcome several limitations of aforementioned OIT transparency approaches and implemented this using DirectX. Knowles et al. [2012] call this concept layered fragment buffer and discuss different memory layouts. The idea is based on the A-buffer, introduced by Carpenter [1984] as part of the REYES CPU renderer. It captures all fragments and their attributes of a scene instead of blending them onto the final image consecutively. Wittenbrink [2001] presented the R-buffer, a hardware architecture design implementing the A-buffer technique. Technically, PPLLs became feasible when GPUs were able to read and write to arbitrary global memory positions and not only (limited) texture space. In 2010, DirectX 11.0 and the OpenGL 4 extensions NV_shader_buffer_store [Brown, 2012] and NV_shader_buffer_load [Brown et al., 2010] made those operations available to GPU shader programs. PPLLs are created by appending each generated fragment to a list local to the according output position so that each pixel does have its own linked list of fragments. The resulting buffer is tightly packed with per-pixel list elements, but each list is scattered and not stored contiguously in memory. After rendering the fragments and storing them in the list, they might not be correctly ordered by depth. Thus, they have to be sorted prior to displaying, either forward or backward in terms of depth ordering. Subsequently, the fragments in the list can be blended by simply traversing the list and applying the appropriate blending operation [Porter and Duff, 1984]. Salvi et al. [2011] use heuristics to reduce the number of captured fragments by merging fragments which are very close to each other or do not contribute much to the scene, e.g. due to a high transparency value. However, this still eliminates the possibility of inserting further data between the merged fragments on the display node in the merging step. Therefore, this optimization does not apply to this work. Pixel Shader Ordering provided by Intel is an extension to DirectX for the Haswell integrated graphics system. This makes it possible to order memory access by the GPU based on the pixel location which can also be used for OIT effects [Fife et al., 2013]. 46 Chapter 4 • Transparency Rendering Render Sort Display Framebuffer Clear PPLL Buffer Image Render loop N Figure 4.2 — The stages of the PPLLs algorithm pipeline. The PPLLs approach does not require explicit sorting, is able to insert fragments between already rendered depth layers and has a tightly packed memory structure. The algorithm provides artifact-free composition of interleaved objects when rendered with transparency effects and the memory can be accessed en- block for transfer. Thus, it is best suited for rendering object-space decomposed scenes with semi-transparent models in a distributed environment. As can be seen in the evaluation, the PPLLs approach is slower than the other rendering techniques for single node rendering. However, the framerates are still highly interactive and having all fragments available allows distributed rendering with a reasonable amount of network traffic. 4.3 Per-Pixel Linked List Rendering Algorithm The PPLLs algorithm collects and stores all fragments for later processing. This is achieved in a single render pass. The final image is created by sorting – either all fragments in the scene or locally for each per-pixel list – and then compositing each pixel’s fragments according to the sorting order. After the data has been sorted, it is also possible to render the scene with depth-of-field effects (see Section 4.5.1) or to perform operations like Constructive Solid Geometry (CSG) on the fragments (see Section 4.5.2). The whole PPLLs algorithm consists of four rendering stages: clear, render, sort, and display (see Figure 4.2). It utilizes a global atomic counter to avoid race conditions on parallel writing operations and two buffers. The header buffer has the size of the final image and stores the entry indices for each list, pointing to the second buffer. The data buffer stores the elements of the PPLLs or the per-pixel arrays in an unordered fashion. First, the clear shader resets all data structures, i.e. the global atomic counter and the header buffer. It is not necessary to reset the data buffer as it is newly built from ground up in each frame and the atomic counter indicates its length. Then, the render shader is used to store the fragments as mentioned above. This results in unordered lists for each pixel, which need to be sorted according to fragment depth for blending. Sorting can be done either in a separate step or directly before blending. The approach described here uses a combination 4.3 • Per-Pixel Linked List Rendering Algorithm 47 ◮ Figure 4.3 — Two tri- angles and the data struc- ture of the PPLLs header and buffer. Only full fragments are shown for simplification. 1) 0 0 0 0 0 0 0 0 2 4 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Z=4 Z=8 Next=0 Orange Z=4 Next=0 Teal Z=8 3) 5) Header Screen Buffer In d e x Next=1 Teal Z=8 Next=0 Teal Z=8 Screen Coordinates Next=0 RGBA Z=4 Header Buffer 12 Bytes Buffer Item Next Next Buffer Index (a) Per-pixel linked lists Screen Coordinates Buffer Index RGBA Z=4 Header Buffer 8 Bytes Buffer Item 0 1 2 3 (b) Per-pixel arrays N Figure 4.4 — The buffer setup for PPLLs (a) and the array variant (b). of both: for long arrays or lists, the data is sorted in global memory using an optional sort shader. Short arrays and lists are cached in a local array in the display shader and sorted directly prior to blending. Since everything is re-generated every frame, the cost of sorting the fragments can be assigned to any of the stages. The display shader also composes the depth-sorted data front to back and sends them to the OpenGL framebuffer. Figure 4.5 shows the Rungholt city model rendered with the PPLLs method. Data is collected and processed using OpenGL graphics rendering and compute shaders, and the GL_SHADER_STORAGE_BUFFERs introduced in OpenGL 4.3. For each pixel of the viewport, the header buffer contains the entry index to the data buffer which stores the fragment attributes (cf. Figure 4.3 and Figure 4.4a). An element of the data buffer contains the color and depth per fragment and may store additional attributes (e.g. fragment normal or texture coordinate). Per- 48 Chapter 4 • Transparency Rendering (a) Semi transparent (b) Opaque model ◭Figure 4.5 — Semi transparent (a) and opaque (b) Rungholt city model rendered with PPLLs. pixel lists are created during rendering by appending the attributes of a fragment to the list in the fragment shader. A global atomic counter which contains the index of the next free element in the data buffer is incremented in a thread-safe manner (atomicCounterIncrement). Existing fragment shaders and rendering methods can easily use PPLLs by replacing assignments like gl_FragColor = color; and gl_FragDepth = depth with store(color, depth) which appends the data to the fragment list of the current pixel. This method can be used for rasterized geometry or implicit surfaces. For volume rendering, it is possible to insert every sample along the sampling ray into the PPLL buffers. However, this might lead to a very heavy memory usage and thus should be avoided. Lindholm et al. [2013] suggest to use a certain number of iso-surfaces extracted from the volume instead. The iso- surface renderings could than again be stored using the simple shader code replacements mentioned above. When a scene additionally contains opaque fragments, such as the sticks in Figure 4.1a, they can be rendered into a separate framebuffer object first. Trans- parent fragments that lie behind opaque ones can be discarded during rendering, shortening the resulting lists. As an alternative method for storing the fragments for each pixel, per-pixel arrays could be used, similar to a k-buffer but with a varying k for each pixel. Here, the number of fragments per array is counted in a first render pass, so the entry index for each pixel location can be calculated using a prefix sum (cf. Figure 4.4b). In a second render pass, each fragment is stored in the respective array within the data buffer. The arrays are created in the same way as the PPLLs by using an atomic per-pixel counter that contains the index of the next free element in the per-pixel array. Thus, the algorithm only differs for the storing method, the remaining parts work like in the PPLLs algorithm. 4.4 • Distributed Transparency Rendering 49 Merge 3 Generate Sort 2 Generate Sort 1 Generate Sort Merge Display Framebuffer Render Nodes Display Node PPLL Buffer Image Render loop N Figure 4.6 — Setup of multiple renderers creating PPLLs buffers for one renderer. As the stages all do produce the same type of buffers, the merge step can easily be integrated in the existing pipeline (cf. Figure 4.2). The clear and render stage are pooled into the Generate stage for simplification. Conceptual Extension for Multiple Renderers PPLLs have the ability to store all generated fragments of a scene for a given view. As the sorting of the fragments takes place right before blending in the last stage, it is possible to combine multiple buffers, even from different nodes, without the risk of rendering artifacts. The so called merge step collects all buffers and is placed right before the display step. This method allows the PPLLs approach to be used for single node rendering as well as for parallel or distributed rendering. Figure 4.6 depicts the setup for three renderers working for one display node. As all stages use the same interface, the buffers, they can easily be reconfigured, allowing to merge multiple outputs for one (or multiple) display nodes. This also allows for more complex setups, e.g. a pyramid-like scheme with mixed hierarchical render-sort-merge-steps. For example, in a system with multiple render nodes, before sending the PPLLs buffers over the network, the merge-steps could be used to filter out fragments which are guaranteed to be invisible in the final image. This could be achieved by sharing the depth-buffers of the opaque renderings with all render nodes, effectively reducing the amount transferred and thus increasing the overall performance. The next section details and evaluates the implementation of the distributed rendering using PPLLs using the described approach. 4.4 Distributed Transparency Rendering Using the flexibility of the PPLLs, it is possible to render object-decomposed scenes which include semi-transparent models in a distributed environment without clipping. For distributed rendering, an arbitrary number of render nodes is connected to a number of display nodes via some sort of network. 50 Chapter 4 • Transparency Rendering ◭ Figure 4.7 — View on the MPI Informatics Building model, rendered on eight ren- der nodes. Each holds an equal-sized part of the model. The color represents the de- composition. Here, a 1 : n correspondence between display and render nodes is assumed for simplicity. For a m : n relation, the display nodes have to be synchronized. This can be achieved via a simple protocol where each display node broadcasts the maximum available frame number. Using the broadcast, each display node also knows the state of all the other nodes and the newest frame available to all nodes is displayed simultaneously. Render nodes do the actual rendering work of the scene. Their output is then transferred to the display nodes which are connected to the output device(s). They compose the received data into a final image. Note that the methods described here are also applicable when using multiple graphics cards in a single node. With the PPLLs algorithm, it is possible to render a scene that has been decom- posed in object space. In the context of this thesis this means that the geometry of the object is decomposed without clipping or bricking. More precisely, the data is divided into geometry subsets which may intersect and overlap each other. That is, multiple render nodes may contribute to the same output pixel on the final image. The challenges for distributed rendering, especially when all fragments of a scene are captured, are the vast amount of data sent and the compositing step on the display node as there will be more than one fragment per pixel location in transparent scenes, thus increasing the overall fragment count. An algorithm to overcome this will be presented later (see Section 4.4.3). 4.4 • Distributed Transparency Rendering 51 4.4.1 Merging Transparent Renderings from Multiple Render Nodes The merge operation can take place either in image-space, i.e. pixel-wise on the rendered images. Or it can take place in object space where the actual 3D objects and their dimensions or positions are taken into account. Merging the buffers from the render nodes in image-space means putting the data at the respective position of the output image. When a render node gen- erates a simple pre-composed image, no other node may render on the same pixels as no other fragment can be inserted any more behind the generated frag- ment or between the flattened fragments. Thus, for object space decomposition where the objects may overlap in image-space, the Painter’s algorithm is not able to generate artifact-free images without further effort. To use the algorithm, the geometry of the object has to be sorted before distributing it among the render nodes. In the following, only rendering of object space decomposed scenes without bricking or clipping is discussed. This implicitly requires a sort-last approach. Here, the objects can intersect and overlap each other as seen in Figure 4.7: this particular decomposition is based on the structure of the MPI Informatics Building Model [Havran et al., 2009] which is already provided in sections. For distributing the model, these sections have been combined into meshes of equal size in terms of memory usage. A complementary concept is image-space decomposed rendering. Here, every render node is assigned an area in the output image. This eliminates the need for a merging step, making the approach trivial. However, it does not allow for distribution of the models in the scene. Depending on the camera, one render node might have to render a large section of the scene and thus it would need the complete data set available. Especially for scenes that do not fit into local GPU memory like the MPI Informatics Building Model (see Figure 4.7) the image-space decomposition is therefore not feasible. Furthermore, for the multi-variant analysis of proteins (see Section 4.6.2), each variant is simulated and rendered on a different node. For image-space de- composition, each render node would need to know all of the data sets in its view-space, resulting in a significant communication overhead that can be avoided in the object space decomposition approach. Therefore, object space decomposition is the desired approach for the applications using distributed transparency rendering (see Sections 4.6.2 and 4.6.3). As described in Section 4.2, texture-based OIT implementations are not suitable for distributed rendering. Depending on the details of the algorithm, the 52 Chapter 4 • Transparency Rendering available graphics memory, and the capabilities of the graphics card, the result might be limited to a certain number of depth layers. This might have negative influence on the visual quality of the final image or produce incorrect results, especially if all fragments need to be available for complex operations like in the CSG algorithms. Whereas DP can handle scenes with an arbitrary number of depth layers for local rendering, the k-buffer uses a pre-defined constant and implementations like the bucket depth peeling [Liu et al., 2009] have a number of pre-defined buckets for each output pixel which are limited by hardware constraints. These constants directly impact the maximum number of depth layers that can be captured. Even if all fragments are captured, some of the output buckets or memory might not be used, when the scene does not have uniform depth complexity. This may waste valuable graphics memory and network bandwidth. An example for this is detailed in Table 4.1 and Figure 4.9: even for the simple dragon scene it is obvious that there is a high variation in the depth complexity for the whole image. When rendering in a distributed environment, all of those textures need to be transferred over the network to the display node. Once the display node has collected all textures, they have to be merged for blending. As for the render nodes, the limiting factor is the number of attachable textures per render pass. When all textures can be attached at once, the sorting can be realized as a single merge sort step at the pixel level because there is potentially no globally valid sort order for the textures. Having multiple textures, merging the received data is far more complex: The merging algorithm needs to determine a global order for each pixel to compose the image correctly. For an arbitrary scene, the problem is that the order of the textures might be different for each output pixel, making this step very complex. This is the case if two or more polygons that are rendered by different nodes intersect each other in object space. Figure 4.8 demonstrates this. Due to the limited number of texture slots on the GPU, this operation has to be performed by the CPU which is very costly. When the merging and sorting is complete, the collected data can be blended and displayed. 4.4.2 Using Per-Pixel Linked Lists for Distributed Transparency Rendering The PPLLs approach is an ideal candidate for using it with distributed render- ing. As explained above, with this approach it is possible to insert fragments between existing ones because all fragments are collected and kept until they are displayed. 4.4 • Distributed Transparency Rendering 53 ◮ Figure 4.8 — A mini- mal example demonstrating the sorting problem for two nodes. Above of the dotted line, node 1 is in front, be- low of the dotted line, node 2 provides the front layer. Eye Node 1 Node 2 Screen (a) UNC power plant 1 (b) UNC power plant 2 (c) Stanford Dragon N Figure 4.9 — PPLLs examples of the UNC power plant rendered with fully semi-transparent colors (a), with opaque stripes (b), and the Stanford Dragon (c). For PPLLs approaches, opaque and semi-transparent fragments can be treated differently. As recommended by Yang et al. [2010], opaque fragments are rendered first by discarding all semi-transparent fragments in the first geometry rendering pass. The output consists of two textures containing the color (texc) and depth values (texz) of the opaque fragments. Alternatively, the opaque fragments can simply be stored in the list, too. When rendering semi-transparent fragments, the depth texture generated in the first render pass can be used to discard all fragments behind the nearest opaque fragment. The discarded fragments are not visible in the final image either way. Discarding them immediately results in less fragments stored, thus leading to faster sorting and blending, and saves memory bandwidth during the transfer. After the PPLLs have been generated on the render nodes, they are sent to the display node together with the color and depth texture of the opaque fragments. 54 Chapter 4 • Transparency Rendering When all data is collected, the textures are composited by taking over the minimal depth value of all opaque fragments and the respective color to texz and texc of the display node. In the case of unordered lists, they are simply concatenated, forming a longer list which is sorted thereafter. For already sorted lists, they are combined in a list-sorting step: Given two ascendant ordered lists, append the smaller value to the new list and advance the pointer of the respective list. This is repeated until the end of both lists is reached. Thus, the newly generated list is also sorted and the merge step can be executed again with the next buffer, if available. texz can again be used to discard fragments which are behind the opaque fragment. Having the per-pixel lists sorted by depth, they can be blended and displayed on the screen. This method is suited best for object space decomposition as the step can either be executed on the render node or on the display node. The complexity is equal to a simple list operations for the each output pixel. When using PPLLs for distributed rendering, the following data has to be transferred for each frame from the render nodes to the display node: • Color and depth texture, if the scene contains opaque fragments. • Header buffer with the entry for each per-pixel list. • Data buffer containing the actual list elements. Each list element contains a next-pointer and its payload. In this case, the payload consists of two 32-bit values: the depth, and the color of the fragment or its normal. Transferring the normal to the display node makes it possible to recolor the fragment using per-pixel lighting. This requires the display node to have access to the scene lighting data but this can easily be shared in a setup step before the actual rendering takes place. In a shader, the color is usually represented as four 32-bit floating point values. To reduce the size, each color channel is stored in 8-bit integers in this approach as this is what most output devices can display. Here, the quantization can be justified by the data compression ratio of 75%. On the display node, the color is used for deferred shading and for blending the final image. When storing normals instead of colors, the normal vector is converted to spherical coordinates, thus only two values need to be stored. Both components can be stored in a 16-bit floating point value [Gross and Pfister, 2007]. Again, the precision reduction is justified by the compression ratio of 66%. When receiving a normal from the render nodes, the display node uses it to compute the per-pixel lighting in the blending stage. 4.4 • Distributed Transparency Rendering 55 Display Node Render Node Transform existing Compare Render Updated buffer Transform existing Update Display Request Updated buffer Receive PPLL Buffer Matrices Render loop PPLL Buffer PPLL Stage Network N Figure 4.10 — Instead of sending the complete rendering, only updates are transmitted which are used in combination with the last frame to compute the new image on the display node. 4.4.3 Two-sided Fragment Transformation In the approach described above, all fragments generated on the render nodes are sent entirely to the display node (see Figure 4.6). It is obvious and proven by numbers (see Section 4.4.4), that such large frame updates have a negative influence on the overall performance of the system. When a frame is rendered, both the render node and the display node hold buffers containing the fragments of the current view. To reduce the amount of memory transferred, differential updates can be used. For a new view, the render node creates a new buffer which holds the complete scene for this render node. Additionally, the render node uses the buffer from the last frame and transforms the fragments into the new view. Then, both buffers are compared and only fragments which are new or do no longer exist are sent to the display node. The display node then uses its buffer from the previous frame and the newly received differential buffer to update its main buffer and displays it. Therefore, the render node always knows the state of the display node buffers. Periodical updates will help to avoid rounding error effects in the display node buffer. When there is no movement or camera transition, no update is requested from the render nodes. These idle moments can also be used to request a full update so the user experience is not disrupted. Figure 4.10 shows the work flow of the two-sided fragment transformation algorithm. When using multiple models with different transformation matrices, e.g. in a scene with multiple animated models, the same scheme can be used. However, 56 Chapter 4 • Transparency Rendering instead of merging the results into one buffer, each model has its own buffer. This is necessary because the transformation algorithm needs the correct matrix to transform the fragments of a given model back and forth. With other fragments than the ones belonging to the matrix in the same buffer, wrong results will occur. When all render nodes have sent their updates and the buffers of each model were updated, all of these buffers are merged into a final buffer and then the final buffer is displayed. Implementation Details The standard transformation for a vertex v from object space to camera space works as follows: vcamera = Mview ∗Mmodel ∗ v (4.1) In this approach, each PPLLs entry stores the complete xyz-coordinate of the fragment and the normal vector in object space. This is necessary to minimize the rounding errors of the matrix multiplications and to keep the transformation chain as correct as possible in comparison to the newly rasterized fragments. As the fragments are stored in camera space, the depth ordering still applies here. Additionally, the entries contain the normal vector in object space, encoded in memory saving 2× 16 bit angular notation. This allows for comparing the frag- ment also for the normals and therefor enables an independent deferred shading on the displaying node, similar as described beforehand (cf. Section 4.4.2). When the existing fragments from the last frame are transformed, the inverse view matrix M−1view and inverse model matrix M −1 model from the last frame are used to get the stored fragment position in camera space vcamera back into the object space v: v = M−1view ∗M −1 model ∗ vcamera (4.2) Then, equation 4.1 is applied to v again, using the new model transformation and thereby putting it into the new camera space of the current view. Case-by-Case Analysis For comparing the actual rendering output buffer with the transformation buffer of the previous frame, the render node advances both lists for a given pixel. Fragments are seen as equal if the depth value and normal vector are within a given epsilon value. There are four cases (cf. Figure 4.11a): 4.4 • Distributed Transparency Rendering 57 1. If both fragments are the same, the display node has the correct fragment and is able to transform it to the correct location. Thus, the fragment is not transferred. 2. If a fragment is only available in the newly rendered buffer, it is a new fragment which cannot be put there by a transformation on the display node. It has to be taken over into the update buffer to be sent to the display node. 3. If a fragment is only available in the transformed buffer from the last frame but not in the newly rendered buffer, it has to be discarded. A special flag is used so the display node can distinguish between new fragments and fragments marked for deletion. 4. If two fragments exist at the same position but with different content, the newly generated fragment wins and is sent to the display node. Upon receiving, the display node traverses the buffer from the last frame and the newly received buffer. Again, there are four cases (cf. Figure 4.11b): 1. If a fragment is available in the transformed buffer but not in the update buffer, it is left out to save space (case 1 from above). 2. If a fragment is in not in the previous buffer but only in the newly received one, the fragment is new and has to be taken over (case 2 from above). 3. If a fragment is in the last frame and in the new buffer and has the deletion-flag set, it is to be removed (case 3 from above). 4. If two fragments exist at the same position but with different content, the newly generated fragment wins and replaces the corresponding fragment on the display node (case 4 from above). For this method, it is necessary to store the precise xyz-coordinate of the fragment to correctly transform it, making the fragment slightly larger in comparison to the standard approach. This is due to the fact that when the fragments are transformed, they do not fall exactly into the pixels of the screen but may have a slight offset. For a second transform, this offset has to be taken care of to minimize the error that could accumulate rapidly otherwise, falsifying the final image. 58 Chapter 4 • Transparency Rendering transformed rendered update delete new don't send tr 1) 2) 3) 4) replace (a) Render Node te transformed update display take over new delete 1) 2) 3) 4) replace (b) Display Node N Figure 4.11 — All four example cases of how fragments are compared and taken over into the update buffer on the render node (a), and how the fragments are used on the display node (b). 4.4.4 Discussion Table 4.1 gives an overview of different OIT algorithms and their performance for two different meshes in two different configurations. All benchmarks were run on Intel Xeon E5620 nodes with an NVIDIA Geforce GTX480 using a view port of 1024×1024 pixels. The scenes corresponding to the table columns are shown in Figure 4.9. As said above, the significant disadvantage of the texture-based methods is the memory consumption which is especially problematic when transferring the data over the network. Additionally, the rendering performance itself is partially very low for those methods. The calculations for the memory usage are based on the fact that for compositing, the display node needs every layer and that it is not feasible to combine them on the render node. Note that the stencil routed k-buffer provides a limited amount of depth layers, resulting in the relatively small but constant memory usage. The bucket DP approach is also faster than the PPLLs, but its limited depth layer capacity uses more memory than the PPLLs for the complete scene. Additionally, the PPLLs can provide reasonable performance while keeping the memory overhead at a minimum. Note that for opaque fragments, as seen in the UNC 2 case, the amount of data which needs to be transferred decreases further for the PPLLs. The reason is that opaque fragments are stored separately and non-opaque fragments behind the nearest opaque fragment are skipped during rendering. The other approaches do not benefit from separating the opaque fragments as those approaches are not modified for the evaluation. Additionally, the texture-based approaches still need a full-sized texture for each transparency 4.4 • Distributed Transparency Rendering 59 Test case UNC 1 UNC 2 Dragon Polygon count 12.7M 12.7M 871k Opaque fragments 0 1.9M 0 Transparent fragments 14.7M 4.7M 1.1M Maximum depth layers 234 234 12 FPS Memory FPS Memory FPS Memory Raw/Unsorted 44.1 - 43.9 - 434.3 - Bucket DP1 8.6 256.0 8.6 256.0 72.6 96.0 Adaptive Bucket DP1 4.8 256.0 4.9 256.0 32.9 96.0 DDP2, all layers 0.3 1,872.0 0.3 1,872.0 30.3 96.0 DDP2, 32 layers 2.3 256.0 2.3 256.0 30.3 96.0 k-Buffer3 13.9 32.0 13.9 32.0 60.7 32.0 PPLLs4 4.3 180.0 8.2 94.9 56.5 17.3 NTable 4.1 — Frame rate and memory (in MB) needed for distributed render- ing for different cases: UNC power plant with full transparency (see Figure 4.9a) and a large number of opaque fragments (see Figure 4.9b), as well as a fully transparent Stanford Dragon (see Figure 4.9c). 1) Liu et al. [2009], 2) Bavoil and Myers [2008], 3) Bavoil et al. [2007], 4) Yang et al. [2010] layer which is empty for the opaque or unpopulated sections, thus wasting valuable graphics memory and network bandwidth. For most of the algorithms, the amount of data that has to be transferred from the render nodes to the display node is enormous. Other algorithms can only provide a limited number of depth layers which reduces the visual quality. This might become an issue when the scene is very complex. Figure 4.12 shows the house model with different number of depth layers. While the alpha value is saturated at 12 layers for this example, the overall layer count is 32. Having all layers available might not be important for the final image, but it makes a difference for advanced effects, e.g. the CSG operations. The evaluation has shown that PPLLs can provide the full visual quality at a comparatively moderate memory usage. Compressing the data before sending them over the network can further improve the performance. Table 4.2 shows the performance in FPS when using compression algorithms compared to no compression, averaged for a full rotation of the UNC power- plant (see Figure 4.9a) and the Stanford dragon model (see Figure 4.9c). Using the large UNC model, the GPU-based GFC [O’Neil and Burtscher, 2011] com- pression algorithm does not improve the performance, whereas it speeds up 60 Chapter 4 • Transparency Rendering (a) 3 layers (b) 8 layers (c) all layers (d) comparison layers N Figure 4.12 — Renderings with different depth layer counts of (a) 3, (b) 8 and (c) all depth layers. For the given view, the alpha value is saturated at 12 layers, the full depth count is 32. (d) shows the difference of 3 and 8 layers (teal), and 8 and all layers (orange). Model None LZO GFC Dragon 2.8 2.4 4.1 UNC 0.4 0.3 0.4 ◭ Table 4.2 — Performance in FPS for two different compression algorithms LZO [Oberhumer, 2011] and GFC [O’Neil and Burtscher, 2011] compared to the plain network rendering for the power plant build- ing (four render nodes) and the dragon model (one render node). the dragon by about 45%. LZO [Oberhumer, 2011] was chosen as an alternative CPU-based compression algorithm for its high speed and lossless compres- sion [O’Neil and Burtscher, 2011]. In this case, the computation overhead of the compression and decompression weighs far more than what is gained through the faster data transfer. To evaluate PPLLs in a distributed rendering environment, two applications are presented later. PPLLs are small in size, flexible and easy to handle, still providing a reasonable performance. A visual analysis of multiple remotely rendered variants of a protein data set (see Section 4.6.2) and a simple mesh renderer (see Section 4.6.3) are used as proof of concept. Two-sided Fragment Transformation Evaluation Figure 4.13 shows the evaluation of the approach using the two-sided fragment transformation. For this, different viewing operations where used: rotation (cf. Figure 4.13a), translation (cf. Figure 4.13b), zoom (cf. Figure 4.13c and Figure 4.13d), and a combination (cf. Figure 4.13e) of all of them together. Using this approach, the fragments to be transferred could be reduced to about 40 % to 60 % of the total fragment count. 4.4 • Distributed Transparency Rendering 61 0,0 % 0,2 % 0,4 % 0,6 % 0,8 % 1,0 % 1,2 % 0 20.000 40.000 60.000 80.000 100.000 0 10 20 30 40 50 60 70 N u m b e r o f F ra g m e n ts (a) Y-Axis Rotation 0,0 % 0,2 % 0,4 % 0,6 % 0,8 % 1,0 % 1,2 % 0 20.000 40.000 60.000 80.000 100.000 0 50 100 150 200 250 300 350 N u m b e r o f F ra g m e n ts (b) X-Axis Translation 0,0 % 1,0 % 2,0 % 3,0 % 0 20.000 40.000 60.000 80.000 100.000 0 10 20 30 40 50 60 70 N u m b e r o f F ra g m e n ts (c) Zoom In 0,0 % 1,0 % 2,0 % 3,0 % 0 20.000 40.000 60.000 80.000 100.000 0 10 20 30 40 50 60 70 N u m b e r o f F ra g m e n ts (d) Zoom Out 0,0 % 0,2 % 0,4 % 0,6 % 0,8 % 1,0 % 1,2 % 1,4 % 1,6 % 0 20.000 40.000 60.000 80.000 100.000 120.000 140.000 160.000 0 50 100 150 200 N u m b e r o f F ra g m e n ts (e) Complex Movement All Fragments Refresh every 10 Frames No Refresh % Bad Pixel N Figure 4.13 — Evaluation for different object and camera movements using the two-sided fragment transformation. Without refreshing (purple), the frag- ment count is at about 40 %-60 % of the total fragment count per frame (orange). The intermediate approach with updates every 10 frames (teal) converges to- wards the non-refreshing approach in between the full updates. Additionally, the percentage of bad pixels compared with the original image is shown (black). 62 Chapter 4 • Transparency Rendering Refresh per Frame per 10 Frames no Refresh Operation FPS Frag. FPS Frag. FPS Frag. Full Y Rotation 5.51 65.4k 6.48 23.3k (35%) 5.88 28.4k (50%) X Axis Movement 4.90 76.6k 5.98 26.7k (36%) 4.63 38.3k (43%) Zoom In 7.36 43.4k 7.04 24.6k (57%) 7.22 23.9k (55%) Zoom Out 5.30 72.9k 5.46 43.9k (60%) 5.34 45.1k (62%) Complex Movement 4.81 79.3k 4.97 40.6k (51%) 4.67 43.6k (55%) (large ǫ) 4.76 79.3k 4.15 70.7k (89%) 3.87 75.0k (95%) NTable 4.3 — Frame rate and fragment count in thousands for the evaluated scenarios. The values in brackets denote the fragment percentage compared to the full refresh for every scene (“per Frame” columns). Table 4.3 displays the average frames per second and average fragment count for the evaluated scenarios. The best average performance is delivered when a complete refresh is sent every ten frames. For one frame, there might be a higher network traffic, but for the first subsequent frames, the traffic is significantly lower. This also saves computation time for the transformation process for the full update and the subsequent frames. When no refresh is performed, the erroneous fragments have to be detected by the transform shader and removed in a delta update. Another effect that can be seen is when the epsilon used for the identification is chosen to loosely (last row of Table 4.3). Here, the update mechanism sends far too many fragments. For the first difference after a full update, many fragments wrongly survive. Those wrong fragments have to be removed in the subsequent frame, causing more network traffic, and in addition, the correct frames have to be transferred, too. As seen in Figure 4.13, the image quality is relatively stable for standard view transformations. However, it sums up when performing multiple view changes. In the example shown, only the difference frame was used and the image is compared to the full update for each frame. Figure 4.14a shows multiple steps of the complex movement (cf. Figure 4.13e) as an overlay. Some faulty pixels can be seen (cf. Figure 4.14b) which result from rounding errors during the transformation and are responsible for the not-perfect image quality. Additionally, when a transformed fragment is used instead of the newly rasterized one, the interpolated data, in this case the normal vector, might not exactly fit. However, this might not be visible to the viewer, especially if the perfect picture is not shown, but still causes a bad pixel. Using full updates can remove the errors, effectively starting from a perfect frame again. 4.5 • Advanced Rendering Techniques using Per-Pixel Linked Lists 63 ◮ Figure 4.14 — Collage of multiple steps of the Ar- madillo model (a) perform- ing the complex movement (dark to light color, cf. Fig- ure 4.13e). Close view on the bad pixels (red) of a frame during the anima- tion, original image con- verted to gray scale for bet- ter contrast (b). (a) Multiple time steps (b) Highlighted bad pixels 4.5 Advanced Rendering Techniques using Per-Pixel Linked Lists Using PPLLs and having all fragments available enables the realization of ad- vanced rendering effects like depth of field or CSG. While the former is a nice effect and can be used for attention guiding, the latter is useful for combining models and the creation of complex objects without artifacts. Both techniques are explained in the following. 4.5.1 Splatting for Depth of Field Effects Instead of rendering a full screen quad, the PPLLs data can be displayed directly using splatting [Westover, 1990]. For this purpose, each element stores the full 3D position and color, allowing a fast sorting of all fragments. Because of the flexible extensibility of the PPLLs data structure, those attributes can simply be added to the elements of the data buffer. After rendering the scene into the PPLLs buffers, the data is mapped to CUDA and sorted using the Thrust library. This destroys the structure of the list buffer, effectively converting the data into an array of vertices ordered by depth. The complete buffer is mapped as a vertex buffer and rendered as points. This can be extended, for example to create a depth-of-field effect (see Figure 4.15), using a geometry shader that creates quads whose size depends on the distance to the focal point. 4.5.2 Constructive Solid Geometry Using the PPLLs approach, all fragments in the scene are available for composit- ing. This makes the implementation of CSG operations possible. For example, 64 Chapter 4 • Transparency Rendering ◭ Figure 4.15 — Depth-of-field ex- ample with a Stan- ford dragon and two Buddha fig- urines. The focus plane is on the left Buddha and on the tail of the dragon. \ ∪ ∪ ∪ ◭ Figure 4.16 — CSG operation example of a cross made of three joined (∪) cylinders which is subtracted (\) from the intersection (∩) of a cube box and a sphere. 4.6 • Applications 65 0 1 2 1 0 A B ∪∪ (a) No Operation ∪∪ A B (b) Union ∪ A B ∪ (c) Intersection ∪ A B \ (d) A without B ∪∪ B A \ (e) B without A N Figure 4.17 — Schematics of the different CSG operations and the counter values for the traversing ray. The shaded figure is the result of the CSG operation. the union operation (cf. Figure 4.17b) can be implemented as follows: when composing the sorted fragments, the layers are counted and a fragment is rendered only if the counter is incremented from 0 to 1 (entering the object) or decremented from 1 to 0 (leaving the object). This represents the outer border and no internal structure or overlapping geometry is composed into the final image. The counter increases for fragments facing the viewer and decreases for fragments facing away. The orientation of a fragment is determined in the render stage and stored as algebraic sign of the depth value. Therefore, all depth- dependent checks are performed on the absolute depth value for consistency. The molecular surface rendering (cf. Section 4.6.1) uses CSG to correctly discard fragments which are inside the surface but should not be displayed. The two other basic CSG operations – intersection (cf. Figure 4.17c) and subtraction (cf. Figures 4.17d-e) – can be implemented analogously (see Figure 4.16). For the intersection, the fragments are rendered only when the counter is incremented from 1 to 2 or decremented from 2 to 1. In case of the subtract operation, the fragments are rendered when the ray enters or leaves the first object, and when it enters the second object. For the latter case, the normal vector has to be inverted for rendering as the inside of the second object is the outside of the resulting object. More complex CSG operations, e.g. XOR, are logical combinations of these three basic operations. For rendering more complex objects with more than two parts, the CSG opera- tions are segmented into binary operations. Only the final result is displayed after the CSG operations are completed. 4.6 Applications The following sections will detail three possible applications for the PPLLs algorithm. First, rendering molecular surfaces is explained (cf. Figure 4.18a and Section 4.6.1). Here, the OIT technique is used for an artifact-free rendering of 66 Chapter 4 • Transparency Rendering (a) 1A3I Protein (b) 2JQW Protein (c) UNC power plant N Figure 4.18 — PPLLs examples molecular surface rendering (a), the multi- variant visualization of different states of a protein derived from a simulation (b), and a distributed rendering of the UNC power plant (c). Here, the colors indicate the data distribution with one color per render node. semi-transparent molecular surfaces consisting of ray casting implicit spheres and torus objects. The transparency of the surface enables e.g. a combined visualization of the surface and other representations like the ball-and-stick visualization, thereby delivering more information and context for the users and domain experts. Section 4.6.2 extends the molecular surface rendering and uses the distributed PPLLs approach detailed above (cf. Section 4.4) to deliver a multi-variant analysis for molecular surfaces. Here, each render node creates the same view of a slightly different variant of the same object and the display node merges the results, augmented with information about common equal and non-equal parts of the variants. The third application (cf. Figure 4.18c and Section 4.6.3) is a distributed mesh renderer, e.g. for models which are too large for a single graphics card. Another purpose is to use it for in-situ simulations with object decompositions which intersect each other. The display node receives the buffers from the render node and displays the final image. The use-case is that a simulation node computes the output and also directly creates the PPLLs buffers for visualization. For the presented application, a static mesh is used that is loaded at startup. 4.6.1 Rendering Molecular Surfaces using Order-Independent Transparency Many fields of research like chemistry, physics, and biomedicine work with molecular data. This data can originate from purely computational sources (e.g. 4.6 • Applications 67 ◮ Figure 4.19 — (a) Trans- parent Quicksurf surface with internal cartoon repre- sentation and DNA (yellow parts) and (b) SAS repre- sentation with internal ball- and-stick visualization. (a) 1I6H Protein (b) 1AX8 Protein molecular dynamics simulations, normal mode analysis, or fitting calculations) or from measurements (e.g. x-ray crystallography or scanning tunneling mi- croscopy). High-quality visualizations are an important tool for the analysis of these data sets. Many different molecular models have been developed to that end. Among these models, molecular surface representations are widely used. They are beneficial for many applications since they show the boundary of a molecule with respect to a certain property, in this case the accessibility for other molecules. Naturally, transparent surfaces do not fully obscure other parts of the scene. That is, users can see through objects and spot features or changes on the backside immediately without changing the perspective. Furthermore, users are able to see inside molecules. This makes transparent molecular surfaces especially suited for complex analysis tasks where the user wants to see the molecular surface but also another, sparse representation of the molecule like a stick model (cf. Figure 4.19). Multiple nested representations of the molecule can be used to provide more information to the user. Molecular surfaces are defined as a set of implicit surface patches (see Sec- tion 4.6.1 below). Thus, they can be rendered using GPU-based ray cast- ing [Gumhold, 2003; Reina and Ertl, 2005]. This results in higher visual quality and faster rendering while additionally saving the computation time and mem- ory for the triangulation. However, for transparent rendering, it poses an even greater problem than triangulated surfaces. The implicit surface patches might not only overlap and generate multiple fragments which have to be sorted according to their depth, but there are also parts of these surface patches which have to be removed to obtain a correct final image. The following sections detail how this method can be used to render transparent molecular surfaces in real-time and compare the PPLLs approach to a freely available existing DP OIT approach from the NVIDIA SDK. 68 Chapter 4 • Transparency Rendering VdW Surface (Atoms) Solvent Accessible Surface Solvent Excluded Surface Probe ◭ Figure 4.20 — Schematics of the SAS (purple) and the SES (orange), both defined by a probe (depicted green in a sample position) rolling over the molecule atoms (teal). Molecular Surface Definitions There are several definitions for molecular surface models, which are based on physical properties of the atoms. The Van-der-Waals Surface (VdW) shows the extent of a molecule. All atoms are represented by spheres with the corresponding VdW radius of their element, which is derived from the distance between non-bonded atoms. Figure 4.20 shows a 2D schematic of the VdW surface (colored teal). A fundamental drawback of the VdW surface is that it has no correlation with the surrounding medium (e.g. solvent molecules or ligands). The Solvent Accessible Surface (SAS) [Richards, 1977] includes this information. It is defined by the center of a spherical probe (colored green) which rolls over the VdW surface of the molecule. Thus, the SAS represents the surface that is directly accessible for solvent molecules the size and form of the probe. Figure 4.20 shows a 2D schematic of the SAS (colored purple). The SAS closes small gaps and holes in the VdW surface, which cannot be entered by solvent molecules. The Solvent Exclusive Surface (SES) (colored in orange) is the topological bound- ary of the union of all possible probe spheres that do not intersect any atom of the molecule [Richards, 1977; Greer and Bush, 1978]. The SES can also be defined by a spherical probe rolling over the VdW surface. Here, the surface of the probe traces out the SES (in contrast to the SAS, where the probe center is used). Figure 4.21 shows a schematic of the SES. The SES consists of three types of geometrical primitives: convex spherical patches, concave spherical triangles, 4.6 • Applications 69 ◮ Figure 4.21 — 3D schematics of the SES depict- ing the spheres (blue), the spherical triangle (yellow) and the toroidal patches (green) defined by the probe (orange). and saddle-shaped toroidal patches. The spherical patches are the remainders of the VdW spheres. They occur when the probe is rolling over the surface of an atom and touches no other atom. The toroidal patches are formed when the probe is in contact with two atoms and rotates around the axis connecting the atom centers (see Figure 4.21). The inward-looking, saddle-shaped patch of the resulting torus is part of the SES. While rolling, the probe traces out a small circle arc on each of the two VdW spheres. These small circle arcs are part of the boundaries of the aforementioned spherical patches. Spherical triangles occur if the probe is simultaneously in contact with three VdW spheres. Here, the probe is in a fixed position, meaning that it cannot roll without losing contact to at least one of the atoms. Please note that the probe can in theory be in contact with more than three atoms, however, these cases can be divided into multiple atom triplets for simplicity. Similar to the SAS, the SES closes small gaps and holes, but it does not inflate the molecule. The extent of the VdW surface is retained. Therefore, the SES is the most versatile molecular surface and suitable for many applications and analysis tasks like docking or solvation. Metaballs [Blinn, 1982] is another molecular surface definition modeling the electron density surface. For some images in this chapter, Metaballs are depicted using the QuickSurf [Krone et al., 2012] method and the representation was exported from the freely available Visual Molecular Dynamics tool [Humphrey et al., 1996]. 70 Chapter 4 • Transparency Rendering Molecular Surface Visualization Molecular models can be rendered using ray tracing by sending primary rays from the view point into the scene. For every hit with an object, secondary rays are generated to capture reflection and refraction. Ray tracing generates very accurate images, but is computationally demanding. With the performance of current CPUs and GPUs, real-time ray tracing of complex scenes has become possible [Wald, 2004]. This, however, requires elaborate acceleration structures. BALLView [Moll et al., 2006], a molecule viewer, offers such real-time ray tracing [Marsalek et al., 2010]. Although ray tracing – including secondary rays – achieves interactive frame rates, point-based GPU ray casting [Reina and Ertl, 2005] has a considerably higher performance. The performance of ray tracing further decreases when using transparency. In the presented approach, effects which might require secondary rays are not used, thus GPU ray casting is sufficient. The computation and rendering of the VdW surface and SAS is quite trivial, since they only consist of spheres. If the surface is drawn opaque, the interior parts of overlapping spheres do not have to be removed since they are not visible anyway. However, for transparent renderings, the interior parts have to be removed. If the spheres are composed of triangles, the intersections for each triangle have to be computed. Laug and Borouchaki [2002] presented methods for high-quality meshing of molecular surfaces. Today, implicit surfaces like spheres are commonly rendered using high-quality GPU-based ray casting. Clipping the interior parts would require neighborhood information. Since this would result in a plethora of additional intersection tests in some cases, available tools often do not clip the interior parts, which results in a cluttered representation when using transparency. The computation of the SES is more complex and costly. Connolly [1983] presented the analytical equations to compute the SES. Based on his work, several methods to accelerate the computation of the SES have been introduced. Edelsbrunner and Mücke [1994] presented the α-shape. Sanner et al. [1996] developed the Reduced Surface (RS) algorithm. Krone et al. [2009] later used this RS algorithm for dynamic simulation data and they were the first to render the SES using fast GPU-based ray casting of the individual patches instead of triangulations. Varshney et al. [1994] described a parallelizable algorithm based on Voronoi diagrams. Totrov and Abagyan [1995] proposed a contour-buildup algorithm for the SAS, from which the SES can be derived. This algorithm was recently shown to be parallelizable on multi-core CPUs [Lindow et al., 2010] and GPUs [Krone et al., 2011]. Parulek and Viola [2012] presented an implicit representation for the SES using a ray casting approach. With the exception 4.6 • Applications 71 of Parulek and Viola [2012], all aforementioned methods compute the patches of the SES. The technique best suited for the application may depend on different factors, e.g. hardware capabilities. Transparent Molecular Surface Rendering The VdW and the SAS representations consist only of spheres. As these spheres intersect each other, all parts of a sphere which are inside another sphere have to be removed since they would be visible when using transparency. The PPLLs method allows for removing these interior parts using the CSG union operation explained in Section 4.5.2. That is, all sphere fragments are stored in the data buffer during rendering. During the compositing stage, interior fragments are discarded to clear the internals. As explained in Section 4.6.1, the SES consists of three types of patches: concave spherical triangles, torus segments, and convex spherical patches. The spherical triangles can be ray casted straightforwardly, since they have no interior parts which have to be clipped. Krone et al. [2009] showed that the outer part of the torus, shown as wireframe in Figure 4.21, can be removed using a sphere- intersection test. The interior parts of the torus (shown as dotted lines in Figure 4.22) can be removed by intersection tests with two planes. Simply speaking, a pie slice is cut out of the torus ring. The spherical patches are the parts of the VdW spheres that are accessible by the rolling probe. The parts that are interior to the small circles traced out by the rolling probe while forming a torus have to be removed. These parts can be cut away using a clip plane through the small circle. However, there might be an inner remains which also has to be removed (shown as red patch in Figure 4.22). Using the RS for the SES computation, these problematic remains lie inside the RS. The RS is a closed triangle surface. The RS triangle edges connect atom centers that share a torus patch. The RS triangle faces span atoms which share a spherical triangle. Please refer to [Sanner et al., 1996; Krone et al., 2009] for more details. PPLLs with CSG operations are used to remove the spherical remains. Please note that this application could also use α-shapes [Edelsbrunner and Mücke, 1994] or Power Cells [Varshney et al., 1994] to compute the SES and remove the inner remains of the spheres since they are essentially dual to the RS. For the VdW and SAS, CSG is performed on visible objects only. However, it is also possible to render invisible objects (alpha equals zero) which will not contribute to the final image, but can be considered for the CSG operations. Hence, the triangles of the RS are rendered with alpha set to zero, as under certain conditions for concave structures, the RS might peek out of the atoms. Consequently, the fragments of the atom spheres inside the RS will be removed 72 Chapter 4 • Transparency Rendering VdW Surface (Atoms) Solvent Excluded Surface Probe Inner remains Reduced Surface Toroidal patches and hull segments ◭Figure 4.22 — Schematic representation of the SES and Reduced Surface of a molecule depicting the in- ner remains of a VdW sphere which have to be removed. in the compositing stage without changing the algorithm, resulting in an artifact- free rendering of the protein surface. However, not all parts of the sphere which are located within the inner torus ring are also inside the RS. These fragments have to be removed as well. For that purpose, the torus is closed. That is, two planes are drawn to delimit the torus patch and the interior parts of the torus ring (green dotted lines in Figure 4.22) as invisible objects as well. Therefore, the CSG operation additionally will remove all fragments inside the torus object. The final rendering will only show the outer parts of the VdW spheres, the spherical triangles, and the outer torus patches. Thus, the transparent SES is displayed correctly with the PPLLs rendering algorithm. Alternate SES Rendering In addition to the transparent SES rendering explained above, there is another way as follows. As the problem is to determine whether a fragment of a VdW sphere may be drawn or not, one has to determine if this fragment is inside the surface or not. This can be achieved by rendering the SAS and projecting these fragments back to the original VdW spheres. A fragment from an atom is only visible if the corresponding SAS sphere fragment is visible. Although this method can be implemented straightforwardly, the problem here is the massive overhead of SAS rendering due to the much bigger spheres which produce many more fragments compared to the SES approach. As the evaluation shows, the SES is faster than the SAS rendering and, thus, the alternative SES generation was not inspected further. Evaluation As mentioned in beforehand, Parulek and Viola [2012] defined an implicit function for the SES. They render the SES using a sphere tracing of this function. 4.6 • Applications 73 (a) Stanford Dragon (b) 1GKI protein (c) UNC power plant N Figure 4.23 — PPLLs renderings used for evaluation (cf. Table 4.4). Their performance evaluation for opaque surfaces shows similar frame rates on a NVIDIA Geforce GTX480 as this approach achieves on a GTX680 for transparent renderings. Since the sphere tracing is only executed until the first hit is found, more steps would be necessary for transparency rendering. This would drastically lower the performance of their method for large numbers of surface layers. Additionally, their visualization approach might have visible artifacts if the number of neighboring atoms is set too low [Parulek and Viola, 2012]. While this is only a minor distraction in opaque surfaces, the artifacts might become prominent in transparent renderings. The performance of the PPLLs method is compared to DDP and, for reference, with pure (unsorted) OpenGL blending. DDP was chosen since it guarantees full visual quality whereas other accelerated DP methods, e.g. bucket depth peeling, may introduce visual artifacts. Here, the freely available implementation of DDP from the NVIDIA SDK is used. For benchmarking, an Intel i7 920 CPU with a NVIDIA Geforce GTX580 (Fermi) and NVIDIA Geforce GTX680 (Kepler), driver version 310.90 was used. The tests utilize freely available meshes and protein data sets from the Protein Data Bank [Berman et al., 2000]. Table 4.4 shows the performance in frames per second of the PPLLs and per-pixel arrays compared to the DDP method for polygon meshes: a single Stanford Dragon (cf. Figure 4.23a), a protein surface mesh generated by VMD [Humphrey et al., 1996] (cf. Figure 4.23b) and the UNC power plant (cf. Figure 4.23c). As DDP can only peel two depth layers per rendering pass, the framerate is very low compared to this implementation and to the OpenGL reference due to the number of render passes. The difference between the list and the array approach is due to the different data structure and the additional operations necessary. Per-pixel arrays require 74 Chapter 4 • Transparency Rendering Stanford Dragon 1GKI Protein UNC power plant Vertices 435 k 1,046 k 10,975 k Faces 871 k 348 k 12,748 k Fragments 1,157 k 2,227 k 14,765 k Depth Layers 10 16 186 Screen Coverage 48.4% 70.2% 76.1% OpenGL 599.5 693.0 110.4 DDP 89.8 62.9 0.8 PPLLs 135.2 97.0 3.9 Per-Pixel Arrays 102.4 82.8 4.2 NTable 4.4 — Comparison of the PPLLs algorithm with other approaches for different polygon models using an NVIDIA GTX680 on a 1024×1024 viewport. The numbers for the rendering methods are frames per second. (a) SES of 1YV8 (b) VdW of 4ADJ ◭Figure 4.24 — Molecule surfaces with internal ball- and-stick representation as used for PPLLs render- ing evaluation detailed in Table 4.5. two render passes and the calculation of the prefix sum. The benefit is a more structured memory layout as each pixels’ fragments are stored consecutively. In the tests, this proofed to be beneficial for large models like the UNC power plant. By contrast, the list approach has a scattered memory layout. Table 4.5 details the performance for the ray casted molecules (see Figure 4.24). The surface representation directly influences the fragment count, especially due to inner fragments and hull fragments needed for CSG. The SES generates about twice as many fragments as the VdW representation. The fragment count of the SAS is three to four times the count of VdW. This is also the approximate loss in the frame rate for the surface representations. A higher number of fragments slows down the performance when sorting and blending. The same effect as seen in Table 4.4 occurs here, too. Per-pixel arrays outperform PPLLs only for very high numbers of fragments. For small and medium fragment counts, the PPLLs are preferable performance-wise. Another interesting fact is the performance difference between the GTX680 and 4.6 • Applications 75 Surface VdW SES SAS Molecule 1YV8 4ADJ 1YV8 4ADJ 1YV8 4ADJ Atoms 641 9720 641 9720 641 9720 Fragments 2.30M 9.34M 4.64M 18.00M 8.87M 32.16M Screen Coverage 18.0% 42.6% 18.0% 42.8% 23.8% 47.1% Depth Layers 44 92 117 188 128 234 GTX580 Array 69.5 14.7 21.3 3.8 14.0 2.5 Lists 85.4 16.6 31.0 6.2 14.7 1.8 GTX680 Array 58.2 12.9 16.0 2.1 8.0 1.4 Lists 76.6 13.9 22.1 4.2 10.3 1.3 NTable 4.5 — Performance evaluation for different molecules and surface models rendered with the PPLLs framework. The numbers for the rendering methods are frames per second. GTX580 which might be due to changed caching or memory access strategies within the hardware design. On average for the selected benchmarks, the GTX680 based on the newer Kepler architecture is about 40% slower than the older Fermi architecture of the GTX580. This is not a PPLLs-related effect but, according to the experience gained throughout the tests, holds for other applications as well. 4.6.2 Visual Multi-Variant Analysis This section details a multi-variant analysis prototype tool for molecular surface visualizations and multi-variant data analysis to investigate structural changes in molecular data. Its core component is the distributed rendering using PPLLs rendering. Each render node has the same data set but renders a different user- defined time step, effectively returning different states to the display node. The application is a visual analysis application [Kehrer and Hauser, 2013], delivering a tight integration of the visual mapping, visual interaction concepts, and computational analysis. It offers the possibility for the user to visually compare different variants of a molecular surface visualization and gives rudimentary tools to the user to explore the comparative visualization. In many fields, a visual analysis of different variants of data is of interest for researchers. This can, for example, provide a succinct representation of interesting parts of simulated data that change over time. Therefore, multi- variant data analysis is proposed to investigate structural changes in molecular data. The data originates from molecular dynamics simulations and from the publicly accessible Protein Data Bank [Berman et al., 2000]. Either the data sets 76 Chapter 4 • Transparency Rendering (a) Variant 1 (b) Variant 2 (c) Variant 3 (d) Analysis View N Figure 4.25 — (a) – (c) Three different variants taken from the Lipase-2 simu- lation (see Table 4.6) of the 2VEO molecule rendered in the VdW representation. (d) The analysis view. The main body all three variants have in common is colored yellow, the moving arm is colored blue, red, and green, according to the variant. The presented method also highlights small changes clearly like the side chain (the little ‘tail’) in the lower left, which is very prominent in the green variant only. can contain different conformations of a molecule or the structural changes might arise due to the simulated boundary conditions. The benefit of using a PPLLs approach is that every generated fragment is available for displaying on the display node. This means the transparency of the fragments can be modified after the scene has been generated on the render nodes and still deliver a correct visual impression on the display node. For visualizing molecule data sets, there exist a plethora of methods (cf. Sec- tion 4.6.1). This application uses the VdW representation but the described meth- ods can be implemented with any other surface representation as well. To render the proteins, GPU raycasting is used. In detail, the rendering using PPLLs works as follows: In the fragment shader, each sphere is raycasted and generates two fragments, one per intersection with the viewing ray. For both fragments, the depth as it is stored in the z-buffer, and the normal converted to θ and φ values from the spherical coordinate representation, is stored. When the fragments are generated, they are ordered by depth for the CSG (cf. Section 4.5.2) operation. During the CSG step, the union operation is performed to discard fragments that lie inside the molecule surface. After the operation, only the surface fragments are left and are used for blending when displaying the molecule. This ensures minimal network usage and the display node only receives the fragments it actually needs for the visualization. Additionally, this implicitly sorts the resulting list on each node, reducing the work load for the merge step on the display node. Table 4.6 details the data 4.6 • Applications 77 Atom Number of Fragments CSG Count Rendered Transferred Ratio 2JQW 260 1,222.0k 344.7k 28.2% 1YV8 641 2,055.4k 553.0k 26.9% Lipase-1 4,622 5,479.3k 1,546.9k 28.2% Hydrogel 4,822 1,465.3k 431.6k 29.5% Lipase-2 6,626 7,147.7k 2,053.3k 28.7% Pore 17,432 6,073.0k 1,751.5k 28.8% NTable 4.6 — Number of fragments rendered, transferred, and the filter quote when using CSG on the render node to filter invisible/unused fragments. ◮ Figure 4.26 — VdW rep- resentation of a hydrogel rendered in three different variants with PPLLs on three nodes, displayed using stan- dard blending (a) and the multi-variant analysis appli- cation (b). (a) Standard Blending (b) Proposed Visualization sets used for the benchmarks, and the number of rendered and transferred fragments. The filtering of the CSG operation reduces the data to be transferred down to about 28%. As an example, Figure 4.25 depicts the view of the Lipase-2 simulation on three render nodes (4.25a to 4.25c), each showing a different variant. The merged final view on the display node can be seen in Figure 4.25d. On the display node, each variant is assigned a separate color and a low alpha value is used where the variants differ from each other. Segments which are similar beyond a user-definable object space threshold in the variants are colorized in a special color and are displayed opaquely to reduce the visual clutter. For molecular dynamics, these parameters can be adjusted in Ångstrom to be meaningful to the domain experts. Implementation Details For each data buffer sent by the render nodes, the merging compares the depth and normal for each pixel to the data in the display node buffer. If normal and depth match by a certain user-defined threshold, a per-fragment counter 78 Chapter 4 • Transparency Rendering (a) Low Tolerance (b) High Tolerance ◭ Figure 4.27 — 2JQW protein rendered by three nodes with different merge settings. Each variant is rendered in a different color and the common parts are rendered in orange. (a): low tolerance visualization (neighbor area of 5×5 and epsilon of 0.03), and (b): high tolerance (neighbor area of 15×15 and epsilon of 0.08). is incremented to determine the number of variants which comply with each other. As matching fragments are similar with respect to the thresholds, only one is used in the result buffer and the others are discarded. This reduces the work load for later merging operations and blending. When one merge step is completed, the input from the render node is taken over into the display node buffer and the buffer of the next render node is merged. As the different versions of the data might have a small offset in image-space, a neighbor search is included which can be defined by the user. Here, a fragment is not only compared to the fragments at the exact same position but a small area in image-space is also searched for a fitting fragment to compensate position jittering of the different variants. Note that this is only executed for the first layer, that is, the front most fragments of the respective per-pixel lists are compared as the shared surface is rendered opaquely. Thus, it does not impact the overall performance much. Comparing the more layers and a rendering transparent shared surface would cost reasonable performance and result in much higher visual clutter. Figure 4.27 shows how the user defined tolerance settings impact the visualization. After this step, the buffer can be displayed. Each linked list is iterated as described before. For fragments with a counter value higher than a certain user-defined threshold, the fragment is rendered opaque. This effectively renders the common parts of different variants opaque and the differences semi-transparently with a lower alpha value. 4.6 • Applications 79 Render Network Display System Node1 Mem.2 Wait1 Node1 Overall1 FPS 2JQW 32 45.0 539 64 635 1.6 Hydrogel 60 50.3 674 78 812 1.2 Lipase-2 287 149.3 1,537 171 1,995 0.5 Pore 289 130.9 1,576 136 2,001 0.5 2JQW 65 20.0 187 115 367 2.7 Hydrogel 94 25.5 337 126 557 1.8 Lipase-2 399 124.4 1,760 395 2,554 0.4 Pore 388 105.9 1,378 270 2,036 0.5 (a) GigaBit Ethernet Render Network Display System Node1 Mem.2 Wait1 Node1 Overall1 FPS 2JQW 31 45.0 162 71 233 4.3 Hydrogel 58 50.3 161 85 246 4.1 Lipase-2 334 149.3 700 210 910 1.1 Pore 295 130.9 494 141 635 1.6 2JQW 53 20.0 158 108 266 3.8 Hydrogel 100 25.5 252 134 386 2.6 Lipase-2 435 124.4 693 404 1,097 0.9 Pore 365 105.9 564 314 878 1.1 (b) Infiniband NTable 4.7 — Performance evaluation of the multi-variant analysis application using four render clients with a GigaBit Ethernet (a) and IP over InfiniBand (b). The upper half shows the standard approach whereas the lower half uses LZO compression. Values shown are milliseconds for render and transfer times1, and the transferred data2 in MB. Discussion In the resulting images, the common parts of the molecules are clearly high- lighted by the algorithm in contrast to the simple combined rendering (cf. Figure 4.26). The differing parts like the arm of the Lipase-2 molecule in Fig- ure 4.25 are still visible. Using different color for each version, they can clearly be distinguished. To evaluate this approach, different data sets are used on four render clients, all 80 Chapter 4 • Transparency Rendering connected via a fully switched GigaBit Ethernet and InfiniBand. The display node and render clients were equipped with Intel Xeon X5660 CPUs and NVIDIA Quadro 6000, rendering a view port of 1024×786 pixels. Table 4.7 shows the performance evaluation: Ethernet variant (cf. Table 4.7a) and InfiniBand variant (cf. Table 4.7b) with no compression (upper half) and with LZO compression (lower half). GFC was not used for this application, since the compression rate is very low. Although compression using LZO has a negative impact on the performance as seen in Table 4.8, it did compress the data significantly. Slow network architectures benefit from the LZO compression for small and mid-range fragment counts whereas InfiniBand is fast enough to transfer the plain uncompressed data for every case. For large fragment counts, the faster data transfer does not compensate for the effort of compressing and decompressing the amount of data. Although LZO was able to compress the data to about 44% to 83% of the original size, the resulting buffers still range from about 45 to 150 MB. For the overall performance, the compression gains about factor 1.5 for the small and medium data sets. The large data sets do not benefit from compression as it needs more time than the resulting smaller transfer can save. As the times show, the network waiting time is significant for the overall performance. The Overall timings and FPS are measured on the display node. They include sending the request to the render nodes until the final image is blended and displayed on the screen. Of course, this method also works on a single node. On the one hand, multiple instances of the same application can be run on one machine. Here, the GPU memory is the limit as each variant needs the same amount of memory for its buffers. On the other hand, the rendering tasks can be integrated into the display node. The required GPU memory for the PPLLs is about the same as one render node and the display nodes consume, regardless of how many variants are processed. However, each variant has to be rendered consecutively, removing the parallelism from the distributed approach. When a rendering is finished, it has to be merged into the display buffers to free the render buffers for the next variant. The multi-variant analysis demonstrates a prototypical application for OIT in distributed rendering environments. The performance when requesting the data from the display node is about 1 to 4 FPS, depending on the data size and compression scheme. As the main performance loss is due to the network architecture, faster network architecture could drastically improve the performance. However, as the user inspects a static image displaying multiple variants at once, the performance is still reasonable. 4.6 • Applications 81 As a result, when using PPLLs for distributed OIT rendering, the compression with LZO for standard GigaBit Ethernet networks is feasible, especially when a low but interactive frame rate caused by the network transfers is acceptable. For faster network architectures, e.g. Infiniband, the overall performance benefits more from the network than it gains by the compression step. 4.6.3 Distributed Mesh Rendering A use case for distributed rendering of semi-transparent data is a simulation which runs in a distributed environment. For proving the concept, large static meshes like the power plant model and the MPI Informatics Building Model [Havran et al., 2009] are used. The full UNC power plant model, the index buffer and vertex buffer (position, normal and color) are about 564MB in size, resulting in about 12.7M triangles. For the MPI model, the data sums up to about 8.53GB and about 73.2M triangles, which is due to a different memory structure than the power plant uses. For a simulation or a dynamic scene it is important that every update generated by PPLLs is significantly smaller (180MB as seen in Table 4.1 for the power plant) than the raw data. The amount of transferred data mostly depends on the fragment count which is strongly related to the view port size, camera settings, or zoom level, and only indirectly depends on the model size and depth complexity. Instead, the result of the simulation can be rendered locally on the respective node. Using PPLLs ensures that the final result is the correct and artifact-free visualization of the simulation output. Figure 4.28a shows the UNC power plant created by four render nodes and Figure 4.28b shows the MPI model created by eight render nodes. Each node renders a sub model, decomposed in object space with overlapping sections. The fragments are transferred to the display node. It composites the final image and shows it for user interaction. For displaying, a simple application which connects to the render nodes and sends the camera updates is used. Each render node uses the this information to render its part of the scene into its PPLLs buffers. The buffers are transferred back to the display application. Here, all received buffers are merged into one buffer. Each PPLL is then sorted by depth, the fragments are composited, and finally blended onto the screen. Discussion Four render nodes and one display node are used for evaluation of how com- pression impacts the performance. The machines were using an Intel Xeon 82 Chapter 4 • Transparency Rendering (a) UNC power plant (b) MPI building ◭ Figure 4.28 — The UNC power plant and MPI Informatics Build- ing rendered using mul- tiple render nodes. The different colors indicate the parts rendered by each render node. Render1 Network1 Display1 Mem.2 FPS FBO 76.8 243.2 329.3 32.0 3.1 PPLL 344.3 2,071.0 375.1 241.3 0.4 GFC 270.5 2,042.6 424.7 238.2 0.4 LZO 962.2 1,868.9 483.8 138.9 0.3 NTable 4.8 — Elapsed average time1 in milliseconds on the render node, for network transfer, and compositing on the display node. The scene is a completely opaque view (FBO) for reference and a fully semi-transparent view with the PPLLs variant for the UNC power plant model. Additionally, the GFC and LZO compression algorithms were tested on the PPLLs data. Transferred memory2 is shown in MB. E5620 processor, connected via GigaBit Ethernet, and using an NVIDIA Geforce GTX480 graphics card to render a viewport of 1024×1024 pixels. The 12.7M triangles of the power plant are split into four different sub-models of equal input size. This emulates a simulation where each node has about the same work load to simulate and to render. Table 4.8 shows the timings for the render nodes, the transfer, and the display nodes. Note that the communication between host and GPU is included here, as well as the time spent compressing the data. The frame rate is low for the transparent case due to bandwidth limitations, that is, it is barely possible to render, transfer, and display one frame in approximately 2.5 seconds. Using the GFC compression hardly improves the network performance, gaining no overall speed-up. For the LZO compression, a significant improvement in the network transfer times is achieved but the computational overhead negates this gain completely. 4.7 • Summary 83 Nodes Render1 Network1 Display1 Mem.2 FPS 2 120.9 708.8 98.6 213.5 1.3 4 59.6 758.5 114.3 221.5 1.2 6 40.7 644.0 116.7 229.5 1.4 8 31.1 582.8 110.0 237.5 1.4 NTable 4.9 — Evaluation of the MPI model using a different number of render nodes. Average time1 in milliseconds on a render node, for the network transfer and displaying operation. The amount of memory2 transferred is shown in MB. Table 4.9 shows the evaluation of the MPI model rendering for four setups with different number of render nodes. The benchmarking computers were equipped with Intel Xeon X5650 CPUs and NVIDIA Quadro6000 GPUs, and connected via Infiniband. The view port is 1024×1024 pixels. As the timings of the render nodes show, the computational effort for rendering decreases linearly with the number of render nodes. But as the total amount of fragment data is the same for all cases – the complete model has to be transferred and merged – the performance gain is very low. Comparing the distributed rendering approach from Table 4.8 to the local approach from Table 4.1 and Table 4.4, the enormous data transfer drastically reduces the performance. Although the LZO compression significantly reduces the network load, the computational overhead is a problem. Due to the large data that needs to be transferred for the complete scene, distributed OIT is very costly. Comparison against local rendering is, however, beside the point in context of a distributed simulation, since one cannot afford the gather operation for every completed simulation step (see above). For a large model like the MPI building which does not fit in the memory of one graphics card, the evaluation has shown that the performance does not benefit from a larger number of rendering nodes. Here, the network performance is the main bottleneck of the system. 4.7 Summary This chapter has demonstrated advanced PPLLs techniques for applications. Implementing CSG made it possible to adapt existing molecular surface visu- alizations to semi-transparent surfaces. The general approach for storing the fragments and communicating them between the steps made it easy to extend the approach to be used in distributed rendering environments. 84 Chapter 4 • Transparency Rendering The rendering approach delivers a reasonable performance for the presented applications. For the single node renderings, the bottle neck is clearly the graphics card. Although there was a noticeable negative performance difference between the NVIDIA GeForce GTX 580 and the newer NVIDIA Geforce GTX 680, it can be assumed that the frame rates will increase again with new and upcoming hardware architectures. However, the bottleneck for distributed rendering is the GPU-to-CPU commu- nication and network. The benchmarks have shown that even single-threaded CPU compression can make a serious improvement on the overall performance for slow networks. A alternative approach has been presented as well, reducing the amount of memory transferred for each frame by exploiting the fact that the display node can transform the fragments, effectively mirroring the operations on the render node. The evaluation has shown that the computational effort for the operations are higher but the network load could be reduced without negative influence on the image quality. 4.7.1 Outlook For future work, the PPLLs approach can be combined with post-processing effects like ambient occlusion or cel shading with transparency which might help to understand complex visualizations. As all fragments are available, including additional information like the normal vector, a mixed realization of screen-space ambient occlusion and object-space ambient occlusion could be implemented. The PPLLs buffer does not hold all fragments in the scene, but from a given view. Thus, the approach could include more information than the standard screen-space ambient occlusion that only has the first layer of fragments available. Furthermore, to improve the strategy for updating the current frame on the display node, for example by sub-sampling or interpolation, could be worth looking into. More and upcoming real-time compression algorithms are to be evaluated for using them with per-pixel linked-lists, reducing the size of the buffers, thereby further optimizing the GPU-to-host communication and network transfers. A third topic is to extend the PPLLs into the time domain. As the approach is able to deliver interactive frame rates, using animated scenes is trivial. More interesting, however, is to use it with large data sets, e.g. from volume visual- ization. This requires a far more integrated approach, combining the volume data with the PPLLs. From a given flow field or gradient in the time-domain, the direction of a given sample could be extracted. To prepare the next frame on the display node, the PPLL element representing that sample could be moved to the 4.7 • Summary 85 new position – similar to the transformation in the memory saving two-sided distributed rendering approach (cf. Section 4.4.3). Annotating the volume voxels with a direction and velocity, the voxels could be stored into the correct neighboring PPLL. Thus, the PPLLs algorithm would provide interpolation for data sets with multiple time steps. This would allow for fluent updates and a more interactive experience. In the background, the render nodes could produce the correct buffers and send them to the display node. Once the correct buffers are available, they can be combined with the predicted buffer currently shown. Therefore, the frame is completed and can be displayed to the user before the next frame is in preparation. The distributed rendering approach can be applied to volume visualization as well. Very large volumes could be distributed among the participating render servers and the output could be sent to the display node(s), just as described for the normal rendering. To extend the two-sided approach, an image analysis algorithm can calculate the quality of the image on the display node and on the render node and decide whether to send a differential update or a full update. This would keep the traffic as low as possible while still maintaining a high image quality even for complex movements, animations, or a long series of view transformations. 4.7.2 Conclusion PPLLs are a general and flexible rendering algorithm. The strength is its use for artifact-free transparency rendering. It can be used with all sorts of fragment generation, let it be rasterized vertex data, implicit surfaces, or volumetric data. The algorithm delivers interactive performance and can easily be integrated for use with existing code. It also enables complex operations like CSG without large effort. Due to the flexibility and the buffer interface, the PPLLs approach can be used for local rendering, distributed rendering, or remote rendering. Three differ- ent prototypical applications have demonstrated that the effectiveness of the algorithm. As a result, PPLLs enable OIT effects and do perform an interactively, thus delivering a great visual quality suited for large data visualization, even on distributed rendering systems. C h a p t e r 5 Generalized Rendering using Per-Voxel Linked Lists (a) Polygon brick house (b) Raytracing scene (c) CT-scanned Kingfisher N Figure 5.1 — PVLLs renderings of a geometry model (a), a raytraced scene containing reflection, refraction, and shadows (b), and a sparse volume visual- ization of a kingfisher CT scan (c). The previous chapter focused on transparency rendering of distributed, decom- posed objects and mesh objects too large for the graphics card memory. This chapter presents research on rendering and storing sparse volume data resulting in the Per-Voxel Linked Lists (PVLLs) approach – a spatial data structure based on linked voxels for volume data sets which might not fit onto current graphics cards at once. It extends the concept of PPLLs by Yang et al. [2010] for using it 88 Chapter 5 • Generalized Rendering using Per-Voxel Linked Lists in combination with voxelization, voxel-based rendering, and the visualization of sparse volume data. Nowadays, data sets obtained from measurements, modeling, simulations, or other sources grow larger and larger in size. Data sources include, among others, image acquisition modalities, like CT or laser scanners, and large-scale models, e.g., in the field of geostatistics, being both of particular interest in the context of this paper. Regardless of their origin, these large data sets have to be processed and visualized pushing the limits of available hardware. In most cases, however, the original raw data contains lots of information which is of no interest in the subsequent processing or visualization steps. This uninteresting data can be filtered out in a pre-processing step for example by applying a transfer function to volumetric data sets or applying simple threshold filters. Then again, the data might already be sparse. Voxelization is a very generic approach of sampling arbitrary three-dimensional objects, like meshes, unstructured data, or implicit object representations, into a discrete volumetric representation based on a regular grid. A voxel cell of this grid acts like a bin where all samples within that location are put into. Mesh objects can be voxelized by intersecting the voxel position with the triangles and implicit representations like spheres are voxelized by evaluating samples at the voxel position [Falk et al., 2013]. In addition, continuous volumetric data can be sampled as well. Figure 5.1 shows three exemplary scenarios where the presented method can be utilized. From left to right these include voxelization of mesh-based objects, global effects rendering with shadows, reflection as well as refraction, and sparse volume representations for data from CT scans. When using the linked lists with the standard A-buffer approach [Carpenter, 1984], the data structures are view-dependent and, thus, have to be updated each frame. With the presented approach this is no longer necessary as any arbitrary view can be reconstructed. Therefore, the buffers do not need to be updated for each frame on static scenes and, as all fragments are available, global rendering effects like refraction and reflection can be realized easily. In this approach, the fragments of a scene or data set are stored in a single buffer in a volume like fashion, however avoiding the storage of empty spaces. This makes the approach suitable for sparse volume representation and for volume rendering as empty parts are omitted and do not waste graphic memory. The presented method is able to render voxelized scenes – including global illumination effects – with interactive frame rates. For the sparse volume data, the proposed algorithm is able to reduce the required memory footprint in comparison to standard volume rendering methods, allowing the inspection of high resolution sparse volumes even on low-end graphics devices. 5.1 • Related Work 89 5.1 Related Work The presented approach extends the PPLLs approach so it does not only contain the information of a particular view but comprises the entire scene. 5.1.1 Voxelization The method of voxelization has long been used for voxel-based graphics sys- tems [Kaufman and Shimony, 1987] and to speed up ray tracing [Yagel et al., 1992]. Karabassi et al. [1999] utilize the depth buffer of the GPU to voxelize certain non-convex objects and their surfaces. In contrast to the PVLLs approach, features like cavities might be missed since only six depth images are used, which are generated along the coordinate axes. In 2000, Fang and Liao [2000] presented an voxelization approach for CSG models where the models are eval- uated for multiple slices along the view direction. Recent work by Eisemann and Décoret [2008] computes the voxelization of solid polygonal objects using the GPU in a single rendering pass. GigaVoxels [Crassin, 2011] is a voxel-based framework real-time rendering of large and highly detailed volumetric scenes. These works are specialized to voxelize volumes or geometry objects while the approach presented here can voxelize and visualize both data types. Kämpe et al. [2013] evaluated directed acyclic graphs instead of octrees to store the voxels in hierarchical levels, but only for polygonal models. 5.1.2 Sparse Volume Rendering While there exist various methods for volume rendering, this chapter refers only to volume ray casting without loss of generality. Volume ray casting was first presented in 1988 [Drebin et al., 1988; Levoy, 1988; Sabella, 1988] and bears a high resemblance to ray tracing without secondary rays. In recent GPU-based approaches, the volumetric data is stored in a 3D texture and the volume rendering is performed within a single pass [Stegmaier et al., 2005b]. For large, sparse data sets the memory requirements might exceed the amount of available graphics memory and more advanced storage schemes have to be used. Museth [2013] presents VDB, a framework for sparse, time-varying volumetric data that is discretized on a grid. This kind of data is used for animations and special effects in movies. Teitzel et al. [1999] presented an approach to render sparse grid data by interpolating the sparse grid and in fact turning it into a volume. Kähler et al. [2003] use adaptive hierarchical methods to render sparse volumes, effectively partitioning the whole volume into smaller volumes which are then used for volume ray casting, similar to the two-stage storage of the PVLLs approach but using volumes instead of voxel lists. Gobbetti et al. [2008] 90 Chapter 5 • Generalized Rendering using Per-Voxel Linked Lists present an alternative single-pass approach for rendering out-of-core scalar volumes. For additional state of the art, refer to Balsa Rodríguez et al. [2014] and Beyer et al. [2014]. Other methods for storing sparse data are the compressed row storage (CRS) or compressed sparse row (CSR) patterns [Koza et al., 2014]. Here, sparse matrices are compressed by removing the entries which contain zeros. The PVLLs algorithm adopts these notions of sparse matrices and extends it to 3D by relying on the linked lists. Instead of storing all volume data, the volume data is pre-classified with a given transfer function and only non-zero voxels are stored. This allows for a smaller memory footprint of the volume, in particular when representing sparse volumes. Nießner et al. [2010] use Layered Depth Images (LDIs introduced by Shade et al. [1998]), a concept similar to the PVLLs approach. Instead of using three orthographic projections and linked lists of voxels, they use n orthographic LDIs to store the scene of geometric data. Frey et al. [2013] extended the LDI approach to Volumetric Depth Images generating proxy data which displays volume data sets. Bürger et al. [2010] propose Orthogonal Frame Buffers as extension to LDIs and perform surface operations like recoloring or particle flow on rendered meshes. Reichl et al. [2012] use a rasterization and ray tracing hybrid technology to render meshes. In contrast and as extensions to these approaches, PVLLs can render meshes, volumes, and implicit surfaces and internally stores the voxels in linked lists. It displays arbitrary rasterized data and uses ray tracing for rendering and visualizing global lighting effects. 5.2 From Pixel to Voxel: Per-Voxel Linked Lists In the standard rendering approach with a depth buffer, only the front-most fragments are visible. For rendering semitransparent scenes, the fragments have to be drawn either in a specific order to allow for appropriate compositing or by approximating the blending function [Salvi et al., 2011]. Rendering such scenes in a single rendering pass is possible when e.g. utilizing the PPLLs approach. After all fragments have been gathered in the PPLLs, the lists are sorted according to the fragment depths. The final image is obtained by blending the sorted fragments based on their opacity. Since the A-buffer is created for a particular camera setting, its content has to be updated as soon as the view point is changed. In addition, the A-buffer also reflects the projection mode of the camera, i.e. perspective or orthographic projection. Thus, the contents of the A-buffer describe only the parts of the scene that are inside the view frustum. In contrast to the classical A-buffer approach, orthographic projections are used along the three coordinate axes to capture the scene in an entirely view- 5.2 • From Pixel to Voxel: Per-Voxel Linked Lists 91 independent manner. The concept of the PPLLs is extended to PVLLs, allowing to combine the results of the three individual views in a single storage buffer – the main PVLLs buffer. This buffer makes it possible to store the voxelized scene in a memory-efficient way and to reconstruct it in the subsequent rendering step. Since the buffer is not view-dependent it is created once in the beginning and has to be updated only when the scene geometry changes. From a conceptual point of view, the bounding volume is defined with a given volume resolution to comprise the scene similar to an axis-aligned bounding box. One axis of the bounding volume is chosen to be the predominant axis for the traversal of the voxel-linked lists – without loss of generality let it be the z-axis. Each fragment inserted into the bounding volume is transferred by a simple coordinate transformation, using the swizzle operators. As for each of the three views from the main axis, xy is always the screen and z is always the depth component, the transformation ensures the correct voxel position in the volume. The bounding volume itself is subdivided into multiple bricks. Each brick contains a number of voxels depending on the brick resolution. For reasons of simplicity, let the brick resolution be 1× 1× 1 unless otherwise noted. The voxels represent spatially extended pixels which can either be filled or empty. By applying the linked-list approach, only filled voxels are stored. In the following, PVLLs algorithm is described. The underlying data structure for the individual bricks, implementation details, and proposed optimizations are presented in Section 5.3. Section 5.4 focuses on the application of the PVLLs in the context of sparse volume ray casting and the utilization for global ray tracing effects. 5.2.1 Voxelization Algorithm The PVLL algorithm consists of three stages: the setup stage, the voxelization, and the rendering stage. Figure 5.2 depicts the algorithm and shows the connections between the three stages. During the setup, the necessary buffers are created and initialized. For both geometric and volumetric data a single buffer, i.e. one A-buffer based on PVLLs, is sufficient. The voxelization of geometric objects, i.e. mesh-based models, takes place in three steps. First, an orthographic projection of the entire scene is rendered for each of the coordinate axes. During the rendering of one view, all fragments are transformed into a common coordinate system to match the predominant axis of the buffer – the growth direction of the buffer. Each fragment is then inserted into the respective per-voxel list. If a voxel entry does not exist for that position, the voxel is created and added to the A-buffer and a pointer to the fragment data is added. In case the voxel entry already exists, the fragment data 92 Chapter 5 • Generalized Rendering using Per-Voxel Linked Lists Voxelization Voxel Ray Tracing Setup Initialize buffers Render X Render Y Render Z Sort in Z direction Create brick and volume buffers Render proxy Ray creation Ray traversal Display Initialize buffers Create brick and volume buffers slice-wise Volume Geometry & implicit surfaces R e n d e r lo o p Render Z Upload N Figure 5.2 — The PVLLs algorithm. Geometry and implicit surfaces are voxelized for each of the three coordinate axes whereas the per-voxel buffer is set up in one pass for sparse volumes. The final rendering stage is identical. is appended to the list of fragments of the voxel. Upon insertion, the attributes of the fragment, like position and normal, are transformed to object space to maintain consistency between the three coordinate views. By using three orthogonal cameras/views, the algorithm is able to rasterize the entire scene. This is due to the fact that if an object is not visible in one view, e.g. a plane parallel to the viewing direction yielding no fragments for this particular direction, will be visible in one of the other views. Thus, the scene is fully rasterized and the final voxel volume will include each fragment. After rendering the scene into the main buffer, the buffer contents have to be sorted. Similar to the PPLLs approach each list is then sorted according to depth independently. In the last step of geometry voxelization, the brick states are updated. A brick is marked as occupied if it contains at least one filled voxel. In case the brick resolution is equal to 1× 1× 1, i.e. a brick consists only of a single voxel, this step is skipped. The brick occupancy is used during the rendering stage to support empty-space skipping. Please note also that the buffer holds only individual non-empty voxels and information on empty space is encoded indirectly. Figure 5.3 shows the Stanford Lucy model in different voxelization domains, clearly showing the voxel structure for the smaller domain sizes. Besides rasterized polygons, the voxelization can also process implicit object representations that are typically used for ray-casting, e.g. to visualize particle data sets. In this prototype, the ray casting of spheres is used as an example, but the approach can be extended to other shapes as well. The spheres are raycasted and for each generated fragment, the sphere data (center and radius) is embedded into the voxel data. This allows the sphere to be raycasted again when the traversal ray enters the sphere voxel, resulting in a pixel-accurate image. Falk et al. [2013] presented this approach in a similar way before. 5.2 • From Pixel to Voxel: Per-Voxel Linked Lists 93 (a) 643 (b) 1283 (c) 2563 (d) 5123 (e) 10243 (f) 20483 N Figure 5.3 — The Stanford Lucy rendered with different voxelization pa- rameters. While the voxelization is clearly visible for the low values, the 20483 volume exceeds the framebuffer resolution of 1024× 1024. Camera Empty brick Occupied brick Ray starting point Fast ray section Slow ray section Geometry outline Light source Shadow ray Bounding volume Reflected ray Z-Axis N Figure 5.4 — Schematic two-dimensional view of the bounding volume and the ray tracing process of a translucent bunny on a mirroring surface. Exemplary rays (orange) are traced from the camera through the voxelized geometry (black outline) including secondary shadow rays (teal) and a reflection on the ground surface (light orange). Dashed lines denote fast grid traversal for skipping empty bricks. Volume data, in contrast to meshes, can be transformed into a buffer without explicit voxelization. The volume is processed along the predominant axis in slices that are one voxel thick to keep the memory requirements down. The slices are uploaded into GPU memory and evaluated one after another. On the GPU, a pre-classification of the volume slice is performed by applying the transfer function for alpha thresholding to discard transparent voxels. The remaining voxels are appended to the respective voxel lists. When the user changes the transfer function, the pre-classification is computed within 3-5 seconds, not hindering the interactive experience too much. Since the volume is 94 Chapter 5 • Generalized Rendering using Per-Voxel Linked Lists sliced along the same axis as used for the PVLLs buffer, the contents of the linked lists are automatically sorted according to depth by definition. Again, only non-empty regions are preserved and stored in the buffer, thus enabling support for volumes which would otherwise not fit into GPU memory. In addition to the original data value, local gradient is also stored to be used for illumination computations. In the last step, the brick states are updated. 5.2.2 Final Image Rendering The final rendering is obtained by voxel ray tracing within the bounding volume. First, initial rays are created for each pixel on the screen with a starting point on the bounding volume and their direction determined by on the camera position. The camera can be positioned arbitrarily and even be situated inside the bounding volume since the entire scene was captured with the buffer. Rays are traversed by projecting the direction vector onto the boundary of the nearest next voxel. Similarly, empty-space skipping is performed by projecting the direction vector onto the next non-empty brick. Within a brick, generic ray-voxel traversal with a slight modification is employed. Finding the next voxel along the principal direction of the VoxLink space is straight forward, i.e. the next list element. To find a voxel in a neighboring list, at first, the the respective list has to be identified and then it is traversed from its beginning until the algorithm arrives at the correct depth. In case there is no stored information available, the voxel is empty and the traversal is continued. Otherwise, the color value is computed for contributing fragments using the stored normal. During ray traversal, the voxels of non-empty bricks are inspected and a color value is computed for contributing fragments using the stored normal vector and standard per-pixel shading algorithms. For volume rendering, the normal is derived from the local gradient of the actual volume and also stored. The rays traverse the voxel volume cell by cell, following the exact direction vector. For each step, the current position is forwarded to the next cell boundary (if the brick contains voxels) or to the next brick (if the current brick is empty). The color value is then blended with the respective screen pixel utilizing front- to-back blending. Secondary rays can be generated and traversed as well to achieve shadowing, reflection, or refraction. Early-ray termination is carried out once a ray leaves the bounding volume or the accumulated opacity exceeds a user-specified threshold. Figure 5.4 shows a scene containing the outline of a semi-transparent bunny sitting on top of a reflective surface with exemplary primary rays (orange) and secondary rays (teal). A lookup in the brick volume determines whether 5.3 • Implementation Details 95 ◮ Figure 5.5 — Data struc- ture and auxiliary buffers of two bricks, each hold- ing 4 × 4 voxels and hav- ing a brick header point- ing to the first non-empty voxel in that row or noth- ing. Colored voxels contain only data needed for the ray traversal and are linked sim- ilar to the PPLLs approach. Voxel buffer Fragment buffer Material buffer Material Normal Frag Next Depth Frag Next Depth Frag Next Depth Brick header Linked voxels … … … Null pointer Z-Axis Ray buffer Direction Pos Origin … Material Normal Material Normal the local area is completely empty (white cells) or whether there are some populated voxels (blue cells). In the first case, the ray can skip all voxels within that brick and fast-forward to the next brick (indicated by dashed lines). Otherwise, each voxel within this brick has to be traversed and its contribution is computed. 5.3 Implementation Details The OpenGL rendering pipeline and GLSL shaders are used for the voxelization and the subsequent ray tracing. The buffer is created in a similar way as the A-buffer with the linked-list implementation except for the transformation of fragments to the predominant axis before inserting them into the respective lists and the fact that the scene is rendered three times to capture all features. The data layout of the lists are adjusted to benefit from cache coherency during ray tracing. The general layout of the data structures is detailed below and followed by some considerations regarding optimizations. 5.3.1 Data Structures The proposed data structure is quite similar to those used in PPLLs A-buffer implementations and is depicted in Figure 5.5. Blue arrows denote the pointers and gray connectors show the layout of the data and how the different buffers are connected via indices. The global header, a 2D buffer, covers the front face of the bounding volume, which is perpendicular to the predominant axis. In the original PPLLs concept, this corresponds to the A-buffer covering the viewport. The indices stored in the global header point to the list of voxels for that particular position. The voxel lists are stored in the voxel buffer as linked lists 96 Chapter 5 • Generalized Rendering using Per-Voxel Linked Lists and each list element consists of a fragment index, the depth of the fragment in terms of the predominant axis, and a pointer to the next list element. The fragment index in turn points to an element in the fragment buffer, which holds a material ID and the normal to be used for deferred shading. During the rendering phase, multiple fragments might be created for a specific voxel. These fragments get merged by averaging their values during the sorting step. The material ID is used for a lookup in the material buffer that holds all the different materials. It stores the color, reflection, and refraction coefficients. The global header is mainly used during the voxelization stage. To support empty-space skipping, the bounding volume of the scene is subdivided into multiple bricks of size n (cf. Section 5.2). Each brick has its unique brick header representing n2 entry indices. In principle, brick headers are identical to the global header but they posses additional information on whether the entire brick is empty. As can easily be seen, the combined brick headers of the first layer represent exactly the same information as the global header. Thus, the global header is not necessary for rendering. For sparse volumes a bounding volume resolution identical to the original volume is used for simplicity. With this restriction, a voxel of a brick can only contain a single data value and, thus, the data is stored directly in the voxel buffer without the need of an additional fragment buffer. Additionally, the normal derived from the volume density is stored in 16 bit polar coordinate form. The actual color contribution of each voxel is calculated during rendering by applying the transfer function to the data value as it is done in regular volume rendering. 5.3.2 Optimizations There exist a couple of major starting points for optimizing the approach outlined above. Although the optimizations proved to be beneficial in improving the performance, some some of the ideas and their identified problems are discussed in more detail in the following. Data Layout The PVLLs data layout separates the voxel cells and the actual content so the stored fragment information like color or normal is only accessed if it is needed and not every time the linked list is traversed. Additionally, shared information like the material is only stored once and each fragment has a link to its material where the information can be retrieved from in the appropriate stage. 5.4 • Voxel Ray Tracing 97 Empty-Space Skipping The introduction of the bricks drastically improved the rendering performance by enabling empty-space skipping. An evaluation showed that the optimal brick size n varies with the bounding volume resolution N, but a ratio of N/n = 64 gave the best results in the tests. If the region is empty, it can be traversed in one single step, skipping the look-up for 43, 83, or 163 voxels at a domain size of 256, 512, or 1024, respectively (see Figure 5.10). GLSL Shader Optimizations Although GPU architectures are becoming more and more generalized nowadays, they are still sensitive to branching which cannot be entirely avoided for ray traversals. Rewriting while-loops into for-loops with constant upper bounds, reducing the branching to a minimum, and converting boolean expressions into formulas helped to improve the overall performance. However, care has to be taken when using upper bounds for loops as computations might stop prematurely, potentially resulting in artifacts or unfinished images. 5.4 Voxel Ray Tracing After the buffer has been created by either voxelizing geometric objects or embedding sparse volume data, the final image is obtained by means of ray tracing. Two exemplary renderings are depicted in Figure 5.6. The house (Figure 5.6a) is built from geometry blocks, voxelized, and rendered with refraction and reflection on the windows, and the other data set (Figure 5.6b) is a sparse volume from the field of geostatistics. Although the traversal and evaluation of the buffer, i.e. the rendering, is basically the same for both sparse volume data and voxelized geometry, the following sections point out some data-dependent peculiarities. 5.4.1 Sparse Volume Rendering One potential application for the proposed PVLLs is the rendering of sparse volume data. Depending on the volume data and the associated transfer function, typically only a low percentage of voxels is of interest (cf. Table 5.3). All other density values are not of specific interest and can, thus, be omitted safely. More specifically, they consume precious memory although their content is never visualized. With the PVLLs, it is possible to store the remaining and interesting data in a memory-efficient and sparse data structure. 98 Chapter 5 • Generalized Rendering using Per-Voxel Linked Lists (a) (b) N Figure 5.6 — Example scenes for voxel-based ray tracing. (a) detailed view of a house built from blocks showing refractions in the window. (b) volume rendering of the Kronecker 2 data set from geostatistics. The rendering of the sparse volume data itself suits fine with the rendering algorithm proposed in Section 5.2.2. During voxel ray tracing the density values stored for the occupied voxels are mapped to colors by applying a transfer function identical to regular volume ray casting [Stegmaier et al., 2005b]. The main difference is, however, that changes of the transfer function are somewhat limited in this approach with respect to interactive rendering. Since the approach relies on a pre-segmentation of the volume to generate a sparse volume, the transfer function can be adjusted interactively for all densities occurring in the sparse volume. But voxels which were initially discarded during voxelization cannot be made visible without a re-computation of the sparse volume. This allows the user to alter the color and opacity of existing voxels only. However, redoing the pre-classification only takes 3-5 seconds, thus it is not hindering the interactive experience. 5.4.2 Global Effects Rendering With the availability of an in-memory representation of the whole scene addi- tional global effects are feasible. Therefore, support for secondary rays allowing for shadows, refraction, and reflection was added. The original algorithm de- scribed in Section 5.2.2 was extended by performing the ray traversal in a loop (cf. Figure 5.2, dashed line) and adding an additional ray buffer for dynamically 5.4 • Voxel Ray Tracing 99 A B C a0 = 0 aAB = 1 - 0.9 × 0.5 N Figure 5.7 — Correct color blending for split rays. The ray is split in voxel A and directly traversed to voxel B while the ray to voxel C is stored for the next iteration. Voxel A is a refractive material with transparency α = 0.5 and also reflects light with coefficient β = 0.1. adding and storing rays generated during the rendering phase. Besides the ray origin and direction, the ray buffer holds the opacity accumulated so far and the pixel position of the initial camera ray. The explicitly stored pixel position allows us to process primary and secondary rays independent of the final pixel position while still being able to contribute to the correct pixel. To account for correct blending of the ray results, the accumulated opacity is propagated with each split as well. Figure 5.7 depicts an opacity-value transportation scenario where voxel A is refractive as well as reflective. The refraction ray hits voxel B and the reflection ray hits C. The standard color blending equation for front-to-back compositing C′i = C ′ i−1 + (1− α ′ i−1)αmatCmat is not applicable when using the iterative ray traversal as the rays can split on reflective and refractive voxels. Adding the color of the delayed ray to the final image with said equation does not yield correct results. However, if the contribution of the rays is split and the alpha value propagated with them, one can rewrite the equation yielding the correct color for the example as follows C = (1− α0)αACA + CB(1− αAB) + CC(1− αAC), with αAB = 1− (1− βA)(1− α ′ A), αAC = 1− βA(1− α ′ A), α′A = α0 + (1− α0)αA = 0.5, and βA denoting the reflection coefficient of A. This principle can be applied at each ray split by simply propagating the computed opacity at that point and, thus, achieve proper compositing of the final image. The ray traversal step of the algorithm is adjusted for the iterative ray storage to take care of potential ray splits that are depending on the material. Storing the 100 Chapter 5 • Generalized Rendering using Per-Voxel Linked Lists ◭ Figure 5.8 — Linear white to black visualization of the number of ray steps per pixel. Darker pixels cor- respond to more ray steps. Atop of a plane, a refrac- tive Stanford bunny mod- els sits together with a re- flective sphere, both casting shadow onto the plane. At the side there are three little boxes, also visible in the re- flection. The scene is lit by a single point light source. rays causes memory writes using atomic counter operations and after each pass the CPU starts the next ray batch with the respective number of shaders derived from the atomic counter holding the number of active rays. If the material at the intersection point has either a reflective component or a refractive component, but not both, the current ray direction is adjusted accordingly and no split occurs. In case of reflection and refraction, one ray is continued directly and the other is stored in the ray buffer for the next iteration. Which ray is traversed directly and which one is stored for the next iteration is subject to the implementation. The ability of splitting rays also enables shadows generation. Whenever a ray hits a fragment, a new shadow ray is created which targets a light source (cf. Figure 5.4). When the shadow ray hits another fragment, this causes shadow with respect to the translucency of the hit object. Otherwise, if the light source is reached without hindrance, the respective fragment is lit. Again, the ray is traced until the alpha value is saturated, it leaves the bounding proxy volume, or the shadow ray hits a light source. To keep the shader executions from diverging too much, a ray is only iterated for a fixed number of steps maxsteps and kept active in the next shader pass if not terminated. This prevents stalling of already finished shaders by long and complex ray traversals in the same ray batch. After all rays have either finished or reached their iteration count in the current pass, only active and newly created rays are traced in the subsequent pass. 5.5 • Results & Discussion 101 Figure 5.8 shows a visualization of the number of ray step iterations for a refractive Stanford bunny and the respective scene. Here, the structure of the bricks are clearly visible as they surround the geometry of the bunny and the sphere. In this example, the bounding volume resolution is 1024 and the brick resolution is 16. A total of 1.03M rays are generated in the scene for a viewport of 1000× 1000. After the first pass which generated rays for each pixel in the viewport, the number of active rays drops rapidly since only refractive and refractive rays or rays exceeding the maximum number of iterations are continued. When enabling shadow rays with one light source the total ray count is about 1.88M. 5.5 Results & Discussion The performance evaluation was carried out on an Intel Xeon E5-2637 machine with 128GiB of RAM and an NVIDIA Quadro K6000 graphics card, the view- port size used was 1000× 1000 for all measurements. The framerates were determined by averaging the frame rates of a 360° rotation around the y-axis followed by a 360° rotation around the x-axis. With these two rotations the entire scene is covered. Since the PVLLs are generated along the z-axis, a rotation around this axis will have only minor effects on the performance due to rather similar bounding volume traversals for this direction. Figure 5.9 depicts the four scenes used in the evaluation. Both the dragon model and the Buddha statue feature a non-reflective, non-refractive material. The window panes in the house scene are refractive and the Bunny scene contains a refractive Stanford bunny and a reflective implicit sphere. The performance was measured for a plain rendering of the voxelized dragon and the Buddha statue and with global effects enabled for all scenes (denoted by ray tracing). In Table 5.1, the results are shown. Please note that measurements in Table 5.1 include only the rendering; the one-time costs of the voxelization are omitted. Information on the voxelization can be found in Section 5.5.1. For all scenes the rendering is possible at highly interactive frame rates. To investigate the influence the view direction has on the voxel traversal, a scene containing a single sphere was set up. Again, the frame rate is recorded while the scene is rotated around the y-axis and the x-axis. The results for different bounding volume resolutions as well as varying brick sizes are shown in Figure 5.10. Since the sphere always has the same appearance for every view direction, one can conclude that the variations are due to the voxel lookup. The evaluation shows that while a brick size of 4 works best for a bounding volume of 256. For a volume resolution of 512 and 1024, brick sizes of 8 and 16, respectively, yield the best performance. All lines share the same features 102 Chapter 5 • Generalized Rendering using Per-Voxel Linked Lists (a) Dragon (b) Buddha (c) House (d) Bunny N Figure 5.9 — Scenes used in the evaluation. Results of the performance measurements are shown in Table 5.1. Scene Resolution # Voxel (103) occupied Memory FPS (Render Mode) Bricks [MiB] Dragon 256, [8] 81.6 5.0% 9.4 307.8 (plain) 512, [8] 326.2 2.6% 70.0 134.5 1024, [16] 1303.2 2.6% 277.0 87.3 Buddha 256, [8] 67.1 3.7% 9.2 310.0 (plain) 512, [8] 268.5 2.0% 69.1 147.9 1024, [16] 1073.6 2.0% 273.0 97.4 Dragon 256, [8] 81.6 5.0% 9.4 47.6 (ray tracing) 512, [8] 326.2 2.6% 70.0 39.6 1024, [16] 1303.2 2.6% 277.0 33.0 Buddha 256, [8] 67.1 3.7% 9.2 48.1 (ray tracing) 512, [8] 268.5 2.0% 69.1 40.3 1024, [16] 1073.6 2.0% 273.0 34.1 House 256, [8] 560.0 18.8% 16.7 45.2 (ray tracing) 512, [8] 2319.3 12.2% 100.0 38.8 1024, [16] 9444.9 12.2% 401.0 31.1 Bunny 256, [8] 139.2 6.9% 10.2 37.3 (ray tracing) 512, [8] 557.3 3.6% 73.5 26.4 1024, [16] 2233.8 3.6% 291.0 20.8 NTable 5.1 — Results of the performance measurements for the test scenes. Columns denote the resolution of the bounding volume with brick size in brackets, number of non-empty voxels, percentage of non-empty bricks, memory footprint, and frame rate. 5.5 • Results & Discussion 103 ◮ Figure 5.10 — Ray-casted sphere rotated around the y-axis and x-axis for 360° at different bounding volume resolutions and brick sizes (denoted in brackets). 0 FPS 50 FPS 100 FPS 150 FPS 200 FPS 250 FPS 300 FPS 0° 90° 180° 270° 360° / 0° 90° 180° 270° 360° 256 [4] 256 [8] 256 [16] 512 [4] 512 [8] 512 [16] 1024 [4] 1024 [8] 1024 [16] Y-rotation X-rotation ◮ Figure 5.11 — Evaluation of different voxelization and brick dimensions for the Stanford Lucy model (cf. Figure 5.3. 0 FPS 50 FPS 100 FPS 150 FPS 200 FPS 250 FPS 300 FPS 350 FPS 400 FPS 450 FPS 64 ³ 128 ³ 256 ³ 512 ³ 1024 ³ 2048 ³ 4 8 16 32 64 Brick size with slight variations. At the y rotation of 90° and 270° for example, the spikes indicate the highest performance for that particular view of the scene. Figure 5.11 shows the evaluation of different brick sizes and voxelization di- mensions, measured on a NVIDIA GTX 680. The evaluated scenes are depicted in Figure 5.3. The best brick size (middle bars) depends on the scene, especially on the number and arrangement of the empty sections. 5.5.1 Voxelization As mentioned before, the voxeliation of static scenes has to be done only once in the beginning. If, however, dynamic scenes are considered, the PVLLs have to be recomputed all the time. Table 5.2 shows the combined results of voxelization and rendering for the four test scenes where the voxelization is performed every frame. This demonstrates that the approach is capable of voxelizing and rendering a scene at interactive frame rates except for high volume resolutions and global effects rendering enabled. While the approach performs well with the Dragon model and the Buddha 104 Chapter 5 • Generalized Rendering using Per-Voxel Linked Lists Resolution 256 512 1024 Brick Size 4 8 16 4 8 16 4 8 16 Dragon 132.1 157.1 159.1 62.1 76.5 81.1 21.4 28.4 31.0 Buddha 84.6 94.8 94.1 44.0 52.2 53.6 17.7 22.4 23.8 House 29.4 29.6 29.5 6.1 6.2 6.3 1.4 1.4 1.4 Bunny 20.5 21.0 20.7 4.8 4.9 4.9 1.2 1.2 1.2 NTable 5.2 — Voxelization in combination with rendering for different volume resolutions and brick sizes. Measurements are given in frames per second. O p a ci ty Value (a) Kronecker O p a ci ty Value (b) Kingfisher O p a ci ty Value (c) Mouse O p a ci ty Value (d) Chameleon N Figure 5.12 — Data sets used in the evaluation of sparse volume rendering together with an illustration of their respective transfer functions. (a) Kronecker data set from geostatistics. (b)-(d) CT scans of animals: malachite kingfisher, old field mouse, and veiled chameleon. statue, even for large sizes, the performance for the House and Bunny scene drops dramatically. This is due to the high depth complexity of the scenes, particularly in the z-direction, and the related costs for voxelization. In all cases, similar frame rates were obtained for brick sizes of 8 and 16. Although this seems contradictory to the findings in Figure 5.10 at first, the explanation is rather straight forward. The rendering process does not have a too big impact on the overall performance of one frame and the benefits of using the optimal brick size are canceled out by the additional costs of the voxelization. 5.5.2 Sparse Volumes To illustrate the applicability of the approach in combination with volume rendering, three large-scale CT data sets of animals and two large data sets from the field of geostatistics are used. In Figure 5.12, the volume rendering results are shown for these data sets. The data sets are not inherently sparse 5.5 • Results & Discussion 105 by themselves, but they share the common property that they contain large regions which are not of particular interest. For the CT scans of the animals (Figure 5.12b-d) this applies to the air surrounding the specimen. Other volume densities representing, e.g., tissue and bone structures are conserved and sepa- rated by applying a simple transfer function. The resolution for all three animal data sets is 10243 at 8 bit per data voxel. The geostatistics data comprises two volume data sets which represent scattered measuring points which are interpolated to form a volume data set through a geostatistical method called Kriging [Kitanidis, 1997]. They are stored in a data-sparse format through low-rank Kronecker representations [Nowak and Litvinenko, 2013]. A domain size of 10243 and double precision was used in the simulation for computing the Kronecker volumes. The data is converted to 32 bit floating point values before voxelization and uploading into GPU memory to match the data structures. Despite the conversion no significant loss in the dynamic range could be detected when compared with the original results. Although the entire domain of the data set actually contains data, domain experts are only interested in the data outside the 95% confidence interval of the standard deviation N (µ = 0, σ2 = 1). This turns the volume data sets into sparsely occupied volumes. Figure 5.12a shows the data outside the confidence interval thereby illustrating the sparsity of the data itself. The depiction in Figure 5.6b (center left), in contrast, shows the rendering of one Kronecker data sets with low opacities, generating the fluffy surface. The generation of the PVLLs is carried out on the GPU for all volume data sets as described in Section 5.2.2. The computation is non-interactive but takes only 2.6 seconds for a 10243 volume data set. Table 5.3 shows the results of the volume benchmarks. The interesting parts in the Kronecker data sets still occupy large parts of the volume (34% and 50%) resulting in a comparatively large memory footprint. The volume occupancy for the CT scans are in a range of 0.5% to 1.8%, excluding the air surrounding the specimen (60 to 80%) and some of the internal tissue but keeping the main features visible. The approach delivers interactive framerates for most of the tested cases. Natu- rally, higher numbers of voxels result in lower average frame rates. This also directly impacts the memory footprint but still delivers low memory footprints compared to the original data set. 5.5.3 Global Effects Rendering In Table 5.1, the rendering performance is shown for scenes with global effects enabled. Since the inclusion of refraction and reflection requires the utilization of the additional ray buffer, the frame rate for the Buddha and the dragon 106 Chapter 5 • Generalized Rendering using Per-Voxel Linked Lists occupied occupied Memory Voxel Volume Resolution Voxels Bricks [MiB] per Byte FPS Kronecker 1 512 [8] 34.7% 39.7% 738.0 16.6 11.6 1024 [8] 34.6% 37.1% 5874.3 16.6 1.9 Kronecker 2 512 [8] 49.6% 57.3% 1054.5 16.6 9.6 1024 [8] 49.6% 53.4% 8408.9 16.6 1.4 Kingfisher 1024 [8] 0.5% 2.3% 107.6 19.6 35.4 [16] 3.5% 97.9 17.8 21.0 Mouse 1024 [8] 1.7% 6.2% 322.3 18.2 15.9 [16] 7.5% 302.8 17.1 8.4 Chameleon 1024 [8] 1.8% 5.2% 338.1 17.8 24.2 [16] 7.2% 322.8 17.0 13.3 NTable 5.3 — Results for the volume data sets regarding memory usage and rendering performance. The columns denote the resolution of the bounding volume with brick size in brackets, percentage of non-empty voxels, percentage of non-empty bricks, total amount of memory, number of voxels stored per byte, and the frame rate. scenes drops to a mere 15% and 35% for bounding volume resolutions of 256 and 1024, respectively. At higher volume resolutions the actual costs of the ray traversal outweigh the impact of the ray buffer. With shadow rays enabled, the performance drops to about a third due to the increase in the number in rays as well as storing the shadow rays in the additional ray buffer. Different upper limits of ray steps maxsteps per rendering iteration were also tested without noticing a significant difference between 256, 512, and 1024 steps. 5.5.4 Limitations In its current state, the implementation has several shortcomings which are partly due to the inherent nature of voxels and partly due to unoptimized code. For the voxelization, the biggest performance issue is the slow speed for updates in every frame for some scenes (cf. Table 5.2). This, however, could be overcome by inserting the fragments immediately into distinct bricks instead of the global buffer. Thus, only fragments in the individual bricks have to be sorted which should improve overall performance. Another limitation is that reflection, refraction, and shadow generation for geometries relies on the voxel representation since rays only hit cube-shaped 5.6 • Summary and Outlook 107 voxels. Thus, the PVLLs approach cannot keep up with other ray-tracing pro- grams, e.g. NVIDIA OptiX, neither in terms of computation time nor in terms of image quality. 5.6 Summary and Outlook For future research, the ray traversal can be optimized further by employing adaptive space subdivision methods like BSP trees or kd-trees instead of uni- form bricks. This might also lead to a reduced memory footprint, thus making the approach more profitable for even larger volumes. Additionally, the imple- mentation of level-of-detail mechanisms could improve the performance for large proxy geometries. Here, the idea is that voxels in the back of the cube which are not traversed by the rays could be omitted, potentially speeding up the traversal step. By achieving a higher performance in the rendering phase, the number of secondary rays can be increased and, thus, enable effects like caustics, sub-surface scattering, or diffuse mirroring. The memory footprint could be reduced further by using a tighter bounding box scheme which is aligned and fitted to the sparse data, thereby eliminating the empty space within the bricks. Section 4.4 details how to use PPLLs for distributed and remote rendering. With the similar data structure of the PVLLs, it might be possible to also extend it to make use of multiple distributed rendering engines. Similar to multi-variant rendering (cf. Section 4.6.2) for molecular surfaces, distributed PVLLs could be used for a comparison of different volumes. The PVLLs approach presented in this section can be used for rendering sparse volume data as well as voxelization and rendering of geometry including reflection and refraction at interactive frame rates. PVLLs are the only approach that can handle arbitrary representations using voxelization. It adapts the original A-buffer concept of PPLLs and extends it to store the entire scene in three dimensions. The performance results show that PVLLs reduce the memory footprint of sparse volume data while still being interactive. C h a p t e r 6 Using Mobile Devices for Interaction with Visualizations N Figure 6.1 — X-ray scenario with the VdW representation displayed on the VISUS power wall and the inner ball-and-stick representation rendered on the mobile device. With the larger data sets, the visualizations also get more complex and more detailed. The last component of the visualization pipeline is the user, viewing and interacting with the presented data. There are established interaction methods like for example mouse and keyboard for workstations or tracking systems for large display walls. This chapter will 110 Chapter 6 • Using Mobile Devices for Interaction with Visualizations describe mobile devices as an alternative to these methods. Mobile devices feature a touch-enabled display which can be used as an input device as well as an output device. The input is similar to a touch pad and in combination with the display it can be used as different user controls: buttons, sliders, menu lists, and so on. This allows the user to perform simple tasks like navigation and also complex tasks like direct selection or manipulation of data. Moreover, the controls allow the user to operate the system and modify all parameters in a user-friendly manner. For the output device, it can be used as an additional screen, displaying contextual information or other visualizations. The built-in sensors allow new interactions, for example an x-ray-style looking-glass for complex data sets (see Figure 6.1). Parts of this chapter are based on the paper “Remote Rendering and User Inter- action on Mobile Devices for Scientific Visualization” [Krone et al., 2015]. 6.1 Motivation Interacting with a visualization system can be challenging, especially if the system uses more than the two planar axes of the screen. When using a standard mouse, the screen area is mapped to the surface the mouse is moving on. The third dimension is then controlled e.g. by the mouse wheel. Alternatively, a combination of mouse and keyboard can be used, mapping the depth axis onto two buttons. A multi-touch enabled smartphone or tablet, however, offers a wide range of interaction possibilities by using gestures. Using this type of device allows the user to perform similar actions on the large visualization the same way he is used to for mobile devices. Additionally, the touch pad of the mobile device is also a display which en- ables the user to view further information or to view a completely different visualization. Feedback can be given by sound or vibrations, so the user is not necessarily distracted from the visualization he is looking at. For large displays like the VISUS power wall, there is also the possibility for physical navigation by the user. Standing at a certain distance allows the user gain an overview of the scene and then he can walk directly to the interesting part. Studies by Ball et al. [2007] suggest that users tend to navigate physically, e.g. walking or head moving, even if they have the ability to directly interact with the visualization, e.g. by zooming. Still, the mobile display can be useful for the viewer, e.g. for parameter control. For some types of power wall setups, the diffuser must not be touched, though it cannot be used as a touch screen for 6.2 • Related Work 111 interaction. Using mobile devices for these types of power wall retrofits these capabilities, then allowing direct user input. Additionally, a mobile device can either be used for the direct interaction or it can be used to keep an overview over the scene even when the user stands directly in front of the display. It could display the total scene, any specific part, or additional information so that the user does not have to wander around to inspect different parts, e.g. for a visual comparison. Mobile devices are usually powered by their own battery and communicate with wireless technology, allowing the user to move freely in front of the display device. Additionally, the device can be passed to other users easily and it does not require a flat surface like a mouse to be operated on. The graphical capabilities of the devices allow them to be used as a second screen, also displaying advanced visualizations like on the large display. These visualizations can either be streamed or rendered on the device itself, depending on the hardware and complexity of the visualization. This implicitly enables remote visualization on the device and the option to take the visualization home by storing all necessary data on the device for later use. 6.2 Related Work Interaction with visualization systems is a wide topic and there exist a plethora of methods [Badillo et al., 2006; Khan, 2011]. Ni et al. [2006] survey inter- action techniques and devices for large displays, with the focus on tracking and gesturing. Tracking devices use camera-based systems that can detect special markers or track the user(s). For gesture input, the users may use devices fitted with accelerometers, effectively recording the movement and using this information for the input. Cockburn et al. [2009] give an survey of “overview+detail, zooming, and focus+context interfaces”, aiming to review and categorize state-of-the-art of general interface concepts that also apply to large displays. Hardy and Rukzio [2008] and McCallum and Irani [2009] use mobile devices for cursor positioning on large displays. Their approach uses the touch system for limited mouse input only while the approach presented here is more flexible and can also be used as a second screen. Cheng and Zhu [2014] demonstrate a tablet-based system. The user can select the view port then this part of the screen is displayed on the tablet device, being an output device only. For remote visualization on smartphones, the standard desktop applications like VNC, Microsoft Remote Desktop, or TeamViewer do also have versions for mobile devices. Eissele et al. [2009] present a system for mobile devices 112 Chapter 6 • Using Mobile Devices for Interaction with Visualizations which can display additional information and blend it into the camera view, creating a type of augmented reality vision. The system is capable of streaming or local rendering of context-aware visualizations. Lamberti and Sanna [2007] demonstrate a streaming-based visualization framework for mobile devices, using the MPEG codec for streaming complex 3D models in motion. Wessels et al. [2011] describe how to use WebSockets for remote visualization on smart phone hardware. Hoffmann and Kohlhammer [2009] detail a portable remote visualization framework that can easily be integrated into existing visualization applications but focus on portability and visualization instead of the interactive focus of the system presented here. Commercial and industry-grade frameworks also support remote visualization on mobile devices. The FEI Visualization Sciences Group developed RemoteViz1 for the Open Inventor framework. Dragon HPC offers the remote Visualization Platform2 for their services that also supports mobile device clients. In contrast to the previous works and available systems described above, the system and concepts described here aim to include the interaction and the visualization on the mobile device. Additionally, the collaborative part is an important aspect, broadening the scope of the presented system. Of course, the system presented here builds upon and extends the same concepts explained in the previous work. 6.3 Technical Foundation In the context of this thesis, the system consists of a large visualization wall driven by a number of rendering nodes. Additionally, an operator node is used to steer the visualization on the power wall. For the mobile devices, a number of wirelessly connected mobile phones or tables are assumed. Any machine or mobile device connected to and synchronized with the system can receive the visualization or be used for interaction. 6.3.1 Data Flow Although the visualization might have a plethora of parameters to modify, many of them might not be of interest for a presentation and are fixed before hand when preparing the presentation or in a collaboration scenario. Therefore, the few remaining parameters can be put into special input categories displayed aside with the visualization so the user can access them without searching 1 http://www.vsg3d.com/open-inventor/remoteviz (accessed 29.06.2015) 2 http://dragon-hpc.com/remote-visualisation.php (accessed 29.06.2015) 6.3 • Technical Foundation 113 ◮ Figure 6.2 — Mobile de- vice user interface with the model in the center and but- tons for quick settings on the right. through sub-menus. This provides fast and comfortable access for the user. As an example picture of the prototype, Figure 6.2 shows the user interface of the mobile device with a ball-and-stick molecule representation and buttons to access parameters. To minimize the risk of invalid input values, the available parameters can be restricted so that they cannot be changed by the remote application. This also applies for the data set parameters. The application only presents files to the user which are compatible to the current visualization, avoiding possibly unpleasant situations during the presentation. Establishing Connection When the mobile device application is started, there are different ways to connect to the visualization system. For security reasons, a password can be required to prevent unauthorized access. Automatic Invitation A fixed IP address or a network broadcast contain- ing this information can be used for the device to connect. This allows a fully automatic configuration of the mobile device without input from the user. Semi-Automatic Invitation Another option is to use machine-readable information like QR codes3 to provide the connection details. In case of the presentation scenario, the power wall can display the QR code and the user uses the application to scan the code (cf. Figure 6.4a). For the collaboration scenario, the QR code can be shared via file transfer or email and then read using the mobile device. Alternatively, the code can be printed, too. 3 http://qrcode.com/ (accessed 29.06.2015) 114 Chapter 6 • Using Mobile Devices for Interaction with Visualizations The code provides all information for the application to connect to the visualization system. Of course, any other machine-readable system can be used, too. Manual Additionally, there is always the option for the user to manually enter the connection details. This can also be used as a fallback if the other methods fail to work. If the network setup allows this, the most comfortable variant is the automatic invitation for the user. This, however is a public announcement, so every compatible client can connect and access the visualization. When the access needs to be restricted or broadcasting the invitation is not possible, the semi- automatic invitation should be used. Display to Device For an interactive viewing of a visualization, the application typically offers a wide range of interaction possibilities for the user. These include changing the data set, the filtering settings, the mapping and the rendering configuration including the viewport and camera settings. In short: the user should be able to interact with all four stages of the visualization pipeline (cf. Chapter 2.4). In this context, the application delivers its current state to the mobile device and the user has the ability to change the state. For the interaction itself, the prototype application makes use of two configurable parameter bars on the left side and on the right side of the screen. The user can put the parameters there for quick access and the user interface is not overloaded with unnecessary controls. Device to Display When the user modifies the application state, the mobile device communicates the change in the parameters to the host who is in control of the large visualiza- tion. However, for a local rendering on the mobile device, the user is able to change its parameters which do not alter the large visualization. 6.3.2 Device Localization To have an intuitive way of aligning the view of the mobile device with the user’s perspective, two different methods are proposed. Alternate device localization methods like for example optical or radio-based tracking systems [Hightower and Borriello, 2001; Gupta et al., 2013] are out of scope of this thesis. Figure 6.3 shows an example of two users and their mobile devices viewing the same data 6.3 • Technical Foundation 115 ◮ Figure 6.3 — Two de- vices showing different view angles of the same data set. The view trans- formation measured by the internal motion sen- sors or derived from the camera is applied to the internal view matrix. Mobile Devices ͞Virtual͟ Data “et View transformation by walking set. The view transformation derived from the sensors or camera ensures a consistent view for the users while wandering around. Motion Sensor Localization A simple way of aligning the view of the device with the big display is using the internal sensors of the mobile device, in this case the acceleration sensors and gyroscope. For the initial view, the user holds the device in parallel with the display and the subsequent motion is tracked by the sensors (cf. Figure 6.3). The motion is accumulated and the view on the mobile device display is updated accordingly. To counter sensor drifting issues, the user can again hold the device like for the initial view, recalibrating the sensors. Every parameter change from the external system like camera changes on the large display are still applied to the local visualization, giving the user the same perspective on the data set when looking through the mobile device as when looking on the display directly. For fine-tuning, the user is still able to move, rotate or zoom the camera by hand. Photo-based Localization Another possibility for device localization is to use the camera many devices have attached. When holding the device, the camera takes a picture of the cur- rent view on the large display and the image is transferred to a node which can access the large visualization, e.g. the operator node (cf. Figure 6.4b). Then, the image taken is compared to the actual view and a homography matrix is calcu- lated, e.g. by using the functions provided by the OpenCV framework [Bradski, 2000]. From this matrix, a camera transition can be calculated [Klein, 2006] and the transition matrix is sent back to the mobile device. Now, the view 116 Chapter 6 • Using Mobile Devices for Interaction with Visualizations Large display Scanning QR code (a) Connection Setup F ra m e b u ff e r Photo Camera transformation Position based X-ray view Large display (b) Camera based localization example N Figure 6.4 — QR-code based connection setup (a) and schematics of the camera based localization approach for mobile devices in front of a large display setup (b). from the user is aligned with the view on the device, e.g. allowing x-ray like visualizations (cf. Figure 6.1). When the user moves the device, the view can be updated using the motion sensor approach explained above. Here again, manual input can be used for fine-tuning. To keep drifting errors caused by noisy sensor information at a minimum, the photo-based approach can be used periodically. 6.3.3 High-Scale Visualization on Mobile Devices This section briefly describes different possibilities for displaying the content of the large display wall on a mobile device. Basically, the difference is whether the rendering takes place on the local device or on a remote system which streams the final images only. Local Visualization As the graphics capabilities of today’s mobile devices are very advanced, the devices itself can be used for rendering the data. The graphics chips support operations necessary to render particles or volumes on the mobile device. The limitations here are the power consumption and the performance of the mobile graphics chips. However, the PPLLs (cf. Chapter 4.3) and PVLLs (cf. Chapter 5) algorithms cannot be ported directly to current mobile devices running OpenGL ES, as important atomic functions and buffer storing methods are not supported. The standard approaches for mesh or volume rendering are supported and can be used for visualization. For the local visualization, the data is delivered by the remote system. It can 6.3 • Technical Foundation 117 either be streamed, e.g. for insitu-simulations, or downloaded at once, e.g. for data already available. Alternatively, files stored on the device can be used for the visualization. Remote Visualization The other possibility is to create the images for the mobile devices on a remote system and only send the final images. Without a dedicated viewing application, this reduces the complexity on the mobile side, but also removes the option for the user of the mobile device to change the visualization settings locally. The visualization is created on the rendering system, though it may accept parameter inputs from the mobile devices. However, these parameters are then applied to all viewers of the current visualization. Video streaming has become a very active field in the recent years. Video on demand, TV streaming and video websites pushed the development of new codecs which can adapt to the current network situation. The network is the main bottle neck for remote visualization using video streaming. For streaming the visualization, the frame buffer is read to the host memory and then the encoding is done on the CPU before sending it to the mobile device. Using multicast, the network load can be reduced as messages are not duplicated for multiple receivers on the same path. Using a standard video codec systems, it is possible to use any remote video playback application without the need for a dedicated application on the mobile system. Thus, this approach is most suitable for a broad user spectrum with heterogeneous mobile devices. When streaming data or video, the available bandwidth per device has to be taken into account. While video codecs like H.264 [Wiegand et al., 2003] can adapt to changes in bandwidth and still provide a live stream, even if the details and quality is reduced. But when streaming data, there is usually no such mechanism and a reduced bandwidth also bogs down the framerate. Another option would be to stream still images for low end devices. However, this method suffers from a high delay which is caused by the image encoding on the CPU of the rendering node. Additionally, as this is not a common practice, a specialized application is necessary for the mobile devices. 6.3.4 Limitations The described system has some implicit limitations. First, there has to be a reliable network connection between the device and the host system to provide a smooth user experience. For multiple devices which stream the video 118 Chapter 6 • Using Mobile Devices for Interaction with Visualizations content, this can become a bottle neck, depending on the network infrastructure. When only parameter changes are communicated, less network load has to be transferred, allowing more machines or mobile devices to participate. For transmitting the data set, a fast connection is favorable, but a short startup delay is easier to get over with than a stuttering visualization. Besides the bandwidth required for such a system, the latency also plays a role. The user input has to be transferred to the controlling systems fast enough to provide an interactive or near-interactive experience. Second, for the local visualization, device hardware must be capable of render- ing the data. While every device should be able to play video or display images, an interactive rendering needs sufficient graphics capabilities. When rendering local data, the user needs to have access to this data which might not always be in the interest of the owner of the data. Using remote visualization, e.g. via streamed video, the user can interact with the visualization but cannot access the data directly. The third aspect is that the user has to hold the device, thus only one hand can be used for interaction on the touch display. This is sufficient for the basic touch or button inputs but more complex interaction methods or additional controllers cannot be used easily. 6.4 Mobile Device Interaction Concepts for Visualization In the following, an example visual analysis application is presented together with concepts of how to use mobile devices with that application. The applica- tion was first presented in Krone et al. [2014b]. 6.4.1 Visual Analysis Application for Dynamic Protein Cavities and Binding Sites The simulation of proteins can provide important insights into molecular mech- anisms on an atomistic level. Visualization is a vital method to analyze the simulation data since it illustrates the location and shape of the molecules, or it delivers an abstract representation which focuses on a certain property. Animations of molecular simulations show the dynamics and conformational changes. However, more advanced analysis methods and the visualization of their results can improve understanding significantly and allow researchers to detect certain phenomena that are not obvious from the raw data. 6.4 • Mobile Device Interaction Concepts for Visualization 119 N Figure 6.5 — Visual analysis application with the views of the protein, sequence diagram, cavity diameter and area, and relational graph showing the evolution of the cavities over time. Figure 6.5 shows a visual analysis application that uses a set of automatic and semi-automatic feature extraction and analysis methods, visualizes the results in different ways, and supports user controls to parameterize the analysis and filter the results. It adheres to the “visual analytics mantra” by Keim et al. [2008]: “analyze first – show the important – zoom, filter and analyse further – details on demand”. A similar definition for visual analytics was given by Thomas and Kielman [2009]. For spatial data, this is generally referred to as visual analysis. According to Kehrer and Hauser [2013], a tight integration of the visual mapping, visual interaction concepts, and computational analysis are crucial for a visual analysis application. The goal of the visual analysis process is to gain knowledge or a deeper understanding by visual examination of a data set. In this case, the input is dynamic protein data, mainly obtained by molecular dynamics simulations. Topology files give the number of atoms, their type and spatial relationship, and additional meta information about the atoms or the protein. A trajectory file provides the path of all atoms over the simulated time. Alternatively, the current atom positions can be received from a running simulation. From this information, the application extracts a visual representation of the protein, e.g. a molecular surface. The application is an extension to the method of [Krone et al., 2013] that uses the molecular surface mesh to extract cavities such as pockets or channels as follows: The cavities are classified and centerlines are extracted for all cavities. This allows the user to investigate the inner structure of the protein, e.g. the channel diameter. The cavities provide evidence of possible binding sites of the protein – connections for other proteins to dock on. Chemical information about binding sites can be added and mapped to the visualization. All this extracted information can be explored using brushing and linking in their respective views. For example, when a user selects a binding site 120 Chapter 6 • Using Mobile Devices for Interaction with Visualizations in a diagram, it is also highlighted on the mesh. The user can now determine whether the binding site is located in a pocket. With this system, the domain expert is provided with context to the selected information and features. 6.4.2 Mobile Device Interaction Visual analysis applications give the user a rich environment where he can view, browse, filter and interact with the data. Brushing and linking [Doleisch, 2004] is one method to interact across different views and windows. A mobile device can be used to control the visualizations, e.g. as a so called second screen. A second screen extends the visual area, allowing the developer to add content for the user. For the user, the information is accessible without the need to switch views or reconfiguring the application. Especially the touch ability of a mobile device, or on a desktop touch monitor, together with a visualization can improve the user experience. The concepts described here can be used for a desktop scenario using standard workstations as well as for a setting which uses large displays such as power walls. Basically, there are two different types for interaction using mobile devices: Firstly, using the device for input only and, secondly, using the visualization and input capabilities of the device together. For the input-only method, it serves as a touch-pad-like device with contextual controls. In this case, it is used in combination with a computer system which is controlled by the device. The second method adds the opportunity to use the mobile device as a stand-alone system. Both the mobile device and the main visualization system are still connected but also are only loosely coupled, e.g. by using the same data but not necessarily the same visualization or the same parameters. In the context of the visual analysis application (cf. Section 6.4.1), the mobile device could display the graphs while the main display shows the visualiza- tion. Or different visualizations of the protein can be shown on both displays, allowing the user to see different aspects of the data set. Having the device and application connected, input concepts like brushing and linking can also be applied across displays, allowing the user to interact on the mobile device and the workstation. In the following, two different usage scenarios are described: a single user work- ing with the mobile device and a collaboration scenario where multiple users have access to the same visualization system with their mobile devices. 6.4 • Mobile Device Interaction Concepts for Visualization 121 Single User In a presentation in front of a power wall, the mobile device can be used for simple tasks like forwarding the slides or displaying text for the speaker. Basically, it works as the control for the presentation slides. As this is a standard feature for presentation applications, the mobile device can also be used for controlling the non-static parts of the presentations, e.g. by switching the applications or by embedding the presentation controls in the application. Touch controls allow the presenter to steer and modify the presented visual- izations. Having the visualization from the large display also available on the mobile device, the presenter knows what the audience is seeing.Visual updates like parameter changes happen on both devices, thus the presenter has no need to turn away from the audience. The mobile device can be used to display location and view dependent data. For instance, when in front of a powerwall, the mobile device displays the same scene but in a zoomed-in view. When the device and the visualization on the larger display are synchronized (cf. Section 6.3.2), the device can be used as a interactive magnifying glass, displaying other color maps, additional information, or a completely different visualization. For example, when having the protein surface displayed on the power wall, the mobile device can be held up and show the interior ball-and-stick representation, or highlight internal cavities, effectively creating an x-ray metaphor. In this context, the data can either be streamed to the device and rendered there, computed by a dedicated rendering node or be taken from the visualization directly, scaled according to the size of the mobile display. The specific option is implementation dependent and strongly relies on the hardware environment. Besides the local usage of the device, it can also be used remotely. As long as the connection details are known, the system can connect and receive the data or stream the video. Once a connection is established, the data transferred to the device can be stored there and the visualization can be used even when the connection is terminated. This could be used for conferences, presenting a topic and then sharing the data and visualization with the audience. Or it can be used for public events such as open house. Interested viewers might want to take the visualization home after they have seen the demonstration. Collaboration Basically, the techniques in for the single user can also be applied when multiple users are participating. Of course, the mobile device can also be passed around when multiple users are in front of a display. For collaboration, this makes the 122 Chapter 6 • Using Mobile Devices for Interaction with Visualizations interaction easier than with a large keyboard or a mouse which always needs a flat surface. When every user has his own device, all of them can be connected to the visualization system. Here, each device can be used coupled and synchronized, or decoupled from the larger display. The options from the single user case above apply here, too. Each user can work with his own device and visualization, e.g. walking in front of the display and using it as a magnifier glass. Having multiple users connected to the visualization system, the settings can also be shared among them. Points of interest can be inspected together, and the parameters and views can be shared, thus enabling an in-depth collaboration experience. Another scenario is remote collaboration. The connection details like the QR code can be made available remotely, e.g. by sending it per email or on a website, and the remote user can connect to the visualization system. For a fully interactive experience, the network has to support the required bandwidth and also provide a low latency. By using open standard technologies, client applications for all kinds of comput- ers or devices can be implemented. While workstation computers are present in most offices, mobile devices are also powerful enough to be used as remote clients. All the user needs is such a device and Internet connection to participate in a collaboration scenario. The ubiquity and simplicity of mobile devices is a big convenience factor here. 6.5 Summary and Outlook This chapter discussed concepts of how to use standard mobile devices for interaction with visualization systems. Although the mobile systems are not as powerful as the graphic hardware in a workstation, they can still deliver a good visual impression of the data. Remote rendering allows delivering the same visual effects and quality as provided for a workstation display, assuming that the network connection can provide the required bandwidth and low latency. The touch display of the mobile devices allows users to control the visualization and other sensors like camera and gyroscope allow for an accurate positioning in front of a display. This allows for displaying the same view on both the mobile device and the larger display, allowing for a magnifying-glass like view. Mobile devices are a relatively new class of devices. The system described in this chapter uses a client-server approach to deliver the visualizations. Another approach is to use participating devices in a cluster-like setting, each one 6.5 • Summary and Outlook 123 rendering a specific part of the scene and sharing it with all other devices. Using a broadcast-style network architecture (like a wireless local area network), the data could be sent to all other devices at relatively low cost, using compression mechanisms and detail reduction when necessary. With respect to the pace of the development in mobile device hardware and software, the mentioned limitations might be obsolete in the near future. This means that what runs today on a workstation might be computable on a mobile device in the near future, but due to new rendering techniques or more detailed data the proposed remote visualization methods will still be relevant. However, as stated in the motivation of this thesis, the data sizes will most likely continue to rise as well as the computation capabilities of the computers, clusters and mobile devices. So there will still be a need for the presented methods and new developments in this field. The Google Glass project4 and the Occulus Rift project5 show possible devel- opments in the area of mobile devices. Whereas the Google Glass projects information into the field of vision of the user, the Occulus Rift project provides immersive virtual reality experiences by using one separate display region per eye, creating a stereo effect. Both devices connect to a host, a workstation computer or mobile device, which delivers the data. While the mobile devices are not yet powerful enough to drive the Occulus Rift glasses, they might be in the future. This would enable a much deeper immersive experience for the user. For the Google Glass project, they could be used to display additional informa- tion. E.g. a region selected by other users, a different viewing mode, or text displaying statistics with respect to what is in the field of view of the user. The camera in the Google Glass can be used to perform a more steady tracking of where the user is looking at in the picture. However, as for now, the Google Glass project offers only limited input capabilities, e.g. by speech. As said above in the Google Glass example, users could start to interact with each other or the visualization displayed on each other’s devices. Therefore, they need not necessarily be at the same location but can work together remotely. Besides the interaction possibilities mentioned above, the mobile device could show what the other users are seeing, their annotations or highlighted regions, or the like. This immediately suggests the transfer of settings between users. In the current setup, all users modify the same settings of the visualization shown at the display wall. For the remote cooperation scenario where each user has his own instance of the visualization or uses the mobile device (with the reduced 4 https://www.google.com/glass (accessed 29.06.2015) 5 http://www.oculusvr.com/ (accessed 29.06.2015) 124 Chapter 6 • Using Mobile Devices for Interaction with Visualizations variant of the data) for visualization, the transfer of the visualization settings could be implemented. There is also ongoing research in the area of hands-free interaction. This could be combined with the glasses mentioned above, giving the user an interactive and immersive feeling of the visualization. Last but not least, holographic projections might advance the stereoscopic projection of today. Here, the users could walk around the projection and view it from different angles simultaneously, without the need of an additional mobile device. Interactions could be based on reaching into the projection, giving the user the feeling of a real hands-on interaction. C h a p t e r 7 Conclusion While a detailed outlook and conclusion of the topics was given in the respec- tive chapters (see Sections 3.5.2, 4.7.1, and 6.5), this chapter will describe an integrated visualization system using the techniques and concepts introduced in chapters 3–6. Additionally, an outlook for future work and a summary of the thesis is given. 7.1 Big Picture An integrated system using the parts presented in this theses could be completed as follows: an operator node for controlling the system, a visualization cluster based on render nodes equipped with a powerful graphics cards, a large display wall, driven by a subset of the cluster nodes, and user devices like mobile devices or workstations. The system is also connected to the Internet, making it available for remote visualization (see Figure 7.1). In this scenario, the data to be visualized comes from a black box. So it can either be a replay of pre-computed simulation or computation files or it is an in-situ visualization of a live computation or simulation. 7.1.1 System The software back end of this system is powered by DIANA. Small daemon applications run on the cluster nodes, receiving commands or parameter up- dates, and returning the computation or visualization result. The power wall 126 Chapter 7 • Conclusion Mobile Devices & Workstations Operator Node Power wall DIANA Interface PPLL Buffers Control & User Input Visualization Load Balancer Cluster & Simulation Black Box Displays N Figure 7.1 — Example scenario of how DIANA (cf. Chapter 3), PPLLs (cf. Chapter 4), and the mobile device interaction (cf. Chapter 6) can be used together in a visualization system. The operator node steers the visualization on the powerwall system. The simulation or computation is depicted as a black box system (upper right) which all cluster nodes can access. Mobile devices or workstations can connect to the operator node to participate in the presentation, or receive data or visualization streams. nodes use DIANA to connect to the visualization nodes and to provide a control interface for the operator. The operator node also has a connection to the visualization nodes for its own visualization (alternatively, the local system can be used for rendering) and controls the power wall nodes. It also features a DIANA interface so external clients like the mobile devices can connect to the system and retrieve data or stream visualizations. Different applications can connect to the daemons, thus some sort of load distri- bution has to be established. For example, when the large display powerwall is switched on, the driving nodes connect to the load balancing machine and in re- turn get the connection details for a DIANA instance which has enough resources to handle it. The node-assignment does not have to be fixed and can vary over time. In case of the power wall nodes, one can imagine that a render node works exclusively for its respective client to deliver maximum performance. However, this is only an illustrative example. The realization highly depends on the implementation, hardware situation, and the visualization itself. A main application controlling the whole system runs on the operator node. Its purpose is to initiate the startup of the applications on the cluster nodes and powerwall nodes. It controls the visualization on the large display and can also be used for all clients like the mobile devices to connect to this machine. This allows the operator to restrict the access to the visualization or override parameter choices of the users. External applications running on mobile devices or workstations also connect 7.1 • Big Picture 127 to the operator node. It distributes further internal information like how to connect to the load balancer, how to request access to a rendering nodes, or how to retrieve the data. As it is connected to the Internet, collaborations or presentations with partners world-wide can be supported. The interaction methods presented in Chapter 6 also apply in this scenario. 7.1.2 Operating Procedure On startup, the daemon loads the available plugins and the operator node controls the further actions like loading the data set or setting the visualization parameters. An application, either the power wall display node, the workstation, or a mobile device, simply requests a frame from the render nodes and displays it. For the rendering, the distributed PPLLs algorithm can be used. It is flexible, easy to use, provides distributed OIT rendering, and is capable of rendering all sorts of data like geometry meshes, volumes, or implicit surfaces. Upon request, the visualization daemon sends a PPLLs buffer to the client which can then do further operations, e.g. merging with other buffers or post-process it. Finally, the visualization is displayed on the devices. However, any other remote rendering compatible algorithm can be used in this scenario, too. Some sort of synchronization protocol (cf. Section 4.4) could be used to ensure the exact same time step for each display. While it is necessary for a multi-display device like the power wall to have a consistent and artifact- free image, it might not be of much use for a multi-client system. The reason is that the slowest client would dictate the frame rate if all displays would be synchronized. Therefore, for multi-client systems the synchronization needs could be relaxed. 7.1.3 Implementation For the implementation of the DIANA plugins, the existing PPLLs applications can be used and wrapped in DIANA code. On the mobile side, DIANA can be added to the existing applications. If the mobile devices are powerful enough and support the necessary operations, the PPLLs algorithm can also be ported directly. Otherwise, the remote visualization and remote rendering methods mentioned in Chapter 6 can be used. However, as the size of the buffers is comparatively large here, a streaming video visualization or local rendering might be the better choice here, especially if the network bandwidth is limited. The load balancing process can be implemented straight forward, using a fixed resource allocation for static assignments like the render nodes and a round- robin procedure for dynamic clients, e.g. the mobile devices or additional 128 Chapter 7 • Conclusion workstations. If this simple system is not efficient enough, more sophisti- cated load balancing strategies [Willebeek-LeMair and Reeves, 1993] could be employed, too. 7.2 Future Work The presented prototypes of this work were mainly used for visualization of biomolecules or FTLE visualization (cf. Section 3.4). Obviously, the system can also be used for other types of visualization. For volumetric rendering, the PPLLs can be used in the same way the PVLLs are combining geometry and volumes (cf. 5.2.2), and in combination with the distributed rendering, standard volume bricking methods [Parker et al., 1999] can be applied. For standard volume rendering, there might not be an advantage but when combining the volumes with mesh data, e.g. the volume representing the air flow around a meshed object, both rendered with transparency effects. Therefore, the PPLLs and PVLLs are an ideal basis for a generic rendering framework. They both can handle arbitrary input data and, in the case of the PVLLs, can even be used for real-time ray tracing effects. The current implementation of the display side of the PPLLs approach requires modern graphics memory functions provided by OpenGL 4 and its extensions. These functions are not available on all hardware platforms, especially on mobile devices. Extending the algorithm to use texture memory storage could make this available for mobile devices in a remote rendering setting, too. For this setting, render nodes transfer the data in chunks, prepared to fit into the texture memory of the devices. The display algorithm on the mobile device then interprets the data to render the correct visualization. In the case that future graphic APIs support those functions, of course, the algorithms can be used directly. The current PPLLs system supports static and dynamic data sets. However, loading a new time step requires a complete rebuild of the data structures. Using the differential information of both time steps or any other mechanism that provides an preview from the current time step to the next could be used to update the current visualization without rebuilding from scratch. For example, particle data with velocity information could enabling the approach used for the two-sided transformation of PPLLs (cf. Section 4.4.3). The display node could update the particle representation and transfer it into another PPLL if necessary, or perform an interpolation or extrapolation instead of requiring a complete data update. This requires a tight coupling of the data delivering process, let it be pre-computed data, or a in-situ simulation or computation, and the 7.3 • Summary 129 visualization. The visualization needs to know about the additional information and how to interpret the data. On one hand, this requires more computation time on the display nodes. On the other hand, the overall amount of data transferred, and thus the network waiting time, could be reduced. Adaptive algorithms could be used to find the sweet spot depending on the current hardware and network architecture, and visualization load. The interaction concepts presented in Chapter 6 focus on the visualization part. Especially the mobile device interaction is used for an intuitive and seamless visualization experience for the user. The concepts could be advanced to also include computational steering mechanisms where applicable. Standard parameter settings can already be prepared for the simulation and computation scenarios. More complex operations like module connection or altering the data are not included in the current setting and are subject to future work. However, the goal should not be to replace the operator node or to include all functionality in the mobile device. Instead, the functionality needed for the target application has to be made available, e.g. for presentations or collaboration scenarios. 7.3 Summary The main research questions of this thesis were how to compute and process the ever increasing data, visualize the results without losing details and how to interact with the visualization efficiently and in a user friendly way. In this context, specific solutions were proposed for computation and processing of the data, for distributed visualization with scenes containing transparency and novel user interaction concepts with the focus on mobile devices. DIANA, a computation hardware abstraction layer, was presented (cf. Chapter 3) which was developed with the goals in mind to have a middleware with low performance impact, which is easy to integrate into existing applications, and which can easily be extended to future APIs. The core of DIANA is a database which provides interfaces to the devices and commands. The interfaces reside in run-time loadable plugins, allowing the user to easily replace them with other versions, e.g. when the underlying hardware changes or new features have to be added. The software itself is linked against the DIANA core library and thus requires no modification or recompilation. Evaluation for standard GPU performance computations have shown a very low overhead and DIANA has successfully been used for a distributed live computation of an FTLE field on the VISUS powerwall. When distributed systems are used for simulations or calculations of decom- posed objects, the final visualization is challenging, especially when semi- 130 Chapter 7 • Conclusion transparent sections are present. For rendering these scenes, the previously published PPLLs approach has been extended to distributed rendering (cf. Chap- ter 4). It is able to collect all fragments in a scene during rendering and the final composition happens after the rendering process when the scene is to be displayed. The PPLLs mechanism can also be used for advanced rendering methods like CSG or depth of field effects. The method was evaluated using protein surface rendering and proved to be comparable to current OIT render- ing algorithms. Additionally, a distributed mesh renderer and a comparative multi-variant visualization application make use of the PPLLs approach. They demonstrate the capability for distributed remote rendering and the artifact-free handling of semi-transparent, overlapping objects. An extension to the PPLLs approach is the PVLLs method (cf. Chapter 5). It uses the PPLLs storage mechanism to store all fragments of a given scene. The PVLLs is a generic rendering framework which can visualize meshed data, volumes, and implicit surfaces using voxelization. The voxels are generated by rendering the scene from three orthographic projections, effectively generating a sparse volume with the voxels containing the fragment data. Therefore, it can also be used for volume visualization of sparse data when the full volume data set, including all surrounding cells, does not fit onto the GPU. Having all fragments available also enables global lighting effects like reflection, refraction, and shadows. This is achieved by using ray tracing with ray splitting when traversing the voxels within the volume. These capabilities turn the system into a generic rendering framework. Using a large system with multiple display nodes is challenging with respect to user interaction. The distributed architecture requires special applications to control the environment. To give the user an easy to use and intuitive interface, mobile device interaction concepts were proposed (cf. Chapter 6). The mobile device joins the system as an additional device with display and input capabilities. The images for the mobile device can be rendered locally or by a remote system, e.g. the render cluster also driving the display wall. This makes it possible to use different visualizations or add annotations with additional information. Inputs on the mobile device can be forwarded to the visualization system, making direct inputs possible by the user. The system can be used e.g. for research, collaboration, or presentation scenarios. An integrated system was proposed (cf. Section 7.1) which features DIANA as a common interface. The visualization is driven by the PPLLs method, allowing for distributed semi-transparent rendering. Interactions can take place for example using an operator node or using the mobile devices. Although the single components did arise from different motivations and were originally based on different assumptions, a system can be created where these components can 7.3 • Summary 131 be integrated as sub-systems. This system is flexible, extensible, and offers its services to a broad spectrum of devices, let it be mobile devices, standard computers or workstations, or large display walls. Bibliography B. Badillo, D. A. Bowman, W. McConnel, T. Ni, and M. G. da Silva. Literature survey on interaction techniques for large displays. Technical Report TR-06-21, Virginia Polytechnic Institute and State University, 2006. 111 R. Ball, C. North, and D. Bowman. Move to improve: promoting physical navigation to increase user performance with large displays. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 191–200, 2007. 110 M. Balsa Rodríguez, E. Gobbetti, J. Iglesias Guitián, M. Makhinya, F. Marton, R. Pajarola, and S. Suter. State-of-the-art in compressed GPU-based direct volume rendering. Computer Graphics Forum, 33(6):77–100, 2014. 90 A. Barak, T. Ben-Nun, E. Levy, and A. Shiloh. A package for OpenCL based heterogeneous computing on clusters with many GPU devices. In Clus- ter Computing Workshops and Posters (CLUSTER WORKSHOPS), 2010 IEEE International Conference on, pages 1–7, 2010. 24 L. Bavoil and E. Enderton. Constant-Memory Order-Independent Transparency Techniques. NVIDIA Corp., 2011. 44 L. Bavoil and K. Myers. Order Independent Transparency with Dual Depth Peeling. NVIDIA Corp., 2008. 43, 44, 59 L. Bavoil, S. P. Callahan, A. Lefohn, J. L. D. Comba, and C. T. Silva. Multi- fragment effects on the GPU using the k-buffer. In Symposium on Interactive 3D graphics and games, pages 97–104, 2007. 44, 59 H. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. Bhat, H. Weissig, I. Shindyalov, and P. Bourne. The Protein Data Bank. Nucl. Acids Res., 28:235–242, 2000, http://www.pdb.org. 73, 75 J. Beyer, M. Hadwiger, and H. Pfister. A Survey of GPU-Based Large-Scale Volume Visualization. In EuroVis - STARs, pages 105–123, 2014. 90 J. Blinn. A Generalization of Algebraic Surface Drawing. ACM Trans. Graph., 1: 235–256, 1982. 69 M. Bourgoin, E. Chailloux, and J.-L. Lamotte. Efficient abstractions for GPGPU programming. International Journal of Parallel Programming, 42(4):583–600, 2014. 25 134 Bibliography G. Bradski. The OpenCV library. Dr. Dobb’s Journal of Software Tools, 2000. 115 T. Brady, J. Dongarra, M. Guidolin, A. Lastovetsky, and K. Seymour. Smart- GridRPC: the new RPC model for high performance grid computing. Con- currency and Computation: Practice and Experience, 22(18):2467–2487, Dec. 2010. 25 K. Bürger, J. Krüger, and R. Westermann. Sample-based surface coloring. IEEE Transactions on Visualization and Computer Graphics, 16(5):763–776, 2010. 90 K. Brodlie, D. A. Duce, J. R. Gallop, J. P. R. B. Walton, and J. Wood. Distributed and collaborative visualization. Computer Graphics Forum, 23(2):223–251, 2004. 17 K. Brodlie, J. Brooke, M. Chen, D. Chisnall, A. J. Fewings, C. Hughes, N. W. John, M. W. Jones, M. Riding, and N. Roard. Visual supercomputing: Technologies, applications and challenges. Computer Graphics Forum, 24(2):217–245, 2005. 17 A. R. Brodtkorb, C. Dyken, T. R. Hagen, J. M. Hjelmervik, and O. O. Storaasli. State-of-the-art in heterogeneous computing. Sci. Program., 18:1–33, Jan. 2010. 23 P. Brown. OpenGL Extension: NV_shader_buffer_store. NVIDIA Corp., 2012. 45 P. Brown, C. Dodd, M. Kilgard, and E. Werness. OpenGL Extension: NV_shader_buffer_load. NVIDIA Corp., 2010. 45 L. Carpenter. The a -buffer, an antialiased hidden surface method. SIGGRAPH Comput. Graph., 18(3):103–108, 1984. 45, 88 G. Chen, P. V. Sander, D. Nehab, L. Yang, and L. Hu. Depth-presorted triangle lists. ACM Trans. Graph., 31(6):160:1–160:9, Nov. 2012. 43 K. Cheng and D. Zhu. Tablet interaction techniques for viewport navigation on large displays. In CHI ’14 Extended Abstracts on Human Factors in Computing Systems, pages 2029–2034. ACM, 2014. 111 A. Cockburn, A. Karlson, and B. B. Bederson. A review of overview+detail, zooming, and focus+context interfaces. ACM Comput. Surv., 41(1):2:1–2:31, Jan. 2009. 4, 111 M. L. Connolly. Analytical Molecular Surface Calculation. J. Appl. Cryst., 16: 548–558, 1983. 70 C. Crassin. GigaVoxels: A Voxel-Based Rendering Pipeline For Efficient Exploration Of Large And Detailed Scenes. PhD thesis, Universite de Grenoble, 2011. 89 Bibliography 135 R. Dolbeau. HMPP™ : A hybrid multi-core parallel programming environment. In First Workshop on General Purpose Processing on Graphics Processing Units, page 1–5. 2007. 23 H. Doleisch. Visual Analysis of Complex Simulation Data using Multiple Heteroge- nous Views. PhD thesis, Institute of Computer Graphics and Algorithms, Vienna University of Technology, 2004. 120 B. Domonkos and G. Jakab. A programming model for GPU-based parallel computing with scalability and abstraction. In Proceedings of the 25th Spring Conference on Computer Graphics (SCCG), pages 115–122, Apr. 2009. 24 J. Dongarra. Trip Report to Changsha and the Tianhe-2 Supercomputer. Oak Ridge National Laboratory, 2013. 2 R. A. Drebin, L. Carpenter, and P. Hanrahan. Volume rendering. Computer Graphics (Proceedings of SIGGRAPH 1988), 22(4):65–74, 1988. 89 J. Duato, A. J. Peña, F. Silla, R. Mayo, and E. S. Quintana-Orti. rCUDA: reducing the number of GPU-based accelerators in high performance clusters. In 2010 International Conference on High Performance Computing and Simulation (HPCS), pages 224–231. IEEE, July 2010. 24 H. Edelsbrunner and E. P. Mücke. Three-dimensional alpha shapes. ACM Trans. Graph., 13(1):43–72, 1994. 70, 71 S. Eilemann, M. Makhinya, and R. Pajarola. Equalizer: A scalable parallel rendering framework. IEEE Transactions on Visualization and Computer Graphics, 15(3):436–452, June 2009. 17 E. Eisemann and X. Décoret. Single-pass GPU solid voxelization and applica- tions. In Graphics Interface (GI 2008), volume 322, pages 73–80, 2008. 89 M. Eissele, D. Weiskopf, and T. Ertl. Interactive context-aware visualization for mobile devices. In SG ’09: Proceedings of Smart Graphics, pages 167–178, 2009. 111 C. Everitt. Interactive Order-Independent Transparency. NVIDIA Corp., 2001. 44 M. Falk, M. Krone, and T. Ertl. Atomistic visualization of mesoscopic whole- cell simulations using ray-casted instancing. Computer Graphics Forum, 32(8): 195–206, 2013. 88, 92 S. Fang and D. Liao. Fast CSG voxelization by frame buffer pixel mapping. In IEEE Symposium on Volume Visualization, pages 43–48, 2000. 89 136 Bibliography M. Üffinger, S. Frey, and T. Ertl. Interactive high-quality visualization of higher- order finite elements. Computer Graphics Forum, 29:337–346, 2010. 15 M. Fife, F. Strugar, and L. Davies. Whitepaper: Adaptive Volumetric Shadow Maps. Technical report, Intel, 2013. 45 J. D. Foley, A. van Dam, S. K. Feiner, and J. F. Hughes. Computer graphics: principles and practice (2nd ed.). Addison-Wesley Longman, 1990. 43 S. Frey and T. Ertl. PaTraCo: a framework enabling the transparent and efficient programming of heterogeneous compute networks. In Eurographics Symposium on Parallel Graphics and Visualization (EGPGV10), pages 131–140, 2010. 24 S. Frey, F. Sadlo, and T. Ertl. Explorable volumetric depth images from raycasting. In Conference on Graphics, Patterns and Images (SIBGRAPI), pages 123–130, Aug. 2013. 90 E. Gobbetti, F. Marton, and J. A. I. Guitián. A single-pass GPU ray casting frame- work for interactive out-of-core rendering of massive volumetric datasets. Visual Computer, 24(7-9):797–806, 2008. 89 D. Göddeke, R. Strzodka, and S. Turek. Performance and accuracy of hardware– oriented native–, emulated– and mixed–precision solvers in FEM simulations. International Journal of Parallel, Emergent and Distributed Systems, 22(4):221–256, 2007. 23 M. D. Godfrey and D. F. Hendry. The computer as von Neumann planned it. IEEE Ann. Hist. Comput., 15(1):11–21, 1993. 9 S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. Auto- tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing (InPar), 2012, pages 1–10, 2012. 39 J. Greer and B. L. Bush. Macromolecular shape and surface maps by solvent exclusion. In Proceedings of the National Academy of Science, pages 303–307, 1978. 68 M. Gross and H. Pfister. Point-Based Graphics. Morgan Kaufmann Publishers Inc., 2007. 54 S. Grottel, M. Krone, C. Müller, G. Reina, and T. Ertl. MegaMol - a prototyping framework for particle-based visualization. IEEE Transactions on Visualization and Computer Graphics, 21(2):201–214, Feb 2015. 16 S. Gumhold. Splatting Illuminated Ellipsoids with Depth Correction. In Pro- ceedings of VMV, pages 245 – 252, 2003. 14, 67 Bibliography 137 P. Gupta, N. da Vitoria Lobo, and J. Laviola, JosephJ. Markerless tracking and gesture recognition using polar correlation of camera optical flow. Machine Vision and Applications, 24(3):651–666, 2013. 114 R. Hardy and E. Rukzio. Touch & interact: Touch-based interaction of mobile phones with displays. In Proceedings of the 10th International Conference on Human Computer Interaction with Mobile Devices and Services, MobileHCI ’08, pages 245–254. ACM, 2008. 111 V. Havran, J. Zajac, J. Drahokoupil, and H.-P. Seidel. MPI Informatics Building model as data for your research. Research Report MPI-I-2009-4-004, MPI Informatik, Dec. 2009. 51, 81 J. Hightower and G. Borriello. Location systems for ubiquitous computing. Computer, 34(8):57–66, Aug. 2001. 114 M. Hoffmann and J. Kohlhammer. A generic framework for using interac- tive visualization on mobile devices. In Intelligent Interactive Assistance and Mobile Multimedia Computing, volume 53 of Communications in Computer and Information Science, pages 131–142. 2009. 112 W. Humphrey, A. Dalke, and K. Schulten. VMD – Visual Molecular Dynamics. Journal of Molecular Graphics, 14:33–38, 1996. 69, 73 R. Kähler, M. Simon, and H. Hege. Interactive volume rendering of large sparse data sets using adaptive mesh refinement hierarchies. IEEE Transactions on Visualization and Computer Graphics, 9(3):341–351, 2003. 89 A. Karabassi, G. Papaioannou, and T. Theoharis. A fast depth-buffer-based voxelization algorithm. Journal of Graphics Tools, 4(4):5–10, 1999. 89 A. Kaufman and E. Shimony. 3D scan-conversion algorithms for voxel-based graphics. In Workshop on Interactive 3D Graphics, pages 45–75, 1987. 89 D. Kauker, H. Sanftmann, S. Frey, and T. Ertl. Memory saving discrete fourier transform on GPUs. In CIT, pages 1152–1157. IEEE Computer Society, 2010. 7 D. Kauker, M. Krone, A. Panagiotidis, G. Reina, and T. Ertl. Evaluation of per-pixel linked lists for distributed rendering and comparative analysis. Computing and Visualization in Science, pages 111–121, 2013a. 5, 42 D. Kauker, M. Krone, A. Panagiotidis, G. Reina, and T. Ertl. Rendering molecular surfaces using order-independent transparency. In E. Association, editor, Eu- rographics Symposium on Parallel Graphics and Visualization (EGPGV), volume 13, page 33–40, 2013b. 5, 42 138 Bibliography J. Kehrer and H. Hauser. Visualization and visual analysis of multifaceted scientific data: A survey. IEEE Transactions on Visualization and Computer Graphics, 19(3):495–513, 2013. 4, 75, 119 D. A. Keim, F. Mansmann, J. Schneidewind, J. Thomas, and H. Ziegler. Visual data mining. chapter Visual Analytics: Scope and Challenges, pages 76–90. Springer-Verlag, 2008. 119 T. K. Khan. A survey of interaction techniques and devices for large high resolution displays. In Visualization of Large and Unstructured Data Sets - Appli- cations in Geospatial Planning, Modeling and Engineering (IRTG 1131 Workshop), volume 19 of OpenAccess Series in Informatics (OASIcs), pages 27–35, 2011. 111 P. K. Kitanidis. Introduction to Geostatistics: Applications in Hydrogeology. Stanford- Cambridge Program. Cambridge University Press, 1997. 105 G. Klein. Visual Tracking for Augmented Reality. PhD thesis, University of Cambridge, 2006. 115 V. Kämpe, E. Sintorn, and U. Assarsson. High resolution sparse voxel DAGs. ACM Transactions on Graphics (Proc. SIGGRAPH), 32(3):101:1–101:13, July 2013, SIGGRAPH 2013. 89 A. Knoll, I. Wald, P. A. Navrátil, M. E. Papka, and K. P. Gaither. Ray tracing and volume rendering large molecular data on multi-core and many-core architectures. In Proceedings of the 8th International Workshop on Ultrascale Visualization, UltraVis ’13, pages 5:1–5:8, 2013. 3 P. Knowles, G. Leach, and F. Zambetta. Efficient layered fragment buffer techniques. In P. Cozzi and C. Riccio, editors, OpenGL Insights, page 279–292. CRC Press, 2012. 45 Z. Koza, M. Matyka, S. Szkoda, and L. Miroslaw. Compressed multirow storage format for sparse matrices on graphics processing units. SIAM Journal on Scientific Computing, 36(2):C219–C239, 2014. 90 M. Krone, K. Bidmon, and T. Ertl. Interactive visualization of molecular surface dynamics. IEEE Trans. Vis. Comp. Graph., 15(6):1391–1398, 2009. 70, 71 M. Krone, S. Grottel, and T. Ertl. Parallel Contour-Buildup Algorithm for the Molecular Surface. In IEEE Symposium on Biological Data Visualization (biovis’11), pages 17–22, 2011. 70 M. Krone, J. E. Stone, T. Ertl, and K. Schulten. Fast Visualization of Gaussian Density Surfaces for Molecular Dynamics and Particle System Trajectories. In EuroVis Short Papers, volume 1, pages 67–71, 2012. 14, 69 Bibliography 139 M. Krone, G. Reina, C. Schulz, T. Kulschewski, J. Pleiss, and T. Ertl. Interactive Extraction and Tracking of Biomolecular Surface Features. Computer Graphics Forum, 32(3):331–340, 2013. 119 M. Krone, M. Huber, K. Scharnowski, M. Hirschler, D. Kauker, G. Reina, U. Nieken, D. Weiskopf, and T. Ertl. Evaluation of visualizations for interface analysis of sph. In EuroVis 2014 Short Papers, volume 3, pages 109–113, 2014a. 7 M. Krone, D. Kauker, G. Reina, and T. Ertl. Visual analysis of dynamic protein cavities and binding sites. In IEEE PacificVis - Visualization Notes, volume 1, pages 301–305, 2014b. 5, 118 M. Krone, C. Müller, T. Ertl, D. Kauker, A. C. Silva, D. Salsa, M. Gräber, and M. Kallert. Remote rendering and user interaction on mobile devices for scientific visualization. In International Symposium on Visual Information Communication and Interaction (VINCI), volume 8. ACM, 2015. 110 F. Lamberti and A. Sanna. A streaming-based solution for remote visualization of 3d graphics on mobile devices. IEEE Transactions on Visualization and Computer Graphics, 13(2):247–260, Mar. 2007. 112 P. Laug and H. Borouchaki. Molecular Surface Modeling and Meshing. Engi- neering with Computers, 18(3):199–210, 2002. 70 M. Levoy. Display of surfaces from volume data. IEEE Computer Graphics and Applications, 8(3):29–37, 1988. 89 S. Lindholm, D. Jönsson, H. Knutsson, and A. Ynnerman. Towards data centric sampling for volume rendering. In SIGRAD 2013, pages 55–60, 2013. 48 N. Lindow, D. Baum, S. Prohaska, and H.-C. Hege. Accelerated Visualization of Dynamic Molecular Surfaces. Computer Graphics Forum, 29:943–952, 2010. 70 F. Liu, M.-C. Huang, X.-H. Liu, and E.-H. Wu. Efficient depth peeling via bucket sort. In Proceedings of the Conference on High Performance Graphics 2009, pages 51–57. ACM, 2009. 44, 52, 59 A. Magni, D. Grewe, and N. Johnson. Input-aware auto-tuning for directive- based GPU programming. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, GPGPU-6, pages 66–75, New York, NY, USA, 2013. ACM. 39 M. Makhinya. Performance Challenges in Distributed Rendering Systems. PhD thesis, Institut für Informatik, Universität Zürich, 2012. 17 140 Bibliography L. Marsalek, A. Dehof, I. Georgiev, H.-P. Lenhof, P. Slusallek, and A. Hildebrandt. Real-Time Ray Tracing of Complex Molecular Scenes. In 14th International Conference on Information Visualization (IV), pages 239–245, 2010. 70 C. Maughan and M. Wloka. Vertex shader introduction. Technical report, NVIDIA Corporation, 2001. 11 M. Maule, J. a. L. D. Comba, R. P. Torchelsen, and R. Bastos. Technical section: A survey of raster-based transparency techniques. Comput. Graph., 35(6): 1023–1034, Dec. 2011. 43 D. C. McCallum and P. Irani. ARC-pad: Absolute+relative cursor positioning for large displays with a mobile touchscreen. In Proceedings of the 22Nd Annual ACM Symposium on User Interface Software and Technology, UIST ’09, pages 153–156, New York, NY, USA, 2009. ACM. 111 A. Meligy. Parallel and distributed visualization: The state of the art. In Computer Graphics, Imaging and Visualisation, 2008. CGIV ’08. Fifth International Conference on, pages 329–336, 2008. 17 C. Müller, G. Reina, and T. Ertl. The VVand: A two-tier system design for high-resolution stereo rendering. In CHI POWERWALL 2013 Workshop, 2013. 19 A. Moll, A. Hildebrandt, H.-P. Lenhof, and O. Kohlbacher. Ballview: A tool for research and education in molecular modeling. Bioinformatics, 22(3):365–366, 2006. 70 K. Moreland. IceT Users’ Guide and Reference Version 2.1, volume Tech Report SAND 2011-5011. Sandia National Laboratories, Aug. 2011. 17 J. D. Mulder, J. J. v. Wijk, and R. v. Liere. A survey of computational steering environments. Future Generation Computer Systems, 15(1):119 – 129, 1999. 16 K. Museth. VDB: High-resolution sparse volumes with dynamic topology. ACM Transactions on Graphics, 32(3):27:1–27:22, 2013. 89 K. Myers and L. Bavoil. Stencil routed a-buffer. ACM SIGGRAPH 2007 sketches, 2007. 44 T. Ni, G. S. Schmidt, O. G. Staadt, M. A. Livingston, R. Ball, and R. May. A survey of large high-resolution display technologies, techniques, and applications. In Proceedings of the IEEE Conference on Virtual Reality, VR ’06, pages 223–236, Washington, DC, USA, 2006. IEEE Computer Society. 111 Bibliography 141 M. Nießner, H. Schäfer, and M. Stamminger. Fast indirect illumination using layered depth images. Visual Computer, 26(6-8):679–686, 2010. 90 W. Nowak and A. Litvinenko. Kriging and spatial design accelerated by orders of magnitude: Combining low-rank covariance approximations with FFT- techniques. Mathematical Geosciences, 45(4):411–435, 2013. 105 NVIDIA Corp. NVIDIA CUDA C Programming Guide, 2014. 11 M. Oberhumer. LZO documentation, 2011. [Online]. Available: http://www. oberhumer.com/opensource/lzo/. 60 M. A. O’Neil and M. Burtscher. Floating-point data compression at 75 gb/s on a GPU. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4, pages 7:1–7:7. ACM, 2011, ACM ID: 1964189. 59, 60 A. Panagiotidis, D. Kauker, S. Frey, and T. Ertl. DIANA: a device abstraction framework for parallel computations. In P. Iványi and B. Topping, editors, Second International Conference on Parallel, Distributed, Grid and Cloud Computing for Engineering. Civil-Comp Press, 2011. 5, 22 A. Panagiotidis, D. Kauker, F. Sadlo, and T. Ertl. Distributed computation and large-scale visualization in heterogeneous compute environments. In 2012 11th International Symposium on Parallel and Distributed Computing (ISPDC), pages 87 –94, June 2012. 5, 22 S. Parker, M. Parker, Y. Livnat, P.-P. Sloan, C. Hansen, and P. Shirley. Interactive ray tracing for volume visualization. Visualization and Computer Graphics, IEEE Transactions on, 5(3):238–250, 1999. 128 J. Parulek and I. Viola. Implicit Representation of Molecular Surfaces. In IEEE Pacific Visualization Symposium, pages 217–224, 2012. 70, 71, 72, 73 T. Porter and T. Duff. Compositing digital images. In Proceedings of the 11th annual conference on Computer graphics and interactive techniques, SIGGRAPH ’84, page 253–259. ACM, 1984. 45 F. Reichl, M. G. Chajdas, K. Bürger, and R. Westermann. Hybrid sample-based surface rendering. In Vision, Modelling and Visualization (VMV), pages 47–54, 2012. 90 G. Reina. Visualization of Uncorrelated Point Data. PhD thesis, Visualization Research Center, University of Stuttgart, 2008. 14 142 Bibliography G. Reina and T. Ertl. Hardware-accelerated glyphs for mono- and dipoles in molecular dynamics visualization. In EuroVis05: IEEE VGTC Symposium on Visualization, page 177–182, 2005. 67, 70 M. Repplinger, A. Löffler, D. Rubinstein, and P. Slusallek. DRONE: a flexible framework for distributed rendering and display. In Proceedings of the 7th International Conference on Visual Computing 2009 (ISVC09)., pages 975–986. Springer, 2009. 17 F. M. Richards. Areas, Volumes, Packing, and Protein Structure. Annual Review of Biophysics and Bioengineering, 6(1):151–176, 1977. 68 C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel. PTask: Operating system abstractions to manage GPUs as compute devices. pages 233–248. Symposium on Operating Systems Principles (SOSP), 2011. 25 P. Sabella. A rendering algorithm for visualizing 3D scalar fields. Computer Graphics (Proceedings of SIGGRAPH 1988), 22(4):51–58, 1988. 89 F. Sadlo and R. Peikert. Efficient visualization of lagrangian coherent structures by filtered AMR ridge extraction. IEEE Transactions on Visualization and Computer Graphics, 13(6):1456–1463, 2007. 31 F. Sadlo, M. Üffinger, C. Pagot, D. Osmari, J. L. D. Comba, T. Ertl, C.-D. Munz, and D. Weiskopf. Visualization of cell-based higher-order fields. Computing in Science & Engineering, 13(3):84–91, 2011. 15 M. Salvi, J. Montgomery, and A. Lefohn. Adaptive transparency. In Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics, HPG ’11, pages 119–126. ACM, 2011. 45, 90 M. F. Sanner, A. J. Olson, and J.-C. Spehner. Reduced Surface: An efficient way to compute molecular surfaces. Biopolymers, 38(3):305–320, 1996. 70, 71 J. Shade, S. Gortler, L. He, and R. Szeliski. Layered depth images. In Proceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’98, pages 231–242. ACM, 1998. 90 S. Stegmaier, M. Strengert, T. Klein, and T. Ertl. A simple and flexible volume rendering framework for graphics-hardware-based raycasting. In Proceedings of the Fourth Eurographics / IEEE VGTC Conference on Volume Graphics, VG’05, pages 187–195, 2005a. 14 S. Stegmaier, M. Strengert, T. Klein, and T. Ertl. A simple and flexible volume rendering framework for graphics-hardware–based raycasting. In Eurograph- ics/IEEE VGTC Workshop on Volume Graphics, pages 187–195, 2005b. 89, 98 Bibliography 143 M. Strengert, C. Müller, C. Dachsbacher, and T. Ertl. CUDASA: compute unified device and systems architecture. In Eurographics Symposium on Parallel Graphics and Visualization (EGPGV08), pages 49–56, 2008. 24 C. Teitzel, M. Hopf, R. Grosso, and T. Ertl. Volume visualization on sparse grids. Computing and Visualization in Science, 2:47–59, 1999. 89 J. Thomas and J. Kielman. Challenges for Visual Analytics. Information Visual- ization, 8(4):309–314, 2009. 119 M. Totrov and R. Abagyan. The contour-buildup algorithm to calculate the analytical molecular surface. Journal of Structural Biology, 116:138–143, 1995. 70 A. Varshney, F. P. Brooks, and W. V. Wright. Linearly Scalable Computation of Smooth Molecular Surfaces. IEEE Computer Graphics and Applications, 14(5): 19–25, 1994. 70, 71 A. A. Vasilakis and I. Fudos. Z-fighting aware depth peeling. In ACM SIG- GRAPH 2011 Posters, SIGGRAPH ’11, pages 77:1–77:1. ACM, 2011. 44 I. Wald. Realtime Ray Tracing and Interactive Global Illumination. PhD thesis, Computer Graphics Group, Saarland University, 2004. 70 A. Wessels, M. Purvis, J. Jackson, and S. Rahman. Remote data visualization through WebSockets. In Information Technology: New Generations (ITNG), 2011 Eighth International Conference on, pages 1050–1051, Apr. 2011. 112 L. Westover. Footprint evaluation for volume rendering. SIGGRAPH ’90, pages 367–376. ACM, 1990. 63 T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra. Overview of the h.264/AVC video coding standard. Circuits and Systems for Video Technology, IEEE Transactions on, 13(7):560–576, 2003. 117 M. Willebeek-LeMair and A. Reeves. Strategies for dynamic load balancing on highly parallel computers. Parallel and Distributed Systems, IEEE Transactions on, 4(9):979–993, Sept. 1993. 128 C. M. Wittenbrink. R-buffer: a pointerless a-buffer hardware architecture. Proceedings of the ACM SIGGRAPH/EUROGRAPHICS workshop on Graphics hardware, pages 73–80, 2001, ACM ID: 383529. 45 H. Wright, R. H. Crompton, S. Kharche, and P. Wenisch. Steering and visual- ization: Enabling technologies for computational science. Future Generation Computer Systems, 26(3):506 – 513, 2010. 16 144 Bibliography R. Yagel, D. Cohen, and A. Kaufman. Discrete ray tracing. Computer Graphics and Applications, IEEE, 12(5):19–28, 1992. 89 J. C. Yang, J. Hensley, H. Grün, and N. Thibieroz. Real-time concurrent linked list construction on the GPU. Computer Graphics Forum, 29(4):1297–1304, 2010. 42, 45, 53, 59, 87 N. Zhang. Memory-hazard-aware k-buffer algorithm for order-independent transparency rendering. IEEE Transactions on Visualization and Computer Graph- ics, 20(2):238–248, 2014. 44 Y. Zhang and R. Pajarola. Deferred blending: Image composition for single-pass point rendering. COMPUTER & GRAPHICS, 31:175–189, 2007. 43