Institute of Parallel and Distributed Systems University of Stuttgart Universitätsstraße 38 D–70569 Stuttgart Bachelorarbeit Comparison of different n-body algorithms on various hardware platforms using SYCL Tim Thüring Course of Study: Informatik Examiner: Prof. Dr. Dirk Pflüger Supervisor: Marcel Breyer, M.Sc. Commenced: October 5, 2022 Completed: April 5, 2023 Abstract The n-body problem has various applications in different fields of science such as astrophysics, where it describes the problem of calculating the movements of n different bodies which all interact with each other over time. There exist different algorithms that solve the n-body problem, for example, the naive approach and the Barnes-Hut algorithm. Since applications of the n-body problem can deal with large systems of bodies it is crucial to understand the runtime behavior of theses algorithms. In this thesis the runtime behavior of the naive algorithm and the Barnes-Hut algorithm will be compared on different CPUs as well as on GPUs from different vendors. Both algorithms will be implemented using SYCL which is an abstraction layer that enables parallel programming for various types of devices. With SYCL there is no need to use different languages for parallel programming on CPUs and GPUs and the whole code can be written using standard C++. The results show that one can achieve better performance with both algorithms on GPUs than on CPUs. A projection for the runtime of a simulation of approximately one earth year with a system of over 1.2 million bodies and a Δ𝑡 of one hour shows that an NVIDIA A100 GPU could finish this simulation in under one day with the naive approach. A dual socket AMD EPYC 7543 would take almost twelve days for the same simulation. This shows that the naive algorithm maps extremely well to GPUs. CPUs can keep up better with the Barnes-Hut algorithm. Here the projection of the runtime showed that the NVIDIA A100 would finish the simulation of an earth year in just below an hour whereas the dual socket AMD EPYC 7543 could finish the simulation in 2.59 hours. 3 Kurzfassung Das n-Körper-Problem hat viele Anwendungen in unterschiedlichen Forschungsgebieten, wie zum Beispiel Astrophysik, in denen es das Problem der Berechnung von Bewegungen unterschiedlicher Körper beschreibt, die alle miteinander interagieren. Es existieren verschiedene Algorithmen, die das n-Körperproblem lösen, zum Beispiel der naive Ansatz oder der Barnes-Hut Algorithmus. Da sich Anwendungen des n-Körper-Problems mit großen Systemen von Körpern beschäftigen können, ist es wichtig das Laufzeitverhalten dieser Algorithmen zu verstehen. In dieser Arbeit wird das Laufzeitverhalten des naiven Algorithmus und des Barnes-Hut Algorithmus auf unterschiedlichen CPUs und GPUs von verschiedenen Herstellern verglichen. Beide Algorithmen werden unter der Verwendung von SYCL implementiert, was eine Abstraktionsschicht ist die es ermöglicht verschiedenste Hardware parallel zu programmieren. Mit SYCL besteht nicht die Notwendigkeit verschiedene Sprachen für das parallele Programmieren auf CPUs und GPUs zu verwenden und der gesamte Programmcode kann in Standard C++ geschrieben werden. Die Ergebnisse zeigen, dass man bei beiden Algorithmen bessere Performanz auf GPUs erreichen kann als auf CPUs. Eine Hochrechnung der Laufzeit für eine Simulation von ungefähr einem Erd-Jahr mit einem System von 1,2 Millionen Körpern und einem Δ𝑡 von einer Stunde zeigt, dass eine NVIDIA A100 GPU mit dem naiven Ansatz die Simulation in unter einem Tag beenden kann. Ein Dual-Socket AMD EPYC 7543 würde fast zwölf Tage für diese Simulation benötigen. Das zeigt, dass sich der naive Algorithmus enorm gut auf GPUs anwenden lässt. CPUs können beim Barnes-Hut Algorithmus besser mithalten. Hier zeigt die Hochrechnung, dass die NVIDIA A100 die Simulation eines Jahres in knapp unter einer Stunde schaffen würde, während der Dual-Socket AMD EPYC 7543 die Simulation in 2,59 Stunden schaffen könnte. 4 Contents 1 Introduction 11 2 Related work 13 3 Foundations 15 3.1 The n-body problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Algorithms for the N-Body Problem . . . . . . . . . . . . . . . . . . . . . . . 16 3.3 SYCL for parallel programming on various hardware . . . . . . . . . . . . . . . 19 4 Implementation 23 4.1 Implementation of the naive algorithm . . . . . . . . . . . . . . . . . . . . . . 23 4.2 Implementation of the Barnes-Hut algorithm . . . . . . . . . . . . . . . . . . . 25 5 Analysis of the runtime behavior 33 5.1 Experiment setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.2 Impact of the parameters on the runtime behavior of the algorithms . . . . . . . 34 5.3 Evaluation of the performance optimizations . . . . . . . . . . . . . . . . . . . 44 5.4 Comparison of the runtime behavior on different GPUs and CPUs . . . . . . . . 52 5.5 Summary of the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6 Conclusion 61 6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 Bibliography 63 5 List of Figures 3.1 Example of a subdivision of the space into a quadtree . . . . . . . . . . . . . . . 17 3.2 Example for a quadtree data structure . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 Example of how nodes of a quadtree represent virtual bodies . . . . . . . . . . . 19 4.1 Grouping of the acceleration computations into work-groups . . . . . . . . . . . 24 4.2 Building the top part of the tree in order to determine the root nodes for all subtrees 27 4.3 Creating all subtrees independently from each other . . . . . . . . . . . . . . . . 27 4.4 Two methods of assigning nodes to work-items . . . . . . . . . . . . . . . . . . 29 5.1 Comparison of different block sizes 𝑘 for the naive algorithm on an NVIDIA Quadro GP100 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.2 Comparison of different block sizes 𝑘 for the naive algorithm on a dual socket AMD EPYC 7543 CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.3 Comparison of different values for the maximum depth for the top of the octree . 38 5.4 Comparison of different amounts of work-items for the first phase of the octree creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.5 Comparison of different amounts of work-items for building the subtrees during the octree creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.6 Comparison of different work-group sizes for the acceleration kernel of the Barnes- Hut algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.7 Effect of the 𝜃-value of the Barnes-Hut algorithm on the runtime and accuracy . . 43 5.8 Analysis of the optimizations for the naive algorithm on an NVIDIA Quadro GP100 45 5.9 Impact of the optimizations for the naive algorithm on an NVIDIA GeForce RTX 3090 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.10 Differences between DPC++ and Open SYCL with the different performance optimizations of the naive algorithm on an NVIDIA Quadro GP100 . . . . . . . 47 5.11 Impact of the optimizations of the naive algorithm on an AMD GPU with Open SYCL and DPC++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.12 Impact of the performance optimizations for the naive algorithm on a dual socket AMD EPYC 7543 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.13 Impact of using subtrees during the octree creation of the Barnes-Hut algorithm on CPUs and GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.14 Impact of the sorting step of the Barnes-Hut algorithm on CPUs and GPUs . . . 51 5.15 Comparison of the naive algorithm and the Barnes-Hut algorithm on CPUs and GPUs 53 5.16 Comparison of the naive algorithm and the Barnes-Hut algorithm on consumer and data center GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.17 Comparison of the naive algorithm and the Barnes-Hut algorithm on AMD and NVIDIA GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 7 5.18 Comparison of two different SYCL implementations on GPUs from AMD and NVIDIA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.19 Comparison of the naive algorithm and the Barnes-Hut algorithm on different CPUs 59 8 Acronyms AU astronomical unit. 43 CPU central processing unit. 11 FPGA field programmable gate array. 19 GPU graphics processing unit. 11 HT hyper-threading. 33 TFLOPS tera floating point operations per second. 34 9 1 Introduction The n-body problem occurs in different fields of science and describes the calculation of the movements of n different bodies in a system where all of these bodies interact with each other. One of the main applications of this problem can be found in astrophysics where the goal is to calculate the trajectories of different celestial objects like planets or asteroids which all interact with each other through gravitational forces. There are different approaches that solve the n-body problem. The naive approach considers all pairwise interactions between all bodies in the system. This leads to the most precise result, however, the runtime of the algorithm grows quadratically with the number of bodies in the system which makes the approach poorly suited for the simulation of very large systems. Therefore, other algorithms have been developed that apply approximations to the simulation in order to speed up the calculation. One such algorithm is the Barnes-Hut algorithm [BH86], which leverages an octree data structure to apply approximations and improve the performance of n-body simulations. Since n-body simulations can be performed with large systems of bodies it is important to understand the runtime behavior of these algorithms. In this thesis the runtime behavior of two algorithms for the n-body problem, the naive algorithm and the Barnes-Hut algorithm, will be compared on different central processing units (CPUs) as well as on graphics processing units (GPUs) from NVIDIA and AMD. Parallel programming on such different hardware platforms usually involves the usage of different languages such as CUDA, HIP or OpenMP. Most of the works that implemented n-body algorithms for GPUs use CUDA. In this thesis both algorithms will be implemented with the usage of SYCL which is an abstraction layer that enables parallel programming on a variety of devices. The benefit of using SYCL is that one does not need to use different languages for parallel programming on a specific device. Instead, with SYCL one can write a whole application in standard C++ that is then capable of addressing a variety of devices. The results of the comparison conducted during this thesis show that one can achieve better performance on GPUs than on CPUs for both algorithms. The differences between those two types of devices is especially significant for the naive algorithm where GPUs have a clear advantage over CPUs. A projection for the runtime of a simulation of approximately one earth year with a system of over 1.2 million bodies and a Δ𝑡 of one hour shows that when using the naive approach an NVIDIA A100 GPU could finish the simulation in under one day. A dual socket AMD EPYC 7543 would need almost twelve days in the same situation. However, CPUs can keep up better with GPUs when using the Barnes-Hut algorithm. Nevertheless, also with this algorithm GPUs are faster in the end. Here the projection of the runtime showed that with the Barnes-Hut algorithm the NVIDIA A100 would finish the simulation of an earth year in just below an hour whereas the dual socket AMD EPYC 7543 could finish the same simulation in 2.59 hours. 11 1 Introduction This thesis is structured as follows: In the next chapter important work related to this thesis will be presented. Chapter 3 will introduce all important fundamentals about n-body problems and SYCL. In chapter 4 the implementations of the naive algorithm and the Barnes-Hut algorithm conducted during this thesis will be described. The results of the comparison of both algorithms on different GPUs and CPUs will be presented in chapter 5. Finally, chapter 6 will conclude this thesis and discuss possible directions of future work. 12 2 Related work This section will present scientific work related to the topic of this thesis. There exist a lot of implementations of n-body algorithms on CPUs as well as on GPUs. An implementation of the naive algorithm on GPUs was performed by Nyland et al. [NHP07]. The authors parallelized the naive approach on NVIDIA GPUs with the usage of CUDA. Arora et al. [ASV09] implemented the naive approach for several CPUs and GPUs and compared the performance on these different platforms. Their GPU implementation makes use of CUDA and uses the approach by Nyland et al. However, they only considered the naive algorithm. Furthermore, the hardware used for the comparison (NVIDIA Tesla C1060 and NVIDIA Tesla C870 GPUs as well as, for example, Intel Nehalem and AMD Barcelona CPUs) is not up to date anymore. Capuzzo-Dolcetta et al. [CS13] implemented the naive algorithm with OpenCL an compared their implementation on GPUs from different manufactures. A CUDA implementation of the Barnes-Hut algorithm on NVIDIA GPUs was conducted by Burtscher et al. [BP11]. The octree creation phase of the Barnes-Hut algorithm was implemented using a parallel top-down approach that makes use of synchronization. The authors also compared their implementation of the Barnes-Hut algorithm with a CUDA implementation of the naive algorithm. A comparison of the performance of the Barnes-Hut algorithm on CPUs and GPUs was also performed by the authors. However, the CPU implementation of the Barnes-Hut algorithm used for this comparison was not parallelized. There also exist other approaches for the octree creation like the one proposed by Warren et al. [WS93] which makes use of a hashed octree and leverages space filling curves for decomposition of all bodies in the space. It is also possible to construct the tree bottom-up. This technique has been, for example, applied by Alpay [Alp19b] to an algorithm that is similar to the Barnes-Hut algorithm with the usage of OpenCL. For the tree the SpatialCL library by Alpay [Alp19a] is used that also includes a demonstration of the Barnes-Hut algorithm. Beyond the Barnes-Hut algorithm that has a theoretical complexity of O(𝑛 𝑙𝑜𝑔(𝑛)) there also exist other approaches that try to reduce the theoretical complexity of the naive algorithm. One example is the fast multipole method by Greengard et al. [Gre87] which reduces the theoretical complexity to O(𝑛). Yokota et al. [YB11] implemented both the fast multipole method and the Barnes-Hut algorithm with CUDA for execution on GPUs. They also compared both algorithms with the naive approach on GPUs and CPUs. Overall, not many comparisons of the Barnes-Hut algorithm and the naive algorithm have been conducted on different modern hardware. Most comparisons so far have used different implementations of the algorithms with different programming languages for the various devices. The implementations of the naive algorithm and the Barnes-Hut algorithm presented in this work will only use SYCL code. This allows for using mostly the same code for the execution on CPUs and GPUs. Different frameworks than SYCL exist and also have been applied to n-body simulations like the Cabana library [SRJ+22] that uses Kokkos [TLA+22] for achieving performance portability. 13 2 Related work However to my knowledge no comparisons of different algorithms with the usage of this library have been conducted. Furthermore, I am not aware of any similar work that makes use of SYCL for comparison between the naive algorithm and the Barnes-Hut algorithm on different hardware. 14 3 Foundations In this chapter the basic concepts and tools which are relevant for this thesis will be introduced. Starting with physical background about n-body problems and the ideas behind two different algorithms for n-body simulations. Last, SYCL will be introduced as part of the technical background of this thesis. 3.1 The n-body problem The n-body problem describes the calculation of pairwise interactions between n different bodies to obtain their movements over time. It has applications in various fields of science such as molecular dynamics [EH86], plasma physics [GSK+10], fluid dynamics [SW94] or astrophysics [BH86]. The variants of the n-body problem in those different fields mainly differ in the kind of force which underlies the interactions in the system. This work considers the application of the n-body problem for astrophysics, where the main application is the simulation of celestial objects, e.g., the planets, moons and asteroids of our solar system. This variant of the n-body problem is also called gravitational n-body problem since the different bodies of the system interact with each other through gravitational forces. It is described in various literature, e.g., in the work by Meyer et al. [Ken17] on which the content of the following paragraphs is based. In the gravitational case the interactions between the bodies arise through gravity. The individual bodies in the system are modeled as point masses where every body 𝑖 is described by its mass 𝑚𝑖 and a position ®𝑟𝑖 . With this, the calculation of the gravitational force acting on body 𝑖 can be derived from Newton’s second law of motion and Newton’s law of gravity. Together this gives the following equation for the total force ®𝐹𝑖 acting on body 𝑖 generated through the gravity of all other bodies 𝑗 : ®𝐹𝑖 = 𝑚𝑖 · ®𝑎𝑖 = 𝐺 · 𝑚𝑖 · ∑︁ 𝑗≠𝑖 𝑚 𝑗 · ( ®𝑟 𝑗 − ®𝑟𝑖 ) ∥®𝑟 𝑗 − ®𝑟𝑖 ∥3 . (3.1) In equation 3.1, 𝐺 ≈ 6.674 · 10−11 𝑚3 𝑘𝑔·𝑠2 denotes the gravitational constant and ®𝑎𝑖 describes the acceleration of body 𝑖. Often the acceleration value ®𝑎𝑖 is sufficient, which simplifies equation 3.1 to equation 3.2: ®𝑎𝑖 = 𝐺 · ∑︁ 𝑗≠𝑖 𝑚 𝑗 · ( ®𝑟 𝑗 − ®𝑟𝑖 ) ∥®𝑟 𝑗 − ®𝑟𝑖 ∥3 . (3.2) In order to obtain the evolution of the system over time, i.e., the movements of the bodies in the system, one uses numerical integration over time. One way of getting the new positions ®𝑟𝑖+1 and velocities ®𝑣𝑖+1 of each body in the next time step 𝑖 + 1, is the so called leapfrog integration method. 15 3 Foundations The method is described in various literature. This paragraph will be based on the work by Hairer et al. [HLW03]. The leapfrog integration method is sometimes also named Verlet integration and can be written and interpreted in a variety of ways. A possible interpretation of this scheme is that it is a composition of two symplectic Euler schemes. One variant of the leapfrog integration scheme can be expressed as shown below: ®𝑣𝑖+ 1 2 = ®𝑣𝑖 + ®𝑎𝑖 · Δ𝑡 2 (3.3) ®𝑟𝑖+1 = ®𝑟𝑖 + ®𝑣𝑖+ 1 2 · Δ𝑡 (3.4) ®𝑣𝑖+1 = ®𝑣𝑖+ 1 2 + ®𝑎𝑖+1 · Δ𝑡 2 . (3.5) After the step described by equation 3.4 the new acceleration values ®𝑎𝑖+1 have to be determined. In the case of the gravitational n-body problem equation 3.2 could be used. However, as described by e.g. Nyland et al. [NHP07], for n-body simulations in practice this would not be applicable. Since the individual bodies are modeled as point masses and collisions are not taken into account, the forces between two bodies that get close to each other grow towards infinity. This can be circumvented by adding a softening factor 𝜖2 and rewriting equation 3.2 to equation 3.6 for the acceleration computation: ®𝑎𝑖 ≈ 𝐺 · ∑︁ 𝑗≠𝑖 𝑚 𝑗 · ( ®𝑟 𝑗 − ®𝑟𝑖 )( ∥®𝑟 𝑗 − ®𝑟𝑖 ∥2 + 𝜖2) 3 2 . (3.6) Together, the acceleration computation and time integration form the basic building blocks of n-body simulations. 3.2 Algorithms for the N-Body Problem The part dominating the runtime of n-body simulations is the acceleration computation which has to be done in every time step. Beyond the naive approach for the acceleration calculation, several other approaches have been developed to speed up the acceleration computations. These approaches include methods such as the Barnes-Hut algorithm [BH86], the fast multipole method [Gre87], or particle-mesh methods (see, e.g., Bertschinger et al. [BG91]). This work is concerned with two different algorithms for n-body simulations: the naive algorithm and the Barnes-Hut algorithm. In this section, these two algorithms for the acceleration computation of the n-body problem will be introduced. 3.2.1 The Naive Approach The naive approach for the acceleration computation is described in various literature, e.g., by Nyland et al. [NHP07]. In order to obtain the movements of all bodies over time, one has to calculate the new acceleration vector of each body in every time step. This means that for a system of 𝑛 bodies, 𝑛 acceleration values have to be determined. Using the naive approach these values can be calculated by applying equation 3.6 for all 𝑛 bodies 𝑖. Since each of the 𝑛 acceleration computations requires to sum over the accelerations induced by the gravity of all the other 𝑛 − 1 bodies 𝑗 , naively computing the accelerations results into O(𝑛2) complexity. 16 3.2 Algorithms for the N-Body Problem For large systems, this quadratic complexity can be problematic since the runtime of the naive algorithm can get quite long. Thus, other algorithms have been developed to speed up the calculation of the acceleration values. However, such algorithms often apply approximations and thus do not reach the accuracy of the naive approach. One such algorithm, the Barnes-Hut algorithm, will now be introduced in the next section. 3.2.2 The Barnes-Hut Algorithm The Barnes-Hut algorithm was proposed by Josh Barnes and Piet Hut [BH86]. It is a tree based method that reduces the complexity of the acceleration computation to O(𝑛 𝑙𝑜𝑔(𝑛)) which makes it much more applicable for large systems compared to the naive approach. The Barnes-Hut algorithm can compute the 𝑛 different acceleration values ®𝑎𝑖 more efficiently, since it applies approximations to each of the 𝑛 calculations. In contrast to the naive algorithm, which considers all interaction pairs for the calculation of one acceleration ®𝑎𝑖, the Barnes-Hut algorithm reduces the number of interaction calculations per acceleration value. The general idea of the approximation is that for the calculation of acceleration ®𝑎𝑖 of body 𝑖, the acceleration is only directly computed for the bodies 𝑗 that are close to body 𝑖. Groups of bodies that are far away are grouped together and are estimated as one big virtual body. When the bodies are far away the loss of precision of this approximation compared to individually computing all accelerations is not large. In order to determine when to apply this approximation, all bodies are stored in a tree data structure according to their position. In the three dimensional case, such a data structure would be an octree, in two dimensions, a quadtree. The implementation of the algorithm performed during this thesis always considers the three dimensional case. For easier visualization, all following graphics will be limited to the two dimensional case with quadtrees. However, all concepts described here can be easily extended to three dimensions. quadtree creation 1 1 2 2 3 3 4 4 Figure 3.1: Example of a subdivision of a space containing four bodies numbered from 1 to 4 into a quadtree. In the situation shown on the left there is only one cell that contains all four bodies. This cell then gets recursively subdivided into smaller cells until there is only one body in each cell. 17 3 Foundations Figure 3.1 shows how a space containing 4 bodies numbered from 1 to 4. The space gets recursively subdivided into smaller cells, until there is only one body in each cell. In the corresponding quadtree data structure, each cell 𝑐𝑖 corresponds to a node of the tree. The quadtree corresponding to figure 3.1 is visualized in figure 3.2. 𝑐0 𝑐1 𝑐2 𝑐5 𝑐6 𝑐7 𝑐8 𝑐3 𝑐4 1 2 34 Figure 3.2: Example for a quadtree data structure for the quadree visualized in figure 3.1. Each node of the tree corresponds to one cell. Below the nodes that contain a body, the respective body is shown. All nodes, have to store the sum of the masses of all bodies which are contained in any of the nodes further down the tree and the center of mass of all those bodies. The center of mass of several bodies 𝑘 can be calculated using equation 3.7. With that, each non-leaf node of the tree also represents a virtual body located in the center of mass of all the bodies in the cell which is represented by this node, and with a mass of the sum of all masses of those bodies. ®𝐶 = ∑ 𝑘 𝑚𝑘®𝑟𝑘∑ 𝑘 𝑚𝑘 (3.7) For example, cell 𝑐2 in figure 3.2 got subdivided into more cells since it contains a total of 3 bodies: body 2, 3 and 4. Thus, as shown in figure 3.3, the node corresponding to cell 𝑐2 represents the virtual body 𝑣234 located at the center of mass of the bodies 2, 3 and 4 and with a mass corresponding to the sum of the individual masses of the three bodies. During the acceleration computation, for each body 𝑖, the tree is traversed top to bottom. For each node, one has to decide if an approximation should be used, i.e., if only one acceleration induced by the virtual body represented by the current node should be used. In such a case, all nodes and bodies in the subtree of this node will not be considered anymore for the acceleration computation. This can drastically reduce the number of acceleration computations. In the example above this would correspond to the following situation: assume that the goal is the computation of acceleration ®𝑎1. When the algorithm arrives at the node representing cell 𝑐2 it has to make the decision if the virtual body 𝑣234 should be used or if an approximation is not feasible here and one should continue to step down deeper into the tree, computing more precise accelerations with bodies 2, 3 and 4 individually. The decision, if the virtual body 𝑣 represented through the current node under consideration in the tree traversal should be used as an approximation for all other bodies contained in the cell 𝑐 represented by this node is based on the following criterion: 𝑒𝑑𝑔𝑒𝐿𝑒𝑛𝑔𝑡ℎ(𝑐) | |®𝑟𝑣 − ®𝑟𝑖 | | < 𝜃 . (3.8) 18 3.3 SYCL for parallel programming on various hardware 2 3 4 (a) Cell 𝑐2 2 3 4 𝑣234 (b) Virtual body 𝑣234 repre- sented by cell 𝑐2 𝑐0 𝑐1 𝑐2 𝑐3 𝑐4 1𝑣234 (c) Quadtree with the virtual body 𝑣234 instead of the sub- tree containing the individual bodies 2, 3 and 4 Figure 3.3: Example of how nodes of a quadtree represent virtual bodies. Figure 3.3a shows cell 𝑐2, divided into four quadrants and containing the bodies with ID 2, 3 and 4. Figure 3.3b shows cell 𝑐2 with the virtual body 𝑣234 located in the center of mass of all bodies contained in cell 𝑐2. Figure 3.3c shows the quadtree data structure where the virtual body 𝑣234 represents the whole subtree containing the bodies 2, 3 and 4. Here, ®𝑟𝑖 is the position vector of the body 𝑖 whose acceleration ®𝑎𝑖 should be computed, whereas ®𝑟𝑣 corresponds to the position of the virtual body, which is the center of mass of all bodies in the cell. 𝜃 is a constant that defines the accuracy of the algorithm. For higher values of 𝜃, approximations with virtual bodies will be applied more often, which leads to better performance but lower accuracy. Lower 𝜃-values give higher accuracy, but it comes at the cost of more computations and thus longer run-times. For 𝜃 = 0 no approximations are applied and the algorithm would correspond to the naive approach again. The tree creation step of the Barnes-Hut algorithm requires O(𝑛 𝑙𝑜𝑔(𝑛)) calculation steps. Further- more, the time complexity of the computation for the calculation of one acceleration ®𝑎𝑖 is O(𝑙𝑜𝑔(𝑛)). Since in total 𝑛 acceleration values have to be computed per time step, the total complexity of the acceleration computation step is O(𝑛 𝑙𝑜𝑔(𝑛)). With that, the total complexity of the Barnes-Hut algorithm reaches O(𝑛 𝑙𝑜𝑔(𝑛)), which is a significant improvement over the naive approach. 3.3 SYCL for parallel programming on various hardware SYCL 2020 [Khr] is a standard that acts as an abstraction layer which enables parallel programming for heterogeneous systems and is developed by the Khronos Group1. SYCL code can be written using only standard ISO C++. The advantage of an application written with SYCL code is that it can be executed in parallel on a variety of devices, such as CPUs and GPUs from different manufacturers or even field programmable gate arrays (FPGAs). In this thesis, SYCL will be used to compare the runtime behavior of different n-body algorithms on different GPUs and CPUs. For creating a SYCL application and running it on different devices one needs an implementation of SYCL. There are a variety of SYCL implementations, each of which include support for different backends, that are then used for executing code in parallel on 1https://www.khronos.org (visited on 04/03/2023) 19 https://www.khronos.org 3 Foundations the selected devices. For example, when executing SYCL code on an NVIDIA GPU, the CUDA backend could be used in the background, whereas code executing on a CPU could be parallelized using OpenMP as a backend. This work uses two different SYCL implementations: Open SYCL2 [AH20] and DPC++3. Open SYCL offers support for NVIDIA and AMD GPUs with the CUDA and the ROCm backend respectively. Furthermore Intel GPUs are supported with Level Zero. Open SYCL also offers broad CPU support through the OpenMP backend. DPC++ supports Intel CPUs, GPUs and FPGAs with the SPIR-V backend and the OpenCL backend. Intel GPUs are also supported with Level Zero. DPC++ also offers support for NVIDIA GPUs through CUDA and AMD GPUs through HIP. However, the AMD HIP backend is currently in experimental state. There also exist other SYCL implementations like, for example, Codeplay ComputeCpp4. However, these implementations will not be considered in this work. The SYCL application written in this work utilizes the CUDA backends of Open SYCL and DPC++ for execution on NVIDIA GPUs and the ROCm backend of Open SYCL for AMD GPU support. For parallel execution on CPUs the OpenMP backend of Open SYCL is used. In the next sections a short introduction into programming with SYCL will be given which explains the high level concepts of how to parallelize code with SYCL and execute it on different devices. 3.3.1 Programming with SYCL The content of this section follows the explanations of the book Data Parallel C++ by Reinders et al. [RAB+21]. On the highest level, a SYCL application can be divided into host and device code. Device code is most of the time the computational intensive part of the program which should get accelerated by parallel execution on a device. Such a device could be a GPU but also a CPU. Host code on the other hand is all the other code of the application, besides the device code. This code gets executed on the CPU of the system. Listing 3.1 shows a simple example of SYCL code. The code initializes a vector with some data on the CPU. After that each entry of the vector is squared. This operation will happen in parallel on a device of the system. The following paragraph will describe the example step by step. In line 1, a queue is created. A queue is associated with one device and used to submit code which should be executed on the device. In this case no device is specified, so the SYCL implementation will select a device. However, here it would be possible to specify that, for example, this queue should be associated with a GPU. 2https://github.com/OpenSYCL/OpenSYCL (visited on 04/03/2023) 3https://github.com/intel/llvm (visited on 04/03/2023) 4https://developer.codeplay.com/products/computecpp/ce/home/ (visited on 04/03/2023) 20 https://github.com/OpenSYCL/OpenSYCL https://github.com/intel/llvm https://developer.codeplay.com/products/computecpp/ce/home/ 3.3 SYCL for parallel programming on various hardware Listing 3.1 Example of how code gets submitted to a device for parallel execution when using SYCL 1 sycl::queue queue; 2 3 std::vector storage = {2, 9, 10, 7, 4, 3, 1, 5}; 4 sycl::buffer data_buffer(storage.data(), storage.size()); 5 6 int range = storage.size(); 7 8 queue.submit([&](sycl::handler &h) { 9 10 sycl::accessor data_accessor(data_buffer, h); 11 12 h.parallel_for(sycl::range<1>(range), [=](auto &i) { 13 data_accessor[i] = data_accessor[i] * data_accessor[i]; 14 }); 15 }).wait(); The code in line 4 creates a buffer for variables of type int. A buffer is an abstraction layer for data in SYCL. This data can be accessed by the device using an accessor. In the example an accessor is created in line 10. In line 8 a kernel containing device code is submitted to the device with which the previously created queue is associated. The only device code which gets executed in parallel is line 13. Here the previously created accessor is used to modify the data stored in the buffer. By default, device code submitted to a queue gets executed asynchronous to the host code. The code in line 10 however, gets executed directly and not asynchronously. In this example the call to wait() in line 15 causes to the host device to wait until the device executing the kernel is finished with the computation. One instance of the kernel which executes line 13 in parallel to all other instances is called work-item. The amount of parallelism or the number of work-items which are used is controlled in line 12. In this example it corresponds to the size of the buffer. Each of the work-items of the kernel is identified by an ID, here, in this example the ID is stored in the variable i. In the previous example, besides the amount of work-items no further specifications about the parallel execution are made. However, SYCL provides a more powerful option to express parallelism: nd-range kernels. For this work, the most important property of nd-range kernels is the ability to group different work-items together into work-groups. E.g.: one can divide a total of 1024 work-items into 16 work-groups with 64 work-items each. The advantage of this approach is that there exist several work-group specific functions that can be used to optimize the application. For example, each work-group has its own local memory shared between all work-items of that work-group. Depending on the device which is used for running the kernel, this type of memory can be much faster than the global memory that can be accessed by all work-items. Furthermore there are several build in synchronization options between work-items that belong to one single work-group like barriers or memory fences. 21 4 Implementation In this work, two algorithms for the n-body problem, the naive approach and the Barnes-Hut algorithm, have been implemented and parallelized for execution on different hardware using SYCL. All computationally demanding parts of the algorithms have been parallelized with SYCL for execution on CPUs and GPUs, including the leapfrog integration. In this chapter, the approaches of how the algorithms are implemented for parallel execution will be described. Furthermore, it will be explained how the algorithms are optimized for better performance. 4.1 Implementation of the naive algorithm The implementation of the naive algorithm with SYCL conducted in this thesis is based on the CUDA implementation of this algorithm by Nyland et al. [NHP07]. As described in section 3.2.1, calculating all the 𝑛 accelerations of a system with 𝑛 bodies requires O(𝑛2) acceleration computations. Each of those O(𝑛2) accelerations can be computed independently of each other which makes this approach embarrassingly parallel. After that all accelerations corresponding to one body can be summed up. This means that in theory one could have a maximum of O(𝑛2) calculations executing in parallel. The disadvantage of this method is that it also requires quadratic amount of memory, since all acceleration values have to be stored until the final 𝑛 accelerations can be calculated. Thus, Nyland et al. chose a different approach where only the 𝑛 acceleration computations for each body happen independent of each other in parallel and the 𝑛− 1 acceleration values which emerge through the interaction of one body with all other bodies are calculated sequentially. This approach still results in more than enough parallelism to fully utilize modern GPU and CPU hardware. The implementation of Nyland et al. was performed using CUDA and thus only targets NVIDIA GPUs. The authors describe in detail how they optimized their implementation for execution on GPUs. Since the high degree of parallelism available in the naive algorithm is especially well suited for GPUs, the implementation of the naive algorithm done in this work is based on this GPU implementation by Nyland et al. Their optimization approaches for GPUs programmed with CUDA are also realized in SYCL code. The implementation of the algorithm with SYCL enables execution on both GPUs and CPUs. The impact of the optimization on the runtime on CPUs will be analyzed more closely in chapter 5.3.1. 4.1.1 Optimization for GPUs One optimization approach of the naive algorithm described by Nyland et al. aims to speed up the acceleration computation by using the shared memory of GPUs. In SYCL this corresponds to using local memory which is accessible to all work-items of one work-group. In order to achieve 23 4 Implementation work-group of 𝑘 work-items 𝑘 × 𝑘 block ®𝑎𝑛−1,𝑛−1 . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ... ... ... ... ... ... ... ... . . . ®𝑎1,0 ®𝑎2,0 ®𝑎3,0 ®𝑎4,0 ®𝑎5,0 ®𝑎𝑛−1,0 . . . ®𝑎0,1 ®𝑎2,1 ®𝑎3,1 ®𝑎4,1 ®𝑎5,1 ®𝑎0,2 ®𝑎1,2 ®𝑎3,2 ®𝑎4,2 ®𝑎5,2 ®𝑎0,3 ®𝑎1,3 ®𝑎2,3 ®𝑎4,3 ®𝑎5,3 ®𝑎0,4 ®𝑎1,4 ®𝑎2,4 ®𝑎3,4 ®𝑎5,4 ®𝑎0,5 ®𝑎1,5 ®𝑎2,5 ®𝑎3,5 ®𝑎4,5 ®𝑎0,𝑛−1 ®𝑎1,𝑛−1 ®𝑎2,𝑛−1 ®𝑎3,𝑛−1 ®𝑎4,𝑛−1 ®𝑎5,𝑛−1 Figure 4.1: Grouping of the acceleration computations into work-groups of size k=3. The compu- tation of the acceleration values in each work-group happens in blocks of size 𝑘 × 𝑘 (highlighted in blue) in order to make use of the work-groups local memory. Modified from Nyland et al. [NHP07] Figure 31-4. this optimization with SYCL, the first step is to group the 𝑛 acceleration value computations into work-groups of a fixed size 𝑘 . Figure 4.1 shows an example of how the individual calculations for all acceleration values ®𝑎𝑖, 𝑗 are grouped into work-groups for 𝑘 = 3. Here 𝑖 denotes the index of the body whose acceleration ®𝑎𝑖 is calculated by the 𝑖-th work-item. ®𝑎𝑖, 𝑗 corresponds to an individual accelerations induced by one of the other 𝑛 − 1 bodies 𝑗 . The computations of the acceleration values in each work-group happen in blocks. For a work-group consisting of k work-items, the 𝑘 · (𝑛 − 1) acceleration values are calculated in blocks of size 𝑘 × 𝑘 . This is necessary for the usage of the local memory. Before each block, each work-item loads the values of one of the 𝑘 bodies into the local memory. After that all work items that belong to this work-group are synchronized through a barrier. Now, all values needed for the acceleration computation (mass and position) of the 𝑘 bodies of the current block are accessible to all work-items of the work-group in local memory. Each work-item has to only load the values of one body from the slower global memory into the faster local memory. After the synchronization, each work-item has fast access to the position and mass values of all 𝑘 bodies of the block and can now compute the 𝑘 accelerations induced by these bodies to the body associated with the respective work-item. For example, in the first block of the second work-group in figure 4.1, the work-item with ID 3 would load all values of body 0 into the local memory, making it also accessible with fast access for work-items 4 and 5. Work-item 4 and 5 will do the same for the bodies with ID 1 and 2 respectively. After each work-item has finished the calculation of the acceleration values, the 𝑘 work-items synchronize before the next position and mass values for the next 𝑘 bodies are loaded into the local memory. 24 4.2 Implementation of the Barnes-Hut algorithm In the variant described above, the size 𝑘 of a work-group would have to divide the total amount of bodies, since in SYCL all work-groups must have the same size. Since the number of bodies can be arbitrary, it would limit the possible values for 𝑘 . However, the size 𝑘 of the blocks is not irrelevant. For example for GPUs it is best to choose 𝑘 as a multiple of a power of 2. In order to allow arbitrary values for 𝑘 , a padding is added to the global range of all work-groups so that it is divisible by 𝑘 . With this change there exist a few work-items in the last work-group that are not associated with any acceleration computation. It only requires small changes to handle this case and ensure that only work-items associated with an acceleration value ®𝑎𝑖 perform the calculations. However, the work-items that do not perform any calculations of acceleration values are still needed to load values into the local memory. 4.2 Implementation of the Barnes-Hut algorithm The Barnes-Hut algorithm [BH86] consists of two major parts: the octree creation and the computation of the acceleration values using the octree. The biggest challenge when implementing this algorithm is the parallelization of the octree creation, especially on GPU hardware. An implementation of the Barnes-Hut algorithm on GPUs with a classical octree was conducted by Burtscher et al. [BP11] using CUDA. The SYCL implementation of the Barnes-Hut algorithm on CPUs and GPUs performed in this thesis will be based on the implementation of Burtscher et al. In their CUDA implementation of the Barnes-Hut algorithm, Burtscher et al. divided the Barnes-Hut algorithm further into several steps. First, the bounding box of all bodies used for the simulation is calculated. After that the octree data structure gets build up, followed by the computation of the center of mass for each node. To minimize thread-divergence, Burtscher et al. introduced a new step before the acceleration computation: an in-order sorting of all bodies in the octree. These concepts are realized in this thesis with the usage of SYCL. The following sections will describe in more detail how the Barnes-Hut algorithm was implemented using SYCL for execution on CPUs and GPUs. 4.2.1 Parallel octree creation Burtscher et al. use a parallel top-down approach for creating the octree data structure. At the beginning, the octree consists of only one node, the root node, which represents a cell corresponding to the bounding box of all bodies. In the SYCL implementation of this work, every work-item gets assigned a subset of all bodies that it has to insert into the tree. One work-item can insert bodies in parallel to other work-items. When a work-item inserts a body into the octree, it traverses the tree top to bottom starting with the root node. Stepping further down into the tree does not require any synchronization, since no modifications are made. If a leaf node is reached, the insertion point for the current body is found in the tree. Since modifications of the node will happen in the next step, it is necessary to lock the node and prevent other work-items from working on it while the current work-item tries to insert the body. If the leaf node is empty, i.e., does not already contain a different body, the body can be inserted and the work-item can continue with the next body assigned to it. If the leaf node already contains a body, the node has to be split, i.e., it has to become an inner node and eight new child nodes have to be created. 25 4 Implementation When making such changes, like splitting a node and creating eight new nodes, or inserting a body into a node, one needs to be aware of the following. After releasing the lock that locks the node, it is not guaranteed that other work-items will see the changes made to the node. Thus, a memory fence is needed to ensure that the view of all work-items on the global memory is consistent after modifying the tree. Memory fences are supported by SYCL, however, different SYCL implementations vary a lot in their implementation and feature support of this function. DPC++ already supports the atomic_fence function introduced with SYCL 2020 whereas currently this function is not yet supported by Open SYCL. Open SYCL only supports the mem_fence function of the older SYCL 1.2.1 specification [Khr20]. In this specification however, there exist only memory consistency guarantees for work-items in one work-group when using a mem_fence. But no consistency guarantees are made for work-items in different work-groups (see [Khr20] chapter 3.5.2.2.). The implementation of atomic_fence in DPC++ did not guarantee consistency across different work-items that belong to different work-groups during testing with the CUDA backend. However, these issues would require more research which is beyond the scope of this thesis. Furthermore in the current revision 6 of the SYCL 2020 standard, the SYCL memory model is not completely formalized yet (see [Khr22] chapter 3.8.3.2.). Due to this current state and the goal that the implementation of the Barnes-Hut algorithm should be compatible with several backends and SYCL implementations, it can not be assumed that there are any consistency guarantees across work-groups when using memory fences with SYCL. Thus, this limits the approach described above to one work-group in SYCL. Due to device specific constraints this also limits the amount of parallelism that is available since work-groups are not allowed to be arbitrarily large. For example for most GPUs tested during this work the limit is 1024 work-items per work-group. Limiting the whole octree creation to one work-group would thus result into a non-optimal usage of the resource of a device, especially for highly parallel GPU hardware. Thus a new approach had to be taken which makes a few changes to the algorithm in order to achieve more parallelism with SYCL. This approach will be explained in the following section. Leveraging subtrees for more parallelism The general idea of the approach that will be described in this section is that it builds the octree with the same method as previously, but in two steps. First, the top of the tree gets build up to a certain level that can be specified using a parameter. Figure 4.2 visualizes this step with an example for a maximum level of two. Here, ten bodies with IDs from 1 to 10 are inserted into the upper part of the tree. In this example the tree corresponds to a quadtree for easier visualization. However the concept can be easily extended to an octree. The body with ID 4 got already inserted into a node on level one, thus no subtree originates here. The other bodies have not reached their final position on a level smaller or equal than two. This means that the node on level two into which they would have been inserted becomes the root node of their subtree. For example, in figure 4.2 the bodies with ID 1, 2 and 3 will belong to the same subtree. Figure 4.3 shows how the subtrees which were determined in the first step are build in a second step. The benefit is that each of these subtrees can now be build independently of the other subtrees in separate work-groups. More importantly, if changes are made to one of the subtrees, it is only important that all work-items of the work-group associated with this subtree see these changes. 26 4.2 Implementation of the Barnes-Hut algorithm Work-items belonging to other work-groups will never access nodes of this subtree, thus it is not important that their view on this part of the memory is consistent. This means that with this approach memory fences which only have effects on one work-group are fully sufficient. {1,2,3} 4 {5,6,7} {8,9,10} Figure 4.2: Building the tree to a maximum level of two to determine the subtrees for each body. All leaf nodes in this example that contain a set of more than one body (represented through their ID) will become subtree root nodes. Body 4 got already inserted at level one. Thus body 4 has reached its final position and will not be part of any subtree 1 3 2 4 5 7 6 10 8 9 Figure 4.3: Creating all subtrees independently from each other after all nodes that belong to a subtree have been determined With one work-group per subtree, one has much more work-groups and parallelism available in the second step. However, one is still limited to one work-group for the creation of the top of the octree and has additional overhead for distributing the bodies to the work-groups associated with the subtree that will contain the body. The performance differences of this approach with subtrees and the approach which is limited to one work-group will be further analyzed in chapter 5.3.2. 27 4 Implementation 4.2.2 Calculating the center of mass After the octree data structure is build, one can now calculate the center of mass and the sum of all masses for each node. Burtscher et al. extracted this part of the calculation into a separate step. It is theoretically possible to calculate the center of mass on the fly during the octree creation. However this requires extensive use of atomic operations and during the implementation process it turned out that this drastically reduces performance. Thus, in the SYCL implementation of the Barnes-Hut algorithm conducted during this thesis the computation of the center of mass is also performed in a separate step. The general idea of Burtscher et al. is to traverse the octree bottom-up in parallel and compute the center of mass and sum of masses for each node. For the bottom-up traversal the authors make use of ordering guarantees for the order in which the nodes are stored. In the SYCL implementation of the algorithm in this work all nodes of the tree are stored in one buffer. The next insertion index is determined atomically during the octree creation. With this method it is guaranteed that child nodes will have a higher index as their parents. The data structure does not store pointers to the parent node of a child node explicitly. But because of this ordering guarantee, the octree can be traversed bottom up by starting with the last node in the buffer that stores all nodes and iterating through it until the root node at index 0 is reached. In the algorithm each work-item gets assigned a certain subset of the nodes and starts with the node that has the highest index. If the node is a leaf node the center of mass can be calculated directly. If it is an inner node of the octree, it has to be checked if all child nodes of this node are already processed. If this is the case, the center of mass and sum of masses can be calculated by combining the respective values of the child nodes, if not, the work-item continues with the next node and revisits the node in a subsequent iteration over all nodes assigned to the work-item. This happens as long as not all nodes are processed yet. Burtscher et al. use the sum of masses value as a flag to indicate if a node has already been processed. This enables the algorithm to work without atomic operations. However, since it is not guaranteed that if a work-item sees the center of mass value, it will also see the center of gravity value of the node, a memory fence is needed again. For the same reasons as with the octree creation, the usage of memory fences limits this kernel to one work-group for the SYCL implementation. Using an approach with subtrees again is much harder for this kernel, since with the current SYCL implementation all nodes are stored in one buffer, independent of what subtree they belong to. This makes it hard to filter out nodes belonging to one specific subtree without overhead. Separating the storage space, so that all nodes of one subtree are stored right next to each other is also not trivial, since estimating how many nodes will be created for one subtree and thus how much storage is needed is not simple. Even a subtree containing only two bodies can become really deep if the bodies are close to each other. This scenario is not unrealistic in astrophysics since such a case could correspond to a typical planet-moon system. Thus, in the SYCL implementation of this thesis most of the computations for the center of mass are limited to one work-group. The assignment of nodes to work-items is a central part of this kernel and important for its runtime behavior. Since for nodes with higher indices it is more likely that they can be processed directly, these nodes should be processed first by the work-items. Figure 4.4 shows an example of two methods for assigning nodes two work-items. In this example 14 nodes get assigned to two work-items. Method 1 is based on the method described by Burtscher et al. where work-item 𝑖 starts from the back with the node that has the highest index and processes every 𝑖-th node of the array. 28 4.2 Implementation of the Barnes-Hut algorithm work-item 1 work-item 2 Method 1: Method 2: Figure 4.4: Example for the two methods of assigning nodes to two work items. Each cell in the arrays represents a node, the color denotes the work-item to which a node gets assigned. The arrows indicate the processing direction of each work-item. Method 2 performs worse since a lot of the computations of work-item 2 will depend on computations that work-item 1 has to do. Based on [BP11] p. 82, figure 6.10. Method 2 shows a different mapping between work-items and nodes. Here all nodes in the array are split into blocks and each block gets assigned to one work-item. Each work-item will again start processing with node that has the highest index. However, since in the SYCL implementation of this work child nodes always have higher indices as their parent node and the processing of the parent node depends on the processing of the child node, it is likely that many nodes assigned to work-item 2 in the example depend on nodes assigned to work-item 1, forcing work-item 2 to wait more often. Because of this it is better to choose the first method for assigning nodes to work-items. However, method 1 does not work with the Open SYCL OpenMP backend on CPUs, since the algorithm does not terminate. On GPUs this methods works reliable with various backends. Since the performance gain of the first method over the second method on GPUs is not negligible, two separate kernels for GPUs and CPUs were implemented with SYCL. It is not fully clear why the method does not work with the Open SYCL OpenMP backend on CPUs and would require further research. One reason could be that method one of this algorithm heavily relies on each work-item making progress eventually. However, this property is not guaranteed by the SYCL standard and therefore depends on the implementation (see [Khr22] chapter 3.8.3.4.). Optimizations for the center of mass computation As described by Burtscher et al. the center of mass and sum of masses values of leaf nodes can be directly calculated since the computation of those values does not depend on any other node. This property is even more important in the SYCL implementation of this algorithm performed during this thesis. Since there are no dependencies for the computation of these values, a memory fence is not needed. Thus, in the implementation with SYCL, the calculations for all leaf nodes are extracted into a separate kernel. In this kernel a maximum amount of parallelism can be used since one is not 29 4 Implementation limited to one work-group anymore. Furthermore, it reduces the amount of work for the second kernel which is limited to only one work-group. Additionally, since the center of mass is already calculated for each leaf node, their parent nodes can directly be processed in the second kernel. Since each work item often has to iterate over all nodes assigned to it several times it is better to cache nodes that still have to be processed. This optimization was also performed by Burtscher et al. In the implementation done during this thesis, all nodes that could not be processed during the first phase are stored in a separate buffer. For all subsequent iterations only theses nodes are considered. Like in the implementation of this kernel by Burtscher et al., the number of bodies that are contained in the cell represented through a node gets stored for each node. This is important for the sorting of the bodies which will be described in the next section. 4.2.3 Sorting the bodies with the tree This part is not needed for the correctness of the algorithm. However, it is important for improving the performance of the acceleration computation on GPUs. Burtscher et al. use this step to minimize thread divergence on GPUs. The order of the bodies after the sorting step will correspond to the in-order tree-traversal of the octree that has been created. In the SYCL implementation of this work, for each body the tree is traversed top to bottom until the leaf node of this body is reached. During each step, when a work-item steps down one level in the tree it has to determine how many nodes have to be stored before this node. This can be done using the value of how many bodies belong to a cell represented by a node which was calculated during the center of mass computation. The algorithm increments a counter with this value for all nodes whose bodies will be stored before the bodies of the current node, each time it steps down in the tree. The insertion index of the current body in the buffer of the bodies sorted according to an in-order tree-traversal is, thereby, determined when the leaf node containing the body is reached. 4.2.4 Computing the acceleration for each body For each acceleration value ®𝑎𝑖 the octree has to be traversed top to bottom. However, it is not necessary to traverse the tree up to the leaf level every time. In the SYCL implementation of the Barnes-Hut algorithm this kernel has no constraints for the maximum amount of parallelism. Thus, one work-item computes exactly one acceleration value ®𝑎𝑖 . In order to have more control over the mapping from work-groups to the corresponding primitives of the device specific backend, the work-group size can be explicitly specified via a parameter. To allow arbitrary work-group sizes a padding is added in a similar way as described in section 4.1.1. Each work-item traverses the tree top to bottom with the usage of a stack. For each node two cases have to be considered. The first case is that the center of mass value of the node can be used as an approximation for the computation of ®𝑎𝑖. In this case, all nodes belonging to the subtree of this node do not have to be considered anymore. In the other case, if the approximation can not be made, all child nodes of the node have to be pushed on a stack, indicating that they still have to be processed later. 30 4.2 Implementation of the Barnes-Hut algorithm In order to minimize thread divergence it is desired that threads in one warp always execute code belonging to the same branch, i.e., consider the same of the two cases described above. This means that the decision whether an approximation can be used or not should be the same for the threads in one warp. It is likely that this is the case if the acceleration values the threads are computing belong to bodies that are close to each other spatially. This is why Burtscher et al. introduced the in-order sorting step of all bodies, since bodies which are close to each other in the in-order sorting of the octree are also likely to be close to each other spatially. With SYCL, however, there is no direct control of mapping specific calculations to warps, so one has to rely on the SYCL implementation for this part. In the SYCL implementation of the Barnes-Hut algorithm there is a one-to-one mapping between the global ID of the work-item and the index of buffer holding the sorted bodies. More specifically, the work-item with ID 0 will compute the acceleration of the first body in buffer of sorted bodies, and so on. This results into work-items with consecutive IDs to compute acceleration values for bodies that also appear in consecutive order after the in-order sorting of all bodies. The effect of this performance optimization will be more closely analyzed in chapter 5.3.3. 31 5 Analysis of the runtime behavior In this chapter the runtime behavior of the two algorithms implemented with SYCL will be analyzed. First, the two algorithms will be considered separately. For each algorithm there will be experiments analyzing the impact of their different parameters. After that the performance optimizations described in chapter 4 will be evaluated. Last, the overall runtime behavior of the two algorithms will be compared on different CPUs as well as on GPUs from different vendors. 5.1 Experiment setups Unless stated otherwise, Open SYCL will be used for the experiments since it is compatible with all the hardware used in the experiments through the CUDA, ROCm and OpenMP backends. CUDA version 11.4.3 gets used on all systems with NVIDIA GPUs and ROCm 5.3.3 is used for the AMD GPU. The DPC++ version used for all experiments is sycl-nightly/20221102. For Open SYCL commit 4a04f1c is used. Open SYCL got build against the LLVM-compiler of the DPC++ installation. For most runs, a simulation interval of 10 earth days was chosen. With a Δ𝑡 of one hour, this results in 241 acceleration computations including the computation of the initial acceleration values. For some runs it was necessary to restrict the amount of acceleration computations to 25 or 13 due to the very long runtime. All individual timings of each run have been averaged. Generally, it was observed that there are a lot more fluctuations in the runtimes on CPUs than in the runtimes on GPUs. The runtimes of the leapfrog integration steps are not included in the time measurements. One reason for this is that it is the same for the naive algorithm and the Barnes-Hut algorithm and thus does not contribute anything to a comparison of both algorithms. Furthermore, its runtime is negligible compared to the runtime of the other kernels of the simulation. For example even for the largest dataset the integration steps take only about half a millisecond on an NVIDIA RTX 3090. The datasets used for the simulation contain real world data of objects in our solar system. The data originates from a data base of the NASA Jet Propulsion Laboratory1. For the experiments six differently sized datasets were provided containing 178, 999, 2678, 19054, 138723 and 1216869 bodies respectively. Table 5.1 show the technical specifications of all five test systems that will be used during the experiments. Hardware that is explicitly used for the experiments is highlighted in bold. For experiments on the CPU, hyper-threading (HT) was enabled and omp.accelerated of Open SYCL 1https://ssd.jpl.nasa.gov/tools/sbdb_query.html (visited on 04/03/2023) 33 https://ssd.jpl.nasa.gov/tools/sbdb_query.html 5 Analysis of the runtime behavior System 1 System 2 System 3 System 4 System 5 GPU NVIDIA A100 NVIDIA Quadro GP100 NVIDIA GeForce RTX 3090 AMD Radeon PRO VII - GPU memory 40GB HBM2 16GB HBM2 24GB GDDR6X 16GB HBM2 - GPU FP64 performance [TFLOPS] 9.7 ∼5 0.556 6.5 - CPU 2x AMD EPYC 7742 2x Intel Xeon Silver 4116 AMD Ryzen Threadripper 3960X Intel i7-6700K 2x AMD EPYC 7543 CPU core count 128 24 24 8 64 CPU clock speed [GHz] 3.4 3.0 4.5 4.2 3.7 Table 5.1: Overview of all systems used for the experiments. Hardware that is explicitly used for the experiments is highlighted in bold. All the other hardware of the systems is also shown for completeness. The CPU-clock speed corresponds to the maximum single-core speed as specified from the manufacturer. The information for the hardware specifications have been retrieved from [NVI17], [NVI20], [NVI21], [AMD20b], [AMD20a], [Intb], [AMD], [Inta], [AMD22]. If no FP64 performance was specified the amount of tera floating point operations per second (TFLOPS) was calculated using the FP32 performance and the the single precision, double precision ratio of the respective GPU. was used. All calculations were performed using FP64 double precision. If not stated otherwise, all runtimes correspond to the time of one time step without leapfrog integration. If the runtime corresponds to a specific part of the algorithm, the runtime of this part in one time step is given. 5.2 Impact of the parameters on the runtime behavior of the algorithms In this section the impact of the different parameters for both algorithms on their runtime behavior will be analyzed. First, the block size of the naive algorithm will be considered. After that the different parameters of the Barnes-Hut algorithm will be studied, starting with the parameters that influence the runtime behavior of the octree creation: the maximum level of the top of the octree and the amount of parallelism available for the first and second phase of the octree creation. Last, the impact of the 𝜃-value, that gets used to determine the accuracy of the Barnes-Hut algorithm, on the runtime of the acceleration kernel will be analyzed. 34 5.2 Impact of the parameters on the runtime behavior of the algorithms 5.2.1 Impact of the block size on the naive algorithm The goal of the experiments presented in this section is to analyze the impact of the block size of the naive algorithm on its runtime behavior. The first experiment analyzes the impact of the block size of the naive algorithm on GPUs. In the experiment, the runtime of the naive algorithm with different block sizes was measured on an NVIDIA Quadro GP100 using all 6 datasets. (a) Runtime comparison between different block sizes 𝑘 and different numbers of bodies. (b) Comparison of different block sizes for 19054 bodies. Figure 5.1: Comparison of different block sizes 𝑘 for the naive algorithm on an NVIDIA Quadro GP100 GPU. Figure 5.1a shows the runtime behavior for four different block sizes on all datasets. A block size of 128 performs well for most datasets. Figure 5.1b shows the runtimes of several block sizes for the dataset containing 19054 bodies. The blue bars denote block sizes that are a power of two whereas the red bars correspond to block sizes that are divisors of 19054 and can be used as block sizes. It can be clearly seen that most block sizes that are a power of two perform better than the divisors of 19054. Figure 5.1a shows the runtimes of the naive algorithm for all datasets and for different block sizes 𝑘 . Each curve corresponds to one block size. The number of bodies used for the simulation is shown on the x-axis and the y-axis corresponds to the runtime in milliseconds. Both, the x-axis and the y-axis are logarithmically scaled. One can observe that small values for 𝑘 like 16 do not perform well for a large numbers of bodies. With 𝑘 = 16 the naive algorithm takes about 38.8s whereas with with 𝑘 = 128 it only takes about 20.6s. This could be explained with the fact that for small block sizes the overall amount of blocks is much higher than for large block sizes. Thus, one has more overhead for the blocking compared to larger block sizes which perform much better for large body counts. However, in figure 5.1a it can be recognized that large values like 𝑘 = 1024 do not perform well for small amounts of bodies. The latter can not be observed for the smallest dataset of 178 bodies since there can not be more than 178 work items performing calculations, thus larger 𝑘 do not make a difference. Figure 5.1b shows a bar plot with the runtimes of the naive algorithm for the dataset with 19054 bodies. Each bar corresponds to one block size 𝑘 . In addition to the values for 𝑘 which are highlighted in blue and are a power of two, four other block sizes are also considered: 𝑘 = 1, 𝑘 = 2, 𝑘 = 7 and 𝑘 = 14. These values correspond to all divisors of 19054 that are smaller or equal than the maximum 35 5 Analysis of the runtime behavior work-group size of 1024 and are highlighted in red. They are interesting because they would be the only valid work-group sizes and thus block sizes one could use without having a padding that allows for arbitrary 𝑘 . Since the runtimes for the divisors are a lot slower than the runtimes for most of the 𝑘 which are a power of two, one can conclude that the work-group size should be chosen as a power of two for optimal performance. Further testing conducted during this thesis showed that a multiple of a power of two is also sufficient. One can see that 𝑘 = 128 performs well for most datasets. Thus, this value will be used for all future experiments with the naive algorithm on GPUs unless stated otherwise. In order to study how the block size influences the runtime behavior on CPUs, the previous experiment was repeated on CPUs. The system used for this experiment contains a dual socket AMD EPYC 7543 with a total of 64 cores and HT enabled. (a) Runtime comparison of different block sizes 𝑘 and different numbers of bodies. (b) Comparison of different block sizes for 19054 bodies. Figure 5.2: Comparison of different block sizes 𝑘 for the naive algorithm on a dual socket AMD EPYC 7543 with 64 CPU cores. Figure 5.2a shows the runtime behavior for four different block sizes on all datasets. The x-axis corresponds to the number of bodies contained in the dataset. Figure 5.2b shows the runtimes of several block sizes for the dataset containing 19054 bodies. The blue bars denote block sizes that are a power of two whereas the red bars correspond to block sizes that are divisors of 19054 and can be used as block sizes. The fact that the block size is a power of two does not play an important role on CPUs. The overall size of the value is more important. Figure 5.2 shows the the results of this experiment in the same scheme as used for the previous experiment. In figure 5.2a it can be observed that smaller values for 𝑘 perform much better for smaller datasets. For a large amounts of bodies this can not be observed anymore and large 𝑘 even perform a little bit better. For example for the largest dataset 𝑘 = 16 would need about 125.6s whereas 𝑘 = 128 needs only 117.4s. Figure 5.2b shows the runtime for different block sizes with a dataset containing 19054 bodies in more detail. Like in the previous experiment, in addition to the values for 𝑘 that are a power of two, all divisors of 19054 smaller than 1024 are shown again. Most of the divisors of 19054 do not 36 5.2 Impact of the parameters on the runtime behavior of the algorithms perform well again. However, this is most likely due to their small size which results into a large amount of blocks and thus increases the overhead. Unlike on GPUs 𝑘 = 14 performs quite well on CPUs. From that one can conclude that, unlike on GPUs, it is not that important that 𝑘 is a power of two and the overall size of 𝑘 plays a bigger role. The observation that small block sizes perform better than big block sizes for small numbers of bodies could be explained with an increased probability of cache hits when small blocks are used. However, this advantage vanishes for large numbers of bodies since the small block size creates additional overhead. A value of 𝑘 = 128 performs well for most datasets. Thus, this value will be used for all future experiments with the naive algorithm on CPUs unless stated otherwise. 5.2.2 Impact of the Barnes-Hut tree-building parameters The goal of the experiments presented in this section is to study how the different parameters that adjust the octree creation of the Barnes-Hut algorithm influence the runtime of the octree creation on CPUs and GPUs. The first parameter that will be considered is the maximum level to which the top of the tree gets build in the first step. The subtrees that emerge in the first step then get build in parallel in the second step. The second and the third parameter that will be analyzed in this section determine the amount of parallelism used for the first and second step of the tree creation respectively. Generally it has to be noted that the runtime behavior of the octree creation highly depends on the dataset used for the simulation since the positioning of the bodies in the space determines the structure of the tree. Thus, the parameters might behave differently on different datasets. The depth of the top of the octree The experiment that will be presented in this section aims to analyze how to choose the maximum tree depth for the top of the octree that gets build in the first part of the octree creation. In the experiment the runtime of the whole octree creation was measured using different values for the maximum tree depth for the top of the octree. Generally, a higher value for this maximum depth will result into more work for the first phase where less parallelism is available and generate many small subtrees. A smaller maximum tree depth for the top of the octree will result into less work for the first phase and generates fewer, but larger subtrees. Figure 5.3 shows the results of the experiment on an NVIDIA Quadro GP100 GPU and a dual socket AMD EPYC 7543 CPU. The x-axes correspond to the number of bodies used for the simulation and the y-axes show the time for the octree creation in milliseconds. Both axes are logarithmically scaled. In Figure 5.3a, which shows the results for the NVIDIA Quadro GP100, it can be observed that a depth of seven yields the best results across most datasets. Very low values like one or three perform worst. For example the octree creation on the largest dataset requires only about 194.4ms with a depth of seven, whereas with a depth of three the computation requires about 295.4ms. This is likely because with low values for the depth of the top of the octree the amount of subtrees that emerge is too small and the subtrees itself can become very deep. Thus, the advantage of the increased parallelism is very limited. On the other hand, large values like nine also result into sub-optimal performance. Part of the reason for this behavior could be that the first part of the 37 5 Analysis of the runtime behavior (a) NVIDIA Quadro GP100 (b) Dual socket AMD EPYC 7543 Figure 5.3: Comparison of how different values for the maximum depth for the top of the octree influence the time for the octree creation. Figure 5.3a shows the impact of the depth of the top of the octree on an NVIDIA Quadro GP100 GPU whereas figure 5.3b shows the results of the same experiment repeated on a dual socket AMD EPYC 7543. It can be observed that on the CPU the best results for larger datasets can be obtained with a depth of nine. On the GPU seven yields the best results across most datasets. octree creation requires more work the deeper the top of the tree gets. Thus, there is more work for which one is limited to one work-group and thus can not use the full amount of parallelism available on the GPU. Since with higher values one has more subtrees, and thus more parallelism, in the second phase, one can compensate the performance loss of the first phase. However, this is only possible if the GPU can also fully use this increased amount of parallelism. If there is more theoretical parallelism than can be used there might not be a performance gain anymore. Overall, a depth of seven seems to balance the amount of work in the first phase and the available parallelism in the second phase best. Thus, this value will be chosen for all future experiments on GPUs if not stated otherwise. Figure 5.3b shows the results of the experiment on a dual socket AMD EPYC 7543 CPU. One can see that, similarly to GPUs, small values for the depth of the top of the tree perform worst, especially for large datasets. However, nine performs best for larger datasets on CPUs. With a runtime of approximately 607.9ms it is a lot faster than with a depth value of seven which would require 1337.2ms for the octree creation on the largest dataset. This was not the case on GPUs. One reason for this behavior could be that CPUs favor the many small subtrees that emerge when using a larger tree depth for the top of the octree. These smaller subtrees can then be build very fast without the need for a lot of synchronization, since they contain fewer bodies than larger subtrees. One could also argue that the reason for this behavior is that this approach with subtrees is not suited for CPUs and larger tree depths for the top of the tree shift more work to the first phase which is more similar to the original approach without subtrees. However, as it will be shown in section 5.3.2, the approach with subtrees improves the performance over the approach without subtrees on CPUs. Thus, it is more likely that the reason for higher values for the top of the tree performing better is due to the increased amount of subtrees and not only because more work is shifted to the first phase of the tree creation. Since a depth for the top of the tree of nine performed best for the larger datasets, this value will be chosen for CPUs unless stated otherwise. 38 5.2 Impact of the parameters on the runtime behavior of the algorithms Choosing the amount of parallelism for the two phases of the octree creation The experiments presented in this section will analyze how the amount of parallelism available for the first and the second phase of the octree creation influences the runtime behavior of the overall octree creation on CPUs and GPUs. Generally the usage of more work-items results in more parallelism but also requires more synchronization between those work-items which introduces additional overhead. This is why it is important to analyze the impact of these values. In order to do this, the runtime of the first and the second phase of the octree creation was measured on an NVIDIA Quadro GP100 GPU and on a dual socket AMD EPYC 7543 CPU using different amounts of parallelism. (a) NVIDIA Quadro GP100 (b) Dual socket AMD EPYC 7543 Figure 5.4: Comparison of different amounts of work-items for the first phase of the octree creation on an NVIDIA Quadro GP100 GPU and on a dual socket AMD EPYC 7543 CPU. On the GPU, small work-item counts like 128 perform worst and higher values perform better. For larger datasets there is not such a large difference on CPUs between small and large work-item counts. Figure 5.4 shows the runtimes for the creation of the top of the tree on CPUs and GPUs for different work-item counts. The x-axes correspond to the number of bodies used for the simulation and the y-axes correspond to the runtime for the creation of the top of the tree in milliseconds. All axes have been logarithmically scaled. Figure 5.4a shows the impact of changing the work-item count for the first phase of the octree creation on an NVIDIA Quadro GP100 GPU. It can be seen that smaller values like 128 result in the worst performance with a runtime of about 85ms for the largest dataset. But also the largest possible value of 1024 does not result in the best performance across all datasets. The results indicate that choosing higher values results into better performance for this part of the algorithm. Nevertheless, the highest possible value is not always the best choice, since it is about 3.8ms slower than when using 640 work-items for the largest dataset. However, smaller values like 128 clearly perform worst. A work-item count of 640 offers good performance across all datasets and even performs best for the largest dataset. Thus, this value will be chosen on GPUs for the final experiments unless otherwise stated. 39 5 Analysis of the runtime behavior Figure 5.4b presents the results when varying the amount of work-items for the first part of the octree creation on a dual socket AMD EPYC 7543 CPU. In contrast to the version of this experiment on GPUs, the variation of the runtime between the different numbers of work-items is much smaller for lager datasets. Part of the reason for this could be that if the work-item count exceeds the thread count of the CPU (which is 128 in this case) there is no additional performance to gain. However the overhead of using much more work-items seems to be very limited since these values do not perform much worse for larger datasets. A work-item count of 128 performs best with a small margin for the largest dateset where it is approximately two to three milliseconds faster than using 1024 or 16 work-items. Since it also performs well for smaller datasets, this value will be chosen for the final experiments on CPUs. (a) NVIDIA Quadro GP100 (b) Dual socket AMD EPYC 7543 Figure 5.5: Comparison of different amounts of work-items per work-group when building the subtrees during the octree creation. Each work-group is responsible for building one subtree. Figure 5.5a shows the runtimes for different amounts of work-items per work-group on an NVIDIA Quadro GP100 GPU. Figure 5.5b shows these values for the same experiment on a dual socket AMD EPYC 7543 CPU. On the GPU higher work-group sizes perform better than smaller ones. On the CPU there are almost no differences between the runtimes for large and small work-group sizes when large datasets get used. Figure 5.5 shows the runtimes for the second phase of the octree creation on an NVIDIA Quadro GP100 and and a dual socket AMD EPYC 7543 for different work-item counts. The work-item count corresponds to the amount of work-items in one work-group. Each work-group is responsible for building one of the subtrees. The x-axes correspond to the number of bodies used for the simulation and the y-axes correspond to the runtime for the creation of subtrees in milliseconds. All axes have been logarithmically scaled. In figure 5.5a the runtimes measured on an NVIDIA Quadro GP100 are shown. One can observe that for smaller datasets, smaller values perform a little bit better whereas 1024 performs best for the largest dataset. However, this value only works reliably on all GPUs considered during this work when using Open SYCL. When DPC++ gets used the program is unable to launch this kernel with such a high number of work-items. This could be due to the amount of registers needed gets too 40 5.2 Impact of the parameters on the runtime behavior of the algorithms large. Since a work-item count of 640 also performs well across most datasets, being only 2.2ms slower than 1024 for the largest dataset, and runs stable on all GPUs with both Open SYCL and DPC++, this values will be chosen for the final experiments. Interestingly, the total time needed for the creation of the subtrees with the dataset containing 999 bodies is lower than the time for the dataset with just 178 bodies. This is because these two datasets are very different. The dataset with just 178 bodies contains planets like Jupiter or Saturn which have a lot of moons. This results into a really deep octree. The dataset containing 999 bodies only contains asteroids which are sparsely distributed. Thus the depth of the resulting octree is much smaller. Since the depth for the top of the tree is fixed across all datasets, the subtrees that get created for the second dataset are much smaller compared to the first one. Figure 5.5b presents the runtimes for the creation of the subtrees on a dual socket AMD EPYC 7543 CPU. It can be observed that similar to the results from the first phase of the octree creation the choice of the amount of parallelism does not make a huge difference, especially for larger datasets. Since 16 performs well for small and large datasets this value will be chosen for the final experiments on CPUs. 5.2.3 Effect of the parameters for the acceleration computation of the Barnes-Hut algorithm The goal of the experiments presented in this section is to analyze the impact of the parameters that influence the runtime of the acceleration kernel in the Barnes-Hut algorithm. First, the work-group size of acceleration kernel will be considered. After that the impact of the 𝜃-value on the accuracy and runtime of the Barnes-Hut algorithm will be analyzed. Impact of the work-group size on the acceleration kernel Since no optimizations making use of local memory are made to the acceleration kernel of the Barnes-Hut algorithm and it also does not require any form of synchronization, it is not necessary to explicitly group the work-items into work-groups. However, in practice the specification of the work-group size gives some control over how work-items are mapped to, e.g., thread-blocks on NVIDIA GPUs. The experiment presented in this section will analyze the impact of the work-group size for the acceleration kernel on CPUs and GPUs. Figure 5.6 shows the runtimes for different work-group sizes for the Barnes-Hut acceleration kernel on an NVIDIA Quadro GP100 GPU and a dual socket AMD EPYC 7543 CPU for different work-group sizes. The x-axes correspond to the number of bodies used for the simulation and the y-axes correspond to the runtime of the acceleration kernel. All axes have been logarithmically scaled. In figure 5.6a the results of the runtimes for different work-group sizes on an NVIDIA Quadro GP100 are presented. One can see that a work-group size of 16 results into the best performance on this GPU. With a runtime of about 837.2ms for the largest dataset it performs much better than a work-group size of, e.g., 1024 which results into a runtime of about 944.8ms for this kernel. For the final experiments 16 will be chosen as work-group size for the acceleration kernel on the NVIDIA Quadro GP100. However, further testing revealed that this parameter has to be chosen differently 41 5 Analysis of the runtime behavior (a) NVIDIA Quadro GP100 (b) Dual socket AMD EPYC 7543 Figure 5.6: Comparison of different work-group sizes for the acceleration kernel of the Barnes-Hut algorithm. Figure 5.5a shows the runtimes for different amounts of work-group sizes on an NVIDIA Quadro GP100 GPU. Figure 5.5b shows these values for the same experiment on a dual socket AMD EPYC 7543 CPU. For the GPU a value of 16 performs best across most datasets. For the CPU a work-group size of 8 performs best across most datasets, especially for small ones. on other GPUs in order to achieve good performance. Since the performance differences are not insignificant, the value was chosen differently for all GPUs that will be considered in this thesis. The decision which value gets used is based on which of the work-group sizes performed best across most of the datasets in my experiments. For the NVIDIA RTX 3090 a work-group size of 32 will be chosen. The experiments with the NVIDIA A100 will use a work-group size of 1024 and for the AMD Radeon PRO VII a size of 64 gets used for each work-group. Figure 5.6b shows the effect of different work-group sizes on a dual socket AMD EPYC 7543 CPU. It can be observed that especially for small numbers of bodies there are huge differences between different work-group sizes. For example with the dataset consisting of 2678 bodies the acceleration kernel takes about 1.5ms whereas with a work-group size of 1024 the runtime is increased to about 13.4ms. Generally it can be observed that smaller work-group sizes perform better than bigger work-group sizes. One reason for this could be that the specification of the work-group size might influence how work-items are mapped to threads in the background and smaller work-groups result into situations that improve the cache performance. However, further explaining this observation would require more research about how work-items of work-groups are mapped to threads by Open SYCL. Since a work-group size of eight performs best across most datasets, this value will be chosen for the final experiments on CPUs. Impact of the 𝜃-value on runtime and accuracy The 𝜃-parameter is a central part of the Barnes-Hut algorithm. It is used to control its accuracy and also effects the performance of the algorithm. In order to analyze the impact of this parameter on the runtime of the Barnes-Hut algorithm, the runtime of the acceleration kernel was measured using different values for 𝜃. Furthermore, in order to determine the impact of this value on the accuracy 42 5.2 Impact of the parameters on the runtime behavior of the algorithms of this algorithm, one earth year was simulated with the dataset containing 2678 bodies and a Δ𝑡 of one hour. This was done for several 𝜃-values and the resulting final positions of all bodies got compared to the same simulation conducted with the naive algorithm. (a) Runtimes for different 𝜃-values on an NVIDIA Quadro GP100. (b) Average euclidean distance between the final states of the Barnes-Hut algorithm using different theta values and the final state of the naive algorithm after one earth year. Figure 5.7: Effect of the 𝜃-value of the Barnes-Hut algorithm on the runtime and accuracy. Figure 5.7a shows the runtimes of acceleration kernel on an NVIDIA Quadro GP100 for different 𝜃-values. Figure 5.7b shows how the 𝜃-value impacts the accuracy of the algorithm. It compares the last state of the simulation performed by the Barnes-Hut algorithm with the last state of the simulation that used the naive algorithm. One can see that bigger 𝜃-values also result into a lower runtime. However, when a high value for 𝜃 gets chosen the accuracy of the Barnes-Hut algorithm is decreased. Figure 5.7a shows the results of the experiment regarding the impact of the 𝜃-value on the runtime of the Barnes-Hut algorithm. The experiment was conducted on an NVIDIA Quadro GP100 GPU. The x-axis corresponds the size of the dataset used for the simulation. The y-axis shows the runtime of the acceleration kernel in milliseconds. Both axes have been logarithmically scaled. Each curve corresponds to one run of the simulation using a different 𝜃-value. One can see that the 𝜃-value has a big impact on the runtime of the acceleration kernel. With 𝜃 = 0.2 the runtime of the kernel in one time step is about 20.4s. When using 𝜃 = 0.4 the runtime is decreased substantially to 848.8ms. With 𝜃 = 1 the runtime of the acceleration kernel further decreases to about 245.23ms. However, as it will be shown in next paragraph, these lower runtimes come at the cost of accuracy. Figure 5.7b shows the results of the experiment that analyzes the impact of the 𝜃-value on the accuracy of the simulation. The x-axis corresponds to the 𝜃-value used for the simulation. The y-axis denotes the average euclidean distance between the positions of the same bodies in the final states of the Barnes-Hut algorithm and the naive algorithm after simulating one earth year. The unit for the distance is the astronomical unit (AU). It can be observed that with higher 𝜃-values the precision of the simulation gets reduced significantly. The overall average distance of deviation might not be that high, however there are many bodies in the dataset where no deviation can be 43 5 Analysis of the runtime behavior observed with the accuracy used for the visualization of the simulation. Nevertheless, with higher 𝜃-values some planet moon-systems can become unstable if the simulation spans across a large amount of time steps. For this thesis the overall runtime behavior of the Barnes-Hut algorithm is most important. In order to reflect a good middle ground between accuracy and runtime of the algorithm a 𝜃-value of 0.6 will be chosen for all future experiments. 5.3 Evaluation of the performance optimizations The goal of the experiments described in this section is to evaluate the three performance optimizations described in chapter 4. First, the usage of local memory in the naive algorithm will be analyzed. After that the performance impact of using subtrees during the octree creation of the Barnes-Hut algorithm will be studied. Last, the impact of the sorting step will be examined. 5.3.1 Evaluation of the optimizations for the naive algorithm The experiments presented in this section aim to analyze the impact of the performance optimizations performed on the naive algorithm. This analysis will deal with the following topics: 1. Impact of the optimizations on GPUs 2. Different effects on consumer and data center GPUs 3. Differences between SYCL implementations 4. Impact on the runtime on CPUs In order to gain a better understanding of the impact of the individual optimization steps, three optimization stages for the naive algorithm will be considered. Stage zero corresponds to the implementation of the naive algorithm without any optimizations. Stage one introduces the grouping of acceleration computations into work-groups, including a padding to allow arbitrary work-group sizes. Stage two corresponds to the final implementation, including the usage of local memory. Impact of the optimizations on GPUs In order to analyze the impact of the optimizations on GPUs the naive algorithm was run on all datasets with all three optimization stages using an NVIDIA Quadro GP100 GPU. For stage one and stage two, a work-group size of 128 has been chosen. Figure 5.8 shows the results of theses runs. The x-axis corresponds to the number of bodies. The y-axis shows the runtime in milliseconds and is logarithmically scaled. Each set of three bars corresponds the runs of the three different stages on one dataset. It can be seen that stage two is always faster than stage one and stage zero. For the largest dataset the optimizations from stage two improve the runtime from about 22.2 seconds to 20.6 seconds per acceleration computation. Since n-body simulations usually consist out of a lot of time steps, this improvement adds up over time and can significantly improve the overall runtime of the simulation. 44 5.3 Evaluation of the performance optimizations Figure 5.8: Analysis of the optimizations for the naive algorithm on an NVIDIA Quadro GP100. Stage 0 is the non optimized version of the naive algorithm and stage 2 is the final, most optimized version of the naive algorithm. It can be observed that stage 2 can improve the runtime of the naive algorithm most of the time Furthermore, it can be seen that stage one does not always offer an improvement over stage zero and sometimes even results into worse performance. This indicates that the performance improvement indeed comes from the usage of local memory and not just from the division of work-items into work-groups. The fact that stage one sometimes results into slightly worse performance can be explained with the fact that the work-group size which was chosen as 128 for all datasets. This value produces good results for all datasets, however, there might be specific work-group sizes that perform better on some datasets. Different effects on consumer and data center GPUs To study this topic, the previous experiment was repeated using a different GPU, an NVIDIA GeForce RTX 3090. As opposed to the NVIDIA Quadro GP100 used previously, this is not a data center GPU but a consumer GPU. The main difference between those two categories of graphics cards is that data center GPUs typically have a lot more double precision performance than consumer GPUs. This is especially relevant for the implementation of the naive algorithm conducted in this work, since all calculations are performed using double precision. Figure 5.9 shows the results of this experiment in the same manner as previously. It can be observed that, as opposed to the run of the experiment on the Quadro GP100, no performance differences between the three stages can be seen on the GeForce RTX 3090. This can be explained with the much worse double precision performance of this consumer GPU compared to the data center GPU used in the first experiment. Because of this, the NVIDIA GeForce RTX 3090 is likely heavily bottle-necked by its compute performance. Thus, the memory optimization of stage 2 does not come into effect. The much higher overall compute performance of the Quadro GP100 can also be 45 5 Analysis of the runtime behavior Figure 5.9: Impact of the optimizations for the naive algorithm on an on an NVIDIA GeForce RTX 3090. Stage 0 is the non optimized version of the naive algorithm and stage 2 is the final, most optimized version of the naive algorithm. It can be seen that stage 2 does not improve the runtime of the naive algorithm on consumer GPUs. clearly seen in the runtime differences between the two cards. The Quadro GP100 needs about 20.6 seconds for one acceleration computation on the largest dataset whereas the RTX 3090 takes about 2.4 minutes. This makes consumer GPUs impractical for the naive algorithm on larger datasets. Differences between SYCL implementations In order to analyze differences between the two different SYCL implementations that are supported, the experiment was repeated using DPC++. Furthermore, in order not to limit the observations to NVIDIA GPUs, the experiment was repeated again on an AMD Radeon PRO VII GPU with both, Open SYCL and DPC++. Figure 5.10a shows the results of the first experiment on a Quadro GP100 with the usage of DPC++. One can see that the optimization also works with DPC++ since stage two improves the runtime on most datasets with one exception being the smallest dataset. An explanation for this exception could be again the choice of the work-group size as 128 for all datasets. For the smallest dataset with just 128 bodies this is likely not the best value, however, it performs better for larger datasets and the performance loss for 178 bodies is not that large. As opposed to the results of the analogous experiment with Open SYCL one can observe that stage one often improves performance over stage zero. This phenomenon could not really be observed with Open SYCL. Figure 5.10b shows a comparison between the runtime of the 3 different stages with DPC++ and Open SYCL. The dataset used for the experiment contains 19054 bodies. The y-axis denotes the runtime in milliseconds. The left bar of each group shows the runtime of Open SYCL and the right bar of a group of two bars shows the runtime of DPC++. It can be observed that Open SYCL is much faster than DPC++ at stage 0. With stage one, the performance of DPC++ improves a lot from about 8.04ms for one time step to only 5.57ms. This is faster than Open SYCL which needs about 6.91ms for one time step with stage one. Stage one, however, has almost no effect when using 46 5.3 Evaluation of the performance optimizations (a) Impact of the performance optimizations on an NVIDIA Quadro GP100 using DPC++ (b) Impact of the performance optimizations with DCP++ and Open SYCL for 19054 bodies Figure 5.10: Differences between DPC++ and Open SYCL for the different optimization stages of the naive algorithm on an NVIDIA Quadro GP100. Stage 0 is the non optimized version of the naive algorithm and stage 2 is the final, most optimized version of the naive algorithm. Figure 5.10a shows the runtimes of the naive algorithm with the different optimizations on an NVIDIA Quadro GP100. Figure 5.10b shows a comparison between DPC++ and Open SYCL with the different stages of the naive algorithm. It can be seen that, in contrast to Open SYCL, stage 1 also offers performance improvements in some cases. Open SYCL, since O