Institute of Parallel and Distributed Systems

University of Stuttgart
Universitätsstraße 38

D–70569 Stuttgart

Bachelorarbeit

Comparison of different n-body
algorithms on various hardware

platforms using SYCL

Tim Thüring

Course of Study: Informatik

Examiner: Prof. Dr. Dirk Pflüger

Supervisor: Marcel Breyer, M.Sc.

Commenced: October 5, 2022

Completed: April 5, 2023


Abstract

The n-body problem has various applications in different fields of science such as astrophysics,
where it describes the problem of calculating the movements of n different bodies which all interact
with each other over time. There exist different algorithms that solve the n-body problem, for
example, the naive approach and the Barnes-Hut algorithm. Since applications of the n-body
problem can deal with large systems of bodies it is crucial to understand the runtime behavior of
theses algorithms.

In this thesis the runtime behavior of the naive algorithm and the Barnes-Hut algorithm will be
compared on different CPUs as well as on GPUs from different vendors. Both algorithms will
be implemented using SYCL which is an abstraction layer that enables parallel programming
for various types of devices. With SYCL there is no need to use different languages for parallel
programming on CPUs and GPUs and the whole code can be written using standard C++.

The results show that one can achieve better performance with both algorithms on GPUs than on
CPUs. A projection for the runtime of a simulation of approximately one earth year with a system
of over 1.2 million bodies and a Δ𝑡 of one hour shows that an NVIDIA A100 GPU could finish this
simulation in under one day with the naive approach. A dual socket AMD EPYC 7543 would take
almost twelve days for the same simulation. This shows that the naive algorithm maps extremely
well to GPUs. CPUs can keep up better with the Barnes-Hut algorithm. Here the projection of the
runtime showed that the NVIDIA A100 would finish the simulation of an earth year in just below
an hour whereas the dual socket AMD EPYC 7543 could finish the simulation in 2.59 hours.

3


Kurzfassung

Das n-Körper-Problem hat viele Anwendungen in unterschiedlichen Forschungsgebieten, wie zum
Beispiel Astrophysik, in denen es das Problem der Berechnung von Bewegungen unterschiedlicher
Körper beschreibt, die alle miteinander interagieren. Es existieren verschiedene Algorithmen, die
das n-Körperproblem lösen, zum Beispiel der naive Ansatz oder der Barnes-Hut Algorithmus. Da
sich Anwendungen des n-Körper-Problems mit großen Systemen von Körpern beschäftigen können,
ist es wichtig das Laufzeitverhalten dieser Algorithmen zu verstehen.

In dieser Arbeit wird das Laufzeitverhalten des naiven Algorithmus und des Barnes-Hut Algorithmus
auf unterschiedlichen CPUs und GPUs von verschiedenen Herstellern verglichen. Beide Algorithmen
werden unter der Verwendung von SYCL implementiert, was eine Abstraktionsschicht ist die es
ermöglicht verschiedenste Hardware parallel zu programmieren. Mit SYCL besteht nicht die
Notwendigkeit verschiedene Sprachen für das parallele Programmieren auf CPUs und GPUs zu
verwenden und der gesamte Programmcode kann in Standard C++ geschrieben werden.

Die Ergebnisse zeigen, dass man bei beiden Algorithmen bessere Performanz auf GPUs erreichen
kann als auf CPUs. Eine Hochrechnung der Laufzeit für eine Simulation von ungefähr einem
Erd-Jahr mit einem System von 1,2 Millionen Körpern und einem Δ𝑡 von einer Stunde zeigt, dass
eine NVIDIA A100 GPU mit dem naiven Ansatz die Simulation in unter einem Tag beenden kann.
Ein Dual-Socket AMD EPYC 7543 würde fast zwölf Tage für diese Simulation benötigen. Das
zeigt, dass sich der naive Algorithmus enorm gut auf GPUs anwenden lässt. CPUs können beim
Barnes-Hut Algorithmus besser mithalten. Hier zeigt die Hochrechnung, dass die NVIDIA A100
die Simulation eines Jahres in knapp unter einer Stunde schaffen würde, während der Dual-Socket
AMD EPYC 7543 die Simulation in 2,59 Stunden schaffen könnte.

4


Contents

1 Introduction 11

2 Related work 13

3 Foundations 15
3.1 The n-body problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Algorithms for the N-Body Problem . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 SYCL for parallel programming on various hardware . . . . . . . . . . . . . . . 19

4 Implementation 23
4.1 Implementation of the naive algorithm . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Implementation of the Barnes-Hut algorithm . . . . . . . . . . . . . . . . . . . 25

5 Analysis of the runtime behavior 33
5.1 Experiment setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 Impact of the parameters on the runtime behavior of the algorithms . . . . . . . 34
5.3 Evaluation of the performance optimizations . . . . . . . . . . . . . . . . . . . 44
5.4 Comparison of the runtime behavior on different GPUs and CPUs . . . . . . . . 52
5.5 Summary of the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6 Conclusion 61
6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Bibliography 63

5


List of Figures

3.1 Example of a subdivision of the space into a quadtree . . . . . . . . . . . . . . . 17
3.2 Example for a quadtree data structure . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Example of how nodes of a quadtree represent virtual bodies . . . . . . . . . . . 19

4.1 Grouping of the acceleration computations into work-groups . . . . . . . . . . . 24
4.2 Building the top part of the tree in order to determine the root nodes for all subtrees 27
4.3 Creating all subtrees independently from each other . . . . . . . . . . . . . . . . 27
4.4 Two methods of assigning nodes to work-items . . . . . . . . . . . . . . . . . . 29

5.1 Comparison of different block sizes 𝑘 for the naive algorithm on an NVIDIA Quadro
GP100 GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.2 Comparison of different block sizes 𝑘 for the naive algorithm on a dual socket AMD
EPYC 7543 CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.3 Comparison of different values for the maximum depth for the top of the octree . 38
5.4 Comparison of different amounts of work-items for the first phase of the octree

creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.5 Comparison of different amounts of work-items for building the subtrees during the

octree creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.6 Comparison of different work-group sizes for the acceleration kernel of the Barnes-

Hut algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.7 Effect of the 𝜃-value of the Barnes-Hut algorithm on the runtime and accuracy . . 43
5.8 Analysis of the optimizations for the naive algorithm on an NVIDIA Quadro GP100 45
5.9 Impact of the optimizations for the naive algorithm on an NVIDIA GeForce RTX

3090 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.10 Differences between DPC++ and Open SYCL with the different performance

optimizations of the naive algorithm on an NVIDIA Quadro GP100 . . . . . . . 47
5.11 Impact of the optimizations of the naive algorithm on an AMD GPU with Open

SYCL and DPC++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.12 Impact of the performance optimizations for the naive algorithm on a dual socket

AMD EPYC 7543 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.13 Impact of using subtrees during the octree creation of the Barnes-Hut algorithm on

CPUs and GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.14 Impact of the sorting step of the Barnes-Hut algorithm on CPUs and GPUs . . . 51
5.15 Comparison of the naive algorithm and the Barnes-Hut algorithm on CPUs and GPUs 53
5.16 Comparison of the naive algorithm and the Barnes-Hut algorithm on consumer and

data center GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.17 Comparison of the naive algorithm and the Barnes-Hut algorithm on AMD and

NVIDIA GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

7


5.18 Comparison of two different SYCL implementations on GPUs from AMD and
NVIDIA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.19 Comparison of the naive algorithm and the Barnes-Hut algorithm on different CPUs 59

8


Acronyms

AU astronomical unit. 43

CPU central processing unit. 11

FPGA field programmable gate array. 19

GPU graphics processing unit. 11

HT hyper-threading. 33

TFLOPS tera floating point operations per second. 34

9


1 Introduction

The n-body problem occurs in different fields of science and describes the calculation of the
movements of n different bodies in a system where all of these bodies interact with each other.
One of the main applications of this problem can be found in astrophysics where the goal is to
calculate the trajectories of different celestial objects like planets or asteroids which all interact
with each other through gravitational forces. There are different approaches that solve the n-body
problem. The naive approach considers all pairwise interactions between all bodies in the system.
This leads to the most precise result, however, the runtime of the algorithm grows quadratically
with the number of bodies in the system which makes the approach poorly suited for the simulation
of very large systems. Therefore, other algorithms have been developed that apply approximations
to the simulation in order to speed up the calculation. One such algorithm is the Barnes-Hut
algorithm [BH86], which leverages an octree data structure to apply approximations and improve
the performance of n-body simulations.

Since n-body simulations can be performed with large systems of bodies it is important to understand
the runtime behavior of these algorithms. In this thesis the runtime behavior of two algorithms
for the n-body problem, the naive algorithm and the Barnes-Hut algorithm, will be compared on
different central processing units (CPUs) as well as on graphics processing units (GPUs) from
NVIDIA and AMD. Parallel programming on such different hardware platforms usually involves the
usage of different languages such as CUDA, HIP or OpenMP. Most of the works that implemented
n-body algorithms for GPUs use CUDA. In this thesis both algorithms will be implemented with
the usage of SYCL which is an abstraction layer that enables parallel programming on a variety
of devices. The benefit of using SYCL is that one does not need to use different languages for
parallel programming on a specific device. Instead, with SYCL one can write a whole application
in standard C++ that is then capable of addressing a variety of devices.

The results of the comparison conducted during this thesis show that one can achieve better
performance on GPUs than on CPUs for both algorithms. The differences between those two types
of devices is especially significant for the naive algorithm where GPUs have a clear advantage over
CPUs.

A projection for the runtime of a simulation of approximately one earth year with a system of over
1.2 million bodies and a Δ𝑡 of one hour shows that when using the naive approach an NVIDIA
A100 GPU could finish the simulation in under one day. A dual socket AMD EPYC 7543 would
need almost twelve days in the same situation. However, CPUs can keep up better with GPUs when
using the Barnes-Hut algorithm. Nevertheless, also with this algorithm GPUs are faster in the end.
Here the projection of the runtime showed that with the Barnes-Hut algorithm the NVIDIA A100
would finish the simulation of an earth year in just below an hour whereas the dual socket AMD
EPYC 7543 could finish the same simulation in 2.59 hours.

11


1 Introduction

This thesis is structured as follows: In the next chapter important work related to this thesis will be
presented. Chapter 3 will introduce all important fundamentals about n-body problems and SYCL.
In chapter 4 the implementations of the naive algorithm and the Barnes-Hut algorithm conducted
during this thesis will be described. The results of the comparison of both algorithms on different
GPUs and CPUs will be presented in chapter 5. Finally, chapter 6 will conclude this thesis and
discuss possible directions of future work.

12


2 Related work

This section will present scientific work related to the topic of this thesis. There exist a lot of
implementations of n-body algorithms on CPUs as well as on GPUs. An implementation of the
naive algorithm on GPUs was performed by Nyland et al. [NHP07]. The authors parallelized the
naive approach on NVIDIA GPUs with the usage of CUDA. Arora et al. [ASV09] implemented
the naive approach for several CPUs and GPUs and compared the performance on these different
platforms. Their GPU implementation makes use of CUDA and uses the approach by Nyland et
al. However, they only considered the naive algorithm. Furthermore, the hardware used for the
comparison (NVIDIA Tesla C1060 and NVIDIA Tesla C870 GPUs as well as, for example, Intel
Nehalem and AMD Barcelona CPUs) is not up to date anymore. Capuzzo-Dolcetta et al. [CS13]
implemented the naive algorithm with OpenCL an compared their implementation on GPUs from
different manufactures.

A CUDA implementation of the Barnes-Hut algorithm on NVIDIA GPUs was conducted by
Burtscher et al. [BP11]. The octree creation phase of the Barnes-Hut algorithm was implemented
using a parallel top-down approach that makes use of synchronization. The authors also compared
their implementation of the Barnes-Hut algorithm with a CUDA implementation of the naive
algorithm. A comparison of the performance of the Barnes-Hut algorithm on CPUs and GPUs was
also performed by the authors. However, the CPU implementation of the Barnes-Hut algorithm
used for this comparison was not parallelized.

There also exist other approaches for the octree creation like the one proposed by Warren et al.
[WS93] which makes use of a hashed octree and leverages space filling curves for decomposition of
all bodies in the space. It is also possible to construct the tree bottom-up. This technique has been,
for example, applied by Alpay [Alp19b] to an algorithm that is similar to the Barnes-Hut algorithm
with the usage of OpenCL. For the tree the SpatialCL library by Alpay [Alp19a] is used that also
includes a demonstration of the Barnes-Hut algorithm.

Beyond the Barnes-Hut algorithm that has a theoretical complexity of O(𝑛 𝑙𝑜𝑔(𝑛)) there also exist
other approaches that try to reduce the theoretical complexity of the naive algorithm. One example
is the fast multipole method by Greengard et al. [Gre87] which reduces the theoretical complexity
to O(𝑛). Yokota et al. [YB11] implemented both the fast multipole method and the Barnes-Hut
algorithm with CUDA for execution on GPUs. They also compared both algorithms with the naive
approach on GPUs and CPUs.

Overall, not many comparisons of the Barnes-Hut algorithm and the naive algorithm have
been conducted on different modern hardware. Most comparisons so far have used different
implementations of the algorithms with different programming languages for the various devices.
The implementations of the naive algorithm and the Barnes-Hut algorithm presented in this work
will only use SYCL code. This allows for using mostly the same code for the execution on CPUs
and GPUs. Different frameworks than SYCL exist and also have been applied to n-body simulations
like the Cabana library [SRJ+22] that uses Kokkos [TLA+22] for achieving performance portability.

13


2 Related work

However to my knowledge no comparisons of different algorithms with the usage of this library
have been conducted. Furthermore, I am not aware of any similar work that makes use of SYCL for
comparison between the naive algorithm and the Barnes-Hut algorithm on different hardware.

14


3 Foundations

In this chapter the basic concepts and tools which are relevant for this thesis will be introduced.
Starting with physical background about n-body problems and the ideas behind two different
algorithms for n-body simulations. Last, SYCL will be introduced as part of the technical
background of this thesis.

3.1 The n-body problem

The n-body problem describes the calculation of pairwise interactions between n different bodies to
obtain their movements over time. It has applications in various fields of science such as molecular
dynamics [EH86], plasma physics [GSK+10], fluid dynamics [SW94] or astrophysics [BH86]. The
variants of the n-body problem in those different fields mainly differ in the kind of force which
underlies the interactions in the system.

This work considers the application of the n-body problem for astrophysics, where the main
application is the simulation of celestial objects, e.g., the planets, moons and asteroids of our solar
system. This variant of the n-body problem is also called gravitational n-body problem since the
different bodies of the system interact with each other through gravitational forces. It is described
in various literature, e.g., in the work by Meyer et al. [Ken17] on which the content of the following
paragraphs is based.

In the gravitational case the interactions between the bodies arise through gravity. The individual
bodies in the system are modeled as point masses where every body 𝑖 is described by its mass 𝑚𝑖

and a position ®𝑟𝑖 . With this, the calculation of the gravitational force acting on body 𝑖 can be derived
from Newton’s second law of motion and Newton’s law of gravity. Together this gives the following
equation for the total force ®𝐹𝑖 acting on body 𝑖 generated through the gravity of all other bodies 𝑗 :

®𝐹𝑖 = 𝑚𝑖 · ®𝑎𝑖 = 𝐺 · 𝑚𝑖 ·
∑︁
𝑗≠𝑖

𝑚 𝑗 ·
(
®𝑟 𝑗 − ®𝑟𝑖

)
∥®𝑟 𝑗 − ®𝑟𝑖 ∥3 . (3.1)

In equation 3.1, 𝐺 ≈ 6.674 · 10−11 𝑚3

𝑘𝑔·𝑠2 denotes the gravitational constant and ®𝑎𝑖 describes the
acceleration of body 𝑖. Often the acceleration value ®𝑎𝑖 is sufficient, which simplifies equation 3.1 to
equation 3.2:

®𝑎𝑖 = 𝐺 ·
∑︁
𝑗≠𝑖

𝑚 𝑗 ·
(
®𝑟 𝑗 − ®𝑟𝑖

)
∥®𝑟 𝑗 − ®𝑟𝑖 ∥3 . (3.2)

In order to obtain the evolution of the system over time, i.e., the movements of the bodies in the
system, one uses numerical integration over time. One way of getting the new positions ®𝑟𝑖+1 and
velocities ®𝑣𝑖+1 of each body in the next time step 𝑖 + 1, is the so called leapfrog integration method.

15


3 Foundations

The method is described in various literature. This paragraph will be based on the work by Hairer et
al. [HLW03]. The leapfrog integration method is sometimes also named Verlet integration and can
be written and interpreted in a variety of ways. A possible interpretation of this scheme is that it is
a composition of two symplectic Euler schemes. One variant of the leapfrog integration scheme
can be expressed as shown below:

®𝑣𝑖+ 1
2
= ®𝑣𝑖 + ®𝑎𝑖 ·

Δ𝑡

2
(3.3)

®𝑟𝑖+1 = ®𝑟𝑖 + ®𝑣𝑖+ 1
2
· Δ𝑡 (3.4)

®𝑣𝑖+1 = ®𝑣𝑖+ 1
2
+ ®𝑎𝑖+1 ·

Δ𝑡

2
. (3.5)

After the step described by equation 3.4 the new acceleration values ®𝑎𝑖+1 have to be determined. In
the case of the gravitational n-body problem equation 3.2 could be used. However, as described
by e.g. Nyland et al. [NHP07], for n-body simulations in practice this would not be applicable.
Since the individual bodies are modeled as point masses and collisions are not taken into account,
the forces between two bodies that get close to each other grow towards infinity. This can be
circumvented by adding a softening factor 𝜖2 and rewriting equation 3.2 to equation 3.6 for the
acceleration computation:

®𝑎𝑖 ≈ 𝐺 ·
∑︁
𝑗≠𝑖

𝑚 𝑗 ·
(
®𝑟 𝑗 − ®𝑟𝑖

)(
∥®𝑟 𝑗 − ®𝑟𝑖 ∥2 + 𝜖2) 3

2
. (3.6)

Together, the acceleration computation and time integration form the basic building blocks of
n-body simulations.

3.2 Algorithms for the N-Body Problem

The part dominating the runtime of n-body simulations is the acceleration computation which has
to be done in every time step. Beyond the naive approach for the acceleration calculation, several
other approaches have been developed to speed up the acceleration computations. These approaches
include methods such as the Barnes-Hut algorithm [BH86], the fast multipole method [Gre87], or
particle-mesh methods (see, e.g., Bertschinger et al. [BG91]). This work is concerned with two
different algorithms for n-body simulations: the naive algorithm and the Barnes-Hut algorithm. In
this section, these two algorithms for the acceleration computation of the n-body problem will be
introduced.

3.2.1 The Naive Approach

The naive approach for the acceleration computation is described in various literature, e.g., by
Nyland et al. [NHP07]. In order to obtain the movements of all bodies over time, one has to
calculate the new acceleration vector of each body in every time step. This means that for a system
of 𝑛 bodies, 𝑛 acceleration values have to be determined. Using the naive approach these values
can be calculated by applying equation 3.6 for all 𝑛 bodies 𝑖. Since each of the 𝑛 acceleration
computations requires to sum over the accelerations induced by the gravity of all the other 𝑛 − 1
bodies 𝑗 , naively computing the accelerations results into O(𝑛2) complexity.

16


3.2 Algorithms for the N-Body Problem

For large systems, this quadratic complexity can be problematic since the runtime of the naive
algorithm can get quite long. Thus, other algorithms have been developed to speed up the calculation
of the acceleration values. However, such algorithms often apply approximations and thus do not
reach the accuracy of the naive approach. One such algorithm, the Barnes-Hut algorithm, will now
be introduced in the next section.

3.2.2 The Barnes-Hut Algorithm

The Barnes-Hut algorithm was proposed by Josh Barnes and Piet Hut [BH86]. It is a tree based
method that reduces the complexity of the acceleration computation to O(𝑛 𝑙𝑜𝑔(𝑛)) which makes it
much more applicable for large systems compared to the naive approach.

The Barnes-Hut algorithm can compute the 𝑛 different acceleration values ®𝑎𝑖 more efficiently,
since it applies approximations to each of the 𝑛 calculations. In contrast to the naive algorithm,
which considers all interaction pairs for the calculation of one acceleration ®𝑎𝑖, the Barnes-Hut
algorithm reduces the number of interaction calculations per acceleration value. The general idea
of the approximation is that for the calculation of acceleration ®𝑎𝑖 of body 𝑖, the acceleration is only
directly computed for the bodies 𝑗 that are close to body 𝑖. Groups of bodies that are far away are
grouped together and are estimated as one big virtual body. When the bodies are far away the loss of
precision of this approximation compared to individually computing all accelerations is not large.

In order to determine when to apply this approximation, all bodies are stored in a tree data structure
according to their position. In the three dimensional case, such a data structure would be an octree,
in two dimensions, a quadtree. The implementation of the algorithm performed during this thesis
always considers the three dimensional case. For easier visualization, all following graphics will be
limited to the two dimensional case with quadtrees. However, all concepts described here can be
easily extended to three dimensions.

quadtree creation

1 1

2 2
3 3

4 4

Figure 3.1: Example of a subdivision of a space containing four bodies numbered from 1 to 4 into
a quadtree. In the situation shown on the left there is only one cell that contains all four
bodies. This cell then gets recursively subdivided into smaller cells until there is only
one body in each cell.

17


3 Foundations

Figure 3.1 shows how a space containing 4 bodies numbered from 1 to 4. The space gets recursively
subdivided into smaller cells, until there is only one body in each cell. In the corresponding quadtree
data structure, each cell 𝑐𝑖 corresponds to a node of the tree. The quadtree corresponding to figure
3.1 is visualized in figure 3.2.

𝑐0

𝑐1 𝑐2

𝑐5 𝑐6 𝑐7 𝑐8

𝑐3 𝑐4

1

2 34

Figure 3.2: Example for a quadtree data structure for the quadree visualized in figure 3.1. Each
node of the tree corresponds to one cell. Below the nodes that contain a body, the
respective body is shown.

All nodes, have to store the sum of the masses of all bodies which are contained in any of the nodes
further down the tree and the center of mass of all those bodies. The center of mass of several bodies
𝑘 can be calculated using equation 3.7. With that, each non-leaf node of the tree also represents a
virtual body located in the center of mass of all the bodies in the cell which is represented by this
node, and with a mass of the sum of all masses of those bodies.

®𝐶 =

∑
𝑘 𝑚𝑘®𝑟𝑘∑
𝑘 𝑚𝑘

(3.7)

For example, cell 𝑐2 in figure 3.2 got subdivided into more cells since it contains a total of 3 bodies:
body 2, 3 and 4. Thus, as shown in figure 3.3, the node corresponding to cell 𝑐2 represents the virtual
body 𝑣234 located at the center of mass of the bodies 2, 3 and 4 and with a mass corresponding to
the sum of the individual masses of the three bodies.

During the acceleration computation, for each body 𝑖, the tree is traversed top to bottom. For each
node, one has to decide if an approximation should be used, i.e., if only one acceleration induced
by the virtual body represented by the current node should be used. In such a case, all nodes and
bodies in the subtree of this node will not be considered anymore for the acceleration computation.
This can drastically reduce the number of acceleration computations. In the example above this
would correspond to the following situation: assume that the goal is the computation of acceleration
®𝑎1. When the algorithm arrives at the node representing cell 𝑐2 it has to make the decision if the
virtual body 𝑣234 should be used or if an approximation is not feasible here and one should continue
to step down deeper into the tree, computing more precise accelerations with bodies 2, 3 and 4
individually.

The decision, if the virtual body 𝑣 represented through the current node under consideration in
the tree traversal should be used as an approximation for all other bodies contained in the cell 𝑐
represented by this node is based on the following criterion:

𝑒𝑑𝑔𝑒𝐿𝑒𝑛𝑔𝑡ℎ(𝑐)
| |®𝑟𝑣 − ®𝑟𝑖 | |

< 𝜃 . (3.8)

18


3.3 SYCL for parallel programming on various hardware

2
3

4

(a) Cell 𝑐2

2
3

4

𝑣234

(b) Virtual body 𝑣234 repre-
sented by cell 𝑐2

𝑐0

𝑐1 𝑐2 𝑐3 𝑐4

1𝑣234

(c) Quadtree with the virtual
body 𝑣234 instead of the sub-
tree containing the individual
bodies 2, 3 and 4

Figure 3.3: Example of how nodes of a quadtree represent virtual bodies. Figure 3.3a shows cell
𝑐2, divided into four quadrants and containing the bodies with ID 2, 3 and 4. Figure
3.3b shows cell 𝑐2 with the virtual body 𝑣234 located in the center of mass of all bodies
contained in cell 𝑐2. Figure 3.3c shows the quadtree data structure where the virtual
body 𝑣234 represents the whole subtree containing the bodies 2, 3 and 4.

Here, ®𝑟𝑖 is the position vector of the body 𝑖 whose acceleration ®𝑎𝑖 should be computed, whereas ®𝑟𝑣
corresponds to the position of the virtual body, which is the center of mass of all bodies in the cell.
𝜃 is a constant that defines the accuracy of the algorithm. For higher values of 𝜃, approximations
with virtual bodies will be applied more often, which leads to better performance but lower accuracy.
Lower 𝜃-values give higher accuracy, but it comes at the cost of more computations and thus longer
run-times. For 𝜃 = 0 no approximations are applied and the algorithm would correspond to the
naive approach again.

The tree creation step of the Barnes-Hut algorithm requires O(𝑛 𝑙𝑜𝑔(𝑛)) calculation steps. Further-
more, the time complexity of the computation for the calculation of one acceleration ®𝑎𝑖 is O(𝑙𝑜𝑔(𝑛)).
Since in total 𝑛 acceleration values have to be computed per time step, the total complexity of the
acceleration computation step is O(𝑛 𝑙𝑜𝑔(𝑛)). With that, the total complexity of the Barnes-Hut
algorithm reaches O(𝑛 𝑙𝑜𝑔(𝑛)), which is a significant improvement over the naive approach.

3.3 SYCL for parallel programming on various hardware

SYCL 2020 [Khr] is a standard that acts as an abstraction layer which enables parallel programming
for heterogeneous systems and is developed by the Khronos Group1. SYCL code can be written
using only standard ISO C++. The advantage of an application written with SYCL code is that
it can be executed in parallel on a variety of devices, such as CPUs and GPUs from different
manufacturers or even field programmable gate arrays (FPGAs).

In this thesis, SYCL will be used to compare the runtime behavior of different n-body algorithms
on different GPUs and CPUs. For creating a SYCL application and running it on different devices
one needs an implementation of SYCL. There are a variety of SYCL implementations, each of
which include support for different backends, that are then used for executing code in parallel on

1https://www.khronos.org (visited on 04/03/2023)

19

https://www.khronos.org


3 Foundations

the selected devices. For example, when executing SYCL code on an NVIDIA GPU, the CUDA
backend could be used in the background, whereas code executing on a CPU could be parallelized
using OpenMP as a backend.

This work uses two different SYCL implementations: Open SYCL2 [AH20] and DPC++3. Open
SYCL offers support for NVIDIA and AMD GPUs with the CUDA and the ROCm backend
respectively. Furthermore Intel GPUs are supported with Level Zero. Open SYCL also offers broad
CPU support through the OpenMP backend.

DPC++ supports Intel CPUs, GPUs and FPGAs with the SPIR-V backend and the OpenCL backend.
Intel GPUs are also supported with Level Zero. DPC++ also offers support for NVIDIA GPUs
through CUDA and AMD GPUs through HIP. However, the AMD HIP backend is currently in
experimental state.

There also exist other SYCL implementations like, for example, Codeplay ComputeCpp4. However,
these implementations will not be considered in this work.

The SYCL application written in this work utilizes the CUDA backends of Open SYCL and DPC++
for execution on NVIDIA GPUs and the ROCm backend of Open SYCL for AMD GPU support.
For parallel execution on CPUs the OpenMP backend of Open SYCL is used.

In the next sections a short introduction into programming with SYCL will be given which explains
the high level concepts of how to parallelize code with SYCL and execute it on different devices.

3.3.1 Programming with SYCL

The content of this section follows the explanations of the book Data Parallel C++ by Reinders et al.
[RAB+21].

On the highest level, a SYCL application can be divided into host and device code. Device code is
most of the time the computational intensive part of the program which should get accelerated by
parallel execution on a device. Such a device could be a GPU but also a CPU. Host code on the
other hand is all the other code of the application, besides the device code. This code gets executed
on the CPU of the system.

Listing 3.1 shows a simple example of SYCL code. The code initializes a vector with some data on
the CPU. After that each entry of the vector is squared. This operation will happen in parallel on a
device of the system. The following paragraph will describe the example step by step.

In line 1, a queue is created. A queue is associated with one device and used to submit code which
should be executed on the device. In this case no device is specified, so the SYCL implementation
will select a device. However, here it would be possible to specify that, for example, this queue
should be associated with a GPU.

2https://github.com/OpenSYCL/OpenSYCL (visited on 04/03/2023)
3https://github.com/intel/llvm (visited on 04/03/2023)
4https://developer.codeplay.com/products/computecpp/ce/home/ (visited on 04/03/2023)

20

https://github.com/OpenSYCL/OpenSYCL
https://github.com/intel/llvm
https://developer.codeplay.com/products/computecpp/ce/home/


3.3 SYCL for parallel programming on various hardware

Listing 3.1 Example of how code gets submitted to a device for parallel execution when using
SYCL

1 sycl::queue queue;

2

3 std::vector<int> storage = {2, 9, 10, 7, 4, 3, 1, 5};

4 sycl::buffer<int> data_buffer(storage.data(), storage.size());

5

6 int range = storage.size();

7

8 queue.submit([&](sycl::handler &h) {

9

10 sycl::accessor<int> data_accessor(data_buffer, h);

11

12 h.parallel_for(sycl::range<1>(range), [=](auto &i) {

13 data_accessor[i] = data_accessor[i] * data_accessor[i];

14 });

15 }).wait();

The code in line 4 creates a buffer for variables of type int. A buffer is an abstraction layer for data
in SYCL. This data can be accessed by the device using an accessor. In the example an accessor is
created in line 10.

In line 8 a kernel containing device code is submitted to the device with which the previously
created queue is associated. The only device code which gets executed in parallel is line 13. Here
the previously created accessor is used to modify the data stored in the buffer. By default, device
code submitted to a queue gets executed asynchronous to the host code. The code in line 10 however,
gets executed directly and not asynchronously. In this example the call to wait() in line 15 causes
to the host device to wait until the device executing the kernel is finished with the computation.

One instance of the kernel which executes line 13 in parallel to all other instances is called work-item.
The amount of parallelism or the number of work-items which are used is controlled in line 12.
In this example it corresponds to the size of the buffer. Each of the work-items of the kernel is
identified by an ID, here, in this example the ID is stored in the variable i.

In the previous example, besides the amount of work-items no further specifications about the
parallel execution are made. However, SYCL provides a more powerful option to express parallelism:
nd-range kernels. For this work, the most important property of nd-range kernels is the ability
to group different work-items together into work-groups. E.g.: one can divide a total of 1024
work-items into 16 work-groups with 64 work-items each.

The advantage of this approach is that there exist several work-group specific functions that can be
used to optimize the application. For example, each work-group has its own local memory shared
between all work-items of that work-group. Depending on the device which is used for running the
kernel, this type of memory can be much faster than the global memory that can be accessed by all
work-items. Furthermore there are several build in synchronization options between work-items
that belong to one single work-group like barriers or memory fences.

21


4 Implementation

In this work, two algorithms for the n-body problem, the naive approach and the Barnes-Hut
algorithm, have been implemented and parallelized for execution on different hardware using SYCL.
All computationally demanding parts of the algorithms have been parallelized with SYCL for
execution on CPUs and GPUs, including the leapfrog integration. In this chapter, the approaches of
how the algorithms are implemented for parallel execution will be described. Furthermore, it will
be explained how the algorithms are optimized for better performance.

4.1 Implementation of the naive algorithm

The implementation of the naive algorithm with SYCL conducted in this thesis is based on the
CUDA implementation of this algorithm by Nyland et al. [NHP07].

As described in section 3.2.1, calculating all the 𝑛 accelerations of a system with 𝑛 bodies requires
O(𝑛2) acceleration computations. Each of those O(𝑛2) accelerations can be computed independently
of each other which makes this approach embarrassingly parallel. After that all accelerations
corresponding to one body can be summed up. This means that in theory one could have a
maximum of O(𝑛2) calculations executing in parallel. The disadvantage of this method is that it
also requires quadratic amount of memory, since all acceleration values have to be stored until the
final 𝑛 accelerations can be calculated. Thus, Nyland et al. chose a different approach where only
the 𝑛 acceleration computations for each body happen independent of each other in parallel and the
𝑛− 1 acceleration values which emerge through the interaction of one body with all other bodies are
calculated sequentially. This approach still results in more than enough parallelism to fully utilize
modern GPU and CPU hardware.

The implementation of Nyland et al. was performed using CUDA and thus only targets NVIDIA
GPUs. The authors describe in detail how they optimized their implementation for execution on
GPUs. Since the high degree of parallelism available in the naive algorithm is especially well
suited for GPUs, the implementation of the naive algorithm done in this work is based on this GPU
implementation by Nyland et al. Their optimization approaches for GPUs programmed with CUDA
are also realized in SYCL code. The implementation of the algorithm with SYCL enables execution
on both GPUs and CPUs. The impact of the optimization on the runtime on CPUs will be analyzed
more closely in chapter 5.3.1.

4.1.1 Optimization for GPUs

One optimization approach of the naive algorithm described by Nyland et al. aims to speed up
the acceleration computation by using the shared memory of GPUs. In SYCL this corresponds to
using local memory which is accessible to all work-items of one work-group. In order to achieve

23


4 Implementation

work-group of
𝑘 work-items

𝑘 × 𝑘 block

®𝑎𝑛−1,𝑛−1

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

...
...

...
...

...
...

...
...

...

. . .

®𝑎1,0

®𝑎2,0

®𝑎3,0

®𝑎4,0

®𝑎5,0

®𝑎𝑛−1,0 . . .

®𝑎0,1

®𝑎2,1

®𝑎3,1

®𝑎4,1

®𝑎5,1

®𝑎0,2

®𝑎1,2

®𝑎3,2

®𝑎4,2

®𝑎5,2

®𝑎0,3

®𝑎1,3

®𝑎2,3

®𝑎4,3

®𝑎5,3

®𝑎0,4

®𝑎1,4

®𝑎2,4

®𝑎3,4

®𝑎5,4

®𝑎0,5

®𝑎1,5

®𝑎2,5

®𝑎3,5

®𝑎4,5

®𝑎0,𝑛−1

®𝑎1,𝑛−1

®𝑎2,𝑛−1

®𝑎3,𝑛−1

®𝑎4,𝑛−1

®𝑎5,𝑛−1

Figure 4.1: Grouping of the acceleration computations into work-groups of size k=3. The compu-
tation of the acceleration values in each work-group happens in blocks of size 𝑘 × 𝑘

(highlighted in blue) in order to make use of the work-groups local memory.
Modified from Nyland et al. [NHP07] Figure 31-4.

this optimization with SYCL, the first step is to group the 𝑛 acceleration value computations into
work-groups of a fixed size 𝑘 . Figure 4.1 shows an example of how the individual calculations for
all acceleration values ®𝑎𝑖, 𝑗 are grouped into work-groups for 𝑘 = 3. Here 𝑖 denotes the index of the
body whose acceleration ®𝑎𝑖 is calculated by the 𝑖-th work-item. ®𝑎𝑖, 𝑗 corresponds to an individual
accelerations induced by one of the other 𝑛 − 1 bodies 𝑗 .

The computations of the acceleration values in each work-group happen in blocks. For a work-group
consisting of k work-items, the 𝑘 · (𝑛 − 1) acceleration values are calculated in blocks of size 𝑘 × 𝑘 .
This is necessary for the usage of the local memory. Before each block, each work-item loads
the values of one of the 𝑘 bodies into the local memory. After that all work items that belong to
this work-group are synchronized through a barrier. Now, all values needed for the acceleration
computation (mass and position) of the 𝑘 bodies of the current block are accessible to all work-items
of the work-group in local memory. Each work-item has to only load the values of one body from
the slower global memory into the faster local memory. After the synchronization, each work-item
has fast access to the position and mass values of all 𝑘 bodies of the block and can now compute the
𝑘 accelerations induced by these bodies to the body associated with the respective work-item.

For example, in the first block of the second work-group in figure 4.1, the work-item with ID 3
would load all values of body 0 into the local memory, making it also accessible with fast access for
work-items 4 and 5. Work-item 4 and 5 will do the same for the bodies with ID 1 and 2 respectively.
After each work-item has finished the calculation of the acceleration values, the 𝑘 work-items
synchronize before the next position and mass values for the next 𝑘 bodies are loaded into the local
memory.

24


4.2 Implementation of the Barnes-Hut algorithm

In the variant described above, the size 𝑘 of a work-group would have to divide the total amount
of bodies, since in SYCL all work-groups must have the same size. Since the number of bodies
can be arbitrary, it would limit the possible values for 𝑘 . However, the size 𝑘 of the blocks is not
irrelevant. For example for GPUs it is best to choose 𝑘 as a multiple of a power of 2. In order to
allow arbitrary values for 𝑘 , a padding is added to the global range of all work-groups so that it is
divisible by 𝑘 . With this change there exist a few work-items in the last work-group that are not
associated with any acceleration computation. It only requires small changes to handle this case
and ensure that only work-items associated with an acceleration value ®𝑎𝑖 perform the calculations.
However, the work-items that do not perform any calculations of acceleration values are still needed
to load values into the local memory.

4.2 Implementation of the Barnes-Hut algorithm

The Barnes-Hut algorithm [BH86] consists of two major parts: the octree creation and the
computation of the acceleration values using the octree. The biggest challenge when implementing
this algorithm is the parallelization of the octree creation, especially on GPU hardware. An
implementation of the Barnes-Hut algorithm on GPUs with a classical octree was conducted by
Burtscher et al. [BP11] using CUDA. The SYCL implementation of the Barnes-Hut algorithm on
CPUs and GPUs performed in this thesis will be based on the implementation of Burtscher et al.

In their CUDA implementation of the Barnes-Hut algorithm, Burtscher et al. divided the Barnes-Hut
algorithm further into several steps. First, the bounding box of all bodies used for the simulation is
calculated. After that the octree data structure gets build up, followed by the computation of the
center of mass for each node. To minimize thread-divergence, Burtscher et al. introduced a new
step before the acceleration computation: an in-order sorting of all bodies in the octree. These
concepts are realized in this thesis with the usage of SYCL. The following sections will describe in
more detail how the Barnes-Hut algorithm was implemented using SYCL for execution on CPUs
and GPUs.

4.2.1 Parallel octree creation

Burtscher et al. use a parallel top-down approach for creating the octree data structure. At the
beginning, the octree consists of only one node, the root node, which represents a cell corresponding
to the bounding box of all bodies. In the SYCL implementation of this work, every work-item gets
assigned a subset of all bodies that it has to insert into the tree. One work-item can insert bodies in
parallel to other work-items. When a work-item inserts a body into the octree, it traverses the tree
top to bottom starting with the root node. Stepping further down into the tree does not require any
synchronization, since no modifications are made. If a leaf node is reached, the insertion point for
the current body is found in the tree. Since modifications of the node will happen in the next step, it
is necessary to lock the node and prevent other work-items from working on it while the current
work-item tries to insert the body. If the leaf node is empty, i.e., does not already contain a different
body, the body can be inserted and the work-item can continue with the next body assigned to it. If
the leaf node already contains a body, the node has to be split, i.e., it has to become an inner node
and eight new child nodes have to be created.

25


4 Implementation

When making such changes, like splitting a node and creating eight new nodes, or inserting a body
into a node, one needs to be aware of the following. After releasing the lock that locks the node, it
is not guaranteed that other work-items will see the changes made to the node. Thus, a memory
fence is needed to ensure that the view of all work-items on the global memory is consistent after
modifying the tree.

Memory fences are supported by SYCL, however, different SYCL implementations vary a lot in their
implementation and feature support of this function. DPC++ already supports the atomic_fence

function introduced with SYCL 2020 whereas currently this function is not yet supported by Open
SYCL. Open SYCL only supports the mem_fence function of the older SYCL 1.2.1 specification
[Khr20]. In this specification however, there exist only memory consistency guarantees for
work-items in one work-group when using a mem_fence. But no consistency guarantees are made
for work-items in different work-groups (see [Khr20] chapter 3.5.2.2.). The implementation of
atomic_fence in DPC++ did not guarantee consistency across different work-items that belong to
different work-groups during testing with the CUDA backend. However, these issues would require
more research which is beyond the scope of this thesis. Furthermore in the current revision 6 of
the SYCL 2020 standard, the SYCL memory model is not completely formalized yet (see [Khr22]
chapter 3.8.3.2.).

Due to this current state and the goal that the implementation of the Barnes-Hut algorithm should be
compatible with several backends and SYCL implementations, it can not be assumed that there are
any consistency guarantees across work-groups when using memory fences with SYCL. Thus, this
limits the approach described above to one work-group in SYCL. Due to device specific constraints
this also limits the amount of parallelism that is available since work-groups are not allowed to be
arbitrarily large. For example for most GPUs tested during this work the limit is 1024 work-items
per work-group. Limiting the whole octree creation to one work-group would thus result into a
non-optimal usage of the resource of a device, especially for highly parallel GPU hardware. Thus a
new approach had to be taken which makes a few changes to the algorithm in order to achieve more
parallelism with SYCL. This approach will be explained in the following section.

Leveraging subtrees for more parallelism

The general idea of the approach that will be described in this section is that it builds the octree
with the same method as previously, but in two steps. First, the top of the tree gets build up to a
certain level that can be specified using a parameter. Figure 4.2 visualizes this step with an example
for a maximum level of two. Here, ten bodies with IDs from 1 to 10 are inserted into the upper part
of the tree. In this example the tree corresponds to a quadtree for easier visualization. However the
concept can be easily extended to an octree. The body with ID 4 got already inserted into a node on
level one, thus no subtree originates here. The other bodies have not reached their final position on
a level smaller or equal than two. This means that the node on level two into which they would have
been inserted becomes the root node of their subtree. For example, in figure 4.2 the bodies with ID
1, 2 and 3 will belong to the same subtree.

Figure 4.3 shows how the subtrees which were determined in the first step are build in a second
step. The benefit is that each of these subtrees can now be build independently of the other subtrees
in separate work-groups. More importantly, if changes are made to one of the subtrees, it is only
important that all work-items of the work-group associated with this subtree see these changes.

26


4.2 Implementation of the Barnes-Hut algorithm

Work-items belonging to other work-groups will never access nodes of this subtree, thus it is not
important that their view on this part of the memory is consistent. This means that with this
approach memory fences which only have effects on one work-group are fully sufficient.

{1,2,3}

4

{5,6,7} {8,9,10}

Figure 4.2: Building the tree to a maximum level of two to determine the subtrees for each body.
All leaf nodes in this example that contain a set of more than one body (represented
through their ID) will become subtree root nodes. Body 4 got already inserted at level
one. Thus body 4 has reached its final position and will not be part of any subtree

1 3 2

4

5 7

6 10 8 9

Figure 4.3: Creating all subtrees independently from each other after all nodes that belong to a
subtree have been determined

With one work-group per subtree, one has much more work-groups and parallelism available in
the second step. However, one is still limited to one work-group for the creation of the top of the
octree and has additional overhead for distributing the bodies to the work-groups associated with
the subtree that will contain the body. The performance differences of this approach with subtrees
and the approach which is limited to one work-group will be further analyzed in chapter 5.3.2.

27


4 Implementation

4.2.2 Calculating the center of mass

After the octree data structure is build, one can now calculate the center of mass and the sum of all
masses for each node. Burtscher et al. extracted this part of the calculation into a separate step. It is
theoretically possible to calculate the center of mass on the fly during the octree creation. However
this requires extensive use of atomic operations and during the implementation process it turned out
that this drastically reduces performance. Thus, in the SYCL implementation of the Barnes-Hut
algorithm conducted during this thesis the computation of the center of mass is also performed in a
separate step.

The general idea of Burtscher et al. is to traverse the octree bottom-up in parallel and compute the
center of mass and sum of masses for each node. For the bottom-up traversal the authors make use
of ordering guarantees for the order in which the nodes are stored. In the SYCL implementation of
the algorithm in this work all nodes of the tree are stored in one buffer. The next insertion index
is determined atomically during the octree creation. With this method it is guaranteed that child
nodes will have a higher index as their parents. The data structure does not store pointers to the
parent node of a child node explicitly. But because of this ordering guarantee, the octree can be
traversed bottom up by starting with the last node in the buffer that stores all nodes and iterating
through it until the root node at index 0 is reached.

In the algorithm each work-item gets assigned a certain subset of the nodes and starts with the node
that has the highest index. If the node is a leaf node the center of mass can be calculated directly.
If it is an inner node of the octree, it has to be checked if all child nodes of this node are already
processed. If this is the case, the center of mass and sum of masses can be calculated by combining
the respective values of the child nodes, if not, the work-item continues with the next node and
revisits the node in a subsequent iteration over all nodes assigned to the work-item. This happens as
long as not all nodes are processed yet. Burtscher et al. use the sum of masses value as a flag to
indicate if a node has already been processed. This enables the algorithm to work without atomic
operations. However, since it is not guaranteed that if a work-item sees the center of mass value, it
will also see the center of gravity value of the node, a memory fence is needed again.

For the same reasons as with the octree creation, the usage of memory fences limits this kernel to
one work-group for the SYCL implementation. Using an approach with subtrees again is much
harder for this kernel, since with the current SYCL implementation all nodes are stored in one buffer,
independent of what subtree they belong to. This makes it hard to filter out nodes belonging to one
specific subtree without overhead. Separating the storage space, so that all nodes of one subtree are
stored right next to each other is also not trivial, since estimating how many nodes will be created
for one subtree and thus how much storage is needed is not simple. Even a subtree containing
only two bodies can become really deep if the bodies are close to each other. This scenario is not
unrealistic in astrophysics since such a case could correspond to a typical planet-moon system.
Thus, in the SYCL implementation of this thesis most of the computations for the center of mass are
limited to one work-group.

The assignment of nodes to work-items is a central part of this kernel and important for its runtime
behavior. Since for nodes with higher indices it is more likely that they can be processed directly,
these nodes should be processed first by the work-items. Figure 4.4 shows an example of two
methods for assigning nodes two work-items. In this example 14 nodes get assigned to two
work-items. Method 1 is based on the method described by Burtscher et al. where work-item 𝑖 starts
from the back with the node that has the highest index and processes every 𝑖-th node of the array.

28


4.2 Implementation of the Barnes-Hut algorithm

work-item 1

work-item 2

Method 1:

Method 2:

Figure 4.4: Example for the two methods of assigning nodes to two work items. Each cell in the
arrays represents a node, the color denotes the work-item to which a node gets assigned.
The arrows indicate the processing direction of each work-item. Method 2 performs
worse since a lot of the computations of work-item 2 will depend on computations that
work-item 1 has to do.
Based on [BP11] p. 82, figure 6.10.

Method 2 shows a different mapping between work-items and nodes. Here all nodes in the array are
split into blocks and each block gets assigned to one work-item. Each work-item will again start
processing with node that has the highest index. However, since in the SYCL implementation of
this work child nodes always have higher indices as their parent node and the processing of the
parent node depends on the processing of the child node, it is likely that many nodes assigned to
work-item 2 in the example depend on nodes assigned to work-item 1, forcing work-item 2 to wait
more often.

Because of this it is better to choose the first method for assigning nodes to work-items. However,
method 1 does not work with the Open SYCL OpenMP backend on CPUs, since the algorithm does
not terminate. On GPUs this methods works reliable with various backends. Since the performance
gain of the first method over the second method on GPUs is not negligible, two separate kernels
for GPUs and CPUs were implemented with SYCL. It is not fully clear why the method does not
work with the Open SYCL OpenMP backend on CPUs and would require further research. One
reason could be that method one of this algorithm heavily relies on each work-item making progress
eventually. However, this property is not guaranteed by the SYCL standard and therefore depends
on the implementation (see [Khr22] chapter 3.8.3.4.).

Optimizations for the center of mass computation

As described by Burtscher et al. the center of mass and sum of masses values of leaf nodes can be
directly calculated since the computation of those values does not depend on any other node. This
property is even more important in the SYCL implementation of this algorithm performed during
this thesis. Since there are no dependencies for the computation of these values, a memory fence is
not needed. Thus, in the implementation with SYCL, the calculations for all leaf nodes are extracted
into a separate kernel. In this kernel a maximum amount of parallelism can be used since one is not

29


4 Implementation

limited to one work-group anymore. Furthermore, it reduces the amount of work for the second
kernel which is limited to only one work-group. Additionally, since the center of mass is already
calculated for each leaf node, their parent nodes can directly be processed in the second kernel.

Since each work item often has to iterate over all nodes assigned to it several times it is better to
cache nodes that still have to be processed. This optimization was also performed by Burtscher
et al. In the implementation done during this thesis, all nodes that could not be processed during
the first phase are stored in a separate buffer. For all subsequent iterations only theses nodes are
considered. Like in the implementation of this kernel by Burtscher et al., the number of bodies that
are contained in the cell represented through a node gets stored for each node. This is important for
the sorting of the bodies which will be described in the next section.

4.2.3 Sorting the bodies with the tree

This part is not needed for the correctness of the algorithm. However, it is important for improving
the performance of the acceleration computation on GPUs. Burtscher et al. use this step to minimize
thread divergence on GPUs. The order of the bodies after the sorting step will correspond to the
in-order tree-traversal of the octree that has been created.

In the SYCL implementation of this work, for each body the tree is traversed top to bottom until
the leaf node of this body is reached. During each step, when a work-item steps down one level in
the tree it has to determine how many nodes have to be stored before this node. This can be done
using the value of how many bodies belong to a cell represented by a node which was calculated
during the center of mass computation. The algorithm increments a counter with this value for all
nodes whose bodies will be stored before the bodies of the current node, each time it steps down in
the tree. The insertion index of the current body in the buffer of the bodies sorted according to an
in-order tree-traversal is, thereby, determined when the leaf node containing the body is reached.

4.2.4 Computing the acceleration for each body

For each acceleration value ®𝑎𝑖 the octree has to be traversed top to bottom. However, it is not
necessary to traverse the tree up to the leaf level every time. In the SYCL implementation of the
Barnes-Hut algorithm this kernel has no constraints for the maximum amount of parallelism. Thus,
one work-item computes exactly one acceleration value ®𝑎𝑖 . In order to have more control over the
mapping from work-groups to the corresponding primitives of the device specific backend, the
work-group size can be explicitly specified via a parameter. To allow arbitrary work-group sizes a
padding is added in a similar way as described in section 4.1.1.

Each work-item traverses the tree top to bottom with the usage of a stack. For each node two cases
have to be considered. The first case is that the center of mass value of the node can be used as an
approximation for the computation of ®𝑎𝑖. In this case, all nodes belonging to the subtree of this
node do not have to be considered anymore. In the other case, if the approximation can not be
made, all child nodes of the node have to be pushed on a stack, indicating that they still have to be
processed later.

30


4.2 Implementation of the Barnes-Hut algorithm

In order to minimize thread divergence it is desired that threads in one warp always execute code
belonging to the same branch, i.e., consider the same of the two cases described above. This means
that the decision whether an approximation can be used or not should be the same for the threads in
one warp. It is likely that this is the case if the acceleration values the threads are computing belong
to bodies that are close to each other spatially. This is why Burtscher et al. introduced the in-order
sorting step of all bodies, since bodies which are close to each other in the in-order sorting of the
octree are also likely to be close to each other spatially.

With SYCL, however, there is no direct control of mapping specific calculations to warps, so one has
to rely on the SYCL implementation for this part. In the SYCL implementation of the Barnes-Hut
algorithm there is a one-to-one mapping between the global ID of the work-item and the index
of buffer holding the sorted bodies. More specifically, the work-item with ID 0 will compute the
acceleration of the first body in buffer of sorted bodies, and so on. This results into work-items with
consecutive IDs to compute acceleration values for bodies that also appear in consecutive order
after the in-order sorting of all bodies. The effect of this performance optimization will be more
closely analyzed in chapter 5.3.3.

31


5 Analysis of the runtime behavior

In this chapter the runtime behavior of the two algorithms implemented with SYCL will be analyzed.
First, the two algorithms will be considered separately. For each algorithm there will be experiments
analyzing the impact of their different parameters. After that the performance optimizations
described in chapter 4 will be evaluated. Last, the overall runtime behavior of the two algorithms
will be compared on different CPUs as well as on GPUs from different vendors.

5.1 Experiment setups

Unless stated otherwise, Open SYCL will be used for the experiments since it is compatible with all
the hardware used in the experiments through the CUDA, ROCm and OpenMP backends. CUDA
version 11.4.3 gets used on all systems with NVIDIA GPUs and ROCm 5.3.3 is used for the
AMD GPU. The DPC++ version used for all experiments is sycl-nightly/20221102. For Open
SYCL commit 4a04f1c is used. Open SYCL got build against the LLVM-compiler of the DPC++
installation.

For most runs, a simulation interval of 10 earth days was chosen. With a Δ𝑡 of one hour, this results
in 241 acceleration computations including the computation of the initial acceleration values. For
some runs it was necessary to restrict the amount of acceleration computations to 25 or 13 due to
the very long runtime. All individual timings of each run have been averaged. Generally, it was
observed that there are a lot more fluctuations in the runtimes on CPUs than in the runtimes on
GPUs.

The runtimes of the leapfrog integration steps are not included in the time measurements. One
reason for this is that it is the same for the naive algorithm and the Barnes-Hut algorithm and
thus does not contribute anything to a comparison of both algorithms. Furthermore, its runtime is
negligible compared to the runtime of the other kernels of the simulation. For example even for the
largest dataset the integration steps take only about half a millisecond on an NVIDIA RTX 3090.

The datasets used for the simulation contain real world data of objects in our solar system. The
data originates from a data base of the NASA Jet Propulsion Laboratory1. For the experiments six
differently sized datasets were provided containing 178, 999, 2678, 19054, 138723 and 1216869
bodies respectively.

Table 5.1 show the technical specifications of all five test systems that will be used during the
experiments. Hardware that is explicitly used for the experiments is highlighted in bold. For
experiments on the CPU, hyper-threading (HT) was enabled and omp.accelerated of Open SYCL

1https://ssd.jpl.nasa.gov/tools/sbdb_query.html (visited on 04/03/2023)

33

https://ssd.jpl.nasa.gov/tools/sbdb_query.html


5 Analysis of the runtime behavior

System 1 System 2 System 3 System 4 System 5

GPU NVIDIA
A100

NVIDIA
Quadro
GP100

NVIDIA
GeForce

RTX 3090

AMD
Radeon
PRO VII

-

GPU memory 40GB
HBM2

16GB
HBM2

24GB
GDDR6X

16GB
HBM2 -

GPU FP64
performance
[TFLOPS]

9.7 ∼5 0.556 6.5 -

CPU 2x AMD
EPYC 7742

2x Intel
Xeon Silver

4116

AMD Ryzen
Threadripper

3960X

Intel
i7-6700K

2x AMD
EPYC 7543

CPU core count 128 24 24 8 64

CPU clock speed
[GHz] 3.4 3.0 4.5 4.2 3.7

Table 5.1: Overview of all systems used for the experiments. Hardware that is explicitly used for the
experiments is highlighted in bold. All the other hardware of the systems is also shown
for completeness. The CPU-clock speed corresponds to the maximum single-core speed
as specified from the manufacturer. The information for the hardware specifications have
been retrieved from [NVI17], [NVI20], [NVI21], [AMD20b], [AMD20a], [Intb], [AMD],
[Inta], [AMD22]. If no FP64 performance was specified the amount of tera floating
point operations per second (TFLOPS) was calculated using the FP32 performance and
the the single precision, double precision ratio of the respective GPU.

was used. All calculations were performed using FP64 double precision. If not stated otherwise,
all runtimes correspond to the time of one time step without leapfrog integration. If the runtime
corresponds to a specific part of the algorithm, the runtime of this part in one time step is given.

5.2 Impact of the parameters on the runtime behavior of the
algorithms

In this section the impact of the different parameters for both algorithms on their runtime behavior
will be analyzed. First, the block size of the naive algorithm will be considered. After that the
different parameters of the Barnes-Hut algorithm will be studied, starting with the parameters that
influence the runtime behavior of the octree creation: the maximum level of the top of the octree
and the amount of parallelism available for the first and second phase of the octree creation. Last,
the impact of the 𝜃-value, that gets used to determine the accuracy of the Barnes-Hut algorithm, on
the runtime of the acceleration kernel will be analyzed.

34


5.2 Impact of the parameters on the runtime behavior of the algorithms

5.2.1 Impact of the block size on the naive algorithm

The goal of the experiments presented in this section is to analyze the impact of the block size of
the naive algorithm on its runtime behavior. The first experiment analyzes the impact of the block
size of the naive algorithm on GPUs. In the experiment, the runtime of the naive algorithm with
different block sizes was measured on an NVIDIA Quadro GP100 using all 6 datasets.

(a) Runtime comparison between different block sizes
𝑘 and different numbers of bodies.

(b) Comparison of different block sizes for 19054
bodies.

Figure 5.1: Comparison of different block sizes 𝑘 for the naive algorithm on an NVIDIA Quadro
GP100 GPU. Figure 5.1a shows the runtime behavior for four different block sizes on
all datasets. A block size of 128 performs well for most datasets. Figure 5.1b shows the
runtimes of several block sizes for the dataset containing 19054 bodies. The blue bars
denote block sizes that are a power of two whereas the red bars correspond to block
sizes that are divisors of 19054 and can be used as block sizes. It can be clearly seen
that most block sizes that are a power of two perform better than the divisors of 19054.

Figure 5.1a shows the runtimes of the naive algorithm for all datasets and for different block sizes 𝑘 .
Each curve corresponds to one block size. The number of bodies used for the simulation is shown
on the x-axis and the y-axis corresponds to the runtime in milliseconds. Both, the x-axis and the
y-axis are logarithmically scaled.

One can observe that small values for 𝑘 like 16 do not perform well for a large numbers of bodies.
With 𝑘 = 16 the naive algorithm takes about 38.8s whereas with with 𝑘 = 128 it only takes about
20.6s. This could be explained with the fact that for small block sizes the overall amount of blocks
is much higher than for large block sizes. Thus, one has more overhead for the blocking compared
to larger block sizes which perform much better for large body counts. However, in figure 5.1a it
can be recognized that large values like 𝑘 = 1024 do not perform well for small amounts of bodies.
The latter can not be observed for the smallest dataset of 178 bodies since there can not be more
than 178 work items performing calculations, thus larger 𝑘 do not make a difference.

Figure 5.1b shows a bar plot with the runtimes of the naive algorithm for the dataset with 19054 bodies.
Each bar corresponds to one block size 𝑘 . In addition to the values for 𝑘 which are highlighted in blue
and are a power of two, four other block sizes are also considered: 𝑘 = 1, 𝑘 = 2, 𝑘 = 7 and 𝑘 = 14.
These values correspond to all divisors of 19054 that are smaller or equal than the maximum

35


5 Analysis of the runtime behavior

work-group size of 1024 and are highlighted in red. They are interesting because they would be
the only valid work-group sizes and thus block sizes one could use without having a padding that
allows for arbitrary 𝑘 . Since the runtimes for the divisors are a lot slower than the runtimes for most
of the 𝑘 which are a power of two, one can conclude that the work-group size should be chosen as a
power of two for optimal performance. Further testing conducted during this thesis showed that a
multiple of a power of two is also sufficient.

One can see that 𝑘 = 128 performs well for most datasets. Thus, this value will be used for all
future experiments with the naive algorithm on GPUs unless stated otherwise.

In order to study how the block size influences the runtime behavior on CPUs, the previous
experiment was repeated on CPUs. The system used for this experiment contains a dual socket
AMD EPYC 7543 with a total of 64 cores and HT enabled.

(a) Runtime comparison of different block sizes 𝑘 and
different numbers of bodies.

(b) Comparison of different block sizes for 19054
bodies.

Figure 5.2: Comparison of different block sizes 𝑘 for the naive algorithm on a dual socket AMD
EPYC 7543 with 64 CPU cores. Figure 5.2a shows the runtime behavior for four
different block sizes on all datasets. The x-axis corresponds to the number of bodies
contained in the dataset. Figure 5.2b shows the runtimes of several block sizes for the
dataset containing 19054 bodies. The blue bars denote block sizes that are a power of
two whereas the red bars correspond to block sizes that are divisors of 19054 and can
be used as block sizes. The fact that the block size is a power of two does not play an
important role on CPUs. The overall size of the value is more important.

Figure 5.2 shows the the results of this experiment in the same scheme as used for the previous
experiment. In figure 5.2a it can be observed that smaller values for 𝑘 perform much better for
smaller datasets. For a large amounts of bodies this can not be observed anymore and large 𝑘 even
perform a little bit better. For example for the largest dataset 𝑘 = 16 would need about 125.6s
whereas 𝑘 = 128 needs only 117.4s.

Figure 5.2b shows the runtime for different block sizes with a dataset containing 19054 bodies in
more detail. Like in the previous experiment, in addition to the values for 𝑘 that are a power of
two, all divisors of 19054 smaller than 1024 are shown again. Most of the divisors of 19054 do not

36


5.2 Impact of the parameters on the runtime behavior of the algorithms

perform well again. However, this is most likely due to their small size which results into a large
amount of blocks and thus increases the overhead. Unlike on GPUs 𝑘 = 14 performs quite well on
CPUs.

From that one can conclude that, unlike on GPUs, it is not that important that 𝑘 is a power of two
and the overall size of 𝑘 plays a bigger role. The observation that small block sizes perform better
than big block sizes for small numbers of bodies could be explained with an increased probability
of cache hits when small blocks are used. However, this advantage vanishes for large numbers of
bodies since the small block size creates additional overhead.

A value of 𝑘 = 128 performs well for most datasets. Thus, this value will be used for all future
experiments with the naive algorithm on CPUs unless stated otherwise.

5.2.2 Impact of the Barnes-Hut tree-building parameters

The goal of the experiments presented in this section is to study how the different parameters that
adjust the octree creation of the Barnes-Hut algorithm influence the runtime of the octree creation
on CPUs and GPUs. The first parameter that will be considered is the maximum level to which the
top of the tree gets build in the first step. The subtrees that emerge in the first step then get build
in parallel in the second step. The second and the third parameter that will be analyzed in this
section determine the amount of parallelism used for the first and second step of the tree creation
respectively. Generally it has to be noted that the runtime behavior of the octree creation highly
depends on the dataset used for the simulation since the positioning of the bodies in the space
determines the structure of the tree. Thus, the parameters might behave differently on different
datasets.

The depth of the top of the octree

The experiment that will be presented in this section aims to analyze how to choose the maximum
tree depth for the top of the octree that gets build in the first part of the octree creation. In the
experiment the runtime of the whole octree creation was measured using different values for the
maximum tree depth for the top of the octree. Generally, a higher value for this maximum depth
will result into more work for the first phase where less parallelism is available and generate many
small subtrees. A smaller maximum tree depth for the top of the octree will result into less work for
the first phase and generates fewer, but larger subtrees.

Figure 5.3 shows the results of the experiment on an NVIDIA Quadro GP100 GPU and a dual socket
AMD EPYC 7543 CPU. The x-axes correspond to the number of bodies used for the simulation
and the y-axes show the time for the octree creation in milliseconds. Both axes are logarithmically
scaled. In Figure 5.3a, which shows the results for the NVIDIA Quadro GP100, it can be observed
that a depth of seven yields the best results across most datasets. Very low values like one or three
perform worst. For example the octree creation on the largest dataset requires only about 194.4ms
with a depth of seven, whereas with a depth of three the computation requires about 295.4ms. This
is likely because with low values for the depth of the top of the octree the amount of subtrees
that emerge is too small and the subtrees itself can become very deep. Thus, the advantage of the
increased parallelism is very limited. On the other hand, large values like nine also result into
sub-optimal performance. Part of the reason for this behavior could be that the first part of the

37


5 Analysis of the runtime behavior

(a) NVIDIA Quadro GP100 (b) Dual socket AMD EPYC 7543

Figure 5.3: Comparison of how different values for the maximum depth for the top of the octree
influence the time for the octree creation. Figure 5.3a shows the impact of the depth of
the top of the octree on an NVIDIA Quadro GP100 GPU whereas figure 5.3b shows
the results of the same experiment repeated on a dual socket AMD EPYC 7543. It can
be observed that on the CPU the best results for larger datasets can be obtained with a
depth of nine. On the GPU seven yields the best results across most datasets.

octree creation requires more work the deeper the top of the tree gets. Thus, there is more work
for which one is limited to one work-group and thus can not use the full amount of parallelism
available on the GPU. Since with higher values one has more subtrees, and thus more parallelism,
in the second phase, one can compensate the performance loss of the first phase. However, this is
only possible if the GPU can also fully use this increased amount of parallelism. If there is more
theoretical parallelism than can be used there might not be a performance gain anymore. Overall, a
depth of seven seems to balance the amount of work in the first phase and the available parallelism
in the second phase best. Thus, this value will be chosen for all future experiments on GPUs if not
stated otherwise.

Figure 5.3b shows the results of the experiment on a dual socket AMD EPYC 7543 CPU. One
can see that, similarly to GPUs, small values for the depth of the top of the tree perform worst,
especially for large datasets. However, nine performs best for larger datasets on CPUs. With a
runtime of approximately 607.9ms it is a lot faster than with a depth value of seven which would
require 1337.2ms for the octree creation on the largest dataset. This was not the case on GPUs. One
reason for this behavior could be that CPUs favor the many small subtrees that emerge when using a
larger tree depth for the top of the octree. These smaller subtrees can then be build very fast without
the need for a lot of synchronization, since they contain fewer bodies than larger subtrees. One
could also argue that the reason for this behavior is that this approach with subtrees is not suited for
CPUs and larger tree depths for the top of the tree shift more work to the first phase which is more
similar to the original approach without subtrees. However, as it will be shown in section 5.3.2, the
approach with subtrees improves the performance over the approach without subtrees on CPUs.
Thus, it is more likely that the reason for higher values for the top of the tree performing better
is due to the increased amount of subtrees and not only because more work is shifted to the first
phase of the tree creation. Since a depth for the top of the tree of nine performed best for the larger
datasets, this value will be chosen for CPUs unless stated otherwise.

38


5.2 Impact of the parameters on the runtime behavior of the algorithms

Choosing the amount of parallelism for the two phases of the octree creation

The experiments presented in this section will analyze how the amount of parallelism available
for the first and the second phase of the octree creation influences the runtime behavior of the
overall octree creation on CPUs and GPUs. Generally the usage of more work-items results in more
parallelism but also requires more synchronization between those work-items which introduces
additional overhead. This is why it is important to analyze the impact of these values. In order to
do this, the runtime of the first and the second phase of the octree creation was measured on an
NVIDIA Quadro GP100 GPU and on a dual socket AMD EPYC 7543 CPU using different amounts
of parallelism.

(a) NVIDIA Quadro GP100 (b) Dual socket AMD EPYC 7543

Figure 5.4: Comparison of different amounts of work-items for the first phase of the octree creation
on an NVIDIA Quadro GP100 GPU and on a dual socket AMD EPYC 7543 CPU. On
the GPU, small work-item counts like 128 perform worst and higher values perform
better. For larger datasets there is not such a large difference on CPUs between small
and large work-item counts.

Figure 5.4 shows the runtimes for the creation of the top of the tree on CPUs and GPUs for different
work-item counts. The x-axes correspond to the number of bodies used for the simulation and the
y-axes correspond to the runtime for the creation of the top of the tree in milliseconds. All axes
have been logarithmically scaled.

Figure 5.4a shows the impact of changing the work-item count for the first phase of the octree
creation on an NVIDIA Quadro GP100 GPU. It can be seen that smaller values like 128 result in the
worst performance with a runtime of about 85ms for the largest dataset. But also the largest possible
value of 1024 does not result in the best performance across all datasets. The results indicate that
choosing higher values results into better performance for this part of the algorithm. Nevertheless,
the highest possible value is not always the best choice, since it is about 3.8ms slower than when
using 640 work-items for the largest dataset. However, smaller values like 128 clearly perform
worst. A work-item count of 640 offers good performance across all datasets and even performs
best for the largest dataset. Thus, this value will be chosen on GPUs for the final experiments unless
otherwise stated.

39


5 Analysis of the runtime behavior

Figure 5.4b presents the results when varying the amount of work-items for the first part of the
octree creation on a dual socket AMD EPYC 7543 CPU. In contrast to the version of this experiment
on GPUs, the variation of the runtime between the different numbers of work-items is much smaller
for lager datasets. Part of the reason for this could be that if the work-item count exceeds the thread
count of the CPU (which is 128 in this case) there is no additional performance to gain. However
the overhead of using much more work-items seems to be very limited since these values do not
perform much worse for larger datasets. A work-item count of 128 performs best with a small
margin for the largest dateset where it is approximately two to three milliseconds faster than using
1024 or 16 work-items. Since it also performs well for smaller datasets, this value will be chosen for
the final experiments on CPUs.

(a) NVIDIA Quadro GP100 (b) Dual socket AMD EPYC 7543

Figure 5.5: Comparison of different amounts of work-items per work-group when building the
subtrees during the octree creation. Each work-group is responsible for building one
subtree. Figure 5.5a shows the runtimes for different amounts of work-items per
work-group on an NVIDIA Quadro GP100 GPU. Figure 5.5b shows these values for
the same experiment on a dual socket AMD EPYC 7543 CPU. On the GPU higher
work-group sizes perform better than smaller ones. On the CPU there are almost no
differences between the runtimes for large and small work-group sizes when large
datasets get used.

Figure 5.5 shows the runtimes for the second phase of the octree creation on an NVIDIA Quadro
GP100 and and a dual socket AMD EPYC 7543 for different work-item counts. The work-item
count corresponds to the amount of work-items in one work-group. Each work-group is responsible
for building one of the subtrees. The x-axes correspond to the number of bodies used for the
simulation and the y-axes correspond to the runtime for the creation of subtrees in milliseconds.
All axes have been logarithmically scaled.

In figure 5.5a the runtimes measured on an NVIDIA Quadro GP100 are shown. One can observe
that for smaller datasets, smaller values perform a little bit better whereas 1024 performs best for
the largest dataset. However, this value only works reliably on all GPUs considered during this work
when using Open SYCL. When DPC++ gets used the program is unable to launch this kernel with
such a high number of work-items. This could be due to the amount of registers needed gets too

40


5.2 Impact of the parameters on the runtime behavior of the algorithms

large. Since a work-item count of 640 also performs well across most datasets, being only 2.2ms
slower than 1024 for the largest dataset, and runs stable on all GPUs with both Open SYCL and
DPC++, this values will be chosen for the final experiments.

Interestingly, the total time needed for the creation of the subtrees with the dataset containing 999
bodies is lower than the time for the dataset with just 178 bodies. This is because these two datasets
are very different. The dataset with just 178 bodies contains planets like Jupiter or Saturn which
have a lot of moons. This results into a really deep octree. The dataset containing 999 bodies only
contains asteroids which are sparsely distributed. Thus the depth of the resulting octree is much
smaller. Since the depth for the top of the tree is fixed across all datasets, the subtrees that get
created for the second dataset are much smaller compared to the first one.

Figure 5.5b presents the runtimes for the creation of the subtrees on a dual socket AMD EPYC
7543 CPU. It can be observed that similar to the results from the first phase of the octree creation
the choice of the amount of parallelism does not make a huge difference, especially for larger
datasets. Since 16 performs well for small and large datasets this value will be chosen for the final
experiments on CPUs.

5.2.3 Effect of the parameters for the acceleration computation of the Barnes-Hut
algorithm

The goal of the experiments presented in this section is to analyze the impact of the parameters that
influence the runtime of the acceleration kernel in the Barnes-Hut algorithm. First, the work-group
size of acceleration kernel will be considered. After that the impact of the 𝜃-value on the accuracy
and runtime of the Barnes-Hut algorithm will be analyzed.

Impact of the work-group size on the acceleration kernel

Since no optimizations making use of local memory are made to the acceleration kernel of the
Barnes-Hut algorithm and it also does not require any form of synchronization, it is not necessary
to explicitly group the work-items into work-groups. However, in practice the specification of the
work-group size gives some control over how work-items are mapped to, e.g., thread-blocks on
NVIDIA GPUs. The experiment presented in this section will analyze the impact of the work-group
size for the acceleration kernel on CPUs and GPUs.

Figure 5.6 shows the runtimes for different work-group sizes for the Barnes-Hut acceleration
kernel on an NVIDIA Quadro GP100 GPU and a dual socket AMD EPYC 7543 CPU for different
work-group sizes. The x-axes correspond to the number of bodies used for the simulation and the
y-axes correspond to the runtime of the acceleration kernel. All axes have been logarithmically
scaled.

In figure 5.6a the results of the runtimes for different work-group sizes on an NVIDIA Quadro
GP100 are presented. One can see that a work-group size of 16 results into the best performance on
this GPU. With a runtime of about 837.2ms for the largest dataset it performs much better than a
work-group size of, e.g., 1024 which results into a runtime of about 944.8ms for this kernel. For the
final experiments 16 will be chosen as work-group size for the acceleration kernel on the NVIDIA
Quadro GP100. However, further testing revealed that this parameter has to be chosen differently

41


5 Analysis of the runtime behavior

(a) NVIDIA Quadro GP100 (b) Dual socket AMD EPYC 7543

Figure 5.6: Comparison of different work-group sizes for the acceleration kernel of the Barnes-Hut
algorithm. Figure 5.5a shows the runtimes for different amounts of work-group sizes
on an NVIDIA Quadro GP100 GPU. Figure 5.5b shows these values for the same
experiment on a dual socket AMD EPYC 7543 CPU. For the GPU a value of 16
performs best across most datasets. For the CPU a work-group size of 8 performs best
across most datasets, especially for small ones.

on other GPUs in order to achieve good performance. Since the performance differences are not
insignificant, the value was chosen differently for all GPUs that will be considered in this thesis.
The decision which value gets used is based on which of the work-group sizes performed best across
most of the datasets in my experiments. For the NVIDIA RTX 3090 a work-group size of 32 will
be chosen. The experiments with the NVIDIA A100 will use a work-group size of 1024 and for the
AMD Radeon PRO VII a size of 64 gets used for each work-group.

Figure 5.6b shows the effect of different work-group sizes on a dual socket AMD EPYC 7543 CPU.
It can be observed that especially for small numbers of bodies there are huge differences between
different work-group sizes. For example with the dataset consisting of 2678 bodies the acceleration
kernel takes about 1.5ms whereas with a work-group size of 1024 the runtime is increased to about
13.4ms. Generally it can be observed that smaller work-group sizes perform better than bigger
work-group sizes. One reason for this could be that the specification of the work-group size might
influence how work-items are mapped to threads in the background and smaller work-groups result
into situations that improve the cache performance. However, further explaining this observation
would require more research about how work-items of work-groups are mapped to threads by Open
SYCL. Since a work-group size of eight performs best across most datasets, this value will be
chosen for the final experiments on CPUs.

Impact of the 𝜃-value on runtime and accuracy

The 𝜃-parameter is a central part of the Barnes-Hut algorithm. It is used to control its accuracy and
also effects the performance of the algorithm. In order to analyze the impact of this parameter on
the runtime of the Barnes-Hut algorithm, the runtime of the acceleration kernel was measured using
different values for 𝜃. Furthermore, in order to determine the impact of this value on the accuracy

42


5.2 Impact of the parameters on the runtime behavior of the algorithms

of this algorithm, one earth year was simulated with the dataset containing 2678 bodies and a Δ𝑡

of one hour. This was done for several 𝜃-values and the resulting final positions of all bodies got
compared to the same simulation conducted with the naive algorithm.

(a) Runtimes for different 𝜃-values on an NVIDIA
Quadro GP100.

(b) Average euclidean distance between the final states
of the Barnes-Hut algorithm using different theta
values and the final state of the naive algorithm
after one earth year.

Figure 5.7: Effect of the 𝜃-value of the Barnes-Hut algorithm on the runtime and accuracy. Figure
5.7a shows the runtimes of acceleration kernel on an NVIDIA Quadro GP100 for
different 𝜃-values. Figure 5.7b shows how the 𝜃-value impacts the accuracy of the
algorithm. It compares the last state of the simulation performed by the Barnes-Hut
algorithm with the last state of the simulation that used the naive algorithm. One can
see that bigger 𝜃-values also result into a lower runtime. However, when a high value
for 𝜃 gets chosen the accuracy of the Barnes-Hut algorithm is decreased.

Figure 5.7a shows the results of the experiment regarding the impact of the 𝜃-value on the runtime
of the Barnes-Hut algorithm. The experiment was conducted on an NVIDIA Quadro GP100 GPU.
The x-axis corresponds the size of the dataset used for the simulation. The y-axis shows the runtime
of the acceleration kernel in milliseconds. Both axes have been logarithmically scaled. Each curve
corresponds to one run of the simulation using a different 𝜃-value. One can see that the 𝜃-value has
a big impact on the runtime of the acceleration kernel. With 𝜃 = 0.2 the runtime of the kernel in
one time step is about 20.4s. When using 𝜃 = 0.4 the runtime is decreased substantially to 848.8ms.
With 𝜃 = 1 the runtime of the acceleration kernel further decreases to about 245.23ms. However, as
it will be shown in next paragraph, these lower runtimes come at the cost of accuracy.

Figure 5.7b shows the results of the experiment that analyzes the impact of the 𝜃-value on the
accuracy of the simulation. The x-axis corresponds to the 𝜃-value used for the simulation. The
y-axis denotes the average euclidean distance between the positions of the same bodies in the final
states of the Barnes-Hut algorithm and the naive algorithm after simulating one earth year. The
unit for the distance is the astronomical unit (AU). It can be observed that with higher 𝜃-values the
precision of the simulation gets reduced significantly. The overall average distance of deviation
might not be that high, however there are many bodies in the dataset where no deviation can be

43


5 Analysis of the runtime behavior

observed with the accuracy used for the visualization of the simulation. Nevertheless, with higher
𝜃-values some planet moon-systems can become unstable if the simulation spans across a large
amount of time steps.

For this thesis the overall runtime behavior of the Barnes-Hut algorithm is most important. In order
to reflect a good middle ground between accuracy and runtime of the algorithm a 𝜃-value of 0.6
will be chosen for all future experiments.

5.3 Evaluation of the performance optimizations

The goal of the experiments described in this section is to evaluate the three performance
optimizations described in chapter 4. First, the usage of local memory in the naive algorithm will
be analyzed. After that the performance impact of using subtrees during the octree creation of the
Barnes-Hut algorithm will be studied. Last, the impact of the sorting step will be examined.

5.3.1 Evaluation of the optimizations for the naive algorithm

The experiments presented in this section aim to analyze the impact of the performance optimizations
performed on the naive algorithm. This analysis will deal with the following topics:

1. Impact of the optimizations on GPUs

2. Different effects on consumer and data center GPUs

3. Differences between SYCL implementations

4. Impact on the runtime on CPUs

In order to gain a better understanding of the impact of the individual optimization steps, three
optimization stages for the naive algorithm will be considered. Stage zero corresponds to the
implementation of the naive algorithm without any optimizations. Stage one introduces the grouping
of acceleration computations into work-groups, including a padding to allow arbitrary work-group
sizes. Stage two corresponds to the final implementation, including the usage of local memory.

Impact of the optimizations on GPUs

In order to analyze the impact of the optimizations on GPUs the naive algorithm was run on all
datasets with all three optimization stages using an NVIDIA Quadro GP100 GPU. For stage one and
stage two, a work-group size of 128 has been chosen. Figure 5.8 shows the results of theses runs.
The x-axis corresponds to the number of bodies. The y-axis shows the runtime in milliseconds and
is logarithmically scaled. Each set of three bars corresponds the runs of the three different stages on
one dataset.

It can be seen that stage two is always faster than stage one and stage zero. For the largest dataset
the optimizations from stage two improve the runtime from about 22.2 seconds to 20.6 seconds per
acceleration computation. Since n-body simulations usually consist out of a lot of time steps, this
improvement adds up over time and can significantly improve the overall runtime of the simulation.

44


5.3 Evaluation of the performance optimizations

Figure 5.8: Analysis of the optimizations for the naive algorithm on an NVIDIA Quadro GP100.
Stage 0 is the non optimized version of the naive algorithm and stage 2 is the final,
most optimized version of the naive algorithm. It can be observed that stage 2 can
improve the runtime of the naive algorithm most of the time

Furthermore, it can be seen that stage one does not always offer an improvement over stage zero and
sometimes even results into worse performance. This indicates that the performance improvement
indeed comes from the usage of local memory and not just from the division of work-items into
work-groups. The fact that stage one sometimes results into slightly worse performance can be
explained with the fact that the work-group size which was chosen as 128 for all datasets. This
value produces good results for all datasets, however, there might be specific work-group sizes that
perform better on some datasets.

Different effects on consumer and data center GPUs

To study this topic, the previous experiment was repeated using a different GPU, an NVIDIA
GeForce RTX 3090. As opposed to the NVIDIA Quadro GP100 used previously, this is not a data
center GPU but a consumer GPU. The main difference between those two categories of graphics
cards is that data center GPUs typically have a lot more double precision performance than consumer
GPUs. This is especially relevant for the implementation of the naive algorithm conducted in this
work, since all calculations are performed using double precision.

Figure 5.9 shows the results of this experiment in the same manner as previously. It can be observed
that, as opposed to the run of the experiment on the Quadro GP100, no performance differences
between the three stages can be seen on the GeForce RTX 3090. This can be explained with the
much worse double precision performance of this consumer GPU compared to the data center GPU
used in the first experiment. Because of this, the NVIDIA GeForce RTX 3090 is likely heavily
bottle-necked by its compute performance. Thus, the memory optimization of stage 2 does not
come into effect. The much higher overall compute performance of the Quadro GP100 can also be

45


5 Analysis of the runtime behavior

Figure 5.9: Impact of the optimizations for the naive algorithm on an on an NVIDIA GeForce RTX
3090. Stage 0 is the non optimized version of the naive algorithm and stage 2 is the
final, most optimized version of the naive algorithm. It can be seen that stage 2 does
not improve the runtime of the naive algorithm on consumer GPUs.

clearly seen in the runtime differences between the two cards. The Quadro GP100 needs about 20.6
seconds for one acceleration computation on the largest dataset whereas the RTX 3090 takes about
2.4 minutes. This makes consumer GPUs impractical for the naive algorithm on larger datasets.

Differences between SYCL implementations

In order to analyze differences between the two different SYCL implementations that are supported,
the experiment was repeated using DPC++. Furthermore, in order not to limit the observations to
NVIDIA GPUs, the experiment was repeated again on an AMD Radeon PRO VII GPU with both,
Open SYCL and DPC++.

Figure 5.10a shows the results of the first experiment on a Quadro GP100 with the usage of DPC++.
One can see that the optimization also works with DPC++ since stage two improves the runtime
on most datasets with one exception being the smallest dataset. An explanation for this exception
could be again the choice of the work-group size as 128 for all datasets. For the smallest dataset
with just 128 bodies this is likely not the best value, however, it performs better for larger datasets
and the performance loss for 178 bodies is not that large. As opposed to the results of the analogous
experiment with Open SYCL one can observe that stage one often improves performance over stage
zero. This phenomenon could not really be observed with Open SYCL.

Figure 5.10b shows a comparison between the runtime of the 3 different stages with DPC++ and
Open SYCL. The dataset used for the experiment contains 19054 bodies. The y-axis denotes the
runtime in milliseconds. The left bar of each group shows the runtime of Open SYCL and the right
bar of a group of two bars shows the runtime of DPC++. It can be observed that Open SYCL is
much faster than DPC++ at stage 0. With stage one, the performance of DPC++ improves a lot
from about 8.04ms for one time step to only 5.57ms. This is faster than Open SYCL which needs
about 6.91ms for one time step with stage one. Stage one, however, has almost no effect when using

46


5.3 Evaluation of the performance optimizations

(a) Impact of the performance optimizations on an
NVIDIA Quadro GP100 using DPC++

(b) Impact of the performance optimizations with
DCP++ and Open SYCL for 19054 bodies

Figure 5.10: Differences between DPC++ and Open SYCL for the different optimization stages
of the naive algorithm on an NVIDIA Quadro GP100. Stage 0 is the non optimized
version of the naive algorithm and stage 2 is the final, most optimized version of
the naive algorithm. Figure 5.10a shows the runtimes of the naive algorithm with
the different optimizations on an NVIDIA Quadro GP100. Figure 5.10b shows
a comparison between DPC++ and Open SYCL with the different stages of the
naive algorithm. It can be seen that, in contrast to Open SYCL, stage 1 also offers
performance improvements in some cases.

Open SYCL, since O