Distributed fast fourier transform for heterogeneous GPU systems

Egger, Simon

Distributed fast fourier transform for heterogeneous GPU systems

Files

DistributedFFT-1.pdf (1.8 MB)

Date

2021

Authors

Egger, Simon

Abstract

The Fast Fourier Transform (FFT) is a numerical method to convert the input data to a representation in the frequency domain. A wide range of applications requires the computation of three-dimensional FFTs, which makes the utilization of Graphics Processing Units (GPUs) on distributed systems particularly appealing. The most common approach for distributed computation is to partition the global input data, resulting in slab decomposition or pencil decomposition. For large numbers of processes, it is well known that slab decomposition only provides limited scalability and is generally outperformed by pencil decomposition. This often leaves their performance comparison on fewer GPUs as a blind spot: We found that slab decomposition generally dominates for larger input sizes when utilizing fewer GPUs, which is compliant with simple theoretical models. An exception to this rule is when the processor grid of pencil decomposition is specifically aligned to fully utilize available NVLink interconnections. Next to the default implementation of slab decomposition and pencil decomposition, we propose Realigned as a possible optimization for both decomposition methods by taking advantage of cuFFT's advanced data layout. Most notably, Realigned reduced the additional memory requirements of pencil decomposition and computes the 1D-FFTs in y-direction more efficiently. Since both decomposition methods require global redistribution of the intermediate results, we further compare the performance of different Peer2Peer and All2All communication techniques. In particular, we introduce Peer2Peer-Streams, which avoids the need for additional synchronization and allows the complete overlap of communication and packing phase. Our performance benchmarks show that this approach generally performs best for large input sizes on test systems with a limited number of GPUs when considering MPI without CUDA-awareness. Furthermore, we utilize custom MPI datatypes and adopt MPI_Type for GPUs, which reduces the additional memory requirements dramatically and avoids the need for a packing and unpacking phase altogether. By identifying a redistributed partition as a batch of slices, where each slice consists of the maximum number of contiguous, complex-valued words, we found that MPI_Type often poses a worthwhile consideration when both sent and received partitions are not composed of one-dimensional slices.

Die Fast Fourier Transform (FFT) ist eine numerische Methode zur Berechnung der spektralen Information von Daten. Viele Anwendungsbereiche benötigen die Berechnung von drei-dimensionalen FFTs, was die Verwendung von Grafikprozessoren (GPU) auf verteilten Systemen besonders interessant macht. Die typische Herangehensweisen für eine verteilte Berechnung sind die Partitionierung der globalen Eingabe, welche in Slab Decomposition und Pencil Decomposition kategorisiert werden. Es ist bekannt, dass Slab Decomposition für viele Prozesse bedeutend langsamer ist als Pencil Decomposition und zusätzlich nur eine begrenzte Skalierbarkeit aufweist. Aus diesem Grund wird ein Vergleich der beiden Methoden oftmals vernachlässigt, auch wenn nur wenige Prozesse beteiligt sind. Unsere Ergebnisse zeigen, übereinstimmend mit einfachen theoretischen Modellen, dass Slab Decomposition generell besser abschneidet als Pencil Decomposition, falls große Eingaben und wenige GPUs betrachtet werden. Diese Regel trifft nicht zu, wenn passende NVLink Verbindungen existieren und diese von Pencil Decomposition geeignet ausgenutzt werden können. Neben den ursprünglichen Realisierungen von Slab Decomposition und Pencil Decomposition, stellen wir auch "Realigned" als eine mögliche Optimierung beider Methoden vor. Realigned erlaubt insbesondere, dass für Pencil Decomposition der zusätzliche Speicherbedarf reduziert wird und die 1D-FFTs in y-Richtung effizienter berechnet werden. Da für sowohl Slab Decomposition als auch Pencil Decomposition globale Kommunikation der Prozesse benötigt wird, vergleichen wir auch verschiedene Optimierungen der typischen Kommunikationsmethoden "Peer2Peer" und "All2All". Hierfür stellen wir insbesondere "Peer2Peer-Streams" vor, welches zusätzliche Synchronisierungen vermeidet und eine weitere Überlappung von Berechnung und Kommunikation erlaubt. Unsere Ergebnisse zeigen, dass Peer2Peer-Streams für große Eingaben eine beachtliche Verbesserung aufweist, falls eine begrenzte Anzahl GPUs und MPI ohne CUDA-awareness genutzt werden. Ferner verwenden wir auch benutzerdefinierte MPI Datentypen und übernehmen "MPI_Type" für GPUs. Dies erlaubt eine drastische Reduktion des zusätzlich benötigten Speicherbedarfs, indem nicht-zusammenhängende Partitionen direkt versandt und empfangen werden können. Unsere Ergebnisse zeigen, durch die Klassifizierung dieser Partitionen als 1D-, 2D- oder 3D-Partitionen, dass MPI_Type eine erwähnenswerte Option darbietet, falls es sich bei weder versandten noch empfangenen Partitionen um 1D-Partitionen handelt.

URI

http://nbn-resolving.de/urn:nbn:de:bsz:93-opus-ds-119356
http://elib.uni-stuttgart.de/handle/11682/11935
http://dx.doi.org/10.18419/opus-11918

Collections

05 Fakultät Informatik, Elektrotechnik und Informationstechnik

Full item page

Distributed fast fourier transform for heterogeneous GPU systems

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By