Please use this identifier to cite or link to this item: http://dx.doi.org/10.18419/opus-11037
Authors: Pfander, David
Title: Auto-tuning and performance portability on heterogeneous hardware
Issue Date: 2019
metadata.ubs.publikation.typ: Dissertation
metadata.ubs.publikation.seiten: 253
URI: http://nbn-resolving.de/urn:nbn:de:bsz:93-opus-ds-110545
http://elib.uni-stuttgart.de/handle/11682/11054
http://dx.doi.org/10.18419/opus-11037
Abstract: In high-performance computing, excellent node-level performance is required for the efficient use of supercomputers. However, manual optimization is a tedious process that commonly needs to be repeated for every hardware platform targeted. Auto-tuning has been developed as an approach to partially automate the optimization process, generally by tuning parameterized compute kernels. Through auto-tuning, both high performance on a single hardware platform and performance portability can be achieved. In this work, we present the auto-tuning framework AutoTuneTMP that leverages two features: just-in-time (JIT) compilation and C ++ template metaprogramming. JIT compilation enables a tight integration of auto-tuning into an application and reduces the time required for auto-tuning. Template metaprogramming enables a design focused on ease-of-integration, maintainability and extensibility. It further forms the basis for optimization templates that support the development of auto-tunable compute kernels. Additionally, our framework provides search strategies for performing parameter tuning. To demonstrate the applicability and usefulness of our framework, we use Auto-TuneTMP to auto-tune three algorithms that require excellent performance. The first algorithm is matrix multiplication. Whereas many implementations are manually optimized or even use assembly, we rely on template metaprogramming for a high-level approach. Across four hardware platforms, we achieved up to 91% of the peak performance and are, therefore, competitive with vendor libraries such as Intel’s MKL. Sparse grid regression is well-suited for machine learning in moderate-dimensional big data scenarios. However, large datasets necessitate high-performance algorithms. We introduce two auto-tuned high-performance algorithms for this application: the unified streaming algorithm and the subspace algorithm. These algorithms serve as the second and third example for auto-tuning with AutoTuneTMP. The unified streaming algorithm is written in OpenCL and targets a wide range of architectures including GPUs. The subspace algorithm was developed for processor platforms only, but has a lower time complexity. Due to these auto-tuned algorithms and a new approach for the spatial adaptivity of sparse grids, speedups of up to 13x were measured compared to the state-of-the-art surplus-refined masked streaming approach. Furthermore, we demonstrate that both new algorithms are performance-portable. Apart from auto-tuning, we investigate the performance portability of a new distributed variant of the sparse grid clustering method. As a density-based clustering method, it relies on a sparse grid density estimation to compute density functions for large datasets of moderate dimensionality in linear complexity in the size of the dataset. The algorithm consists of four major components: two density compute kernels and two compute kernels for creating and pruning a k-nearest-neighbor graph. All components were written using OpenCL and MPI. On the node-level, we reached on average 79% of the achievable peak performance of one processor and four GPUs. In distributed experiments on two supercomputers, Hazel Hen and Piz Daint, we measured up to 352 TFLOPS using 128 nodes. As a similar fraction of peak performance was achieved on both supercomputers and because of the high node-level efficiency, we can demonstrate the performance portability of the algorithm. Our work shows that performance portability is a realistic goal for scientific applications on modern hardware. By using JIT compilation and template metaprogramming, the tightly-integrated auto-tuning approach presented reduces the effort required for optimization without compromising on performance.
Appears in Collections:05 Fakultät Informatik, Elektrotechnik und Informationstechnik

Files in This Item:
File Description SizeFormat 
dissertation_print.pdf5,34 MBAdobe PDFView/Open


Items in OPUS are protected by copyright, with all rights reserved, unless otherwise indicated.