Browsing by Author "Braun, Claus"

Now showing 1 - 1 of 1

Open Access
Algorithm-based fault tolerance for matrix operations on graphics processing units : analysis and extension to autonomous operation
(2015) Braun, Claus; Wunderlich, Hans-Joachim (Prof. Dr. rer. nat. habil.)
Scientific computing and computer-based simulation technology evolved to indispensable tools that enable solutions for major challenges in science and engineering. Applications in these domains are often dominated by compute-intensive mathematical tasks like linear algebra matrix operations. The provision of correct and trustworthy computational results is an essential prerequisite since these applications can have direct impact on scientific, economic or political processes and decisions. Graphics processing units (GPUs) are highly parallel many-core processor architectures that deliver tremendous floating-point compute performance at very low cost. This makes them particularly interesting for the substantial acceleration of complex applications in science and engineering. However, like most nano-scaled CMOS devices, GPUs are facing a growing number of threats that jeopardize their reliability. This makes the integration of fault tolerance measures mandatory. Algorithm-Based Fault Tolerance (ABFT) allows the protection of essential mathematical operations, which are intensively used in scientific computing. It provides a high error coverage combined with a low computational overhead. However, the integration of ABFT into linear algebra matrix operations on GPUs is a non-trivial task, which requires a thorough balance between fault tolerance, architectural constraints and performance. Moreover, ABFT for operations carried out in floating-point arithmetic has to cope with a reduced error detection and localization efficacy due to inevitable rounding errors. This work provides an in-depth analysis of Algorithm-Based Fault Tolerance for matrix operations on graphics processing units with respect to different types and combinations of weighted checksum codes, partitioned encoding schemes and architecture-related execution parameters. Moreover, a novel approach called A-ABFT is introduced for the efficient online determination of rounding error bounds, which improves the error detection and localization capabilities of ABFT significantly. Extensive experimental evaluations of the error detection capabilities, the quality of the determined rounding error bounds, as well as the achievable performance confirm that the proposed A-ABFT method performs better than previous approaches. In addition, two case studies (QR decomposition and Linear Programming) emphasize the efficacy of A-ABFT and its applicability to practical problems.