Efficient fault tolerance for selected scientific computing algorithms on heterogeneous and approximate computer architectures

Schöll, Alexander

Efficient fault tolerance for selected scientific computing algorithms on heterogeneous and approximate computer architectures

Files

diss-main.pdf (5.61 MB)

Date

2018

Authors

Schöll, Alexander

Abstract

Scientific computing and simulation technology play an essential role to solve central challenges in science and engineering. The high computational power of heterogeneous computer architectures allows to accelerate applications in these domains, which are often dominated by compute-intensive mathematical tasks. Scientific, economic and political decision processes increasingly rely on such applications and therefore induce a strong demand to compute correct and trustworthy results. However, the continued semiconductor technology scaling increasingly imposes serious threats to the reliability and efficiency of upcoming devices. Different reliability threats can cause crashes or erroneous results without indication. Software-based fault tolerance techniques can protect algorithmic tasks by adding appropriate operations to detect and correct errors at runtime. Major challenges are induced by the runtime overhead of such operations and by rounding errors in floating-point arithmetic that can cause false positives. The end of Dennard scaling induces central challenges to further increase the compute efficiency between semiconductor technology generations. Approximate computing exploits the inherent error resilience of different applications to achieve efficiency gains with respect to, for instance, power, energy, and execution times. However, scientific applications often induce strict accuracy requirements which require careful utilization of approximation techniques. This thesis provides fault tolerance and approximate computing methods that enable the reliable and efficient execution of linear algebra operations and Conjugate Gradient solvers using heterogeneous and approximate computer architectures. The presented fault tolerance techniques detect and correct errors at runtime with low runtime overhead and high error coverage. At the same time, these fault tolerance techniques are exploited to enable the execution of the Conjugate Gradient solvers on approximate hardware by monitoring the underlying error resilience while adjusting the approximation error accordingly. Besides, parameter evaluation and estimation methods are presented that determine the computational efficiency of application executions on approximate hardware. An extensive experimental evaluation shows the efficiency and efficacy of the presented methods with respect to the runtime overhead to detect and correct errors, the error coverage as well as the achieved energy reduction in executing the Conjugate Gradient solvers on approximate hardware.

URI

http://nbn-resolving.de/urn:nbn:de:bsz:93-opus-ds-99681
http://elib.uni-stuttgart.de/handle/11682/9968
http://dx.doi.org/10.18419/opus-9951

Collections

05 Fakultät Informatik, Elektrotechnik und Informationstechnik

Full item page

Efficient fault tolerance for selected scientific computing algorithms on heterogeneous and approximate computer architectures

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By