05 Fakultät Informatik, Elektrotechnik und Informationstechnik
Permanent URI for this collectionhttps://elib.uni-stuttgart.de/handle/11682/6
Browse
Search Results
Item Open Access Software-basierter Selbsttest eingebetteter Speicher(2015) Ebinger, FelixProzessoren werden häufig mittels softwarebasierter Selbsttests (SBST) getestet, da dieses Testverfahren mehrere Vorteile besitzt. Zunächst ist der Test zerstörungsfrei, und wird im funktionalen Betriebszustand des Prozessors durchgeführt. Es ist weder eine Veränderung des Hardwaredesigns erforderlich noch ist ein Übertesten möglich. Die Testmethode ist flexibel einsetzbar und kann sowohl beim Herstellungstest als auch im Feld genutzt werden. Speicher werden dagegen üblicherweise mittels eingebauter Selbsttests (engl. built-in self-test, BIST) getestet, da der Overhead durch die zusätzliche Testhardware nur gering ausfällt und diese Tests bei Speichern ohne Performance-Einbußen realisiert werden können. In dieser Arbeit wird die softwarebasierte Umsetzung von Speichertests untersucht um die Vorteile softwarebasierter Selbsttests auch bei Speichertests nutzen zu können. Dies stellt eine Herausforderung dar, da softwarebasiert nicht jede Operationsfolge mit frei wählbarem Zeitverhalten erzeugt werden kann. Insbesondere bei dynamischen Fehlern kann dies zu einer Verringerung der Testabdeckung führen. Hierzu wird ein Framework zur automatischen Umwandlung von Marchtestbeschreibungen in Testprogramme für den miniMIPS-Prozessor vorgestellt. Dabei steht besonders die Laufzeit des Testprogramms und die erreichte Testabdeckung im Vordergrund. Die Testabdeckung wird durch Simulation und Fehlerinjektion experimentell bestimmt. Es zeigt sich, dass die Fehlerabdeckung für die untersuchten statische und dynamische Fehlermodelle durch die vorgestellte Implementierung in Software nicht beeinträchtigt wird.Item Open Access Software-basierter Selbsttest von Peripherie-Komponenten(2015) Bäßler, JochenSoftware-basierte Selbsttest (SBST) Verfahren werden zumeist für das Testen von Mikroprozessoren eingesetzt, lassen sich jedoch auch auf Peripheriekomponenten anwenden. Der Vorteil von SBST, gegenüber Hardware-basierten Ansätzen besteht dabei im Verzicht auf spezielle Testhardware und Hochgeschwindigkeitstestgeräte und der Tatsache, dass Tests in der natürlichen Betriebsumgebung (engl. In-System) und bei normaler Betriebsfrequenz (engl. At-Speed) ablaufen. Peripheriekomponenten nehmen in vielen Systemen einen erheblichen Teil der Chipfläche ein, werden teilweise für sicherheitskritische Aufgaben eingesetzt und müssen folglich ausgiebig getestet werden. Um strukturelle SBST-Verfahren erfolgreich auf diesen Typ von Komponenten anzuwenden, müssen Maßnahmen getroffen werden um deren geringe Beobacht- und Kontrollierbarkeit zu erhöhen, da andernfalls die erzielte Fehlerabdeckung der Verfahren zu niedrig ausfällt. In dieser Arbeit werden zwei unterschiedliche Ansätze untersucht, um die strukturelle Fehlerabdeckung von SBST-Verfahren auf Kommunikationsperipheriekomponenten zu verbessern. Der erste Ansatz zielt auf eine verbesserte Kontrollierbarkeit der verwendeten Komponente ab. Dazu wird ein Loopback-basierter Mechanismus implementiert. Um darüber hinaus eine bessere Beobachtbarkeit zu erreichen wird als zweiter Ansatz der Zustand ausgewählter internen Signale dem System sichtbar gemacht. Eine beispielhafte Anwendung der vorgestellten Methode auf die I2C-Komponente eines RISC-Prozessors zeigt die Wirksamkeit der verwendeten Maßnahmen zur Verbesserung der strukturellen Fehlerabdeckung.Item Open Access Adaptierung an Zeitverhalten-Variationen in rekonfigurierbaren Hardwarestrukturen(2015) Brandhofer, SebastianDas Zeitverhalten von Komponenten in rekonfigurierbaren Hardwarestrukturen kann durch Alterungseffekte und zufällige Defekte variieren. Wenn ein System nicht an diese Abweichungen vom nominellen Zeitverhalten adaptiert werden kann, entstehen Verzögerungsfehler während des Betriebs, die zu falschen Ergebnissen oder Systemausfällen führen können. Insbesondere in sicherheitskritischen Anwendungen von rekonfigurierbaren Hardwarestrukturen kann dies zu Gefährdung von Personen führen. Diese Arbeit stellt einen Algorithmus zur Adaptierung an Zeitverhalten-Variationen in rekonfigurierbaren Hardwarestrukturen vor, der Alterung von Komponenten sowie zufällige Defekte berücksichtigt und Verzögerungsfehler durch eine dem Zeitverhalten angepasste Nutzung der rekonfigurierbaren Hardwarestrukturen vermeidet. Der entworfene Algorithmus wird mit Hilfe von verschiedenen Verzögerungsverteilungen hinsichtlich der Adaptionsfähigkeit, Speicheranforderungen und Laufzeit untersucht.Item Open Access Algorithm-based fault tolerance for matrix operations on graphics processing units : analysis and extension to autonomous operation(2015) Braun, Claus; Wunderlich, Hans-Joachim (Prof. Dr. rer. nat. habil.)Scientific computing and computer-based simulation technology evolved to indispensable tools that enable solutions for major challenges in science and engineering. Applications in these domains are often dominated by compute-intensive mathematical tasks like linear algebra matrix operations. The provision of correct and trustworthy computational results is an essential prerequisite since these applications can have direct impact on scientific, economic or political processes and decisions. Graphics processing units (GPUs) are highly parallel many-core processor architectures that deliver tremendous floating-point compute performance at very low cost. This makes them particularly interesting for the substantial acceleration of complex applications in science and engineering. However, like most nano-scaled CMOS devices, GPUs are facing a growing number of threats that jeopardize their reliability. This makes the integration of fault tolerance measures mandatory. Algorithm-Based Fault Tolerance (ABFT) allows the protection of essential mathematical operations, which are intensively used in scientific computing. It provides a high error coverage combined with a low computational overhead. However, the integration of ABFT into linear algebra matrix operations on GPUs is a non-trivial task, which requires a thorough balance between fault tolerance, architectural constraints and performance. Moreover, ABFT for operations carried out in floating-point arithmetic has to cope with a reduced error detection and localization efficacy due to inevitable rounding errors. This work provides an in-depth analysis of Algorithm-Based Fault Tolerance for matrix operations on graphics processing units with respect to different types and combinations of weighted checksum codes, partitioned encoding schemes and architecture-related execution parameters. Moreover, a novel approach called A-ABFT is introduced for the efficient online determination of rounding error bounds, which improves the error detection and localization capabilities of ABFT significantly. Extensive experimental evaluations of the error detection capabilities, the quality of the determined rounding error bounds, as well as the achievable performance confirm that the proposed A-ABFT method performs better than previous approaches. In addition, two case studies (QR decomposition and Linear Programming) emphasize the efficacy of A-ABFT and its applicability to practical problems.Item Open Access Self-diagnosis in Network-on-Chips(2015) Dalirsani, Atefe; Wunderlich, Hans-Joachim (Dr. rer. nat. habil.)Network-on-Chips (NoCs) constitute a message-passing infrastructure and can fulfil communication requirements of the today’s System-on-Chips (SoCs), which integrate numerous semiconductor Intellectual Property (IP) blocks into a single die. As the NoC is responsible for data transport among IPs, its reliability is very important regarding the reliability of the entire system. In deep nanoscale technologies, transient and permanent failures of transistors and wires are caused by variety of effects. Such failures may occur in the NoC as well, disrupting its normal operation. An NoC comprises a large number of switches that form a structure spanning across the chip. Inherent redundancy of the NoC provides multiple paths for communication among IPs. Graceful degradation is the property of tolerating a component’s failure in a system at the cost of limited functionality or performance. In NoCs, when a switch in the path is faulty, alternative paths can be used to connect IPs, keeping the SoC functional. To this purpose, a fault detection mechanism is needed to identify the faulty switch and a fault tolerant routing should bypass it. As each NoC switch consists of a number of ports and multiple routing paths, graceful degradation can be considered even in a rather granular way. The fault may destroy some routing paths inside the switch, leaving the rest non-faulty. Thus, instead of disabling the faulty switch completely, its fault-free parts can be used for message passing. In this way, the chance of disconnecting the IP cores is reduced and the probability of having disjoint networks decreases. This study pursues efficient self-test and diagnosis approaches for both manufacturing and in-field testing aiming at graceful degradation of defective NoCs. The approaches here identify the location of defective components in the network rather than providing only a go/no-go test response. Conventionally, structural test approaches like scan-design have been employed for testing the NoC products. Structural testing targets faults of a predefined structural fault model like stuck-at faults. In contrast, functional testing targets certain functionalities of a system for example the instructions of a microprocessor. In NoCs, functional tests target NoC characteristics such as routing functions and undistorted data transport. Functional tests get the highest gain of the regular NoC structure. They reduce the test costs and prevent overtesting. However, unlike structural tests, functional tests do not explicitly target structural faults and the quality of the test approach cannot be measured. We bridge this gap by proposing a self-test approach that combines the advantages of structural and functional test methodologies and hence is suitable for both manufacturing and in-field testing. Here, the software running on the IP cores attached to the NoC is responsible for test. Similar to functional tests, the test patterns here deal only with the functional inputs and outputs of switches. For pattern generation, a model is introduced that brings the information about structural faults to the level of functional outputs of the switch. Thanks to this unique feature of the model, a high structural fault coverage is achieved as revealed by the results. To make NoCs more robust against various defect mechanisms during the lifetime, concurrent error detection is necessary. Toward this, this dissertation contributes an area efficient synthesis technique of NoC switches to detect any error resulting from single combinational and transition fault in the switch and its links during the normal operation. This technique incorporates data encoding and the standard concurrent error detection using multiple parity trees. Results reveal that the proposed approach imposes less area overhead as compared to traditional techniques for concurrent error detection. To enable fine-grained graceful degradation, intact functions of defective switches must be identified. Thanks to the fault tolerant techniques, fault-free parts of switches can be still employed in the NoC. However, reasoning about the fault-free functions with respect to the exact cause of a malfunction is missing in the literature. This dissertation contributes a novel fine-grained switch diagnosis technique that works based on the structural logic diagnosis. After determining the location and the nature of the defect in the faulty switch, all routing paths are checked and the soundness of the intact switch functions is proved. Experimental results show improvements in both performance and reliability of degraded NoCs by incorporating the fine-grained diagnosis of NoC switches.Item Open Access Fault tolerance infrastructure and its reuse for offline testing : synergies of a unified architecture to cope with soft errors and hard faults(2015) Imhof, Michael E.; Wunderlich, Hans-Joachim (Prof. Dr. rer. nat. habil.)The evolution of digital circuits from a few application areas to omnipresence in everyday life has been enabled by the ability to dramatically increase integration density through scaling. However, the continuation of scaling gets more difficult with every generation and poses severe challenges on reliability. Throughout the manufacturing process the appearance of defects cannot be avoided and further deteriorates with scaling. Hence, the reliability at timepoint zero denoted by the manufacturing yield is not ideal and some defective chips will produce wrong output signals. For this reason, the presence of such hard faults needs to be shown prior to delivery during test where automatic test equipment (ATE) is used to apply a test set that covers a predefined set of modeled defects. As some potential defect locations are hard to test using the chips operational interface, additional dedicated test infrastructure is included on chip that provides test access. Throughout the operational lifetime reliability is threatened by soft errors that originate from interactions of radiation with semiconductor devices and potentially manifest in sequential state corruptions. With further raising soft error rates aggravated by scaling high reliability is maintained by the inclusion of fault tolerance infrastructure able to detect, localize and ideally correct soft errors. Thus, the orthogonal combination of two independent infrastructures elevates the area overhead although test support and fault tolerance are never required concurrently. This work proposes a unified architecture that employs a common infrastructure to provide fault tolerance during operation and test access during test. Similarities between both fields are successfully exploited and traced back to the combination of an efficient sequential state checksum with an effective state update by bit-flipping. Experiments on public and industrial circuits evaluate the unified architecture in both fields and show an improved area efficiency as well as successful correction during fault tolerance. During test, the results substantiate advantages with respect to test time, test volume, peak and average test power as well as test energy.