Multi-Level Simulation of Nano-Electronic Digital Circuits on GPUs Von der Fakultät Informatik, Elektrotechnik und Informationstechnik der Universität Stuttgart zur Erlangung der Würde eines Doktors der Naturwissenschaften (Dr. rer. nat.) genehmigte Abhandlung Vorgelegt von Eric Schneider aus Hildburghausen, Deutschland Hauptberichter: Prof. Dr. rer. nat. habil. Hans-Joachim Wunderlich Mitberichter: Prof. Dr. phil. nat. Rolf Drechsler Tag der mündlichen Prüfung: 21. Juni 2019. Institut für Technische Informatik der Universität Stuttgart 2019 Contents Acknowledgments ix Abstract xi Zusammenfassung xiii List of Abbreviations xxi 1 Introduction 1 I Background 2 Fundamental Background 13 2.1 CMOS Digital Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.1 CMOS Switching and Time Behavior . . . . . . . . . . . . . . . . 14 2.1.2 Circuit Performance and Reliability Issues . . . . . . . . . . . . . 16 2.2 Circuit Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.1 Electrical Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.2 Switch Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.3 Logic Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.4 Register-Transfer Level . . . . . . . . . . . . . . . . . . . . . . . 23 2.2.5 System-/Behavioral Level . . . . . . . . . . . . . . . . . . . . . . 23 2.3 Delay Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4 Circuit and Fault Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.1 Event-driven Simulation . . . . . . . . . . . . . . . . . . . . . . 29 iii 2.4.2 Fault Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.5 Multi-Level Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3 Graphics Processing Units (GPUs) 33 3.1 GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1.1 Streaming Multiprocessors . . . . . . . . . . . . . . . . . . . . . 35 3.1.2 Thread Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.1.3 Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2 Programming Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2.1 Thread Organization . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2.2 Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4 Parallel Circuit and Fault Simulation on GPUs 43 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2 Parallel Circuit Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3 Parallel Gate Level Simulation . . . . . . . . . . . . . . . . . . . . . . . . 45 4.3.1 Logic Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3.2 Timing-aware Simulation . . . . . . . . . . . . . . . . . . . . . . 47 4.4 Parallel Fault Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 II High-Throughput Time and Fault Simulation 5 Parallel Logic Level Time Simulation on GPUs 57 5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.2 Circuit Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2.1 Gates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2.2 Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2.3 Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.3 Modeling Signals in Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 iv 5.3.1 Events and Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5.3.2 Delay Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.3.3 Waveforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.4.1 Serial Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.4.2 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.5.1 Waveform Allocation . . . . . . . . . . . . . . . . . . . . . . . . 80 5.5.2 Kernel Organization . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.5.3 Simulation Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.5.4 Test Stimuli and Response Conversion . . . . . . . . . . . . . . . 88 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6 Parallel Switch Level Time Simulation on GPUs 93 6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.2 Switch Level Circuit Model . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.2.1 Channel-Connected Components . . . . . . . . . . . . . . . . . . 95 6.2.2 RRC-Cell Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.2.3 Switch Level Waveform Representation . . . . . . . . . . . . . . 99 6.2.4 Modeling Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.3 Switch Level Simulation Algorithm . . . . . . . . . . . . . . . . . . . . . . 103 6.3.1 RRC-Cell Simulation . . . . . . . . . . . . . . . . . . . . . . . . 104 6.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7 Waveform-Accurate Fault Simulation on GPUs 115 7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 7.2 Fault Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.2.1 Small Gate Delay Fault Model . . . . . . . . . . . . . . . . . . . 117 7.2.2 Switch-Level Fault Models . . . . . . . . . . . . . . . . . . . . . 118 7.2.3 Fault Injection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.3 Syndrome Computation and Fault Detection . . . . . . . . . . . . . . . . 122 v 7.3.1 Signal Interpretation . . . . . . . . . . . . . . . . . . . . . . . . 122 7.3.2 Discrete Syndrome Computation . . . . . . . . . . . . . . . . . . 123 7.3.3 Fault Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 7.4 Fault-Parallel Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 7.4.1 Fault Collapsing . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 7.4.2 Fault Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 7.4.3 Grouping Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 129 7.4.4 Parallel Fault Evaluation . . . . . . . . . . . . . . . . . . . . . . 132 7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 8 Multi-Level Simulation on GPUs 135 8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 8.2 Multi-Level Circuit Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 137 8.2.1 Region of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . 137 8.2.2 Changing the Abstraction . . . . . . . . . . . . . . . . . . . . . . 138 8.2.3 Mixed-abstraction Temporal Behavior . . . . . . . . . . . . . . . 138 8.2.4 Simulation Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 8.3 Transparent Multi-Level Time Simulation . . . . . . . . . . . . . . . . . . 142 8.3.1 Waveform Transformation . . . . . . . . . . . . . . . . . . . . . 143 8.4 Multi-Level Data Structures and Implementation . . . . . . . . . . . . . . 147 8.4.1 Multi-Level Node Description . . . . . . . . . . . . . . . . . . . . 148 8.4.2 Multi-Level Netlist Graph . . . . . . . . . . . . . . . . . . . . . . 149 8.4.3 Multi-Level Waveforms . . . . . . . . . . . . . . . . . . . . . . . 150 8.4.4 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 III Applications and Evaluation 9 Experimental Setup 155 9.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 9.2 Host System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 9.3 Simulation Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 vi 9.4 Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 9.5 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 10 High-Throughput Time Simulation Results 161 10.1 Logic Level Timing Simulation Performance . . . . . . . . . . . . . . . . . 161 10.1.1 Runtime Performance . . . . . . . . . . . . . . . . . . . . . . . . 162 10.1.2 Simulation Throughput . . . . . . . . . . . . . . . . . . . . . . . 164 10.2 Switch Level Timing Simulation Performance . . . . . . . . . . . . . . . . 166 10.2.1 Runtime Performance . . . . . . . . . . . . . . . . . . . . . . . . 166 10.2.2 Throughput performance . . . . . . . . . . . . . . . . . . . . . . 167 10.3 Switch Level Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 10.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 11 Application: Fault Simulation 175 11.1 Fault Grouping Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 11.2 Small Delay Fault Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 178 11.2.1 Variation Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 11.3 Switch Level Fault Simulation . . . . . . . . . . . . . . . . . . . . . . . . 180 11.3.1 Resistive Open-Transistor Faults . . . . . . . . . . . . . . . . . . 181 11.3.2 Capacitive Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 11.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 12 Application: Power Estimation 187 12.1 Circuit Switching Activity Evaluation . . . . . . . . . . . . . . . . . . . . . 187 12.1.1 Weighted Switching Activity for Power Estimation . . . . . . . . 188 12.1.2 Distribution of Switching Activity . . . . . . . . . . . . . . . . . 189 12.1.3 Correlation of Clock Skew Estimates . . . . . . . . . . . . . . . . 190 12.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 13 Application: Multi-Level Simulation 195 13.1 Multi-Level Simulation Runtime . . . . . . . . . . . . . . . . . . . . . . . 195 13.2 Multi-Level Trade-Off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 13.2.1 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 vii 13.2.2 Runtime Savings . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 13.2.3 Simulation Efficiency . . . . . . . . . . . . . . . . . . . . . . . . 199 13.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Conclusion 203 Bibliography 205 A Benchmark Circuits 227 B Result Tables 229 List of Symbols 235 Index 237 Publications of the Author 239 viii Acknowledgments I would like to thank Prof. Hans-Joachim Wunderlich for giving me the opportunity to work at the Institut für Technische Informatik (ITI) in Stuttgart and to let me experience the life in academia under his professional supervision. I am deeply grateful for his invaluable patience and faith in me, as well as the countless and very often insightful discussions we had. Furthermore, I want to thank Prof. Rolf Drechsler from University of Bremen for taking over the role as my second adviser and also for the nice chats. In addition, I am very grateful to Stefan Holst, who was my mentor during my undergrad- uate studies and who helped me getting started at the ITI as a colleague and friend. I very much enjoyed the work with the other colleagues and friends as well, which alto- gether made the life at the institute interesting and enjoyable in many ways. Therefore I am happy to thank Michael A. Kochte and Claus Braun as well as Laura Rodrígez Gómez, Dominik Ull, Chang Liu, Rafał Baranowski, Alexander Schöll, Michael Imhof, Abdullah Mumtaz, Marcus Wagner, Sebastian Brandhofer, Ahmed Atteya, Natalia Lylina, Zahra Paria Najafi Haghi and Chih-Hao Wang. I also want to thank Lars Bauer from Karlsruhe Institute of Technology, Matthias Sauer formerly from University of Freiburg, Prof. Sybille Helle- brand and Matthias Kampmann from University of Paderborn as well as Prof. Xiaoqing Wen and Kohei Miyase from Kyushu Institute of Technology and Yuta Yamato formerly from Nara Institute of Science and Technology for all the insightful discussions and col- laborations we could establish together. Thank you, Mirjam Breitling, Lothar Hellmeier and Helmut Häfner. Without your admin- istrative and technical assistance this work would have been impossible for me. Last but not least, I am also very grateful to my parents for their constant support and caring. Stuttgart, June 2019 Eric Schneider ix x Abstract Simulation of circuits and faults is an essential part in design and test validation tasks of contemporary nano-electronic digital integrated CMOS circuits. Shrinking technology processes with smaller feature sizes and strict performance and reliability requirements demand not only detailed validation of the functional properties of a design, but also accurate validation of non-functional aspects including the timing behavior. However, due to the rising complexity of the circuit behavior and the steady growth of the designs with respect to the transistor count, timing-accurate simulation of current designs requires a lot of computational effort which can only be handled by proper abstraction and a high degree of parallelization. This work presents a simulation model for scalable and accurate timing simulation of digital circuits on data-parallel graphics processing unit (GPU) accelerators. By providing compact modeling and data-structures as well as through exploiting multiple dimensions of parallelism, the simulation model enables not only fast and timing-accurate simulation at logic level, but also massively-parallel simulation with switch level accuracy. The model facilitates extensions for fast and efficient fault simulation of small delay faults at logic level, as well as first-order parametric and parasitic faults at switch level. With the parallelization on GPUs, detailed and scalable simulation is enabled that is applicable even to multi-million gate designs. This way, comprehensive analyses of realistic timing-related faults in presence of process- and parameter variations are enabled for the first time. Additional simulation efficiency is achieved by merging the presented methods in a unified simulation model, that allows to combine the unique advantages of the different levels of abstraction in a mixed-abstraction multi-level simulation flow to reach even higher speedups. xi Experimental results show that the implemented parallel approach achieves unprece- dented simulation throughput as well as high speedup compared to conventional timing simulators. The underlying model scales for multi-million gate designs and gives detailed insights into the timing behavior of digital CMOS circuits, thereby enabling large-scale applications to aid even highly complex design and test validation tasks. xii Zusammenfassung Simulation ist ein wichtiger Bestandteil der Entwurfs- und Testvalidierung von heutigen nano-elektronischen digitalen Schaltungen. Die Herstellungsprozesse von Technologien mit geringen Strukturgrößen im Bereich weniger Nanometer unterliegen strengen An- forderungen an die Zuverlässigkeit und Performanz der zu produzierenden Schaltungen. Daher ist es schon während des Entwurfs eine Validierung der funktionalen Aspekte not- wendig und überdies hinaus auch eine akkurate Validierung der nicht-funktionalen Aspek- te einschließlich der Simulation des Zeitverhaltens. Aufgrund der steigenden Komplexität des Schaltverhaltens und der stetig steigenden Zahl von Transistoren in den Schaltungen benötigt akkurate Zeitsimulation jedoch enormen Rechenaufwand, welcher nur mit ge- eigneter Modellabstraktion und einem hohen Grad an Parallelisierung bewältigt werden kann. In dieser Arbeit wird ein Simulationsmodell zur akkuraten Zeitsimulation digitaler Schal- tungen auf massiv daten-parallelen Grafikbeschleunigern (sogenannten ”GPUs”) vorge- stellt. Die Verwendung einer kompakten Modellierung und geeignete Datenstrukturen so- wie die simultane Ausnutzung mehrerer Dimensionen von Parallelismus ermöglichen nicht nur schnelle und zeitgenaue Simulation auf Logik-Ebene, sondern erstmals auch skalier- bare massiv-parallele Simulation mit erhöhter Genauigkeit auf Schalter-Ebene. Erweiterungen der Modellierung bieten schnelle und effiziente Fehlersimulation von kleins- ten Verzögerungsfehlern auf Logik-Ebene, als auch parametrische und parasitische Feh- ler in elektrischen Parametern erster Ordnung auf Schalter-Ebene. Die Parallelisierung auf den GPU-Architekturen erlaubt darüber hinaus erstmals detaillierte Simulationen und Analysen von realistischen Verzögerungsfehlern unter Prozess- und Parametervariationen für Schaltkreise mit Millionen von Gattern. xiii Durch Vereinigung der implementierten Methoden in einem mehrere Ebenen umfassenden Simulationskonzept mit gemischten Abstraktionen werden die Vorteile der verschiedenen Abstraktionsebenen kombiniert, wodurch die Effizienz der Zeitsimulation gesteigert und zusätzliche Beschleunigung erzielt werden kann. Ergebnisse durchgeführter Experimente zeigen, dass der implementierte parallele Simula- tionsansatz einen hohen Rechendurchsatz mit hohem Grad an Beschleunigung gegenüber konventioneller Zeitsimulation erzielt. Das zugrundeliegende Simulationsmodell skaliert für Schaltkreise mit Millionen von Gattern und gewährt detaillierte Einsicht in das Zeit- schaltverhalten digitaler CMOS Schaltungen, wodurch umfassende Anwendungen zur Un- terstützung auch für hochkomplexe Aufgaben beim Entwurf und Test nano-elektronischer Schaltungen ermöglicht werden. xiv List of Figures 1.1 Contributions of this Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 Propagation Delay and Rise/Fall Times . . . . . . . . . . . . . . . . . . . . . 14 2.2 First-order Transient Step Response . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Circuit Abstraction Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4 Switch Level Representation of a CMOS Inverter . . . . . . . . . . . . . . . . 19 2.5 Overview of Delay Models for Gates . . . . . . . . . . . . . . . . . . . . . . . 22 2.6 Small (Gate-)Delay Fault Example . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1 Block Diagram of the full GP100 Architecture . . . . . . . . . . . . . . . . . 34 3.2 Block Diagram of a GP100 Streaming Multiprocessor with CUDA Core . . . 35 3.3 GP100 Memory Hierarchy and Thread Access . . . . . . . . . . . . . . . . . 37 3.4 Serial Execution Flow of a heterogeneous Program . . . . . . . . . . . . . . 39 3.5 NVIDIA CUDA Thread Organization and Memory Accesses . . . . . . . . . . 42 5.1 Logic Level Time Simulation Flow . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2 Graph Visualization of Circuit c17 . . . . . . . . . . . . . . . . . . . . . . . . 59 5.3 Delay Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.4 Logic Level Waveform Representation . . . . . . . . . . . . . . . . . . . . . . 69 5.5 Logic Level Waveform Processing Example. . . . . . . . . . . . . . . . . . . . 73 5.6 Structural and Data-Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.7 Thread Organization for Structural Parallelism . . . . . . . . . . . . . . . . . 76 5.8 Two-dimensional Thread Organization . . . . . . . . . . . . . . . . . . . . . 77 5.9 Multi-dimensional Thread Organization with Instance Parallelism . . . . . . 78 5.10 Kernel Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 xv 5.11 Waveform Memory Organization . . . . . . . . . . . . . . . . . . . . . . . . 81 5.12 Memory Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.13 Overall Simulation Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.14 Parallel Pattern-to-Waveform Conversion . . . . . . . . . . . . . . . . . . . . 90 5.15 Parallel Waveform-to-Pattern Conversion . . . . . . . . . . . . . . . . . . . . 91 6.1 Pattern-dependent Delays due to Multiple-Input Switching . . . . . . . . . . 94 6.2 Extraction of Channel-Connected Components . . . . . . . . . . . . . . . . . 96 6.3 Resistor-Resistor-Capacitor (RRC-) Cell Model . . . . . . . . . . . . . . . . . 97 6.4 Curve-Fitting of Transient Responses . . . . . . . . . . . . . . . . . . . . . . 102 6.5 Switch Level Event Representation and Visualization . . . . . . . . . . . . . 102 6.6 Switch Level Waveforms of a two-input NOR-Cell . . . . . . . . . . . . . . . 106 6.7 Time-continuous Simulation State of the NOR-Cell Example . . . . . . . . . 110 7.1 Fault Simulation Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.2 Mapping of Small Delay Faults at Outputs to Input-descriptions . . . . . . . 118 7.3 Greedy Fault Grouping Heuristic . . . . . . . . . . . . . . . . . . . . . . . . 130 7.4 Fault Grouping Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 7.5 Parallel Syndrome Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . 133 8.1 State Transitions in ternary Logic Waveforms . . . . . . . . . . . . . . . . . . 140 8.2 Ternary Logic Waveform Example . . . . . . . . . . . . . . . . . . . . . . . . 140 8.3 Multi-Level Simulation Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 8.4 Multi-Level Waveform Transformation: Logic-to-Switch Level . . . . . . . . 146 8.5 Multi-Level Waveform Transformation: Switch-to-Logic Level . . . . . . . . 148 8.6 Generic Multi-Level Node Description . . . . . . . . . . . . . . . . . . . . . . 148 8.7 Node Expansion and Collapsing . . . . . . . . . . . . . . . . . . . . . . . . . 151 8.8 Generic Waveform Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 8.9 Transparent Parallelization of the Multi-Level Simulation . . . . . . . . . . . 152 10.1 Logic Level Simulation Performance Trend . . . . . . . . . . . . . . . . . . . 165 10.2 Logic Level Simulation GPU-Device Exploration . . . . . . . . . . . . . . . . 165 10.3 Switch Level Simulation Performance Trend . . . . . . . . . . . . . . . . . . 169 xvi 10.4 Switch Level Simulation GPU-Device Exploration . . . . . . . . . . . . . . . 169 10.5 Multiple Input Switching in a NAND3_X1 Cell . . . . . . . . . . . . . . . . . 170 10.6 Pattern-dependent Delay Test Circuit . . . . . . . . . . . . . . . . . . . . . . 171 10.7 Waveforms of Electrical, Logic and Switch Level Simulation in Comparison . 172 11.1 Resistive Open-Transistor Fault Simulation Results . . . . . . . . . . . . . . . 182 11.2 Resistive Open-Transistor Fault Behavior . . . . . . . . . . . . . . . . . . . . 184 11.3 Capacitive Fault Simulation Results . . . . . . . . . . . . . . . . . . . . . . . 185 11.4 Capacitive Fault Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 12.1 Switching Activity Characterization at Logic Level . . . . . . . . . . . . . . . 189 12.2 Weighted Switching Activity Comparison . . . . . . . . . . . . . . . . . . . . 190 12.3 Distribution of Switching Activity in Time-Slices . . . . . . . . . . . . . . . . 190 12.4 Clock-Skew in the Clock Tree . . . . . . . . . . . . . . . . . . . . . . . . . . 192 12.5 Correlation of Skew Estimates for Circuit b14 . . . . . . . . . . . . . . . . . 193 12.6 Correlation of Skew Estimates for Circuit b18 . . . . . . . . . . . . . . . . . 194 13.1 Multi-Level Simulation Performance . . . . . . . . . . . . . . . . . . . . . . 196 13.2 Multi-Level Simulation Speedup . . . . . . . . . . . . . . . . . . . . . . . . . 198 13.3 Runtime Savings of Mixed-Abstraction Simulation . . . . . . . . . . . . . . . 199 13.4 Distribution of Fault Group Sizes . . . . . . . . . . . . . . . . . . . . . . . . 200 xvii xviii List of Tables 6.1 Switch Level Event Table Example . . . . . . . . . . . . . . . . . . . . . . . . 108 6.2 Switch Level Device Table Example . . . . . . . . . . . . . . . . . . . . . . . 108 9.1 Specifications of the used GPU Devices . . . . . . . . . . . . . . . . . . . . . 156 9.2 Number Symbols and represented Order of Magnitude . . . . . . . . . . . . 158 10.1 Logic Level Simulation Performance . . . . . . . . . . . . . . . . . . . . . . . 163 10.2 Switch Level Simulation Performance . . . . . . . . . . . . . . . . . . . . . . 168 11.1 Fault Collapsing and Fault Grouping Statistics . . . . . . . . . . . . . . . . . 177 11.2 Fault Detection of Transition and Small Delay Faults . . . . . . . . . . . . . . 178 11.3 Random Process Variation Impact on Small Delay Fault Detection . . . . . . 181 A.1 Benchmark Circuits and Test Set Statistics . . . . . . . . . . . . . . . . . . . 228 B.1 Characterization of Switching Activity at Logic Level . . . . . . . . . . . . . 230 B.2 Weighted Switching Activity of timed and untimed Simulation . . . . . . . . 231 B.3 Resisitive Open-Transistor Fault Simulation Results . . . . . . . . . . . . . . 232 B.4 Capacitive Fault Simulation Results . . . . . . . . . . . . . . . . . . . . . . . 233 B.5 Multi-Level Simulation Runtime of different Mixed-Abstraction Scenarios . . 234 xix xx List of Abbreviations ATPG Automatic test pattern generation CAT Cell-aware test CCC Channel-connected component CLF conditional line flip CMOS Complementary metal-oxide semiconductor CUDA Computer unified device architecture DSPF Detailed parasitics file format EDA Electronic design automation ELF Early-life failures EM Electro-migration FAST Faster-than-at-speed test FinFET Fin-type field-effect-transistor FLOP Floating-point operation GND Ground voltage potential GPU Graphics processing unit HCI Hot-carrier injection IFA Inductive fault analysis MEPS Million (node) evaluations per second MIS Multiple-input switching MOSFET Metal-oxide-semiconductor field-effect-transistor xxi List of Abbreviations NBTI Negative-bias temperature instability PPSFP Parallel-pattern single fault propagation (fault simulation) ROI Region of interest RTL Register-transfer level SDF Standard delay format SM Streaming multi-processor SPICE Simulation program with integrated circuit emphasis STA Static timing analysis STF slow-to-fall STR slow-to-rise TDDB Time-dependent dielectric breakdown VDD Supply voltage potential xxii Chapter 1 Introduction In modern semi-conductor development cycles of nanometer CMOS integrated circuits, the simulation of circuits is one of the most important and complex tasks for design and test validation. Circuit simulation is used for design validation to analyze developed de- signs with respect to validation targets to indicate the compliance with their specification or customer requirements. In test validation on the other hand, fault simulation is utilized to evaluate the defect coverage of test sets and new test strategies, as well as to assess the quality of the tested products. For this, electronic design automation (EDA) tools for design and test validation have to rely on repeated, compute-intensive circuit simulations with extensive runtimes, which can pose a bottleneck, especially for designs with billions of transistors [ITR15]. Simulation itself is best described as "performing experiments on a model" of the real (phys- ical) world [Cel91, CK06]. The models for simulation typically provide a set of properties and rules, that represents and explains parts of the complex real world behavior in an abstract manner. Once an algorithm has been derived from the model, a simulation pro- gram (simulator) can be efficiently utilized to perform and observe experiments repeatedly on the model for any given input parameters. The output of the simulation then provides data that can help to make predictions regarding the real world behavior. This way, knowl- edge can be obtained to validate real world hypotheses without the need of costly (and often destructive) experiments in the real world. For example, in circuit simulation abstract models of the real physical silicon chips are provided, that allow to run experiments to predict and observe the behavior of prototype 1 1 Introduction designs and compare it to its specification. The designer can then draw conclusions from the results to assess the provided functionality of the prototype design, the correctness as well as other non-functional aspects (i.e., timing reliability, power) [WH11, Wun91]. Fault simulation is an extension of circuit simulation in which the behavior of arbitrary defects in a real physical chip are abstracted to faults. Each fault describes the altered chip behavior with respect to a given defect mechanism, which is reflected as abstract deterministic misbehavior in the underlying circuit simulation model. The abstraction of a simulation model itself is crucial to avoid dealing with the overall complexity of the real physical world and to make an evaluation feasible at all. While a high accuracy of a simulation model is generally desirable, it can make simulations infeasible due to a high time- and space complexity of the evaluations. By constraining the model to certain aspects and by using simplified assumptions, the costs of the simulations can be lowered and evaluation can be focused on certain problems of interest. On the other hand, too much abstraction reduces the accuracy of the models and also the viability of the simulation and its results. Challenges With the continuing advancements in semiconductor manufacturing processes and shrink- ing technologies of digital integrated CMOS circuits, traditional circuit and fault simula- tion models have become insufficient. Nowadays, thorough design and test validation as well as diagnosis require more and more accurate simulations and have thus become complex and runtime-intensive tasks, since several aspects have lead to an increase in the overall modeling and simulation effort: Complexity of Circuit Size Over the past decades, semiconductor manufacturing technologies and processes made continuous advancements that still allow for an ongoing growth in complexity and higher integration of modern integrated circuits [ITR15]. Although Moore’s Law [Moo65, Moo75] was already predicted to end [RS11], newer process technologies, such as the fin-type field effect transistor (FinFET), enable a scaling of structure sizes down to a few nanometers 2 only and the production with higher transistor density [Kin05, Hu11]. For example, in 2017 NVIDIA announced a new generation of parallel GPU accelerators: The NVIDIA R© VoltaTM architecture. The GV100 GPU chip of the family is manufactured in a 12nm FinFET technology and comprises 21.1 billion transistors on a die of 815 mm2 in size [NVI17d]. To validate such massive designs, scalable parallel simulation algorithms are required that are able to cope with the increasing design complexity and its millions or billions of transistors. Complexity of Circuit Behavior While technology scaling causes the circuits to grow in size and functionality, the be- havior of standard cells and circuit structures is becoming more and more complex as well [ITR15]. With the increasing demand for high-performance and low-power em- bedded designs, chips are running today at their operational limit [SBJ+03, DWB+10, KET16, GKET16]. Furthermore, the circuit structures become more and more prone to process- [ZHHO04, QS10] and parameter variations (e.g., voltage and temperature [BPSB00, BKN+03, TKD+07]), which can impact the reliability of the circuit and effec- tiveness of tests if neglegted [PBH+11, HBH+10, CIJ+12]. Therefore, accurate simulation models are required that deliver sufficiently high precision and detail to capture both functional as well as non-functional aspects of the circuit with as little abstraction as pos- sible. The circuit models must provide support for accurate representation of the temporal behavior to reflect hazards and glitches, signal slopes and parametric and parasitic depen- dencies. Also, pattern-dependent delays and multiple-input switching effects should be considered as they can significantly alter the circuit delay [MRD92, SDC94, CS96]. How- ever, such timing-accurate evaluations involve the vast processing of many real numbers through complex arithmetic operations, which is several times more expensive in term of computing complexity compared to untimed logic simulation [SJN94] and also much harder to parallelize [BMC96]. Complexity of Defect Modeling The complexity of a simulation quickly rises further as soon as the behavior of defects in a circuit needs to be investigated in fault simulation. While classical fault models and 3 1 Introduction simulation approaches have been extensively studied and optimized in the past [Wun10, Wun91, WLRI87], modern semiconductor manufacturing processes have to deal with several new fine-grained types of defects [LX12, RAH+14, HRG+14, ITR15]. Small de- lay defects (SDD) [TPC11, RMdGV02], marginal devices [RAH+14, KCK+10, KKK+10] and wear-out-related effects due to circuit aging [LGS09, APZM07, BM09, GSR+14] im- pact the timing behavior of the circuit and require timing- and glitch-accurate evaluation [Kon00, YS04, HS14], especially when considering process variations on top [PBH+11, HBH+10, CIJ+12]. While certain validation tasks can utilize formal verification meth- ods [DG05], the modeling and evaluation of low-level faults and parameters in explicit simulation is much more intuitive and straight-forward [Lam05], especially when a vali- dated formal model does not exist. Moreover, the faulty behavior of many realistic defects often cannot be represented at higher abstraction levels as their behavior strongly depends on the physical layout as well as the low-level topology and technology of the standard cells [SMF85, FS88]. For this, cell-aware test (CAT) approaches have been recently developed [HRG+14] that ex- tensively rely on expensive low-level analog fault simulations with user-defined fault mod- els (UDFM). Thus, it is important to efficiently simulate faults with little abstraction as accurately and efficiently as possible [CCCW16]. Parallelism as a Remedy Assume that a circuit of N nodes with a set F of faults needs to be simulated for a pattern set T under different operating parameters P. In a naïve flow, a total of N×|F|×|T |×|P| individual simulation problems have to be solved, each of which corresponds to complex arithmetic evaluations of the timing behavior at a node. Thus, with growing design size, the number of simulation problems quickly rises since the number of nodes, faults and test patterns also increase. Although circuit- and fault simulation are inherently parallelizable [IT88, BBC94, BMC96], the parallelization has risen to a whole new level with the introduction of general purpose computing on graphics processing unit (GPU) accelerators [OHL+08]. In contrast to con- ventional processors (e.g., IBM Power9 architecture with 24 cores [STKS17]), the GPU- 4 architectures provide thousands of compute cores on a single device for running millions of threads concurrently [NVI17d]. With this amount of compute cores, the GPU-based sys- tems attain unprecedented arithmetic computing throughput in the order of petaFLOPS1 on a single compute node, which has established well in high-performance computing for speeding up many applications [NVI18c]. Many approaches that parallelize logic level simulation on GPUs for acceleration were published [GK08, Den10, KSWZ10, CKG13, WLHW14, LTL18] that isolate and partition the simulation problems for independent parallel execution by individual threads, but only very few are able to consider actual timing. Analog simulation with GPU-support [GCKS09, CRWY15, CUS19] and full GPU-accelerated SPICE implementations [HZF13, HF16, vSAH18] have been proposed as well. However, existing parallel approaches for either logic level or analog simulation are mainly optimized for speeding up single simu- lation instances only. Since the number of simulation problems is continuing to rise, it is of utmost importance that the approaches are not only efficiently parallelizable, but also scalable by providing high-simulation throughput to make an evaluation of large designs for more complex algorithms feasible. Timing Modeling at Logic Level For the calculation of the full signal switching histories in a circuit, traditional unit delay models [Wun91] have become insufficient. Instead, timing-accurate simulation models are required that are able to process real-valued gate delays for individual gate pins and signal switches [IEE01a]. Since the execution of bare arithmetic floating-point instructions has much higher latency compared to bit-wise logic operations of traditional untimed logic simulation, timing-accurate simulation of a million-gate circuit for a test pattern set can take hours or even days to complete [SHK+15]. On GPUs, long-latency processes and arithmetic operations can be hidden by the massive computing throughput provided by the parallel computing cores [OHL+08]. For this, massive parallelization needs to be exploited to effectively speed up extensive logic level simulations with timing-accurate and variation-aware delay models for large designs and test sets. 11 petaFLOP = 1015 floating point operations per second 5 1 Introduction Switch Level Modeling and Algorithms Timing-accurate and lower-level models are desirable for accurate simulation of digital integrated CMOS circuits. While simulation at logic level lacks accuracy, SPICE simulation lacks scalability even when accelerated by parallelization. As a trade-off, switch level simulation was introduced [Bry87, Hay87, BBH+88, MV91] which drastically reduces the complexity of the evaluation compared to full analog simulation. While not being as accurate, switch level modeling reflects many properties of CMOS circuits and standard cells that cannot be modeled at logic level. This has recently drawn attention in cell-aware test (CAT) generation to reduce the analog fault simulation overhead [CCCW16, CWC17]. Yet, the simulation at switch level is still an order of magnitude slower compared to logic simulation [SJN94] and therefore poses a strong bottleneck in validation tasks, which makes it a prime-example for the acceleration on GPUs. However, to the best of the author’s knowledge no algorithms to accelerate switch level simulations on GPUs exist. Defect Modeling and Fault Simulation Many defects types exhibit a varying faulty behavior based on the size or magnitude of the defect parameter. The behavior of such a parametric defect can range from a hard func- tional failure of the circuit, to a weak impact on the non-functional timing behavior that only affects the circuit performance [RMdGV02, FGRC17]. With the continuously shrink- ing circuit structures, the probability of such defects as well as their severity with respect to the circuit reliability rises [ITR15], while at the same time the faults are becoming harder to detect [LX12, TTC+15]. Thus, thorough test validation requires a fine-grained model- ing of the defect parameters to capture deviations in both functional and non-functional aspects of the circuit (e.g., small delays [YS04, LKW17, KKL+18]). To reflect the behavior of more realistic defects found in CMOS cells, faults must be modeled at the lower levels without introducing too much abstraction [FS88, CCCW16, CWC17]. This vastly increases the computational overhead, as the defect mechanisms not only get more complex, but also many simulations of different individual faults are necessary (e.g., for fault coverage estimation). Thus, the need for parallelization of fault simulation is unavoidable [IT88, Wun91]. With their large amount of parallel arithmetic 6 compute cores, GPUs provide high arithmetic computing throughput that is promising to simulate even large designs for many faults and test patterns with full timing-accuracy. Multi-Level Simulation The modeling and simulation at higher abstraction is fast, but the approaches are limited in accuracy such that often timing and defects cannot be reflected properly [RAH+14]. On the other hand, lower level approaches can provide sufficient accuracy, but exhibit significant runtime complexity and resource requirements, such that the approaches can- not be used solely for a system-wide application (i.e., system-level test validation [Che18, JRW14]). However, advantages from both higher- and lower-level abstractions can still be taken into account at the same time by using multi-level techniques. A multi-level simulation provides a combined simulation across multiple different levels of abstractions [GMS88, MC95, RK08, KZB+10, HBPW14]. Accurate and compute-intensive evaluations are typically restricted to a particular region of the circuit [HBPW14, KZB+10], or single cells [CCCW16], whereas the remainder is processed using faster higher-level simulations. Intermediate simulation data is usually exchanged at abstraction boundaries by data aggregation and mapping to the respective higher- or lower-level inputs [JAC+13, HBPW14, CCCW16]. Thus, compared to full simulations at the lowest abstraction, multi- level simulations can provide a trade-off in terms of speed and accuracy and make certain simulation scenarios even feasible at all [JAC+13, HKBG+09, CCCW16]. Applications High-throughput timing-accurate circuit- and fault simulation is helpful to accelerate ap- plications in a variety of areas, which heavily rely on extensive simulations. Process variability and timing variation: With the increasing sensitivity of circuits to- wards process- and parameter variation, the need for statistical evaluations for perfor- mance and reliability assessment rises. Under variation, the circuit timing is severely altered [BCSS08, GK09] which tampers with the effectiveness of tests [PBH+11]. By in- cluding parameter dependencies in the delay processing, such as supply voltage and am- bient temperature [DJA+08, SAKF08], or process variations [ABZ03, ZHHO04], statistical 7 1 Introduction timing analyses [WW13] and statistical fault coverage analysis [SHM+05, CIJ+12] can be realized by large-scale Monte-Carlo evaluations in highly parallel timing simulation. Non-functional properties: The timing-accurate switching histories from the simula- tion can be utilized to analyze non-functional properties in a design. Both, thermal de- sign power [HGV+06] and IR-drop from switching activity hotspots [SBJ+03, MSB+15, DED17a] impact the reliability of designs and tests [WYM+05, ZWH+18] and thus need to be evaluated. Moreover, vulnerabilities with respect to single-event upsets [BKK+93] and aging effects, such as bias temperature instability (BTI) [GSR+14] or hot-carrier injec- tion (HCI) [LGS09, ZKS+15], can be investigated using timing-accurate signal histories. Power estimation: The static and dynamic power-consumption of a design can be es- timated from the circuit signals and switching activity [GNW10]. Power consumption tampers with the functionality of the design not only during functional operation, but also during the application of tests [WYM+05, AWH+15]. While previous approaches relied on untimed logic simulation due to complexity reasons, the power of GPUs en- able timing-accurate power estimation for many patterns in parallel. This way more- accurate power-aware test schemes and test pattern selection can be developed [HSW+16, HSK+17, ZWH+18]. Scan test: Timing-accurate simulation is necessary to validate delay tests for small delay defects and marginalities, i.e., faster than at-speed test (FAST) [HIK+14, KKL+18]. How- ever, in scan test each test pattern undergoes a long session of consecutive shift cycles before the test can actually be applied and evaluation [GNW10]. During the shift-phase, IR-drop-induced shift-errors from delayed signals [YWK+11] as well as skewed clocks [Sho86] can cause errors during shift-in and shift-out phases of the test pattern applica- tion. Again, with the high-throughput computing capabilities of GPUs, parallelization can be utilized to accurately evaluate full test sets with millions of shift cycles concurrently. Goal of this Thesis and Organization The document at hand presents highly parallel simulation models and simulation algo- rithms for fast and scalable timing simulation of nano-electronic digital integrated CMOS circuits on graphics processing units (GPUs). 8 The main contributions of this thesis are summarized in Fig. 1.1. Compact functional as well as timing-accurate temporal modeling are combined with highly parallelized simula- tion algorithms designed to run on GPU-architectures to exploit their massive computing throughput. These building blocks provide the foundation for developing high-throughput simulation algorithms for timing-accurate logic level and switch level simulation with par- allel execution on the GPUs. high-throughput parallelization compact functional modeling efficient high-throughput parallel timing & fault simulation of nano-electronic digital circuits massively parallel graphics processing units (GPU) efficient simulation algorithms timing-accurate temporal abstraction [HSW12] [HIW15] [SHWW14] small delay fault simulation [SHK+15] [SKH+17] [SW16] [SW19a] multi-level timing & fault simulation [SW19b] transistor level fault simulation logic level timing simulation switch level timing simulation [SKW18] Figure 1.1: Contributions of this work. The presented simulation models and algorithms are further extended for fault simula- tion of small delay faults at logic- as well as low-level parametric and parasitic faults at switch level. These allow to fully exploit the massive parallelism found in circuit and fault simulation on the GPU-architectures. Finally, on top of these the extended simulation models and algorithms are ultimately combined to form an efficient multi-level fault simulation flow that enables additional sim- ulation throughput and simulation efficiency for the lower level fault simulation through the use of mixed abstractions during simulation. By simultaneously exploiting different dimensions of simulation parallelism, the presented parallel simulation techniques achieve unprecedented simulation throughput on GPUs 9 1 Introduction thereby enabling scalable timing and fault simulation of multi-million gate designs, which effectively reduces simulation time from several hours or even days to minutes only. The structure of this document is organized into three parts: The first part (Chapter 2–4) introduces background on circuit- and fault modeling as well as simulation. Furthermore, the part briefly summarizes the key aspects of data-parallel many-core computing architecture of graphics processing units (GPU) and their program- ming paradigm, along with an overview of state-of-the-art GPU-accelerated simulation techniques. The second part of this document handles the high-throughput time- and fault simulation, which is the main contribution of the thesis and is composed of the following chapters: Chapter 5 (”Parallel Logic Level Time Simulation on GPUs”) – presents the first timing- accurate and variation-aware logic level time simulation on GPUs and outlines the underlying algorithms, kernels and general data structures for multi-dimensional parallelization with high simulation throughput. Chapter 6 (”Parallel Switch Level Time Simulation on GPUs”) – introduces the first parallel switch level time simulation for GPUs which utilizes first-order electrical parameters found in CMOS circuits for a more accurate modeling of the functional and timing behavior of CMOS standard cells. Chapter 7 (”Waveform-Accurate Fault Simulation on GPUs”) – describes the extension of the logic- and switch level simulation algorithm for fault modeling, syndrome com- putation and parallelization for first-time detailed and comprehensive parallel sim- ulation of timing-related faults. Chapter 8 (”Multi-Level Simulation on GPUs”) – addresses a combination of the presented parallel logic- and switch level algorithms in the first GPU-accelerated multi-level simulator that effectively enhances the performance and efficiency of the lower-level timing and fault simulation. The third and last part (Chapter 9–13) discusses the evaluation of the aforementioned presented simulation methodologies as well as example applications in five chapters. Finally, the last chapter provides a summary of all the contributions of this thesis and outlines directions for future work and possible extensions. 10 Part I Background Chapter 2 Fundamental Background This chapter summarizes the necessary background and fundamentals of circuit- and fault modeling as well as simulation of digital integrated CMOS circuits as found in basic litera- ture [Wun91, BA04, WWX06, WH11]. First, background on standard CMOS technology is provided, followed by circuit- and delay fault modeling on different levels of abstraction. Finally, simulation algorithms and multi-level simulation methods are briefly discussed. 2.1 CMOS Digital Circuits Today’s semiconductor manufacturing processes of digital integrated circuits mainly rely on complementary metal-oxide semiconductor (CMOS) technology. CMOS uses two types of metal-oxide-semiconductor field-effect transistor (MOSFET) devices, namely NMOS and PMOS transistors, that can basically be considered as voltage-controlled switches. NMOS and PMOS transistors are utilized to form standard cells in standard cell libraries [Nan10, Syn11] which are the foundation of contemporary semiconductor design. For more details on the structure and mechanics of transistors as well as the manufacturing processes and design of standard cells, the reader is referred to [WH11]. In modern semiconductor process technology nodes, conventional planar MOSFETs are being replaced by the recent fin-type field-effect transistor (FinFET) technology [HLK+00, Kin05, BC13]. The FinFET devices allow for scalable manufacturing of feature sizes be- yond 20nm, since they have a more compact layout with a faster switching behavior com- pared to traditional transistors [YMO15]. This enables the production of more complex 13 2 Fundamental Background designs with higher transistor density that can be further run at higher clock frequencies to achieve even higher device performance. Although the physical structure of planar MOS- FET and FinFET is different, their basic working principles and design flows are similar. 2.1.1 CMOS Switching and Time Behavior In today’s nanometer regime, it is often of utmost importance that the designs meet a certain performance requirement that is given by a minimum system clock frequency. The highest frequency a design can reliably run with is usually determined by the length of the critical path. The critical path accumulates all the propagation delays of standard cells and interconnects on the longest sensitizable path from a primary or pseudo-primary input to a primary or pseudo-primary output in the design. In general, the delays of a standard cell are distinguished by propagation delays and rise/- fall times as shown in Fig. 2.1, which depicts the typical input and output waveform of a inverter cell in a small circuit (15nm FinFET technology [Nan14, BD15] powered with a supply voltage of 0.8V). The maximum signal amplitude on the left axis has been normal- ized. As shown, the signal switching processes do not occur instantly, but occur merely as a function of the voltage over time as a result of parametric and parasitic electrical elements in the circuit. The change of the voltage of a signal is typically modeled by differential equations that describe electrical charging and discharging processes of capacitances in the circuit [WH11]. 0 0.5 1 0 10 20 30 40 50 S ig n a l [V ] (n o rm a liz e d ) time [ps] input output dr df dlh dhl 80% 20% Figure 2.1: Rising/falling propagation delay (dr, df ) and rise/fall times or slopes (dlh,dhl) of an INV_X1 CMOS inverter cell [Nan14, BD15]. (Adapted from [WH11]) The propagation delay is defined for rising (low-to-high) and falling (high-to-low) output transition polarities (dr and df ) and describes the time it takes for a signal transition at a cell input to propagate to the cell output. It is measured from the time point where the input signal reaches the 50% mark of the maximum signal amplitude (i.e., 0.5 · VDD 14 2.1 CMOS Digital Circuits as halfway between VDD and GND potential) to the point where the output crosses 50% of the maximum signal amplitude in response [WH11]. Moreover, as shown in Fig. 2.1, signal transitions are not instantaneous, but follow a certain continuous transition ramp or slope that are given by the rise-time dlh for rising and the fall-time dhl for falling transitions as well. The rise-time (and fall time) of a signal transition is defined as the time it takes to transition from 20% to 80% of the maximum signal amplitude (and vice versa), though these thresholds can vary throughout the literature (e.g., 10%–90%) [WH11]. With the shrinking manufacturing process technologies and increasing complexity, it is important to accurately estimate the possible timing of contemporary and future designs. Thus, suitable models are required to reflect and consider CMOS-related timing effects as early and as accurately as possible during the design phase. One approach to estimate the delay of a circuit is the RC delay model [WH11], which approximates the non-linear characteristics of transistors considering average resistances and capacitances of the circuit nodes. In the RC delay model, the circuit netlist is trans- formed into an electrical equivalent RC-circuit where all transistors and interconnects are replaced by simple resistors and the interconnect- and gate capacitances in the fanouts are replaced by (in the simplest case) lumped capacitances. A switching transistor causes a change in the state of the RC-model elements by changing the corresponding resistance. This triggers a transient at the output that is typically computed using the first-order or second-order step response [Elm48, WH11]. Hence, in case of the rising transient of Fig. 2.1 the step response vrise beginning at some time t0 can be modeled for the time t > t0 after as vrise(t) := (GND− VDD) · e −(t−t0) τ + VDD, (2.1) where τ is the time constant computed as τ := R ·C, with R being the effective resistance of the driving PMOS transistor and C being the load capacitance. The propagation delay dr is then derived from the output signal vrise at the time it crosses 0.5 · VDD, which is computed as dr := τ · log 2 [WH11]. Similarly, the falling transient vfall beginning at some time t1 is calculated for t > t1 as vfall (t) := (VDD− GND) · e −(t−t1) τ + GND, (2.2) 15 2 Fundamental Background in which case the resistance R for the time constant τ := R ·C corresponds to the effective resistance of the conducting NMOS transistor, resulting in an estimated falling propagation delay of df := τ · log 2. Both of the modeled rising and falling transients expressed by vrise and vfall are illustrated in Fig. 2.2. The axes have been normalized by τ for the time and by VDD for the voltage, respectively. For the sake of simplicity, it was assumed that the internal resistances of NMOS and PMOS were equal. 0 1 0 1 2 3 4 5 O u tp u t [V ] (n o rm a liz e d ) time [τ] τ log 2 0.5 vrise vfall 80% 20% Figure 2.2: First-order step response for rising and falling transient modeling the continu- ous charging/discharging process of an output load capacitance. (Adapted from [WH11].) 2.1.2 Circuit Performance and Reliability Issues With the shrinking feature sizes in newer technologies, the semiconductor manufactur- ing processes get more complex and increasingly prone to process variations and defects [SSB05, PBH+11]. Since the structures, such as fins of FinFET transistors, have a size of only a few nanometers, spot-defects by particles or statistical processes in the different manufacturing steps, such as dopant fluctuations, are getting more and more frequent and influential. These problems can severely alter the physical layout of the circuit structures and lead to improper connections in the design. Moreover, transistor-related parameters, such as the threshold voltage and electron mobility, can shift due to improper manufactur- ing. The resulting defects, such as open and shorts in either fins or gates [LX12], as well as variation artifacts [HTZ15] primarily impact the timing of the circuit [TTC+15, FGRC17] and ultimately affect the device performance and reliability by introducing delay faults. Certain parameter shifts in circuit structures exhibit small delays that sometimes indicate circuit aging or the presence of a marginal device [KCK+10]. For example, when a design is put under stress (i.e., high switching activity or mechanical stress) after deployment, struc- tures in the circuit can fail due to device degradation and transistor aging [BM09]. These wear-out mechanisms appear gradually as a result of continuing static and dynamic stress 16 2.2 Circuit Modeling in the circuit, i.e., by certain stimuli or circuit activity, as well as environmental conditions, such as temperature. Typical degradation mechanisms are Negative-Bias Temperature In- stability (NBTI) [LGS09, GBRC09], Hot-Carrier Injection (HCI) [EFB81, GSR+14], Time- Dependent Dielectric Breakdown (TDDB) [DGB+98] and electromigration (EM) [Cle01]. The effects of the degradation range from changes in electrical properties of the transistor devices and interconnects [LQB08] that result in delay faults [BM09] up to hard fail- ures, such as voids or bridges due to transported metal atoms from EM. Imminent wear- out failures can be predicted using specialized hardware (e.g., aging monitors [APZM07, LSK+18]), once the circuit has sufficiently aged or degraded. Wear-out can also be de- layed or mitigated by reducing the stress patterns in the circuit to prolong the system lifetime [ZBK+13]. Failures in the early-life phase of a design are related to infant mortality among circuits, where marginal devices have passed the conventional manufacturing test, but soon break after initial deployment. These failures are called early-life failures (ELFs). Marginal de- vices are usually tested using so-called Burn-In tests [VH99], that strongly exercise the circuit under extreme stress conditions to expose the ELF. The indicators of ELFs can be located deep inside the circuitry which can be barely noticeable as small delay devia- tions that are much smaller than the clock period and not testable under at-speed condi- tions [KCK+10]. This lead to the recent development of Faster-than-At-Speed Test (FAST) [HIK+14, KKS+15] to explicitly test for these hidden delay faults (HDF) to identify the marginal devices. Therefore, to assess and validate the timing of today’s designs as well as to thoroughly test for timing-related defects, designers and testers have to rely on timing-accurate models of the circuits and simulation algorithms for test and diagnosis [TPC11]. 2.2 Circuit Modeling Today’s multi-billion transistor circuit designs are typically not manufactured starting from scratch using transistors, but rather starting from high-level specifications that describe its behavior in an abstract manner [Wun91]. As shown in Fig. 2.3, the circuit design undergoes different representations at various levels following a V-shaped model. During 17 2 Fundamental Background the top-down design phase, abstract high-level descriptions are refined step by step by synthesis tools that systematically add more and more implementation details by mapping to lower level structures [ABF90, BA04, WWX06]. For example, at the highest level of abstraction, only the desired function is specified and the actual design itself is treated as a black box since no physical implementation details are present. At this level, modeling and simulation is fast and simple, but the accuracy is the lowest. When moving down towards the lower abstraction levels, the required gates to realize the specified function or even the layout of complete transistor netlists with physical and electrical properties are determined. This drastically increases the modeling complexity and, of course, the simulation effort, but ultimately leads to more accurate models and evaluations. system/behavioral level register transfer level logic level switch level electrical level design (top-down) refine ab st ra ct validation (bottom-up) s im u la ti o n s p e e d + − highest lowest d e ta il /a c c u ra c y + − lowest highest Figure 2.3: Overview of abstraction levels of a circuit in the design and validation. 2.2.1 Electrical Level The electrical level provides the highest accuracy among the commonly used simulation models as it reflects the physical properties and behavior based on layout. The circuit is modeled using netlists of electrical components and devices, such as transistor mod- els, resistors, capacitors, inductors and voltage sources. Corresponding simulators usually consider geometric information of the physical layout to reflect spatial dependencies of power-grid structures and capacitances between neighboring interconnection lines. Inter- nal node voltages and currents are typically calculated through compute-intensive nodal analyses and differential equations using numerical integration methods. This way, highly waveform-accurate transient analyses by electrical simulations can be provided which also allow to predict the signal integrity as well as the power consumption in a circuit [WH11]. 18 2.2 Circuit Modeling Modeling and simulation at the electrical level is typically performed using SPICE [NP73], which has established as industry golden standard. Common standard cell libraries [Nan10, Syn11] already provide SPICE models of the implemented standard cells, which utilize transistor device model cards that can include hundreds of parameters [DPA+17, Nan17]. However, the high detail and accuracy of the electrical level modeling comes at the cost of vast runtime complexity and memory requirements. 2.2.2 Switch Level Switch level modeling abstracts the electrical level behavior of the circuit by simplification and linearization of the transistor switching behavior [Bry84, Bry87, Hay87, Kao92]. The circuit is represented as an interconnected network of transistors and first-order electrical components, i.e., resistors and capacitors, as shown in Fig. 2.4. Transistors are treated as simple three-terminal devices that form discrete ideal switches, where the input at the gate terminal controls the connection between drain and current terminal of the device. The resistors in the model reflect the static functional behavior, and the capacitors (sometimes referred to as wells) model the temporal behavior [Hay87]. Most switch level models distinguish between signal strengths and consider bi-directional signal flows as well as modeling of charges inside of the networks. At switch level, signal values are typically represented as pairs 〈s, v〉 of signal strength s ∈ N and signal value v. The signal values are typically expressed in ternary (three-valued) logic over the set E3 = {0, 1, X} [Hay86, BAS97] to distinguish between low (0), high (1) and undefined (X) signals. The signal strength reflects the driving strength of the signal source, which is used to resolve multiple drivers at a node or consider bidirectional signal A ZN VDD GND A ZN ⟨0,1⟩=VDD R C VA=GND vA=⟨0,0⟩ v1=⟨0,1⟩ v2=⟨0,0⟩ v3=⟨1,1⟩ ⟨0,0⟩=GND v4=⟨∞,0⟩ v5=⟨1,1⟩ vZN=⟨3,1⟩≈VDD VZN≈VDD A ZN vA=0 vZN=1 φNOT:ZN:=(¬A) "ON" "OFF" ∅ Figure 2.4: Inverter cell with CMOS implementation and switch level representation, where signal values are encoded as pairs 〈si, vi〉 of strength si and logic value vi [Hay87]. 19 2 Fundamental Background flows as well as stored charges in the network. Signals of strength s = 0 are the strongest signals and typically correspond to VDD or GND sources, whereas higher values of s indi- cate weaker signals. Stronger signal values always override weaker ones. Signals of the same strength, but with different values, result in an undefined value. For the purpose of efficient switch level simulation of digital designs, a common practice is to partition the transistor netlist of circuits and CMOS cells into smaller sub-networks composed of ideal PMOS and NMOS transistor meshes, forming a channel-connected com- ponent (CCC) [Bry87, DOR87, HVC97, BBH+88, BAS97, CCCW16]. The transistors within a CCC are interconnected via their drain and source terminals to each other and connect power supply VDD and ground GND sources and allow for a bidirectional signal flow within the network. The outputs of a CCC are linked to an output domain that can lead to gate terminals of transistors in succeeding CCCs. However, it is assumed that no current can flow over the gate from one CCC into another. Switch level evaluation algorithms are mainly based on nodal analyses and dependen- cies that are formulated using linear equations. These equations are solved using relax- ation techniques with sparse matrix vector products (SMVP) to first find a steady state solution of the node voltages and the node charges inside of the network after each in- put switch [Bry87]. In combination with timing analysis, node voltages are more ac- curately represented as a function of time, usually modeled by piece-wise approxima- tions [CGK75, Ter83, Kao92]. In general, switch level models are less accurate than elec- trical level models, but they are able to reflect many important properties of CMOS tech- nology which are not reflected in classical models based on two-valued Boolean or ternary pseudo-Boolean logic. Due to the higher speed of switch level simulation, it has replaced analog SPICE simulation in recent cell-aware test generation [CCCW16]. 2.2.3 Logic Level The logic level is the most dominant abstraction used in design and test validation of nano- electronic digital circuits [WWX06, BA04]. At logic level the circuit behavior is described as netlist of interconnected Boolean gates. The netlist of a circuit is typically modeled as a directed graph G := (V,E). Each vertex v ∈ V corresponds to a node in the netlist, which corresponds to a Boolean gate, flip-flop, or circuit input or output, and basically 20 2.2 Circuit Modeling represents a functional abstraction of a physical standard cell that determines the logical behavior. The edges (v, v′) ∈ E ⊆ V × V represent ideal interconnections from a source node v ∈ V to a sink node v′ ∈ V in the netlist and allow for signal flows. Incoming edges at a node refer to input pins at the corresponding gate, while outgoing edges indicate the information transport from the node output pin to its successors. Each edge e ∈ E is associated with a signal value that reflects the logical interpretation of the voltage level of the corresponding signal, which are typically defined over two-valued Boolean logic B2 = {0, 1} [Hay86]. The logic values in B2 correspond to discrete signal states, where the logic symbol ’0’ corresponds to low voltage (i.e., ground voltage potential GND) and the symbol ’1’ corresponds to a high voltage (i.e., power supply voltage potential VDD). Multi-valued logic, such as ternary, four-valued, or even higher-order logic can be utilized to reflect additional states (e.g., unknown due to uncertainties) and dynamic behavior (e.g., transitions) of signals [Hay86]. The functional behavior of a node with k incoming edges is modeled by a Boolean function φ : Bk 2 → B2 that maps the values of the edges to an output value according to the function. At logic level, various delay models can be considered to describe the temporal behavior of a gate output signal with respect to input changes. Apart from zero-delay modeling, which reflects the untimed simulation case, where input changes at time t can cause a simultaneous change at the output, timing-aware delay models evolve around unit- and multiple-delay representations to reflect propagation delays for gates (sometimes referred to as transport delays) [ABF90]. Fig. 2.5 provides an overview of the timing-aware delay models on the example of a small inverter gate [Wun91]. In unit delay it is assumed that all gates have a constant propagation delay d := 1 that corresponds to one time unit. For input transitions at time t ∈ N, this can cause output transitions to occur at time t′ := (t+d) = (t+1) in response. However, different gate types typically have different delays. Moreover, gates usually have different delays for rising and falling output transitions as well. Multiple-delay models reflect different transport delays dr for rising and df for falling output transitions, which are applied accordingly. Some delay models are able to consider propagation delays that are bounded by a min- max interval [dmin, dmax] ⊂ R [Wun91, BGA07]. In interval-based delay modeling, the actual delay of a gate is not known exactly, but it is estimated to be in the interval d ∈ 21 2 Fundamental Background A ZN 1 2 3 4 5 60 time 0 1 0 1 ZN 0 1 input signal unit delay (d=1) rise time dr=2, fall time df=1 ZN 0 1 interval dmin=1, dmax=2 7 8 ZN 0 1 unit delay with inertial delay Δd<2 ZN 0 1 unit delay with inertial delay Δd>2 A ZN inverter gate Figure 2.5: Overview of common (timing-aware) delay models in logic level simulation. (Adapted from [Wun91]) [dmin, dmax]. This way, process variations and uncertainties that impact gate delays can be reflected. However, during simulation with such a model the uncertainty of different gates can quickly accumulate and lead to pessimistic simulation results [ABF90]. When the value at a gate output switches, the load capacitance associated with the stan- dard cell implementation at the electrical level is (dis-) charged, which requires some time until the transition is finished. In case a pulse at a gate input is too short, the output does not have sufficient time to charge to the targeted potential, which can prevent the propa- gation of the pulse, such that the output value is sustained. A so-called inertial delay can describe the minimum pulse width ∆d ∈ R required to perform a full (dis-) charge of the output. For example, in Fig. 2.5, signal A has two consecutive pulses at times t1 := 1 and t2 := 3, which would lead to transitions at times t′1 := 2 and t′2 := 3 under unit delay. In the last case, the required minimum pulse width is ∆d > 2, therefore the pulse is filtered out since the pulse width given by (t′2 − t′1) = 2 does not meet the requirement. Individual delays for each input pin of a gate can be considered in combination with rising and falling distinction. So-called pin-to-pin delay models keep a delay value di for each input pin i of a gate. An input transition at pin i at time t then propagates to the output at time t′ := (t+ di) with the corresponding delay associated to the pin. The logic level timing information is usually provided as standard delay format (SDF) files [IEE01a] which are obtained during the synthesis of the design. Each SDF file typically contain absolute and incremental delay specifications describing, specific delays for indi- vidual ports (PORT), the propagation delay from an input to an output pin (IOPATH) as 22 2.3 Delay Faults well as wire delays (INTERCONNECT) between gate instances. In addition to the propaga- tion delays, inertial delays can be specified (via PULSEPATH) that describe the minimum allowed pulse-widths at outputs for the application of pulse filtering [IEE01a, IEE01b]. 2.2.4 Register-Transfer Level At the register-transfer-level (RTL) the circuit module descriptions are mapped to a coarser structural description composed of small interconnected components [Wun91]. The com- ponents can be registers or memory to store data and operands, and combinational net- works to perform operations. Interconnections and buses define the data-flow between the individual components. The data exchange between memory elements is controlled in a synchronized fashion by clock signals. RTL-descriptions are typically expressed as data-flow in hardware-description languages, such as Verilog [IEE01b] or VHDL [IEE09]. While timing can be expressed in cycle granularity, fine-grained propagation delays are not considered at this level, since no implementation details are available. 2.2.5 System-/Behavioral Level At system level the circuits are modeled with the highest abstraction. Abstract functional descriptions of circuit modules are typically expressed as algorithms and functions in higher level programming languages, such as SystemC [IEE12], which are compiled and run as separate processes. Since no details or requirements with respect to the actual im- plementation of the design is implied, a reasoning about accurate timing is not possible. 2.3 Delay Faults In test validation and diagnosis, fault models are used which abstract the effect of classes of physical defects in a chip for representation and processing on a higher abstraction level [Wun91]. As opposed to classical static fault models, such as stuck-at and bridging faults, a circuit cannot be tested for delay faults by using a single test pattern only [WLRI87]. Instead, delay tests with a minimum of two patterns composed of an initialization and propagation test vector are required. These patterns are applied in consecutive clock cycles to launch signal transitions at (pseudo-)primary inputs. The output responses are then 23 2 Fundamental Background captured after the clock period has passed, which might reveal output signals that are not yet stabilized. A generalized way to model arbitrary faults in logic simulation is the so-called conditional line flip (CLF) calculus [Wun10]. Each CLF has an activation condition cond, which is an arbitrary user-defined Boolean formula to determine the activation of the fault at a signal v. Once a CLF is active (condition cond is true), it flips (inverts) the value of v to v′ := v⊕ [cond ]. As a simple example: A stuck-at-0 fault is represented in the CLF calculus as v ⊕ [v] and a stuck-at-1 fault is represented as v ⊕ [v]. Transition Faults The transition fault model [WLRI87] assumes a lumped defect at a gate input or output pin that causes a delay at the associated signal line by an amount of time larger than the test clock period (usually infinite), thereby omitting a transition. Transition faults can be distinguished as slow-to-rise (STR) or slow-to-fall (STF) faults that affect rising and falling transition polarities, respectively. They are usually considered as conditional stuck-at faults, which can be efficiently implemented and simulated using parallel-pattern stuck-at fault simulation by activating the fault upon signal changes of subsequent patterns [Wun10, WLRI87]. Therefore, regarding the modeling and simulation, transition faults are independent of the actual circuit timing and gate delays. For example, let vi ∈ B2 be the value at a signal line after application of the initialization vector, and let vi+1 ∈ B2 be the value after the subsequent propagation vector. The fault activation at a signal line transitioning from vi to vi+1 upon the clock cycle is determined by active ⇔ (vi⊕vi+1) ·ρ, where the term (vi⊕vi+1) indicates a change in the signal value and ρ determines the activation of a STR (ρ := vi+1) or STF (ρ := vi+1) fault, respectively. Note that transition faults always affect all sensitized paths in the output cone of the fault site regardless of the actual timing. Therefore, the fault model fails to accurately represent delay faults of smaller quantities, e.g., small resistive open defects. Path-Delay Faults The path-delay fault model [Smi85] was proposed to model distributed delay effects in the circuit and assumes the accumulation of smaller lumped delays during signal propagation. 24 2.3 Delay Faults In this model, a path of nodes {i, n1, n2, ..., nj , o} ⊆ V in the netlist graph G := (V,E) from a (pseudo-)primary input i ∈ I to a (pseudo-)primary output o ∈ O of the netlist is assumed to suffer from a path-delay fault if the accumulated delay along the associated path exceeds the clock period. The fault is activated, iff the input signal transition is propagated along all nodes of the path to the output. As a result, the signal at the single affected output of the fault is expected to be erroneous. Like transition faults, path-delay faults can be distinguished as STR and STF. Similarly, the segment delay fault model was proposed [HPA96, MAJP00] to model dis- tributed delays along sub-paths and hence combines transition fault and path-delay fault behavior. A segment delay fault is modeled on a path segment of l connected gates {n1, n2, ..., nl} ∈ V . If a signal transition enters the segment at node n1 and propagates through all of its gates, the fault is activated at the segment output node nl from where it behaves as a lumped transition fault. The simulation of path-delay faults does not require timing information and can thus be performed in untimed logic simulation or by sensitization analysis [PR08, Wun10]. ATPG-tools for path-delay faults typically constrain the set of relevant paths by topolog- ical analyses with timing information to identify longest sensitizable paths through each node [ED12]. By doing so, even small lumped delays at gates can be tested for, whose fault size is larger than the slack of the tested path. However, not all faults can be tested robustly [Lin87, Kon00, ED11] and longest path delays can shift due to process varia- tions [CIJ+12]. Thus, certain faults with smaller delays might be missed, requiring more accurate modeling and simulation of the faults. Small (Gate-)Delay Faults A small (gate) delay fault is considered as a manifestation at an input or output pin of a node, that slows down the signal propagation through the respective pin by a small finite amount of time [IT88, TPC11]. These faults are mainly caused by weak resistive opens in the design [RMdGV02] and, in contrast to transition faults or path-delay faults, the delay impact of a small gate delay fault is much smaller than the clock period [NG00]. Fig. 2.6 illustrates the behavior of a small delay fault at a gate output in a circuit. After application of a delay test vector, transitions at the circuit inputs are launched which 25 2 Fundamental Background propagate through the input cone of the fault. Eventually, the fault site is reached and the output transition at the fault site is delayed by the fault size. In the left example (a), the delayed output response is propagated through the output cone of the circuit over sensitized paths to three outputs. However, in the output response vector only two of these outputs (2 and 3) show an error as the signals are captured before the last transition happens. Despite the additional delay due to the fault, the latest signal transition at the first output still arrives in time and a good response value is captured. If the small delay fault was simulated using a simplified transition fault model, all of the outputs of sensitized propagation paths would show an erroneous response. Hence, transition faults typically overestimate the fault coverage of small delay faults with smaller sizes and they are also too optimistic regarding the fault propagation and detection at outputs [IRW90]. In the right example (b) of Fig. 2.6, a non-robust propagation of a small delay fault in pres- ence of reconvergence is illustrated. The reconvergent fault propagation causes a delayed hazard at the output and the fault can only be detected within a small detection window spanned by the hazard. This behavior is not captured by simple transition fault simulation as the corresponding STF-transition fault (corresponding to a stuck-at-1 [WLRI87]) would lead to a constant-1 output signal with no detection window during simulation. Also, the CLF calculus [Wun10] is not suitable for expressing small delay faults in a prac- tical way, due to the complexity of the temporal modeling. In general, for the fault acti- vation a CLF assumes a single point in the circuit where it is conditionally activated and from which the faulty value is distributed throughout the sensitized output paths. Hence, similar to the transition delay faults, all sets of sensitized paths are affected, similar to the transition delay faults, which does not hold for general small delay faults. One way to fully cover a small delay fault behavior in the CLF calculus would be by introducing multiple-CLF faults in the fan-out of the actual small delay fault site (e.g., CLFs injected at reachable circuit outputs). Then, for each injected CLF all possible combinations of related activation paths and propagation paths need to be encoded in the conditions (not including false paths) to fully capture the timing violations that can occur for the small delay fault. However, this introduces a tremendeous modeling complexity. In [PB07], so-called segment-network faults were introduced which consider multiple seg- ment delay faults [HPA96, MAJP00] as a (sub-) tree in the netlist starting from a common 26 2.3 Delay Faults input- cone output- cone fault f d e la y t e s t in p u t capture o u tp u t re s p o n s e last signal transition output ok? launch transition fault activation fault propagation circuit fault-free value: 1 faulty value: 0 t=0 +3 (STF) +2 +2 5 86 comb. logic +1 ... 41 3 63 reconvergence 0 1 0 1 0 1 0 1 capture time T=7 TF:1 input output: a) b) Figure 2.6: Examples of a small gate delay fault in a circuit: a) robust fault-propagation to outputs over sensitized paths and b) non-robust propagation with output hazard. root node. If a signal is propagated from the root node along a segment, a fault is activated at the corresponding leaf node where the segment ends. While the activation results only from a single root node, the fault modeling provides more control in the propagation of the faults. No hazards are considered in the modeling which can invalidate the detection. Therefore, in order to simulate smallest delay faults for test validation explicit timing- accurate and glitch-aware simulation is required [CHE+08, BGA07]. Impact of Process Variations on the Fault Detection Process variations cause inaccuracies during manufacturing, which can severely affect the functional and timing behavior of a circuit [Nas01, PBH+11]. The source of variation is typically distinguished as either of random or systematic nature. The background of random variation usually has a quantum mechanical origin. Both, layout and material properties of the circuits are influenced by statistical processes and numerous uncertain- ties during chip manufacturing and affect electrical properties of the underlying structures. Random variation can influence the switching behavior of gates and interconnects in a cir- cuit (intra-die), for example, by line-edge roughness (LER) artifacts, and can also increase the threshold voltage of transistors due to random dopant fluctuation (RDF). At logic level, this type of variation is typically modeled by assuming independent random variables as delays for each gate in a circuit [SSB05, BCSS08]. On the other hand, systematic variation takes into account the spatial and parametric dependencies of different process corners 27 2 Fundamental Background between dies (inter-die), wafers (wafer-to-wafer) or lots (lot-to-lot). This type of varia- tion can affect gates with high spatial correlation or gates of the same type in a similar manner simultaneously [ABZ03, ZHHO04]. The sources of systematic variation relate to irregular material properties as well as limitations of the fabrication processes themselves (e.g., lithography, etching, polishing) that introduce a correlation between neighboring structures or their corresponding nodes in the design [SSB05, ABZ03]. These process variations can have severe impact on the gate delays as well as the circuit timing and ultimately affect the detection of delay faults [CIJ+12, SPI+14]. Therefore, simulation algorithms must be timing-accurate and variation-aware to allow for modeling of the impact of process variations on the delay for statistical timing analyses, variation- aware fault grading and pattern generation [CIJ+12, ZKAH13, SPI+14]. 2.4 Circuit and Fault Simulation A simple way to simulate a circuit is by compiled-code simulation [WWX06, BA04]. In compiled-code simulation the combinational netlist is translated into executable instruc- tions for each node with all signal states being kept as variables. The circuit nodes are then evaluated by executing the assigned operations over their input variables. For a successful evaluation of each node, all of its input variables need to be determined first. Hence, the processing of the nodes usually follows a topological order to allow for a hassle-free eval- uation, which is obtained by a so-called levelization pre-process that partitions the nodes into levels ordered by increasing topological distance (depth of the nodes) with respect to the circuit inputs [Wun91]. A basic simulation flow for the simulation of a test pattern tp ∈ B |I| 2 is shown in Algo- rithm 2.1. During the process, first the input nodes I ⊂ V are assigned their correspond- ing values of the input pattern, followed by the ordered computation of the internal node signals until the outputs are reached. Due to the levelization, all input-dependencies of the nodes have been resolved. This type of simulation is typically referred to as oblivious simulation, since always every node of the circuit is evaluated upon the application of a new test pattern [BBC94]. 28 2.4 Circuit and Fault Simulation Algorithm 2.1: Simulation flow of a plain oblivious logic simulation. Input: netlist G = (V,E), test pattern tp with tp[i] value of input i ∈ I Output: values vn for all n ∈ V 1 foreach node n ∈ V in topological order do 2 if n ∈ I then 3 Assign input value vn := tp[n]. 4 else 5 Fetch values v1, v2, ..., vk of fanin(n) (direct predecessors of n). 6 Compute vn := φn(v1, v2, ..., vk). 7 end 8 end 9 return Stored values vn of all nodes n ∈ V . 2.4.1 Event-driven Simulation Usually, when applying consecutive test patterns to a circuit, not all of the primary or pseudo primary inputs of the circuits change and hence certain signals sustain their state. With the oblivious simulation scheme, this causes a lot of unnecessary node evaluations since these nodes do not require recomputation, yet all nodes are always evaluated re- gardless of switching activity from signal changes [BA04, WWX06]. To provide a more efficient evaluation in cases of little switching activity, event-driven simulation approaches have been proposed [Wun91]. In event-driven simulation the evaluations are constrained to nodes with active switching events at their inputs. Thus, the evaluation only follows the path of events during simulation and thereby avoids (unnecessary) evaluation of nodes with constant signals. Traditional time simulators typically follow a synchronous event-driven time-wheel ap- proach [Ulr69], which as proven well for simulation at logic level. A different simula- tion approach for asynchronous event-driven simulation can be realized using the Chandy- Misra-Bryant (CMB) algorithm [CM79, Bry77]. As opposed to a global synchronous time schedule, the CMB algorithm assigns a time stamp to each event and utilizes message passing to distribute events from node outputs to input FIFOs of successor nodes. The evaluation of events at different nodes can be realized by individual processes concur- rently, which can benefit from parallelization to provide speedup for simulating single circuit instances [BMC96]. In event-driven approaches, the handling of events can get quite complex which quickly increases the runtime overhead when considering more detailed delay models [Wun91]. 29 2 Fundamental Background For example, in inertial delay modeling, many events scheduled during processing might need to be reverted when processing later events in time. Also, the algorithms usually only speed up simulation of single circuit instances by exploiting parallelism from independent gates. They can not benefit from pattern parallelism [BBC94] through simultaneous eval- uating multiple patterns in a data-parallel fashion, as they rely on sparse occurrences of events. Since, gate level parallelism can diminish at deeper levels, this strongly lim- its the simulation throughput and effectiveness of these approaches. Moreover, for the implementation on GPUs these algorithms demand for highly complex control- and dy- namic memory management to process all the event lists in the circuit. However, frequent memory operations of the scheduling are expensive and will limit the effectiveness of an acceleration on GPUs [OHL+08]. 2.4.2 Fault Simulation In fault simulation, a circuit is simulated under the behavior of a given defect to determine whether the circuit behavior is altered by the fault [Wun91]. A naïve approach to simulate faults is through serial processing of the provided fault lists. In serial fault simulation, first a simulation of the circuit is performed to obtain the fault-free good-value simulation results for a test pattern. The resulting output responses vo of all circuit (pseudo-)primary outputs o ∈ O are considered as golden reference, or expected values of the fault-free circuit, which are stored for comparison. The good-value simulation is usually followed by repeated simulations of the pattern for various copies of faulty circuits in which different faults have been injected. The injection of a fault is performed by modifying the circuit description according to the abstracted fault behavior. After simulation of a faulty circuit copy, the output responses of the faulty circuit v′o are compared against the golden reference vo to compute an output syndrome syno by syno := vo ⊕ v′o, (2.3) where ⊕ corresponds to the bitwise XOR-operation of the response bits of the outputs o ∈ O in good and faulty response. The output syndrome contains all the differences in the faulty and the fault-free output response and thus indicates the fault detection of the 30 2.5 Multi-Level Simulation applied test pattern. A fault f is considered as detected at an output o ∈ O, iff syno = 1 and therefore detected by the pattern. If ∀o ∈ O : syno = 0, then f is undetected. Fault simulation is a challenging and compute-intensive task (i.e., especially when done exhaustively for all possible faults of a circuit) due to the additional dimension of com- plexity from the set of different faults. Often a process called fault dropping is applied, in which the simulation of a test pattern set for a fault is stopped as soon the fault has been detected by a certain number n of test patterns (n-detect). The fault is then immedi- ately removed from the fault list and the simulation is continued for the next. While this reduces the overhead of fault simulation, fault dropping in simulation cannot be applied for debug an diagnosis. Especially in diagnosis, often so-called fault-dictionaries must be computed which rely on exhaustive simulation of all faults for all patterns to obtain as much syndrome information as possible. Since the number of faults usually grows with the design size and technology, the in- crease of the fault simulation performance has been subject of research in test ever since [WEF+85, WLRI87, AKM+88]. Different ways of improving the performance and speed- ing up fault simulations exist, that exploit parallelism from pattern-parallel simulation and fault-parallel simulation and other optimizations, such as concurrent or deductive fault simulation. For more information the reader is referred to general literature of semi- conductor design and test [Wun91, WWX06]. 2.5 Multi-Level Simulation In design and test validation tasks it is often necessary to evaluate parts of the design with higher accuracy. Although high accuracy is generally desirable for any task, the continuing simulation of a circuit at the lowest level at all times can be troublesome, due to the increasing order of magnitude of runtime complexity on lower levels [SN90]. In multi-level simulation a design is simulated on more than one level of abstraction at the same time. This is achieved by utilizing mixed abstractions throughout the de- sign during simulation and focusing the accurate and compute-intensive simulations to smaller parts of the design which allows to reduce the overall runtime. Usually, the de- sign is partitioned, such that common parts are evaluated with fast high-level simula- 31 2 Fundamental Background tion, while critical components are processed at the lower abstraction levels with higher accuracy [RK08, KZB+10, SKR10, JAC+13, HBPW14]. Simulators with mixed abstrac- tions usually utilize hierarchical circuit descriptions with the design being represented individually at each desired level of abstraction [SSB89, RGA87]. Those representations are then selected interchangeably for the components during simulation based on the desired accuracy, i.e., for the injection of low-level faults in mixed-level fault simula- tion [GMS88, MC93, MC95]. By doing so, designers working on individual modules of a circuit can focus the accuracy of the validation on their respective designs without the need of processing the whole circuit at the lowest abstraction, which eventually results in a much faster and more efficient simulation. 2.6 Summary With increasing technology scaling and shrinking feature sizes, modern circuits become increasingly complex in terms of design size and also prone to newer defect types that often manifest in the timing behavior of the circuit (e.g., small delays). Hence, timing- accurate simulation has become a necessity for design and test validation. However, timing-accurate simulation itself is tremendously complex in terms of runtime, and clas- sical approaches to tackle the afore-mentioned problems either lack accuracy (due to ab- straction) or scalability (due to simulation effort). In order to make accurate and large-scale design and test validation feasible, the simula- tion algorithms have to be parallelized in a way to increase the throughput performance for coping with an increasing number of gates, test patterns and faults in the designs. This thesis investigates the parallelization of timing simulation on massively parallel graph- ics processing unit (GPU) architectures and provides timing-accurate algorithms for high- throughput simulation as well as multi-level approaches to further enhance the simulation efficiency. 32 Chapter 3 Graphics Processing Units (GPUs) – Architecture and Programming Paradigm This chapter introduces the basic concepts of contemporary data-parallel graphics pro- cessing unit (GPU) accelerators. Both the general GPU-architecture as well as the under- lying parallel programming paradigm will be outlined. For better comprehensibility, the NVIDIA’s Compute Unified Device Architecture (CUDA) [NVI18b] programming model and the NVIDIA Pascal GP100 GPU architecture [NVI17c] are used as an example. The GP100 GPU is manufactured in a 16 nm FinFET technology from TSMC featuring 15.3 Billion transistors on a die of 610 mm2 size. 3.1 GPU Architecture In recent years, general purpose programming on graphics processing units (GPUs) has established well in the context of high-performance computing (HPC). Being used as co- processors, GPUs allow to accelerate computations of many compute-intensive scientific applications by exploiting massive parallelism [NVI18c, ND10, OHL+08]. While conven- tional CPUs typically provide fewer (usually 8 to 24 [STKS17]), but more general and latency-optimized processing cores, contemporary GPU accelerators consist of thousands of simpler cores that provide high computational throughput due to massively parallel execution of thousands to millions of threads. The execution of threads on the cores is performed in a single-instruction-multiple-data (SIMD) fashion, in which the threads each perform the same instruction at a time, but on different data in parallel [HP12]. 33 3 Graphics Processing Units (GPUs) Fig. 3.1 depicts a block diagram of the NVIDIA GP100 GPU of the recent Pascal archi- tecture [NVI17c]. The GPU provides 60 streaming multiprocessors (SM) which are parti- tioned into six Graphics Processing Clusters (GPCs) that are further sub-divided into five Texture Processing Clusters (TPCs) each. Each SM consists of 64 single-precision (SP) and 32 double-precision (DP) compute cores summing up to a total of 3,840 single-precision or 1,920 double precision cores. The GPC and TPC [NVI10] provide all key units for graph- ics processing (e.g., for vertex, geometry and texture processing), however, the actual rendering capabilities will not be used within the context of this work. PCI Express 3.0 Host Interface GigaThread Engine GPC TPC SM SM TPC SM SM TPC SM SM TPC SM SM TPC SM SM GPC TPC SM SM TPC SM SM TPC SM SM TPC SM SM TPC SM SM GPC TPC SM SM TPC SM SM TPC SM SM TPC SM SM TPC SM SM L2 Cache M e m o ry C o n tr o lle r M e m o ry C o n tr o lle r GPC TPC SM SM TPC SM SM TPC SM SM TPC SM SM TPC SM SM GPC TPC SM SM TPC SM SM TPC SM SM TPC SM SM TPC SM SM High-Speed Hub NVLink NVLink NVLink NVLink M e m o ry C o n tr o lle r M e m o ry C o n tr o lle r M e m o ry C o n tro lle r M e m o ry C o n tro lle r M e m o ry C o n tro lle r M e m o ry C o n tro lle r GPC TPC SM SM TPC SM SM TPC SM SM TPC SM SM TPC SM SM Streaming Multiprocessor Memory Interfaces Memory Interfaces Global Thread Scheduler Host Bus Interface High-Speed Communication Interface Figure 3.1: Block diagram of the full NVIDIA GP100 Pascal GPU Architecture. (Adapted from [NVI17c]) The SMs have access to a 4,096 KB Level-2 (L2) cache which is connected via eight 512-bit memory controllers to a High Bandwidth Memory 2 (HBM2) DRAM global device memory provi