Multi-Level Simulation of
Nano-Electronic Digital Circuits

on GPUs

Von der Fakultät Informatik, Elektrotechnik und Informationstechnik der
Universität Stuttgart zur Erlangung der Würde eines Doktors der

Naturwissenschaften (Dr. rer. nat.) genehmigte Abhandlung

Vorgelegt von

Eric Schneider

aus Hildburghausen, Deutschland

Hauptberichter: Prof. Dr. rer. nat. habil. Hans-Joachim Wunderlich

Mitberichter: Prof. Dr. phil. nat. Rolf Drechsler

Tag der mündlichen Prüfung: 21. Juni 2019.

Institut für Technische Informatik der Universität Stuttgart

2019


Contents

Acknowledgments ix

Abstract xi

Zusammenfassung xiii

List of Abbreviations xxi

1 Introduction 1

I Background

2 Fundamental Background 13

2.1 CMOS Digital Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1.1 CMOS Switching and Time Behavior . . . . . . . . . . . . . . . . 14

2.1.2 Circuit Performance and Reliability Issues . . . . . . . . . . . . . 16

2.2 Circuit Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.1 Electrical Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2.2 Switch Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2.3 Logic Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.4 Register-Transfer Level . . . . . . . . . . . . . . . . . . . . . . . 23

2.2.5 System-/Behavioral Level . . . . . . . . . . . . . . . . . . . . . . 23

2.3 Delay Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.4 Circuit and Fault Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4.1 Event-driven Simulation . . . . . . . . . . . . . . . . . . . . . . 29

iii


2.4.2 Fault Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.5 Multi-Level Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3 Graphics Processing Units (GPUs) 33

3.1 GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.1.1 Streaming Multiprocessors . . . . . . . . . . . . . . . . . . . . . 35

3.1.2 Thread Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.1.3 Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2 Programming Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2.1 Thread Organization . . . . . . . . . . . . . . . . . . . . . . . . 40

3.2.2 Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4 Parallel Circuit and Fault Simulation on GPUs 43

4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2 Parallel Circuit Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.3 Parallel Gate Level Simulation . . . . . . . . . . . . . . . . . . . . . . . . 45

4.3.1 Logic Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.3.2 Timing-aware Simulation . . . . . . . . . . . . . . . . . . . . . . 47

4.4 Parallel Fault Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

II High-Throughput Time and Fault Simulation

5 Parallel Logic Level Time Simulation on GPUs 57

5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.2 Circuit Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.2.1 Gates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.2.2 Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.2.3 Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.3 Modeling Signals in Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

iv


5.3.1 Events and Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.3.2 Delay Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.3.3 Waveforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.4.1 Serial Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.4.2 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.5.1 Waveform Allocation . . . . . . . . . . . . . . . . . . . . . . . . 80

5.5.2 Kernel Organization . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.5.3 Simulation Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.5.4 Test Stimuli and Response Conversion . . . . . . . . . . . . . . . 88

5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6 Parallel Switch Level Time Simulation on GPUs 93

6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6.2 Switch Level Circuit Model . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.2.1 Channel-Connected Components . . . . . . . . . . . . . . . . . . 95

6.2.2 RRC-Cell Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.2.3 Switch Level Waveform Representation . . . . . . . . . . . . . . 99

6.2.4 Modeling Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.3 Switch Level Simulation Algorithm . . . . . . . . . . . . . . . . . . . . . . 103

6.3.1 RRC-Cell Simulation . . . . . . . . . . . . . . . . . . . . . . . . 104

6.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

7 Waveform-Accurate Fault Simulation on GPUs 115

7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

7.2 Fault Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

7.2.1 Small Gate Delay Fault Model . . . . . . . . . . . . . . . . . . . 117

7.2.2 Switch-Level Fault Models . . . . . . . . . . . . . . . . . . . . . 118

7.2.3 Fault Injection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

7.3 Syndrome Computation and Fault Detection . . . . . . . . . . . . . . . . 122

v


7.3.1 Signal Interpretation . . . . . . . . . . . . . . . . . . . . . . . . 122

7.3.2 Discrete Syndrome Computation . . . . . . . . . . . . . . . . . . 123

7.3.3 Fault Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

7.4 Fault-Parallel Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

7.4.1 Fault Collapsing . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

7.4.2 Fault Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

7.4.3 Grouping Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 129

7.4.4 Parallel Fault Evaluation . . . . . . . . . . . . . . . . . . . . . . 132

7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

8 Multi-Level Simulation on GPUs 135

8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

8.2 Multi-Level Circuit Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 137

8.2.1 Region of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . 137

8.2.2 Changing the Abstraction . . . . . . . . . . . . . . . . . . . . . . 138

8.2.3 Mixed-abstraction Temporal Behavior . . . . . . . . . . . . . . . 138

8.2.4 Simulation Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

8.3 Transparent Multi-Level Time Simulation . . . . . . . . . . . . . . . . . . 142

8.3.1 Waveform Transformation . . . . . . . . . . . . . . . . . . . . . 143

8.4 Multi-Level Data Structures and Implementation . . . . . . . . . . . . . . 147

8.4.1 Multi-Level Node Description . . . . . . . . . . . . . . . . . . . . 148

8.4.2 Multi-Level Netlist Graph . . . . . . . . . . . . . . . . . . . . . . 149

8.4.3 Multi-Level Waveforms . . . . . . . . . . . . . . . . . . . . . . . 150

8.4.4 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

III Applications and Evaluation

9 Experimental Setup 155

9.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

9.2 Host System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

9.3 Simulation Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

vi


9.4 Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

9.5 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

10 High-Throughput Time Simulation Results 161

10.1 Logic Level Timing Simulation Performance . . . . . . . . . . . . . . . . . 161

10.1.1 Runtime Performance . . . . . . . . . . . . . . . . . . . . . . . . 162

10.1.2 Simulation Throughput . . . . . . . . . . . . . . . . . . . . . . . 164

10.2 Switch Level Timing Simulation Performance . . . . . . . . . . . . . . . . 166

10.2.1 Runtime Performance . . . . . . . . . . . . . . . . . . . . . . . . 166

10.2.2 Throughput performance . . . . . . . . . . . . . . . . . . . . . . 167

10.3 Switch Level Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

10.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

11 Application: Fault Simulation 175

11.1 Fault Grouping Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

11.2 Small Delay Fault Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 178

11.2.1 Variation Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

11.3 Switch Level Fault Simulation . . . . . . . . . . . . . . . . . . . . . . . . 180

11.3.1 Resistive Open-Transistor Faults . . . . . . . . . . . . . . . . . . 181

11.3.2 Capacitive Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

11.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

12 Application: Power Estimation 187

12.1 Circuit Switching Activity Evaluation . . . . . . . . . . . . . . . . . . . . . 187

12.1.1 Weighted Switching Activity for Power Estimation . . . . . . . . 188

12.1.2 Distribution of Switching Activity . . . . . . . . . . . . . . . . . 189

12.1.3 Correlation of Clock Skew Estimates . . . . . . . . . . . . . . . . 190

12.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

13 Application: Multi-Level Simulation 195

13.1 Multi-Level Simulation Runtime . . . . . . . . . . . . . . . . . . . . . . . 195

13.2 Multi-Level Trade-Off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

13.2.1 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

vii


13.2.2 Runtime Savings . . . . . . . . . . . . . . . . . . . . . . . . . . . 197

13.2.3 Simulation Efficiency . . . . . . . . . . . . . . . . . . . . . . . . 199

13.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

Conclusion 203

Bibliography 205

A Benchmark Circuits 227

B Result Tables 229

List of Symbols 235

Index 237

Publications of the Author 239

viii


Acknowledgments

I would like to thank Prof. Hans-Joachim Wunderlich for giving me the opportunity to

work at the Institut für Technische Informatik (ITI) in Stuttgart and to let me experience the

life in academia under his professional supervision. I am deeply grateful for his invaluable

patience and faith in me, as well as the countless and very often insightful discussions we

had. Furthermore, I want to thank Prof. Rolf Drechsler from University of Bremen for

taking over the role as my second adviser and also for the nice chats.

In addition, I am very grateful to Stefan Holst, who was my mentor during my undergrad-

uate studies and who helped me getting started at the ITI as a colleague and friend.

I very much enjoyed the work with the other colleagues and friends as well, which alto-

gether made the life at the institute interesting and enjoyable in many ways. Therefore I

am happy to thank Michael A. Kochte and Claus Braun as well as Laura Rodrígez Gómez,

Dominik Ull, Chang Liu, Rafał Baranowski, Alexander Schöll, Michael Imhof, Abdullah

Mumtaz, Marcus Wagner, Sebastian Brandhofer, Ahmed Atteya, Natalia Lylina, Zahra Paria

Najafi Haghi and Chih-Hao Wang. I also want to thank Lars Bauer from Karlsruhe Institute

of Technology, Matthias Sauer formerly from University of Freiburg, Prof. Sybille Helle-

brand and Matthias Kampmann from University of Paderborn as well as Prof. Xiaoqing

Wen and Kohei Miyase from Kyushu Institute of Technology and Yuta Yamato formerly

from Nara Institute of Science and Technology for all the insightful discussions and col-

laborations we could establish together.

Thank you, Mirjam Breitling, Lothar Hellmeier and Helmut Häfner. Without your admin-

istrative and technical assistance this work would have been impossible for me.

Last but not least, I am also very grateful to my parents for their constant support and

caring.
Stuttgart, June 2019

Eric Schneider

ix


x


Abstract

Simulation of circuits and faults is an essential part in design and test validation tasks

of contemporary nano-electronic digital integrated CMOS circuits. Shrinking technology

processes with smaller feature sizes and strict performance and reliability requirements

demand not only detailed validation of the functional properties of a design, but also

accurate validation of non-functional aspects including the timing behavior. However, due

to the rising complexity of the circuit behavior and the steady growth of the designs with

respect to the transistor count, timing-accurate simulation of current designs requires a

lot of computational effort which can only be handled by proper abstraction and a high

degree of parallelization.

This work presents a simulation model for scalable and accurate timing simulation of

digital circuits on data-parallel graphics processing unit (GPU) accelerators. By providing

compact modeling and data-structures as well as through exploiting multiple dimensions

of parallelism, the simulation model enables not only fast and timing-accurate simulation

at logic level, but also massively-parallel simulation with switch level accuracy.

The model facilitates extensions for fast and efficient fault simulation of small delay faults

at logic level, as well as first-order parametric and parasitic faults at switch level. With the

parallelization on GPUs, detailed and scalable simulation is enabled that is applicable even

to multi-million gate designs. This way, comprehensive analyses of realistic timing-related

faults in presence of process- and parameter variations are enabled for the first time.

Additional simulation efficiency is achieved by merging the presented methods in a unified

simulation model, that allows to combine the unique advantages of the different levels

of abstraction in a mixed-abstraction multi-level simulation flow to reach even higher

speedups.

xi


Experimental results show that the implemented parallel approach achieves unprece-

dented simulation throughput as well as high speedup compared to conventional timing

simulators. The underlying model scales for multi-million gate designs and gives detailed

insights into the timing behavior of digital CMOS circuits, thereby enabling large-scale

applications to aid even highly complex design and test validation tasks.

xii


Zusammenfassung

Simulation ist ein wichtiger Bestandteil der Entwurfs- und Testvalidierung von heutigen

nano-elektronischen digitalen Schaltungen. Die Herstellungsprozesse von Technologien

mit geringen Strukturgrößen im Bereich weniger Nanometer unterliegen strengen An-

forderungen an die Zuverlässigkeit und Performanz der zu produzierenden Schaltungen.

Daher ist es schon während des Entwurfs eine Validierung der funktionalen Aspekte not-

wendig und überdies hinaus auch eine akkurate Validierung der nicht-funktionalen Aspek-

te einschließlich der Simulation des Zeitverhaltens. Aufgrund der steigenden Komplexität

des Schaltverhaltens und der stetig steigenden Zahl von Transistoren in den Schaltungen

benötigt akkurate Zeitsimulation jedoch enormen Rechenaufwand, welcher nur mit ge-

eigneter Modellabstraktion und einem hohen Grad an Parallelisierung bewältigt werden

kann.

In dieser Arbeit wird ein Simulationsmodell zur akkuraten Zeitsimulation digitaler Schal-

tungen auf massiv daten-parallelen Grafikbeschleunigern (sogenannten ”GPUs”) vorge-

stellt. Die Verwendung einer kompakten Modellierung und geeignete Datenstrukturen so-

wie die simultane Ausnutzung mehrerer Dimensionen von Parallelismus ermöglichen nicht

nur schnelle und zeitgenaue Simulation auf Logik-Ebene, sondern erstmals auch skalier-

bare massiv-parallele Simulation mit erhöhter Genauigkeit auf Schalter-Ebene.

Erweiterungen der Modellierung bieten schnelle und effiziente Fehlersimulation von kleins-

ten Verzögerungsfehlern auf Logik-Ebene, als auch parametrische und parasitische Feh-

ler in elektrischen Parametern erster Ordnung auf Schalter-Ebene. Die Parallelisierung

auf den GPU-Architekturen erlaubt darüber hinaus erstmals detaillierte Simulationen und

Analysen von realistischen Verzögerungsfehlern unter Prozess- und Parametervariationen

für Schaltkreise mit Millionen von Gattern.

xiii


Durch Vereinigung der implementierten Methoden in einem mehrere Ebenen umfassenden

Simulationskonzept mit gemischten Abstraktionen werden die Vorteile der verschiedenen

Abstraktionsebenen kombiniert, wodurch die Effizienz der Zeitsimulation gesteigert und

zusätzliche Beschleunigung erzielt werden kann.

Ergebnisse durchgeführter Experimente zeigen, dass der implementierte parallele Simula-

tionsansatz einen hohen Rechendurchsatz mit hohem Grad an Beschleunigung gegenüber

konventioneller Zeitsimulation erzielt. Das zugrundeliegende Simulationsmodell skaliert

für Schaltkreise mit Millionen von Gattern und gewährt detaillierte Einsicht in das Zeit-

schaltverhalten digitaler CMOS Schaltungen, wodurch umfassende Anwendungen zur Un-

terstützung auch für hochkomplexe Aufgaben beim Entwurf und Test nano-elektronischer

Schaltungen ermöglicht werden.

xiv


List of Figures

1.1 Contributions of this Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1 Propagation Delay and Rise/Fall Times . . . . . . . . . . . . . . . . . . . . . 14

2.2 First-order Transient Step Response . . . . . . . . . . . . . . . . . . . . . . . 16

2.3 Circuit Abstraction Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4 Switch Level Representation of a CMOS Inverter . . . . . . . . . . . . . . . . 19

2.5 Overview of Delay Models for Gates . . . . . . . . . . . . . . . . . . . . . . . 22

2.6 Small (Gate-)Delay Fault Example . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1 Block Diagram of the full GP100 Architecture . . . . . . . . . . . . . . . . . 34

3.2 Block Diagram of a GP100 Streaming Multiprocessor with CUDA Core . . . 35

3.3 GP100 Memory Hierarchy and Thread Access . . . . . . . . . . . . . . . . . 37

3.4 Serial Execution Flow of a heterogeneous Program . . . . . . . . . . . . . . 39

3.5 NVIDIA CUDA Thread Organization and Memory Accesses . . . . . . . . . . 42

5.1 Logic Level Time Simulation Flow . . . . . . . . . . . . . . . . . . . . . . . . 58

5.2 Graph Visualization of Circuit c17 . . . . . . . . . . . . . . . . . . . . . . . . 59

5.3 Delay Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.4 Logic Level Waveform Representation . . . . . . . . . . . . . . . . . . . . . . 69

5.5 Logic Level Waveform Processing Example. . . . . . . . . . . . . . . . . . . . 73

5.6 Structural and Data-Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.7 Thread Organization for Structural Parallelism . . . . . . . . . . . . . . . . . 76

5.8 Two-dimensional Thread Organization . . . . . . . . . . . . . . . . . . . . . 77

5.9 Multi-dimensional Thread Organization with Instance Parallelism . . . . . . 78

5.10 Kernel Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

xv


5.11 Waveform Memory Organization . . . . . . . . . . . . . . . . . . . . . . . . 81

5.12 Memory Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.13 Overall Simulation Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.14 Parallel Pattern-to-Waveform Conversion . . . . . . . . . . . . . . . . . . . . 90

5.15 Parallel Waveform-to-Pattern Conversion . . . . . . . . . . . . . . . . . . . . 91

6.1 Pattern-dependent Delays due to Multiple-Input Switching . . . . . . . . . . 94

6.2 Extraction of Channel-Connected Components . . . . . . . . . . . . . . . . . 96

6.3 Resistor-Resistor-Capacitor (RRC-) Cell Model . . . . . . . . . . . . . . . . . 97

6.4 Curve-Fitting of Transient Responses . . . . . . . . . . . . . . . . . . . . . . 102

6.5 Switch Level Event Representation and Visualization . . . . . . . . . . . . . 102

6.6 Switch Level Waveforms of a two-input NOR-Cell . . . . . . . . . . . . . . . 106

6.7 Time-continuous Simulation State of the NOR-Cell Example . . . . . . . . . 110

7.1 Fault Simulation Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

7.2 Mapping of Small Delay Faults at Outputs to Input-descriptions . . . . . . . 118

7.3 Greedy Fault Grouping Heuristic . . . . . . . . . . . . . . . . . . . . . . . . 130

7.4 Fault Grouping Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

7.5 Parallel Syndrome Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . 133

8.1 State Transitions in ternary Logic Waveforms . . . . . . . . . . . . . . . . . . 140

8.2 Ternary Logic Waveform Example . . . . . . . . . . . . . . . . . . . . . . . . 140

8.3 Multi-Level Simulation Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

8.4 Multi-Level Waveform Transformation: Logic-to-Switch Level . . . . . . . . 146

8.5 Multi-Level Waveform Transformation: Switch-to-Logic Level . . . . . . . . 148

8.6 Generic Multi-Level Node Description . . . . . . . . . . . . . . . . . . . . . . 148

8.7 Node Expansion and Collapsing . . . . . . . . . . . . . . . . . . . . . . . . . 151

8.8 Generic Waveform Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

8.9 Transparent Parallelization of the Multi-Level Simulation . . . . . . . . . . . 152

10.1 Logic Level Simulation Performance Trend . . . . . . . . . . . . . . . . . . . 165

10.2 Logic Level Simulation GPU-Device Exploration . . . . . . . . . . . . . . . . 165

10.3 Switch Level Simulation Performance Trend . . . . . . . . . . . . . . . . . . 169

xvi


10.4 Switch Level Simulation GPU-Device Exploration . . . . . . . . . . . . . . . 169

10.5 Multiple Input Switching in a NAND3_X1 Cell . . . . . . . . . . . . . . . . . 170

10.6 Pattern-dependent Delay Test Circuit . . . . . . . . . . . . . . . . . . . . . . 171

10.7 Waveforms of Electrical, Logic and Switch Level Simulation in Comparison . 172

11.1 Resistive Open-Transistor Fault Simulation Results . . . . . . . . . . . . . . . 182

11.2 Resistive Open-Transistor Fault Behavior . . . . . . . . . . . . . . . . . . . . 184

11.3 Capacitive Fault Simulation Results . . . . . . . . . . . . . . . . . . . . . . . 185

11.4 Capacitive Fault Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

12.1 Switching Activity Characterization at Logic Level . . . . . . . . . . . . . . . 189

12.2 Weighted Switching Activity Comparison . . . . . . . . . . . . . . . . . . . . 190

12.3 Distribution of Switching Activity in Time-Slices . . . . . . . . . . . . . . . . 190

12.4 Clock-Skew in the Clock Tree . . . . . . . . . . . . . . . . . . . . . . . . . . 192

12.5 Correlation of Skew Estimates for Circuit b14 . . . . . . . . . . . . . . . . . 193

12.6 Correlation of Skew Estimates for Circuit b18 . . . . . . . . . . . . . . . . . 194

13.1 Multi-Level Simulation Performance . . . . . . . . . . . . . . . . . . . . . . 196

13.2 Multi-Level Simulation Speedup . . . . . . . . . . . . . . . . . . . . . . . . . 198

13.3 Runtime Savings of Mixed-Abstraction Simulation . . . . . . . . . . . . . . . 199

13.4 Distribution of Fault Group Sizes . . . . . . . . . . . . . . . . . . . . . . . . 200

xvii


xviii


List of Tables

6.1 Switch Level Event Table Example . . . . . . . . . . . . . . . . . . . . . . . . 108

6.2 Switch Level Device Table Example . . . . . . . . . . . . . . . . . . . . . . . 108

9.1 Specifications of the used GPU Devices . . . . . . . . . . . . . . . . . . . . . 156

9.2 Number Symbols and represented Order of Magnitude . . . . . . . . . . . . 158

10.1 Logic Level Simulation Performance . . . . . . . . . . . . . . . . . . . . . . . 163

10.2 Switch Level Simulation Performance . . . . . . . . . . . . . . . . . . . . . . 168

11.1 Fault Collapsing and Fault Grouping Statistics . . . . . . . . . . . . . . . . . 177

11.2 Fault Detection of Transition and Small Delay Faults . . . . . . . . . . . . . . 178

11.3 Random Process Variation Impact on Small Delay Fault Detection . . . . . . 181

A.1 Benchmark Circuits and Test Set Statistics . . . . . . . . . . . . . . . . . . . 228

B.1 Characterization of Switching Activity at Logic Level . . . . . . . . . . . . . 230

B.2 Weighted Switching Activity of timed and untimed Simulation . . . . . . . . 231

B.3 Resisitive Open-Transistor Fault Simulation Results . . . . . . . . . . . . . . 232

B.4 Capacitive Fault Simulation Results . . . . . . . . . . . . . . . . . . . . . . . 233

B.5 Multi-Level Simulation Runtime of different Mixed-Abstraction Scenarios . . 234

xix


xx


List of Abbreviations

ATPG Automatic test pattern generation

CAT Cell-aware test

CCC Channel-connected component

CLF conditional line flip

CMOS Complementary metal-oxide semiconductor

CUDA Computer unified device architecture

DSPF Detailed parasitics file format

EDA Electronic design automation

ELF Early-life failures

EM Electro-migration

FAST Faster-than-at-speed test

FinFET Fin-type field-effect-transistor

FLOP Floating-point operation

GND Ground voltage potential

GPU Graphics processing unit

HCI Hot-carrier injection

IFA Inductive fault analysis

MEPS Million (node) evaluations per second

MIS Multiple-input switching

MOSFET Metal-oxide-semiconductor field-effect-transistor

xxi


List of Abbreviations

NBTI Negative-bias temperature instability

PPSFP Parallel-pattern single fault propagation (fault simulation)

ROI Region of interest

RTL Register-transfer level

SDF Standard delay format

SM Streaming multi-processor

SPICE Simulation program with integrated circuit emphasis

STA Static timing analysis

STF slow-to-fall

STR slow-to-rise

TDDB Time-dependent dielectric breakdown

VDD Supply voltage potential

xxii


Chapter 1

Introduction

In modern semi-conductor development cycles of nanometer CMOS integrated circuits,

the simulation of circuits is one of the most important and complex tasks for design and

test validation. Circuit simulation is used for design validation to analyze developed de-

signs with respect to validation targets to indicate the compliance with their specification

or customer requirements. In test validation on the other hand, fault simulation is utilized

to evaluate the defect coverage of test sets and new test strategies, as well as to assess

the quality of the tested products. For this, electronic design automation (EDA) tools for

design and test validation have to rely on repeated, compute-intensive circuit simulations

with extensive runtimes, which can pose a bottleneck, especially for designs with billions

of transistors [ITR15].

Simulation itself is best described as "performing experiments on a model" of the real (phys-

ical) world [Cel91, CK06]. The models for simulation typically provide a set of properties

and rules, that represents and explains parts of the complex real world behavior in an

abstract manner. Once an algorithm has been derived from the model, a simulation pro-

gram (simulator) can be efficiently utilized to perform and observe experiments repeatedly

on the model for any given input parameters. The output of the simulation then provides

data that can help to make predictions regarding the real world behavior. This way, knowl-

edge can be obtained to validate real world hypotheses without the need of costly (and

often destructive) experiments in the real world.

For example, in circuit simulation abstract models of the real physical silicon chips are

provided, that allow to run experiments to predict and observe the behavior of prototype

1


1 Introduction

designs and compare it to its specification. The designer can then draw conclusions from

the results to assess the provided functionality of the prototype design, the correctness as

well as other non-functional aspects (i.e., timing reliability, power) [WH11, Wun91].

Fault simulation is an extension of circuit simulation in which the behavior of arbitrary

defects in a real physical chip are abstracted to faults. Each fault describes the altered

chip behavior with respect to a given defect mechanism, which is reflected as abstract

deterministic misbehavior in the underlying circuit simulation model.

The abstraction of a simulation model itself is crucial to avoid dealing with the overall

complexity of the real physical world and to make an evaluation feasible at all. While

a high accuracy of a simulation model is generally desirable, it can make simulations

infeasible due to a high time- and space complexity of the evaluations. By constraining the

model to certain aspects and by using simplified assumptions, the costs of the simulations

can be lowered and evaluation can be focused on certain problems of interest. On the

other hand, too much abstraction reduces the accuracy of the models and also the viability

of the simulation and its results.

Challenges

With the continuing advancements in semiconductor manufacturing processes and shrink-

ing technologies of digital integrated CMOS circuits, traditional circuit and fault simula-

tion models have become insufficient. Nowadays, thorough design and test validation

as well as diagnosis require more and more accurate simulations and have thus become

complex and runtime-intensive tasks, since several aspects have lead to an increase in the

overall modeling and simulation effort:

Complexity of Circuit Size

Over the past decades, semiconductor manufacturing technologies and processes made

continuous advancements that still allow for an ongoing growth in complexity and higher

integration of modern integrated circuits [ITR15]. Although Moore’s Law [Moo65, Moo75]

was already predicted to end [RS11], newer process technologies, such as the fin-type field

effect transistor (FinFET), enable a scaling of structure sizes down to a few nanometers

2


only and the production with higher transistor density [Kin05, Hu11]. For example, in

2017 NVIDIA announced a new generation of parallel GPU accelerators: The NVIDIA R©

VoltaTM architecture. The GV100 GPU chip of the family is manufactured in a 12nm

FinFET technology and comprises 21.1 billion transistors on a die of 815 mm2 in size

[NVI17d]. To validate such massive designs, scalable parallel simulation algorithms are

required that are able to cope with the increasing design complexity and its millions or

billions of transistors.

Complexity of Circuit Behavior

While technology scaling causes the circuits to grow in size and functionality, the be-

havior of standard cells and circuit structures is becoming more and more complex as

well [ITR15]. With the increasing demand for high-performance and low-power em-

bedded designs, chips are running today at their operational limit [SBJ+03, DWB+10,

KET16, GKET16]. Furthermore, the circuit structures become more and more prone

to process- [ZHHO04, QS10] and parameter variations (e.g., voltage and temperature

[BPSB00, BKN+03, TKD+07]), which can impact the reliability of the circuit and effec-

tiveness of tests if neglegted [PBH+11, HBH+10, CIJ+12]. Therefore, accurate simulation

models are required that deliver sufficiently high precision and detail to capture both

functional as well as non-functional aspects of the circuit with as little abstraction as pos-

sible. The circuit models must provide support for accurate representation of the temporal

behavior to reflect hazards and glitches, signal slopes and parametric and parasitic depen-

dencies. Also, pattern-dependent delays and multiple-input switching effects should be

considered as they can significantly alter the circuit delay [MRD92, SDC94, CS96]. How-

ever, such timing-accurate evaluations involve the vast processing of many real numbers

through complex arithmetic operations, which is several times more expensive in term

of computing complexity compared to untimed logic simulation [SJN94] and also much

harder to parallelize [BMC96].

Complexity of Defect Modeling

The complexity of a simulation quickly rises further as soon as the behavior of defects

in a circuit needs to be investigated in fault simulation. While classical fault models and

3


1 Introduction

simulation approaches have been extensively studied and optimized in the past [Wun10,

Wun91, WLRI87], modern semiconductor manufacturing processes have to deal with

several new fine-grained types of defects [LX12, RAH+14, HRG+14, ITR15]. Small de-

lay defects (SDD) [TPC11, RMdGV02], marginal devices [RAH+14, KCK+10, KKK+10]

and wear-out-related effects due to circuit aging [LGS09, APZM07, BM09, GSR+14] im-

pact the timing behavior of the circuit and require timing- and glitch-accurate evaluation

[Kon00, YS04, HS14], especially when considering process variations on top [PBH+11,

HBH+10, CIJ+12]. While certain validation tasks can utilize formal verification meth-

ods [DG05], the modeling and evaluation of low-level faults and parameters in explicit

simulation is much more intuitive and straight-forward [Lam05], especially when a vali-

dated formal model does not exist.

Moreover, the faulty behavior of many realistic defects often cannot be represented at

higher abstraction levels as their behavior strongly depends on the physical layout as

well as the low-level topology and technology of the standard cells [SMF85, FS88]. For

this, cell-aware test (CAT) approaches have been recently developed [HRG+14] that ex-

tensively rely on expensive low-level analog fault simulations with user-defined fault mod-

els (UDFM). Thus, it is important to efficiently simulate faults with little abstraction as

accurately and efficiently as possible [CCCW16].

Parallelism as a Remedy

Assume that a circuit of N nodes with a set F of faults needs to be simulated for a pattern

set T under different operating parameters P. In a naïve flow, a total of N×|F|×|T |×|P|

individual simulation problems have to be solved, each of which corresponds to complex

arithmetic evaluations of the timing behavior at a node. Thus, with growing design size,

the number of simulation problems quickly rises since the number of nodes, faults and test

patterns also increase.

Although circuit- and fault simulation are inherently parallelizable [IT88, BBC94, BMC96],

the parallelization has risen to a whole new level with the introduction of general purpose

computing on graphics processing unit (GPU) accelerators [OHL+08]. In contrast to con-

ventional processors (e.g., IBM Power9 architecture with 24 cores [STKS17]), the GPU-

4


architectures provide thousands of compute cores on a single device for running millions

of threads concurrently [NVI17d]. With this amount of compute cores, the GPU-based sys-

tems attain unprecedented arithmetic computing throughput in the order of petaFLOPS1

on a single compute node, which has established well in high-performance computing for

speeding up many applications [NVI18c].

Many approaches that parallelize logic level simulation on GPUs for acceleration were

published [GK08, Den10, KSWZ10, CKG13, WLHW14, LTL18] that isolate and partition

the simulation problems for independent parallel execution by individual threads, but

only very few are able to consider actual timing. Analog simulation with GPU-support

[GCKS09, CRWY15, CUS19] and full GPU-accelerated SPICE implementations [HZF13,

HF16, vSAH18] have been proposed as well. However, existing parallel approaches for

either logic level or analog simulation are mainly optimized for speeding up single simu-

lation instances only. Since the number of simulation problems is continuing to rise, it is

of utmost importance that the approaches are not only efficiently parallelizable, but also

scalable by providing high-simulation throughput to make an evaluation of large designs

for more complex algorithms feasible.

Timing Modeling at Logic Level

For the calculation of the full signal switching histories in a circuit, traditional unit delay

models [Wun91] have become insufficient. Instead, timing-accurate simulation models

are required that are able to process real-valued gate delays for individual gate pins and

signal switches [IEE01a]. Since the execution of bare arithmetic floating-point instructions

has much higher latency compared to bit-wise logic operations of traditional untimed logic

simulation, timing-accurate simulation of a million-gate circuit for a test pattern set can

take hours or even days to complete [SHK+15].

On GPUs, long-latency processes and arithmetic operations can be hidden by the massive

computing throughput provided by the parallel computing cores [OHL+08]. For this,

massive parallelization needs to be exploited to effectively speed up extensive logic level

simulations with timing-accurate and variation-aware delay models for large designs and

test sets.
11 petaFLOP = 1015 floating point operations per second

5


1 Introduction

Switch Level Modeling and Algorithms

Timing-accurate and lower-level models are desirable for accurate simulation of digital

integrated CMOS circuits. While simulation at logic level lacks accuracy, SPICE simulation

lacks scalability even when accelerated by parallelization. As a trade-off, switch level

simulation was introduced [Bry87, Hay87, BBH+88, MV91] which drastically reduces the

complexity of the evaluation compared to full analog simulation. While not being as

accurate, switch level modeling reflects many properties of CMOS circuits and standard

cells that cannot be modeled at logic level. This has recently drawn attention in cell-aware

test (CAT) generation to reduce the analog fault simulation overhead [CCCW16, CWC17].

Yet, the simulation at switch level is still an order of magnitude slower compared to logic

simulation [SJN94] and therefore poses a strong bottleneck in validation tasks, which

makes it a prime-example for the acceleration on GPUs. However, to the best of the

author’s knowledge no algorithms to accelerate switch level simulations on GPUs exist.

Defect Modeling and Fault Simulation

Many defects types exhibit a varying faulty behavior based on the size or magnitude of the

defect parameter. The behavior of such a parametric defect can range from a hard func-

tional failure of the circuit, to a weak impact on the non-functional timing behavior that

only affects the circuit performance [RMdGV02, FGRC17]. With the continuously shrink-

ing circuit structures, the probability of such defects as well as their severity with respect to

the circuit reliability rises [ITR15], while at the same time the faults are becoming harder

to detect [LX12, TTC+15]. Thus, thorough test validation requires a fine-grained model-

ing of the defect parameters to capture deviations in both functional and non-functional

aspects of the circuit (e.g., small delays [YS04, LKW17, KKL+18]).

To reflect the behavior of more realistic defects found in CMOS cells, faults must be

modeled at the lower levels without introducing too much abstraction [FS88, CCCW16,

CWC17]. This vastly increases the computational overhead, as the defect mechanisms

not only get more complex, but also many simulations of different individual faults are

necessary (e.g., for fault coverage estimation). Thus, the need for parallelization of fault

simulation is unavoidable [IT88, Wun91]. With their large amount of parallel arithmetic

6


compute cores, GPUs provide high arithmetic computing throughput that is promising to

simulate even large designs for many faults and test patterns with full timing-accuracy.

Multi-Level Simulation

The modeling and simulation at higher abstraction is fast, but the approaches are limited

in accuracy such that often timing and defects cannot be reflected properly [RAH+14].

On the other hand, lower level approaches can provide sufficient accuracy, but exhibit

significant runtime complexity and resource requirements, such that the approaches can-

not be used solely for a system-wide application (i.e., system-level test validation [Che18,

JRW14]). However, advantages from both higher- and lower-level abstractions can still

be taken into account at the same time by using multi-level techniques.

A multi-level simulation provides a combined simulation across multiple different levels of

abstractions [GMS88, MC95, RK08, KZB+10, HBPW14]. Accurate and compute-intensive

evaluations are typically restricted to a particular region of the circuit [HBPW14, KZB+10],

or single cells [CCCW16], whereas the remainder is processed using faster higher-level

simulations. Intermediate simulation data is usually exchanged at abstraction boundaries

by data aggregation and mapping to the respective higher- or lower-level inputs [JAC+13,

HBPW14, CCCW16]. Thus, compared to full simulations at the lowest abstraction, multi-

level simulations can provide a trade-off in terms of speed and accuracy and make certain

simulation scenarios even feasible at all [JAC+13, HKBG+09, CCCW16].

Applications

High-throughput timing-accurate circuit- and fault simulation is helpful to accelerate ap-

plications in a variety of areas, which heavily rely on extensive simulations.

Process variability and timing variation: With the increasing sensitivity of circuits to-

wards process- and parameter variation, the need for statistical evaluations for perfor-

mance and reliability assessment rises. Under variation, the circuit timing is severely

altered [BCSS08, GK09] which tampers with the effectiveness of tests [PBH+11]. By in-

cluding parameter dependencies in the delay processing, such as supply voltage and am-

bient temperature [DJA+08, SAKF08], or process variations [ABZ03, ZHHO04], statistical

7


1 Introduction

timing analyses [WW13] and statistical fault coverage analysis [SHM+05, CIJ+12] can be

realized by large-scale Monte-Carlo evaluations in highly parallel timing simulation.

Non-functional properties: The timing-accurate switching histories from the simula-

tion can be utilized to analyze non-functional properties in a design. Both, thermal de-

sign power [HGV+06] and IR-drop from switching activity hotspots [SBJ+03, MSB+15,

DED17a] impact the reliability of designs and tests [WYM+05, ZWH+18] and thus need

to be evaluated. Moreover, vulnerabilities with respect to single-event upsets [BKK+93]

and aging effects, such as bias temperature instability (BTI) [GSR+14] or hot-carrier injec-

tion (HCI) [LGS09, ZKS+15], can be investigated using timing-accurate signal histories.

Power estimation: The static and dynamic power-consumption of a design can be es-

timated from the circuit signals and switching activity [GNW10]. Power consumption

tampers with the functionality of the design not only during functional operation, but

also during the application of tests [WYM+05, AWH+15]. While previous approaches

relied on untimed logic simulation due to complexity reasons, the power of GPUs en-

able timing-accurate power estimation for many patterns in parallel. This way more-

accurate power-aware test schemes and test pattern selection can be developed [HSW+16,

HSK+17, ZWH+18].

Scan test: Timing-accurate simulation is necessary to validate delay tests for small delay

defects and marginalities, i.e., faster than at-speed test (FAST) [HIK+14, KKL+18]. How-

ever, in scan test each test pattern undergoes a long session of consecutive shift cycles

before the test can actually be applied and evaluation [GNW10]. During the shift-phase,

IR-drop-induced shift-errors from delayed signals [YWK+11] as well as skewed clocks

[Sho86] can cause errors during shift-in and shift-out phases of the test pattern applica-

tion. Again, with the high-throughput computing capabilities of GPUs, parallelization can

be utilized to accurately evaluate full test sets with millions of shift cycles concurrently.

Goal of this Thesis and Organization

The document at hand presents highly parallel simulation models and simulation algo-

rithms for fast and scalable timing simulation of nano-electronic digital integrated CMOS

circuits on graphics processing units (GPUs).

8


The main contributions of this thesis are summarized in Fig. 1.1. Compact functional as

well as timing-accurate temporal modeling are combined with highly parallelized simula-

tion algorithms designed to run on GPU-architectures to exploit their massive computing

throughput. These building blocks provide the foundation for developing high-throughput

simulation algorithms for timing-accurate logic level and switch level simulation with par-

allel execution on the GPUs.

high-throughput
parallelization

compact
functional
modeling

efficient high-throughput parallel timing & fault simulation
of nano-electronic digital circuits

massively parallel graphics processing units (GPU)

efficient
simulation
algorithms

timing-accurate
temporal
abstraction

[HSW12]
[HIW15] [SHWW14]

small delay fault
simulation

[SHK+15]
[SKH+17]

[SW16]
[SW19a]

multi-level timing & fault simulation [SW19b]

transistor level
fault simulation

logic level
timing simulation

switch level
timing simulation

[SKW18]

Figure 1.1: Contributions of this work.

The presented simulation models and algorithms are further extended for fault simula-

tion of small delay faults at logic- as well as low-level parametric and parasitic faults at

switch level. These allow to fully exploit the massive parallelism found in circuit and fault

simulation on the GPU-architectures.

Finally, on top of these the extended simulation models and algorithms are ultimately

combined to form an efficient multi-level fault simulation flow that enables additional sim-

ulation throughput and simulation efficiency for the lower level fault simulation through

the use of mixed abstractions during simulation.

By simultaneously exploiting different dimensions of simulation parallelism, the presented

parallel simulation techniques achieve unprecedented simulation throughput on GPUs

9


1 Introduction

thereby enabling scalable timing and fault simulation of multi-million gate designs, which

effectively reduces simulation time from several hours or even days to minutes only.

The structure of this document is organized into three parts:

The first part (Chapter 2–4) introduces background on circuit- and fault modeling as well

as simulation. Furthermore, the part briefly summarizes the key aspects of data-parallel

many-core computing architecture of graphics processing units (GPU) and their program-

ming paradigm, along with an overview of state-of-the-art GPU-accelerated simulation

techniques.

The second part of this document handles the high-throughput time- and fault simulation,

which is the main contribution of the thesis and is composed of the following chapters:

Chapter 5 (”Parallel Logic Level Time Simulation on GPUs”) – presents the first timing-

accurate and variation-aware logic level time simulation on GPUs and outlines the

underlying algorithms, kernels and general data structures for multi-dimensional

parallelization with high simulation throughput.

Chapter 6 (”Parallel Switch Level Time Simulation on GPUs”) – introduces the first parallel

switch level time simulation for GPUs which utilizes first-order electrical parameters

found in CMOS circuits for a more accurate modeling of the functional and timing

behavior of CMOS standard cells.

Chapter 7 (”Waveform-Accurate Fault Simulation on GPUs”) – describes the extension of

the logic- and switch level simulation algorithm for fault modeling, syndrome com-

putation and parallelization for first-time detailed and comprehensive parallel sim-

ulation of timing-related faults.

Chapter 8 (”Multi-Level Simulation on GPUs”) – addresses a combination of the presented

parallel logic- and switch level algorithms in the first GPU-accelerated multi-level

simulator that effectively enhances the performance and efficiency of the lower-level

timing and fault simulation.

The third and last part (Chapter 9–13) discusses the evaluation of the aforementioned

presented simulation methodologies as well as example applications in five chapters.

Finally, the last chapter provides a summary of all the contributions of this thesis and

outlines directions for future work and possible extensions.

10


Part I

Background


Chapter 2

Fundamental Background

This chapter summarizes the necessary background and fundamentals of circuit- and fault

modeling as well as simulation of digital integrated CMOS circuits as found in basic litera-

ture [Wun91, BA04, WWX06, WH11]. First, background on standard CMOS technology is

provided, followed by circuit- and delay fault modeling on different levels of abstraction.

Finally, simulation algorithms and multi-level simulation methods are briefly discussed.

2.1 CMOS Digital Circuits

Today’s semiconductor manufacturing processes of digital integrated circuits mainly rely

on complementary metal-oxide semiconductor (CMOS) technology. CMOS uses two types

of metal-oxide-semiconductor field-effect transistor (MOSFET) devices, namely NMOS and

PMOS transistors, that can basically be considered as voltage-controlled switches. NMOS

and PMOS transistors are utilized to form standard cells in standard cell libraries [Nan10,

Syn11] which are the foundation of contemporary semiconductor design. For more details

on the structure and mechanics of transistors as well as the manufacturing processes and

design of standard cells, the reader is referred to [WH11].

In modern semiconductor process technology nodes, conventional planar MOSFETs are

being replaced by the recent fin-type field-effect transistor (FinFET) technology [HLK+00,

Kin05, BC13]. The FinFET devices allow for scalable manufacturing of feature sizes be-

yond 20nm, since they have a more compact layout with a faster switching behavior com-

pared to traditional transistors [YMO15]. This enables the production of more complex

13


2 Fundamental Background

designs with higher transistor density that can be further run at higher clock frequencies to

achieve even higher device performance. Although the physical structure of planar MOS-

FET and FinFET is different, their basic working principles and design flows are similar.

2.1.1 CMOS Switching and Time Behavior

In today’s nanometer regime, it is often of utmost importance that the designs meet a

certain performance requirement that is given by a minimum system clock frequency. The

highest frequency a design can reliably run with is usually determined by the length of the

critical path. The critical path accumulates all the propagation delays of standard cells and

interconnects on the longest sensitizable path from a primary or pseudo-primary input to

a primary or pseudo-primary output in the design.

In general, the delays of a standard cell are distinguished by propagation delays and rise/-

fall times as shown in Fig. 2.1, which depicts the typical input and output waveform of a

inverter cell in a small circuit (15nm FinFET technology [Nan14, BD15] powered with a

supply voltage of 0.8V). The maximum signal amplitude on the left axis has been normal-

ized. As shown, the signal switching processes do not occur instantly, but occur merely as a

function of the voltage over time as a result of parametric and parasitic electrical elements

in the circuit. The change of the voltage of a signal is typically modeled by differential

equations that describe electrical charging and discharging processes of capacitances in

the circuit [WH11].

0

0.5

1

0 10 20 30 40 50

S
ig

n
a
l 
[V

]
(n

o
rm

a
liz

e
d
)

time [ps]

input

output
dr

df
dlh

dhl

80%

20%

Figure 2.1: Rising/falling propagation delay (dr, df ) and rise/fall times or slopes (dlh,dhl)
of an INV_X1 CMOS inverter cell [Nan14, BD15]. (Adapted from [WH11])

The propagation delay is defined for rising (low-to-high) and falling (high-to-low) output

transition polarities (dr and df ) and describes the time it takes for a signal transition at

a cell input to propagate to the cell output. It is measured from the time point where

the input signal reaches the 50% mark of the maximum signal amplitude (i.e., 0.5 · VDD

14


2.1 CMOS Digital Circuits

as halfway between VDD and GND potential) to the point where the output crosses 50%

of the maximum signal amplitude in response [WH11]. Moreover, as shown in Fig. 2.1,

signal transitions are not instantaneous, but follow a certain continuous transition ramp or

slope that are given by the rise-time dlh for rising and the fall-time dhl for falling transitions

as well. The rise-time (and fall time) of a signal transition is defined as the time it takes

to transition from 20% to 80% of the maximum signal amplitude (and vice versa), though

these thresholds can vary throughout the literature (e.g., 10%–90%) [WH11].

With the shrinking manufacturing process technologies and increasing complexity, it is

important to accurately estimate the possible timing of contemporary and future designs.

Thus, suitable models are required to reflect and consider CMOS-related timing effects as

early and as accurately as possible during the design phase.

One approach to estimate the delay of a circuit is the RC delay model [WH11], which

approximates the non-linear characteristics of transistors considering average resistances

and capacitances of the circuit nodes. In the RC delay model, the circuit netlist is trans-

formed into an electrical equivalent RC-circuit where all transistors and interconnects are

replaced by simple resistors and the interconnect- and gate capacitances in the fanouts are

replaced by (in the simplest case) lumped capacitances. A switching transistor causes a

change in the state of the RC-model elements by changing the corresponding resistance.

This triggers a transient at the output that is typically computed using the first-order

or second-order step response [Elm48, WH11]. Hence, in case of the rising transient of

Fig. 2.1 the step response vrise beginning at some time t0 can be modeled for the time

t > t0 after as

vrise(t) := (GND− VDD) · e
−(t−t0)

τ + VDD, (2.1)

where τ is the time constant computed as τ := R ·C, with R being the effective resistance

of the driving PMOS transistor and C being the load capacitance. The propagation delay

dr is then derived from the output signal vrise at the time it crosses 0.5 · VDD, which is

computed as dr := τ · log 2 [WH11]. Similarly, the falling transient vfall beginning at some

time t1 is calculated for t > t1 as

vfall (t) := (VDD− GND) · e
−(t−t1)

τ + GND, (2.2)

15


2 Fundamental Background

in which case the resistance R for the time constant τ := R ·C corresponds to the effective

resistance of the conducting NMOS transistor, resulting in an estimated falling propagation

delay of df := τ · log 2. Both of the modeled rising and falling transients expressed by vrise

and vfall are illustrated in Fig. 2.2. The axes have been normalized by τ for the time and

by VDD for the voltage, respectively. For the sake of simplicity, it was assumed that the

internal resistances of NMOS and PMOS were equal.

 0

 1

 0  1  2  3  4  5

O
u

tp
u

t 
[V

]
(n

o
rm

a
liz

e
d

)

time [τ]

τ log 2

0.5
vrise
vfall

80%

20%

Figure 2.2: First-order step response for rising and falling transient modeling the continu-
ous charging/discharging process of an output load capacitance. (Adapted from [WH11].)

2.1.2 Circuit Performance and Reliability Issues

With the shrinking feature sizes in newer technologies, the semiconductor manufactur-

ing processes get more complex and increasingly prone to process variations and defects

[SSB05, PBH+11]. Since the structures, such as fins of FinFET transistors, have a size

of only a few nanometers, spot-defects by particles or statistical processes in the different

manufacturing steps, such as dopant fluctuations, are getting more and more frequent and

influential. These problems can severely alter the physical layout of the circuit structures

and lead to improper connections in the design. Moreover, transistor-related parameters,

such as the threshold voltage and electron mobility, can shift due to improper manufactur-

ing. The resulting defects, such as open and shorts in either fins or gates [LX12], as well as

variation artifacts [HTZ15] primarily impact the timing of the circuit [TTC+15, FGRC17]

and ultimately affect the device performance and reliability by introducing delay faults.

Certain parameter shifts in circuit structures exhibit small delays that sometimes indicate

circuit aging or the presence of a marginal device [KCK+10]. For example, when a design is

put under stress (i.e., high switching activity or mechanical stress) after deployment, struc-

tures in the circuit can fail due to device degradation and transistor aging [BM09]. These

wear-out mechanisms appear gradually as a result of continuing static and dynamic stress

16


2.2 Circuit Modeling

in the circuit, i.e., by certain stimuli or circuit activity, as well as environmental conditions,

such as temperature. Typical degradation mechanisms are Negative-Bias Temperature In-

stability (NBTI) [LGS09, GBRC09], Hot-Carrier Injection (HCI) [EFB81, GSR+14], Time-

Dependent Dielectric Breakdown (TDDB) [DGB+98] and electromigration (EM) [Cle01].

The effects of the degradation range from changes in electrical properties of the transistor

devices and interconnects [LQB08] that result in delay faults [BM09] up to hard fail-

ures, such as voids or bridges due to transported metal atoms from EM. Imminent wear-

out failures can be predicted using specialized hardware (e.g., aging monitors [APZM07,

LSK+18]), once the circuit has sufficiently aged or degraded. Wear-out can also be de-

layed or mitigated by reducing the stress patterns in the circuit to prolong the system

lifetime [ZBK+13].

Failures in the early-life phase of a design are related to infant mortality among circuits,

where marginal devices have passed the conventional manufacturing test, but soon break

after initial deployment. These failures are called early-life failures (ELFs). Marginal de-

vices are usually tested using so-called Burn-In tests [VH99], that strongly exercise the

circuit under extreme stress conditions to expose the ELF. The indicators of ELFs can be

located deep inside the circuitry which can be barely noticeable as small delay devia-

tions that are much smaller than the clock period and not testable under at-speed condi-

tions [KCK+10]. This lead to the recent development of Faster-than-At-Speed Test (FAST)

[HIK+14, KKS+15] to explicitly test for these hidden delay faults (HDF) to identify the

marginal devices.

Therefore, to assess and validate the timing of today’s designs as well as to thoroughly test

for timing-related defects, designers and testers have to rely on timing-accurate models of

the circuits and simulation algorithms for test and diagnosis [TPC11].

2.2 Circuit Modeling

Today’s multi-billion transistor circuit designs are typically not manufactured starting from

scratch using transistors, but rather starting from high-level specifications that describe

its behavior in an abstract manner [Wun91]. As shown in Fig. 2.3, the circuit design

undergoes different representations at various levels following a V-shaped model. During

17


2 Fundamental Background

the top-down design phase, abstract high-level descriptions are refined step by step by

synthesis tools that systematically add more and more implementation details by mapping

to lower level structures [ABF90, BA04, WWX06]. For example, at the highest level of

abstraction, only the desired function is specified and the actual design itself is treated as

a black box since no physical implementation details are present. At this level, modeling

and simulation is fast and simple, but the accuracy is the lowest. When moving down

towards the lower abstraction levels, the required gates to realize the specified function

or even the layout of complete transistor netlists with physical and electrical properties

are determined. This drastically increases the modeling complexity and, of course, the

simulation effort, but ultimately leads to more accurate models and evaluations.

system/behavioral level

register transfer level

logic level

switch
level

electrical level

design (top-down)

refine

ab
st
ra
ct

validation (bottom-up)

s
im

u
la

ti
o
n

 s
p

e
e
d

+
−

highest

lowest

d
e
ta

il
/a

c
c
u

ra
c
y

+
−

lowest

highest

Figure 2.3: Overview of abstraction levels of a circuit in the design and validation.

2.2.1 Electrical Level

The electrical level provides the highest accuracy among the commonly used simulation

models as it reflects the physical properties and behavior based on layout. The circuit

is modeled using netlists of electrical components and devices, such as transistor mod-

els, resistors, capacitors, inductors and voltage sources. Corresponding simulators usually

consider geometric information of the physical layout to reflect spatial dependencies of

power-grid structures and capacitances between neighboring interconnection lines. Inter-

nal node voltages and currents are typically calculated through compute-intensive nodal

analyses and differential equations using numerical integration methods. This way, highly

waveform-accurate transient analyses by electrical simulations can be provided which also

allow to predict the signal integrity as well as the power consumption in a circuit [WH11].

18


2.2 Circuit Modeling

Modeling and simulation at the electrical level is typically performed using SPICE [NP73],

which has established as industry golden standard. Common standard cell libraries [Nan10,

Syn11] already provide SPICE models of the implemented standard cells, which utilize

transistor device model cards that can include hundreds of parameters [DPA+17, Nan17].

However, the high detail and accuracy of the electrical level modeling comes at the cost of

vast runtime complexity and memory requirements.

2.2.2 Switch Level

Switch level modeling abstracts the electrical level behavior of the circuit by simplification

and linearization of the transistor switching behavior [Bry84, Bry87, Hay87, Kao92]. The

circuit is represented as an interconnected network of transistors and first-order electrical

components, i.e., resistors and capacitors, as shown in Fig. 2.4. Transistors are treated as

simple three-terminal devices that form discrete ideal switches, where the input at the gate

terminal controls the connection between drain and current terminal of the device. The

resistors in the model reflect the static functional behavior, and the capacitors (sometimes

referred to as wells) model the temporal behavior [Hay87]. Most switch level models

distinguish between signal strengths and consider bi-directional signal flows as well as

modeling of charges inside of the networks.

At switch level, signal values are typically represented as pairs 〈s, v〉 of signal strength s ∈

N and signal value v. The signal values are typically expressed in ternary (three-valued)

logic over the set E3 = {0, 1, X} [Hay86, BAS97] to distinguish between low (0), high (1)

and undefined (X) signals. The signal strength reflects the driving strength of the signal

source, which is used to resolve multiple drivers at a node or consider bidirectional signal

A ZN

VDD

GND

A ZN

⟨0,1⟩=VDD

R

C

VA=GND vA=⟨0,0⟩

v1=⟨0,1⟩

v2=⟨0,0⟩

v3=⟨1,1⟩

⟨0,0⟩=GND

v4=⟨∞,0⟩

v5=⟨1,1⟩ vZN=⟨3,1⟩≈VDD VZN≈VDD

A ZN
vA=0 vZN=1

φNOT:ZN:=(¬A)

"ON"

"OFF"
∅

Figure 2.4: Inverter cell with CMOS implementation and switch level representation,
where signal values are encoded as pairs 〈si, vi〉 of strength si and logic value vi [Hay87].

19


2 Fundamental Background

flows as well as stored charges in the network. Signals of strength s = 0 are the strongest

signals and typically correspond to VDD or GND sources, whereas higher values of s indi-

cate weaker signals. Stronger signal values always override weaker ones. Signals of the

same strength, but with different values, result in an undefined value.

For the purpose of efficient switch level simulation of digital designs, a common practice

is to partition the transistor netlist of circuits and CMOS cells into smaller sub-networks

composed of ideal PMOS and NMOS transistor meshes, forming a channel-connected com-

ponent (CCC) [Bry87, DOR87, HVC97, BBH+88, BAS97, CCCW16]. The transistors within

a CCC are interconnected via their drain and source terminals to each other and connect

power supply VDD and ground GND sources and allow for a bidirectional signal flow within

the network. The outputs of a CCC are linked to an output domain that can lead to gate

terminals of transistors in succeeding CCCs. However, it is assumed that no current can

flow over the gate from one CCC into another.

Switch level evaluation algorithms are mainly based on nodal analyses and dependen-

cies that are formulated using linear equations. These equations are solved using relax-

ation techniques with sparse matrix vector products (SMVP) to first find a steady state

solution of the node voltages and the node charges inside of the network after each in-

put switch [Bry87]. In combination with timing analysis, node voltages are more ac-

curately represented as a function of time, usually modeled by piece-wise approxima-

tions [CGK75, Ter83, Kao92]. In general, switch level models are less accurate than elec-

trical level models, but they are able to reflect many important properties of CMOS tech-

nology which are not reflected in classical models based on two-valued Boolean or ternary

pseudo-Boolean logic. Due to the higher speed of switch level simulation, it has replaced

analog SPICE simulation in recent cell-aware test generation [CCCW16].

2.2.3 Logic Level

The logic level is the most dominant abstraction used in design and test validation of nano-

electronic digital circuits [WWX06, BA04]. At logic level the circuit behavior is described

as netlist of interconnected Boolean gates. The netlist of a circuit is typically modeled as

a directed graph G := (V,E). Each vertex v ∈ V corresponds to a node in the netlist,

which corresponds to a Boolean gate, flip-flop, or circuit input or output, and basically

20


2.2 Circuit Modeling

represents a functional abstraction of a physical standard cell that determines the logical

behavior. The edges (v, v′) ∈ E ⊆ V × V represent ideal interconnections from a source

node v ∈ V to a sink node v′ ∈ V in the netlist and allow for signal flows. Incoming edges

at a node refer to input pins at the corresponding gate, while outgoing edges indicate the

information transport from the node output pin to its successors.

Each edge e ∈ E is associated with a signal value that reflects the logical interpretation of

the voltage level of the corresponding signal, which are typically defined over two-valued

Boolean logic B2 = {0, 1} [Hay86]. The logic values in B2 correspond to discrete signal

states, where the logic symbol ’0’ corresponds to low voltage (i.e., ground voltage potential

GND) and the symbol ’1’ corresponds to a high voltage (i.e., power supply voltage potential

VDD). Multi-valued logic, such as ternary, four-valued, or even higher-order logic can

be utilized to reflect additional states (e.g., unknown due to uncertainties) and dynamic

behavior (e.g., transitions) of signals [Hay86]. The functional behavior of a node with k

incoming edges is modeled by a Boolean function φ : Bk
2 → B2 that maps the values of

the edges to an output value according to the function.

At logic level, various delay models can be considered to describe the temporal behavior

of a gate output signal with respect to input changes. Apart from zero-delay modeling,

which reflects the untimed simulation case, where input changes at time t can cause a

simultaneous change at the output, timing-aware delay models evolve around unit- and

multiple-delay representations to reflect propagation delays for gates (sometimes referred

to as transport delays) [ABF90]. Fig. 2.5 provides an overview of the timing-aware delay

models on the example of a small inverter gate [Wun91].

In unit delay it is assumed that all gates have a constant propagation delay d := 1 that

corresponds to one time unit. For input transitions at time t ∈ N, this can cause output

transitions to occur at time t′ := (t+d) = (t+1) in response. However, different gate types

typically have different delays. Moreover, gates usually have different delays for rising and

falling output transitions as well. Multiple-delay models reflect different transport delays

dr for rising and df for falling output transitions, which are applied accordingly.

Some delay models are able to consider propagation delays that are bounded by a min-

max interval [dmin, dmax] ⊂ R [Wun91, BGA07]. In interval-based delay modeling, the

actual delay of a gate is not known exactly, but it is estimated to be in the interval d ∈

21


2 Fundamental Background

A

ZN

1 2 3 4 5 60 time

 
0

1

0

1

ZN
0

1

input signal

unit delay (d=1)

rise time dr=2,
fall time df=1

ZN
0

1 interval
dmin=1, dmax=2

7 8

ZN
0

1 unit delay with
inertial delay Δd<2

ZN
0

1 unit delay with
inertial delay Δd>2

A ZN

inverter gate

Figure 2.5: Overview of common (timing-aware) delay models in logic level simulation.
(Adapted from [Wun91])

[dmin, dmax]. This way, process variations and uncertainties that impact gate delays can

be reflected. However, during simulation with such a model the uncertainty of different

gates can quickly accumulate and lead to pessimistic simulation results [ABF90].

When the value at a gate output switches, the load capacitance associated with the stan-

dard cell implementation at the electrical level is (dis-) charged, which requires some time

until the transition is finished. In case a pulse at a gate input is too short, the output does

not have sufficient time to charge to the targeted potential, which can prevent the propa-

gation of the pulse, such that the output value is sustained. A so-called inertial delay can

describe the minimum pulse width ∆d ∈ R required to perform a full (dis-) charge of the

output. For example, in Fig. 2.5, signal A has two consecutive pulses at times t1 := 1 and

t2 := 3, which would lead to transitions at times t′1 := 2 and t′2 := 3 under unit delay. In

the last case, the required minimum pulse width is ∆d > 2, therefore the pulse is filtered

out since the pulse width given by (t′2 − t′1) = 2 does not meet the requirement.

Individual delays for each input pin of a gate can be considered in combination with rising

and falling distinction. So-called pin-to-pin delay models keep a delay value di for each

input pin i of a gate. An input transition at pin i at time t then propagates to the output

at time t′ := (t+ di) with the corresponding delay associated to the pin.

The logic level timing information is usually provided as standard delay format (SDF) files

[IEE01a] which are obtained during the synthesis of the design. Each SDF file typically

contain absolute and incremental delay specifications describing, specific delays for indi-

vidual ports (PORT), the propagation delay from an input to an output pin (IOPATH) as

22


2.3 Delay Faults

well as wire delays (INTERCONNECT) between gate instances. In addition to the propaga-

tion delays, inertial delays can be specified (via PULSEPATH) that describe the minimum

allowed pulse-widths at outputs for the application of pulse filtering [IEE01a, IEE01b].

2.2.4 Register-Transfer Level

At the register-transfer-level (RTL) the circuit module descriptions are mapped to a coarser

structural description composed of small interconnected components [Wun91]. The com-

ponents can be registers or memory to store data and operands, and combinational net-

works to perform operations. Interconnections and buses define the data-flow between

the individual components. The data exchange between memory elements is controlled

in a synchronized fashion by clock signals. RTL-descriptions are typically expressed as

data-flow in hardware-description languages, such as Verilog [IEE01b] or VHDL [IEE09].

While timing can be expressed in cycle granularity, fine-grained propagation delays are

not considered at this level, since no implementation details are available.

2.2.5 System-/Behavioral Level

At system level the circuits are modeled with the highest abstraction. Abstract functional

descriptions of circuit modules are typically expressed as algorithms and functions in

higher level programming languages, such as SystemC [IEE12], which are compiled and

run as separate processes. Since no details or requirements with respect to the actual im-

plementation of the design is implied, a reasoning about accurate timing is not possible.

2.3 Delay Faults

In test validation and diagnosis, fault models are used which abstract the effect of classes

of physical defects in a chip for representation and processing on a higher abstraction level

[Wun91]. As opposed to classical static fault models, such as stuck-at and bridging faults,

a circuit cannot be tested for delay faults by using a single test pattern only [WLRI87].

Instead, delay tests with a minimum of two patterns composed of an initialization and

propagation test vector are required. These patterns are applied in consecutive clock cycles

to launch signal transitions at (pseudo-)primary inputs. The output responses are then

23


2 Fundamental Background

captured after the clock period has passed, which might reveal output signals that are not

yet stabilized.

A generalized way to model arbitrary faults in logic simulation is the so-called conditional

line flip (CLF) calculus [Wun10]. Each CLF has an activation condition cond, which is

an arbitrary user-defined Boolean formula to determine the activation of the fault at a

signal v. Once a CLF is active (condition cond is true), it flips (inverts) the value of v to

v′ := v⊕ [cond ]. As a simple example: A stuck-at-0 fault is represented in the CLF calculus

as v ⊕ [v] and a stuck-at-1 fault is represented as v ⊕ [v].

Transition Faults

The transition fault model [WLRI87] assumes a lumped defect at a gate input or output

pin that causes a delay at the associated signal line by an amount of time larger than

the test clock period (usually infinite), thereby omitting a transition. Transition faults

can be distinguished as slow-to-rise (STR) or slow-to-fall (STF) faults that affect rising

and falling transition polarities, respectively. They are usually considered as conditional

stuck-at faults, which can be efficiently implemented and simulated using parallel-pattern

stuck-at fault simulation by activating the fault upon signal changes of subsequent patterns

[Wun10, WLRI87]. Therefore, regarding the modeling and simulation, transition faults

are independent of the actual circuit timing and gate delays.

For example, let vi ∈ B2 be the value at a signal line after application of the initialization

vector, and let vi+1 ∈ B2 be the value after the subsequent propagation vector. The fault

activation at a signal line transitioning from vi to vi+1 upon the clock cycle is determined

by active ⇔ (vi⊕vi+1) ·ρ, where the term (vi⊕vi+1) indicates a change in the signal value

and ρ determines the activation of a STR (ρ := vi+1) or STF (ρ := vi+1) fault, respectively.

Note that transition faults always affect all sensitized paths in the output cone of the fault

site regardless of the actual timing. Therefore, the fault model fails to accurately represent

delay faults of smaller quantities, e.g., small resistive open defects.

Path-Delay Faults

The path-delay fault model [Smi85] was proposed to model distributed delay effects in the

circuit and assumes the accumulation of smaller lumped delays during signal propagation.

24


2.3 Delay Faults

In this model, a path of nodes {i, n1, n2, ..., nj , o} ⊆ V in the netlist graph G := (V,E)

from a (pseudo-)primary input i ∈ I to a (pseudo-)primary output o ∈ O of the netlist is

assumed to suffer from a path-delay fault if the accumulated delay along the associated

path exceeds the clock period. The fault is activated, iff the input signal transition is

propagated along all nodes of the path to the output. As a result, the signal at the single

affected output of the fault is expected to be erroneous. Like transition faults, path-delay

faults can be distinguished as STR and STF.

Similarly, the segment delay fault model was proposed [HPA96, MAJP00] to model dis-

tributed delays along sub-paths and hence combines transition fault and path-delay fault

behavior. A segment delay fault is modeled on a path segment of l connected gates

{n1, n2, ..., nl} ∈ V . If a signal transition enters the segment at node n1 and propagates

through all of its gates, the fault is activated at the segment output node nl from where it

behaves as a lumped transition fault.

The simulation of path-delay faults does not require timing information and can thus

be performed in untimed logic simulation or by sensitization analysis [PR08, Wun10].

ATPG-tools for path-delay faults typically constrain the set of relevant paths by topolog-

ical analyses with timing information to identify longest sensitizable paths through each

node [ED12]. By doing so, even small lumped delays at gates can be tested for, whose

fault size is larger than the slack of the tested path. However, not all faults can be tested

robustly [Lin87, Kon00, ED11] and longest path delays can shift due to process varia-

tions [CIJ+12]. Thus, certain faults with smaller delays might be missed, requiring more

accurate modeling and simulation of the faults.

Small (Gate-)Delay Faults

A small (gate) delay fault is considered as a manifestation at an input or output pin of a

node, that slows down the signal propagation through the respective pin by a small finite

amount of time [IT88, TPC11]. These faults are mainly caused by weak resistive opens in

the design [RMdGV02] and, in contrast to transition faults or path-delay faults, the delay

impact of a small gate delay fault is much smaller than the clock period [NG00].

Fig. 2.6 illustrates the behavior of a small delay fault at a gate output in a circuit. After

application of a delay test vector, transitions at the circuit inputs are launched which

25


2 Fundamental Background

propagate through the input cone of the fault. Eventually, the fault site is reached and

the output transition at the fault site is delayed by the fault size. In the left example (a),

the delayed output response is propagated through the output cone of the circuit over

sensitized paths to three outputs. However, in the output response vector only two of

these outputs (2 and 3) show an error as the signals are captured before the last transition

happens. Despite the additional delay due to the fault, the latest signal transition at

the first output still arrives in time and a good response value is captured. If the small

delay fault was simulated using a simplified transition fault model, all of the outputs of

sensitized propagation paths would show an erroneous response. Hence, transition faults

typically overestimate the fault coverage of small delay faults with smaller sizes and they

are also too optimistic regarding the fault propagation and detection at outputs [IRW90].

In the right example (b) of Fig. 2.6, a non-robust propagation of a small delay fault in pres-

ence of reconvergence is illustrated. The reconvergent fault propagation causes a delayed

hazard at the output and the fault can only be detected within a small detection window

spanned by the hazard. This behavior is not captured by simple transition fault simulation

as the corresponding STF-transition fault (corresponding to a stuck-at-1 [WLRI87]) would

lead to a constant-1 output signal with no detection window during simulation.

Also, the CLF calculus [Wun10] is not suitable for expressing small delay faults in a prac-

tical way, due to the complexity of the temporal modeling. In general, for the fault acti-

vation a CLF assumes a single point in the circuit where it is conditionally activated and

from which the faulty value is distributed throughout the sensitized output paths. Hence,

similar to the transition delay faults, all sets of sensitized paths are affected, similar to

the transition delay faults, which does not hold for general small delay faults. One way

to fully cover a small delay fault behavior in the CLF calculus would be by introducing

multiple-CLF faults in the fan-out of the actual small delay fault site (e.g., CLFs injected

at reachable circuit outputs). Then, for each injected CLF all possible combinations of

related activation paths and propagation paths need to be encoded in the conditions (not

including false paths) to fully capture the timing violations that can occur for the small

delay fault. However, this introduces a tremendeous modeling complexity.

In [PB07], so-called segment-network faults were introduced which consider multiple seg-

ment delay faults [HPA96, MAJP00] as a (sub-) tree in the netlist starting from a common

26


2.3 Delay Faults

input-
cone

output-
cone

fault f

d
e
la

y
 t

e
s
t 

in
p

u
t

capture

o
u

tp
u

t 
re

s
p

o
n

s
e

last signal
transition

output
ok?

launch
transition

fault
activation fault

propagation

circuit

fault-free value: 1
faulty value: 0

 
t=0

   +3
(STF)

+2 +2 

5 86

comb.
logic

+1 

...

41 3

63

reconvergence

0

1

0

1

0

1

0

1

capture
time T=7

TF:1

input

output:

a) b)

Figure 2.6: Examples of a small gate delay fault in a circuit: a) robust fault-propagation
to outputs over sensitized paths and b) non-robust propagation with output hazard.

root node. If a signal is propagated from the root node along a segment, a fault is activated

at the corresponding leaf node where the segment ends. While the activation results only

from a single root node, the fault modeling provides more control in the propagation of

the faults. No hazards are considered in the modeling which can invalidate the detection.

Therefore, in order to simulate smallest delay faults for test validation explicit timing-

accurate and glitch-aware simulation is required [CHE+08, BGA07].

Impact of Process Variations on the Fault Detection

Process variations cause inaccuracies during manufacturing, which can severely affect the

functional and timing behavior of a circuit [Nas01, PBH+11]. The source of variation

is typically distinguished as either of random or systematic nature. The background of

random variation usually has a quantum mechanical origin. Both, layout and material

properties of the circuits are influenced by statistical processes and numerous uncertain-

ties during chip manufacturing and affect electrical properties of the underlying structures.

Random variation can influence the switching behavior of gates and interconnects in a cir-

cuit (intra-die), for example, by line-edge roughness (LER) artifacts, and can also increase

the threshold voltage of transistors due to random dopant fluctuation (RDF). At logic level,

this type of variation is typically modeled by assuming independent random variables as

delays for each gate in a circuit [SSB05, BCSS08]. On the other hand, systematic variation

takes into account the spatial and parametric dependencies of different process corners

27


2 Fundamental Background

between dies (inter-die), wafers (wafer-to-wafer) or lots (lot-to-lot). This type of varia-

tion can affect gates with high spatial correlation or gates of the same type in a similar

manner simultaneously [ABZ03, ZHHO04]. The sources of systematic variation relate to

irregular material properties as well as limitations of the fabrication processes themselves

(e.g., lithography, etching, polishing) that introduce a correlation between neighboring

structures or their corresponding nodes in the design [SSB05, ABZ03].

These process variations can have severe impact on the gate delays as well as the circuit

timing and ultimately affect the detection of delay faults [CIJ+12, SPI+14]. Therefore,

simulation algorithms must be timing-accurate and variation-aware to allow for modeling

of the impact of process variations on the delay for statistical timing analyses, variation-

aware fault grading and pattern generation [CIJ+12, ZKAH13, SPI+14].

2.4 Circuit and Fault Simulation

A simple way to simulate a circuit is by compiled-code simulation [WWX06, BA04]. In

compiled-code simulation the combinational netlist is translated into executable instruc-

tions for each node with all signal states being kept as variables. The circuit nodes are then

evaluated by executing the assigned operations over their input variables. For a successful

evaluation of each node, all of its input variables need to be determined first. Hence, the

processing of the nodes usually follows a topological order to allow for a hassle-free eval-

uation, which is obtained by a so-called levelization pre-process that partitions the nodes

into levels ordered by increasing topological distance (depth of the nodes) with respect to

the circuit inputs [Wun91].

A basic simulation flow for the simulation of a test pattern tp ∈ B
|I|
2 is shown in Algo-

rithm 2.1. During the process, first the input nodes I ⊂ V are assigned their correspond-

ing values of the input pattern, followed by the ordered computation of the internal node

signals until the outputs are reached. Due to the levelization, all input-dependencies of

the nodes have been resolved. This type of simulation is typically referred to as oblivious

simulation, since always every node of the circuit is evaluated upon the application of a

new test pattern [BBC94].

28


2.4 Circuit and Fault Simulation

Algorithm 2.1: Simulation flow of a plain oblivious logic simulation.
Input: netlist G = (V,E), test pattern tp with tp[i] value of input i ∈ I
Output: values vn for all n ∈ V

1 foreach node n ∈ V in topological order do
2 if n ∈ I then
3 Assign input value vn := tp[n].
4 else
5 Fetch values v1, v2, ..., vk of fanin(n) (direct predecessors of n).
6 Compute vn := φn(v1, v2, ..., vk).
7 end
8 end
9 return Stored values vn of all nodes n ∈ V .

2.4.1 Event-driven Simulation

Usually, when applying consecutive test patterns to a circuit, not all of the primary or

pseudo primary inputs of the circuits change and hence certain signals sustain their state.

With the oblivious simulation scheme, this causes a lot of unnecessary node evaluations

since these nodes do not require recomputation, yet all nodes are always evaluated re-

gardless of switching activity from signal changes [BA04, WWX06]. To provide a more

efficient evaluation in cases of little switching activity, event-driven simulation approaches

have been proposed [Wun91]. In event-driven simulation the evaluations are constrained

to nodes with active switching events at their inputs. Thus, the evaluation only follows the

path of events during simulation and thereby avoids (unnecessary) evaluation of nodes

with constant signals.

Traditional time simulators typically follow a synchronous event-driven time-wheel ap-

proach [Ulr69], which as proven well for simulation at logic level. A different simula-

tion approach for asynchronous event-driven simulation can be realized using the Chandy-

Misra-Bryant (CMB) algorithm [CM79, Bry77]. As opposed to a global synchronous time

schedule, the CMB algorithm assigns a time stamp to each event and utilizes message

passing to distribute events from node outputs to input FIFOs of successor nodes. The

evaluation of events at different nodes can be realized by individual processes concur-

rently, which can benefit from parallelization to provide speedup for simulating single

circuit instances [BMC96].

In event-driven approaches, the handling of events can get quite complex which quickly

increases the runtime overhead when considering more detailed delay models [Wun91].

29


2 Fundamental Background

For example, in inertial delay modeling, many events scheduled during processing might

need to be reverted when processing later events in time. Also, the algorithms usually only

speed up simulation of single circuit instances by exploiting parallelism from independent

gates. They can not benefit from pattern parallelism [BBC94] through simultaneous eval-

uating multiple patterns in a data-parallel fashion, as they rely on sparse occurrences

of events. Since, gate level parallelism can diminish at deeper levels, this strongly lim-

its the simulation throughput and effectiveness of these approaches. Moreover, for the

implementation on GPUs these algorithms demand for highly complex control- and dy-

namic memory management to process all the event lists in the circuit. However, frequent

memory operations of the scheduling are expensive and will limit the effectiveness of an

acceleration on GPUs [OHL+08].

2.4.2 Fault Simulation

In fault simulation, a circuit is simulated under the behavior of a given defect to determine

whether the circuit behavior is altered by the fault [Wun91]. A naïve approach to simulate

faults is through serial processing of the provided fault lists. In serial fault simulation,

first a simulation of the circuit is performed to obtain the fault-free good-value simulation

results for a test pattern. The resulting output responses vo of all circuit (pseudo-)primary

outputs o ∈ O are considered as golden reference, or expected values of the fault-free

circuit, which are stored for comparison. The good-value simulation is usually followed by

repeated simulations of the pattern for various copies of faulty circuits in which different

faults have been injected. The injection of a fault is performed by modifying the circuit

description according to the abstracted fault behavior.

After simulation of a faulty circuit copy, the output responses of the faulty circuit v′o are

compared against the golden reference vo to compute an output syndrome syno by

syno := vo ⊕ v′o, (2.3)

where ⊕ corresponds to the bitwise XOR-operation of the response bits of the outputs

o ∈ O in good and faulty response. The output syndrome contains all the differences in

the faulty and the fault-free output response and thus indicates the fault detection of the

30


2.5 Multi-Level Simulation

applied test pattern. A fault f is considered as detected at an output o ∈ O, iff syno = 1

and therefore detected by the pattern. If ∀o ∈ O : syno = 0, then f is undetected.

Fault simulation is a challenging and compute-intensive task (i.e., especially when done

exhaustively for all possible faults of a circuit) due to the additional dimension of com-

plexity from the set of different faults. Often a process called fault dropping is applied,

in which the simulation of a test pattern set for a fault is stopped as soon the fault has

been detected by a certain number n of test patterns (n-detect). The fault is then immedi-

ately removed from the fault list and the simulation is continued for the next. While this

reduces the overhead of fault simulation, fault dropping in simulation cannot be applied

for debug an diagnosis. Especially in diagnosis, often so-called fault-dictionaries must be

computed which rely on exhaustive simulation of all faults for all patterns to obtain as

much syndrome information as possible.

Since the number of faults usually grows with the design size and technology, the in-

crease of the fault simulation performance has been subject of research in test ever since

[WEF+85, WLRI87, AKM+88]. Different ways of improving the performance and speed-

ing up fault simulations exist, that exploit parallelism from pattern-parallel simulation

and fault-parallel simulation and other optimizations, such as concurrent or deductive

fault simulation. For more information the reader is referred to general literature of semi-

conductor design and test [Wun91, WWX06].

2.5 Multi-Level Simulation

In design and test validation tasks it is often necessary to evaluate parts of the design

with higher accuracy. Although high accuracy is generally desirable for any task, the

continuing simulation of a circuit at the lowest level at all times can be troublesome, due

to the increasing order of magnitude of runtime complexity on lower levels [SN90].

In multi-level simulation a design is simulated on more than one level of abstraction

at the same time. This is achieved by utilizing mixed abstractions throughout the de-

sign during simulation and focusing the accurate and compute-intensive simulations to

smaller parts of the design which allows to reduce the overall runtime. Usually, the de-

sign is partitioned, such that common parts are evaluated with fast high-level simula-

31


2 Fundamental Background

tion, while critical components are processed at the lower abstraction levels with higher

accuracy [RK08, KZB+10, SKR10, JAC+13, HBPW14]. Simulators with mixed abstrac-

tions usually utilize hierarchical circuit descriptions with the design being represented

individually at each desired level of abstraction [SSB89, RGA87]. Those representations

are then selected interchangeably for the components during simulation based on the

desired accuracy, i.e., for the injection of low-level faults in mixed-level fault simula-

tion [GMS88, MC93, MC95]. By doing so, designers working on individual modules of

a circuit can focus the accuracy of the validation on their respective designs without the

need of processing the whole circuit at the lowest abstraction, which eventually results in

a much faster and more efficient simulation.

2.6 Summary

With increasing technology scaling and shrinking feature sizes, modern circuits become

increasingly complex in terms of design size and also prone to newer defect types that

often manifest in the timing behavior of the circuit (e.g., small delays). Hence, timing-

accurate simulation has become a necessity for design and test validation. However,

timing-accurate simulation itself is tremendously complex in terms of runtime, and clas-

sical approaches to tackle the afore-mentioned problems either lack accuracy (due to ab-

straction) or scalability (due to simulation effort).

In order to make accurate and large-scale design and test validation feasible, the simula-

tion algorithms have to be parallelized in a way to increase the throughput performance

for coping with an increasing number of gates, test patterns and faults in the designs. This

thesis investigates the parallelization of timing simulation on massively parallel graph-

ics processing unit (GPU) architectures and provides timing-accurate algorithms for high-

throughput simulation as well as multi-level approaches to further enhance the simulation

efficiency.

32


Chapter 3

Graphics Processing Units (GPUs) –
Architecture and Programming Paradigm

This chapter introduces the basic concepts of contemporary data-parallel graphics pro-

cessing unit (GPU) accelerators. Both the general GPU-architecture as well as the under-

lying parallel programming paradigm will be outlined. For better comprehensibility, the

NVIDIA’s Compute Unified Device Architecture (CUDA) [NVI18b] programming model and

the NVIDIA Pascal GP100 GPU architecture [NVI17c] are used as an example. The GP100

GPU is manufactured in a 16 nm FinFET technology from TSMC featuring 15.3 Billion

transistors on a die of 610 mm2 size.

3.1 GPU Architecture

In recent years, general purpose programming on graphics processing units (GPUs) has

established well in the context of high-performance computing (HPC). Being used as co-

processors, GPUs allow to accelerate computations of many compute-intensive scientific

applications by exploiting massive parallelism [NVI18c, ND10, OHL+08]. While conven-

tional CPUs typically provide fewer (usually 8 to 24 [STKS17]), but more general and

latency-optimized processing cores, contemporary GPU accelerators consist of thousands

of simpler cores that provide high computational throughput due to massively parallel

execution of thousands to millions of threads. The execution of threads on the cores is

performed in a single-instruction-multiple-data (SIMD) fashion, in which the threads each

perform the same instruction at a time, but on different data in parallel [HP12].

33


3 Graphics Processing Units (GPUs)

Fig. 3.1 depicts a block diagram of the NVIDIA GP100 GPU of the recent Pascal archi-

tecture [NVI17c]. The GPU provides 60 streaming multiprocessors (SM) which are parti-

tioned into six Graphics Processing Clusters (GPCs) that are further sub-divided into five

Texture Processing Clusters (TPCs) each. Each SM consists of 64 single-precision (SP) and

32 double-precision (DP) compute cores summing up to a total of 3,840 single-precision or

1,920 double precision cores. The GPC and TPC [NVI10] provide all key units for graph-

ics processing (e.g., for vertex, geometry and texture processing), however, the actual

rendering capabilities will not be used within the context of this work.

PCI Express 3.0 Host Interface

GigaThread Engine

GPC
TPC
SM

SM

TPC
SM

SM

TPC
SM

SM

TPC
SM

SM

TPC
SM

SM

GPC
TPC
SM

SM

TPC
SM

SM

TPC
SM

SM

TPC
SM

SM

TPC
SM

SM

GPC
TPC
SM

SM

TPC
SM

SM

TPC
SM

SM

TPC
SM

SM

TPC
SM

SM

L2 Cache

M
e
m

o
ry

 C
o
n
tr

o
lle

r
M

e
m

o
ry

 C
o
n
tr

o
lle

r

GPC
TPC

SM

SM
TPC

SM

SM
TPC

SM

SM
TPC

SM

SM
TPC

SM

SM

GPC
TPC

SM

SM
TPC

SM

SM
TPC

SM

SM
TPC

SM

SM
TPC

SM

SM

High-Speed Hub

NVLink NVLink NVLink NVLink

M
e
m

o
ry

 C
o
n
tr

o
lle

r
M

e
m

o
ry

 C
o
n
tr

o
lle

r

M
e
m

o
ry

 C
o
n
tro

lle
r

M
e
m

o
ry

 C
o
n
tro

lle
r

M
e
m

o
ry

 C
o
n
tro

lle
r

M
e
m

o
ry

 C
o
n
tro

lle
r

GPC
TPC

SM

SM
TPC

SM

SM
TPC

SM

SM
TPC

SM

SM
TPC

SM

SM

Streaming
Multiprocessor

Memory
Interfaces

Memory
Interfaces

Global Thread
Scheduler

Host Bus
Interface

High-Speed
Communication Interface

Figure 3.1: Block diagram of the full NVIDIA GP100 Pascal GPU Architecture. (Adapted
from [NVI17c])

The SMs have access to a 4,096 KB Level-2 (L2) cache which is connected via eight 512-bit

memory controllers to a High Bandwidth Memory 2 (HBM2) DRAM global device memory

provi