Browsing by Author "Kesim, Dominik"

Now showing 1 - 1 of 1

Open Access
Assessing resilience of software systems by the application of chaos engineering : a case study
(2019) Kesim, Dominik
Modern distributed systems more and more frequently adapt the microservice architectural style to design cloud-based systems. However the prevalence of microservice architectures and container orchestration technologies such as Kubernetes increase the complexity of assessing the resilience of such systems. Resilience benchmarking is a means to assess the resilience of software systems by leveraging random fault injections on the application- and network-level. Though current approaches to resilience benchmarking have become inefficient. Chaos engineering is a new, yet, evolving discipline that forces a change in the perspective of how systems are developed with respect to their resilience. The key idea is to apply empirical experimentation in order to learn how a system behaves under turbulent conditions by intentionally injecting failures. Solving the errors found by this approach in addition to repeating the same experiments, allows the system to build up an immunity against failures before they occur in production. In the scope of an industrial case study this work provides means to identify risks and hazards by applying three hazard analysis methods known from engineering safety-critical systems to the domain of chaos engineering, namely i) Fault Tree Analysis as a top-down approach to identify root causes, ii) Failure Mode and Effects Analysis as a component-based inspection of different failure modes, iii) and Computational Hazard and Operations as a means to analyze the system’s communication paths. A dedicated number of the identified hazards are then implemented as chaos engineering experiments in Chaostoolkit in order to be injected on the application-platform-level, i.e., Kubernetes. In total, four experiments have been derived from the findings of the hazard analysis whereas three experiments have been executed and analyzed by applying non-parametric statistical tests to the observations. This work provides a generic approach to assessing the resilience of a distributed system in the context of chaos engineering illustrated by an industrial case study.