Repository logoOPUS - Online Publications of University Stuttgart
de / en
Log In
New user? Click here to register.Have you forgotten your password?
Communities & Collections
All of DSpace
  1. Home
  2. Browse by Author

Browsing by Author "Kesim, Dominik"

Filter results by typing the first few letters
Now showing 1 - 1 of 1
  • Results Per Page
  • Sort Options
  • Thumbnail Image
    ItemOpen Access
    Assessing resilience of software systems by the application of chaos engineering : a case study
    (2019) Kesim, Dominik
    Modern distributed systems more and more frequently adapt the microservice architectural style to design cloud-based systems. However the prevalence of microservice architectures and container orchestration technologies such as Kubernetes increase the complexity of assessing the resilience of such systems. Resilience benchmarking is a means to assess the resilience of software systems by leveraging random fault injections on the application- and network-level. Though current approaches to resilience benchmarking have become inefficient. Chaos engineering is a new, yet, evolving discipline that forces a change in the perspective of how systems are developed with respect to their resilience. The key idea is to apply empirical experimentation in order to learn how a system behaves under turbulent conditions by intentionally injecting failures. Solving the errors found by this approach in addition to repeating the same experiments, allows the system to build up an immunity against failures before they occur in production. In the scope of an industrial case study this work provides means to identify risks and hazards by applying three hazard analysis methods known from engineering safety-critical systems to the domain of chaos engineering, namely i) Fault Tree Analysis as a top-down approach to identify root causes, ii) Failure Mode and Effects Analysis as a component-based inspection of different failure modes, iii) and Computational Hazard and Operations as a means to analyze the system’s communication paths. A dedicated number of the identified hazards are then implemented as chaos engineering experiments in Chaostoolkit in order to be injected on the application-platform-level, i.e., Kubernetes. In total, four experiments have been derived from the findings of the hazard analysis whereas three experiments have been executed and analyzed by applying non-parametric statistical tests to the observations. This work provides a generic approach to assessing the resilience of a distributed system in the context of chaos engineering illustrated by an industrial case study.
OPUS
  • About OPUS
  • Publish with OPUS
  • Legal information
DSpace
  • Cookie settings
  • Privacy policy
  • Send Feedback
University Stuttgart
  • University Stuttgart
  • University Library Stuttgart