Assessing resilience of software systems by the application of chaos engineering : a case study

Kesim, Dominik

Assessing resilience of software systems by the application of chaos engineering : a case study

Files

bachelor_thesis__1_ (3).pdf (5.59 MB)

Date

2019

Authors

Kesim, Dominik

Abstract

Modern distributed systems more and more frequently adapt the microservice architectural style to design cloud-based systems. However the prevalence of microservice architectures and container orchestration technologies such as Kubernetes increase the complexity of assessing the resilience of such systems. Resilience benchmarking is a means to assess the resilience of software systems by leveraging random fault injections on the application- and network-level. Though current approaches to resilience benchmarking have become inefficient. Chaos engineering is a new, yet, evolving discipline that forces a change in the perspective of how systems are developed with respect to their resilience. The key idea is to apply empirical experimentation in order to learn how a system behaves under turbulent conditions by intentionally injecting failures. Solving the errors found by this approach in addition to repeating the same experiments, allows the system to build up an immunity against failures before they occur in production. In the scope of an industrial case study this work provides means to identify risks and hazards by applying three hazard analysis methods known from engineering safety-critical systems to the domain of chaos engineering, namely i) Fault Tree Analysis as a top-down approach to identify root causes, ii) Failure Mode and Effects Analysis as a component-based inspection of different failure modes, iii) and Computational Hazard and Operations as a means to analyze the system’s communication paths. A dedicated number of the identified hazards are then implemented as chaos engineering experiments in Chaostoolkit in order to be injected on the application-platform-level, i.e., Kubernetes. In total, four experiments have been derived from the findings of the hazard analysis whereas three experiments have been executed and analyzed by applying non-parametric statistical tests to the observations. This work provides a generic approach to assessing the resilience of a distributed system in the context of chaos engineering illustrated by an industrial case study.

Moderne verteilte Anwendungen übernehmen immer öfter den Microservices-Architekturstil um cloud-basierte Anwendungen zu designen. Allerdings erhöht die weite Verbreitung von Microservice-Architekturen und Container-Orchestrierung Technologie, wie zum Beispiel Kubernetes, die Komplexität zur Bewertung und Verbesserung der Widerstandsfähigkeit (engl. resilience) von verteilten Anwendungen. Resilience Benchmarking ist eine Methode die Widerstandsfähigkeit von Anwendungen mit Hilfe von zufälligen Fehlerinjektionen auf Applikations- und Netzwerkebene zu bewerten. Jedoch sind aktuelle Resilience-Benchmarking-Ansätze ineffizient. Chaos engineering ist eine neue, sich entwickelnde Disziplin, welche ein Umdenken hinsichtlich der Entwicklung von Anwendungen in Bezug auf ihre Widerstandsfähigkeit erzwingt. Die Idee ist es, durch Ausführung empirischer Experimente mehr über das Verhalten eines Systems mit Hilfe von gezielten und kontrollierten Fehlerinjektionen zu lernen. Durch diesen Ansatz gefundene und behobene Probleme und Wiederholung der Experimente erlauben es dem System, eine Immunität gegenüber gefundenen Fehlern aufzubauen. Im Rahmen einer industriellen Fallstudie stellt diese Arbeit Techniken zur Identifikation von Risiken und Gefährdungen, in einem Chaos-Engineering-Kontext, zur Verfügung. Es wurden drei Gefährdungsanalyse Techniken vorgestellt, namentlich i) Fault Tree Analysis als top-down-Ansatz zur Identifikation von Grundursachen, ii) Failure Mode and Effects Analysis zur Komponenten-basierten Analyse von verschiedenen Fehlermodi, iii) und ComputationalHazard and Operations als Mittel zur Analyse der Kommunikationswege des Systems. Die identifizierten Gefährdungen werden anschließend mit Hilfe des Chaostoolkits als Chaos Engineering Experimente definiert und auf Plattform-Ebene (Kubernetes) ausgeführt. Insgesamt wurden vier Experimente aus der Gefährdungsanalyse abgeleitet und entworfen wobei drei dieser Experimente ausgeführt und durch statistische Tests analysiert wurden. Diese Arbeit stellt einen generischen Ansatz zur Verbesserung der Widerstandsfähigkeit von Verteilten Anwendungen, veranschaulicht Anhand einer industriellen Fallstudie, im Kontext von Chaos Engineering bereit.

URI

http://nbn-resolving.de/urn:nbn:de:bsz:93-opus-ds-109188
http://elib.uni-stuttgart.de/handle/11682/10918
http://dx.doi.org/10.18419/opus-10901

Collections

05 Fakultät Informatik, Elektrotechnik und Informationstechnik

Full item page

Assessing resilience of software systems by the application of chaos engineering : a case study

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Endorsement

Review

Supplemented By

Referenced By