Institute of Software Technology Reliable Software Systems University of Stuttgart Universitätsstraße 38 D–70569 Stuttgart Bachelor’s Thesis Automated Performance Regression Detection in Microservice Architectures Nils Wenzler Course of Study: Softwaretechnik Examiner: Dr.-Ing. André van Hoorn Supervisor: Dr.-Ing. André van Hoorn Teerat Pitakrat Cor-Paul Bezemer Commenced: March 23, 2017 Completed: September 23, 2017 CR-Classification: C.2.4, D.2.8 Abstract The emergence of microserive architectures has lead to a need for software performance techniques which cater to the needs of these architectures. Microservice achrictectures are architectures which are build out of small independent processes which communicate by use of language independent interfaces such as REST. Microservice environments challenge common software performance engineering techniques by their highly dis- tributed nature, their rapidly changing systems and their extensive use of virtualization and containers. Detecting performance changes of a system during development, so- called performance regression detection, is a valuable addition to the development and maintenance of software systems. How software performance regression detection can be performed in microservice architectures is the non-trivial question, on which this thesis focuses to answer. To reach this goal, in a first step extensive research on by the microservice orchestration technology Kubernetes and the containerization technology Docker provided software performance metrics is performed. Results show that some major performance metrics can not be considered to be stable concerning redeployments of the microservice system. They suggest that performance metrics of a microservice instance can be highly impaired by other microservices running on the same node. A second step empirically evaluates the performance of existing performance regression detection techniques in the context of a microservice environment. After a thoroughly comparison of the different approaches, selected approaches were implemented and their performance was evaluated empirically in the test setup. The results show that some approaches are not applicable or show a bad performance in the microservice setup. Although none of the approaches performed well enough for practical application, two of the approaches showed promising results, which could lead to enabling performance regression detection in microservice archtectures. iii Kurzfassung Das Aufkommen von Microservice Architekturen hat zu einem Bedarf an neuen Software- Performanz-Techniken, die an die Eigenschaften dieser Architekturen angepast sind, geführt. Microservice Architekturen sind Architekturen die aus kleinen und unab- hängigen Prozessen aufgebaut sind, die über sprachenunabhängige Schnittstellen wie REST kommunizieren. Microservice-Umgebungen stellen auf Grund ihrer massiven Verteiltheit, den sich schnell und häufig ändernden Systemen und ihrer Verwendung von Virtualisierung und Containerisierung eine Herausforderung für Software-Performanz- Ingeniuere dar. Performanceänderungen eines Systems zu erkennen, so genannte Performance-Regressions-Erkennung, ist eine wertvolle Ergänzung zur Entwicklung und Wartung von Softwaresystemen. Die nicht-triviale Frage, wie Performance-Regressions- Erkennung im Umfeld von Microservice-Umgebungen durchgefürht werden kann, steht deshalb im Fokus dieser Arbeit. Um Antworten auf diese Frage zu finden, werden zunächst die von der Microservice- Orchestatrationstechnologie Kubernetes und der Containerisierungstechnologie Docker zu Verfügung gestellten Performanz-Metriken untersucht. Die Ergebnisse dieser Arbeit zeigen, dass manche Microservice-Performanz-Metriken im Bezug auf Redeployments des selben Systems auf dem selben Cluster nicht als stabil angesehen werden können. Sie suggerieren, dass Performanz-Metriken einer einzelnen Microservice-Instanz mas- siv durch andere Microservice-Instanzen auf dem selben Knoten beeinflusst werden können. In einem zweiten Schritt wird die Performanz existierender Performanz-Regressions- Erkennungsverfahren im Microservice-Umfeld empirisch untersucht. Nach einem gründlichen Vergleich der unterschiedlichen Verfahren, wird eine Auswahl der Ver- fahren nachimplementiert und ihre Performanz in einem Test-Microservice-System mit künstlichen Regressionen evaluiert. Die Ergbnisse zeigen, dass manche Verfahren nicht anwendbar sind oder sehr schlechte Ergebnisse erzielen. Obwohl keins der untersuchten Verfahren für eine praktische Anwendung gut genug wäre, zeigen zwei der unter- suchten Verfahren vielversprechende Ergebnisse, die zielführend für die Entwicklung von Performanz-Regressions-Erkennung im Microservice Architekturen sein könnten. v Contents 1. Introduction 1 1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2. Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Foundations 5 2.1. Performance testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2. Performance regressions . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3. Performance regression detection . . . . . . . . . . . . . . . . . . . . . . 7 2.4. Evaluation of performance regression detection approaches . . . . . . . 8 2.5. Anomaly detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.6. Microservice architectures . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.7. Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3. Related work 17 3.1. Microservice performance research . . . . . . . . . . . . . . . . . . . . . 17 3.2. Existing performance regression detection approaches . . . . . . . . . . 18 4. Comparison and implementation of approaches 33 4.1. Selection criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2. Comparison of approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3. Selection of approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.4. Implementation of approaches . . . . . . . . . . . . . . . . . . . . . . . . 38 5. Evaluation 45 5.1. Evaluation goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2. Evaluation methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.3. Evaluation setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.4. Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.5. Description of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.6. Discussion of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 vii 6. Threats to validity 71 6.1. External validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.2. Internal validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 7. Conclusion 77 7.1. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.2. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 7.3. Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 8. Acknowledgements 79 Bibliography 81 A. Metric measurements 87 viii List of Figures 2.1. The different levels of load testing. Adaption of Mike Cohn’s test pyramid [CG09]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2. The process of performance regression detection . . . . . . . . . . . . . 7 3.1. Visualization of the process of regression models on clustered performance counters regression detection . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2. Visualization of the process of performance signature-based regression detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3. Visualization of the proposed filtering technique of Nguyen et al. . . . . 27 3.4. Visualization of the process of statistical process control-based regression detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.5. Visualization of the process of mining performance regression testing repositories regression detection . . . . . . . . . . . . . . . . . . . . . . 31 3.6. Examples for association rules . . . . . . . . . . . . . . . . . . . . . . . . 31 5.1. Steady state detection visualization . . . . . . . . . . . . . . . . . . . . . 47 5.2. Overall view on the test environment . . . . . . . . . . . . . . . . . . . . 50 5.3. Cpu/usage_rate distribution relative to median value . . . . . . . . . . . 57 5.4. Cpu/usage_rate median in different deployments . . . . . . . . . . . . . 59 5.5. Relative deviations of median CPU measurements during runs compared to requests per minute of the load driver . . . . . . . . . . . . . . . . . . 60 5.6. Memory usage and working set behavior during different deployments . 61 5.7. Relative deviations of median memory measurements during runs com- pared to requests per minute of the load driver . . . . . . . . . . . . . . 62 5.8. Network rx and tx rate behavior during different deployments . . . . . . 63 5.9. Relative deviations of median network measurements during runs com- pared to requests per minute of the load driver . . . . . . . . . . . . . . 64 6.1. Requests per minute median in different deployments . . . . . . . . . . . 73 6.2. Node CPU usage throughout series of load tests . . . . . . . . . . . . . . 75 ix A.1. Cpu/usage_rate development in different deployments . . . . . . . . . . 87 A.2. Memory/usage development in different deployments . . . . . . . . . . 88 A.3. Memory/page_faults_rate development in different deployments . . . . 88 A.4. Memory/working_set development in different deployments . . . . . . . 89 A.5. Network/tx_rate development in different deployments . . . . . . . . . . 89 A.6. Network/rx_rate development in different deployments . . . . . . . . . 90 A.7. Requests per minute of load driver development in different deployments 90 A.8. Cpu/usage_rate distribution relative to median value . . . . . . . . . . . 94 A.9. Memory/usage distribution relative to median value . . . . . . . . . . . 95 A.10.Memory/page_faults_rate distribution relative to median value . . . . . 95 A.11.Network/tx_rate distribution relative to median value . . . . . . . . . . 96 A.12.Network/rx_rate distribution relative to median value . . . . . . . . . . 96 x List of Tables 3.1. Overview Student-T-Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2. Overview regression models on clustered performance counters . . . . . 21 3.3. Overview signature-based performance regression detection . . . . . . . 23 3.4. Overview transactional profiles . . . . . . . . . . . . . . . . . . . . . . . 25 3.5. Overview statistical process control techniques using machine learning . 25 3.6. Overview performance regression unit tests . . . . . . . . . . . . . . . . 29 3.7. Overview differential flame graphs . . . . . . . . . . . . . . . . . . . . . 30 3.8. Overview mining performance regression testing repositories . . . . . . 32 4.1. Tabular comparison between approaches . . . . . . . . . . . . . . . . . . 36 5.1. Specification of the testing nodes . . . . . . . . . . . . . . . . . . . . . . 50 5.2. The different kinds of injected regressions . . . . . . . . . . . . . . . . . 53 5.3. Available CPU metrics in Heapster . . . . . . . . . . . . . . . . . . . . . 54 5.4. Available memory metrics in Heapster . . . . . . . . . . . . . . . . . . . 54 5.5. Available filesystem metrics in Heapster . . . . . . . . . . . . . . . . . . 55 5.6. Available network metrics in Heapster . . . . . . . . . . . . . . . . . . . 55 5.7. Collected response metrics in Locust . . . . . . . . . . . . . . . . . . . . 56 5.8. Normal distribution findings . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.9. Performance evaluation of the four performance regression detection approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.10.Performance of Student t-test regression detection . . . . . . . . . . . . . 65 5.11.Performance of statistical process control regression detection . . . . . . 66 5.12.Performance of signature-based performance regression detection . . . . 66 5.13.Mining performance regression testing repositories performance . . . . . 67 A.1. Median of cpu/usage_rate per test run (original unit: millicores) . . . . 91 A.2. Variance of cpu/usage_rate per test run (original unit: millicores) . . . . 92 A.3. Median of memory/usage per test run (unit: mebibytes) . . . . . . . . . 93 A.4. Variance of memory/usage per test run (original unit: bytes) . . . . . . . 93 xi List of Algorithms 4.1. Student t-test regression detection pseudo code . . . . . . . . . . . . . . 39 4.2. Statistical process control techniques regression detection pseudo code . 41 4.3. Signature-based performance regression detection pseudo code . . . . . 42 4.4. Mining performance regression testing repositories pseudo code . . . . . 43 xiii Chapter 1 Introduction 1.1. Motivation Microservice architectures promise to reduce complexity, to give the possibility of scaling independently, to remove and deploy independent parts of the system easily, to support usage of different frameworks and languages, to increase the overall system elasticity and finally to improve the resilience of the system [Ama+15]. Many companies already use this by service oriented architectures inspired architectural style. To software performance engineers these architectures pose a lot of new challenges and a need for software performance applications which cater to the specific needs of microservice environments has emerged [Hei+17]. Performance regression detection is one of those software performance applications. Performance regression detection detects significant changes in the performance of a software system during development and helps to avoid performance decreases. It basically answers the question whether the overall systems performance has changed because of the most recent changes to the code base. The approach of tackling software performance during development has per se some advantages compared to dealing with performance issues at the end of the development of a project. On the one hand, it can be very challenging to try to target performance issues at the end of software development because the sources of the bad performance may be hidden somewhere in the system. Oppositely, if the developer is notified directly when a performance decrease, a so-called performance regression was introduced during development, it should be less challenging to find and resolve the issue. On the other hand, stands the fact that nowadays software systems, and microservice systems in special, are subject to continuous change caused by techniques such as continuous delivery. It therefore sounds reasonable to monitor the performance of a software system on a base of the different changes during development. 1 1. Introduction This setting leads to the central question of this thesis’s work: How and how well can performance regression detection be realized in the context of microservice environ- ments? Some of the challenges in the context of microservice systems which have to be beat to enable performance regression detection in such an environment are dealing with the distributed nature of the systems independent microservices, understanding the per- formance behavior of the microservice systems, researching the available performance metrics of microservice environments and their properties, finding software performance techniques for testing elasitc systems, and deciding on how to incorporate the different metrics of the different microservice instances into the performance regression detection. Because of the distributed nature of microservice systems, virtualization and container- ization are commonly used technologies. These themselves add challenging aspects to software performance engineering tasks. Basic foundations to software performance engineering, such as the performance metrics behave differently in containerized appli- cations. E.g., this thesis shows that performance metrics of containers are impaired by other microservices running on the same physical node. Although this thesis is not able to answer all of the questions concerning performance regression detection in microservice achitectures, it offers in-depth research and first results to some of those challenging questions. In this work, a reference microservice system is used for an empirical study on the possibilities of performance regression detection in microservice environments. Extensive experimentation on the different available performance metrics of microservices is performed and their properties are investigated. The stability of measurement results during test runs and in comparison between different deployments of the same system are researched. In this work, it is shown that most of the measurements were not of a normal distribution, that the performance metrics did not show any unexpected deviations during single test runs, but that massive deviations of up to 16% of performance measurement results were observed in the comparison of different deployments of the same microservice system and on the same cluster. The reason for these deviations was found in scheduling decisions of the used microservice orchestration technology Kubernetes. To further investigate the possibilities of performance regression detection in microser- vice architectures, in-depth research on existing performance regression detection ap- proaches was performed. After a thoroughly comparison of the different approaches, selected approaches were implemented and their performance was evaluated empirically in the test setup. The results show that some approaches are not applicable or show a bad performance in the microservice setup. None of the existing approaches performed on a level which would be usable in a productive environment. Two of the implemented approaches show promising results and may lead to solutions for performance regression 2 1.2. Thesis structure detection in the context of microservices. The best approach could detect 67% of the injected regressions with an accuracy of 73%. Finally, this work shows possible future directions for research to enable performance regression detection in microservice environments. 1.2. Thesis structure The remainder of this thesis is structured as follows: Chapter 2 – Foundations explains the basic concepts needed to understand this work. It offers a short introduction into the theory of load testing, regression detection and microservice architectures. Chapter 3 – Related work presents a detailed description of the existing research work on the different performance regression detection approaches. Chapter 4 – Comparison and implementation of approaches offers a reasoning con- cerning which existing approaches were chosen to be evaluated in a microservice environment. It offers a collection of criteria by which the selection was performed and compares the different approaches concerning these criteria. Chapter 5 – Evaluation shows the setup and methodology of the evaluation of this the- sis. Furthermore, the findings of the experiments concerning microservice metrics behavior and performance of the performance regression detection approaches in a microservice environment are described and afterwards discussed. Chapter 6 – Threats to validity lists possible threats to the validity of the findings presented in this work. For the different threats, a short evaluation and steps to minimize the risks are presented. Chapter 7 – Conclusion gives a short walk through the thesis, its evaluation and its findings. An outlook concerning possible future research is given. Chapter 8 – Acknowledgements lists the people without whom this work would not have been possible or who gave valuable inputs during the creation of this thesis. The original measurements of this work were published at [Wen17b]. The original code base of this work was published at [Wen17a]. This code base does only offer the exact implementation of the prototype. It neither offers further documentation nor a performance regression detection tool which could be used in a productive environment. 3 Chapter 2 Foundations This chapter explains the basic concepts needed to understand the following work. It offers a short introduction into the theory of load testing, regression detection and microservice architectures. 2.1. Performance testing Performance testing is a type of software testing, which strives to determine the respon- siveness, throughput, reliability, and/or scalability of a software system under a given load [Mei+07]. A load is a, most commonly artificially, produced set of operations which are performed on the system under test (SUT) to observe its behavior when working. Performance testing helps to identify bottlenecks in a system, supports performance tuning and enables evaluating compliance of a system with service level agreements (SLA). SLAs specify requirements to the product build in a software development project. Examples for such requirements could be guaranteed up time, certain response time limits for a specified number of users or a guaranteed throughput of documents per hour. In general, there are four different types of performance testing: load tests, stress tests, performance tests and capacity tests [Mei+07]. Those types of testing mainly differ in the goal that they strive to reach. Performance tests target to determine the speed, scalability or stability of the SUT. Scalability describes the system’s ability to adapt to increasing workloads by usage of additional resources (see Section 2.7.1). Load tests strive to evaluate the SUTs behavior under normal and peak conditions. Stress tests evaluate the behavior of the SUT under loads which exceed those of normal or peak conditions. Capacity tests target the question of how many users/transactions per time slot the SUT is able to support while still meeting its performance requirements. 5 2. Foundations Unit Test Component Test Integration test System Test Figure 2.1.: The different levels of load testing. Adaption of Mike Cohn’s test pyramid [CG09]. Independent of the type of a performance test, there are different levels at which load testing can be performed. Figure 2.1 shows how those levels build on each other. While unit tests, focus on testing small code passages and classes, component tests focus on testing the behavior of whole components. Integration tests focus on testing the interaction between single components while system tests strive to test the overall systems performance. The main focus of this thesis will be on testing on a system and component level. 2.2. Performance regressions Functional regression tests are a commonly used method for assuring that a newer version of a software system still fulfills the functionality of the older version [LL13]. Such tests are called (functional) regression tests. Analogically, performance regressions describe a significant change of performance compared to an older version of the software system. Although functional regression tests are well established, performance regression tests are not as commonly performed in practice. Examples for typical performance regressions are [Ngu+12] [Sha+15]: Increasing memory usage Adding a (large) field in a very often used object will lead to a prominent increase of overall memory usage. Increasing CPU usage Additional calculations or algorithms with a bad run-time will increase the CPU load of the SUT. Increasing I/O access times Since storage device accesses are in comparison to mem- ory accesses more time consuming, introducing an increased amount of such (blocking) I/O accesses will decrease the performance of the SUT. 6 2.3. Performance regression detection commit Performance regression detection regr essio n de tecte d no regression Figure 2.2.: The process of performance regression detection Increasing network usage In comparison to memory accesses, network I/O is more time-consuming and can have an overall negative impact on inter-microservice communication. Introducing an increasing amount of network calls will decrease the performance of the SUT. Unnecessary system prints System prints, which are sometimes used for debugging, may slow down parts of the system, since the output depends on slow I/O opera- tions. Wrong configurations A wrong configuration of the system components may slow down the system significantly. Examples of such configuration errors may be: number of available threads in a thread pool, number of parallel connections to a database or resource limits for single processes. 2.3. Performance regression detection Performance regression detection deals with the detection of performance regressions. Although approaches for performance regression tests exist, they are uncommon com- pared to functional regression tests. Figure 2.2 shows the general process of performance regression detection. After a new commit, a change in the software system is submit- ted, the performance regression evaluates the new system’s performance and inspects whether a performance regression was added or not. Performance regression detection compares the performance measurements of a new version vi of a software system to the performance measurements of a subset of the earlier observed versions v0, ...vi−2, vi−1 of the software systems. In between different versions the code base, the functionality of the software and even the load of the evaluation may change. 7 2. Foundations 2.4. Evaluation of performance regression detection approaches For later evaluation of the performance of the performance regression detection ap- proaches the following metrics are introduced. There are four general results which a single evaluation of a performance regression detection approach can have. The ap- proach can report a performance regression when the system indeed has a performance regression. This is called a true positive (TP). The approach may report that there is no performance regression when the system’s performance indeed has not changed. This is called a true negative (TN). For a perfect approach which does not make mistakes, these two results would be sufficient, but there are two cases in which the approach could make errors. If it reports no regression although in reality one could observe a regression, this kind of error is called a false negative (FN). Finally, if the system reports a regression although there is no performance anomaly, we observe a false positive (FP) [DG06]. Out of those four kinds of results, three different metrics are commonly calculated. The precision, the recall and the F-measure. The precision describes how many of the reported regressions are indeed regressions. It fits to the common understanding of the word precision. Precision = TP TP + FP The recall describes how many of the real regressions were reported to be a regression. Recall = TP TP + FN Additionally there is the F-measurement which combines both metrics. F = 2 · Precision ·Recall Precision+Recall For all three metrics, the results range from 0 to 1, with one being the optimal case. 2.5. Anomaly detection Performance regression detection is a special use case of anomaly detection. Anomaly detection is the problem of finding patterns, so called anomalies, in data. Anomalies are 8 2.5. Anomaly detection patterns which are considered to be not normal. In the context of this work, performance regressions can be considered to be anomalies concerning software performance. In a survey, Chandola, Banerjee, and Kumar [CBK09] give an overview over the different kinds of anomaly detection and offer basic algorithms for the different types. If not otherwise stated, the information of this section refers to [CBK09]. Defining what is normal behavior, how to deal with an adapting definition of normal behavior and how to deal with noise in the input data, are challenges in anomaly detection. Noisy data is data which is corrupted or distorted. Different techniques deal with different kinds of data. Some techniques may only work for data in form of attributes (Mining performance regression testing repositories serves as an example), while others use ordinal or continuous data (Student t-test based performance regression detection serves as an example). The following section explains what general types of anomalies exist, how anomaly detection techniques are trained to recognize normal and uncommon behavior and finally which general techniques are used to detect anomalies. In general, there are three categories of anomalies: Point anomalies are anomalies where a single measurement can be considered to be an anomaly in reference to the remaining data set. Contextual anomalies are anomalies, where the single measurement itself can not be considered to be an anomaly since the value of it can be common. Contextual anomalies are deviating concerning their context. While a temperature of 30 degree Celsius can be considered normal in the context of summer, a temperature of 30 degree Celsius in winter would be considered to be a contextual anomaly. Collective anomalies are anomalies concerning related data samples, which can only be considered uncommon because of their existence as a collection. A possible example for such anomalies would be tracking actions in the field of intrusion detection and seeing a suspicious series of events in the tracking logs. The performance regression detection approaches, which are described in this thesis vary in the kind of anomalies they detect. Although the classification is to some amount subject of interpretation, the approach which is described in Section 3.2.8 could be seen as focusing on contextual anomalies while the approach of Section 3.2.5 could be seen as focusing on point anomalies. No approach presented in this work deals with collective anomalies. Anomaly detection approaches need to build a model of normal behavior. The process of building such a model is called training. According to Chandola, Banerjee, and Kumar [CBK09] three different kinds of techniques concerning training exist: supervised, 9 2. Foundations semi-supervised and unsupervised training. Supervised training offers training data with labels for normal as well as for anomalous data. Semi-supervised training offers training data which only shows normal behavior. Finally, unsupervised techniques do not use labeled data at all. They most commonly assume frequent patterns to be normal. The performance regression detection approaches in this thesis are setup in an environment where they may use old performance measurements, which are considered to be of normal behavior. Therefore, performance regression detection, as understood in this work, can be categorized into the section of semi-supervised training. Another classification is given by the general approach which the different techniques use: • classification based • nearest neighbor based • clustering based • statistical techniques • information theory • spectral theory The research concerning the different existing approaches in the field of software performance regression detection are hard to categorize. Nonetheless, some approaches can be considered to be in the category of the statistical techniques. 2.6. Microservice architectures Many modern software businesses focus on the use of microservice architectures. Being a fine grained Service Oriented Architecture (SOA), they promise to cater to the needs of cloud computing and continuous delivery (CD) [BHJ16]. A microservice architecture is built out of single independent processes, so called microservices. The main reasons for the emergence of microservice architectures are their promise to reduce complexity, to give the possibility of scaling independently, to remove and deploy independent parts of the system easily, to support usage of different frameworks and languages, to increase the overall system elasticity and finally to improve the resilience of the system [Ama+15]. Scalability is a prerequisite for elasticity and describes the degree to which a system is able to sustain increasing workloads by making use of additional resources [LEB15]. Elasticity adds the aspects of how fast how often and at what granularity the system adapts. Resilience describes “ the ability of a system to sustain external and 10 2.6. Microservice architectures internal disruptions without discontinuity of performing the system’s function or if the function is disconnected, to fully recover the function rapidly” [HBRM16]. Container virtualization is commonly used in microservice architectures. It allows to run applications like microservices on one single system side-by-side in isolated containers. Containers offer virtualization on the level of the operation system and are therefore more lightweight, compared to full server virtualization [Ama+15]. They promise to be easy to setup, since all dependencies of a software are defined and bundled into the container. Since microservice architectures are mostly used together with CD approaches, mi- croservice environments are rapidly changing and it is hard or even impossible to find a steady state for performance regression detection [Hei+17]. Therefore, approaches for performance regression detection in microservice architectures should target that fact in a certain way. The overall lack of software performance engineering approaches for microservice systems [Hei+17] is one of the main motivations for this thesis. The remainder of this section will give a short introduction to the terminology of the used container orchestration tool Kubernetes [Kubd]. Kubernetes is an open-source system which offers deployment and management func- tionalities for a microservice system. In this thesis, Kubernetes is used for deploying the microservices. This short subsection tries to explain the terminology and most basic concepts of Kubernetes, so that future references to such terminology will be understandable. A single microservice is in most cases build out of a containerized application, which allows independent deployment and execution. Since in some cases a microservice might be built out of several containers, Kubernetes base unit is a pod. A pod represents one single instance of a microservice in the system. Since pods may be redeployed or several instances of one pod may run simultaneous to make up for high demand of this microservice, there is a need to have one central point to ask for instances of one kind of microservice. In Kubernetes, this concept is called a service. Commonly requests are not issued to a pod but to a service which forwards the request to one of the available microservice instances associated with one service. To control the number of instances of a single microservice, so called replica sets allow to horizontally scale it. Horizontal scaling describes the process of providing more or less instances of a microservice to adapt to a changing load. Opposed to that, vertical scaling describes the process of making more or less resources available to one single existing instance of a microservice. 11 2. Foundations In terms of hardware, there are so called nodes. A node corresponds to one single system in the cluster. Such nodes may be independent hardware systems or may be a set of virtual machines. When mentioning a redeployment in this work, it means to delete all pods, services and replica sets of the system under test and to reinstantiate them afterwards. 2.7. Performance metrics According to the IEEE standard 610.12, metrics are a quantitative measure, to which degree a system, component or process possesses a given attribute. Metrics are used to evaluate an attribute of a system and to help comparing attributes between different systems. Performance metrics are metrics which measure aspects which help when evaluating performance of a system. Typically, there are four main types of system performance metrics: CPU utilization, memory utilization, network I/O and disk I/O [Ngu+12]. Additional measurements of metrics are collected on application level. A performance counter is a concrete dataset of a given performance metric. A typical load test may collect thousands of performance counters [Sha+15]. Therefore, tooling support is needed when software performance analysts try assessing the performance of a system. The performance attributes scalability, elasticity and resilience are of special importance for evaluating microservices. Scalability is a performance attribute, which describes the degree to which a system is able to sustain increasing workloads by making use of additional resources [LEB15]. Elasticity adds the possibility of decreasing available resources and is based on how much time the system needs to perform such an adaption [LEB15]. Resilience describes “ the ability of a system to sustain external and internal disruptions without discontinuity of performing the system’s function or if the function is disconnected, to fully recover the function rapidly” [HBRM16]. Since those performance attributes can not be directly measured, software performance engineers use special metrics and combinations of metrics to evaluate scalability, elastic- ity and resilience. 2.7.1. Measuring scalability Scalability is a performance attribute, which describes the degree to which a system is able to sustain increasing workloads by making use of additional resources [LEB15]. In contrast to scalability, it contains the possibility of reducing available resources. Tsai, 12 2.7. Performance metrics Huang, and Shao [THS11] propose to observe performance change and performance variability for evaluating scalability in a cloud computing environment. This thesis assumes that microservice environments and cloud computing environments are compa- rable in their requirements to scalability. Tsai et al. build their scalability metric upon the following definitions. • The waiting time Tw is defined as the sum of the queuing time Tq and the execution time Te. Tw = Tq + Te • CR is the sum over all resources of the product of all resource allocations of a resource Ri and the time Ti the resource is used. CR = ∑ i Ri ∗ Ti • The performance/resource ratio PRR is the product of the inverse waiting time Tw and the inverse CR value. PRR = 1 Tw ∗ 1 CR • Finally the performance change PC is defined as PC = PRR(t)W (t) PRR(t′)W (t′) with PRR(t) as the performance resource ratio of the system under a certain work load W (t) and PRR(t′) and W (t′) being the performance resource ratio and the work load at a different time t′. In a scalable system, PC should have a value close to 1. A value of 1 means, that the resource need per work instance does not change with working loads. • The performance variance PV is the standard deviation of the performance change for multiple test runs with the same constant workload. PV = E[(PCi − 1 n n∑ i=1 PCi])2] A scalable system should have a performance variance close to zero. A value of zero represents that the performance change is constant. 13 2. Foundations 2.7.2. Measuring elasticity Scalability is a prerequisite for elasticity [HKR13]. While scalability describes the ability of a software system or component to adapt to increasing workloads by making usage of additional resources, elasticity adds the aspects of how fast, how often and at what granularity the system or component adapts. Herbst, Kounev, and Reussner [HKR13] propose a metric for elasticity which uses the concept of underprovisioned and overprovisioned states. Underprovisioned states are states in which the system or component does not have enough resources available. Overprovisioned states are states in which the system has more resources available than needed. To recognize such states, Herbst et al. propose to build a matching function which gives information on when the system should scale down or up. Afterwards they use metrics like average time for the system to leave an underprovisioned state and the average amount of underprovisioned resources in the system to evaluate the overall elasticity. Another approach for measuring elasticity, is given in the work of Islam et al. [Isl+12]. It is the most commonly referenced approach [LEB15]. They as well consider under- and overprovisioned states, but use a cost penalty for evaluating the elasticity. In general Islam et al. integrate over cost resulting out of the difference between resources which were needed and resources that were available. They propose different approaches to evaluate a cost of under- and overprovisioning. The approach for underprovisioning uses quality of service metrics to calculate a violation metric regarding the SLAs. For overprovisioning they use the actual cost, which the additional resources cost. 2.7.3. Measuring resilience While all the three software attributes, scalability, elasticity and resilience, are challeng- ing to measure, resilience seems to be the toughest one to measure. The most common definition [Pas+15] of resilience is, the one given by Hollnagel, who defines it as: “the intrinsic ability of a system to adjust its functioning prior to, during, or following changes and disturbances, so that it can sustain required operations under both expected and unexpected conditions” [Hol13]. Pasquini et al. [Pas+15] mention in their work, that although this definition is good for understanding the general concept of resilience, parts of the resilience engineering community argue that it is not well suited for measuring resilience. Strigini [Str12] states in his work that resilience and dependability are broad concepts, covering several attributes and therefore several possible metrics are available. One possible metric which targets the dependability is the overall system availability in a realistic setup. Such a metric is very dependent on the SUT and its load, it therefore does not perform well when comparing different systems. Furthermore, the results when artificial load is being used as a reference highly depend on what kind of disturbances of 14 2.7. Performance metrics the system are present in the load. In addition, the possibility of a human aspect exists as well. If the system which is regarded, includes human beings, which it often does, factors like alertness or fatigue will influence the system resilience [Str12]. Even aspects like the graphical user interface may be considered, since it may cause more frequent failures and therefore may be less often able to sustain required operations. Other metrics which target tolerable disturbances, may be metrics like the ability to return to the original state with k faulty components, communication up to a t-bit communication error or m erroneous inputs. 15 Chapter 3 Related work This section describes related research concerning the topic of automated performance regression detection not limited to microservice architectures. Surprisingly, at the time of writing the existing research work on performance of microservice systems seems to be limited. The computer science bibliography dblp [Dbl] returns only 8 results for the query “microservice performance”. The related work on microservice architectures therefore is a short section. Nonetheless this chapter will give a short overview over the existing research work on microservice performance attributes. Afterwards it offers a detailed description of existing performance regression detection approaches. 3.1. Microservice performance research Most similar to this work is the master thesis “Performance anomaly detection in mi- croservice architectures under continuous change” by Düllmann [Dül17]. Although the title suggests that the research focus was nearly identical, the two works differ a lot. Düllman focused on observing the performance impacts of changes in a running microser- vice environment, while this work focuses on redeploying a test system for regression detection. In his work, he changed performance attributes of a single microservice during runtime. Opposed to that, the approach of this work focuses on observing the whole system and redeploying the whole system for every single change. Furthermore, Düllmann used architectural knowledge of the microservice system and focused mainly on how to prevent false positives triggered by high loads, which are observable during the deployment of a microservice. This work does not rely on architectural knowledge, although some of the approaches internally build a model of the relationship between the different metrics. Additionally, Düllmann used an artificial microservice system which consisted out of three microservice instances. This thesis evaluates an existing microservice platform which consists out of more than 20 microservice instances and has 17 3. Related work to some amount a realistic use case. Finally, this work includes some general research focusing on the behavior of microservice metrics. Gribaudo, Iacono, and Manini [GIM17] performed research on simulation-based estima- tion of performance attributes of microservice architectures. Their work mainly focused on infrastructure parameters of a microservice system which may influence performance. A simple example for such a parameter would be the number of available nodes in a cluster. In a general survey, presented different problems and possible research directions in the field of microservices. They argue that special solutions for performance regression detection have to be investigated. Although this thesis’s work evaluates the possibilities of common techniques and researches their possibilities and challenges. It shows their performance and highlights existing challenges. The work of de Camargo et al. [Cam+16] focused on developing a testing framework for microservice systems. They tried to tackle test automation and the problem of keeping test specification consistent with rapidly changing microservices. 3.2. Existing performance regression detection approaches There are several approaches to performance regression detection available. Since one of the main goals of this thesis is evaluating and investigating their usability for microservice architectures, the following sections will go into detail on how the different approaches work. Inspired by the principles of a systematic literature review, the following approaches were found. The main goal of the research was to find a broad collection of performance regression detection approaches. To find those approaches, the two common research search engines dblp and Google Scholar, were used. As initial search queries “perfor- mance regression detection” and “performance regression test” were chosen. Out of those results, relevant papers which describe approaches for performance regression detection were selected. The related work sections of the selected papers, were used for finding further relevant research works. 3.2.1. Student t-test based performance regression detection Statistical tests, such as the Student t-test are a commonly used approach in comparing measurements of two different versions of the same software system [Sha+15]. In a study, Shang et al. [Sha+15] propose a new approach to performance regression 18 3.2. Existing performance regression detection approaches Table 3.1.: Overview Student t-test Sp ee d of de te ct io n Su pp or t of m et ri cs A da pt . to ch an gi ng lo ad s C om pl ex it y Lo ad te st in g re po si to ry C om m on ne ss C ha ng e in di st ri bu ti on Several runs till reaching confi- dence thresh- old Yes No* 1 step No Very common No* detection. They compare their results, which will be presented in Section 3.2.2, to classical statistical hypothesis tests such as the Student t-test [Sha+15]. They conclude that such tests do not perform well and lead to a high number of false positives. Although their research shows that the quality of results depends highly on the chosen SUT, they conclude that the overall performance is not very promising. 3.2.2. Regression models on clustered performance counters In the above mentioned work, Shang et al. [Sha+15] propose an approach which uses regression models on clustered performance counters. Regression models are models which allow to make predictions about dependent variables, given the values of some other explanatory variables. Such models are built out of sets of data points. There are different types of regression models. For example, linear regression models try to model the relationship of the data set by use of a linear function. Clustering is a method which allows bundling of similar datasets. Since software systems can have thousands of performance counters, it is important to do a selection of metrics which shall be regarded. Especially when working with a model-based approach it would be infeasible to build a model for every counter. 19 3. Related work Shang et al. therefore eliminate performance counters which can be considered redun- dant. In terms of redundancy analysis they use R2 to step-wise eliminate performance counters over a predefined threshold. In a second step, Shang et al. cluster the per- formance counters, to reduce the models to be build. They use an n-dimensional representation for the performance counters. Each dimension represents a time step / slice of the load test and holds the corresponding value of the performance counter. By use of the Pearson distance metric, a hierarchical clustering based on average cluster distance is performed. By use of the Calinski-Harabasz stopping rule the clusters are split into separate clusters. Out of every cluster, the performance counter which deviates the most between the two software versions, is selected. Deviation is measured by use of a Kolmogorov-Smirnov test. For this counter, a linear regression model with the remaining counters as independent variables is build. Finally, for the performance regression detection, the model which was build out of the performance counters of an earlier version is used to predict the expected value for the selected performance counter of every cluster. This prediction is compared to the real values and the average prediction error is calculated. If the maximum of all average prediction errors, is higher than a given threshold, a performance regression alert is issued. Figure 3.1 visualizes remove redundant counters cluster performance counters for each cluster build regression model accum. errors predict expected values threshold to trigger alert old new Figure 3.1.: Visualization of the process of regression models on clustered performance counters regression detection the general process of the approach. The results of the approach are promising. The proposed approach leads to better results than pure statistical Student t-tests or manual selection of variables for regression models. 20 3.2. Existing performance regression detection approaches Table 3.2.: Overview regression models on clustered performance counters Sp ee d of de te ct io n Su pp or t of m et ri cs A da pt . to ch an gi ng lo ad s C om pl ex it y Lo ad te st in g re po si to ry C om m on ne ss C ha ng e in di st ri bu ti on One run Yes - 6 steps No - No* Since there is no direct need for two separate test runs, the duration of this detection approach is smaller than the classical approach of two separate runs to make the loads comparable. This approach could be used for microservice performance metrics such as scalability, elasticity, and resilience as well. 3.2.3. Signature-based performance regression detection Malik, Hemmati, and Hassan [MHH13][Mal10] propose an approach which is similar to the one of Shang et al. [Sha+15]. They focus on the reduction of relevant datasets as well. Malik et al. call the resulting small sets of performance counters with its values signatures. Their performance regression detection focuses on comparing such signatures. Instead of clustering, they make use of principal component analysis (PCA). Principal component analysis is a statistical procedure which projects an n-dimensional dataset to a q-dimensional dataset with q < n. The results of the PCA are principal components (PC). PCA keeps the information loss of this projection low. Malik et al. suggest that PCA scales better than clustering algorithms. Afterwards they select a subset of PCs which represents 90% cumulative variability. This means that the selected PCs explain at least 90% of the variability of the collected measurements. Afterwards they do a “PC decomposition” with the goal of extracting the most relevant performance counters out of the PCs. The exact technique of “PC decomposition” is not described in detail and there is no common technique with that name. For a possible reimplementation in later parts of this work, “PC decomposition” is understood as using the values of the linear combinations for the single PCs and metrics. For extracting the 21 3. Related work most relevant performance counters out of this decomposition, they calculate weights for each performance counter and principal component and select the most relevant ones by use of a tunable threshold. The resulting weights of performance counters of a baseline test, are compared to the weights of the same performance counters in the newer version of the system. A deviation in weights suggests, that the distribution of the performance counter values have changed between the two versions. If the deviation crosses a certain threshold, a performance regression alert is issued. principal component analysis select 90% cumulative variance for each PC calculate weights of metrics bundle old new compare signatures/ weightsprincipal component analysis select 90% cumulative variance for each PC calculate weights of metrics bundle signa- ture signa- ture Figure 3.2.: Visualization of the process of performance signature-based regression detection Figure 3.2 visualizes the general process of the approach. Malik et al. propose another approach, which is supervised. This means, that manual inspection and labeling of data is needed. Although the labeling is only needed for the performance counters of the baseline tests, this approach is not considered in this paper. There are two reasons for this: 1) Malik et al. conclude that it is rarely possible for performance engineers to do such a labeling and 2) the alternative approach of PCA does only perform slightly worse. Since the signature-based approach does not have specific requirements towards the used performance measurements, it is expected to be easily adapted for microservice performance metrics. Nonetheless, it is unclear whether the microservice performance metrics would be excluded in the performance counter reduction phase. A big advantage of this approach is that it focuses on large scale software systems and therefore is expected to perform well in terms of scalability to big software systems. Additionally, Malik et al. suggest that the PCA approach performs better than clustering approaches. A direct comparison to the approach of Shang et al. [Sha+15] may not be possible, since Malik et al. use k-means clustering as a reference opposed of the hierarchical clustering of Shang et al. . In terms of disadvantages, it’s worth mentioning that the given approach offers no direct solution for changes in load and therefore would need the common double set of load 22 3.2. Existing performance regression detection approaches Table 3.3.: Overview signature-based performance regression detection Sp ee d of de te ct io n Su pp or t of m et ri cs A da pt . to ch an gi ng lo ad s C om pl ex it y Lo ad te st in g re po si to ry C om m on ne ss C ha ng e in di st ri bu ti on 40+ samples Yes No* 3 steps No - No* tests to assure comparability. Furthermore Malik et al. found out, that their approach needs at least a base of 40 samples to work. Although they suggest that higher sampling rates easily tackle the problem, it is an additional requirement which other approaches do not have. 3.2.4. Performance regression detection with transaction profiles Ghaith et al. [Gha+13] proposed a new approach for performance regression detection and did further research on it which was published later [Gha+16]. Their approach is based on so-called transaction profiles. Transactions are interactions of a user with the software system to invoke various application functions (e.g., login, browsing catalogs, etc.). In contrast to the common load metrics of total response time for such transactions, the approach of Ghaith et al. focuses on calculating the load-independent transaction profile for a transaction. A transaction profile describes which resource utilization a single transaction request produces (e.g., CPU usage on node 1, memory usage on node 2, etc.). The associated value to the whole transaction profile is the total time needed for one single transaction of this profile. Since such a transaction profile therefore describes the load of a single request, such transaction profiles are considered to be load-independent. In common setups, often two separate load tests are performed for performance regression detection to be able to compare the measurements made in different loads. One test is performed under the same load as the previous one, to be able to compare the results of both tests in performance regression detection. The other test is performed under the load which fits the current productive setup and loads. It is used for capacity analysis or future regression detection. Ghaith et al. report that such a 23 3. Related work load independent metric makes it possible to eliminate the first of those two test runs, since the load independent metrics are directly comparable to their earlier values. This leads to a reduction of load testing duration of up to 50% [Gha+16]. For calculating the transaction profile values, three input parameters are needed. The first two, response times and resource utilization are common metrics for load testing setups and have good tooling support. The third one is a Queuing Model Network (QNM) of the SUT. Building and validating a QNM must be performed manually for complex systems. After an initially building a QNM, it can be used and updated for later runs. Furthermore, QNMs of typical deployment topologies exist in research and can support building the QNM. Finally, the QNM has to be what Ghaith et al. call reverse-solved. Although there is no analytical solution for this process, search-based approaches for approximating the result are applicable. To optimize the duration of such search-based approaches, the transaction profile of earlier runs may be used as initial starting points. Case studies of Ghaith et al. show that transaction profiles can be considered a more load stable metric than total response time, although the transaction profile values still are impacted by different load levels. Especially loads which lead to high levels of software contention, are impacting the transaction profile values. For performance regression detection based on transaction profiles, such loads must be avoided. Additional inaccuracies in the method result out of ignoring network delays and usage of the BCMP approximation for QNMs. The BCMP is an often-used class of queuing networks, which does not exactly fit to realistic setups because it uses concepts such as infinite queues for servers or the estimation of equal service time for every customer. For the performance regression detection Ghaith et al. use thresholds calculated by test runs on software versions with so called performance-safe changes. Those changes are considered to be of no performance impact. They use a 95% percentile of the deviations for each transaction type. 3.2.5. Statistical process control techniques using machine learning Nguyen et al. [Ngu+12] propose an approach to performance regression detection which is backed by control charts. Control charts are commonly used in statistical quality control of manufacturing processes. Control charts show whether a process deviates from an earlier baseline of the same process. To use control charts, two requirements have to be fulfilled. First, the input parameters of the process should be considerable constant and secondly, the process output should be of a normal distribution. When trying to use control charts in software performance regression detection, those two requirements are 24 3.2. Existing performance regression detection approaches Table 3.4.: Overview transactional profiles Sp ee d of de te ct io n Su pp or t of m et ri cs A da pt . to ch an gi ng lo ad s C om pl ex it y Lo ad te st in g re po si to ry C om m on ne ss C ha ng e in di st ri bu ti on One run Yes Yes 4 steps No - No Table 3.5.: Overview statistical process control techniques using machine learning Sp ee d of de te ct io n Su pp or t of m et ri cs A da pt . to ch an gi ng lo ad s C om pl ex it y Lo ad te st in g re po si to ry C om m on ne ss C ha ng e in di st ri bu ti on One run Yes Yes 3 steps Yes Use of common tech- nique No* 25 3. Related work not fulfilled. The input parameters of the process, which resembles the load of the SUT, are not constant. This is the case because load tests use randomizers for load generation, but even more importantly, load should be adaptable to real production loads, which change over time. Nguyen et al. propose to use machine learning techniques to learn parameters of a linear scaling factor to make performance metrics of different loads comparable. Their assumption is, that in a well-structured system performance metrics stand in a linear relationship to load. Their evaluation suggests, that this procedure is very accurate. The second requirement, a normal distribution of output parameters, does not hold without adaption either. Furthermore, their research shows that most of the distributions are bi-modal. They explain this with bookkeeping tasks which the system performs in idle states. Because of those bookkeeping tasks, the normal distribution does not drop to zero on the left-hand side. Nguyen et al. propose to search for the local minimum between real load and bookkeeping tasks load in the distribution and eliminate all loads to the left of it. This filtering leads to a highly improved fitting to normal distributions in their measurements. Figure 3.3 gives a visual aid for explaining the filtering approach. The red line shows the original distribution. Some small spikes on the left-hand side in the lower area of the distribution are cut away by removing the first maximum. All measurements with values lower than 70 are removed. The remaining distribution (blue dotted line) fits better to a normal distribution than the original red one. The approach of Nguyen et al. stores all results of performance regression tests in one repository as baseline tests. The performance regression detection therefore does not have to rely on a single reference version. Figure 3.4 visualizes the general process of the approach. This approach has some requirements concerning the used metrics. They have to be of a normal distribution, or at least have to be filterable to reach a normal distribution and they have to be in a linear relationship to load. Because of the adaption of the performance metric values to fit to different loads, a double execution of load tests to make their results comparable under different loads is not necessary. Nguyen et al. see their biggest advantage compared to other approaches in the simplicity and intuitiveness of the approach. 3.2.6. Performance regression unit testing The approach of Horký et al. [Hor+13] focuses on how performance tests can be inte- grated into the software development life cycle from the very beginning of development. They argue that at early stages of development load testing can not depend on the 26 3.2. Existing performance regression detection approaches 0 50 100 150 200 0. 00 0. 01 0. 02 0. 03 0. 04 0. 05 0. 06 density plots for CPU usage rate original and adjusted N = 586 Bandwidth = 1.502 D en si ty original density curve adjusted density curve Figure 3.3.: Visualization of the proposed filtering technique of Nguyen et al. scale measurements create control charts new compare violation ratios scale measurements violation ratio old violation ratio new filter measurements filter measurementsoldoldl Figure 3.4.: Visualization of the process of statistical process control-based regression detection 27 3. Related work availability of large components and therefore propose an approach which they call performance unit testing. Unit testing is a technique which is very common in functional tests. In functional tests, unit tests allow to test even small code segments based on example data calculated by use of the specification. Performance unit testing is not as easy as functional unit testing. Most of the time there is no specification of performance requirements on a low level, which makes verifying fulfillment of those impossible. Furthermore, performance is hardware dependent which makes a naive comparison of execution times senseless. Horký et al. therefore propose to use so-called baseline functions as reference. For example, a baseline function for a sorting algorithm could be the Java Array sort implementation. In terms of performance regression such a baseline function could be an earlier version of the method which was considered to be of sufficient performance. With Stochastic Performance Logic (SPL), a first order logic for performance comparisons, Horký et al. introduced a way of specifying performance relations between methods. The evaluation of such a SPL formula is done by statistical hypothesis testing. The formulas can be annotated directly into Java code. The tooling support is surprisingly good if only Java is considered. A command line interface for test execution, automated HTML report generation, an Eclipse plugin as well as Git and SVN integration for automated pulls of older project versions as baseline are available [Dev]. The performance regression detection in this approach is done with the Welch’s t-test [Wel47], although the needed assumption of two sets of independent observations of random variables with the same distribution does not hold. Horký et al. conclude that the test is none the less usable, but test repetitions of around tens of thousands are needed. Furthermore, because of the non-deterministic behavior of load tests, they propose a 5% threshold for regression detection to avoid false positives. In their case study, test durations of 27 minutes for a code coverage of 18% are needed. To reach better test durations, they propose caching of earlier results to prevent duplication of performance tests. Opposed to all other approaches of this work, this approach focuses on a unit level. Most commonly regression detection is performed on a component, integration or system level. To reach a system level evaluation with performance unit tests, all single components would need a performance requirement specification and performance unit tests. It is hard to imagine in a microservice environment because there the spectrum of used languages and tools is very broad. 28 3.2. Existing performance regression detection approaches Table 3.6.: Overview performance regression unit tests Sp ee d of de te ct io n Su pp or t of m et ri cs A da pt . to ch an gi ng lo ad s C om pl ex it y Lo ad te st in g re po si to ry C om m on ne ss C ha ng e in di st ri bu ti on Thousands of func- tion runs No Yes Impl. avail- able No Inspired by com- mon tech- nique ? 3.2.7. Differential flame graphs Bezemer, Pouwelse, and Gregg [BPG15] propose a visualization approach called dif- ferential flame graphs for supporting performance regression detection. Flame graphs visualize how much time a program spends in a certain stack trace during test execution. The visualization shows the aggregated time on the x-axis and visualizes the given stack traces on the y-axis. A software performance engineer can use such a flame graph for developing a better understanding of how much time given functions need for execution and on which layers of a function call was spent which amount of time. Out of the flame graphs of two distinct versions of the system, Bezemer et al. propose to build a differential flame graph. The differential flame graph uses color to show whether and how much a function’s performance metric increased or decreased. Furthermore, a differential flame graph can be built upon the differences of the two flame graphs, so that the resulting differential flame graph only visualizes stack traces of function calls which changed their performance behavior between the two versions of the software. This approach is usable for visualizing stack based metrics, such as stack traces of execution. In an earlier paper Bezemer et al. [Bez+14] did research on how I/O usage could be collected as a stack based metric. Although this approach is meant to support a manual performance regression detection process, an automated implementation is imaginable. A big issue concerning this approach is the availability of stack based metrics. Especially considering a microservice 29 3. Related work Table 3.7.: Overview differential flame graphs Sp ee d of de te ct io n Su pp or t of m et ri cs A da pt . to ch an gi ng lo ad s C om pl ex it y Lo ad te st in g re po si to ry C om m on ne ss C ha ng e in di st ri bu ti on One run No No* Impl. avail- able No Use of common tech- nique No environment, the majority of metrics is not based on stack traces. In their paper, Bezemer et al. mention that differential graphs may as well be used in graphical user interface evaluation. A stack-trace equivalent would be the navigation of the user through the user interface. For microservices, an interesting approach could be using the traces of a request through the microservice architecture. A possible differential flame graph would show how a request uses different microservices and how much time the single interactions need. 3.2.8. Mining performance regression testing repositories Foo et al. [Foo+10] propose to use data from earlier performance tests to extract associ- ation rules, which show relations between different performance counters. Figure 3.6 gives examples for possible rules. On the left hand side of each rule are zero to n observations, which lead to 1 to m resulting observations. Foo et al. [Foo+10] suggest to detect performance regressions by testing those associ- ation rules of earlier runs against the data of a newer run. If the confidence in a rule deviates more than a given threshold, a performance regression alert is issued. In a first step of their approach, the data of the performance metrics is converted into cardinal data (low, medium, high). Foo et al. afterwards extract the association rules by use of data mining concepts on this cardinal data set. They find frequent data sets and association rules by use of the apriori algorithm which returns support and confidence 30 3.2. Existing performance regression detection approaches values for possible association rules. These confidence levels are compared between different versions of a software system, to detect performance regressions. metric disrectizationnew compare confidences metric discretization rule set new rule set old apriori algorithm apriori algorithmoldoldl Figure 3.5.: Visualization of the process of mining performance regression testing repos- itories regression detection Figure 3.5 visualizes the general process of the approach. The results of the approach seem to be comparable to others. The flexibility concerning use of different metrics seems to be promising for usage in a microservice environ- ment. The need for a big repository of existing performance load test data may be a disadvantage of the approach. {carts network/rx_rate high} → {carts cpu/usage_rate high} {carts cpu/usage_rate high} → {carts network/tx_rate high, carts-db network/rx_rate high} {} → {carts memory/usage medium} Figure 3.6.: Examples for association rules 31 3. Related work Table 3.8.: Overview mining performance regression testing repositories Sp ee d of de te ct io n Su pp or t of m et ri cs A da pt . to ch an gi ng lo ad s C om pl ex it y Lo ad te st in g re po si to ry C om m on ne ss C ha ng e in di st ri bu ti on One run Yes No* 3 steps Yes - No 32 Chapter 4 Comparison and implementation of approaches After giving a short overview over the different approaches of performance regression detection, this thesis first compares the presented approaches to performance regression detection. In a second step, a subset of the approaches is chosen to be implemented and evaluated in a microservice environment. Criteria by which the approaches are compared are introduced. A second chapter describes how the selected approaches were implemented. 4.1. Selection criteria This thesis strives to research performance regression detection approaches which are especially promising for microservice architectures. Therefore, the selection focuses on criteria which are relevant to the microservice environment. The main aspects, which are expected to be relevant for performance regression detection in a microservice environment are: Speed of detection Since microservice architectures are rapidly changing environ- ments, one of the biggest issues in integrating performance tests into the delivery pipeline, is the duration of the overall detection time. Different approaches vary in the number of test runs needed (e.g., two runs with different loads) or the number of data points needed for a reliable detection. 33 4. Comparison and implementation of approaches Support of different metrics/Adaptability to microservice metrics Since the selected approaches should be able to be enriched with microservice performance metrics like scalability, elasticity and resilience, it is important that the approach would generally support use of such metrics. Adaptability to changing load Not only the overall system is rapidly changing in mi- croservice architectures, the load of the system under test must be adapted over time as well to fit the load of the productive system. Approaches which have some possibility of adapting to changing loads are better suited than those who do not. Complexity The complexity of the selected approaches is relevant out of two reasons. First, a complicated approach is hard to understand and results may be complex in evaluation and communication. Secondly since the extent of this thesis is limited, approaches which offer an existing implementation or are easier to implement are more likely to be chosen than more complex ones. Since there is no direct way of measuring the complexity of an approach, the comparison is done by estimating the number of different steps which have to be performed to perform the regression detection. This estimation is not precise and to some extend influenced by personal experience and subjective opinion. Need for a load testing result repository Some approaches need a repository of old load testing results. Since this is considered to be an additional effort for software performance engineers, such a requirement is considered to be a disadvantage. Commonness Performance regression detection approaches which are currently more common than others, are considered to be of more interest in research. Not only are findings concerning such approaches of common interest, but they may offer an opportunity to help establishing such an approach in microservice environments. 4.2. Comparison of approaches Table 4.1 shows a tabular comparison between the approaches for performance regres- sion detection. The first row, speed of detection, gives an overview what effort in terms of performance testing is needed for starting the regression detection. The second row, support of metrics, expresses how easily the given approaches are adaptable to new or different metrics. Unit testing is considered to be not adaptable, since the tests are performed on a function level, where interesting aspects such as inter process communication and scalability are not directly visible. Since differential flame graphs are only usable for datasets which are stack-based, this approach is only usable for few metrics. 34 4.2. Comparison of approaches The third row, adaptability to changing loads, expresses whether the algorithm of the approach includes mechanisms to deal with changing loads when comparing the current system with an older version. The value “No*” indicates that although the approach does not tackle the issue, the solution of repeating the new test under the old load set up is possible. Since this approach doubles the load testing time, it is considered to be a disadvantage. The forth row, complexity, tries to grasp the complexity of the given approach. Since such a comparison is not possible in general, the comparison is done by estimating the steps needed to execute the algorithm. As an example: Considering the approach of signature-based performance regression detection (Section 3.2.3), one must do a principle component analysis, afterwards a principle component decomposition and finally do a comparison of the retrieved performance signatures. Overall those are considered 3 separate steps. As said, this comparison is only chosen because of a lack of better ways to compare complexity. The fifth row, load testing repository, expresses whether the given approach needs a repository of old load tests. The approach of statistical process control (Section 3.2.5) needs a repository for machine learning of the α and β values of the linear models. Mining Repositories (Section 3.2.8) needs a repository to build the association rules. The sixth row, commonness, describes how common the approaches are in comparison of usual techniques of software performance engineering. The Student t-test (Section 3.2.1) is a very common approach [Sha+15]. The statistical process control approach (Section 3.2.5) uses control charts, which are commonly used charts in process control. The unit testing approach (Section 3.2.6) is inspired by common functional unit tests. The differential flame graph approach (Section 3.2.7) uses the, according to Bezemer et al. [BPG15], common visualization of flame graphs. The last row, change in distribution, shows whether the approach is sensitive to changes in the distribution of the collected data sets. “No*” means, that the algorithm itself does not do a distribution change detection, but when using the right metrics, like variance of total response time, a detection of distribution change is possible. 35 4. C om parison and im plem entation ofapproaches Table 4.1.: Tabular comparison between approaches Approach Student T- Test Clustered counters Signature- based Transac- tional profiles Statistical process control Unit test- ing Differential flame graphs Mining reposito- ries Description 3.2.1 3.2.2 3.2.3 3.2.4 3.2.5 3.2.6 3.2.7 3.2.8 Speed of detection Several runs till confidence threshold One run 40+ sam- ples One run One run Thousands of function runs One run One run Support of metrics Yes Yes Yes Yes Yes No No Yes Adaptability to chang- ing loads No* - No* Yes Yes Yes No* No* Complexity 1 step 6 steps 3 steps 4 steps 3 steps Impl. avail- able Impl. avail- able 3 steps Load test- ing reposi- tory No No No No Yes No No Yes Common- ness Very com- mon - - - Use of common technique Inspired by common technique use of com- mon tech- nique - 36 4.3. Selection of approaches 4.3. Selection of approaches Table 4.1 gives a tabular overview over the aspects of the different approaches. In this thesis, the approach of performance unit testing and differential flame graphs will not be further researched. Performance unit testing does not fit the targeted testing scenario because it tests on a unit level. Although regression testing can be performed on unit level, the more common use case is to perform it on a system or component level. Furthermore, use of different tools and languages is very common in microservice environments. Proper implementation of performance unit tests would therefore be nearly impossible since they depend highly on the chosen languages. The existing research concerning scalability, elasticity, and resilience offers no possibilities of collecting those metrics on a unit level. Differential flame graphs will not be further considered in this thesis because their use case, especially in terms of metrics, is too narrow. Furthermore, it is unclear how an automated regression detection approach would be realized with those graphs. Additionally, future applications of including measurements w.r.t. scalability, elasticity, and resilience are not applicable. Differential flame graphs could nonetheless be a promising visualization for traces throughout the microservice system. Although offering a performance boost by trying to perform load independent tests, the technique of transaction profiles is not further looked into. The main reasons are that the implementation of the approach seems quite time consuming. Additionally, the results of the research work suggested that the results were only to a limited degree load independent. Furthermore, the need for a queuing model of the specific system under test is an additional burden that is non-trivial to realize for a microservice architecture. Networking delays were not considered in the original approach. Since network traffic is one important key aspect of microservice architectures, this might be more important than first expected. Lastly it is unclear how well and how fast a search-based reverse solving of a queuing network could be performed. The Student t-test, the signature based approach and the statistical process control approach will be suspect of a practical evaluation. The Student t-test is chosen because it is very common and simple to implement. It is therefore a good choice to evaluate other approaches against. This technique is often used although it is expected that not all performance metrics are normal distributed and it therefore should not be applicable or handled with care. The approaches of “Regression models on clustered performance counters” and “Signature-based performance regression detection” are considered to be similiar. Both approaches focus on minimizing the number of relevant counters to compare. Malik 37 4. Comparison and implementation of approaches [Mal10] directly compares his PCA based algorithm to a cluster based algorithm and concludes that the PCA based algorithm outperformed a clustering approach. Although Shang et al. [Sha+15] do not stop after clustering but do further work by building regression models on top of them, together with the fact that [Mal10] promises to be easier to implement, Signature-based performance regression detection” is chosen while “Regression models on clustered performance counters’ is not being implemented. The approach of Nguyen et al. [Ngu+12] is interesting for two reasons. First of all, they propose a technique to turn most of the non-normal distributed measurements into normal distributed measurements. This could be interesting for the Student t-test as well. Secondly by choosing control charts as basic technique they use a very common approach known in software performance monitoring. Last but not least, the approach of Foo et al. [Foo+10] is chosen because of its simplicity in implementation and because by sticking to very basic ordinal versions of the data set, it is very special in its way of approaching the problem of performance regression detection. It is expected that because of its nature it should perform differently and it may be able to detect different kinds of regressions. 4.4. Implementation of approaches For all selected approaches, the basic setup shown in Figure 5.2 was chosen. A local mirror of the InfluxDB, a Java implementation and an R server were used. The main work of the implementation was implementing the approaches in Java. Since all approaches use similar functionalities, first a general framework for performance regression analysis was written. The key features of it are connecting to the InfluxDB by use of the REST API, extracting single test runs out of the raw measurements, normalizing the recorded measurements as described in Section 5.2.2, offering a connection to the R server and evaluation functionalities to later evaluate the set of runs with regressions and without. As described later, the R server offers a set of different statistical and numerical techniques which are well documented and tested. It helps avoiding implementation errors by building upon well tested structures. The following sections give a short description of the implemented approaches as well as an algorithmic pseudo code representation. All algorithms share the same basic interface of DetectRegressions(repo, test). repo is the representation of the load testing repository containing all past load tests and test contains the performance measurements of the new test run on which the regression detection should be performed. Those two data structures should be imagined like databases where the corresponding fields link to corresponding datasets. The reason for this analogy is that in the real implementation 38 4.4. Implementation of approaches repo and test indeed are kept in the InfluxDB. For example repo.pods returns a set of representations of all the pods which were observed in the load testing repository. pod.metrics for example would return a set of all metrics which were collected for a certain pod. 4.4.1. Student t-test based performance regression detection The Student t-test was the easiest to implement. Since there was no reference paper describing a performance regression detection approach with the Student t-test, the approach was implemented in a way which would be expected in a performance engi- neering use-case. Algorithm 4.1 shows the algorithm used for performance regression detection with the Student t-test. The algorithm basically describes that for every singe pod and metric a Student t-test is performed (line 7) and if the p value is lower than 0.05 a regression is reported (line 9). For the Student t-test and calculating means the R implementations was used. Algorithm 4.1 Student t-test regression detection pseudo code 1: procedure DETECTREGRESSIONS(repo, test) 2: for all pod ∈ repo.pods do 3: for all metric ∈ pods.metrics do 4: allOldData← repo.LASTMEASUREMENTSOF(pod,metric) 5: allNewData← test. ALLMEASUREMENTSOF(pod,metric) 6: if allOldData.variance ̸= 0 then 7: pV alue←TTEST(allOldData.mean,allNewData) 8: if pV alue ≤ 0.05 then 9: REPORTREGRESSION(pod,metric) 10: end if 11: end if 12: end for 13: end for 14: end procedure 4.4.2. Statistical process control techniques using machine learning The machine learning aspect of this approach was neglected in the evaluation, since it is not directly associated to the detection approach. For sake of reducing evaluation complexity, no different loads were used in the observation. Therefore, the machine learning part of the approach should not be relevant. 39 4. Comparison and implementation of approaches Algorithm 4.2 shows the used approach in pseudo code. For every pod and metric, the algorithm performs an evaluation of the control chart violation ratios. If the violation ratio increased, a performance regression is reported. The lines 6 to 14 may be irritating. They implement the filtering approach to get more normal distributed data. The filtering is only performed if the p value of the Shapiro Wilk test increased by doing so. The R implementation was used for performing the Shapiro Wilk test and calculating means and standard deviations. 4.4.3. Signature-based performance regression detection The signature-based performance regression detection approach extracts performance signatures by use of principal component analysis (PCA). Algorithm 4.3 shows the pseudo code implementation of the algorithm. Basically, the performance signature of an old and a new test are generated (line 2 to 5) and afterwards compared (line 6 to 15). The signature extraction applies PCA (line 20) to a matrix of the performance measurements. In this matrix, the single rows represent different measurements and the different columns represent different metrics. To apply the PCA, zero variance measurements have to be removed first. This step is left out in the pseudo code to make understanding the code easier. The final signature is build out of the first x principal components which have an accumulated variance of 90% of the original data set. For every principal component the weights of the metrics are afterwards extracted out of the corresponding fields of the eigenvector. The eigenvector represents the linear combination of the metrics that results in the certain principal component. 4.4.4. Mining performance regression testing repositories The approach of mining performance regression testing repositories uses the apriori algorithm to extract association rules out of the measurements. Algorithm 4.4 shows a pseudo code implementation of the approach. After extracting the association rules (line 2 to 3) the confidence change is calculated. If it is higher than a certain threshold (0.02 in this example) a performance regression is reported. For performing the apriori algorithm the measurements first have to be transformed into a solely ordinal type of data (line 13 to 27). Afterwards the resulting data set is put into the apriori algorithm. The apriori algorithm extracts all rules which have a minimum support and confidence level. These two thresholds are adjustable but not shown in this pseudo code representation. 40 4.4. Implementation of approaches Algorithm 4.2 Statistical process control techniques regression detection pseudo code 1: procedure DETECTREGRESSIONS(repo, test) 2: for all pod ∈ repo.pods do 3: for all metric ∈ pods.metrics do 4: allOldData← repo.ALLMEASUREMENTSOF(pod,metric) 5: allNewData← test. ALLMEASUREMENTSOF(pod,metric) 6: pV alueOrig ←SHAPIROWILKTEST(allOldData).pV alue 7: if pV alueOrig ≤ 0.05 then 8: allOldDataF iltered, cutV alue←REMOVEFIRSTPEAK(allOldData) 9: pV alueF iltered←SHAPIROWILKTEST(allOldDataF iltered).pV alue 10: if pV alueF iltered > pV alueOrig then 11: allOldData← allOldDataF iltered 12: allNewData← {x ∈ allNewData : x > cutV alue} 13: end if 14: end if 15: bV io←GETVIOLATION(allOldData,allOldData) 16: tV io←GETVIOLATION(allNewData,allOldData) 17: if tV io > bV io then 18: REPORTREGRESSION(pod,metric) 19: end if 20: end for 21: end for 22: end procedure 23: 24: procedure GETVIOLATION(testData, refDat) 25: total← 0 26: violation← 0 27: for all val ∈ testData do 28: total← total + 1 29: if val /∈ [refDat.mean− 3× refDat.sd, refDat.mean+3× refDat.sd] then 30: violation← violation+ 1 31: end if 32: end for return violation total 33: end procedure 41 4. Comparison and implementation of approaches Algorithm 4.3 Signature-based performance regression detection pseudo code 1: procedure DETECTREGRESSIONS(repo, test) 2: lastOldRun← repo.last 3: lastNewRun← test.last 4: oldSignature←EXTRACTSIGNATURE(lastOldRun) 5: newSignature←EXTRACTSIGNATURE(lastNewRun) 6: for all x ∈ newSignature.range do 7: for all metric ∈ newSignature.pcs[x].metrics do 8: if x ∈ oldSignature.range ∧ oldSignature.pcs[x].contains(metric) then 9: if abs(newSignature.pcs[x][metric] − oldSignature.pcs[x][metric]) > 0.02 then REPORTREGRESSION(‘signature changed significantly ’,metric) 10: end if 11: else if newSignature.pcs[x][metric] > 0.02 then 12: REPORTREGRESSION(‘signature has new significant metric’,metric) 13: end if 14: end for 15: end for 16: end procedure 17: 18: procedure EXTRACTSIGNATURE(run) 19: measurementMatrix←GETMEASUREMENTMATRIX(run) 20: pcaRes←PCA(measurementMatrix) 21: cummulativeV ariance← 0 22: result← {} 23: for all pc ∈ pcaRes.pcs do 24: if cummulativeV ariance < 0.9 then 25: cummulativeV ariance← cummulativeV ariance+ pc.var 26: result.add(pc) 27: for all metric ∈ run.metrics do 28: result.lastPc[metric]← result.lastPc.eigen[metric.numb] 29: end for 30: else 31: break 32: end if 33: end forreturn result 34: end procedure 42 4.4. Implementation of approaches Algorithm 4.4 Mining performance regression testing repositories pseudo code 1: procedure DETECTREGRESSIONS(repo, test) 2: allOldRules← repo.EXTRACTRULES(repo) 3: allNewRules← repo.EXTRACTRULES(test) 4: for all x ∈ allOldRules, y ∈ allNewRules : x.type = y.type do 5: confidenceChange← 1− x.conf · y.conf + (1− x.conf) · (1− y.conf)√ x.conf 2 + (1− x.conf)2 · √ y.conf 2 + (1− y.conf)2 6: if confidenceChange > 0.02 then 7: REPORTREGRESSION(x,y) 8: end if 9: end for 10: end procedure 11: 12: procedure EXTRACTRULES(data) 13: cardinalDataMap← {} 14: for all pod ∈ repo.pods do 15: for all metric ∈ pods.metrics do 16: data← repo.ALLMEASUREMENTSOF(pod,metric) 17: for all val ∈ data do 18: if val.value > data.mean+ data.sd then 19: cardinalDataMap.put(val.time,metric+ ‘high′) 20: else if val.value > data.mean− data.sd then 21: cardinalDataMap.put(val.time,metric+ ‘low′) 22: else 23: cardinalDataMap.put(val.time,metric+ ‘medium′) 24: end if 25: end for 26: end for 27: end for return APRIORI(cardinalDataMap).rules 28: end procedure 43 Chapter 5 Evaluation The following section introduces the main research questions of this thesis and presents the chosen methodology for answering them. By use of an empirical study, it presents an evaluation of the formulated research questions. It gives an in-depth explanation of the used setup, the system under test and the architecture of the performance regression detection prototype. 5.1. Evaluation goals This section gives a short explanation of the research questions of this thesis. Since the main goal was researching the possibilities and challenges of automated performance regression detection in a microservice environment, two different focuses were set. The first one targets possible differences between common monolithic systems and the microservice environment and its metrics. Since most metrics of microservice architectures are collected on virtualized systems, different virtualization containers of one node may influence each other. The second focus is set on the concrete performance regression detection approaches and their performance in a microservice environment. The goal of this second focus is to evaluate how well current performance regression detection approaches perform in the new environment of microservices. 5.1.1. Evaluation of microservice performance metrics behavior RQ1.1 Which metrics are available and commonly collected? RQ1.2 How stable are metrics during a run? RQ1.3 Can metrics be considered or adapted to be of normal distribution? 45 5. Evaluation RQ1.4 How stable are metrics between system redeployments? Since load testing measurements and their metrics are the foundation for performance regression detection, research on the overall behavior of typical metrics is performed. Main goals of evaluation concern the stability of those metrics. This stability will be evaluated during a single deployment as well as between several redeployments. If a metric has a high variance during a test run, long testing durations are needed to gather statistical significant results. If a metric does vary highly between redeployments, the values of such a metric during a single test run and deployment are not representative and several deployments and test runs may be needed to get significant results. 5.1.2. Evaluation of performance regression detection approaches RQ2 How do the implemented approaches perform in a microservice environment? The evaluation shall show advantages and disadvantages of existing approaches and shall evaluate their performance in a microservice setup. A perfect microservice performance regression detection approach would be expected to be efficient concerning needed load testing durations as well as the needed evaluation time. Furthermore, the results of the approach should have a high precision and recall. It should be simple to set up and use. Furthermore, the results of the regression detection approach should help the performance engineer in understanding what kind of regression and where it was found. The selected and implemented performance regression detection approaches will be evaluated in these categories. 5.2. Evaluation methodology To answer the question of which metrics are commonly collected, the documentation of Kubernetes default monitoring tool Heapster was used as a reference. The remaining questions were answered by performing an empirical study on a reference microservice system. To evaluate the research questions concerning the behavior of performance metrics in the context of microservice environments (Section 5.1.1), a series of 19 load tests, each 4 hours long, was performed in this test setup (Section 5.3). The SUT and load specification did not change in between the different load tests and redeployments, since the focus of the metrics behavior research is set on the stability during a run, in between runs and the distribution of the metrics’ measurements. A second set of runs was performed on the SUT to answer the remaining research question of how well the single performance regression detection approaches perform 46 5.2. Evaluation methodology in a microservice environment. To answer this question, a set of 44 load tests with durations of 2 hours each were performed. Out of the 44 test runs, 20 were performed without regressions and contained the unchanged system. Five of the 20 unchanged runs were used to simulate a performance test repository, which some approaches need. The remaining 15 runs without regressions were used to evaluate false positives of the performance regression detection approaches. Additionally, to the 20 unchanged runs, 24 runs were performed with injected regressions. For this purpose, 6 different kinds of regressions were injected (Section 5.3.6) and of each type of regression 4 load testing runs were executed. The load specification was not changed in between the single runs. The final evaluation of the performance regression detection approaches was performed by testing every approach with each of the 15 regression-free and 24 regression-including versions of the SUT. For this final evaluation, all metrics which are described in Section 5.4, except for network and filesystem metrics were used. These metrics were excluded because they triggered a lot of false positive regression alerts. 5.2.1. Steady state detection 0 20 40 60 80 100 120 15 0 20 0 25 0 30 0 Steady state detection example data time (minutes) cp u us ag e ra te (m illic ore s) selected steady state data Figure 5.1.: Steady state detection visualization 47 5. Evaluation When a SUT is deployed and put under load, it needs time to reach a so-called steady or stationary state. The steady state describes a state of the SUT in which the metrics can be considered stable. It is expected that in such a steady state, initial loads which may result from startup, caching and warming up of the overall system have reached a state in which they change negligibly. The difference between measurements influenced by early warm up and deployment tasks and steady state metrics is visible in Figure 5.1. The figure shows CPU usage rate (Table 5.3) of one test run. The red line marks the point after which the system can be considered to be in steady state. An often-used naive approach for ignoring initial anomalies in metrics, is to ignore a fixed time span at the start of each test. Steady state detection is a possibility for speeding up the regression detection process since it allows to be more precise in removing samples. Therefore, a relevant amount of load testing is collected earlier. In this thesis, the naive approach of ignoring a fixed time span at the beginning of load tests is being used. This approach was chosen since the steady state detection is not relevant for evaluating the performance between the selected approaches. For the sake of completeness, it is nonetheless worth mentioning it. One possible way to implement steady state detection is performing a trend analysis on the measurements while iteratively removing measurements from the beginning. When a point is reached, where no trend is observed, the remaining measurements can be considered to be in steady state. A more detailed description of possible ways to perform steady state detection is given by Shumway and Stoffer [SS06]. 5.2.2. Metric normalization Since microservice architectures are most commonly deployed in distributed setups, some challenges rise concerning the simultaneous collection and analysis of those. According to Foo et al. [Foo+10], there are mainly three issues to tackle: Clock Skew Since the cluster is built out of different independent machines, the clocks of the different machines may not be identical. There may be offsets between them, which could lead to misinterpretations when comparing measurements performed with different clocks. Furthermore, different metrics may be observed at different rates, which leads to the fact, that at one certain point of time, there may be one metric available, but others are not. Extended Test There may be several measurements collected directly after the test, where the system may not be under load anymore. 48 5.3. Evaluation setup Delay The measurements may start collecting data at different offsets. This may be caused by different start up times or different resource utilizations on the different nodes. These challenges were tackled in the testing environment by use of the following solutions: Clock Skew In the test environment, measurements were only taken approximately ev- ery minute. The time offsets of the different clocks therefore should be neglectable, since no such time intervals are observed which are in the magnitude of common offsets. Furthermore, for later evaluation of the single metrics and measurements in the performance regression approaches, the measurements are first linearly interpolated in strict one-minute intervals, to allow for a best approximation of the real values of the different metrics at certain times. Extended Test Similarly to ignoring data collected at the beginning of the test because of not steady states, a fixed offset at the end of the tests is ignored. Delay Delays considering the collection of first measurements on the different nodes are not relevant, since the time to reach a steady state is longer than that of measurement collection startup. Since all measurements before reaching a steady state are ignored, this delay can be ignored. 5.3. Evaluation setup The following section presents an overview over the systems which were used for the evaluation. It describes the architecture of the performance regression detection setup, the different software tools which were used and describes the cluster on which the evaluation was performed. 5.3.1. Overview Figure 5.2 shows the overall testing and evaluation environment used in this thesis. Visualized as blue boxes on the left, one can see the system under test. In this thesis Sock Shop (Section 5.3.2), an artificial sock web shop, is used as a system under test. To observe the system in a busy state, several Locust (Section 5.3.4) instances simulate users putting the system under load. The Locust instances are visualized at the bottom of the figure. Heapster (Section 5.3.3) observes the resource consumption of the system and the single microservices. Heapster and the Locust load drivers store their collected data in a central InfluxDB (Section 5.3.3), which is visualized in the center of the figure. 49 5. Evaluation Figure 5.2.: Overall view on the test environment Table 5.1.: Specification of the testing nodes Image Fedora-Cloud-Base-25-1.3.x86_64 CPU cores 4 CPU MHz 2300 RAM 8 GB Disk space 80 GB Which metrics are observed, is presented in detail in Section 5.4. The performance regression detection approach on the right of the figure, communicates with the InfluxDB and requests the information of the load test metrics. For statistical evaluation of the data sets, the implemented approaches use an R Server (Section 5.3.5), which is shown in the bottom right corner. The arrows in the figure visualize data flows between the single components and show how the components are connected. Kubernetes (Section 5.3.3) is used for container management and deployment. It runs on an OpenStack cluster [Ope]. In the cluster, Kubernetes uses three virtual machines whose specifications are shown in Table 5.1. 5.3.2. Sock shop — a microservices demo application In their work, Aderaldo et al. [Ade+17] research requirements for microservice research benchmarks. They research available candidates and evaluate whether they fulfill those requirements. Although Aderaldo et al. conclude that none of the available open-source candidates are mature enough to be used as a community-wide benchmark, they argue 50 5.3. Evaluation setup that they already can be valuable for empirical research. Sock Shop [Soc], one of the proposed candidates, is a microservice demo application developed by Weaveworks, a company which focuses on cloud solutions. It consists out of 19 microservices which are implemented in Java, Go, and Node.js. Although the system is built as an artificial benchmarking platform, it resembles the back-end of a web shop which sells socks. Sock Shop was chosen as an evaluation platform for several reasons. Documentation Sock Shop is quite well documented. Deployment instructions for the most common platforms are available. An architecture description exists and basic documentation for the single microservices is available. Microservice architecture Sock Shop was designed with common architectural pat- terns in mind. Aderaldo et al. mention service discovery, database per service, and messaging as examples. Load tests Sock Shop already offers extensive Locust [Loc] load testing scripts, which simulate users registering, logging in, browsing the catalogue, ordering socks and simulating even credit cards in checkout. Autoscaling Sock Shop offers horizontal scaling scripts for Kubernetes [Kubb]. There- fore, one of the key aspects of the microservice architecture, high elasticity and automated adaption to load, is thoroughly represented. Monitoring Zipkin is integrated in some of the microservices and collects traces through- out the system. A trace shows how a request propagates through the microservice system. 5.3.3. Kubernetes, Heapster, and InfluxDB Kubernetes [Kubb] is an open-source system which offers functionality for deployment, orchestration, and scaling of microservice containers. Heapster [Heab] is an open source software which enables container cluster monitoring and performance analysis of the single containers. In the test setup of this thesis Heapster runs on the Kubernetes cluster and collects basic performance metrics such as CPU usage, memory usage, or network load on a container level. The collected data of the load testing tool Locust [Loc] and the data of Heapster are stored in an InfluxDB. InfluxDB [Inf] is an open source time series database. A time series database is a database which specializes in storing and querying data which has an ordered time dimension. Since the obtained metric measurements are essentially a tuple of time and value, it is sensible to use a time series database like InfluxDB in this setup. 51 5. Evaluation 5.3.4. Locust — a python load testing tool Locust [Loc] is an open source load testing tool. It is used to put the SUT under a constant load. The load testing script of this thesis is inspired by the load testing scripts which Sock Shop is offering. To make sure that variance in recorded performance metrics is not a result of high variance in the load which was put on the SUT, it is important that the load testing tool is able to put the SUT under a steady load with low variance. The behavior of the load during measurements is evaluated in detail in Section 6.2.1. 5.3.5. R The R project for statistical computing [R] is a software environment for statistical calculations and plotting. It offers a TCP server implementation called Rserve. In this thesis R is used for all kinds of calculations, such as means, median, standard deviation, apriori algorithm and principal component analysis. Since the libraries are well documented and tested, the use of R helps avoiding implementation errors. Even easy calculations such as mean or variance hold a risk of being faultily implemented when looking at metrics such as available filesystem space in bytes over 400 measurements. 5.3.6. Regression injection Concerning regression injection, this thesis strives to insert common and realistic perfor- mance regressions into the microservice system. The regressions were inserted into the carts microservice. The carts microservice was chosen because it was written in Java and is easy to adapt. It is connected to a MongoDB database. The carts microservice was chosen as well for the general research of metric behavior in subsection 5.5.2. The injected regressions were inspired by performance antipatterns as e.g., researched by Smith and Williams [SW03] as well as the work of other performance regression detection research papers [MHH13] [Ngu+12]. The implementations of the antipatterns “The Ramp” and “One-lane-Bridge” were inspired by pseudo-code from a paper of Keck et al. [Kec+16]. Section 5.2 shows a tabular overview of the different kinds of regressions which were inserted. 5.4. Metrics This section answers research question 1.1 “Which metrics are available and commonly collected?”. It describes which metrics are commonly available and explains what the 52 5.4. Metrics Table 5.2.: The different kinds of injected regressions Type of regression Description System print [Ngu+12] Adding unnecessary logs in the system standard output. Comparable to setting a wrong logging level in production. DB connection [Ngu+12] A wrong configuration in the database client limits the number of concurrent open connections to 1 opposed to the default of 100. One-lane bridge [Kec+16] Occurs when a bottleneck of concurrency exists in the program. To inject this regression semaphores were added at critical points in the program. Ramp [Kec+16] Occurs when processing time increases over time. This regression was artificially injected inspired by [Kec+16]. Unnecessary processing [SW03] A calculation does some heavy processing which would not be necessary. For example evaluating additional data after a searched result already was found. Increased memory usage [Ngu+12] The systems memory usage increases significantly. This may be due to a so-called memory leak, which prevents increasing memory loads to be released. different metrics measure. It offers reasoning for the fact that some metrics are ignored in the performance regression detection set up. 5.4.1. CPU CPU metrics are collected by use of the Kubernetes [Kubb] default Container Cluster Monitoring tool Heapster [Heab]. In Kubernetes, an important unit concerning CPU usage is millicores (m). Millicores describe a fractional usage of one single core, vCore or hyperthread, depending on the base system. A container which requests 100 m is guaranteed half as much CPU as one asking for 200 m [Kubd]. The available CPU metrics on pod level are listed in Table 5.3 [Heaa]. The metric cpu/usage_rate can be considered to be the microservice equivalent of common CPU usage. The metric cpu/usage is not considered in this work, since its cumulative nature would lead to a need for special handling. Furthermore, its data represents an integral over the cpu/usage_rate. 53 5. Evaluation limit The limit of available millicores for the pod. request The guaranteed amount of available resources meassured in millicores. usage The cumulative CPU usage on all cores. usage_rate The CPU usage of all cores at a certain point of time measured in millicores. Table 5.3.: Available CPU metrics in Heapster Table 5.4.: Available memory metrics in Heapster limit The limit of available memory for the pod measured in bytes. major_page_faults The cumulative number of major page faults of the pod. A major page fault occurs when memory has to be loaded from the disk. major_page_faults_rate The number of major page faults which occurred in the pod in one certain second. A major page fault occurs when memory has to be loaded from the disk. page_faults The accumulative number of page faults of the pod. A page fault occurs when memory is already available but has to be mapped by the operating system. page_faults_rate The number of major page faults which occurred in the pod in one certain second. A page fault occurs when memory is already available but has to be mapped by the operating system. request The guaranteed amount of memory resources to be avail- able measured in bytes. usage The total memory usage of the system measured in bytes. working_set The working set of the pod measured in bytes. The working set describes all referenced memory of the pod. 5.4.2. Memory Memory metrics are collected by use of the Kubernetes [Kubb] default Container Cluster Monitoring tool Heapster [Heab]. If feasible, the unit of memory metrics is bytes. The available memory metrics on pod level are listed in Table 5.4 [Heaa]. 54 5.4. Metrics Table 5.5.: Available filesystem metrics in Heapster usage The total number of bytes used on the filesystem. limit The total size of the filesystem. available The number of remaining bytes in the filesystem. Table 5.6.: Available network metrics in Heapster rx The total number of incoming network bytes. rx_errors The total number of errors concerning incoming traffic. rx_errors_rate The number of errors concerning incoming traffic per sec- ond. rx_rate The number of incoming network bytes per second. tx The total number of outgoing network bytes. tx_errors The total number of errors concerning outgoing traffic. tx_errors_rate The number of errors concerning outgoing traffic during on second. tx_rate The number of outgoing network bytes per second. The metrics major_page_faults and page_faults are not considered in this work, since their cumulative nature would lead to a need for special handling. Furthermore, their data is represented in the according rate metrics. 5.4.3. Filesystem Filesystem metrics are collected by use of the Kubernetes [Kubb] default Container Cluster Monitoring tool Heapster [Heab]. If feasable, the unit of filesystem metrics is bytes. The available filesystem metrics on pod level are listed in Table 5.5 [Heaa]. 5.4.4. Network Network metrics are collected by use of the Kubernetes [Kubb] default Container Cluster Monitoring tool Heapster [Heab]. If feasible, the unit of network metrics is bytes. The available network metrics on pod level are listed in Table 5.6 [Heaa]. 55 5. Evaluation Table 5.7.: Collected response metrics in Locust status_code The HTTP status code of the response reason The Reason-Phrase of the response (OK/Accepted/Not found/...) url The full request url of the request. path_url The relative url of the request. method The method of the request (GET/POST/DELETE). elapsed The elapsed time of the request in seconds. 5.4.5. Response time Response time metrics are collected by use of the load driver Locust[Loc]. To implement this logging, an establishing and writing to the InfluxDB was added to the Python load testing script. The collected metrics are shown in Table 5.7. 5.5. Description of results The research question of which metrics are commonly collected in a microservice setup was already answered in Section 5.4. Section 5.5.1 describes work on research question RQ 1.2. Section 5.5.2 describes work on research question RQ 1.3. 5.5.1. General metrics behavior Deviation of measurements during runs To answer research question RQ1.2 “How stable are metrics during a run?” the distribu- tions of the single relevant measurements were examined. The corresponding plots can be found in Appendix A. Figure 5.3 is shown as an example for those plots. The plot shows the distibution of the measurements divided by its median value. This is done to facilitate recognizing relative deviations. The measurements exclude a 40-minute startup time to avoid showing influences of data which were not in steady state. The red lines show the 25% and 75% quantiles. None of the observed metrics had high devia- tions of their median values. Only memory/page_fault_rate had some more significant outliers. No deviations were found out of which major issues for performance regression detection were found. 56 5.5. Description of results 0.90 0.95 1.00 1.05 1.10 0 2 4 6 8 10 cpu/usage_rate relative to median value density plot N = 200 Bandwidth = 0.01094 D en si ty Figure 5.3.: Cpu/usage_rate distribution relative to median value Normal distributed metrics Nguyen et al. [Ngu+12] mention approaches to filter per- formance measurements, so that the resulting data is of normal distribution. They do this because their approach is based on control charts which need normal distributed data as an input. In their work, they saw that around 88% of their runs have a bi-modal distribution. 66% of the runs are not of normal distribution concerning the Shapiro-Wilk test (α = 0.05). After their filtering approach, they could raise the number of runs which passed the Shapiro-Wilk test to 91%. The main idea for their filtering approach was that they expected the bi-modal distribution to result out of phases where the system idles and performs bookkeeping tasks. Therefore, they propose their filtering technique of removing the data sets which are represented in the first peak of the distribution of the measurements. To evaluate whether their findings can be reproduced in the microservice environment of this thesis, the measurements of the single runs of the 19 redeployments were evaluated. The findings are depicted in Table 5.8. The first row shows the number of metric runs. This terminology is chosen to describe one set of measurements for a single metric and run. Out of those metric runs a majority describe metrics which are not relevant because they show no variance. Those are for example metrics such as cpu/request, cpu/limit, memory/limit which are static during 57 5. Evaluation Table 5.8.: Normal distribution findings number of metric runs 548 runs with variance ̸= 0 110 runs not normal before filtering 110 runs not normal after filtering 97 runtime or metrics such as filesystem/available for microservices which do not interact with the filesystem. All of those metric runs show a p-value ≤ 0.05 when performing a Shapiro-Wilk test and are therefore not considered to be of a normal distribution. After applying the filtering technique, 97 of those 110 still do not pass the Shapiro-Wilk test. Those findings are mostly expected. The used dataset already removed the startup and shutdown phases of the load tests and the system was put under static load. The idle phases which the filtering tries to eliminate, should already have been removed. Nonetheless, this shows that the majority of collected metrics can not be considered to be of normal distribution. Two of the described approaches, the Student t-test as well as the control chart based approach therefore are at least on a theoretical level not applicable. 5.5.2. Metrics behavior with redeployments Behavior of CPU metrics Figure 5.4 shows the behavior of the cpu/usage_rate metric concerning different rede- ployments of the same system configuration in the carts microservice. The median is built upon the test data with the first 40 minutes removed (5.2.1, A.1). 40 minutes into the test runs there are no trends observable in the cpu/usage_rates any more. The single points show the corresponding median value during one single test run. The red line shows the mean value of the median values of the runs for orientation purposes. Between the maximum and minimum median value of those 19 test runs, lies a range of 123 millicores. Compared to the mean median value of 113 millicores this range is highly relevant. The variance of the data set is 703 and therefore comparably high. The plot suggests that the recorded data has patterns. The median values are strictly alter- nating between higher and lower values. Furthermore, there are three clusters visible throughout the test runs. Concerning the alternating behavior of the measurements, Section 6.2.2 offers a possible explanation, although the reason for the different clusters would still be unclear. Figure 5.5 shows the relative deviations of the median CPU measurements throughout the runs compared to the relative deviations of the median requests per minute of the 58 5.5. Description of results 5 10 15 0 20 40 60 80 10 0 12 0 14 0 cpu/usage_rate median values in different deployments test run cp u/ us ag e_ ra te in m illi co re s Figure 5.4.: Cpu/usage_rate median in different deployments load driver. It is clearly visible that the relative deviations of the cpu_usage_rate are significant higher than the relative deviations of the requests per minute of the load driver. Behavior of memory metrics Figure 5.6 shows plots of the memory usage of the 19 test runs. Every different line shows the observed behavior of one single test run. This depiction is used because the memory/usage does not reach a clear steady state. As clearly visible, the memory/usage plots all show a trend. Therefore, a visualization of the median values is not able to show a clear picture of the behavior. In the figure, it is clearly visible that memory usage varies enormously between different deployments. Figure 5.7 shows the distributions of absolute relative deviation of the median values of the runs. For orientation purposes, the first violin chart shows the distribution of the median values of the requests per minute in the different deployments. The figure shows significant differences between the distribution of the requests per minute and the memory usage as well as the memory working set. Although the median requests per minute only deviate by a maximum of around 3%, the median measurements of memory usage and working set deviate by a maximum of around 14 % during the recorded test runs. This is a finding which could 59 5. Evaluation 0. 00 0. 05 0. 10 0. 15 requests per minute usage_rate distribution of relative deviations for cpu metrics re la tiv e d ev ia tio n fro m m ed ia n te st ru n Figure 5.5.: Relative deviations of median CPU measurements during runs compared to requests per minute of the load driver be a topic for future research. Behavior of filesystem metrics None of the observed microservices produced significant measurements concerning filesystem usage. The load testing scripts only worked with a fixed number of articles and used only a fixed number of user accounts. This setting was chosen to reduce the effect of random action selection. Therefore, at this point no conclusions concerning filesystem metrics are available. Behavior of network metrics Figure 5.8 shows plots of the network rx and tx rate of the 19 test runs. Every different line shows the observed behavior of one single test run. Opposed to the memory usage and working set, there’s no clear relationship in between the different runs visible. Even with different kinds of smoothing the plots do not suggest that the noisy behavior is similar between the different runs. Nonetheless one can see that the measurements for network rx and tx rate do not show trends in the different runs. Figure 5.9 shows 60 5.5. Description of results 0 50 100 150 200 250 300 60 0 70 0 80 0 90 0 10 00 memory usage behaviour in different deployments measurements m e m o ry /u sa ge (m eb iby te s) 0 50 100 150 200 60 0 70 0 80 0 90 0 10 00 memory working set behaviour in different deployments measurements m e m o ry /w o rk in g_ se t (m eb iby te s) Figure 5.6.: Memory usage and working set behavior during different deployments 61 5. Evaluation 0. 00 0. 02 0. 04 0. 06 0. 08 0. 10 0. 12 0. 14 requests per minute usage page_fault_rate working_set distribution of relative deviations for memory metrics re la tiv e d ev ia tio n fro m m ed ia n te st ru n Figure 5.7.: Relative deviations of median memory measurements during runs com- pared to requests per minute of the load driver the distributions of absolute relative deviation of the median values of the runs. For orientation purposes, the first violin chart shows the distribution of the median values of the requests per minute in the different deployments. The figure shows that there is no significant difference between the relative deviations between the network metrics and the requests per minute of the load driver. This suggests that the deviations of network metrics may have only been implicated by variance of load during the different runs. 5.5.3. Approaches In the following tables, the results show in how many of the total 39 test cases the single approaches reported which results. R1 to R6 represent the different types of regressions. They stand for: R1: mongo database misconfiguration R2: unnecessary processing 62 5.5. Description of results 0 20 40 60 80 100 65 70 75 80 85 90 95 network rx rate behaviour in different deployments measurements n e tw o rk /rx _r a te (k e bi by te s) 0 20 40 60 80 100 75 80 85 90 95 10 0 10 5 network tx rate behaviour in different deployments measurements n e tw o rk /tx _r a te (k e bi by te s) Figure 5.8.: Network rx and tx rate behavior during different deployments 63 5. Evaluation 0. 00 0 0. 00 5 0. 01 0 0. 01 5 0. 02 0 0. 02 5 requests per minute rx_rate tx_rate distribution of relative deviations for network metrics re la tiv e d ev ia tio n fro m m ed ia n te st ru n Figure 5.9.: Relative deviations of median network measurements during runs com- pared to requests per minute of the load driver R3: a one line bridge antipattern injection R4: unnecessary memory usage R5: unnecessary system prints R6: a the ramp antipattern injection Since the load testing script was focusing on putting load on the carts microservice, some parts of the system were nearly under no load. Since some approaches struggle with such measurements, measurements with a median cpu/usage_rate below 10 millicores were removed. Furthermore, the network tx_rate and rx_rate triggered a lot of false positives. These metrics were therefore ignored by the approaches. The zipkin microservices were ignored as well because of too low load. If an approach issued one or more performance regression alerts, a regression was found. If no regression alert was triggered, the approach reported no regression in the system. Concerning the size of the regressions, R2, R4 and R6 triggered significant changes to some metrics’ measurements and were expected to be easy to spot. R1, R3 and R5 did 64 5.5. Description of results Table 5.9.: Performance evaluation of the four performance regression detection ap- proaches Approach t-test-based process- controll-based signature- based mining repositories- based TP 24 16 16 7 TN 0 9 8 10 FP 15 6 7 5 FN 0 8 8 17 Precision 63% 73% 70% 58% Recall 100% 67% 67% 29% F-measure 77% 70% 68% 39% Time needed 3 min 3 min 2 min 26 min Table 5.10.: Performance of Student t-test regression detection No change R1 R2 R3 R4 R5 R6 Regression detected 15 4 4 4 4 4 4 No regression detected 0 0 0 0 0 0 0 not have a too significant influence of the system’s behavior those three were expected to be useful for a more fine-grained evaluation of sensitivity of the approaches. Table 5.9 shows the resulting cumulative true positives, true negatives, false positives, false negatives, precision, recall, and F-measure values of the approaches. Student t-test based performance regression detection Table 5.10 shows the detection results of the Student t-test-based regression detection. The performance metrics of the approach are depicted in Table 5.9. The Student t-test did not perform well in this evaluation. With an α-value of 0.05 the Student t-test issued a performance regression alert in every set of the test runs. Although the number of detected regressions per test set varied, they resulted in a range of 46 to 84 regression alerts per test set. It therefore is far from reaching the needed zero alerts for detecting no regression. The evaluation metrics for the Student t-test-based approach are depicted in Table 5.9. The approach has the overall highest F-measure. This is caused by the 100% recall which the approach reached by reporting every possibility as a performance regression. Statistical process control techniques using machine learning 65 5. Evaluation Table 5.11.: Performance of statistical process control regression detection No change R1 R2 R3 R4 R5 R6 Regression detected 6 2 4 2 2 2 4 No regression detected 9 2 0 2 2 2 0 Table 5.12.: Performance of signature-based performance regression detection No change R1 R2 R3 R4 R5 R6 Regression detected 7 0 1 4 3 4 4 No regression detected 8 4 3 0 1 0 0 The results of the evaluation runs of the statistical process control-based regression detection approach are depicted in Table 5.11. It detected 16 out of 24 of the runs with regressions, although six false positive performance regression alerts were issued as well. Every type of regression was at least detected once. Concerning the different kinds of regressions, it was able to detect every kind of regression. As expected, the more significant regressions R2 and R6 were detected in every of the 4 runs. Although R4, increased memory usage, was considered easy to spot as well, only 2 of the 4 runs were categorized right. Overall, it performed better than the Student t-test. The evaluation metrics for the Student statistical process control-based approach are depicted in Table 5.9. The approach has with 73% the overall highest observed precision. It reached a recall rate of 67%. Signature-based performance regression detection The detection results of the signature-based approach are depicted in Table 5.12. It was one of the fastest regression detection approaches. It did build in between 10 to 13 principal components for signature generation. The evaluation metrics for the signature-based approach are depicted in Table 5.9. The results show a quite similar performance of this approach as the statistical process control-based. Its performance metrics are only faintly lower. These differences are not significant. One of the biggest reasons of why this approach did not perform better is probably the high and changing number of principal components that had to be to considered to reach a cumulative variance of 90%. The number of used principal components varied between 10 and 13. The order of the principal components is relevant to this approach, but changed in between runs as well. This behavior lead to the fact that only a small number of metrics per principal component did fit to each other. The regression detection therefore was quite limited. 66 5.6. Discussion of results Table 5.13.: Mining performance regression testing repositories performance No change R1 R2 R3 R4 R5 R6 Regression detected 5 0 4 1 0 2 0 No regression detected 10 4 0 3 4 2 4 Mining performance regression testing repositories The detection results of the mining performance regression testing repositories approach are shown in Table 5.13. Its evaluation metrics are depicted in Table 5.9. The approach of mining performance regression testing repositories did not perform well. Its precision of 58%, recall of 29%, and F-measure of 39% were the worst of all researched approaches. Three out of six types of regressions were not detected at all (R1, R4, and R6). Two of those three, R4 and R6, were in the categories of high significance and should have been easy to spot. It performed worse than all other approaches. It was not able to detect three out of 6 types of regressions at all. Although it did have the highest number of true negatives in this evaluation, it seems that it simply had a bias to reporting no regression. Additionally, this approach was significantly slower. The reason for that is the used apriori algorithm, which does not scale well to an increasing number of input elements. For an isolated evaluation of a single microservice’s metrics, the time which the approach needed was not significantly longer than to other approaches. In such a setup, all approaches needed around 2 minutes. The approach did need a lot of RAM (4 GB) and time (26 min). The other approaches needed 3 or less minutes and 2 GB of RAM each. 5.6. Discussion of results 5.6.1. General metrics behavior The findings suggest that there is no unexpected behavior of performance metrics during a run. The measurements did not deviate significantly. Therefore the existent deviations should not be challenging to the performance regression techniques. 5.6.2. Metrics behavior with redeployments Results for CPU metrics 67 5. Evaluation The observed behavior of the cpu/usage_rate concerning redeployments has huge implications for performing performance regression detection in a microservice setups based on cpu/usage_rate. The difference in metric results between redeployments is high enough to trigger performance regression detection approaches, although the system was not changed at all. Furthermore, a CPU performance regression may be hidden by the fact, that the cpu/usage_rate strives for a lower level in this certain testing deployment. A possible but very time-expensive solution to this problem would be to run several test runs throughout several redeployments. This solution would stand opposed to the very frequent changes, which can be observed in microservice systems. Further research should be done on why such behavior can be observed. What does influence the metrics? Are different devisions between the nodes of the cluster and the microservices a reason for the high variance? Does the busy cluster lead to such high noise in the recorded median values? Which alternative CPU metrics could be collected to reach more stable measurements between redeployments? Results for memory metrics The observed behavior of the memory/usage has huge implications for performing performance regression detection. The measurements of the different runs strive for different target values. This leads to similar issues as with the cpu/usage_rate. Results for network metrics The observed behavior of the network metrics does not have huge implications for performing performance regression detection. The measurements of the different runs vary approximately equally as the produced load of the load driver. 5.6.3. Approaches Student t-test based performance regression detection These measurements are a good example of why the F-measure can not blindly be trusted for evaluation. In the set of test cases the distribution of tests with a regression and without a regression is approximately equal. In reality, one would expect that the big majority of changes to a software system do not introduce performance regressions to the system. In such a more realistic setup the precision of the approach would drop significantly. In the current setup, the precision of the approach is already the second 68 5.6. Discussion of results worst, this would be even worse by using a more realistic ratio of tests without and with regressions. The approach probably reports that many performance regressions because its foundation needs normal distributed data. As this work showed, the performance measurements of the microservice environment were not normal distributed. Further- more, the deviations of metrics resulting out of redeployments probably had a negative influence as well. The approach of Student t-test performance regression detection in microservice systems, is considered to be not applicable at all. A performance regression detection approach with such a high number of false positives would be of no value in a realistic setup. Statistical process control techniques using machine learning Although the precision of 73% is comparable to the findings of the original work [Ngu+12], the recall of 67% is quite low for a practical application. These findings are to some amount surprising, since it was observed that the input data was not of normal distribution. Normal distributed data sets are a requirement for control charts, which the approach is based on. During development it was observed, that the sensitivity of the approach highly depends on the size of the load testing repository. Small repositories lead to a high number of false positives. It is suspected that one of the main reasons why this control chart-based approach was outperforming the Student t-test that significantly, is that the control chart-based approach was able to make use of the whole load testing repository, while the Student t-test only used one test out of the repository as reference. Since this approach is easy to understand and had some promising results, it may be a possible solution for performance regression detection in microservice environments. Nonetheless, the observed performance is considered to be too low for practical applica- tions. Signature-based performance regression detection The approach seemed to be very fast and may therefore be promising for future research. Nonetheless a solution for the failing mapping of old and new principal components would have to be found. One huge disadvantage of the approach is that the weights of metrics in a principal component are a concept not as easily understood as for example violation ratios. Therefore, it is sometimes not easy to understand why a performance regression was reported. Mining performance regression testing repositories It is suspected that the bad performance concerning true positives, is as well based on the big data set. The more different measurements have to be consider when extracting the association rules, the more rules may be lost compared to a smaller subset of such measurements. The thresholds for selection rules are support meaning the relative 69 5. Evaluation proportion of the dataset where the rule occurs and confidence describing the probability that the conclusion of the rule holds true. Support may be sensitive to increasing data set sizes. For smaller test evaluations, the observed performance of the approach was better. Nonetheless because of its inability to scale, it does not make sense to apply the approach to a setting like microservice environments. 70 Chapter 6 Threats to validity This chapter gives an overview over the different threats to validity of this work. To every threat an explanation is provided, what was done to reduce the impact of the threat. 6.1. External validity External validity describes how far the findings of a work can be generalized. The remainder of this section will focus on such possible threats [Woh+12]. 6.1.1. Sock shop environment The Sock Shop System [Soc] is as Aderaldo et al. [Ade+17] mention in their work, not a perfect system for microservice. They conclude that no system is currently major enough to be used as a community wide benchmark. Some of the conclusions of this thesis may be influenced by the chosen system under test and describe special properties of the Sock Shop system. Distributions and measurements may be unique to this system. Since Sock Shop was implemented as a microservice demonstration platform the associated risk is considered to be reasonably low. 6.1.2. Kubernetes and Docker This work focuses on the container orchestration tool Kubernetes and its used default monitoring tool Heapster. Kubernetes is built upon the virualization of Docker. Some findings of this work depend on the implementation of Kubernetes (Section 71 6. Threats to validity refloadOnCluster). Furthermore, the metrics which are available and collected depend on the used virtualization and the used monitoring tool as well. The findings of this work can not easily be generalized for use in other orchestration, monitoring or virtualization tools. 6.1.3. Regression injection The injected performance regressions of this thesis are of artificial nature. To evaluate the performance of a single approach, this may lead to bias concerning constructing artificial regressions, in which selected approaches work especially well. Since this thesis focuses on an empirical comparison between approaches, there is no approach which is preferred to work especially well by the conductors. Furthermore, the resulting values are not used to benchmark the approaches independently. The resulting information is only used as an ordinal measure, comparing how well the single approaches behave in direct comparison. These points relativize the influence of the artificial nature of the performance regressions. In terms of generalization, it is unclear how far the findings of this work can be used for detecting real regressions in a system. 6.1.4. Approach reimplementations The selected approaches had to be reimplemented, because no reference implementa- tions were available. The implementation is highly depended on the referenced research papers. Inaccuracies in the description of approaches, or misinterpretation may have lead to an implementation, which does not behave like the original authors intended. Differences in observations may be based on this fact. The quality of description of the approaches was considered in the selection of approaches. The approaches which were implemented, seemed to be clear in their description in most parts. Some minor parts which were not described in detail, were either not relevant or implemented based on an educated guess. 6.2. Internal validity Internal validity focuses on the question of to which extend the conclusions of a work are valid. It describes whether the performed actions really caused the observed effect [Woh+12]. 72 6.2. Internal validity 6.2.1. Artificial load Often when performing load tests, one issue considering the validity of results, is whether the system was put under a steady load. Otherwise, the reason for observed deviations could lie in variances of the input load. Sometimes, the test drivers, which simulate users putting load on the system, are not able to put the system under the requested load. Variance in inputs load lead to a variance in measured metrics. This is of special relevance to the observations in Section 5.5.2 5 10 15 34 00 35 00 36 00 37 00 38 00 39 00 median requests per minute for different redeployments test run re qu es ts p er m in u te Figure 6.1.: Requests per minute median in different deployments Figure 6.1 shows the median requests per minute of the Locust load driver. In all runs, the requests per minute reached a nearly constant level after a short warm up time. A comparison between median values therefore seems sensible. The standard deviation of the data set is 48 which is very low in comparison to the mean, which is 3653. Therefore, the deviations of the input load can be considered to be low. Correlations between the patterns, which were observed in Figure 5.4, are not obvious. Although not perfectly constant the median values are distributed in the small range of 172 requests per minute. Therefore, the load is considered to be of sufficient stability between redeployments. 73 6. Threats to validity 6.2.2. Load on cluster The cluster itself should not be put under a too high load. If the cluster is not able to cater to the resource requests of the SUT, measurements may be influenced by this fact. Figure 6.2 shows the CPU usage of the three nodes of the Kubernetes cluster throughout the load testing concerning the metric research. The x axis shows the sequential measurements while the y axis shows the relative CPU usage of the whole system. There are several things observable. First of all, the different redeployments can be seen. At the start of each redeployment, all three nodes show a high CPU peak. Between the redeployments, the nodes all reach a CPU usage level which is not critically high. The Kubernetes master shows the highest stationary CPU usage with approximately 60% usage rate during a load test. The second plot shows an excerpt of the nodes’ CPU usage rates. A clearly alternating behavior is visible. When the third node (green) has comparably high CPU usage, the first node has a low usage rate. The same holds for high CPU usage of the third and low CPU usage of the first node. This probably depends on the distribution of the single microservices on the different nodes. This behavior may give at least to some degree an explanation to the behavior observed in Figure 5.4. The scheduler of Kubernetes, which decides on where to deploy the single microservices, uses several rules like spreading pods of the same replication controller over several nodes and keeping resource utilization on the different nodes balanced [Kubc]. Beginning with version 1.4 of Kubernetes, some controlling exists, which allows to manually influence how pods are distributed [Kuba]. It may be necessary to restrict this flexibility of Kubernetes to reach more consistent results. 6.2.3. Busy cluster The load tests of this thesis were performed on an OpenStack cluster of the University of Stuttgart. This cluster is being used for other purposes as well. The test environment therefore was based on a so called busy cluster. Although on the level of virtualization, the system was under no other load, measurements may be influenced by that fact. This may be considered an advantage as well, since realistic environments, can be considered busy as well. 74 6.2. Internal validity 0 1000 2000 3000 4000 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 cpu/node_utilization on different nodes measurements cp u us ag e of th e no de 10.0.14.15 10.0.14.16 (kubernetes master) 10.0.14.17 0 500 1000 1500 0. 05 0. 10 0. 15 0. 20 cpu/node_utilization on different nodes (excerpt) measurements cp u us ag e of th e no de 10.0.14.15 10.0.14.17 Figure 6.2.: Node CPU usage throughout series of load tests 75 Chapter 7 Conclusion 7.1. Summary After giving an introduction to the field performance regression detection and to the field of microservices, this work has given an in-depth overview over research work on ap- proaches to performance regression detection. The different approaches were compared to each other and a subset of them were implemented for a future evaluation. Basic research on the behavior of software performance metrics in between redeployments of a microservice system has been performed. The measurements showed that some key metrics such as CPU usage rate and memory usage show significant deviations resulting out of redeployments. The evaluation of the different performance regression detection approaches in the microservice environment showed that some existing approaches do not perform well considering recall and precision of the regression detection. Although some approaches such as performance signature-based regression detection and control charts-based regression performed better than others, the results render no approach fit for practical use. In the evaluation, the most promising approach was able to reach a recall of 67% and a precision of 73%. 7.2. Discussion The most interesting finding of this thesis are the so far unknown performance metric deviations resulting out of redeployments of a microservice system. Although redeploy- ments of the whole system are not commonly performed in a microservice environment, 77 7. Conclusion these findings show that microservice performance measurements have additional pa- rameters influencing them, compared to measurements in a monolithic application. This work concludes that the deviations resulting out of decisions of the Kubernetes scheduler, which decides how to partition the different microservices on the nodes of the cluster, highly influences the performance of performance regression detection. If this conclusion is true, microservice performance measurements of one microservice may be influenced by other microservices with no connection on a software level, but just are deployed to the same node. Microservice environments promise some chances to software performance engineers such as isolated observation of performance metrics in a single microservice. But there are some open and even unknown challenges like these probably Kubernetes scheduler-based deviations of measurements. A more general understanding of microservice performance and the behavior of mi- croservice performance metrics, would help building efficient performance regression detection approaches for microservices. This work showed that performance regression detection has a need for special solutions in the field of microservice environments. In the following section, some possible future work is presented. 7.3. Future work Future work could put a focus on how and to which amount the performance mea- surements of single microservice instances can be influenced by microservices with high load on the same node. A possible solution for avoiding the high deviations in measurements of redeployments, would be to perform the regression detection by only redeploying the changed microservices. It is unclear whether and to which amount past developments in the microservice environment would influence such measurements. Another approach would be to focus on the performance evaluation of single dedicated microservice instances. Challenges in such research would be how to avoid having to build extensive stubs for such an isolated evaluation or how to minimize the influence of other microservices when evaluating in a real microservice environment. In this work, possible approaches to measuring scalability, elasticity, and resilience were presented. Future work could evaluate how well those measurements work in a microservice environment. 78 Chapter 8 Acknowledgements This work would not have been possible without André van Hoorn and Teerat Pitakrat of the Reliable Software Systems Research Group at the University of Stuttgart. Especially without the technical support of Teerat Pitakrat setting up the microservice environment would have been a lot more work. Additionally, I have to make a shout out to Cor- Paul Bezemer, who gave valuable feedback and inspired the research on microservice performance metrics behavior. Last but not least, I want to express my thanks to Alberto Avritzer who helped polishing this work with his feedback. 79 Chapter 8 Bibliography [Ade+17] C. M. Aderaldo, N. C. Mendonça, C. Pahl, P. Jamshidi. “Benchmark require- ments for microservices architecture research.” In: Proceedings of the 1st International Workshop on Establishing the Community-Wide Infrastructure for Architecture-Based Software Engineering. IEEE Press. 2017, pp. 8–13 (cit. on pp. 50, 71). [Ama+15] M. Amaral, J. Polo, D. Carrera, I. Mohomed, M. Unuvar, M. Steinder. “Performance Evaluation of Microservices Architectures Using Containers.” In: Proceedings of the 2015 IEEE 14th International Symposium on Network Computing and Applications. 2015, pp. 27–34 (cit. on pp. 1, 10, 11). [Bez+14] C Bezemer, E. Milon, A. Zaidman, J. Pouwelse. “Detecting and analyzing I/O performance regressions.” In: Journal of Software: Evolution and Process 26.12 (2014), pp. 1193–1212 (cit. on p. 29). [BHJ16] A. Balalaie, A. Heydarnoori, P. Jamshidi. “Microservices Architecture En- ables DevOps: Migration to a Cloud-Native Architecture.” In: IEEE Software 33.3 (2016), pp. 42–52. ISSN: 0740-7459 (cit. on p. 10). [BPG15] C. P. Bezemer, J. Pouwelse, B. Gregg. “Understanding software performance regressions using differential flame graphs.” In: Proceedings of the 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER). 2015, pp. 535–539 (cit. on pp. 29, 35). [Cam+16] A. de Camargo, I. Salvadori, R. d. S. Mello, F. Siqueira. “An Architecture to Automate Performance Tests on Microservices.” In: Proceedings of the 18th International Conference on Information Integration and Web-based Applications and Services. iiWAS ’16. New York, NY, USA: ACM, 2016, pp. 422–429. ISBN: 978-1-4503-4807-2 (cit. on p. 18). [CBK09] V. Chandola, A. Banerjee, V. Kumar. “Anomaly detection: A survey.” In: ACM computing surveys (CSUR) 41.3 (2009), p. 15 (cit. on p. 9). 81 Bibliography [CG09] L. Crispin, J. Gregory. Agile testing: A practical guide for testers and agile teams. Pearson Education, 2009, pp. 276 –279 (cit. on p. 6). [Dbl] dblp: computer science bibliography. http://dblp.uni-trier.de/ (cit. on p. 17). [Dev] Performance Awareness in Software Development - Research @ D3S - De- partment of Distributed and Dependable Systems. http://d3s.mff.cuni.cz/ research/development_awareness/ (cit. on p. 28). [DG06] J. Davis, M. Goadrich. “The Relationship Between Precision-Recall and ROC Curves.” In: Proceedings of the 23rd International Conference on Machine Learning. ICML ’06. ACM, 2006, pp. 233–240. ISBN: 1-59593-383-2 (cit. on p. 8). [Dül17] T. F. Düllmann. “Performance anomaly detection in microservice architec- tures under continuous change.” MA thesis. 2017 (cit. on p. 17). [Foo+10] K. C. Foo, Z. M. Jiang, B. Adams, A. E. Hassan, Y. Zou, P. Flora. “Mining Performance Regression Testing Repositories for Automated Performance Analysis.” In: Proceedings of the 10th International Conference on Quality Software, QSIC 2010, Zhangjiajie, China, 14-15 July 2010. 2010, pp. 32–41 (cit. on pp. 30, 38, 48). [Gha+13] S. Ghaith, M. Wang, P. Perry, J. Murphy. “Automatic, load-independent detection of performance regressions by transaction profiles.” In: Proceed- ings of the 2013 International Workshop on Joining AcadeMiA and Industry Contributions to testing Automation. ACM. 2013, pp. 59–64 (cit. on p. 23). [Gha+16] S. Ghaith, M. Wang, P. Perry, Z. M. Jiang, P. O’Sullivan, J. Murphy. “Anomaly detection in performance regression testing by transaction profile estima- tion.” In: Software Testing, Verification and Reliability 26.1 (2016), pp. 4–39 (cit. on pp. 23, 24). [GIM17] M. Gribaudo, M. Iacono, D. Manini. “Performance evaluation of massively distributed microservices based applications.” In: Proceedings of the 31st European Conference on Modelling and Simulation, ECMS 2017. European Council for Modelling and Simulation. 2017, pp. 598–604 (cit. on p. 18). [HBRM16] S. Hosseini, K. Barker, J. E. Ramirez-Marquez. “A review of definitions and measures of system resilience.” In: Reliability Engineering & System Safety 145 (2016), pp. 47–61 (cit. on pp. 11, 12). [Heaa] heapster/storage-schema.md at master kubernetes/heapster. https://github. com/kubernetes/heapster/blob/master/docs/storage-schema.md (cit. on pp. 53–55). [Heab] kubernetes/heapster: Compute Resource Usage Analysis and Monitoring of Container Clusters. https://github.com/kubernetes/heapster (cit. on pp. 51, 53–55). 82 Bibliography [Hei+17] R. Heinrich, A. van Hoorn, H. Knoche, F. Li, L. E. Lwakatare, C. Pahl, S. Schulte, J. Wettinger. “Performance Engineering for Microservices: Re- search Challenges and Directions.” In: Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering Companion. ICPE ’17 Companion. New York, NY, USA: ACM, 2017, pp. 223–226. ISBN: 978-1- 4503-4899-7 (cit. on pp. 1, 11). [HKR13] N. R. Herbst, S. Kounev, R. H. Reussner. “Elasticity in Cloud Computing: What It Is, and What It Is Not.” In: Proceedings of the ICAC. 2013, pp. 23–27 (cit. on p. 14). [Hol13] E. Hollnagel. Resilience engineering in practice: A guidebook. Ashgate Pub- lishing, Ltd., 2013 (cit. on p. 14). [Hor+13] V. Horký, F. Haas, J. Kotrcˇ, M. Lacina, P. Tu˚ma. “Performance Regression Unit Testing: A Case Study.” In: Computer Performance Engineering: 10th European Workshop, EPEW 2013, Venice, Italy, September 16-17, 2013. Proceedings. Ed. by M. S. Balsamo, W. J. Knottenbelt, A. Marin. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 149–163. ISBN: 978-3- 642-40725-3 (cit. on p. 26). [Inf] influxdata/influxdb: Scalable datastore for metrics, events, and real-time analytics. https://github.com/influxdata/influxdb (cit. on p. 51). [Isl+12] S. Islam, K. Lee, A. Fekete, A. Liu. “How a consumer can measure elasticity for cloud platforms.” In: Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering. ACM. 2012, pp. 85–96 (cit. on p. 14). [Kec+16] P. Keck, A. Van Hoorn, D. Okanovic´, T. Pitakrat, T. F. Düllmann. “Antipattern-based problem injection for assessing performance and reli- ability evaluation techniques.” In: Proceedings ot the Software Reliability Engineering Workshops (ISSREW), 2016 IEEE International Symposium on. IEEE. 2016, pp. 64–70 (cit. on pp. 52, 53). [Kuba] Assigning Pods to Nodes - Kubernetes (cit. on p. 74). [Kubb] Kubernetes - Production-Grade Container Orchestration. http://kubernetes.io (cit. on pp. 51, 53–55). [Kubc] Kubernetes: Advanced Scheduling in Kubernetes (cit. on p. 74). [Kubd] Managing Computer Resources for Containers | Kubernetes. https : / / kubernetes.io/docs/concepts/configuration/manage-compute-resources- container/ (cit. on pp. 11, 53). 83 Bibliography [LEB15] S. Lehrig, H. Eikerling, S. Becker. “Scalability, elasticity, and efficiency in cloud computing: A systematic literature review of definitions and metrics.” In: Proceedings of the 2015 11th International ACM SIGSOFT Conference on Quality of Software Architectures (QoSA). 2015, pp. 83–92 (cit. on pp. 10, 12, 14). [LL13] J. Ludewig, H. Lichter. Software Engineering: Grundlagen, Menschen, Prozesse, Techniken. dpunkt. verlag, 2013 (cit. on p. 6). [Loc] Locust - A modern load testing framework. http://locust.io/ (cit. on pp. 51, 52, 56). [Mal10] H. Malik. “A methodology to support load test analysis.” In: Proceedings of the 2010 ACM/IEEE 32nd International Conference on Software Engineering. Vol. 2. 2010, pp. 421–424 (cit. on pp. 21, 37, 38). [Mei+07] J Meier, C. Farre, P. Bansode, S. Barber, D. Rea. Performance testing guidance for web applications: patterns & practices. Microsoft press, 2007 (cit. on p. 5). [MHH13] H. Malik, H. Hemmati, A. E. Hassan. “Automatic detection of performance deviations in the load testing of Large Scale Systems.” In: Proceedings of the 2013 35th International Conference on Software Engineering (ICSE). 2013, pp. 1012–1021 (cit. on pp. 21, 52). [Ngu+12] T. H. Nguyen, B. Adams, Z. M. Jiang, A. E. Hassan, M. Nasser, P. Flora. “Automated detection of performance regressions using statistical process control techniques.” In: Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering. ACM. 2012, pp. 299–310 (cit. on pp. 6, 12, 24, 38, 52, 53, 57, 69). [Ope] Home » OpenStack Open Source Cloud Computing Software (cit. on p. 50). [Pas+15] A. Pasquini, M. Ragosta, I. A. Herrera, A. Vennesland. “Towards a Measure of Resilience.” In: Proceedings of the 5th International Conference on Appli- cation and Theory of Automation in Command and Control Systems. ATACCS ’15. New York, NY, USA: ACM, 2015, pp. 121–128. ISBN: 978-1-4503-3562- 1 (cit. on p. 14). [R] R: The R Project for Statistical Computing. https://www.r-project.org/ (cit. on p. 52). [Sha+15] W. Shang, A. E. Hassan, M. N. Nasser, P. Flora. “Automated Detection of Per- formance Regressions Using Regression Models on Clustered Performance Counters.” In: Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering, Austin, TX, USA, January 31 - February 4, 2015. 2015, pp. 15–26 (cit. on pp. 6, 12, 18, 19, 21, 22, 35, 38). 84 Bibliography [Soc] Microservices Demo: Sock Shop. https://microservices-demo.github.io/ (cit. on pp. 51, 71). [SS06] R. H. Shumway, D. S. Stoffer. Time series analysis and its applications: with R examples. Springer Science & Business Media, 2006, pp. 23–29 (cit. on p. 48). [Str12] L. Strigini. “Fault Tolerance and Resilience: Meanings, Measures and As- sessment.” In: Resilience Assessment and Evaluation of Computing Systems. Ed. by K. Wolter, A. Avritzer, M. Vieira, A. van Moorsel. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 3–24. ISBN: 978-3-642-29032-9 (cit. on pp. 14, 15). [SW03] C. U. Smith, L. G. Williams. “More new software performance antipatterns: Even more ways to shoot yourself in the foot.” In: Proceedings of the Com- puter Measurement Group Conference. 2003, pp. 717–725 (cit. on pp. 52, 53). [THS11] W. T. Tsai, Y. Huang, Q. Shao. “Testing the scalability of SaaS applications.” In: Proceedings of the 2011 IEEE International Conference on Service-Oriented Computing and Applications (SOCA). 2011, pp. 1–4 (cit. on p. 12). [Wel47] B. L. Welch. “The generalization ofstudent’s’ problem when several different population variances are involved.” In: Biometrika 34.1/2 (1947), pp. 28– 35 (cit. on p. 28). [Wen17a] N. Wenzler. Automated Performance Regression Detection in Microservice Architectures - code base. Sept. 2017. DOI: 10.5281/zenodo.890936. URL: https://doi.org/10.5281/zenodo.890936 (cit. on p. 3). [Wen17b] N. Wenzler. Automated Performance Regression Detection in Microservice Architectures - Raw Measurements. Sept. 2017. DOI: 10 .5281/zenodo . 888699. URL: https://doi.org/10.5281/zenodo.888699 (cit. on p. 3). [Woh+12] C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Regnell, A. Wesslén. Experimentation in software engineering. Springer Science & Business Media, 2012, pp. 102 –110 (cit. on pp. 71, 72). All links were last followed on September 14, 2017. 85 Appendix A Metric measurements 0 50 100 150 200 80 10 0 12 0 14 0 16 0 cpu usage rate behaviour in different deployments testing time (minutes) cp u/ us ag e_ ra te (m illic ore s) Figure A.1.: Cpu/usage_rate development in different deployments 87 A. Metric measurements 0 50 100 150 200 250 300 60 0 70 0 80 0 90 0 10 00 memory usage behaviour in different deployments measurements m e m o ry /u sa ge (m eb iby te s) Figure A.2.: Memory/usage development in different deployments 0 50 100 150 200 10 0 15 0 20 0 25 0 30 0 memory page faults rate behaviour in different deployments measurements m e m o ry /u sa ge (k ibi by te s) Figure A.3.: Memory/page_faults_rate development in different deployments 88 0 50 100 150 200 60 0 70 0 80 0 90 0 10 00 memory working set behaviour in different deployments measurements m e m o ry /w o rk in g_ se t (m eb iby te s) Figure A.4.: Memory/working_set development in different deployments 0 20 40 60 80 100 75 80 85 90 95 10 0 10 5 network tx rate behaviour in different deployments measurements n e tw o rk /tx _r a te (k e bi by te s) Figure A.5.: Network/tx_rate development in different deployments 89 A. Metric measurements 0 20 40 60 80 100 65 70 75 80 85 90 95 network rx rate behaviour in different deployments measurements n e tw o rk /rx _r a te (k e bi by te s) Figure A.6.: Network/rx_rate development in different deployments 0 50 100 150 25 00 30 00 35 00 40 00 45 00 requests per minute behavior in different redeployments testing time (minutes) re qu es ts p er m in u te Figure A.7.: Requests per minute of load driver development in different deployments 90 Table A.1.: Median of cpu/usage_rate per test run (original unit: millicores) PPPPPPPPPService Run 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 median value carts 133,5 110 115 107 114 108 132 121 134 123 132 120 130 112 114 107 111 106 110 114 carts-db 44 37 38 36,5 38 37 44 41 45 40,5 44 40 43 37,5 38 36 38 36 38 38 catalogue 34 33 34 34 35 34 35 34 34 33 34 33 34 33 34 34 34 33 35 34 catalogue-db 26 25 26 26 27 26 26 26 26 26 26 25 26 25 26 26 26 25 27 26 front-end 300 295 298,5 301 309 300 304 296 305 307 302 295 306 302 299,5 301 302 302 311,5 302 load-test 375 365 374 377 379 368 371 372 372 365 369 365 378 363 370 369 372 364 375 371 orders 2,5 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 orders-db 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 payment 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 queue-master 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 rabbitmq 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 session-db 3 5 5 5 5 5 6 5 6 5 6 5 6 5 5 5 5 5 5 5 shipping 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 user 4 4 5 4 4 4 5 5 5 5 5 5 5 4 5 4 5 4 4 5 user-db 3 3 3 3 3 3 4 3 4 3 4 3 4 3 3 3 3 3 3 3 zipkin 10 7 10 7,5 9 7 8 7 8 8 8 8 8 8 7 8 9 8 9 8 zipkin-cron 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 zipkin-mysql 55 35 52 35,5 52 34 57 34 59 34 55 34 57 34 53 35 51 35 52 51 91 A . M etric m easurem ents Table A.2.: Variance of cpu/usage_rate per test run (original unit: millicores) PPPPPPPPPService Run 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 median variance carts 37 50 37 22 25 48 48 36 47 49 41 41 30 50 29 31 26 65 31 37 carts-db 7 5 4 4 5 7 8 6 7 8 7 7 5 7 5 5 5 10 5 6 catalogue 2 1 2 2 2 3 3 3 2 2 2 3 2 3 2 3 2 7 4 2 catalogue-db 2 1 1 2 1 2 2 2 2 2 2 2 1 2 2 2 1 4 3 2 front-end 110 96 108 140 90 152 132 136 115 125 109 216 91 151 137 183 115 470 176 132 load-test 124 138 188 182 149 225 193 213 175 178 187 367 129 225 205 276 214 800 215 193 orders 22 22 11 11 11 16 13 13 11 11 11 20 12 13 8 11 10 12 10 11 orders-db 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 payment 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 queue-master 25 6 5 5 5 4 3 5 4 5 5 4 5 6 3 5 6 7 5 5 rabbitmq 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 session-db 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 shipping 4 9 13 10 18 12 20 8 15 18 12 8 9 7 11 16 10 11 16 11 user 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 user-db 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 zipkin 299 433 718 339 834 429 476 323 344 320 413 416 476 466 566 467 278 694 600 433 zipkin-cron 2 1557 1533 1206 1343 1971 2232 2100 2335 2892 2170 1953 1412 1322 1203 2189 1845 1436 2295 1845 zipkin-mysql 27 362 899 438 632 501 1063 379 1344 198 1219 489 1034 418 1015 298 1725 372 1943 501 92 Table A.3.: Median of memory/usage per test run (unit: mebibytes) PPPPPPPPPService Run 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 median value carts 766,1 687,4 667,0 673,8 668,9 763,5 715,1 676,9 657,8 758,7 673,2 722,5 690,0 774,2 668,1 646,6 625,5 732,9 784,3 687,4 carts-db 162,0 156,4 156,3 162,1 162,9 160,9 159,2 159,4 161,7 160,5 161,1 160,2 164,3 159,2 162,7 159,2 161,7 157,7 160,4 160,5 catalogue 14,5 14,0 18,9 16,5 16,3 16,7 16,7 16,8 17,3 15,9 16,6 16,9 16,2 15,6 15,8 16,9 16,0 18,7 17,6 16,6 catalogue-db 203,5 203,1 239,1 440,9 333,2 289,1 258,5 355,3 286,5 265,9 251,0 379,1 279,9 249,7 257,1 289,0 291,9 265,6 261,9 265,9 front-end 111,5 112,0 129,2 129,2 128,2 126,7 123,9 128,7 127,0 125,1 119,0 128,2 127,5 121,0 124,0 126,3 127,3 127,1 125,0 126,7 load-test 39,1 38,9 42,3 60,6 53,2 51,3 49,7 56,8 51,8 49,0 45,9 56,9 52,7 48,7 50,9 51,3 52,1 53,0 51,2 51,3 orders 317,7 300,9 340,2 340,7 333,1 326,2 313,4 330,6 324,5 321,0 315,9 340,6 323,7 313,7 321,2 321,5 318,1 320,9 321,9 321,5 orders-db 37,2 48,7 42,1 69,3 63,8 62,9 59,2 64,5 63,3 63,2 56,8 65,0 63,9 59,2 62,6 63,0 63,6 63,5 63,3 63,2 payment 15,2 14,3 16,0 14,4 16,8 13,7 16,8 13,0 16,9 14,2 16,9 13,4 15,9 12,6 16,5 12,1 15,5 15,4 16,8 15,4 queue-master 301,9 257,3 295,0 225,8 291,3 256,1 267,6 259,9 298,9 257,0 288,1 256,6 270,4 246,5 290,3 245,5 283,4 245,7 294,1 267,6 rabbitmq 101,0 100,4 103,4 100,4 95,1 100,0 97,4 100,8 98,5 97,5 98,1 101,1 99,5 99,9 98,8 97,7 100,3 100,6 100,4 100,0 session-db 4,6 4,7 4,6 4,6 4,6 4,7 4,6 4,7 4,6 4,7 4,6 4,6 4,7 4,6 4,7 4,6 4,7 4,6 4,7 4,6 shipping 326,7 284,3 326,9 331,5 317,9 316,7 302,1 328,7 315,2 309,1 297,4 325,5 309,4 303,6 308,3 310,7 316,0 316,3 308,3 315,2 user 14,8 16,4 17,9 18,1 18,9 17,8 18,8 16,5 16,3 18,1 17,7 18,1 16,4 17,8 18,3 18,7 19,0 17,7 17,8 17,8 user-db 43,6 60,2 61,6 61,4 59,8 61,4 59,9 60,0 59,8 60,2 60,1 60,5 61,1 59,9 59,0 59,8 60,0 59,5 59,7 60,0 zipkin 1686,6 1494,4 1168,7 1540,8 1528,8 1502,3 1471,1 1504,3 1470,5 1502,1 1449,8 1480,2 1499,5 1466,7 1611,8 1421,1 1465,6 1393,5 1460,4 1480,2 zipkin-cron 40,9 293,4 211,2 208,7 181,7 217,1 213,5 283,3 233,8 322,5 203,7 280,1 233,4 277,2 200,3 258,9 151,0 195,6 200,1 213,5 zipkin-mysql 711,7 546,0 1009,3 477,4 1018,8 552,4 930,5 538,2 932,2 445,1 857,6 465,1 885,4 460,9 1083,6 466,2 1003,9 462,7 961,7 711,7 Table A.4.: Variance of memory/usage per test run (original unit: bytes) PPPPPPPPPService Run 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 median variance carts 7,E+13 9,E+14 9,E+14 5,E+14 2,E+14 2,E+14 1,E+14 3,E+14 1,E+15 2,E+14 6,E+14 8,E+13 1,E+14 1,E+14 3,E+14 4,E+14 8,E+14 1,E+14 5,E+14 2,62E+14 carts-db 1,E+15 1,E+15 1,E+15 2,E+15 1,E+15 2,E+15 1,E+15 2,E+15 1,E+15 2,E+15 1,E+15 1,E+15 1,E+15 1,E+15 1,E+15 1,E+15 2,E+15 1,E+15 1,E+15 1,45E+15 catalogue 9,E+10 9,E+11 4,E+10 1,E+10 1,E+10 1,E+10 1,E+10 3,E+10 1,E+10 3,E+10 3,E+11 2,E+10 1,E+11 5,E+10 8,E+09 4,E+10 6,E+09 1,E+12 2,E+10 2,72E+10 catalogue-db 9,E+10 2,E+12 2,E+10 2,E+14 1,E+15 9,E+14 3,E+14 2,E+15 8,E+14 5,E+14 5,E+14 2,E+15 8,E+14 3,E+14 3,E+14 6,E+12 1,E+15 3,E+14 5,E+14 4,68E+14 front-end 2,E+12 1,E+13 3,E+12 2,E+12 2,E+12 6,E+12 8,E+12 2,E+12 7,E+12 8,E+12 2,E+13 2,E+12 2,E+12 8,E+12 4,E+12 3,E+12 4,E+12 5,E+12 5,E+12 4,43E+12 load-test 2,E+13 1,E+13 2,E+13 1,E+13 6,E+12 6,E+12 9,E+12 9,E+12 6,E+12 1,E+13 6,E+12 6,E+12 8,E+12 7,E+12 1,E+13 2,E+13 6,E+12 2,E+13 1,E+13 8,68E+12 orders 1,E+14 5,E+11 2,E+12 4,E+12 2,E+13 3,E+13 2,E+13 1,E+13 4,E+13 1,E+13 4,E+13 1,E+13 2,E+13 1,E+13 5,E+12 3,E+12 3,E+13 2,E+12 7,E+12 1,19E+13 orders-db 3,E+12 1,E+14 3,E+11 2,E+12 2,E+11 6,E+11 9,E+12 2,E+12 2,E+12 3,E+12 4,E+13 3,E+12 5,E+11 2,E+13 5,E+11 4,E+11 2,E+11 2,E+10 2,E+12 1,71E+12 payment 4,E+10 2,E+12 3,E+11 2,E+12 2,E+10 2,E+12 2,E+10 2,E+12 2,E+10 2,E+12 1,E+10 2,E+12 2,E+10 2,E+12 3,E+10 2,E+12 2,E+10 3,E+12 1,E+10 2,62E+11 queue-master 2,E+12 1,E+14 7,E+11 2,E+14 9,E+11 2,E+14 1,E+12 2,E+14 6,E+11 2,E+14 2,E+12 2,E+14 4,E+11 2,E+14 4,E+12 2,E+14 1,E+13 2,E+14 5,E+12 1,36E+13 rabbitmq 3,E+09 3,E+09 2,E+11 2,E+10 7,E+12 2,E+10 8,E+12 3,E+10 3,E+12 3,E+10 5,E+12 2,E+10 6,E+12 3,E+09 2,E+12 2,E+09 7,E+11 2,E+09 1,E+12 2,73E+10 session-db 5,E+08 2,E+09 1,E+08 1,E+08 1,E+09 1,E+07 2,E+09 5,E+07 6,E+08 3,E+08 1,E+09 2,E+08 9,E+08 5,E+08 6,E+08 2,E+08 5,E+08 5,E+08 4,E+08 4,72E+08 shipping 6,E+11 2,E+11 1,E+12 6,E+11 2,E+13 3,E+13 2,E+13 1,E+13 4,E+13 2,E+13 4,E+13 1,E+13 2,E+13 2,E+13 8,E+12 5,E+11 3,E+13 1,E+12 1,E+13 1,23E+13 user 1,E+12 4,E+10 3,E+10 5,E+10 2,E+10 5,E+10 2,E+10 4,E+10 3,E+10 4,E+10 1,E+11 2,E+11 2,E+10 4,E+10 3,E+10 5,E+10 2,E+10 3,E+10 2,E+10 4,02E+10 user-db 2,E+12 4,E+11 4,E+11 3,E+11 2,E+12 3,E+11 3,E+12 9,E+11 1,E+12 1,E+12 2,E+12 5,E+11 1,E+12 2,E+11 6,E+11 6,E+11 4,E+11 6,E+11 3,E+11 5,83E+11 zipkin 1,E+17 6,E+16 7,E+16 5,E+16 5,E+16 4,E+16 6,E+16 6,E+16 7,E+16 4,E+16 4,E+16 4,E+16 4,E+16 4,E+16 8,E+16 3,E+16 9,E+16 4,E+16 4,E+16 4,83E+16 zipkin-cron 7,E+13 2,E+15 2,E+13 1,E+16 8,E+15 7,E+15 7,E+15 8,E+15 5,E+15 1,E+16 2,E+15 2,E+15 5,E+15 3,E+15 9,E+15 1,E+16 5,E+15 7,E+15 4,E+15 5,21E+15 zipkin-mysql 1,E+16 1,E+16 2,E+17 2,E+15 1,E+17 5,E+15 1,E+17 4,E+15 1,E+17 2,E+15 1,E+17 1,E+15 1,E+17 2,E+15 1,E+17 1,E+15 1,E+17 1,E+15 2,E+17 1,41E+1693 0.90 0.95 1.00 1.05 1.10 0 2 4 6 8 10 cpu/usage_rate relative to median value density plot N = 200 Bandwidth = 0.01094 D en si ty Figure A.8.: Cpu/usage_rate distribution relative to median value 0.990 0.995 1.000 1.005 1.010 0 20 40 60 80 memory/usage relative to median value density plot N = 247 Bandwidth = 0.001301 D en si ty Figure A.9.: Memory/usage distribution relative to median value 0.8 1.0 1.2 1.4 1.6 1.8 2.0 0 1 2 3 4 5 6 7 memory/faults_rate relative to median value density plot N = 200 Bandwidth = 0.01975 D en si ty Figure A.10.: Memory/page_faults_rate distribution relative to median value 0.90 0.95 1.00 1.05 1.10 0 2 4 6 8 10 network/tx_rate relative to median value density plot N = 100 Bandwidth = 0.01268 D en si ty Figure A.11.: Network/tx_rate distribution relative to median value 0.90 0.95 1.00 1.05 1.10 0 2 4 6 8 10 network/rx_rate relative to median value density plot N = 100 Bandwidth = 0.01328 D en si ty Figure A.12.: Network/rx_rate distribution relative to median value Declaration I hereby declare that the work presented in this thesis is entirely my own and that I did not use any other sources and references than the listed ones. I have marked all direct or indirect statements from other sources con- tained therein as quotations. Neither this work nor significant parts of it were part of another examination procedure. I have not published this work in whole or in part before. The electronic copy is consistent with all submitted copies. place, date, signature