Empirical Software Engineering (2021) 26: 104 https://doi.org/10.1007/s10664-021-09999-9 Industry practices and challenges for the evolvability assurance of microservices An interview study and systematic grey literature review Justus Bogner1 · Jonas Fritzsch1 · StefanWagner1 ·Alfred Zimmermann2 Accepted: 8 June 2021 © The Author(s) 2021 Abstract Context Microservices as a lightweight and decentralized architectural style with fine- grained services promise several beneficial characteristics for sustainable long-term soft- ware evolution. Success stories from early adopters like Netflix, Amazon, or Spotify have demonstrated that it is possible to achieve a high degree of flexibility and evolvability with these systems. However, the described advantageous characteristics offer no concrete guid- ance and little is known about evolvability assurance processes for microservices in industry as well as challenges in this area. Insights into the current state of practice are a very important prerequisite for relevant research in this field. Objective We therefore wanted to explore how practitioners structure the evolvability assur- ance processes for microservices, what tools, metrics, and patterns they use, and what challenges they perceive for the evolvability of their systems. Method We first conducted 17 semi-structured interviews and discussed 14 different microservice-based systems and their assurance processes with software professionals from 10 companies. Afterwards, we performed a systematic grey literature review (GLR) and used the created interview coding system to analyze 295 practitioner online resources. Results The combined analysis revealed the importance of finding a sensible balance between decentralization and standardization. Guidelines like architectural principles were seen as valuable to ensure a base consistency for evolvability and specialized test automa- tion was a prevalent theme. Source code quality was the primary target for the usage of tools and metrics for our interview participants, while testing tools and productivity metrics were the focus of our GLR resources. In both studies, practitioners did not mention architectural or service-oriented tools and metrics, even though the most crucial challenges like Service Cutting or Microservices Integration were of an architectural nature. Communicated by: Arpad Beszedes and Miryung Kim This article belongs to the Topical Collection: Software Maintenance and Evolution (ICSME) � Justus Bogner justus.bogner@iste.uni-stuttgart.de Extended author information available on the last page of the article. / Published online: 22 July 2021 http://crossmark.crossref.org/dialog/?doi=10.1007/s10664-021-09999-9&domain=pdf http://orcid.org/0000-0001-5788-0991 http://orcid.org/0000-0002-5256-8429 http://orcid.org/0000-0003-3352-7207 mailto: justus.bogner@iste.uni-stuttgart.de Empir Software Eng (2021) 26: 104 Conclusions Practitioners relied on guidelines, standardization, or patterns like Event- Driven Messaging to partially address some reported evolvability challenges. However, specialized techniques, tools, and metrics are needed to support industry with the contin- uous evaluation of service granularity and dependencies. Future microservices research in the areas of maintenance, evolution, and technical debt should take our findings and the reported industry sentiments into account. Keywords Microservices · Evolvability · Assurance · Industry · Interviews · Grey literature review 1 Introduction Fast moving markets and the age of digitalization require that software can be quickly adapted or extended with new features. If change implementations frequently happen under time pressure, the sustainable evolution of a long-living software system can be signifi- cantly hindered by the intentional or unintentional accrual of technical debt (Lehman 1980; Avgeriou et al. 2016). The quality attribute associated with software evolution is referred to as evolvability (Rowe et al. 1998): the degree of effectiveness and efficiency with which a system can be adapted or extended. Software evolution can therefore be seen as a sub- set of maintenance since the latter also includes changes where the requirements are stable, like the fixing of bugs. For Rajlich (2018), evolvability is therefore more demanding and requires maintainability, i.e. an evolvable system is always maintainable but not vice versa. In this sense, it is impossible to evolve unmaintainable software, yet maintaining software that can no longer be evolved may still be possible. Evolvability is especially important for software with frequently changing requirements, e.g. internet-based systems. To provide sufficient confidence that such a system can be sustainably evolved, software professionals apply a set of numerous activities that we refer to as evolvability assurance. These activities are usually either of an analytical or constructive nature (Wagner 2013). The goal of analytical activities is to identify evolvability-related issues in the system, i.e. to evaluate or quantify evolvability. This includes manual techniques like code review or scenario-based analysis but also tool-supported static or dynamic analysis with e.g. met- rics. The goal of constructive activities, on the other hand, is the remediation of identified issues or systematic evolvability construction for some part of the system, i.e. to improve evolvability. The primary constructive activity is code-level or architectural refactoring. However, adhering to evolvability-related principles and guidelines or using evolvability- related design patterns during software evolution is also a constructive – and proactive – form of evolvability assurance. Lastly, some practices like conscious technical debt man- agement cover both areas: the identification and documentation of technical debt items is analytical, while the removal of prioritized items is constructive. For larger systems, all these activities often form a communicated assurance process and are an important part of the development workflow with integration into the continuous integration and delivery (CI/CD) pipeline. Microservices constitute an important architectural style that prioritizes evolvabil- ity (Newman 2015). A key idea here is that fine-grained and loosely coupled services that are independently deployable should be easy to change and to replace. Consequently, one of the postulated microservices characteristics is evolutionary design (Fowler 2019). While these properties provide a beneficial theoretical basis for evolvable systems, they offer no concrete and universally applicable solutions. As with each architectural style, the imple- mentation of a concrete microservice-based system can be of arbitrary quality. Especially 104 Page 2 of 39 Empir Software Eng (2021) 26: 104 the “service cutting” activity has been described as challenging and several approaches have been proposed by academia to support it (Fritzsch et al. 2019b). Apart from this, very lit- tle scientific research has covered the areas of maintenance, evolution, or technical debt for microservices. Examples include the tracking and management of microservices depen- dencies (Esparrachiari et al. 2018), antipatterns for microservices (Taibi et al. 2020), and the applicability of service-based maintainability metrics for microservices (Bogner et al. 2017). In addition to this sparse scientific state of the art, there are also very few empirical studies on the industry state of practice. Little is known about what evolvability assurance processes and techniques companies use for microservices or if these are different com- pared to other architectural styles. In the general area of service-based systems, Schermann et al. (2016) even describe a mismatch between what academia assumes and what industry actually does. An analysis of industry practices in this regard could identify common chal- lenges, showcase successful processes, and highlight gaps and deficiencies. This would also provide insights into how industry perceives academic approaches specifically designed for service orientation, e.g. service-oriented maintainability metrics. Results of such a study could help to design new and more suited evolvability assurance processes or techniques. We therefore conducted an analysis of the industry state of practice based on two qual- itative studies. First, we interviewed 17 Germany-based software professionals from 10 different companies (see Section 4). They described 14 different systems with various microservices characteristics and their concrete evolvability assurance process including tool, metric, and pattern usage. We also talked with them about the evolution qualities of microservices, how microservices influence the assurance process, and their perceived challenges for evolvability. Using these identified concepts, we subsequently conducted a systematic grey litera- ture review (GLR) to confirm and expand our interview results based on a large number of practitioner blog posts, conference presentations, white papers, and Q&A forum posts (see Section 5). In total, we analyzed 295 relevant internet resources written by software professionals using our existing interview coding system, which we extended with newly identified labels. In this paper, we present the results of these two qualitative studies and discuss their combined implications (see Section 6). The interview results related to the evolvability assurance of microservices have already been discussed on their own in (Bogner et al. 2019a). Since the extensive interviews also covered additional topics, there are also pub- lications on used technologies, the adherence to microservice characteristics, and overall software quality (Bogner et al. 2019b) as well as on intentions, strategies, and challenges for migrating to microservices (Fritzsch et al. 2019a). This article therefore extends our pre- vious work (Bogner et al. 2019a) with a grey literature review and the joint interpretation of interview and GLR results. 2 RelatedWork Several empirical studies that report challenges for microservices adoption have been pub- lished so far. Baškarada et al. (2018) conducted 19 in-depth interviews with experienced architects. They discussed opportunities and challenges associated with the adoption and implementation of microservices. Four types of corporate systems with different levels of suitability for microservices were identified. Especially large corporate systems of record Page 3 of 39 104 Empir Software Eng (2021) 26: 104 like enterprise resource planning (ERP) systems were not seen as appropriate targets. In this context, organizational challenges such as DevOps methodologies would be less serious for information and communication technology enterprises than for traditional organizations, where IT was perceived as a “necessary evil” (Baškarada et al. 2018). Ghofrani and Lübke (2018) conducted a similar empirical survey among 25 practitioners that were mainly developers and architects. Their objective was to find perceived challenges in designing, developing, and maintaining microservice-based systems. The results reveal a lack of notations, methods, and frameworks for architecting microservices. Several par- ticipants named the distributed architecture as responsible for the challenging development and debugging of the system. Participants generally prioritized optimizations in security, response time, and performance over aspects like resilience, reliability, and fault tolerance. In their interviews with 10 microservices experts from industry, Haselböck et al. (2018) focused on design areas of microservices and associated challenges. The study identified 20 design areas and their importance as rated by the participants. Design principles and com- mon challenges from earlier mapping studies could be confirmed by the authors. Similar to our qualitative study, interviewees’ rationales are discussed as well. Microservices design is a fundamental aspect of their evolvability later on. As such, the study can be seen as a valuable contribution to the topic at hand. Some studies also focus on evolvability-related areas like technical debt and antipatterns. Carrasco et al. (2018) conducted a literature review to gather migration and architecture smells. The authors derived best practices, success stories, and pitfalls. By digesting 58 different sources from academia and grey literature, they presented nine common bad smells with proposed solutions. Similarly, Taibi et al. (2020) synthesized a taxonomy of 20 microservices antipatterns via an extensive mixed-method study over several years, combining an industrial sur- vey, a literature review, and interviews. Their most recent replication relied on interviews with 27 experienced developers, who shared bad microservices practices that they encoun- tered and their applied solutions. The taxonomy contains both organizational and technical antipatterns. To sum up the frequently studied area of microservices antipatterns, Neri et al. (2019) conducted a multivocal review on the refactoring of antipatterns which violate microser- vices principles. In addition to scientific publications, they also included grey literature from practitioners, such as blog posts, industrial white papers, or books. As results, they present 16 refactorings for seven architectural microservices smells. In a previous study (Bogner et al. 2018), we surveyed 60 software professionals via an online questionnaire to assess maintainability assurance practices in industry as well as notable differences with both service- and microservice-based systems. We asked ques- tions related to the used processes, tools, and metrics to learn about treatments specific to such systems. Very few participants reported the usage of techniques to address existing issues related to architecture-level evolvability. Since 67% of participants neglected service- oriented particularities in the assurance, the study revealed a weak spot in industry practice that may impair the lifespan of service-based systems. A pure grey literature review in the area of microservices was conducted by Soldani et al. (2018). To distill commonly perceived technical and operational advantages (“gains”) and drawbacks (“pains”) of microservices, they selected and analyzed 51 practitioner resources obtained via search engines like Google, Bing, or Duck Duck Go. The analyzed white papers, blog posts, and videos were published between 2014 and 2017 and were mapped to the general areas of microservices design, development, and operations. 104 Page 4 of 39 Empir Software Eng (2021) 26: 104 Bandeira et al. (2019) investigated how microservices were discussed on StackOver- flow. They argue that Q&A websites can serve as representative samples of the community. With StackOverflow being the biggest platform for software development, they refer to sev- eral other studies that already leveraged this source of accumulated knowledge. Unlike our approach, they did not use keyword search, but only retrieved questions for which the author used the microservices tag. In total, they applied mining techniques and topic modelling to 1,043 discussions and extracted technical and conceptual subjects. While there is a slight overlap with codes we used in our GLR, evolvability is not a thematic priority in their very general classification scheme. Lastly, Lenarduzzi and Taibi (2018) investigated technical debt interest by means of a long-term case study where they monitored the migration of a monolithic legacy system to microservices. The study aimed to characterize technical debt and its growth comparatively in both architectural styles. As a preliminary result, they found that the total amount of technical debt grew much faster in the microservice-based system. In summary, the majority of microservices industry studies focuses on general challenges or antipatterns and not on a broader set of applied evolvability assurance activities. Our survey (Bogner et al. 2018) is one of the few to do so, but we did not focus exclusively on microservices and could not report much on developers’ rationales due to the limitations of the quantitative questionnaire. To address this gap, we therefore conducted two qualitative microservices studies to analyze industry evolvability assurance processes, techniques, and challenges. 3 Research Design We generally followed the five-step case study process as described by Runeson and Höst (2009) to structure our research, which includes the consecutive steps of study design, preparation for data collection, evidence collection, data analysis, and reporting. As a first step, we defined a research objective and related research questions. The primary goal of this study can be formulated in the following way: Analyze the applied evolvability assurance for the purpose of knowledge generation with respect to common practices and challenges from the viewpoint of software professionals in the context of microservices in industry We formulated three research questions to set more fine-grained directions for our study and to limit its scope: RQ1:What processes do software professionals follow for the evolvability assurance of microservices and for what reasons? This RQ covers the process aspects of evolvability assurance, i.e. what concrete activities are used and how are they structured and combined. RQ2: What tools, metrics, and patterns do software professionals use for assuring the evolvability of microservices and with what rationales? This RQ is concerned with three concrete concepts which are frequently used within evolvability assurance activities, namely tools and metrics to quantify evolvability or to identify evolvability-related issues as well as design and architecture patterns for systematic evolvability construction. Page 5 of 39 104 Empir Software Eng (2021) 26: 104 RQ3: How do software professionals perceive the quality of their microservices and assurance processes and what parts do they see as challenging? This RQ analyzes software professionals’ perceptions with respect to the quality charac- teristics of their systems (e.g. what evolvability characteristics are they satisfied with) and their assurance activities (e.g. do they think that their activities are effective and efficient). Special emphasis is put on perceived challenges for the evolvability of microservices. Since quantitative survey research with questionnaires would not be in-depth enough to cover practitioners’ rationales – an issue we also experienced during the analysis of our sur- vey data (Bogner et al. 2018) – we selected a predominantly qualitative research approach. Qualitative methods analyze relationships between concepts and directly deal with iden- tified complexity (Seaman 2008). Results from such methods are therefore very rich and informative and may provide insights into the thought process behind the analyzed infor- mation. As concrete methods for surveying the state of practice, we chose semi-structured interviews (Seaman 2008; Hove and Anda 2005) and a systematic grey literature review (GLR) (Garousi et al. 2019), i.e. we collected and analyzed a large number of practitioner online resources related to our research objective. The detailed research designs for each method are described in the following two subsections. 3.1 Practitioner Interviews Semi-structured interviews left us with a basic agenda, but also allowed us to dynami- cally adapt our questions based on responses. For our interview participants, we defined the following requirements: – Significant professional experience (minimum of five years) and solid knowledge of service orientation – Technical role (e.g. developer or architect) that at least sometimes writes code – Recent participation in the development of a system with microservices characteristics We recruited participants via personal industry contacts of the research group and by attending industry meet-up groups on microservices, where we approached companies from different domains and of different size. To ensure a base degree of heterogeneity within our population, we only allowed a maximum of three participants per company and if two participants worked on the same system, they needed to have different roles. 3.1.1 Preparation for Data Collection Before conducting the interviews, we created several documents. We prepared an interview preamble (Runeson and Höst 2009) that explained the interview process and relevant top- ics. To make participants familiar with the study, they received this document beforehand. The preamble also outlined ethical considerations like confidentiality, requested consent for audio recordings, and guaranteed that recordings and transcripts would not be published. As a second document, we created an interview guide (Seaman 2008) that contained the most important questions grouped in thematic blocks. This guide helped us to scope and orga- nize the semi-structured interviews and was used as a loose structure during the sessions. We did not share it with interviewees beforehand. Furthermore, we created a slide set with additional interview artifacts for certain topics or questions, such as an exemplary list of assurance tools. These slides were also not shared with participants before the interviews. For the analysis, we created a preliminary set of coding labels and a case characterization matrix (Seaman 2008) containing the most important case attributes. 104 Page 6 of 39 Empir Software Eng (2021) 26: 104 3.1.2 Evidence Collection In total, we conducted 17 individual interviews (no group interviews). Six of these were performed face to face and 11 via remote communication software with screen sharing. All interviews except for a single English one were conducted in German. Each partici- pant agreed to a recording of the interviews, which took between 45 and 75 minutes. We loosely followed the structure of the interview guide and adapted based on how a partic- ipant reacted. As an initial ice breaker, participants were asked to describe their role and the system they worked on. Later on, the topic shifted to the evolvability of the system and potential symptoms of technical debt. The next thematic block was the concrete evolvability assurance process for the system. We presented a custom maturity model with four levels that ranged from implicit and basic to explicit and systematic (see Fig. 1). We created this simplistic model inspired by existing frameworks like CMMI (Software Engineering Institute 2010) or the maintenance maturity model from April et al. (2005). Since its only purpose was to act as an initial conversation opener about evolvability assurance, our model has not been evaluated in any way. To get participants to reflect on their own assurance, we explained the different levels and exam- ples of associated practices. Participants were then asked to place themselves on the level that corresponded the most to their current assurance activities and to give some rationales for their choice. From there, we discussed the details of their processes and concrete tech- niques, tools, and metrics. Even though we discussed specific level placements in some cases, it was neither the intention to reach perfectly consistent maturity levels across all cases nor to rigorously assess the maturity of the different companies. The model just served as an icebreaker to talk about evolvability assurance. Additionally, the level placements give an indication for how elaborate and systematic the interviewees perceived their own practices. Lastly, we asked questions about challenges and participants’ satisfaction with the cur- rent process. The satisfaction and reflection questions relied on a five point scale from -2 (very negative) to +2 (very positive) with 0 being the neutral center. After the interviews, we manually transcribed each audio recording to create a textual document. We then sent these documents to participants for review and final approval. During this review, intervie- wees were able to delete sensitive paragraphs or change statements of unclear or unintended meaning. The approved transcripts were then used for detailed qualitative content analysis. Fig. 1 Evolvability Assurance Maturity Levels Page 7 of 39 104 Empir Software Eng (2021) 26: 104 3.1.3 Data Analysis As a first step for analyzing the interview data, we performed the coding of each tran- script. Using the created preliminary set of codes, we assigned labels to relevant paragraphs. During this process, several new labels were created and already finished transcripts were revisited. Labels were also renamed, split, or merged as we acquired a more holistic under- standing of the cases. These coding activities followed the constant comparison method that is based on grounded theory (Seaman 2008). After the coding of all transcripts, we analyzed the details and code relationships of each individual transcript. This activity resulted in a textual description for every case1. In the second step, we applied cross-case analysis (Seaman 2008) to identify important generalizations and summaries between the cases. We used the coding system and the cre- ated case characterization matrix as well as tabulation (Runeson and Höst 2009). For each research question, important findings were extracted from the transcript and documented. During this process, we also refined the case characterization matrix. General trends and deviations were documented and later aggregated into results and take-aways. To increase transparency and reproducibility, we published all interview documents and artifacts (except for the full transcripts) as well as the results of the analysis on both GitHub2 and Zenodo3. 3.2 Grey Literature Review (GLR) Since software engineering is a very practitioner-oriented field and microservices still have limited scientific publication coverage in a lot of areas, grey literature may hold valuable insights that academic literature simply cannot provide yet (Garousi et al. 2016). Therefore, multivocal and grey literature reviews in software engineering experienced a rise in popu- larity over the last years, even though the usage of such methods is still at an early stage and clear guidelines are only starting to emerge (Neto et al. 2019). In our research design, we used the identified concepts and results from our interview as the basis for planning our GLR, i.e. the goal of the GLR was to confirm (or reevaluate) the interview results on the foundation of a large sample size of documents. 3.2.1 Preparation for Data Collection We developed a detailed review protocol based on many of the guidelines proposed in (Garousi et al. 2019) to support us during the process (see also Fig. 2). With respect to data sources (see also Fig. 3), we decided to include the search engines Google and Bing, as they are the two most popular ones4 and have been used in most GLRs in the field of software engineering. Additionally, we included the Q&A platform StackOverflow and the three spe- cialized StackExchange communities Software Engineering, Software Quality Assurance & Testing, and DevOps. StackExchange communities are popular with practitioners and especially valuable to identify frequently experienced issues and challenges. Based on our experiences with the interviews, we defined a set of seven search strings for Google and Bing (see Fig. 4). Each included the term microservices combined with 1https://github.com/xJREB/research-microservices-evolvability-interviews/tree/master/case-descriptions 2https://github.com/xJREB/research-microservices-evolvability-interviews 3https://doi.org/10.5281/zenodo.2586916 4https://www.reliablesoft.net/top-10-search-engines-in-the-world 104 Page 8 of 39 https://github.com/xJREB/research-microservices-evolvability-interviews/tree/master/case-descriptions https://github.com/xJREB/research-microservices-evolvability-interviews https://doi.org/10.5281/zenodo.2586916 https://www.reliablesoft.net/top-10-search-engines-in-the-world Empir Software Eng (2021) 26: 104 Fig. 2 General GLR Process various relevant concepts like quality attributes (e.g. evolvability) or means of assur- ance (e.g. metrics and tools). Additionally, we excluded domains that only produced unwanted academic results like researchgate.net or scholar.google.com5. For StackOverflow and the StackExchange communities, we constructed a detailed query for the offered SQL interfaces6. This SQL query7 relied on the same search terms, but only included posts created in 2014 or later that had a score of at least +2. To limit the long run- time and large number of results, the required score was raised to +5 for the StackOverflow query. 3.2.2 Evidence Collection We manually entered all seven search strings into both Google and Bing and extracted the first 100 URLs per search string, i.e. a total of 1,400 URLs (7 ∗ 100 ∗ 2). We used an anonymous browsing session and set the search location to the United States. As Google did not reliably respect domain exclusions of its own domains and Bing did not offer a feature for this at all, we identified any excluded domains via regular expression search and filled up the list with additional search hits to guarantee 100 results per search string. Concerning the four Q&A platforms, we simply extracted all posts returned by the respective queries. We then merged the results (1,730 resources) and eliminated duplicates (485 resources, 28%), which left us with a total of 1,245 URLs that we needed to manually assess for inclusion. This assessment was based on the following criteria. First, we only included textual prac- titioner online resources in English. Most frequently, these were resources like blog posts, news or wiki articles, Q&A posts, tutorials (in written form), company white papers, or presentation slides. Resources were excluded if they were not in English or if they were sci- entific papers, books, videos, job offerings, or announcements of conferences, seminars, and trainings. On top of these basic criteria, filtering out resources not relevant for our research questions was most important. To be included, a resource needed to contain a description or reflection of some form of evolvability assurance or some form of systematic evolvability construction. This could be the usage of tools, metrics, or patterns, but also descriptions of guidelines, best practices, or lessons learned as well as experienced challenges. A last pos- sibility for exclusion was a perceived suboptimal quality of the resource, e.g. if the author’s or company’s experience or authority was questionable. It was important that the author had 5https://github.com/xJREB/research-microservices-evolvability-glr/blob/master/protocol.md# google-and-bing-search 6https://data.stackexchange.com/stackoverflow/query/new 7https://github.com/xJREB/research-microservices-evolvability-glr/blob/master/protocol.md# stackexchange-sql-query Page 9 of 39 104 https://github.com/xJREB/research-microservices-evolvability-glr/blob/master/protocol.md#google-and-bing-search https://github.com/xJREB/research-microservices-evolvability-glr/blob/master/protocol.md#google-and-bing-search https://data.stackexchange.com/stackoverflow/query/new https://github.com/xJREB/research-microservices-evolvability-glr/blob/master/protocol.md#stackexchange-sql-query https://github.com/xJREB/research-microservices-evolvability-glr/blob/master/protocol.md#stackexchange-sql-query Empir Software Eng (2021) 26: 104 Fig. 3 Used Search Engines and Q&A Platforms applied or experienced the described assurance techniques in the real world and that they were not simply a hypothetical suggestion the author thought about. For Q&A posts, we did not follow this as sternly, since we also wanted to record the experienced challenges. To be able to split the manual work of filtering over 1,200 resources without significantly reducing consistency, the first two authors performed several rounds of inter-rater reliability calibration by both filtering the same set of resources and comparing the results. We did this for four consecutive rounds, each with 100 mixed resources from all sources. After each round, differences were discussed and resolved, which gradually led to more consensus. For the last 100 resources, the percentage agreement was 87% and Cohen’s Kappa (Cohen 1960) was 0.602, which Landis and Koch (1977) categorize as right at the end of “moderate” and the beginning of “substantial” agreement. At this point, we split the remaining 845 resources between us and filtered independently. If one rater was unsure about the inclusion of a resource, he assigned a review of his preliminary decision to the other rater. Differences of opinion were discussed until a consensus was reached. Using this process, we finally ended up with 295 included resources that needed to be analyzed in detail (see also Fig. 5). 3.2.3 Data Analysis To analyze the selected GLR resources, we relied on roughly the same coding process as with the interviews. We used the existing coding system as the basis and iteratively extended it with new labels we discovered during the process. A difference to the interview tran- scripts was that we did not code concrete text passages, but assigned labels to the complete resource. For longer resources, we also documented the occurrence of the label to support data extraction later on, e.g. a page number or associated heading. Similar to the filtering stage, we also wanted to split the work between two coders and therefore performed an ini- tial calibration round where both coders analyzed the first 20 resources and discussed any differences. Due to the very elaborate coding system (over 130 unique labels in six cate- gories), it did not make sense to calculate agreement measures like Cohen’s Kappa since Fig. 4 Used Search Terms for Google and Bing 104 Page 10 of 39 Empir Software Eng (2021) 26: 104 Fig. 5 GLR Stages with # of Resources at Each Stage (SO: StackOverflow, SE: Software Engineering, DevOps: DevOps, SQA: Software Quality Assurance & Testing) high values are unlikely to be reached, even if the majority of labels will be the same per resource. After the calibration discussion, however, we felt that we had been sufficiently consistent to nonetheless split up the remaining resources. Additionally, if a coder was unsure about one of his resources, he could assign it to the other coder for double-checking. After coding was completed, we synthesized answers to our research questions by analyzing frequently occurring labels and their related text passages. Similar to the interview analysis, we looked for general tendencies and documented interesting quotes. Later on, these extrac- tions were aggregated into results and take-aways. Likewise, all GLR artifacts and results are published on both GitHub8 and Zenodo9. 4 Interview Results Our interviewees were from 10 different companies (C1–C10) of different sizes and domains (see Table 1). Half of these were software & IT services companies that mostly developed systems for external customers. The companies from other domains always had an internal system owner. Every participant was located in Germany, even though some companies had sites in several European countries or even globally. From our 17 partici- pants (P1–P17), 11 stated architect as their role while four were developers. The remaining two roles were data engineer and DevOps engineer. All participants possessed a minimum of five years of professional experience, with a median of 12 and a mean of 14.7 years. Altogether, we discussed 14 systems (S1–S14) and their evolvability assurance processes, where in three cases, two participants talked about the same system (S5, S9, S11). For the sake of brevity, this publication only contains the aggregated interview results. We provide a detailed description of every case in our online repositories2,3. The descrip- tions include general information about the system, the details of the evolvability assurance 8https://github.com/xJREB/research-microservices-evolvability-glr 9https://doi.org/10.5281/zenodo.3731259 Page 11 of 39 104 https://github.com/xJREB/research-microservices-evolvability-glr https://doi.org/10.5281/zenodo.3731259 Empir Software Eng (2021) 26: 104 Table 1 Interview Demographics: Companies and Participants (CID: Company ID, SID: System ID, PID: Participant ID, Exp.: Professional Experience in Years process, and lastly reflections and challenges per system. Table 2 provides some basic sys- tem information and the self-assessed assurance maturity levels while Table 3 lists the usage of tools, metrics, and patterns to analyze and improve evolvability. By analyzing and com- paring the individual cases, we identified several trends or common relationships. These generalizations, summaries, or notable deviations from common assumptions are presented in the following subsections that correspond to our three research questions. 4.1 Assurance Processes (RQ1) The intention of RQ1 was to find out what general activities participants employed to assure the evolvability of microservices, how systematically they organized these activities, and what participants’ rationales for these decisions were. Since one microservices characteris- tic is the decentralization of control and management, we also wanted to analyze how much central governance was applied. 4.1.1 Decentralization vs. Governance In general, every analyzed assurance process had some degree of explicit and conscious addressing of evolvability, even though the sophistication and extensiveness of the applied techniques varied greatly. When looking at the larger systems, there were two different approaches for assuring evolvability: very decentralized with very autonomous teams (e.g. S9, S10, S12, S14) vs. centralized governance for macroarchitecture, technologies, and assurance combined with a varying degree of team autonomy for microarchitecture (e.g. S2, S3, S4, S7, S13). The latter kind was usually applied for systems that were built for external customers and that exhibited some project characteristics. In the decentralized variant, the internal system was managed in a continuous product development mode, which created quality awareness by making people responsible and simultaneously empowering them. This variant is more in line with the microservices and 104 Page 12 of 39 Empir Software Eng (2021) 26: 104 Table 2 Interview Systems: General Characteristics ID System Purpose Age in Years # of Services Assurance Maturity Level (0-3) S1 Derivatives management system (banking) 1.5 9 2 S2 Freeway toll management system 2 10 3 S3 Automotive problem man- agement system 1 10 1.5 S4 Public transport sales sys- tem 2 ∼100 1.5 S5 Business analytics & data integration system 1.5 6 P5: 1 P6: 1.5 S6 Automotive configuration management system 1 60 3 S7 Retail online shop 2.5 ∼250 2.5 S8 IT service monitoring plat- form 3 9 2 S9 Hotel search engine 2 ∼10 P10: 1.5 P11: 3 S10 Hotel management suite 1.5 20 2 S11 Public transport manage- ment suite (S11a: human resource management part) 2.5 10 products P13: 1.5 P14: 2.5 S12 Retail online shop 2 ∼45 2.5 S13 Automotive end-user ser- vices mgmt. system 2 7 2 S14 Retail online shop 6 ∼175 2 DevOps principle “you build it, you run it”. Techniques in this variant also were by no means basic or implicit. Even though teams were allowed to choose their own assurance activities, they usually created a more or less structured processes that did not depend on external gov- ernance. This was hard to replicate for IT service providers that often did not operate the systems themselves (S2, S3, S4, S7, S13) and had to coordinate with external customers or even other contractors (S4, S7). Therefore, they relied more on central governance. Archi- tect P4 described it as follows: “In our case, the main challenge is to convince 300 people to move in the same direction. For that, we created a very large number of guidelines and rules for service creation.” Such guidelines, principles, or standardizations were nonetheless seen as important parts of the process in both variants. These coding labels were among the most frequent ones. Nearly all participants reported their usage in various areas such as architectural principles, rules for service communication, skeleton projects, style guides, cross-cutting concerns like logging or authentication, candidate technologies, or Docker images. The degree of enforcement varied between companies and was usually higher within the cen- tralized variant. In the decentralized variant, pragmatism was often more important than Page 13 of 39 104 Empir Software Eng (2021) 26: 104 Table 3 Interview System Assurance Processes: Tools, Metrics, and Patterns ID Tools Metrics Patterns S1 IDE linting – Event-Driven Messaging, Service Registry S2 SonarQube (FindBugs, Checkstyle, PMD) Test coverage, cyclomatic complexity, clone cover- age, # of defects per ser- vice, # of failed tests, # of code smells, # of endan- gered requirements Event-Driven Messaging S3 SonarQube (FindBugs, Checkstyle, PMD) Test coverage, clone cov- erage, defect resolution time Event-Driven Messaging, Strangler S4 SonarQube (FindBugs), VersionEye # of code smells, test coverage, # of outdated dependencies Event-Driven Messaging, Backends for Fron- tends, Consumer-Driven Contracts, Tolerant Reader S5 SonarQube (FindBugs, Checkstyle, PMD) Test coverage Event-Driven Messaging S6 SonarQube (FindBugs, Checkstyle) Test coverage, # of failed tests, # of code smells, cyclomatic complexity, clone coverage, LOC Event-Driven Messaging S7 SonarQube (FindBugs, PMD, Checkstyle), Cober- tura, IDE linting, custom static analyzer for coding conventions Test coverage, cyclomatic complexity, clone cover- age, # of rule violations, velocity API Gateway S8 SonarQube (FindBugs), IDE linting Cognitive complexity, # of code smells, test coverage Event-Driven Messaging S9 P10: Checkstyle, IDE linting P11: SonarQube (FindBugs, Checkstyle) P10: Defect resolution time, burndown P11: # of code smells, # of rule violations Event-Driven Messaging, Request-Reaction S10 SonarQube, IDE linting Test coverage, # of endan- gered usage scenarios Self-Contained Systems, Backends for Frontends S11 P13: SonarQube (Find- Bugs, Checkstyle) P14: FindBugs, PMD, Cober- tura, custom tool for archi- tectural conformance P13: # of code smells P14: # of architectural viola- tions, # of rule violations – S12 SonarQube (FindBugs, Checkstyle, PMD), IDE linting, Codecov, Cobertura Test coverage, cyclomatic complexity, # of code smells, # of rule violations Event-Driven Messag- ing, Consumer-Driven Contracts S13 SonarQube (FindBugs, Checkstyle, PMD) Test coverage, # of code smells, # of rule violations Event-Driven Messaging S14 SonarQube (FindBugs, Checkstyle, PMD), Structure101, Codecov, Cobertura, Codacy Test coverage, CI/CD pipeline duration, LOC Event-Driven Messaging, Self-Contained Systems, Event Sourcing 104 Page 14 of 39 Empir Software Eng (2021) 26: 104 the strict adherence to rules and several participants (P5, P8, P10, P12, P15, P17) reported simplicity as a key principle (“KISS” ⇒ “keep it simple, stupid”). 4.1.2 Automated andManual Activities To make assurance activities more efficient and objective, all participants saw automation and tool support as useful, albeit with varying enthusiasm. Several participants reported the integration of quality analysis tools into the CI/CD pipeline (P2, P3, P8, P9, P11, P14, P15, P17). This was often combined with quality gates, i.e. automated source code checks that could prevent merging or deployment. For architect P17, the pipeline’s execution time was very important: only tools that were absolutely necessary should therefore be integrated. Additionally, several participants advocated for a sensible usage of quality gates. Data engi- neer P11 was frustrated with how difficult it would be to get a passing merge request due to the strict rules. Similarly, developer P10 mentioned that strict quality gates could hinder the deployment of important production bug fixes. Architect P7’s team did not use any qual- ity gates because of continuous experimentation and prototyping. Lastly, lead architect P2’s team circumvented some of these issues by applying quality gates only for releases and not for merge or pull requests. Nearly all participants agreed that test automation was an important part for the assur- ance process of microservices. While unit tests were very common, several participants also reported automated end-to-end tests for the integration of microservices and stressed their importance (P2, P3, P7, P12, P16). Some teams also had more elaborate strategies that linked tests to requirements (S2) or usage scenarios (S10). Participants with only unit tests (P5, P6, P9) or barely any tests (P10) also saw the importance to bring their test automation to a higher level. Despite the reported importance of automation and tool support, several participants also highlighted the usefulness of manual assurance activities. Code reviews were seen as an important practice to increase code quality and to share knowledge within the team (P1, P3, P4, P5, P7, P8, P10, P11). Pair programming was used for the same reasons by two participants and the downside of additional man-hours was willingly accepted (P8, P15). Lastly, refactoring was highly valued and some participants also explicitly mentioned the use of the “boy scout” rule during feature implementations (P11, P15), i.e. the principle to always leave changed code cleaner than before. Activities like these would efficiently increase code quality over time. 4.1.3 Documentation Practices Even though some participants were proponents of concise documentation or architec- tural decision records within the source code repository (P10, P11), several systems relied on more elaborate architecture and service documentation in a system like Confluence or SharePoint (S1, S4, S5, S6, S8). Common types of documentation were system archi- tecture, service dependencies and contracts between teams, service functionality and API descriptions, reference architectures and service blueprints, design rationales, or architec- tural principles and guidelines. For IT service providers, parts of this documentation was also used to communicate with the customer. Lastly, only P7 and P14 reported the conscious tracking of identified technical debt items for later debt management. Architect P7’s teams held an explicit meeting every two weeks, where the most important technical debt items were discussed and their prioritization was decided. Page 15 of 39 104 Empir Software Eng (2021) 26: 104 4.2 Tools, Metrics, and Patterns (RQ2) Our second research question targeted the application of and rationale for tools, metrics, and design patterns. Automation and tool support is an often cited microservices character- istic and seen as necessary to manage a large number of small components. We were also interested if participants used tools and metrics specifically designed for service orientation. Lastly, we wanted to explore the usage of design patterns for evolvability construction, since there is a large body of patterns for service-oriented architecture (SOA) and more recently also for microservices. 4.2.1 Tools Related to Evolvability Assurance While the usage of over a dozen different tools for evolvability assurance was reported, 14 of 17 participants named SonarQube as a central tool that was usually integrated into the CI/CD pipeline. Since P1 and P10 planned to introduce it soon, architect P14 remained the only participant that would not use SonarQube in the foreseeable future. Reported reasons for its popularity were the OpenSource license, the easy installation, plugin availability, and configurability. In Java-focused systems, SonarQube was often extended with tools like FindBugs, Checkstyle, and PMD. Additionally, specialized tools for test coverage like Cobertura (P8, P14, P15, P17), Codecov (P15, P17), or Codacy (P17) were used. For a basic degree of local and immediate quality assurance, IDE linting via e.g. TSLint, ESLint, and PHPLint was reported by some participants (P1, P8, P10, P12, P15). 4.2.2 Evolvability-Related Metrics With respect to metrics, 10 of the 17 participants reported the usage of test coverage, even though some perceived this metrics as less important than others and were very aware of possible quality differences with automated tests. Architect P12 termed it as follows: “Even I could fake the coverage for two classes you give me in like five minutes. You can write a test that brings coverage to about 60%, but actually it covers like 2%.” Some participants also focused on additional metrics for testing and functional correctness like # of failed tests over time (P2, P7), # of defects per service (P2), or # of endangered requirements (P2) or usage scenarios (P12). Most SonarQube users also payed attention to standard findings like code smells, code duplication, and cognitive or cyclomatic complexity. Participants with rule-based tools like FindBugs, Checkstyle, or other linters used the number of rule violations as a simple metric that had to be zero. Overall, most applied metrics were focused on source code quality, even though their effectiveness for the whole system was seen as controversial by a few participants (P8, P17). Architect P17 described it as follows: “Most of these metrics relate to a single project, which is very useful when I have a monolith with a million LOC. However, if I have a service with 1000 LOC which code base is separated from all other 150 microservices, most of these metrics lose their importance.” With respect to productivity metrics, some interviewees reported the usage of defect resolution time (P3, P10), velocity (P8), sprint burndown (P10), or deployment duration (P17). These were important for them to control and manage software evolution. While architecture-related topics like microservices dependencies were very prevalent during our interviews, participants generally did not apply architecture-level tools and metrics. Architect P14’s team used a custom tool for architectural conformance checking in the monolithic code base for a sub product of S11 and architect P17 reported the intermittent 104 Page 16 of 39 Empir Software Eng (2021) 26: 104 usage of Structure101 for a larger subsystem that also consisted of one code base. Apart from that, tools or metrics were exclusively focused on code quality with a local view for a single service. No automatic or semi-automatic efforts were mentioned to evaluate the architecture of a microservice-based system. Likewise, no participant reported the usage of a tool or metric specifically designed for service orientation. When we explicitly asked about service-oriented metrics like the coupling of a service, the cohesion of a service interface, or the number of operations in a service interface, several participants indicated that these sounded interesting and useful (P1, P5, P6, P7, P8, P15). Some interviewees also noted that the underlying princi- ples of these metrics were important guidelines in the architecture and design phase (P7, P8, P10, P15, P17). They tried to manually respect these principles during e.g. service cutting, even though they currently had no concrete measurements in place to validate them. Another common theme in this area was the healthy and non-patronizing usage of tools and metrics, which should be respected when developing microservices in decentral- ized and autonomous teams. As already mentioned, several participants voiced reservations against test coverage (P10, P12, P15). Architect P14 also warned that a strict metric focus would pose the danger that people optimized for measurements instead of fixing the underlying problems. Moreover, lead architect P15 perceived it as difficult to interpret measurements of a single service without a point of reference or a system-wide average. Architect P17 advocated for a sparse usage of tools, because too many metrics could not be analyzed by developers and their collection could slow down the deployment pipeline. Only tools that would support the analysis of current problems should be kept. Lastly, architects P8 and P14 highlighted the agile principle of “individuals and interactions over processes and tools”: the usage of tools and metrics should support developers in their daily work and not be a frustrating and alienating experience for them. 4.2.3 Service-Based Patterns for Evolvability We also analyzed the usage of service-oriented design patterns as conscious means to increase evolvability. In general, we did not find a widespread usage of them. Most com- mon was Event-Driven Messaging that was partially applied in 11 of the 14 cases. While several participants stated that the pattern was used to decouple services, another inten- tion was to implement reliable asynchronous and long-running communication. The pattern was sometimes paired with Request-Reaction (P10, P16). Apart from messaging, most par- ticipants applied activity patterns like Service Refactoring, Service Decomposition, and Service Normalization. In line with the philosophy of evolutionary design, microservices were frequently split and merged. Other patterns were used sporadically. P12 and P17 applied the Self-Contained Sys- tem paradigm to achieve vertical isolation between subsystems. In a migration context, P3 and P16 reported the usage of the Strangler pattern to extend an existing monolith with new microservices until its final replacement. To place an intermediary between service consumers and producers, P4 and P12 implemented the Backends for Frontends pattern that would also prevent too many concurrent long-running HTTP requests. Similarly, P8 chose the API Gateway pattern which also brought benefits for security. The patterns Consumer-Driven Contracts (P4, P15) and Tolerant Reader (P4) were applied to make ser- vice interface evolution more robust and to prepare consumers for future changes. Lastly, developer P1 was the only participant to explicitly report the usage of the Service Registry Page 17 of 39 104 Empir Software Eng (2021) 26: 104 pattern for dynamic service discovery, even though some participants may have used similar functionality via Kubernetes. 4.3 Evolvability Reflections and Challenges (RQ3) With RQ3, we wanted to analyze participants’ perception of the general evolution qualities of their microservice-based systems as well as their satisfaction with their current assurance processes (see Fig. 6). We also tried to summarize what participants experienced as the most important challenges for the evolvability of their microservices (see Fig. 7). 4.3.1 Perceived System Evolvability In general, our interviewees perceived the evolvability of their microservices as positive (mean: +0.88, median: +1), especially in cases with a migration context where a monolith had been rewritten. Only two participants chose a negative rating (-1). Architect P4 saw the high degree of technological heterogeneity and the very different service granularity as threatening for the large project, especially once S4 would be handed over to the smaller maintenance team. Data engineer P11 described the chosen service cuts as inefficient and politically motivated and worried about significant issues with the consistency of the data model as well as the inaccessibility of code due to distributed repositories. As for more specific quality attributes, the analyzability of individual services would be much improved (P1, P8, P10, P16, P17), even though grasping and understanding the whole system would be difficult (P7, P8, P11, P17). When compared to most monoliths, the modularity of microservices would make it very convenient to change or add functionality (P1, P3, P6, P9, P12, P17) and would also allow to efficiently scale-out the development with multiple teams (P7, P16). Even though reuse is usually a theme more common in SOA, several participants reported a positive reusability of their microservices (P1, P3, P7, P8, P10, P17). To reduce coupling, some participants avoided the sharing of non-open source libraries between services via duplication (P7, P10, P15, P17). Others tried to consciously increase reuse via shared libraries (P3, P9) or by slightly generalizing service interfaces (P1). Lastly, participants reported that individual services would be easy to test (P3, P7, P10, P13, P15) and to replace (P2, P15, P17). 12% 12% 18% 0% 76% 76% 59% 59% 12% 12% 24% 41% Evolvability Rating Assurance Effectiveness Assurance Impact on Productivity Wanted Change in Assurance Efforts 100 50 0 50 100 Percentage very negative (−2) negative (−1) neutral (0) positive (+1) very positive (+2) Fig. 6 Aggregated Evolvability Assurance Reflections of 17 Participants 104 Page 18 of 39 Empir Software Eng (2021) 26: 104 Fig. 7 Evolvability Challenges With At Least 2 Mentions from 17 Participants 4.3.2 Evolvability Challenges Since most systems were fairly young or even still in the process of being migrated, indi- vidual services were usually of a good quality. Basic symptoms of technical debt or bad code quality were rarely seen as an issue, especially since a single service would be easy to replace. However, problems related to architecture and the data model were reported as serious threats for long-term evolvability (P3, P7, P11, P13, P15). This was sometimes exacerbated because coordination between autonomous teams would be difficult (P4, P10, P11, P15). Moreover, finding the appropriate service granularity was a prevalent theme and service cutting was by far named as the most challenging activity that was also associated with frequent refactoring (P2, P3, P4, P6, P7, P9, P11, P12, P15). Harmful inter-service dependencies sometimes led to ripple effects on changes (P3, P5, P9, P11, P15), which made adding or changing functionality slower and more error-prone. Breaking API changes caused similar effects for service consumers (P1) or automated tests (P2). Participants did not use any tools to support service decomposition or metrics to evaluate the quality of the chosen cuts, e.g. via coupling or cohesion. Lead architect P2 described it as follows: “In my opinion, there are no useful tools to split up a monolith. It’s always a very difficult manual activity. You can use something like Domain-Driven Design, but that’s just a methodology which doesn’t give you a concrete solution.” Participants were divided when it came to technological heterogeneity. In very decen- tralized environments, it was generally perceived as overall beneficial (P10, P15, P17), as it would allow choosing the best solution for problems at hand, broaden developers’ experience and skills, and make a company a more attractive employer. Other participants perceived it as potentially dangerous and wished for a more sensible handling of technology hypes (P3, P4, P12, P16). Similarly, the mix of legacy and modern service technology would Page 19 of 39 104 Empir Software Eng (2021) 26: 104 sometimes pose additional problems (P2, P11, P13, P14), like in the case of S9 where addi- tional tooling was necessary to integrate legacy PHP components. Most participants also noted that significant efforts had to be spent on mastering new microservices and DevOps technologies and it would be problematic to find skilled developers (P1, P3, P5, P6, P9, P10, P12, P13, P16). Overall, participants were very aware of the human factors of evolvability and sometimes even saw them as more challenging as technical ones. Knowledge exchange between teams was therefore a high priority for some interviewees (P10, P13, P15). Concerning participants’ reflection of their assurance processes (see also Fig. 6), most saw the effectiveness of their assurance activities (mean: +0.76, median: +1) as well as overall impact on productivity (mean: +0.59, median: +1) as positive. Only three intervie- wees (P9, P11, P12) reported that activities would hinder development efficiency (-1) and would sometimes slow down feature development. Moreover, participants generally wanted to invest more effort for the assurance (mean: +0.76, median: +1) and try out new techniques or metrics. No one reported the wish to reduce efforts. 4.3.3 Influence of Microservices on the Assurance Process However, the influence of microservices on the assurance process was seen as contro- versial. While testing a single service would be easy, integration testing would be more complex because of an additional layer (P2, P3, P13). This would be especially critical if microservices were developed in independence for a long time and integrated at a later stage. Furthermore, root cause analysis of issues would be more complex in such a highly distributed system (P3, P11). A very commonly named concern was that keeping a system- centric quality view and assessing the macroarchitecture would be much more difficult (P4, P6, P7, P8, P10, P11, P15, P17), which architect P8 described as follows: “I’d say we are pretty good when it comes to assuring the evolvability of single services. However, we have a lot of catching up to do for everything that crosses product or service boundaries.” Dis- tributed code repositories and autonomous teams would make the access to code as well as static analysis more complicated. It would also be hard to compare metrics between services and relate them to system-wide averages (P8, P11, P17). Nonetheless, participants also named positive factors. Small services would not only be easy to replace, people would also be much more motivated to fix a small number of issues for a project (P15, P17), which lead architect P15 described as follows: “In a monolith with 100.000 FindBugs warnings, you are completely demotivated to even fix a single one of those. In a microservice with 100 warnings, you just get to work and remove them.” If adopted correctly, microservices would also bring a cultural change with respect to quality awareness and responsibility (P10, P15, P17). Architect P17 high- lighted the importance of continuous product development in this regard: “If you work in a project mode, evolvability assurance usually annoys you, because you have short-term goals and want to finish the project. In a product mode, the team knows that they sabotage their system’s evolvability in the long run, if they take too many short-cuts.” Lastly, lead architect P15 noted that while they were relatively satisfied with their current evolvability assurance activities, they did not really invest much efforts into researching and design- ing a fitting evolvability assurance strategy for the future. Finding out which approaches, tools, and metrics worked best for their microservices could be a vital advantage in the long-term. 104 Page 20 of 39 Empir Software Eng (2021) 26: 104 5 GLR Results From our 295 included resources, 96 were from Bing, 78 from Google, 30 were discovered by both Bing and Google, 2 by both Google and StackExchange, and 89 were exclusively from the StackExchange communities (SO: 32, SE: 51, SQA: 4, DevOps: 2). Therefore, roughly one third of our resources were Q&A posts. From the six categories of our cod- ing system10, the most frequently used one was Challenges: 218 of 295 resources (74%) included at least one label from this category. Second and third most popular categories were Process (66%) and Patterns (48%). 78 resources were assigned at least one label from Influence on Assurance (26%), while 16% of resources mentioned Tools and only 6% Met- rics. Similar as with the interviews, we also present the GLR results in three subsections that correspond to our research questions. 5.1 Assurance Processes (RQ1) Our first RQ was concerned with the general processes for evolvability assurance and the applied activities. 5.1.1 Test Automation The most frequently mentioned process activity was test automation (94 resources, 32%), which was described as an essential prerequisite for sustainable microservice develop- ment and evolution. In general, unit tests were still seen as necessary but not sufficient for microservices. Instead, the majority of resources called for an extension of the classical test pyramid: “The test pyramid was conceived during the era of the monolith and makes a lot of sense when we think about testing such applications. For testing distributed systems, I find this approach to be not just antiquated but also insufficient.”11 Practitioners therefore advocated for an extensive usage of integration, contract, and end-to-end tests, or as André Schaffer from Spotify put it: “A more fitting way of structuring our tests for Microservices would be the Testing Honeycomb. That means we should focus on Integration Tests, have a few Implementation Detail Tests and even fewer Integrated Tests (ideally none).”12 These integration or end-2-end tests were often enabled by partly mocking or spawning involved components. As an alternative to this, some resources mentioned QA in Production (8), i.e. running tests within the production environment to have the most realistic conditions. Lastly, a few resources also described the usage of chaos and load testing or applying practices like test-driven development (TDD) or behavior-driven development (BDD). 5.1.2 Decentralization vs. Governance Similar as with the interviews, we also found the two antagonistic forces of decentraliza- tion & empowerment and governance & standardization, which both had an influence on the assurance process. In general, there was more tendency towards decentralization (37 resources with positive mentions) than towards standardization (28) since this was seen as important to guarantee independent and fast service evolution. Vinay Sahni, the founder of 10https://github.com/xJREB/research-microservices-evolvability-glr/blob/master/coding-labels.md 11https://medium.com/@copyconstruct/testing-microservices-the-sane-way-9bb31d158c16 12https://labs.spotify.com/2018/01/11/testing-of-microservices Page 21 of 39 104 https://github.com/xJREB/research-microservices-evolvability-glr/blob/master/coding-labels.md https://medium.com/@copyconstruct/testing-microservices-the-sane-way-9bb31d158c16 https://labs.spotify.com/2018/01/11/testing-of-microservices Empir Software Eng (2021) 26: 104 Enchant, advocated to maximize the autonomy of teams so that they would have to coor- dinate less and would be more productive.13 WSO2 described a similar philosophy in their microservices white paper: “With respect to development lifecycle management, microser- vices are built as fully independent and decoupled services with a variety of technologies and platforms.”14 Lastly, Armağan Amcalar, head of software engineering at unu GmbH, also remarked in an interview with InfoQ’s Ben Linders that this decentralization was not only beneficial for service evolution but also desired by developers: “Teams want autonomy and ownership. They don’t want to be bound by the decisions of others, and they don’t want to feel obliged to justify themselves and their decisions towards others.”15 On the other hand, many resources recommended basic standardization and governance to avoid chaos, especially for infrastructure, cross-cutting concerns, or service communica- tion, but not for domain-related functionality. Among these were also several proponents of decentralization like Vinay Sahni (“Provide flexibility without compromising consistency: Give teams the freedom to do what’s right for their services, but have a set of standardized building blocks to keep things sane in the long run.”13) and WSO2 (“Run-time gover- nance aspects, such as SLAs, throttling, monitoring, common security requirements, and service discovery, are not implemented at each microservice level. Rather, they are real- ized at a dedicated component, often at the API-gateway level.”14). This sensible balance between decentralization and autonomy on the one hand and governance and standardiza- tion on the other hand was also described as “Goldilocks Governance” by Neal Ford, a director at ThoughtWorks16. In general, a lot of practitioners were very aware of this trade- off: “Choose wisely what you leave out of your macro-architecture. For every choice you allow the individual development teams to make, you must be willing to live with differing decisions, implementations, and operational behaviors.”17 5.1.3 Principles and Guidelines A very important part of the mentioned standardization were principles or guidelines (66 resources, 22%), which were mostly defined for service design and inter-service commu- nication. The most popular principle (20) was to limit or completely avoid the sharing of databases or database tables between microservices. Advocated reasons were to encapsu- late information and to avoid coupling: clear data ownership should guarantee independence between services. A similar “don’t share!” principle was related to domain-specific libraries (18). Having the same shared library within several microservices would lead to upgrade and deployment dependencies. Most authors saw redundancy as preferable to coupling in such cases (“lesser evil”). A softer version of this principle we encountered was to allow sharing libraries between services managed by the same team. In nearly all cases, the shared usage of open source libraries for non-domain related functionality was not seen as harm- ful. Another popular guideline was to reduce or avoid direct synchronous (RESTful) calls between services (14), since this would lead to harmful coupling. Some authors also wrote that true service independence would only be possible with Event-Driven Messaging. Other mentioned principles were favoring pragmatism and simplicity (12) over complex solutions, 13https://www.vinaysahni.com/best-practices-for-building-a-microservice-architecture 14https://wso2.com/whitepapers/microservices-in-practice-key-architectural-concepts-of-an-msa 15https://www.infoq.com/news/2018/11/human-side-microservices 16http://nealford.com/downloads/Evolutionary Architecture Keynote by Neal Ford.pdf 17https://www.freecodecamp.org/news/microservices-from-idea-to-starting-line-ae5317a6ff02 104 Page 22 of 39 https://www.vinaysahni.com/best-practices-for-building-a-microservice-architecture https://wso2.com/whitepapers/microservices-in-practice-key-architectural-concepts-of-an-msa https://www.infoq.com/news/2018/11/human-side-microservices http://nealford.com/downloads/Evolutionary_Architecture_Keynote_by_Neal_Ford.pdf https://www.freecodecamp.org/news/microservices-from-idea-to-starting-line-ae5317a6ff02 Empir Software Eng (2021) 26: 104 avoiding breaking API changes (8), using (semantic) versioning (4) in the spirit of evolvable providers and adaptable consumers, the SOLID18 design principles (8), and the 12-Factor App Guidelines (4). 5.1.4 Other Activities Lastly, the following process-related activities and techniques were mentioned by a few resources: the deliberate refactoring of services (merging and splitting) to improve evolv- ability (16); architecture documentation (15), which was seen as especially important for service interfaces; code reviews (6) to improve quality and share knowledge; and fea- ture toggles (5) to enable a more flexible service evolution within continuous integration practices. 5.2 Tools, Metrics, and Patterns (RQ2) Our second RQ targeted the usage of concrete tools, metrics, and patterns related to evolvability assurance. 5.2.1 Tools Related to Evolvability Assurance We identified 46 resources (16%) that mentioned the usage of tools for the evolvabil- ity assurance process (see Fig. 8). The overwhelming majority were related to automated testing (38 of 47 of unique reported tools). Among these, tools for integration and con- tract testing were especially popular. Most frequently mentioned tools here were Pact (15), SoapUI (5), Spring Cloud Contract (5), and Postman (4). In the important area of consumer- driven contract testing, the company Testsigma advocated the usage of two tools: “There are several tools for contract testing, but the most reliable and effective ones are Spring Cloud Contract and Pact.”19 For the mocking and stubbing of services, WireMock (7) was the most popular tool, even though a lot of other tools were mentioned as well, e.g. Moun- tebank (2), Restito (2), Hoverfly (1), or REST-driver (1). Another area of interest was UI and end-to-end testing. Here, Selenium (8) was the most trusted tool: “Selenium is consid- ered an industry standard in automating web applications for testing purposes. Thanks to Selenoid, Selenium Hub successor, multiple Selenium servers can be run with many differ- ent browser versions encapsulated in Docker containers.”20 An alternative framework was Cypress (2), which was e.g. chosen by Zalando for its fast performance and avoidance of non-deterministic tests.21 Less represented were tools to capture and replay HTTP traffic for realistic and repeatable tests like VCR (4), GoReplay (2), or Betamax (1); resilience and chaos testing like Netflix’s Simian Army (5) or the Java tool Byteman (1); or load testing with e.g. JMeter (1). Besides this variety of testing tools, there was just a small number of tools for other assurance-related activities. Only three resources reported the usage of SonarQube with integration into a CI/CD pipeline with quality gates. Other used static analysis tools were Kiuwan (1), X-Ray (1), NDepend (1), JDepend (1), Code City (1), and Source Monitor (1). 18http://butunclebob.com/ArticleS.UncleBob.PrinciplesOfOod 19https://testsigma.com/blog/testing-microservices-challenges-and-strategies-testsigma 20https://codilime.com/quality-assurance-trends-for-2020 21https://jobs.zalando.com/en/tech/blog/end-to-end-microservices Page 23 of 39 104 http://butunclebob.com/ArticleS.UncleBob.PrinciplesOfOod https://testsigma.com/blog/testing-microservices-challenges-and-strategies-testsigma https://codilime.com/quality-assurance-trends-for-2020 https://jobs.zalando.com/en/tech/blog/end-to-end-microservices Empir Software Eng (2021) 26: 104 15 8 7 5 5 5 4 3 3 3 3 0 2 4 6 8 10 12 14 16 Pact Selenium WireMock Simian Army SoapUI Spring Cloud Contract Postman Cucumber SonarQube Testcontainers VCR # of resources To o l Fig. 8 Evolvability Tools With At Least 3 Mentions in 295 GLR Resources Overall, we found very little explicit reports of such tools in our resources (only 9 mentions in 295 resources). 5.2.2 Evolvability-Related Metrics With respect to evolvability-related metrics (see Fig. 9), we identified 27 unique metrics in only 19 resources (6%). This makes metrics the least represented label category. Even though 38 of 47 identified tools were related to automated testing, only seven resources reported the usage of test coverage as a metric. While this still makes it the most mentioned metric, some authors like Chris Richardson saw deficits with it, even though he still rec- ommended its usage: “While test coverage is not the best metric, it should also be enforced by the deployment pipeline.” 22 Other metrics related to correctness and reliability were deployment success rate (2), # of defects per service (1), or # of failed tests (1). Similar to the small number of tools for static analysis, we also identified just a few metrics for architecture and source code quality (12 metrics with 13 mentions). Apart from lines of code (2), each of these metrics were only reported once, e.g. # of classes, clone cov- erage, cognitive complexity, or cyclomatic complexity. Among these 12 metrics, even less were related to architecture or service orientation. Rare examples were # of dependencies, component entanglement, or static coupling. Lastly, the most frequently used metrics were related to productivity, often in the context of the CI/CD pipeline (19 mentions of 8 metrics). Popular examples were cycle time (5), # of deploys to production (4), deployment duration (3), or mean time to repair (MTTR) (3). For Chris Richardson, such metrics were a very important instrument to track and improve the practices for “rapid, frequent and reliable software delivery”.23 Proponents of such metrics advocated that teams should have some measure of productivity or stability in place to 22https://chrisrichardson.net/post/antipatterns/2019/04/09/antipattern-flying-before-walking.html 23https://www.slideshare.net/chris.e.richardson/melbourne-jan-2019-microservices-adoption-antipatterns -obstacles-to-decomposing-for-testability-and-deployability 104 Page 24 of 39 https://chrisrichardson.net/post/antipatterns/2019/04/09/antipattern-flying-before-walking.html https://www.slideshare.net/chris.e.richardson/melbourne-jan-2019-microservices-adoption-antipatterns-obstacles-to-decomposing-for-testability-and-deployability https://www.slideshare.net/chris.e.richardson/melbourne-jan-2019-microservices-adoption-antipatterns-obstacles-to-decomposing-for-testability-and-deployability Empir Software Eng (2021) 26: 104 7 5 4 3 3 2 2 0 1 2 3 4 5 6 7 Test Coverage Cycle Time # of Deploys to Production Deployment Duration MTTR Deployment Success Rate Lines of Code # of resources M et ri c Fig. 9 Evolvability Metrics With At Least 2 Mentions in 295 GLR Resources identify changes in their service evolution speed, preferably with drill-downs to identify the parts of the life cycle that took longer than usual. 5.2.3 Service-Based Patterns for Evolvability We also analyzed the usage or recommendation of service-based patterns to improve evolvability (see Fig. 10). Nearly half of our resources reported at least one such pattern (140 of 295, 48%). Out of the 15 unique patterns, the most frequently mentioned one was Event-Driven Messaging (73 resources). Practitioners used it to decouple services and to allow for a more flexible service evolution. Moreover, they sometimes combined it with the related patterns Command Query Responsibility Segregation (CQRS) (22) or Event Sourc- ing (21) for an even greater effect. Most resources described this kind of communication architecture as more evolvable as RESTful HTTP: “The request-response pattern creates point-to-point connections that couple both sender to receiver and receiver to sender, mak- ing it hard to change one component without impacting others. Due to this, many architects use middleware as a backbone for microservice communication to create decoupled, scal- able, and highly available systems.”24 In an Nginx blog post, Chris Richardson explained that messaging would decouple client and service via a shared channel so that clients could be completely unaware of the final message receivers.25 A lot of resources also mentioned the usage of patterns that acted as a shielding inter- mediary like API Gateway (43), Backends for Frontends (BFF) (12), or Service Façade (8). This would also reduce the efforts of service interface changes, since only the intermediary needed to be changed instead of all clients. IBM Cloud Learn Hub described the API Gate- way as follows in their microservices guide: “While it’s true that clients and services can communicate with one another directly, API gateways are often a useful intermediary layer, especially as the number of services in an application grows over time. An API gateway acts as a reverse proxy for clients by routing requests, fanning out requests across multiple 24https://www.confluent.io/blog/microservices-apache-kafka-domain-driven-design 25https://www.nginx.com/blog/building-microservices-inter-process-communication Page 25 of 39 104 https://www.confluent.io/blog/microservices-apache-kafka-domain-driven-design https://www.nginx.com/blog/building-microservices-inter-process-communication Empir Software Eng (2021) 26: 104 73 43 31 23 22 21 12 9 8 5 0 10 20 30 40 50 60 70 80 Event-Driven Messaging API Gateway Consumer-Driven Contracts Service Registry CQRS Event Sourcing Backends for Frontends Strangler Service Facade Service Mesh # of resources P at te rn Fig. 10 Top 10 Evolvability Patterns Mentioned in 295 GLR Resources services, and providing additional security and authentication.” 26 Similarly, Microsoft Azure’s Mike Wasson highlighted the advantages of the BFF pattern, namely that backend microservices did not need to respect specialized needs of different clients, which simplified and shielded services by shifting client-specific requirements to a BFF.27 To facilitate dynamic communication via service discovery, the Service Registry pattern (23) was recommended: “In an environment where service instances come and go, hard coding IP addresses isn’t going to work. You will need a discovery mechanism that ser- vices can use to find each other.” 13 Another means to cope with a dynamic microservices environment were patterns to manage API evolution like Consumer-Driven Contracts (31): “These tests should be part of the regular deployment pipeline. Their failure would allow the consumers to become aware that a change on the producer side has occurred, and that changes are required to achieve consistency again.”28 An alternative on the client-side was the Tolerant Reader pattern (5). Chris Richardson recommended to prepare both client and services for changes: “It makes sense to design clients and services so that they observe the robustness principle. Clients that use an older API should continue to work with the new version of the service. The service provides default values for the missing request attributes and the clients ignore any extra response attributes.”25 Lastly, a few resources described the usage of patterns for sharing common infrastructure-related needs like Service Mesh (5) or Sidecar (4). WSO2 justified this with the argument that a lot of cross-cutting concerns were shared between services and could be offloaded to infrastructure components to keep the services themselves analyzable and focused on business logic.14 5.3 Evolvability Reflections and Challenges (RQ3) Our last RQ for the GLR primarily targeted evolvability challenges for microservices as well as their influence on the assurance process. 26https://www.ibm.com/cloud/learn/microservices 27https://azure.microsoft.com/en-us/blog/design-patterns-for-microservices 28https://phoenixnap.com/blog/microservices-continuous-testing 104 Page 26 of 39 https://www.ibm.com/cloud/learn/microservices https://azure.microsoft.com/en-us/blog/design-patterns-for-microservices https://phoenixnap.com/blog/microservices-continuous-testing Empir Software Eng (2021) 26: 104 5.3.1 Evolvability Challenges Since 74% of resources (219) mentioned at least one of the 21 identified challenges (see Fig. 11), this was the most frequent label category. The challenge we encountered most often was Microservices Integration (86 resources, 29%) and its specialization Aggregating Data from Several Services (39 resources, 13%): “The biggest challenge is aggregation of all the individual products or services and their integration with one another. As Sam Newman points out, ’Getting integration right is the single most important aspect of the technology associated with microservices in my opinion. Do it well, and your microservices retain their autonomy, allowing you to change and release them independent of the whole. Get it wrong, and disaster awaits.’”29 Even though many experienced companies like Spotify pointed out the importance and difficulty of this problem (“The biggest complexity in a Microservice is not within the service itself, but in how it interacts with others, and that deserves spe- cial attention.” 12), they provided very little concrete guidance on how to approach it. Many questions on StackOverflow and the StackExchange communities circled around these top- ics and showed practitioners’ uncertainty, e.g. “Microservices : aggregate data : is there some good patterns?”30 or “Communication between two microservices”31. Most of these integration issues were arguably related to other prevalent challenges like Service Cutting (69 resources, 23%), i.e. finding the right service granularity and encap- sulating related functionality in the same microservice, and its consequences Inter-Service Dependencies / Ripples (40 resources, 14%) and Breaking API Changes (30 resources, 10%). Many practitioners ended up with far too many services with lots of dependencies: “In our attempt to decompose our services we have made them very small (e.g. responsi- ble for handling a few data attributes). This easily creates the challenge of the individual services to need to talk to each other to accomplish their own task. It is as if they’re jeal- ous of each other’s data and functionality.”32 Others reported that too large services would also be problematic and common: “On their initial foray into microservices, many people are concerned that they’ll overpartition their functionality and end up with too many tiny microservices. In my experience, overpartitioning is rarely the issue; it’s more common to stuff too much into each service.”33 Additionally, many developers were often unsure if they should add new functionality to an existing service or create a new one. With respect to harmful coupling between services, Mark Richards highlighted the importance to control dependencies: services should be isolated as much as possible, without a lot of direct com- munication with other services.34 One frequently encountered advice was to simply merge two highly coupled services: “If two services are constantly calling back to one another, then that’s a strong indication of coupling and a signal that they might be better off combined into one service.”35 Another common challenge was the increased Architectural / Technical Complexity (67 resources, 23%), which would make it more difficult to understand, extend, and operate the 29https://www.cigniti.com/blog/testing-microservices-architecture-strategy 30https://stackoverflow.com/questions/57788014 31https://stackoverflow.com/questions/36701111 32https://www.tigerteam.dk/2014/micro-services-its-not-only-the-size-that-matters-its-also-how-you-use -them-part-1 33https://techbeacon.com/app-dev-testing/5-fundamentals-successful-microservice-design 34https://www.oreilly.com/content/microservices-antipatterns-and-pitfalls 35https://opensource.com/article/18/4/guide-design-microservices Page 27 of 39 104 https://www.cigniti.com/blog/testing-microservices-architecture-strategy https://stackoverflow.com/questions/57788014 https://stackoverflow.com/questions/36701111 https://www.tigerteam.dk/2014/micro-services-its-not-only-the-size-that-matters-its-also-how-you-use-them-part-1 https://www.tigerteam.dk/2014/micro-services-its-not-only-the-size-that-matters-its-also-how-you-use-them-part-1 https://techbeacon.com/app-dev-testing/5-fundamentals-successful-microservice-design https://www.oreilly.com/content/microservices-antipatterns-and-pitfalls https://opensource.com/article/18/4/guide-design-microservices Empir Software Eng (2021) 26: 104 86 69 67 40 39 30 15 14 14 13 13 0 10 20 30 40 50 60 70 80 90 Microservices Integration Service Cutting Architectural / Technical Complexity Inter-Service Dependencies / Ripples Aggregating Data from Several Services Breaking API Changes Distributed Code Repositories Technological Heterogeneity Code Duplication Coordination Between Decentralized Teams Mastering Technologies # of resources C h al le n g e Fig. 11 Evolvability Challenges With At Least 10 Mentions in 295 GLR Resources system: “Things can get a lot harder for developers. In the case where a developer wants to work on a journey, or feature which might span many services, that developer has to run them all on their machine, or connect to them. This is often more complex than simply running a single program.”36 Many resources described this as a gradually increasing prob- lem, especially if the necessary automation and infrastructure was not yet present. Thorben Janssen wrote in a blog post on Stackify that implementing a single microservice would be straightforward, but – due to the technical complexity of distributed systems – this would quickly change when developing several inter-connected services.37 In some cases, this complexity became so unmanageable that teams even moved back to more coarse-grained or monolithic structures, as for example described by the company Segment: “It seemed as if we were falling from the microservices tree, hitting every branch on the way down. Instead of enabling us to move faster, the small team found themselves mired in exploding com- plexity. Essential benefits of this architecture became burdens.”38 As specialized challenges of this complexity, practitioners also reported Distributed Code Repositories (15), Code Duplication (13), and having No System-Centric View (8). While one repository per service would have the benefit of isolation, it also would come with a management overhead: “As soon as you start reaching 30+ microservices, managing source repositories/versioning and setting CI/CD hooks is going to become an increasingly important but also a very tedious process, and there is no way around it. Managing repositories will add complexity and costs for you.”39 Lastly, keeping an overview of all microservices and their interactions was also described as near impossible: “Your management tools no longer work as well, as they may not have ways of visualizing complex microservices views. [...] Microservices architectures without some amount of structure are difficult to rationalize and reason with, as there is no obvious way to categorize and visualize the purpose of each microservice.”40 36https://dwmkerr.com/the-death-of-microservice-madness-in-2018 37https://stackify.com/communication-microservices-avoid-common-problems 38https://segment.com/blog/goodbye-microservices 39https://blog.usejournal.com/microservices-have-you-thought-this-through-44fc2d829fe3 40https://www.slideshare.net/AbhishekSood10/the-top-6-microservices-patterns-83814065 104 Page 28 of 39 https://dwmkerr.com/the-death-of-microservice-madness-in-2018 https://stackify.com/communication-microservices-avoid-common-problems https://segment.com/blog/goodbye-microservices https://blog.usejournal.com/microservices-have-you-thought-this-through-44fc2d829fe3 https://www.slideshare.net/AbhishekSood10/the-top-6-microservices-patterns-83814065 Empir Software Eng (2021) 26: 104 In addition to these many technical and architectural challenges, we also found a few organizational or human-related challenges. In the context of governance, some practition- ers warned against a too high degree of Technological Heterogeneity (14), as this would also come with the burden of Mastering Technologies (13): “I wouldn’t recommend mixing too many programming languages because hiring people gets more difficult. Also, the context switches for your programmers would slow down development.”41 An explosion in pro- gramming languages could easily increase maintenance and evolution efforts. According to Susan Fowler’s “Production-Ready Microservices” guide, it would therefore be more sen- sible to decide on a small set of languages, frameworks, and libraries to avoid the burden of supporting a multitude of technologies.42 Finally, practitioners reported the Coordination Between Decentralized Teams (13) as another challenge in this area: “One obvious source of complexity is coordination and consensus between teams. In order to control this com- plexity, this necessitates activities like a deeper understanding of the common libraries and dependencies used within the microservices.”43 5.3.2 Influence of Microservices on the Assurance Process As a last focus of analysis, we wanted to determine how practitioners described the influ- ence of microservices on the assurance process. The 78 resources (26%) in this area predominantly reported a negative influence (81 of 101 mentions). The most prominently used label was Increased Testing Complexity (60 resources, 20%). Many practitioners pointed out the need for new approaches in this area: “Microservices demand a new approach to QA. In contrast to monolithic applications, where every part of an application can be tested at the same time, microservices make QA much more complicated because each microservice may be developed and delivered according to its own schedule.”44 Even in comparatively small microservice-based systems, testing was described as much more difficult, especially if each service would have a specialized technology stack and its own code base, dependencies, feature branches, and database technology.45 Additionally, sev- eral questions on StackExchange like “How to test a cluster of microservices?”46 or “What is the role of QA in testing an application having MicroServices architecture?”47 circled around this topic. Other described negative influences were related to the Difficult Root Cause Analysis (11), Difficult Macroarchitecture Assessment (6), or Difficult Quality Anal- ysis (4) in general. Michael Kutz from REWE digital reported the following experiences: “While the Microservice architectural style has a lot of benefits, it makes certain QA prac- tices impractical: there is no big release candidate that can be tested before put to production, no single log file to look into for root cause analysis and no single team to assign found bugs to. Instead there are deployments happening during test runs, as many log files as there 41https://codingsans.com/blog/microservice-architecture-best-practices 42https://www.oreilly.com/library/view/production-ready-microservices/9781491965962/ch01.html 43https://www.capitalone.com/tech/software-engineering/analyzing-polyglot-microservices 44https://blog.runscope.com/posts/qa-in-a-microservices-world 45https://www.freecodecamp.org/news/these-are-the-most-effective-microservice-testing-strategies-according -to-the-experts-6fb584f2edde 46https://devops.stackexchange.com/questions/731 47https://sqa.stackexchange.com/questions/33998 Page 29 of 39 104 https://codingsans.com/blog/microservice-architecture-best-practices https://www.oreilly.com/library/view/production-ready-microservices/9781491965962/ch01.html https://www.capitalone.com/tech/software-engineering/analyzing-polyglot-microservices https://blog.runscope.com/posts/qa-in-a-microservices-world https://www.freecodecamp.org/news/these-are-the-most-effective-microservice-testing-strategies-according-to-the-experts-6fb584f2edde https://www.freecodecamp.org/news/these-are-the-most-effective-microservice-testing-strategies-according-to-the-experts-6fb584f2edde https://devops.stackexchange.com/questions/731 https://sqa.stackexchange.com/questions/33998 Empir Software Eng (2021) 26: 104 are microservices and many teams to mess with the product.”48 A similar story is told by Mark Balbes and Zachary Becker from WWT Asynchrony Labs: “Microservices pose a new challenge for QA. While maintaining the quality of a given microservice is simplified, the complexity of a monolithic application doesn’t disappear. It is simply transferred from the software to the interconnectedness of the systems running the many microservices. So QA has to consider not just the quality of the individual microservice but of the entire web of microservices and systems that support them.”49 On the other hand, some practitioners also described three different positive influences (20 mentions). Developers would have more motivation to improve a small service (8) and new services would be of good quality (7). Moreover, the organizational and cultural changes would lead to more quality awareness (5) in general: “There are smaller decentral- ized teams of 5-7 people that have their own set of KPIs and success metrics to achieve. This allows them to take ownership of ’their’ product and it gives them better clarity on the progress. It also gives them the freedom to innovate, which boosts their morale. [...] As you can see, using DevOps and microservices architecture together will not only boost the pro- ductivity of the team, but it will also enable them to develop a more innovative and better quality product at a faster pace.”50 Sachin Dhamane from Majesco reported similar positive cultural changes: “Organizations need to cultivate a culture of sharing responsibility, own- ership and complete accountability in microservices teams. [...] They take full ownership for their services, often beyond the scope of their roles or titles, by thinking about the end customer’s needs and how they can contribute to solving those needs.”51 Lastly, REWE’s Michael Kutz also highlighted the positive influence of the “you build it, you run it” DevOps principle. Empowering teams while simultaneously demanding responsibility would lead to an increase in service quality: “Teams are made responsible for their services during office hours, but also after that, even at 3:14 am on a Saturday! Since most people working for us like sleeping, we expect this to be a good motivation to improve the services’ reliability a lot. You simply pay more attention to technical debt, when you pay it in sleep!”52 6 Discussion When interpreting the combined results of the interviews and the GLR, we found a lot of commonalities, but also some differences. In the following paragraphs, we discuss these findings per RQ-related topic. Finally, Table 4 provides a summarized overview to compare the results of both studies. With respect to perceived challenges for the evolvability of microservices, Service Cut- ting was reported as one of the most important issues in both studies. Many practitioners were struggling with finding an adequate service decomposition. As a result, related chal- lenges like Microservices Integration, Inter-Service Dependencies / Ripples, Aggregating Data from Several Services, or Breaking API Changes were common as well, even though they were much more prevalent in the GLR resources. The GLR results were therefore more focused on architectural and technical challenges, while the interviews also had a 48https://agiletestingdays.com/2019/session/team-driven-microservice-quality-assurance 49https://adtmag.com/articles/2018/04/11/agile-and-tranforming-qa.aspx 50https://www.thinksys.com/microservices/microservices-comes-together-brilliantly-with-devops 51https://www.majesco.com/innovating-insurance-microservices-part-3 52https://slides.com/mkutz/team-driven-microservicequality-assurance 104 Page 30 of 39 https://agiletestingdays.com/2019/session/team-driven-microservice-quality-assurance https://adtmag.com/articles/2018/04/11/agile-and-tranforming-qa.aspx https://www.thinksys.com/microservices/microservices-comes-together-brilliantly-with-devops https://www.majesco.com/innovating-insurance-microservices-part-3 https://slides.com/mkutz/team-driven-microservicequality-assurance Empir Software Eng (2021) 26: 104 Table 4 Comparison of Interview Results (N=17) and GLR Results (N=295); the percentages refer to the fraction of total participants/resources that mentioned the label decent amount of organizational and human-related challenges like Mastering Technologies or Technological Heterogeneity. In general, the prevalence of most interview challenges was confirmed by the GLR results. We identified only