https://doi.org/10.1007/s10664-021-10083-5 A fine-grained data set and analysis of tangling in bug fixing commits Steffen Herbold1 ·Alexander Trautsch2 ·Benjamin Ledel1 · Alireza Aghamohammadi3 · Taher A. Ghaleb4 ·Kuljit Kaur Chahal5 · Tim Bossenmaier6 ·Bhaveet Nagaria7 ·Philip Makedonski2 · Matin Nili Ahmadabadi8 ·Kristof Szabados9 ·Helge Spieker10 ·Matej Madeja11 · Nathaniel Hoy7 ·Valentina Lenarduzzi12 · ShangwenWang13 · Gema Rodrı́guez-Pérez14 ·Ricardo Colomo-Palacios15 ·Roberto Verdecchia16 · Paramvir Singh17 ·Yihao Qin13 ·Debasish Chakroborti18 ·Willard Davis19 · Vijay Walunj20 ·HongjunWu13 ·DiegoMarcilio21 ·Omar Alam22 · Abdullah Aldaeej23 · Idan Amit24 ·Burak Turhan25,26 · Simon Eismann27 · Anna-Katharina Wickert28 · Ivano Malavolta16 ·Matúš Sulı́r11 · Fatemeh Fard14 · Austin Z. Henley29 · Stratos Kourtzanidis30 · Eray Tuzun31 ·Christoph Treude32 · Simin Maleki Shamasbi33 · Ivan Pashchenko34 ·Marvin Wyrich35 · James Davis36 · Alexander Serebrenik37 · Ella Albrecht2 · Ethem Utku Aktas38 ·Daniel Strüber39 · Johannes Erbel2 Accepted: 11 October 2021 / © The Author(s) 2022 Abstract Context Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs. Objective We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits. Methods We use a crowd sourcing approach for manual labeling to validate which changes contribute to bug fixes for each line in bug fixing commits. Each line is labeled by four participants. If at least three participants agree on the same label, we have consensus. Results We estimate that between 17% and 32% of all changes in bug fixing commits mod- ify the source code to fix the underlying problem. However, when we only consider changes to the production code files this ratio increases to 66% to 87%. We find that about 11% Communicated by: Neil Ernst This article belongs to the Topical Collection: Registered Reports � Steffen Herbold steffen.herbold@tu-clausthal.de Extended author information available on the last page of the article. Published online: 2 July 2022 Empirical Software Engineering (2022) 27: 125 http://crossmark.crossref.org/dialog/?doi=10.1007/s10664-021-10083-5&domain=pdf http://orcid.org/0000-0001-9765-2803 mailto: steffen.herbold@tu-clausthal.de of lines are hard to label leading to active disagreements between participants. Due to con- firmed tangling and the uncertainty in our data, we estimate that 3% to 47% of data is noisy without manual untangling, depending on the use case. Conclusion Tangled commits have a high prevalence in bug fixes and can lead to a large amount of noise in the data. Prior research indicates that this noise may alter results. As researchers, we should be skeptics and assume that unvalidated data is likely very noisy, until proven otherwise. Keywords Tangled changes · Tangled commits · Bug fix · Manual validation · Research turk · Registered report 1 Introduction Detailed and accurate information about bug fixes is important for many different domains of software engineering research, e.g., program repair (Gazzola et al. 2019), bug localiza- tion (Mills et al. 2018), and defect prediction (Hosseini et al. 2019). Such research suffers from mislabeled data, e.g., because commits are mistakenly identified as bug fixes (Herzig et al. 2013) or because not all changes within a commit are bug fixes (Herzig and Zeller 2013). A common approach to obtain information about bug fixes is to mine software repos- itories for bug fixing commits and assume that all changes in the bug fixing commit are part of the bug fix, e.g., with the SZZ algorithm (Śliwerski et al. 2005). Unfortunately, prior research showed that the reality is more complex. The term tangled commit1 was established by Herzig and Zeller (2013) to characterize the problem that commits may address multiple issues, together with the concept of untangling, i.e., the subsequent separation of these con- cerns. Multiple prior studies established through manual validation that tangled commits naturally occur in code bases. For example, Herzig and Zeller (2013), Nguyen et al. (2013), Kirinuki et al. (2014), Kochhar et al. (2014), Kirinuki et al. (2016), and Wang et al. (2019), and Mills et al. (2020). Moreover, Nguyen et al. (2013), Kochhar et al. (2014), and Herzig et al. (2016), and Mills et al. (2020) have independently shown that tangling can have a negative effect on experiments, e.g., due to noise in the training data that reduces model performance as well as due to noise in the test data which significantly affects performance estimates. However, we identified four limitations regarding the knowledge on tangling within the current literature.The first and most common limitation of prior work is that it either only considered a sample of commits, or that the authors failed to determine whether the commits were tangled or not for a large proportion of the data.2 Second, the literature considers this problem from different perspectives: most literature considers tangling at the commit level, whereas others break this down to the file-level within commits. Some prior studies focus on all commits, while others only consider bug fixing commits. Due to these differences, in combination with limited sample sizes, the prior work is unclear regarding the prevalence of tangling. For example, Kochhar et al. (2014) and Mills et al. (2020) found a high prevalence 1Herzig and Zeller (2013) actually used the term tangled change. However, we follow Rodrı́guez-Pérez et al. (2020) and use the term commit to reference observable groups of changes within version control systems (see Section 3.1). 2Failing to label a proportion of the data also results in a sample, but this sample is highly biased towards changes that are simple to untangle. 125 Page 2 of 49 Empir Software Eng (2022) 27: 125 of tangling within file changes, but their estimates for the prevalence of tangling varied substantially with 28% and 50% of file changes affected, respectively. Furthermore, how the lower boundary of 15% tangled commits that Herzig and Zeller (2013) estimated relates to the tangling of file changes is also unclear. Third, there is little work on what kind of changes are tangled. Kirinuki et al. (2014) studied the type of code changes that are often tangled with other changes and found that these are mostly logging, checking of pre-conditions, and refactorings. Nguyen et al. (2013) and Mills et al. (2020) provide estimations on the types of tangled commits on a larger scale, but their results contradict each other. For example, Mills et al. (2020) estimate six times more refactorings in tangled commits than Nguyen et al. (2013). Fourth, these studies are still relatively coarse-grained and report results at the commit and file level. Despite the well established research on tangled commits and their impact on research results, their prevalence and content is not yet well understood, neither for commits in gen- eral, nor with respect to bug fixing commits. Due to this lack of understanding, we cannot estimate how severe the threat to the validity of experiments due to the tangling is, and how many tools developed based on tangled commits may be negatively affected. Our lack of knowledge about tangled commits notwithstanding, researchers need data about bug fixing commits. Currently, researchers rely on three different approaches to mitigate this problem: (i) seeded bugs; (ii) no or heuristic untangling; and (iii) manual untan- gling. First, we have data with seeded bugs, e.g., the SIR data (Do et al. 2005), the Siemens data (Hutchins et al. 1994), and through mutation testing (Jia and Harman 2011). While there is no risk of noise in such data, it is questionable whether the data is representative for real bugs. Moreover, applications like bug localization or defect prediction cannot be evaluated based on seeded bugs. The second approach is to assume that commits are either not tangled or that heuristics are able to filter tangling. Examples of such data are ManyBugs (Le Goues et al. 2015), Bugs.jar (Saha et al. 2018), as well as all defect prediction data sets (Herbold et al. 2019). The advantage of such data sets is that they are relatively easy to collect. The drawback is that the impact of noise due to tangled commits is unclear, even though modern vari- ants of the SZZ algorithm can automatically filter some tangled changes like comments, whitespaces (Kim et al. 2006) or even some refactorings (Neto et al. 2018). Third, there are also some data sets that were manually untangled. While the creation of such data is very time consuming, such data sets are the gold standard. They represent real-world defects and do not contain noise due to tangled commits. To the best of our knowledge, there are only very few such data sets, i.e., Defects4J (Just et al. 2014) with Java bugs, BugsJS (Gyimesi et al. 2019) with JavaScript bugs, and the above mentioned data by Mills et al. (2020).3 Due to the effort required to manually validate changes, the gold standard data sets contain only samples of bugs from each studied project, but no data covers all reported bugs within a project, which limits the potential use cases.4 This article fills this gap in our current knowledge about tangling bug fixing commits. We provide a new large-scale data set that contains validated untangled bug fixes for the 3The cleaning performed by Mills et al. (2020) is not full untangling of the bug fixes, as all pure additions were also flagged as tangled. While this is certainly helpful for bug localization, as a file that was added as part of the bug fix cannot be localized based on an issue description, this also means that it is unclear which additions are part of the bug fix and which additions are tangled. 4We note that none of the gold standard data sets actually has the goal to contain data for all bugs of a project. This is likely also not possible due to other requirements on these data sets, e.g., the presence of failing test cases. Thus, the effort is not the only involved factor why these data sets are only samples. Page 3 of 49 125Empir Software Eng (2022) 27: 125 complete development history of 23 Java projects and partial data for five further projects. In comparison to prior work, we label all data on a line-level granularity. We have (i) labels for each changed line in a bug fixing commit; (ii) accurate data about which lines contribute to the semantic change of the bug fix; and (iii) the kind of change contained in the other lines, e.g., whether it is a change to tests, a refactoring, or a documentation change. This allows us not only to untangle the bug fix from all other changes, but also gives us valuable insights into the content of bug fixing commits in general and the prevalence of tangling within such commits. Through our work, we also gain a better understanding of the limitations of manual validation for the untangling of commits. Multiple prior studies indicate that there is a high degree of uncertainty when determining whether a change is tangled or not (Herzig and Zeller 2013; Kirinuki et al. 2014, 2016). Therefore, we present each commit to four different participants who labeled the data independently from each other. We use the agreement among the participants to gain insights into the uncertainty involved in the labeling, while also finding lines where no consensus was achieved, i.e., that are hard for researchers to classify. Due to the massive amount of manual effort required for this study, we employed the research turk approach (Herbold 2020) to recruit participants for the labeling. The research turk is a means to motivate a large number of researchers to contribute to a common goal, by clearly specifying the complete study together with an open invitation and clear crite- ria for participation beforehand (Herbold et al. 2020). In comparison to surveys or other studies where participants are recruited, participants in the research turk actively contribute to the research project, in our case by labeling data and suggesting improvements of the manuscript. As a result, 45 of the participants we recruited became co-authors of this arti- cle. Since this is, to the best of our knowledge, a new way to conduct research projects about software engineering, we also study the effectiveness of recruiting and motivating participants. Overall, the contributions of this article are the following. – The Line-Labelled Tangled Commits for Java (LLTC4J) corpus of manually validated bug fixing commits covering 2,328 bugs from 28 projects that were fixed in 3,498 commits that modified 289,904 lines. Each changed line is annotated with the type of change, i.e., whether the change modifies the source code to correct the problem causing the bug, a whitespace or documentation change, a refactoring, a change to a test, a different type of improvement unrelated to the bug fix, or whether the participants could not determine a consensus label. – Empirical insights into the different kinds of changes within bug fixing commits which indicate that, in practice, most changes in bug fixing commits are not about the actual bug fix, but rather related changes to non-production artifacts such as tests or documentation. – We introduce the concept of problematic tangling for different uses of bug data to understand the noise caused by tangled commits for different research applications. – We find that researchers tend to unintentionally mislabel lines in about 7.9% of the cases. Moreover, we found that identifying refactorings and other unrelated changes seems to be challenging, which is shown through 14.3% of lines without agreement in production code files, most of which are due to a disagreement whether a change is part of the bug fix or unrelated. – This is the first use of the research turk method and we showed that this is an effective research method for large-scale studies that could not be accomplished otherwise. 125 Page 4 of 49 Empir Software Eng (2022) 27: 125 The remainder of this article is structured as follows. We discuss the related work in Section 2. We proceed with the description of the research protocol in Section 3. We present and discuss the results for our research questions in Section 4 and Section 5. We report the threats to the validity of our work in Section 6. Finally, we conclude in Section 7. 2 RelatedWork We focus the discussion of the related work on the manual untangling of commits. Other aspects, such as automated untangling algorithms (e.g., Kreutzer et al. 2016; Pârtachi et al. 2020), the separation of concerns into multiple commits (e.g., Arima et al. 2018; Yamashita et al. 2020), the tangling of features with each other (Strüder et al. 2020), the identification of bug fixing or inducing commits (e.g., Rodrı́guez-Pérez et al. 2020), or the characterization of commits in general (e.g., Hindle et al. 2008), are out of scope. 2.1 Magnitude of Tangling The tangling of commits is an important issue, especially for researchers working with software repository data, that was first characterized by Kawrykow and Robillard (2011). They analyzed how often non-essential commits occur in commit histories, i.e., commits that do not modify the logic of the source code and found that up to 15.5% of all commits are non-essential. Due to the focus on non-essential commits, the work by Kawrykow and Robillard (2011) only provides a limited picture on tangled commits in general, as tangling also occurs if logical commits for multiple concerns are mixed within a single commit. The term tangled change (commit) was first used in the context of repository mining by Herzig and Zeller (2013) (extended in Herzig et al. (2016)5). The term tangling itself was already coined earlier in the context of the separation of concerns (e.g., Kiczales et al., 1997). Herzig and Zeller (2013) studied the commits of five Java projects over a period of at least 50 months and tried to manually classify which of the commits were tangled, i.e., addressed multiple concerns. Unfortunately, they could not determine if the commits are tangled for 2,498 of the 3,047 bug fixing commits in their study. Of the bug fixing commits they could classify, they found that 298 were tangled and 251 were not tangled. Because they could not label large amounts of the commits, they estimate that at least 15% of the bug fixing commits are tangled. Nguyen et al. (2013) also studied tangling in bug fixing commits, but used the term mixed-purpose fixing commits. They studied 1296 bug fixing commits from eight projects and identified 297 tangled commits, i.e., 22% of commits that are tangled. A strength of the study is that the authors did not only label the tangled commits, but also identified which file changes in the commit were tangled. Moreover, Nguyen et al. (2013) also studied which types of changes are tangled and found that 4.9% were due to unrelated improvements, 1.1% due to refactorings, 1.8% for formatting issues, and 3.5% for documentation changes unrelated to the bug fix.6 We note that it is unclear how many of the commits are affected by multiple types of tangling and, moreover, that the types of tangling were analyzed for only about half of the commits. Overall, this is among the most comprehensive studies on 5In the following, we cite the original paper, as the extension did not provide further evidence regarding the tangling. 6These percentages are not reported in the paper. We calculated them from their Table 3. We combined enhancement and annotations into unrelated improvements, to be in line with our work. Page 5 of 49 125Empir Software Eng (2022) 27: 125 this topic in the literature. However, there are two important issues that are not covered by Nguyen et al. (2013). First, they only explored if commits and files are tangled, but not how much of a commit or file is tangled. Second, they do not differentiate between problematic tangling and benign tangling (see Section 4.2.2), and, most importantly, between tangling in production files and tangling in other files. Thus, we cannot estimate how strongly tangling affects different research purposes. Kochhar et al. (2014) investigated the tangling of a random sample of the bug fixing commits for 100 bugs. Since they were interested in the implications of tangling on bug localization, they focused on finding out how many of the changed files actually contributed to the correction of the bug, and how many file changes are tangled, i.e., did not contribute to the bug fix. They found that 358 out of 498 file changes were corrective, the remaining 140 changes, i.e., 28% of all file changes in bug fixing commits were tangled. Kirinuki et al. (2014) and Kirinuki et al. (2016) are a pair of studies with a similar approach: they used an automated heuristic to identify commits that may be tangled and then manually validated if the identified commits are really tangled. Kirinuki et al. (2014) identified 63 out of 2,000 commits as possibly tangled and found through manual validation that 27 of these commits were tangled and 23 were not tangled. For the remaining 13 com- mits, they could not decide if the commits are tangled. Kirinuki et al. (2016) identified 39 out of 1,000 commits as potentially tangled and found through manual validation that 21 of these commits were tangled, while 7 were not tangled. For the remaining 11 commits, they could not decide. A notable aspect of the work by Kirinuki et al. (2014) is that they also ana- lyzed what kind of changes were tangled. They found that tangled changes were mostly due to logging, condition checking, or refactorings. We further note that these studies should not be seen as an indication of low prevalence of tangled commits, just because only 48 out of 3,000 commits were found to be tangled, since only 102 commits were manually validated. These commits are not from a representative random sample, but rather from a sample that was selected such that the commits should be tangled according to a heuristic. Since we have no knowledge about how good the heuristic is at identifying tangled commits, it is unclear how many of the 2,898 commits that were not manually validated are also tangled. Tao and Kim (2015) investigated the impact of tangled commits on code review. They manually investigated 453 commits and found that 78 of these commits were tangled in order to determine ground truth data for the evaluation of their code review tool. From the description within the study, it is unclear if the sample of commits is randomly selected, if these are all commits in the four target projects within the study time frame that changed multiple lines, or if a different sampling strategy was employed. Therefore, we cannot con- clude how the prevalence of tangling in about 17% of the commits generalizes beyond the sample. Wang et al. (2019) also manually untangled commits. However, they had the goal to cre- ate a data set for the evaluation of an automated algorithm for the untangling of commits. They achieved this by identifying 50 commits that they were sure were tangled commits, e.g., because they referenced multiple issues in the commit message. They hired eight grad- uate students to perform the untangling in pairs. Thus, they employed an approach for the untangling that is also based on the crowd sourcing of work, but within a much more closely controlled setting and with only two participants per commit instead of four partic- ipants. However, because (Wang et al. 2019) only studied tangled commits, no conclusions regarding the prevalence of the tangling can be drawn based on their results. Mills et al. (2020) extends Mills et al. (2018) and contains the most comprehensive anal- ysis of tangled commits to date. They manually untangled the changes for 803 bugs from 125 Page 6 of 49 Empir Software Eng (2022) 27: 125 fifteen different Java projects. They found that only 1,154 of the 2,311 changes of files within bug fixing commits contributed to bug fixes, i.e., a prevalence of 50% of tangling. Moreover, they provide insights into what kinds of changes are tangled: 31% of tangled changes are due to tests, 9% due to refactorings, and 8% due to documentation. They also report that 47% of the tangled changes are due to adding source code. Unfortunately, Mills et al. (2020) flagged all additions of code as tangled changes. While removing added lines makes sense for the use case of bug localization, this also means that additions that are actu- ally the correction of a bug would be wrongly considered to be tangled changes. Therefore, the actual estimation of prevalence of tangled changes is between 32% and 50%, depending on the number of pure additions that are actually tangled. We note that the findings by Mills et al. (2020) are not in line with prior work, i.e., the ratio of tangled file changes is larger than the estimation by Kochhar et al. (2014) and the percentages for types of changes are different from the estimations by Nguyen et al. (2013). Dias et al. (2015) used a different approach to get insights into the tangling of commits: they integrated an interface for the untangling of commits directly within an IDE and asked the developers to untangle commits when committing their local changes to the version con- trol system. They approached the problem of tangled commits by grouping related changes into clusters. Unfortunately, the focus of the study is on the developer acceptance, as well as on the changes that developers made to the automatically presented clusters. It is unclear if the clusters are fully untangled, and also if all related changes are within the same cluster. Consequently, we cannot reliably estimate the prevalence or contents of tangled commits based on their work. 2.2 Data Sets Finally, there are data sets of bugs where the data was untangled manually, but where the focus was only on getting the untangled bug fixing commits, not on the analysis of the tangling. Just et al. (2014) did this for the Defects4J data and Gyimesi et al. (2019) for the BugsJS data. While these data sets do not contain any noise, they have several limitations that we overcome in our work. First, we do not restrict the bug fixes but allow all bugs from a project. Defects4J and BugsJS both only allow bugs that are fixed in a single commit and also require that a test that fails without the bug fix was added as part of the bug fix. While bugs that fulfill these criteria were manually untangled for Defects4J, BugJS has additional limitations. BugJS requires that bug fixes touch at most three files, modify at most 50 lines of code, and do not contain any refactorings or other improvements unrelated to the bug fix. Thus, BugsJS is rather a sample of bugs that are fixed in untangled commits than a sample of bugs that was manually untangled. While these data sets are gold standard untangled data sets, they are not suitable to study tangling. Moreover, since both data sets only contain samples, they are not suitable for research that requires all bugs of a specific release or within a certain time frame of a project. Therefore, our data is suitable for more kinds of research than Defects4J and BugsJS, and because our focus is not solely on the creation of the data set, but also on the understanding of tangling within bug fixing commits, we also provide a major contribution to the knowledge about tangled commits. 2.3 Research Gap Overall, there is a large body of work on tangling, but studies are limited in sampling strategies, inability to label all data, or because the focus was on different aspects. Thus, Page 7 of 49 125Empir Software Eng (2022) 27: 125 we cannot form conclusive estimates, neither regarding the types of tangled changes, nor regarding the general prevalence of tangling within bug fixing commits. 3 Research Protocol Within this section, we discuss the research protocol, i.e., the research strategy we pre- registered (Herbold et al. 2020) to study our research questions and hypotheses. The description and section structure of our research protocol is closely aligned with the pre-registration, but contains small deviations, e.g., regarding the sampling strategy. All deviations are described as part of this section and summarized in Section 3.8. 3.1 Terminology We use the term “the principal investigators” to refer to the authors of the registered report, “the manuscript” to refer to this article that resulted from the study, and “the participants” to refer to researchers and practitioners who collaborated on this project, many of whom are now co-authors of this article. Moreover, we use the term “commit” to refer to observable changes within the version control system, which is in line with the terminology proposed by Rodrı́guez-Pérez et al. (2020). We use the term “change” to refer to the actual modifications within commits, i.e., what happens in each line that is modified, added, or deleted as part of a commit. 3.2 Research Questions and Hypotheses Our research is driven by two research questions for which we derived three hypotheses. The first research question is the following. RQ1: What percentage of changed lines in bug fixing commits contributes to the bug fix and can we identify these changes? We want to get a detailed understanding of both the prevalence of tangling, as well as what kind of changes are tangled within bug fixing commits. When we speak of contributing to the bug fix, we mean a direct contribution to the correction of the logical error. We derived two hypotheses related to this research question. H1 Fewer than 40% of changed lines in bug fixing commits contribute to the bug fix. H2 A label is a consensus label when at least three participants agree on it. Participants fail to achieve a consensus label on at least 10.5% of lines.7 We derived hypothesis H1 from the work by Mills et al. (2018), who found that 496 out of 1344 changes to files in bug fixing commits contributed to bug fixes.8 We derived our expectation from Mills et al. (2018) instead of Herzig et al. (2013) due to the large degree of uncertainty due to unlabeled data in the work by Herzig et al. (2013). We derived H2 based on the assumption that there is a 10% chance that participants misclassify a line, 7In the pre-registered protocol, we use the number of at least 16%. However, this was a mistake from the calculation of the binomial distribution, where we used a wrong value n = 5 instead of n = 4. This was the result of our initial plan to use five participants per commit, which we later revised to four participants without updating the calculation and the hypothesis. 8The journal extension by Mills et al. (2020) was not published when we formulated our hypotheses as part of the pre-registration of our research protocol in January 2020. 125 Page 8 of 49 Empir Software Eng (2022) 27: 125 even if they have the required knowledge for correct classification. We are not aware of any evidence regarding the probability of random mistakes in similar tasks and, therefore, used our intuition to estimate this number. Assuming a binomial distribution B(k|p, n) with the probability of random mislabels p = 0.1 and n = 4 participants that label each commit, we do not get a consensus for k ≥ 2 participants randomly mislabeling the line, i.e., 4∑ k=2 B(k|0.1, 4) = 4∑ k=2 ( 4 k ) 0.1k · 0.94−k = 0.0523 (1) Thus, if we observe 10.5% of lines without consensus, this would be twice more than expected given the assumption of 10% random errors, indicating that lack of consensus is not only due to random errors. We augment the analysis of this hypothesis with a survey among participants, asking them how often they were unsure about the labels. The second research question regards our crowd working approach to organize the manual labor required for our study. RQ2: Can gamification motivate researchers to contribute to collaborative research projects? H3 The leaderboard motivates researchers to label more than the minimally required 200 commits. We derived H3 from prior evidence that gamification (Werbach and Hunter 2012) is an efficient method for the motivation of computer scientists, e.g., as demonstrated on Stack Overflow (Grant and Betts 2013). Participants can view a nightly updated leaderboard, both to check their progress, as well as where they would currently be ranked in the author list. We believe that this has a positive effect on the amount of participation, which we observe through larger numbers of commits labeled than minimally required for co-authorship. We augment the analysis of this hypothesis with a survey among participants, asking them if they were motivated by the leaderboard and the prospect of being listed earlier in the author list. 3.3 Materials This study covers bug fixing commits that we re-use from prior work (see Section 3.5.1). We use SmartSHARK to process all data (Trautsch et al. 2018). We extended SmartSHARK with the capability to annotate the changes within commits. This allowed us to manually validate which lines in a bug fixing commit are contributing to the semantic changes for fixing the bugs (Trautsch et al. 2020). 3.4 Variables We now state the variables we use as foundation for the construct of the analysis we con- duct. The measurement of these variables is described in Section 3.6.1 and their analysis is described in Section 3.6.2. For bug fixing commits, we measure the following variables as percentages of lines with respect to the total number of changed lines in the commit. – Percentage contributing to bug fixes. – Percentage of whitespace changes. – Percentage of documentation changes. – Percentage of refactorings. Page 9 of 49 125Empir Software Eng (2022) 27: 125 – Percentage of changes to tests. – Percentage of unrelated improvements. – Percentage where no consensus was achieved (see Section 3.6.1). We provide additional results regarding the lines without consensus to understand what the reasons for not achieving consensus are. These results are an extension of our registered protocol. However, we think that details regarding potential limitations of our capabilities as researchers are important for the discussion of research question RQ1. Concretely, we consider the following cases: – In the registration, we planned to consider whitespace and documentation changes within a single variable. Now, we use separate variables for both. Our reason for this extension is to enable a better understanding of how many changes are purely cos- metic without affecting functionality (whitespace) and distinguish this from changes that modify the documentation. We note that we also used the term “comment” instead of “documentation” within the registration. Since all comments (e.g., in code) are a form of documentation, but not all documentation (e.g., sample code) is a comment, we believe that this terminology is clearer. – Lines where all labels are either test, documentation, or whitespace. We use this case, because our labeling tool allows labeling of all lines in a file with the same label with a single click and our tutorial describes how to do this for a test file. This leads to differ- ences between how participants approached the labeling of test files: some participants always use the button to label the whole test file as test, other participants used a more fine-grained approach and also labeled whitespace and documentation changes within test files. – Lines that were not labeled as bug fix by any participant. For these lines, we have consensus that this is not part of the bug fix, i.e., for our key concern. Disagreements may be possible, e.g., if some participants identified a refactoring, while others marked this as unrelated improvement. A second deviation from our pre-registered protocol is that we present the results for the consensus for two different views on the data: – all changes, i.e., as specified in registration; and – only changes in Java source code files that contain production code, i.e., Java files excluding changes to tests or examples. Within the pre-registered protocol, we do not distinguish between all changes and changes to production code. We now realize that both views are important. The view of all changes is required to understand what is actually part of bug fixing commits. The view on Java production files is important, because changes to non-production code can be easily determined automatically as not contributing to the bug fix, e.g., using regular expression matching based on the file ending “.java” and the file path to exclude folders that con- tain tests and example code. The view on Java production files enables us to estimate the amount of noise that cannot be automatically removed. Within this study, we used the reg- ular expressions shown in Table 1 to identify non-production code. We note that we have some projects which also provide web applications (e.g., JSP Wiki), that also have produc- tion code in other languages than Java, e.g., JavaScript. Thus, we slightly underestimate the amount of production code, because these lines are only counted in the overall view, and not in the production code view. 125 Page 10 of 49 Empir Software Eng (2022) 27: 125 Table 1 Regular expressions for excluding non-production code files File type Regular expression Test (ˆ |)(test|tests|test long running|testing|legacy-tests! |testdata|test-framework|derbyTesting|unitTests| javastubs!test-lib|srcit|src-lib-test|src-test|tests -src|test-cactus!|test-data|test-deprecated|src unitTests| test-tools|!gateway-test-release-utils|gateway-test-ldap |nifi-mock)! Documentation (ˆ |) (doc|docs|example|examples|sample|samples|demo |tutorial!helloworld|userguide|showcase|SafeDemo)! Other (ˆ |)( site|auxiliary-builds|gen-java|external! |nifi-external)! These regular expressions are valid for a superset of our data and were manually validated as part of prior work (Trautsch et al. 2020) Additionally, we measure variables related to the crowd working. – Number of commits labeled per participant. – Percentage of correctly labeled lines per participant, i.e., lines where the label of the participant agrees with the consensus. We collected this data using nightly snapshots of the number of commits that each participant labeled, i.e., a time series per participant. We also conducted an anonymous survey among participants who labeled at least 200 commits with a single question to gain insights into the difficulty of labeling tangled changes in commits for the evaluation of RQ1. – Q1: Please estimate the percentage of lines in which you were unsure about the label you assigned. – A1: One of the following categories: 0%–10%, 11%–20%, 21%–30%, 31%–40%, 41%–50%, 51–60%, 61%–70%, 71%–80%, 81%–90%, 91%–100%. Finally, we asked all participants who labeled at least 250 commits a second question to gain insights into the effect of the gamification: – Q2: Would you have labeled more than 200 commits, if the authors would have been ordered randomly instead of by the number of commits labeled? – A2: One of the following categories: Yes, No, Unsure. 3.5 Subjects This study has two kinds of subjects: bugs for which the lines contributing to the fix are manually validated and the participants in the labeling who were recruited using the research turk approach. 3.5.1 Bugs We use manually validated bug fixes. For this, we harness previously manually validated issue types similar to Herzig et al. (2013) and validated trace links between commits and Page 11 of 49 125Empir Software Eng (2022) 27: 125 issues for 39 Apache projects (Herbold et al. 2019). The focus of this data set is the Java programming language. The usage of manually validated data allows us to work from a foundation of ground truth and avoids noise in our results caused by the inclusion of com- mits that are not fixing bugs. Overall, there are 10,878 validated bug fixing commits for 6,533 fixed bugs in the data set. Prior studies that manually validated commits limited the scope to bugs that were fixed in a single commit, and commits in which the bug was the only referenced issue (e.g., Just et al. 2014; Gyimesi et al. 2019; Mills et al. 2020). We broaden this scope in our study and also allow issues that were addressed by multiple commits, as long as the commits only fixed a single bug. This is the case for 2,283 issues which were addressed in 6,393 com- mits. Overall, we include 6,279 bugs fixed in 10,389 commits in this study. The remaining 254 bugs are excluded, because the validation would have to cover the additional aspect of differentiating between multiple bug fixes. Otherwise, it would be unclear to which bug(s) the change in a line would contribute, making the labels ambiguous. Herbold et al. (2019) used a purposive sampling strategy for the determination of projects. They selected only projects from the Apache Software Foundation, which is known for the high quality of the developed software, the many contributors both from the open source community and from the industry, as well as the high standards of their development processes, especially with respect to issue tracking (Bissyandı́ et al. 2013). Moreover, the 39 projects cover different kinds of applications, including build systems (ant-ivy), web appli- cations (e.g., jspwiki), database frameworks (e.g., calcite), big data processing tools (e.g., kylin), and general purpose libraries (commons). Thus, our sample of bug fixes should be representative for a large proportion of Java software. Additionally, Herbold et al. (2019) defined criteria on project size and activity to exclude very small or inactive projects. Thus, while the sample is not randomized, this purposive sampling should ensure that our sample is representative for mature Java software with well-defined development processes in line with the discussion of representativeness by Baltes and Ralph (2020). 3.5.2 Participants In order to only allow participants that have a sufficient amount of programming experi- ence, each participant must fulfill one of the following criteria: 1) an undergraduate degree majoring in computer science or a closely related subject; or 2) at least one year of program- ming experience in Java, demonstrated either by industrial programming experience using Java or through contributions to Java open source projects. Participants were regularly recruited, e.g., by advertising during virtual conferences, within social media, or by asking participants to invite colleagues. Interested researchers and practitioners signed up for this study via an email to the first author, who then checked if the participants are eligible. Upon registration, participants received guidance on how to proceed with the labeling of commits (see Appendix A). Participants became co-authors of this manuscript if 1. they manually labeled at least 200 commits; 2. their labels agree with the consensus (Section 3.6.1) for at least 70% of the labeled lines; 3. they contributed to the manuscript by helping to review and improve the draft, including the understanding that they take full responsibility for all reported results and that they can be held accountable with respect to the correctness and integrity of the work; and 4. they were not involved in the review or decision of acceptance of the registered report. 125 Page 12 of 49 Empir Software Eng (2022) 27: 125 The first criterion guarantees that each co-author provided a significant contribution to the analysis of the bug fixing commits. The second criterion ensures that participants carefully labeled the data, while still allowing for disagreements. Only participants who fulfill the first two criteria received the manuscript for review. The third criterion ensures that all co-authors agree with the reported results and the related responsibility and ethical accountability. The fourth criterion avoids conflicts of interest.9 3.6 Execution Plan The execution of this research project was divided into two phases: the data collection phase and the analysis phase. 3.6.1 Data Collection Phase The primary goal of this study is to gain insights into which changed lines contribute to bug fixing commits and which additional activities are tangled with the correction of the bug. Participants were shown the textual differences of the source code for each bug with all related bug fixing commits. The participants then assigned one of the following labels to all changed lines: – contributes to the bug fix; – only changes to whitespaces; – documentation change; – refactoring; – change to tests; and – unrelated improvement not required for the bug fix. Figure 1 shows a screenshot of the web application that we used for labeling. The web application ensured that all lines were labeled, i.e., participants could not submit incom- plete labels for a commit. The web application further supported the participants with the following convenience functions: – Buttons to mark all changes in a file with the same label. This could, e.g., be used to mark all lines in a test file with a single click as test change. – Heuristic pre-labeling of lines as documentation changes using regular expressions. – Heuristic pre-labeling of lines with only whitespace changes. – Heuristic pre-labeling of lines as refactoring by automatically marking changed lines as refactorings, in case they were detected as refactorings by the RefactoringMiner 1.0 (Tsantalis et al. 2018). All participants were instructed not to trust the pre-labels and check if these labels are correct. Moreover, we did not require differentiation between whitespaces and documen- tation/test changes in files that were not production code files, e.g., within test code or documentation files such as the project change log. Each commit was shown to four participants. Consensus is achieved if at least three participants agree on the same label. If this is not the case, no consensus for the line is achieved, i.e., the participants could not clearly identify which type of change a line is. 9Because the review of the registered report was blinded, the fulfillment of this criterion is checked by the editors of the Empirical Software Engineering journal. Page 13 of 49 125Empir Software Eng (2022) 27: 125 Fig. 1 Screenshot of VisualSHARK (Trautsch et al. 2020) that was used for the labeling of the data The data collection phase started on May 16th, 2020 and ended on October 14th, 2020, registration already finished on September 30th, 2020.10 The participants started by watch- ing a tutorial video11 and then labeling the same five commits that are shown in the tutorial themselves to get to know the system and to avoid mislabels due to usability problems. Participants could always check their progress, as well as the general progress for all projects and the number of commits labeled by other participants in the leaderboard. How- ever, due to the computational effort, the leaderboard did not provide a live view of the data, but was only updated once every day. All names in the leaderboard were anonymized, except the name of the currently logged in participant and the names of the principal inves- tigators. The leaderboard is also part of a gamification element of our study, as participants can see how they rank in relation to others and may try to improve their ranks by labeling more commits. The participants are allowed to decide for which project they want to perform the label- ing. The bugs are then randomly selected from all bugs of that project, for which we do not yet have labels by four participants. We choose this approach over randomly sampling from all projects to allow participants to gain experience in a project, which may improve both the labeling speed and the quality of the results. Participants must finish labeling each bug they are shown before the next bug can be drawn. We decided for this for two reasons. First, skipping bugs could reduce the validity of the results for our second research ques- tion, i.e., how good we actually are at labeling bug fixes at this level of granularity, because the sample could be skewed towards simpler bug fixes. Second, this could lead to cherry picking, i.e., participants could skip bugs until they find particularly easy bugs. This would be unfair for the other participants.The drawback of this approach is that participants are 10This timeframe is a deviation from the registration protocol that was necessary due to the Covid-19 pandemic. 11https://www.youtube.com/watch?v=VWvDlq4lQC0 125 Page 14 of 49 Empir Software Eng (2022) 27: 125 https://www.youtube.com/watch?v=VWvDlq4lQC0 forced to label bugs, even in case they are unsure. However, we believe that this drawback is countered by our consensus labeling that requires agreement of three participants: even if all participants are unsure about a commit, if three come to the same result, it is unlikely that they are wrong. 3.6.2 Analysis Phase The analysis took place after the data collection phase was finished on October 14th, 2020. In this phase, the principal investigators conducted the analysis as described in Section 3.7. The principal investigators also informed all participants of their consen- sus ratio. All participants who met the first two criteria for co-authorship received a draft of the manuscript for review. Due to the number of participants, this was con- ducted in multiple batches. In between, the principal investigators improved the manuscript based on the comments of the participants. All participants who reviewed the draft were added to the list of authors. Those who failed to provide a review were added to the acknowledgements, unless they specifically requested not to be added. Finally, all co- authors received a copy of the final draft one week in advance of the submission for their consideration. 3.6.3 Data Correction Phase Due to a bug in the data collection software that we only found after the data collection was finished, we required an additional phase for the correction of data. The bug led to possibly corrupt data in case line numbers in the added and deleted code overlapped, e.g., single line modifications. We computed all lines that were possibly affected by this bug and found that 28,827 lines possibly contained corrupt labels, i.e., about 10% of our data. We asked all co-authors to correct their data through manual inspection by relabeling all lines that could have been corrupted. For this, we used the same tooling as for the initial label- ing, with the difference that all labels that were not affected by the bugs were already set, only the lines affected by the bug needed to be relabeled.12 The data correction phase took place from November 25th, 2020 until January 17th, 2021. We deleted all data that was not corrected by January 18th which resulted in 679 less issues for which the labeling has fin- ished. We invited all co-authors to re-label these issues between January 18th and January 21th. Through this, the data for 636 was finished again, but we still lost the data for 43 issues as the result of this bug. The changes to the results were very minor and the analysis of the results was not affected. However, we note that the analysis of consensus ratios and participation (Section 5.1) was done prior to the correction phase. The bug affected only a small fraction of lines that should only have a negligible impact on the individual consen- sus ratios, as data by all participants was affected equally. We validated this intuition by comparing the consensus ratios of participants who corrected their data before and after the correction and found that there were no relevant changes. Since some participants could not participate in the data correction and we had to remove some labelled commits, the number of labeled commits per author could now be below 200. This altered the results due to out- liers, caused by single very large commits. This represents an unavoidable trade-off arising from the need to fix the potentially corrupt data. 12Details can be found in the tutorial video: https://www.youtube.com/watch?v=Kf6wVoo32Mc Page 15 of 49 125Empir Software Eng (2022) 27: 125 https://www.youtube.com/watch?v=Kf6wVoo32Mc 3.7 Analysis Plan The analysis of data consists of four aspects: the contribution to bug fixes, the capability to label bug fixing commits, the effectiveness of gamification, and the confidence level of our statistical analysis. 3.7.1 Contributions to Bug Fixes We used the Shapiro-Wilk test (Shapiro and Wilk 1965) to determine if the nine variables related to defects are normally distributed. Since the data is not normal, we report the median, median absolute deviation (MAD), and an estimation for the confidence interval of the median based on the approach proposed by Campbell and Gardner (1988). We reject H1 if the upper bound of the confidence interval of the median lines contributing to the bug fix in all code is greater than 40%. Within the discussion, we provide further insights about the expectations on lines within bug fixing commits based on all results, especially also within changes to production code files. 3.7.2 Capability to Label Bug Fixing Commits We use the confidence interval for the number of lines without consensus for this evaluation. We reject H2 if the lower bound of the confidence interval of the median number of lines without consensus is less than 10.5%. Additionally, we report Fleiss’ κ (Fleiss 1971) to estimate the reliability of the consensus, which is defined as κ = P̄ − P̄e 1 − P̄e (2) where P̄ is the mean agreement of the participants per line and P̄e is the sum of the squared proportions of the label assignments. We use the table from Landis and Koch (1977) for the interpretation of κ (see Table 2). Additionally, we estimate the probability of random mistakes to better understand if the lines without consensus can be explained by random mislabels, i.e., mislabels that are the consequence of unintentional mistakes by participants. If we assume that all lines with con- sensus are labeled correctly, we can use the minority votes in those lines to estimate the probability of random mislabels. Specifically, we have a minority vote if three participants agreed on one label, and one participant selected a different label. We assume that random mislabels follow a binomial distribution (see Section 3.2) to estimate the probability that a single participant randomly mislabels a bug fixing line. Following Brown et al. (2001), we Table 2 Interpretation of Fleiss’ κ according to (Landis and Koch 1977) κ Interpretation <0 Poor agreement 0.01 – 0.20 Slight agreement 0.21 – 0.40 Fair agreement 0.41 – 0.60 Moderate agreement 0.61 – 0.80 Substantial agreement 0.81 – 1.00 Almost perfect agreement 125 Page 16 of 49 Empir Software Eng (2022) 27: 125 use the approach from Agresti and Coull (1998) to estimate the probability of a mislabel p, because we have a large sample size. Therefore, we estimate p = n1 + 1 2z2 α n + z2 α (3) as the probability of a mislabel of a participant with n1 the number of minority votes in lines with consensus, n the total number of individual labels in lines with consensus, and zα the 1 − 1 2α quantile of the standard normal distribution, with α the confidence level. We get the confidence interval for p as p ± zα √ p · (1 − p) n + z2 α . (4) We estimate the overall probabilities of errors, as well as the probabilities of errors in production files for the different label types to get insights into the distribution of errors. Moreover, we can use the estimated probabilities of random errors to determine how many lines without consensus are expected. If ntotal is the number of lines, we can expect that there are nnone = ntotal · B(k = 2|p, n = 4) = ntotal · ( 4 2 ) p2 · (1 − p)2 (5) = ntotal · 6 · p2 · (1 − p)2 lines without consensus under the assumption that they are due to random mistakes. If we observe more lines without consensus, this is a strong indicator that this is not a random effect, but due to actual disagreements between participants. We note that the calculation of the probability of mislabels and the number of expected non-consensus lines was not described in the pre-registered protocol. However, since the approach to model random mistakes as binomial distribution was already used to derive H2 as part of the registration, we believe that this is rather the reporting of an additional detail based on an already established concept from the registration and not a substantial deviation from our protocol. In addition to the data about the line labels, we use the result of the survey among partic- ipants regarding their perceived certainty rates to give further insights into the limitations of the participants to conduct the task of manually labeling lines within commits. We report the histogram of the answers given by the participants and discuss how the perceived difficulty relates to the actual consensus that was achieved. 3.7.3 Effectiveness of Gamification We evaluate the effectiveness of the gamification element of the crowd working by consid- ering the number of commits per participant. Concretely, we use a histogram of the total number of commits per participant. The histogram tells us whether there are participants who have more than 200 commits, including how many commits were actually labeled. Moreover, we create line plots for the number of commits over days, with one line per par- ticipant that has more than 200 commits. This line plot is a visualization of the evolution of the prospective ranking of the author list. If the gamification is efficient, we should observe two behavioral patterns: 1) that participants stopped labeling right after they gained a cer- tain rank; and 2) that participants restarted labeling after a break to increase their rank. The first observation would be in line with the results from Anderson et al. (2013) regarding user Page 17 of 49 125Empir Software Eng (2022) 27: 125 behavior after achieving badges on Stack Overflow. We cannot cite direct prior evidence for our second conjecture, other than that we believe that participants who were interested in gaining a certain rank, would also check if they still occupy the rank and then act on this. We combine the indications from the line plot with the answer to the survey question Q2. If we can triangulate from the line plot and the survey that the gamification was effective for at least 10% of the overall participants, we believe that having this gamification element is worthwhile and we fail to reject H3. If less than 10% were motivated by the gamification element, this means that future studies could not necessarily expect a positive benefit, due to the small percentage of participants that were motivated. We additionally quantify the effect of the gamification by estimating the additional effort that participants invested measured in the number of commits labeled greater than 200 for the subset of participants where the gamification seems to have made a difference. 3.7.4 Confidence Level We compute all confidence intervals such that we have a family-wise confidence level of 95%. We need to adjust the confidence level for the calculation of the confidence inter- vals, due to the number of intervals we determine. Specifically, we determine 2 · 9 = 18 confidence intervals for the ratios of line labels within commits (see Section 3.4) and 27 confidence intervals for our estimation of the probability of random mistakes. Due to the large amount of data we have available, we decided for an extremely conservative approach for the adjustment of the confidence level (see Section 3.7.2). We use Bonferroni correc- tion (Dunnett 1955) for all 18 + 27 = 45 confidence intervals at once, even though we could possibly consider these as separate families. Consequently, we use a confidence level of 1 − 0.05 45 = 0.998̄ for all confidence interval calculations.13 3.8 Summary of Deviations from Pre-Registration We deviated from the pre-registered research protocol in several points, mostly through the expansion on details. – The time frame of the labeling shifted to May 16th–October 14th. Additionally, we had to correct a part of the data due to a bug between November 25th and January 21st. – We updated H2 with a threshold of 10.5% of lines, due to a wrong calculation in the registration (see footnote 6). – We consider the subset of mislabels on changes to production code files, as well as mislabels with respect to all changes. – We have additional details because we distinguish between whitespace and documen- tation lines. – We have additional analysis for lines without consensus to differentiate between different reasons for no consensus. – We extend the analysis with an estimation of the probability of random mislabels, instead of only checking the percentage of lines without consensus. 13The pre-registration only contained correction for six confidence intervals. This increased because we provide a more detailed view on labels without consensus and differentiate between all changes and changes to production code files, and because the calculation of probabilities for mistakes was not mentioned in the registration. 125 Page 18 of 49 Empir Software Eng (2022) 27: 125 4 Experiments for RQ1 We now present our results and discuss their implications for RQ1 on the tangling of changes within commits. 4.1 Results for RQ1 In this section, we first present the data demographics of the study, e.g., the number of participants, and the amount of data that was labeled. We then present the results of the Table 3 Statistics about amounts of labeled data per project, i.e., data for which we have labels by four participants and can compute consensus Project Timeframe #Bugs #Commits Ant-ivy 2005-06-16 – 2018-02-13 404 / 404 547 / 547 archiva 2005-11-23 – 2018-07-25 3 / 278 4 / 509 commons-bcel 2001-10-29 – 2019-03-12 33 / 33 52 / 52 commons-beanutils 2001-03-27 – 2018-11-15 47 / 47 60 / 60 commons-codec 2003-04-25 – 2018-11-15 27 / 27 58 / 58 commons-collections 2001-04-14 – 2018-11-15 48 / 48 93 / 93 commons-compress 2003-11-23 – 2018-11-15 119 / 119 205 / 205 commons-configuration 2003-12-23 – 2018-11-15 140 / 140 253 / 253 commons-dbcp 2001-04-14 – 2019-03-12 57 / 57 89 / 89 commons-digester 2001-05-03 – 2018-11-16 17 / 17 26 / 26 commons-io 2002-01-25 – 2018-11-16 71 / 72 115 / 125 commons-jcs 2002-04-07 – 2018-11-16 58 / 58 73 / 73 commons-lang 2002-07-19 – 2018-10-10 147 / 147 225 / 225 commons-math 2003-05-12 – 2018-02-15 234 / 234 391 / 391 commons-net 2002-04-03 – 2018-11-14 127 / 127 176 / 176 commons-scxml 2005-08-17 – 2018-11-16 46 / 46 67 / 67 commons-validator 2002-01-06 – 2018-11-19 57 / 57 75 / 75 commons-vfs 2002-07-16 – 2018-11-19 94 / 94 118 / 118 deltaspike 2011-12-22 – 2018-08-02 6 / 146 8 / 219 eagle 2015-10-16 – 2019-01-29 2 / 111 2 / 121 giraph 2010-10-29 – 2018-11-21 140 / 140 146 / 146 gora 2010-10-08 – 2019-04-10 56 / 56 98 / 98 jspwiki 2001-07-06 – 2019-01-11 1 / 144 1 / 205 opennlp 2008-09-28 – 2018-06-18 106 / 106 151 / 151 parquet-mr 2012-08-31 – 2018-07-12 83 / 83 119 / 119 santuario-java 2001-09-28 – 2019-04-11 49 / 49 95 / 95 systemml 2012-01-11 – 2018-08-20 6 / 279 6 / 314 wss4j 2004-02-13 – 2018-07-13 150 / 150 245 / 245 Total 2328 / 3269 3498 / 4855 The columns #Bugs and #Commits list the completed data and the total data available in the time frame. The five projects marked as italic are incomplete, because we did not have enough participants to label all data Page 19 of 49 125Empir Software Eng (2022) 27: 125 Table 4 Statistics of assigned line labels over all data Label All Changes Production Code Other Code Bug fix 72774 (25.1%) 71343 (49.2%) 361 (0.3%) Test 114765 (39.6%) 8 (0.0%) 102126 (91.3%) Documentation 40456 (14.0%) 31472 (21.7%) 749 (0.7%) Refactoring 5297 (1.8%) 5294 (3.7%) 3 (0.0%) Unrelated Improvement 1361 (0.5%) 824 (0.6%) 11 (0.0%) Whitespace 11909 (4.1%) 10771 (7.4%) 781 (0.7%) Test/Doc/Whitespace 4454 (1.5%) 0 (0.0%) 4454 (4.0%) No Bug fix 13052 (4.5%) 4429 (3.1%) 1754 (1.6%) No Consensus 25836 (8.9%) 20722 (14.3%) 1587 (1.4%) Total 289904 144863 111826 Production code refers all Java files that we did not determine to be part of the test suite or the examples. Other code refers to all other Java files. The labels above the line are for at least three participants selecting the same label. The labels below the line do not have consensus, but are the different categories for lines without consensus we established in Section 3.4. The eight lines labeled as test in production code are due to two test files within Apache Commons Math that were misplaced in the production code folder labeling. All labeled data of the LLTC4J corpus and the analysis scripts we used to calculate our results can be found online in our replication package.14 4.1.1 Data Demographics Of 79 participants registered for this study, 15 participants dropped out without perform- ing any labeling. The remaining 64 participants labeled data. The participants individually labeled 17,656 commits. This resulted in 1,389 commits labeled by one participant, 683 commits labeled by two participants, 303 commits that were labeled by three participants, 3,498 commits labeled by four participants, and five commits that were part of the tutorial were labeled by all participants. Table 3 summarizes the completed data for each project. Thus, we have validated all bugs for 23 projects and incomplete data about bugs for five projects. We have a value of Fleiss’ κ = 0.67, which indicates that we have substantial agreement among the participants. 4.1.2 Content of Bug Fixing Commits Table 4 summarizes the overall results of the commit labeling. Overall, 289,904 lines were changed as part of the bug fixing commits. Only 25.1% of the changes were part of bug fixes. The majority of changed lines were modifications of tests with 39.6%. Documentation accounts for another 14.0% of the changes. For 8.9% of the lines there was no consensus among the participants and at least one participant labeled the line as bug fix. For an addi- tional 1.5% of lines, the participants marked the line as either documentation, whitespace change or test change, but did not achieve consensus. We believe this is the result of differ- ent labeling strategies for test files (see Section 3.4). For 4.5% of the lines no participant selected bug fix, but at least one participant selected refactoring or unrelated improvement. 14https://github.com/sherbold/replication-kit-2020-line-validation We will move the replication kit to a long-term archive on Zenodo in case of acceptance of this manuscript. 125 Page 20 of 49 Empir Software Eng (2022) 27: 125 https://github.com/sherbold/replication-kit-2020-line-validation When we investigated these lines, we found that the majority of cases were due to different labeling strategies: some participants labeled updates to test data as unrelated improvements, others labeled them as tests. How this affected production code files is discussed separately in Section 4.1.3. 256,689 of the changed lines were Java code, with 144,863 lines in production code files and 11,826 lines in other code files. The other code is almost exclusively test code. Within the production code files, 49.2% of the changed lines contributed to bug fixes and 21.7% were documentation. Refactorings and unrelated improvements only represent 4.3% of the lines. In 14.3% of the lines, the participants did not achieve consensus with at least one participant labeling the line as bug fix. Figure 2 summarizes the results of labeling per commit, i.e., the percentages of each label in bug fixing commits. We found that a median of 25.0% of all changed lines contribute to a bug fix. When we restrict this to production code files, this ratio increases to 75.0%. We note that while the median for all changes is roughly similar to the overall percentage of lines that are bug fixing, this is not the case for production code files. The median of 75.0% per commit is much higher than the 49.2% of all production lines that are bug fixing. The histograms provide evidence regarding the reason for this effect. With respect to all changed lines, Fig. 2b shows that there are many commits with a relatively small percentage of bug fixing lines close to zero, i.e., we observe a peak on the left side of the histogram. When we focus the analysis on the production code files, Fig. 2c shows that we instead observe that there are many commits with a large percentage of bug fixing lines (close to 100%), i.e., we observe a peak on the right side of the histogram, but still a long tail of lower percentages. This tail of lower percentages influences the ratio of lines more strongly in the median per commit, because the ratio is not robust against outliers. The most common change to be tangled with bug fixes are test changes and documen- tation changes with a median of 13.0% and 10.2% of lines per commit, respectively. When we restrict the analysis to production code files, all medians other than bug fix drop to zero. For test changes, this is expected because they are, by definition, not in production code files. For other changes, this is due to the extremeness of the data which makes the statis- tical analysis of most label percentages within commits difficult. What we observe is that while many commits are not pure bug fixes, the type of changes differs between commits, which leads to commits having percentages of exactly zero for all labels other than bug fix. This leads to extremely skewed data, as the median, MAD, and CI often become exactly zero such that the non-zero values are – from a statistical point of view – outliers. The last column of the table in Fig. 2a shows for how many commits the values are greater than zero. We can also use the boxplots in Fig. 2b and c to gain insights into the ratios of lines, which we analyze through the outliers. The boxplots reveal that documentation changes are common in production code files, the upper quartile is at around 30% of changed lines in a commit. Unrelated improvements are tangled with 2.9% of the commits and are usually less than about 50% of the changes, refactorings are tangled with 7.8% of the commits and usually less than about 60% of the changes. Whitespace changes are more common, i.e., 46.9% of the commits are tangled with some formatting. However, the ratio is usually below 40% and there are no commits that contain only whitespace changes, i.e., pure reformatting of code. The distribution of lines without consensus shows that while we have full consensus for 71.3% of the commits, the consensus ratios for the remaining commits are distributed over the complete range up to no consensus at all. Page 21 of 49 125Empir Software Eng (2022) 27: 125 Fig. 2 Data about changes within commits As a side note, we also found that the pre-labeling of lines with the RefactoringMiner was not always correct. Sometimes logical changes were marked as refactoring, e.g., because side effects were ignored when variables were extracted or code was reordered. Overall, 21.6% of the 23,682 lines marked by RefactoringMiner have a consensus label of bug fix. However, the focus of our study is not the evaluation of RefactoringMiner and we further note that we used RefactoringMiner 1.0 (Tsantalis et al. 2018) and not the recently released version 2.0 (Tsantalis et al. 2020), which may have resolved some of these issues. 125 Page 22 of 49 Empir Software Eng (2022) 27: 125 Fig. 3 Number of expected lines without consensus versus the actual lines without consensus. Overall (cleaned) counts lines with the labels Test/Doc/Whitespace and No Bug fix as consensus 4.1.3 Analysis of Disagreements Based on the minority votes in lines with consensus, we estimate the probability of random mistakes in all changes as 7.5% ± 0.0, when we restrict this to production code files we estimate the probability as 9.0% ± 0.0. We note that the confidence intervals are extremely small, due to the very large number of lines in our data. Figure 3 shows the expected number of lines without consensus given these probabilities versus the observed lines without con- sensus. For all changes, we additionally report a cleaned version of the observed data. With the cleaned version, we take the test/doc/whitespace and no bug fix labels into account, i.e., lines where there is consensus that the line is not part of the bug fix. The results indicate that there are more lines without consensus than could be expected under the assumption that all mislabels in our data are the result of random mislabels. Table 5 A more detailed resolution of the number of expected mislabels per label type Bug fix Doc. Refactoring Unrelated Whitespace Test Bug fix - 1 508 705 23 2 Documentation 189 - 11 47 17 2 Refactoring 446 2 - 35 3 0 Unrelated 110 0 2 - 1 0 Whitespace 103 1 49 34 - 1 Total Expected: 847 3 570 821 42 5 Total Observed: 12837 3987 6700 8251 3951 252 The rows represent the correct labels, the columns represent the number of two expected mislabels of that type. The sum of the observed values is not equal to the sum of the observed lines without consensus, because it is possible that a line without consensus has two labels of two types each and we cannot know which one is the mislabel. Hence, we must count these lines twice here. The size of the confidence intervals is less than one line in each case, which is why we report only the upper bound of the confidence intervals, instead of the confidence intervals themselves. There is no row for test, because there are no correct test labels in production code files Page 23 of 49 125Empir Software Eng (2022) 27: 125 Fig. 4 Answers of the participants regarding their labeling confidence. Participants were not informed about the distribution of consensus ratios of participants or their individual consensus ratio before answering the survey to avoid bias. 35 out of 48 participants with at least 200 commits answered this question Table 5 provides additional details regarding the expectation of the mislabels per type in production code files. The data in this table is modeled using a more fine-grained view on random mistakes. This view models the probabilities for all labels separately, both with respect to expected mislabels, as well as the mislabel that occurs. The estimated probabilities are reported in Table 7 in the Appendix. Using the probabilities, we calculated the expected number of two random mistakes for each label type, based on the distribution of consen- sus lines. Table 5 confirms that we do not only observe more lines without consensus, but also more pairs of mislabels than could be expected given a random distribution of mistakes for all labels. However, the mislabels are more often of type bug fix, unrelated improve- ment, or refactoring than of the other label types. Table 8 in the Appendix shows a full resolution of the individual labels we observed in lines without consensus and confirms that disagreements between participants whether a line is part of a bug fix, an unrelated improve- ment, or a refactoring, are the main driver of lines without consensus in production code files. In addition to the view on disagreements through the labels, we also asked the partic- ipants that labeled more than 200 commits how confident they were in the correctness of their labels. Figure 4 shows the results of this survey. 55% of the participants estimate that they were not sure in 11%–20% of lines, another 20% estimate 21%–30% of lines. Only one participant estimates a higher uncertainty of 51%–60%. This indicates that participants have a good intuition about the difficulty of the labeling. If we consider that we have about 8% random mistakes and 12% lines without consensus, this indicates that about 20% of the lines should have been problematic for participants. We note that there are also three that selected between 71%–100% of lines. While we cannot know for sure, we believe that these three participants misread the question, which is in line with up to 10% of participants found by Podsakoff et al. (2003). 4.2 Discussion of RQ1 We now discuss our results for RQ1 with respect to our hypotheses to gain insights into the research question, put our results in the context of related work, and identify consequences for researchers who analyze bugs. 125 Page 24 of 49 Empir Software Eng (2022) 27: 125 4.2.1 Prevalence of Tangled Commits Tangling is common in bug fixing commits. We estimate that the average number of bug fixing lines per commit is between 22% and 38%. The 22% of lines are due to the lower bound of the confidence interval for the median bug fixing lines per commit. The 38% of lines assume that all 8.9% of the lines without consensus would be bug fixing and that the median would be at the upper bound of the confidence interval. However, we observe that a large part of the tangling is not within production code files, but rather due to changes in documentation and test files. Only half of the changes within bug fixing commits are to production code files. We estimate that the average number of bug fixing lines in production code files per commit is between 60% and 93%. We fail to reject the hypothesis H1 that less than 40% of changes in bug fixing commits contribute to the bug fix and estimate that the true value is between 22% and 38% of lines. However, this is only true if all changes are considered. If only changes to production code files are considered, the average number of bug fixing lines is between 69% and 93%. 4.2.2 Impact of Tangling The impact of tangling on software engineering research depends on the type of tangling as well as the research application. Moreover, tangling is not, by definition, bad. For example, adding new tests as part of a bug fix is actually a best practice. In the following, we use the term benign tangling to refer to tangled commits that do not negatively affect research and problematic tangling to refer to tangling that results in noise within the data without manual intervention. Specifically, we discuss problematic tangling for the three research topics we already mentioned in the motivation, i.e., program repair, bug localization, and defect prediction. If bug fixes are used for program repair, only changes to production code are relevant, i.e., all other tangling is benign as the tangling can easily be ignored using regular expres- sions, same as we did in this article. In production code files, the tangling of whitespaces and documentation changes is also benign. Refactorings and unrelated improvements are problematic, as they are not required for the bug fix and needlessly complicate the problem of program repair. If bug fixes are used as training data for a program repair approach (e.g., Li et al. 2020), this may lead to models that are too complex as they would mix the repair with additional changes. If bug fixes are used to evaluate the correctness of automatically generated fixes (e.g., Martinez et al. 2016), this comparison may be more difficult or noisy, due to the tangling. Unrelated improvements are especially problematic if they add new features. This sort of general program synthesis is usually out of scope of program repair and rather considered as a different problem, e.g., as neural machine translation (Tufano et al. 2019) and, therefore, introduces noise in dedicated program repair analysis tasks as described above. For bug localization, tangling outside of production code files is also irrelevant, as these changes are ignored for the bug localization anyways. Within production code files, all tangling is irrelevant, as long as a single line in a file is bug fixing, assuming that the bug localization is at the file level. Page 25 of 49 125Empir Software Eng (2022) 27: 125 For defect prediction, we assume that a state-of-the-art SZZ variant that accounts for whitespaces, documentations, and automatically detectable refactorings is used Neto et al. (2018) to determine the bug fixing and bug inducing commits. Additionally, we assume that a heuristic is used to exclude non-production changes, which means that changes to non- production code files are not problematic for determining bug labels. For the bug fixes, we have the same situation as with bug localization: labeling of files would not be affected by tangling, as long as a single line is bug fixing. However, all unrelated improvements and non-automatically detectable refactorings may lead to the labeling of additional commits as inducing commits and are, therefore, potentially problematic. Finally, defect prediction features can also be affected by tangling. For example, the popular metrics by Kamei et al. (2013) for just-in-time defect prediction compute change metrics for commits as a whole, i.e., not based on changes to production code only, but rather based on all changes. Conse- quently, all tangling is problematic with respect to these metrics. According to the histogram for overall tangling in Fig. 2, this affects most commits. Table 6 summarizes the presence of problematic tangling within our data. We report ranges, due to the uncertainty caused by lines without consensus. The lower bound assumes that all lines without consensus are bug fixing and that all refactorings could be auto- matically detected, whereas the upper bound assumes that the lines without consensus are unrelated improvements and refactorings could not be automatically detected. We observe that all use cases are affected differently, because other types of tangled changes cause problems. Program repair has the highest lower bound, because any refactoring or unre- lated improvement is problematic. Bug localization is less affected than defect prediction, because only the bug fixing commits are relevant and the bug inducing commits do not matter. If bug localization data sets would also adopt automated refactoring detection tech- niques, the numbers would be the same as for defect prediction bug fix labels. However, we want to note that the results from this article indicate that the automated detection of refactorings may be unreliable and could lead to wrong filtering of lines that actually con- tribute to the bug fix. Overall, we observe that the noise varies between applications and also between our best case and worst case assumptions. In the best case, we observe as little as 2% problematically tangled bugs (file changes for bug fixes in defect prediction), whereas in the worst case this increases to 47% (total number of bugs with noise for defect pre- diction). Since prior work already established that problematic tangling may lead to wrong performance estimations (Kochhar et al. 2014; Herzig et al. 2016; Mills et al. 2020) as well as the degradation of performance of machine learning models (Nguyen et al. 2013), this is a severe threat to the validity of past studies. Table 6 The ratio of problematic tangling within our data with respect to bugs and production file changes Research topic Bugs File changes Program repair 12%–35% 9%–32% Bug localization 9%–23% 7%–21% Defect prediction (bug fix) 3%–23% 2%–21% Defect prediction (inducing) 5%–24% 3%–18% Defect prediction (total) 8%–47% 5%–39% The values for the inducing commits in defect prediction are only the commits that are affected in addition to the bug fixing commits 125 Page 26 of 49 Empir Software Eng (2022) 27: 125 Tangled commits are often problematic , i.e., lead to noise within data sets that cannot be cleaned using heuristics. The amount of noise varies between research topics and is between an optimistic 2% and a pessimistic value of 47%. As researchers, we should be skeptical and assume that unvalidated data is likely very noisy and a severe threat to the validity of experiments, until proven otherwise. 4.2.3 Reliability of Labels Participants have a probability of random mistakes of about 7.5% overall and 9.0% in pro- duction code files. Due to these numbers, we find that our approach to use multiple labels per commit was effective. Using the binomial distribution to determine the likelihood of at least three random mislabels, we expect that 111 lines with a consensus label are wrong, i.e., an expected error rate of 0.09% in consensus labels. This error rate is lower than, e.g., using two people that would have to agree. In this case, the random error rate would grow to about 0.37%. We fail to achieve consensus on 8.9% of all lines and 14.3% of lines in production code files. Our data indicates that most lines without consensus are not due to random mis- takes, but rather due to disagreements between participants whether a line contributes to the bug fix or not. This indicates that the lines without consensus are indeed hard to label. Our participant survey supports our empirical analysis of the labeled data. A possible problem with the reliability of the work could also be participant dependent “default behavior” in case they were uncertain. For example, participants could have labeled lines as bug fixing in case they were not sure, which would reduce the reliability of our work and also possibly bias the results towards more bug fixing lines. However, we have no data regarding default behavior and believe that this should be studied in an independent and controlled setting. Wereject H2 that participants fail to achieve consensus on at least 10.5% of lines and find that this is not true when we consider all changes. How- ever, we observe that this is the case for the labeling of production code files with 14.3% of lines without consensus. Our data indicates that these lines are hard to label by researchers with active disagreement instead of random mistakes. Nevertheless, the results with consensus are reliable and should be close to the ground truth. 4.2.4 Comparison to Prior Work Due to the differences in approaches, we cannot directly compare our results with those of prior studies that quantified tangling. However, the fine-grained labeling of our data allows us to restrict our results to mimic settings from prior work. From the description in their paper, Herzig and Zeller (2013) flagged only commits as tangled, that contained source code changes for multiple issues or clean up of the code. Thus, we assume that they did not consider documentation changes or whitespace changes as tangling. This definition is similar to our definition of problematic tangling for program repair. The 12%–35% affected bugs that we estimate includes the lower bound of 15% on tangled commits. Thus, our work Page 27 of 49 125Empir Software Eng (2022) 27: 125 seems to replicate the findings from Herzig and Zeller (2013), even though our methodolo- gies are completely different. Whether the same holds true for the 22% of tangled commits reported by Nguyen et al. (2013) is unclear, because they do not specify how they han- dle non-production code. Assuming they ignore non-production code, our results would replicate those by Nguyen et al. (2013) as well. Kochhar et al. (2014) and Mills et al. (2020) both consider tangling for bug localiza- tion, but have conflicting results. Both studies defined tangling similar to our definition of problematic tangling for bug localization, except that they did not restrict the analysis to production code files, but rather to all Java files. When we use the same criteria, we have between 39% and 48% problematic tangling in Java file changes. Similar to what we dis- cussed in Section 4.2.2, the range is the result of assuming lines with consensus either as bug fixes or as unrelated improvements. Thus, our results replicate the work by Mills et al. (2020) who found between 32% and 50% tangled file changes. We could not confirm the results by Kochhar et al. (2014) who found that 28% of file changes are problematic for bug localization. Regarding the types of changes, we can only compare our work with the data from Nguyen et al. (2013) and Mills et al. (2020). The percentages are hard to compare, due to the different levels of abstraction we consider. However, we can compare the trends in our data with those of prior work. The results regarding tangled test changes are similar to Mills et al. (2020). For the other types of changes, the ratios we observe are more similar to the results reported by Mills et al. (2020) than those of Nguyen et al. (2013). We observe only few unrelated improvements, while this is as common as tangled documentation changes in the work by Nguyen et al. (2013). Similarly, Mills et al. (2020) observed a similar amount of refactoring as documentation changes. In our data, documentation changes are much more common than refactorings, both in all changes as well as in changes to production code files. A possible explanation for this are the differences in the methodology: refactorings and unrelated improvement labels are quite common in the lines without consensus. Thus, detecting such changes is common in the difficult lines. This could mean that we under- estimate the number of tangled refactorings and unrelated improvements, because many of them are hidden in the lines without consensus. However, Nguyen et al. (2013) and Mills et al. (2020) may also overestimate the number of unrelated improvements and refactorings, because they had too few labelers to have the active disagreements we observed. Regard- ing the other contradictions between the trends observed by Nguyen et al. (2013) and Mills et al. (2020), our results indicate that the findings by Mills et al. (2020) generalize to our data: we also find more refactorings than other unrelated improvements. Our results confirm prior studies by Herzig and Zeller (2013), Nguyen et al. (2013), and Mills et al. (2020) regarding the prevalence of tan- gling. We find that the differences in estimations are due to the study designs, because the authors considered different types of tangling, which we identify as different types of problematic tangling. Our results for tan- gled change types are similar to the results by Mills et al. (2020), but indicate also a disagreement regarding the prevalence of refactorings and unrelated improvements, likely caused by the uncertainty caused by lines that are difficult to label. 125 Page 28 of 49 Empir Software Eng (2022) 27: 125 4.2.5 Anectodal Evidence on Corner Cases Due to the scope of our analysis, we also identified some interesting corner cases that should be considered in future work on bug fixes. Sometimes, only tests were added as part of a bug fixing commit, i.e., no production code was changed. Usually, this indicates that the reported issue was indeed not a bug and that the developers added the tests to demonstrate that the case works as expected. In this case, we have a false positive bug fixing commit, due to a wrong issue label. However, sometimes it also happened, that the label was correct, but the bug fixing commit still only added tests. An example for this is the issue IO-466,15 where the bug was present, but was already fixed as a side effect of the correction of IO-423. This means we have an issue that is indeed a bug and a correct link from the version control system to the issue tracking system, but we still have a false positive for the bug fixing commit. This shows that even if the issue types and issue links are manually validated, there may be false positives in the bug fixing commits, if there is no check that production code is modified. Unfortunately, it is not as simple as that. We also found some cases, where the linked commit only contained the modification of the change log, but no change to production code files. An example for this is NET-270.16 This is an instance of missing links: the actual bug fix is in the parent of the linked commit, which did not reference NET-270. Again, we have a correct issue type, correct issue link, but no bug fix in the commit, because no production code file was changed. However, a directly related fix exists and can be identified through manual analysis. The question is how should we deal with such cases in future work? We do not have clear answers. Identifying that these commits do not contribute to the bug fix is trivial, because they do not change production code files. However, always discarding such commits as not bug fixing may remove valuable information, depending on the use case. For example, if we study test changes as part of bug fixing, the test case is still important. If you apply a manual correction, or possibly even identify a reliable heuristic, to find fixes in parent commits, the second case is still important. From our perspective, these examples show that even the best heuristics will always produce noise and that even with manual validations, there are not always clear decisions. An even bigger problem is that there are cases where it is hard to decide which modifica- tions are part of the bug fix, even if you understand the complete logic of the correction. A common case of this is the “new block problem” that is depicted in Fig. 5. The problem is obvious: the unchecked use of a.foo() can cause a NullPointException, the fix is the addition of a null-check. Depending on how the bug is fixed, a.foo() is either part of the diff, or not. This leads to the question: is moving a.foo() to the new block part of the bug fix or is this “just a whitespace change”? The associated question is, whether the call to a.foo() is the bug or if the lack of the null-check is the bug. What are the implications that the line with a.foo() may once be labeled as part of the bug fix (Fix 1), and once not (Fix 2)? We checked in the data, and most participants labeled all lines in both fixes as contributing the bug fix in the cases that we found. There are important implications for heuristics here: 1) pure whitespace changes can still be part of the bug fix, if the whitespace changes indicates that the statement moved to a different block; 2) bugs can sometimes be fixed as modification (Fix 1), but also as pure addition (Fix 2), which leads to different data. 15https://issues.apache.org/jira/browse/IO-466 16https://issues.apache.org/jira/browse/NET-270 Page 29 of 49 125Empir Software Eng (2022) 27: 125 https://issues.apache.org/jira/browse/IO-466 https://issues.apache.org/jira/browse/NET-270 Fig. 5 Example for a bug fix, where a new condition is added. In Fix 1, the line a.foo() is modified by adding whitespaces and part of the textual difference. In Fix 2, a.foo() is not part of the diff The first implication is potentially severe for heuristics that ignore whitespace changes. In this case, textual differences may not be a suitable representation for reasoning about the changes and other approaches, such as differences in abstract syntax trees (Yang 1991), are better suited. The second is contrary to the approach by Mills et al. (2020) for untangling for bug localization, i.e., removing all pure additions. In this case the addition could also be a modification, which means that the bug could be localized and that ignoring this addition would be a false negative. Similarly, the SZZ approach to identify the bug inducing commits cannot find inducing commits for pure additions. Hence, SZZ could blame the modifica- tion of the line a.foo() from Fix 1 to find the bug inducing commit, which would be impossible for Fix 2. 4.2.6 Implications for Researchers While we have discussed many interesting insights above, two aspects stand out and are, to our mind, vital for future work that deals with bugs. – Heuristics are effective! Most tangling can be automatically identified by identifying non-production code files (e.g., tests), and changes to whitespaces and documentation within production code files.17 Any analysis that should target production code but does not carefully remove tests cannot be trusted, because test changes are more common than bug fixing changes. – Heuristics are imperfect! Depending on the use case and the uncertainty in our data, up to 47% could still be affected by tangling, regardless of the heuristics used to clean the data. We suggest that researchers should carefully assess which types of tangling are problematic for their work and use our data to assess how much problematic tangling they could expect. Depending on this estimation, they could either estimate the threat to the validity of their work or plan other means to minimize the impact of tangling on their work. 17For example, the pycoSHARK contains methods for checking if code is Java Production code for our projects and all standard paths for tests and documentation. The VisualSHARK and the inducingSHARK are both able to find whitespace and comment-only changes. All tools can be found on GitHub: https://github. com/smartshark 125 Page 30 of 49 Empir Software Eng (2022) 27: 125 https://github.com/smartshark https://github.com/smartshark Fig. 6 Number of commits labeled per participant and number of commit labels over time 4.2.7 Summary for RQ1 In summary, we have the following result for RQ1. We estimate that only between 22% and 38% of changed lines within bug fixing commits contribute to the functional correction of the code. However, much of this additional effort seems to be focused on changes to non-production code, like tests or documentation files, which is to be expected in bug fixing commits. For production code, we estimate that between 69% and 93% of the changes contribute to the bug fix. We further found that researchers are able to reliably label most data, but that multiple raters should be used as there is otherwise a high likelihood of noise within the data. 5 Experiments for RQ2 We now present our results and discuss the implications for RQ2 on the effect of gamification for our study. 5.1 Results for RQ2 Figure 6 shows how many commits were labeled by each participant and how labeling pro- gressed over time. We observe that most participants labeled around 200 commits and that the labeling started slowly and accelerated at the end of August.18 Moreover, all participants with more than 200 commits achieved at least 70% consensus and most participants were clearly above this lower boundary. Five participants that each labeled only few lines before dropping out are below 70%. We manually checked, and found that the low ratio was driven by single mistakes and that each of these participants have participated in the labeling of less than 243 lines with consensus. This is in line with our substantial agreement measured with Fleiss’ κ . Figure 7 shows how many lines were labeled by each participant over time. The plot on the left confirms that most participants started labeling data relatively late in the study. The plot in the middle is restricted to the participants that have labeled substantially more lines than required, which resulted in being mentioned earlier in the list of authors. We observe 18Details about when we recruited participants are provided in Appendix C. Page 31 of 49 125Empir Software Eng (2022) 27: 125 Fig. 7 Lines labeled per participants over time. Each line is one participant. The plot on the left shows the data for all participants. The plot in the middle shows only participants who labeled more than 250 commits, excluding Steffen Herbold, Alexander Trautsch, and Benjamin Ledel because their position in the author list is not affected by the number of labeled commits. The plot on the right shows only the top 5 participants that many participants are relatively densely located in the area around 250 commits. These participants did not stop at 200 commits, but continued a bit longer before stopping. Seven participants stopped when they finished 200 bugs, i.e., they mixed up the columns for bugs finished with commits finished in the leaderboard, which may explain some of the data. We could see no indication that these participants stopped because they achieved a certain rank. The plot on the right shows the top five participants that labeled most data. We observe that the first and third ranked participant both joined early and consistently labeled data over time. We note that the activity of the first ranked participant decreased, until the third ranked participant got close, and then accelerated again. The second ranked participant had two active phases of labeling, both of which have over 500 labeled commits. From the data, we believe that achieving the second rank over all may have been the motivation for labeling here. The most obvious example of the potential impact of the gamification on activity are the fourth and fifth ranked participants. The fourth ranked stopped labeling, until they were overtaken by the fifth ranked participant, then both changed places a couple of times, before the fourth ranked participant prevailed and stayed on the fourth rank. Overall, we believe that the line plot indicates these five participants were motivated by the gamification to label substantially more. Overall, these five participants labeled 5,467 commits, i.e., produced 5.4 times more data than was minimally required by them. Figure 8 shows the results of our question if participants with at least 250 commits would have labeled more than 200 commits, if