Software Lab Institute of Software Engineering University of Stuttgart Universitätsstraße 38, 70569 Stuttgart Master Thesis Tests4J Benchmark: Execution-based Evaluation of Context-Aware Language Models for Test Case Generation Valentin Knappich Course of study: Computer Science Examiner: Prof. Dr. Michael Pradel Supervisor: Prof. Dr. Michael Pradel, Matteo Paltenghi Started: October 17, 2022 Completed: April 17, 2023 Abstract Testing is a critical part of the software engineering process. The cost associated with writing and maintaining test suites has motivated the research community to develop approaches to automati- cally generate test cases for a given software. Evosuite is one of the most established tools for Java and has been shown to achieve high coverage. However, the test cases lack readability, motivating the application of language models (LMs) of code to the task. Evaluating such neural test case generators based on their execution requires substantial e↵orts to set up evaluation projects and to obtain metrics like coverage from them. In consequence, most prior work on Java test case generation has either evaluated models on a small number of selected methods under test (MUTs) or used Defects4J as an evaluation benchmark. However, small benchmarks su↵er from high vari- ance, and many projects in Defects4J have been used to pre-train LMs of code. To fill that gap, we introduce Tests4J, a novel benchmark for neural and non-neural test case generators. Tests4J contains 12k test cases from 60 Java projects, out of which 41 are used for training while 19 are used for evaluation. For all projects, it includes the complete repository, enabling execution-based evaluation and open-ended experimentation with project-specific context information. In a single command, Tests4J allows researchers to obtain execution-based metrics like coverage and intrinsic metrics like loss, BLEU and crystalBLEU. Using Tests4J, we train and evaluate several test case generation models based on PolyCoder with 400M parameters. We compare Evosuite to our best neural model and find that the individual test cases achieve similar coverage. However, Evosuite generates 3 times as many test cases, covering about 3 times as many lines in total. We furthermore find that Evosuite fails to generate any test cases for 4 out of 11 projects in the test set. This presents a fundamental advantage of LMs: they do not need to integrate with the project and thus don’t su↵er from dependency conflicts. Next, we evaluate prefix tuning as a training method and find that there is a significant gap to full fine- tuning. We further investigate the importance of project-specific context information and create simplified representations of the focal class and the test class. We find that adding this context information increases the achieved coverage by more than 4x, and that the focal class and test class context are highly complementary. Motivated by this finding, as well as the hard token limit and quadratic complexity of transformers, we propose Context-Aware Prompt Tuning (CAPT). In CAPT, context information is first compressed into embeddings, and then injected into the LM as soft tokens, similar to prefix tuning. We find that the method does not yield significant improvements over the baseline, but present directions for future research. Lastly, we find that loss is not an ideal indicator of coverage and that there is a high variance in coverage between projects, and thus advocate for large-scale execution-based evaluations. iii Zusammenfassung Softwaretests spielen eine zentrale Rolle in der Softwareentwicklung. Neben etablierten Tools wie Evosuite wurden in den letzten Jahren vermehrt Sprachmodelle für die automatische Generierung von Testfällen eingesetzt. Die ausführungsbasierte Evaluierung solcher Modelle stellt einen erhe- blichen Aufwand dar. Infolgedessen haben vorherige Arbeiten Modelle entweder auf einer kleinen Auswahl zu testender Methoden oder auf Defects4J evaluiert. Wir halten beide Ansätze für sub- optimal, da Evaluierungen auf kleinen Datensätzen eine hohe Varianz aufweisen und die Projekte in Defects4J zu einem großen Anteil in weit verbreiteten Datensätzen für das Vor-Trainieren von Sprachmodellen enthalten sind. Um diese Defizite zu beheben, führen wir in dieser Arbeit Tests4J ein. Es handelt sich dabei um einen neuen Benchmark zur Evaluierung von neuronalen und nicht- neuronalen Methoden für die Generierung von Tests. Tests4J beinhaltet 12k Testfälle aus 60 Java Projekten, wobei 41 für das Training und 19 für die Evaluierung eingesetzt werden. Tests4J enthält eine komplette Kopie aller Projekte, sodass eine ausführungsbasierte Evaluierung und umfassende Experimente mit Kontextinformationen ermöglicht werden. In einem einzigen Befehl können mit Tests4J sowohl ausführungsbasierte Metriken wie die Abdeckung als auch intrinsische Metriken wie Loss, BLEU und crystalBLEU berechnet werden. Mithilfe von Tests4J trainieren und evaluieren wir verschiedene Modelle auf Basis von Poly- Coder mit 400M Parametern. Wir vergleichen Evosuite mit unserem besten neuronalen Modell und stellen fest, dass einzelne Testfälle ähnlich e↵ektiv in der Abdeckung sind. Jedoch gener- iert Evosuite dreimal so viele Testfälle und erreicht die dreifache Abdeckung insgesamt. Zudem evaluieren wir Prefix Tuning als Trainingsmethode und stellen einen signifikanten Unterschied zum Trainieren des ganzen Modells in Bezug auf die Abdeckung und die intrinsischen Metriken fest. Wir untersuchen zudem die Bedeutung von projektspezifischen Kontextinformationen und erstellen vereinfachte Darstellungen der zu testenden Klassen und Testklassen. Dabei kommen wir zu dem Ergebnis, dass Kontextinformationen essenziell sind und zu mehr als der vierfachen Abdeckung führen. Motiviert durch diese Erkenntnis sowie durch die limitierte Anzahl an Tokens und die quadratische Komplexität von Transformern, schlagen wir Context-Aware Prompt Tuning vor. Dabei wird der Kontext zunächst von einem kleineren Modell in numerische Repräsentationen komprimiert, welche dann als virtuelle Tokens vom Sprachmodell verarbeitet werden. Wir stellen fest, dass die Methode keine signifikanten Verbesserungen gegenüber der Baseline erzielt, zeigen aber Richtungen für zukünftige Forschung auf. Abschließend stellen wir fest, dass Loss kein ide- aler Indikator für den Abdeckungsgrad ist, sowie dass es eine große Varianz im Abdeckungsgrad zwischen Projekten gibt und plädieren daher für umfassende, ausführungsbasierte Evaluierungen. v Contents 1 Introduction 1 2 Background 5 2.1 Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Soft Prompt Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Test Case Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3 Tests4J Benchmark 9 3.1 Filtering Repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2 Dataset Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3 Final Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.4 Coverage Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.4.1 Post-processing and Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.4.2 Repository Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.4.3 Sca↵olding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.4.4 Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.4.5 Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.4.6 Report Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.4.7 Pipeline Validation and Manual Corrections . . . . . . . . . . . . . . . . . . . 18 3.5 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.6 Ground Truth Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4 Context-Aware Prompt Tuning 23 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2.1 Context Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2.2 Prompt Generator Architectures . . . . . . . . . . . . . . . . . . . . . . . . . 26 5 Experiments 31 5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.1.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.1.2 Training Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.1.3 Generation Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 vii viii Contents 5.1.4 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.3.1 RQ1: Training Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.3.2 RQ2: Context Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.3.3 RQ3: E↵ectiveness of CAPT . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.3.4 RQ4: Comparison with Evosuite . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.3.5 RQ5: Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6 Discussion and Future Work 47 6.1 Tests4J Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 6.2 Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 7 Related Work 49 7.1 Neural Test Case Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 7.2 Soft Prompt Tuning in Code Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 7.3 Dynamic Prompt Tuning and Multi-modal Language Models . . . . . . . . . . . . . 52 8 Conclusion 55 A Appendix 57 A.1 Maven Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 A.1.1 Maven Compiler Plugin Configuration . . . . . . . . . . . . . . . . . . . . . . 57 A.1.2 Maven Surefire Plugin Configuration . . . . . . . . . . . . . . . . . . . . . . . 57 A.1.3 Maven JaCoCo Plugin Configuration . . . . . . . . . . . . . . . . . . . . . . . 57 A.1.4 Maven Compile Log Regex . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 A.1.5 Maven Execution Log Regex . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 A.2 List of Repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 A.3 CAPT Result Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 A.3.1 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 A.3.2 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 A.3.3 Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Bibliography 66 Acronyms AST Abstract Syntax Tree. 17 CAPT Context-Aware Prompt Tuning. 3, 24 LM Languge Model. 1, 3, 5–7, 9, 11, 15, 24, 49 MLP Multilayer Perceptron. 7 MUT method under test. 1, 8–10, 12, 14, 17, 18, 20, 23 NLP Natural Languge Processing. 1, 6 PEFT parameter-e�cient fine-tuning. 7 PG Prompt Generator. 24 SGD Stochastic Gradient Descent. 1, 7 ix 1 Introduction Testing is a critical part of the software engineering process. Unit tests are at the bottom of the testing pyramid [13] and test single functions or methods called methods under test (MUTs). In modern codebases, test code constitutes a significant amount of code, often spanning more lines than the main code. The cost associated with writing, updating and maintaining this test code motivated the research community to develop tools for automatic test case generation. These tools automatically generate test cases for MUTs based on their signature, body, and sometimes additional context information from the class or project. Evosuite [19] and Randoop [36] are the most popular tools for Java. Evosuite uses evolutionary algorithms to generate test suites, whereas Randoop uses a random search directed by feedback from executing the tests. Typically, test suites generated by these tools achieve excellent coverage scores, sometimes even outperforming human developers in that regard [20]. However, there are some limitations that hinder the adoption of these tools in practice. One major concern is the quality of the generated test cases. Analyses revealed multiple test smells that appear significantly more often in generated tests than in manually written ones [22, 38]. Generated test cases have been shown to be less readable and understandable [23, 14], arguably increasing the amount of work necessary for maintaining test suites. In a survey, most industrial practitioners said they would not keep a generated test case without modification, citing readability as one of the main reasons [4]. Meanwhile, the Natural Languge Processing (NLP) research community developed powerful Language Models (LMs) able to perform both discriminative and generative tasks on natural lan- guage. These models, usually based on the Transformer architecture [48], are pre-trained on self- supervised objectives and later adapted to downstream tasks such as text summarization. For such adaptation, there are two main approaches: fine-tuning and in-context learning. In fine-tuning, the LM weights are further trained on a downstream task objective, whereas in-context learning [32] achieves adaptation merely by providing a task description or input-output examples in the input, also called prompt. Soft prompt tuning is a family of methods that aim to find a middle ground between these two directions by introducing additional parameters in the form of soft to- kens. Thereby, soft tokens refer to continuous embeddings that are treated as tokens but trained directly with Stochastic Gradient Descent (SGD). Accordingly, normal textual prompts are called hard prompts. The success of LMs in the field of NLP has led to the adoption in the domain of source code as well. The result are LMs of code that can reason about and generate source code [18, 35, 50, 12, 52, 5]. They have been applied to many tasks, code completion, code synthesis and bug detection being among the most popular ones [40]. Yet, only relatively few works have attempted to generate 1 2 1. Introduction 1 public T retain() throws RefCountResourceException { 2 synchronized (this) { 3 if (refCount.getAndIncrement() == 0) { 4 try { 5 log.debug("First Reference. Create Resource."); 6 resource.set(resourceSupplier.get()); 7 } 8 catch (Throwable throwable) { 9 throw new RefCountResourceException(throwable); 10 } 11 } 12 13 return resource.get(); 14 } 15 } 16 17 @Test 18 public void retainShouldNotCreateResourceOnSecondCall() throws Throwable { 19 AtomicInteger refCount = new AtomicInteger(1); 20 RefCountResource resource = new RefCountResource(resourceSupplier, resourceCleanup, refCount, new AtomicReference(new Object()));,! 21 resource.retain(); 22 23 verify(resourceSupplier, times(0)).get(); 24 assertEquals(2, refCount.get()); 25 } Figure 1.1: Example MUT and test case that requires substantial context information. test cases with LMs [6, 47, 2, 43]. In these works, there is no clear consensus on the evaluation methodology. Some works evaluate models using intrinsic metrics like the loss [47, 2]. Such metrics, loss in particular, are easy to compute, enabling simple evaluations on a large scale. However, we argue that ultimately, a test case should cover the MUT and correctly assert its behavior, which can be achieved in many ways. Since intrinsic metrics only compare a generated test case to a ground truth, there is the risk of a disconnect between these metrics and the final performance. Execution-based metrics o↵er a more reliable evaluation because they are not restricted to a single ground truth and directly measure coverage as one key quality attribute of a test case. However, performing coverage analyses is associated with substantial e↵ort to set up. Prior work has therefore either focussed on coverage evaluation on a small number of classes and projects [6], or used existing benchmarks [47, 2] like Defects4J [27]. We argue that small-scale evaluation setups su↵er from high variance in coverage between classes and even between projects. Defects4J, on the other hand, contains a large portion of repositories that have been used to pre-train LMs, potentially leading to data leakage and distorted results. To fill that gap, we introduce Tests4J, a large-scale benchmark for neural and non-neural test case generators that provides execution-based and intrinsic metrics, while avoiding data leakage. To apply LMs to the task of test case generation, it is cast to a completion task: the language model is given a prompt and asked to complete it. In the most basic setup, this prompt only contains the MUT. We argue that this setup is ill-posed because the MUT is only part of the information required to write correct test cases. Figure 1.1 shows an example where the test case 1. Introduction 3 requires substantial context information beyond the MUT. For instance, it requires information on how to instantiate objects of the focal class RefCountResource, as well as information on the initialized mock objects resourceSupplier and resourceCleanup from the test class. Illustrated by this example and empirically shown by prior work [47], providing such context information is therefore essential for generating test cases. However, transformer LMs are limited in processing context by hard token limits of 1024 or 2048 and the quadratic compute and memory complexity. We attempt to alleviate these issues and introduce Context-Aware Prompt Tuning (CAPT), which first compresses context information into embeddings and injects them into the LM as virtual tokens, similar to soft prompt tuning. We summarize our main contributions to be the following: 1. We introduce Tests4J, containing 12k pairs of MUTs and test cases from 60 Java projects. Out of these, 19 projects are set up to be executable in a fully automatic way, enabling both intrinsic and execution-based evaluation in a single command. The remaining projects are used for training. All projects are stored as complete clones, allowing extensive experimen- tation with context information. 2. We integrate Evosuite into Tests4J, enabling the comparison of Evosuite and our neural approaches (RQ4). We find that individual test cases generated by the two approaches are similarly e↵ective in covering the MUTs. However, Evosuite generates about 3 times as many executable test cases, covering about 3 times as many lines in total. 3. Based on Tests4J, we experiment with prefix tuning (RQ1) and the importance of context information (RQ2). We find that prefix tuning lacks significantly behind full fine-tuning and that good context representations from both focal and test class are essential. 4. We propose and evaluate CAPT, finding that it is unable to e↵ectively retain performance while reducing the sequence length. We present hypotheses for the reasons and leave further experimentation to future work. The remainder of this thesis is structured as follows. In chapter 2, we provide background knowledge on test case generation, LMs and Prompt Tuning. Chapter 3 describes the construction of the benchmark, whereas Context-Aware Prompt Tuning is described in chapter 4. The experi- mental setup, research questions and results are presented in chapter 5. In chapter 6, we discuss the results and derive directions for future work. Chapter 7 describes related work and chapter 8 summarizes the experiments and findings of this work. 2 Background This chapter contains background information on the most relevant topics in this thesis: Language Models, Prompt Tuning and Test Case Generation. 2.1 Language Models Language Models (LMs) are machine learning models trained to predict the probability of a given text. They are usually trained in a self-supervised manner using a denoising objective like masked language modeling or causal language modeling. Generative LMs, which this work focuses on, often use the former. At every time step, the model predicts the most likely next token. To generate text, the prediction is applied autoregressively, i.e., the model generates text token by token from left to right. Generally, the inputs and outputs of LMs are sequences of tokens. We denote such sequences by xi:j , which refers to all tokens with indices inclusively between i and j: xi:j = [xi, ..., xj ]. In a sequence-to-sequence task, such as test case generation, the model is trained to infer a target sequence y0:m from a prompt x0:n. Figure 2.1 illustrates this setup and introduces the model visualization used throughout this thesis. More formally, we define the model to predict a conditional probability distribution p(yi+1|x0:n, y0:i). We can then formulate the cross entropy loss as CE = � mX i=0 log(p(yi+1|x0:n, y0:i)) Since the introduction of the transformer architecture [48], it has been the predominant archi- tecture for LMs. It is based on self-attention, where every token is represented by an embedding. At every layer, every token embedding is used to create query, key, and value embeddings via linear projections. For every token, the dot-product similarity of its query embedding with all other key embeddings determines how much each of the other tokens should influence this token embedding, i.e., how much attention it should pay to each other token. The new token embedding is then the weighted sum of each value embedding, weighted by attention score. More formally, the au- thors formulate the attention mechanism in a compact way by combining the query, key, and value embeddings for every token into the matrices Q, K, and V: Attention(Q,K, V ) = softmax( QKT p dk )V Instead of performing the attention mechanism with single, large query, key, and value embeddings per token, the authors propose multi-head self-attention. There, self-attention as described above 5 6 2. Background is performed for multiple smaller query, key, and value embeddings that are created by separate projections. MultiHead(Q,K, V ) = Concat(head1, ..., headh)W O where headi = Attention(QWQ i ,KWK i , V W V i ) The self-attention mechanism has one major drawback compared to earlier approaches based on recurrence or convolutions. It has a quadratic complexity in terms of both compute and memory with respect to the sequence length. That is, using longer prompts with more context information comes at the cost of much higher resource requirements. While the notation of O(n2) is generally about scaling behavior in the limit of n, quadratic behavior can already be clearly observed within the range of [0, 2048] (see e.g. Figure 4.1), motivating our work on CAPT in chapter 4. The original transformer proposed by Vaswani et al. [48] was an encoder-decoder architecture, i.e., an encoder first processed the prompt and a decoder generated new output while flexibly at- tending to both prompt and previous outputs. Subsequently, many architectures were derived that use only an encoder (e.g. BERT [15]), only a decoder (e.g. GPT-2 [41]) or both (e.g. T5 [42], BART [30]). Motivated by the success of pre-trained transformer LMs in NLP, they were also adopted for code-related tasks, resulting in LMs of code. In that context, code is, like natural language, represented as a sequence of tokens. Some of the most notable pre-trained models are Codex [12], PLBART [1], CodeGen [35] and PolyCoder [52]. In this work, we focus on the PolyCoder family of models based on the GPT-2 architecture. The authors publicly release three model variants with 160M, 400M and 2.7B parameters. Such models have been applied to several tasks. For instance, code completion poses the task of continuing an incomplete snippet of code, code generation is the task of generating code from a natural language description and code summarization is the task of generating a natural language description for code [40]. LMs of code are usually pre-trained on a corpus of code that is as general and representative as possible. In that sense, such corpora usually contain code in a number of programming languages and from many domains. To specialize models to a specific downstream task, such as Java test case generation, there are two main paradigms of adaption. First, there is fine-tuning, where all or some of the model parameters are adjusted using task-specific losses. Second, there is in-context learning, where the model is merely conditioned on a task via prompt, while the parameters remain unchanged. To perform in-context learning, the prompt might for instance contain a natural language task description or input-output pairs. If there are multiple of these pairs in the prompt, the method is called few-shot prompting. Depend- ing on the task, fine-tuning often leads to equally accurate but much smaller models. On the other hand, fine-tuning requires su�ciently large datasets and results in the deployment of many models if multiple tasks should be supported. 2. Background 7 Figure 2.1: Transformer Language Model. x0:n represents the prompt and y0:m the tar- get. Red boxes represent trainable parame- ters, blue boxes represent frozen parameters, and black boxes represent tokens. Figure 2.2: Deployment benefits of soft prompt tuning. Figure from [29]. 2.2 Soft Prompt Tuning To reduce the complexity of storing and deploying fine-tuned models for multiple tasks, and to reduce computational requirements for fine-tuning, parameter-e�cient fine-tuning (PEFT) methods have been proposed. In PEFT, only a small number of parameters are tuned, while the others remain unchanged. The tuned parameters are either newly introduced and randomly initialized or selected among the existing ones [11]. Soft prompt tuning is a family of PEFT methods. Inspired by the success of discrete prompts in models like GPT-3 [10], they add soft tokens to frozen LMs. These soft tokens are continuous embeddings and can therefore be directly optimized using SGD. Whereas only a small portion of the total parameters are fine-tuned, these methods have been shown to achieve comparable performance to full fine-tuning in many tasks [31, 29, 33]. At the same time, they open up interesting opportunities in the model deployment. Practitioners can deploy a single model endpoint of the main model that supports multiple tasks by selecting the respective soft prompts. As depicted in Figure 2.2, one can even mix di↵erent tasks in the same batch. Additionally, the memory consumption during fine-tuning is slightly reduced because the optimizer does not need to maintain states for the parameters of the main model (gradients of the main model are still required to allow backpropagation through the model to the soft tokens). The two most prominent methods in this family are Prefix Tuning [31] and Prompt Tuning [29]. They mostly di↵er in the shape of the soft prompts and how they are injected into the LM, as depicted in Figure 2.3. Prompt Tuning trains soft tokens of the size of the hidden dimensionality and treats them as input embeddings. The hidden states of subsequent layers are determined by the attention mechanism, just as with regular tokens. In contrast, the soft tokens learned by Prefix Tuning span all layers of the LM. Specifically, they contain key and value embeddings for every layer that are directly injected into the attention mechanism. The authors of Prefix Tuning argue that this increases the expressiveness of the method. To stabilize the training process, the prefix embeddings are reparameterized: every token is represented by an embedding of the hidden dimensionality, and projected up to the target size by a Multilayer Perceptron (MLP) that is shared among tokens. 8 2. Background Figure 2.3: Comparison of Prompt Tuning and Prefix Tuning. Prefix Tuning is shown at inference time, i.e., the reparameterization with the MLP is not depicted. 2.3 Test Case Generation Unit test cases are functions that test the functionality of software. Contrary to integration tests and end-to-end tests, unit tests test a small piece of code, usually functions or methods. They generally follow four phases: setup, execution, validation, tear-down [9]. In object-oriented languages such as Java, test cases are often organized in test classes, where some of the setup code is shared among test cases. Besides general software quality factors, the quality of test cases are usually quantified using metrics like coverage and mutation scores. Coverage generally measures what portion of the code under test was executed by a test case or test suite. In particular, it measures the fraction of lines or branches that are executed, yielding the line coverage and branch coverage metrics. Whereas coverage is an important criterion for the quality of test cases, it does not consider the quality of the assertions. In contrast, mutation scores mutate the code under test, check if it still passes the test case and can thereby quantify how likely it is that a regression is detected by the test. Test case generation refers to the task of automatically generating test cases for a given MUT. In the last decades, several approaches for test case generation have been proposed. For Java, Evosuite [19] and Randoop [37] have been the predominantly used tools. Randoop’s approach uses random testing, where the test cases are generated by randomly sampling sequences of method calls. Furthermore, the random generation is guided by feedback from the execution in order to get sequences that are executable and not redundant with respect to the program state. In contrast, Evosuite uses a genetic algorithm that first generates a set of seed tests and generates new tests by applying crossover and mutation operators to the population. The fitness of tests is determined using coverage feedback, i.e., Evosuite directly optimizes for code coverage. Consequently, Evosuite typically achieves very high coverage scores, sometimes even outperforming human developers in that regard [20]. 3 Tests4J Benchmark We introduce Tests4J, a benchmark for training and evaluating neural test case generation models. In essence, the benchmark consists of a dataset and an evaluation pipeline. The dataset contains mappings from MUTs to test cases, whereas the evaluation pipeline automatically obtains coverage metrics for generated test cases. We restrict our work to the Java programming language. It is one of the most used languages in prior work regarding test case generation and also provides strong non-neural baselines such as Evosuite [19]. To enable the evaluation pipeline in obtaining execution-based metrics, we include complete repository snapshots in the dataset, rather than just pairs of MUTs and test cases. This also makes the dataset extensible regarding the additional context information a model could use. Prior work [47, 2] has used a subset of Defects4J [27] to evaluate test case generators with coverage metrics. However, many of the repositories have been used to pre-train LMs of code. For instance, the popular pre-training dataset The Pile [21] contains parts of 16 out of 17 repos- itories in Defects4J. Similarly, PolyCoder’s [52] pre-training corpus contains parts of 12 out of 17 repositories. Evaluating on these repositories is therefore problematic because they might have memorized the test case rather than inferring them. In this benchmark, we remove all reposito- ries used in these two datasets from the candidates. Furthermore, the dataset creation process is mostly automated, enabling future research to increase the scale and avoid data leakage from other pre-training corpuses as well. Furthermore, Defects4J does not contain ground truth test cases. That is, it does not enable evaluation with intrinsic metrics like crystalBLEU [17]. To evaluate a novel test case generator through execution, researchers can insert generated test cases into the respective repositories. Another popular benchmark for test case generators is JUGE [16]. It is used annually for the JUnit testing tool competition1. At its core, JUGE is a standardized infrastructure to measure the e↵ectiveness and e�ciency of test case generators. It has been designed for and applied to a variety of methods, inlcuding search-based approaches (e.g. Evosuite [19]) and random-based approaches (e.g. Randoop [36]). However, it has not been used to evaluate neural approaches. We argue that its design is fundamentally not ideal for neural test case generators for multiple reasons. First, much like Defects4J, the majority of the repositories that have been used as benchmark in the last years can be found in the pre-training corpus of many LMs. Second, one key pillar of JUGE is a standardized execution environment that is not su�cient to run LMs. Lastly, JUGE requires developers to write Java interfaces for their test case generators while most deep learning research is done in Python. In contrast, the only interface between models and Tests4J is JSON with a 1https://junitcontest.github.io/ 9 https://junitcontest.github.io/ 10 3. Tests4J Benchmark Benchmark Training Data Intrinsic Evaluation Execution-based Evaluation Accounts for Data Leakage Interface between Generator and Benchmark Methods2Test [46] - - - Defects4J [27] - - - - JUGE [16] - - - Java Tests4J (ours) JSON Table 3.1: Comparison of Test Case Generation Benchmarks and Datasets simple schema, allowing quick adoption. Methods2Test [46] is a dataset with pairs of MUTs and test cases intended for training and intrinsic evaluation. We improve upon Methods2Test by including complete real-world projects, enabling execution-based evaluation and open-ended context. In contrast, Methods2Test only contains pairs of MUTs and test cases, and does not provide commit hashes or scraping dates. Therefore, it does not allow execution-based evaluation and provides fixed, limited contextual information. However, we build upon Methods2Test in two main ways. First, we use their list of repositories as a starting point, which assures that all our repositories are using licenses that allow redistribution and that they are not forks. Second, we re-use their heuristic used to map test cases to focal methods based on matching file paths, method names, and method calls. Table 3.1 summarizes the comparison with the aforementioned existing benchmarks and datasets. The remainder of this chapter describes both the creation of the dataset and the process of obtain- ing coverage metrics for generated test cases. We start with a set of candidate repositories and filter it according to our requirements, as illustrated in section 3.1. Next, we describe the dataset split in section 3.2 and present the final dataset in section 3.3. Afterward, we elaborate on the process of inserting generated test cases into the repository and getting coverage metrics in section 3.4. Finally, we introduce our evaluation metrics in section 3.5 and present the coverage scores for the ground truth test cases in section 3.6. 3.1 Filtering Repositories The overall goal of the filters is to ensure that all repositories meet the requirements described above. We start with a list of candidates, which contains the 9410 repositories from Methods2Test as well as 9 additional repositories that were used by Bareiß et al. [6] to evaluate their approach to test case generation. The filters are executed in two main steps. First, the repositories are filtered based on their metadata, which we query from the GitHub API2 (steps 1-5). Afterward, the repositories are cloned and filtered, e.g., based on the number of mappable test cases and whether they are executable (steps 6-9). Table 3.2 summarizes the filtering process. We apply the 9 filters described below, after which 63 out of the 9419 candidate repositories remain. In the following, all quantitative results and plots refer to the state of the dataset at the respective stages in the process, not to the final dataset. Section 3.3 presents further information about the final dataset. 2https://docs.github.com/en/rest/repos/repos?apiVersion=2022-11-28 https://docs.github.com/en/rest/repos/repos?apiVersion=2022-11-28 3. Tests4J Benchmark 11 Filter Name Before After Relative Reduction 1 G it H u b A P I Data Leakage 9419 6428 -31.8% 2 Existence 6428 6277 -2.3% 3 Repository Size 6277 6163 -1.8% 4 Maven Usage 6163 3272 -46.9% 5 Last Commit Date 3272 1198 -63.4% 6 C lo n ed Number of Tokens 1198 1193 -0.4% 7 Number of Test Cases 1193 305 -74.4% 8 Compilation 305 152 -50.2% 9 Execution 152 63 -58.6% Table 3.2: Number of Repositories before and after the respective filtering steps 1. Data Leakage. To avoid data leakage, we first remove all candidate repositories that appear in the pretraining corpus of two LMs that were considered for this work: Polycoder [52] and GPT-Neo [7]. This eliminates almost a third of the repositories. 2. Existence. We ensure that the repository still exists by calling the GitHub API, eliminating about 2.3% of the candidates. 3. Repository Size. We find that there is a small number of very large repositories that take a long time to clone and compile, slowing down the process of scraping as well as coverage evaluation. Thus, with a threshold of 300MB, we reduce the total size of candidate repositories by over 60% while reducing the number of candidates by only 1.8%. Figure 3.1 shows a histogram of the repository sizes. Note that the size returned by GitHub’s API corresponds to the size that is downloaded during cloning, i.e., a compressed version of the repository, including its history. The history is not as relevant for our purposes, but we see this size as a proxy for both disk usage and compile time. Figure 3.1: Histogram of the Repository Sizes. The threshold of 300MB is shown as dashed line. 23 outliers with a mean size of 1.46GB are not depicted. 4. Maven Usage. Compiling and executing repositories automatically requires all dependencies to be resolvable by a package manager. To simplify subsequent steps, we restrict our dataset 12 3. Tests4J Benchmark to repositories that use Maven3 as package manager. To determine if a repository uses Maven, we check if there is a pom.xml file at the repository root, filtering out 46.9% of the repositories. 5. Last Commit Date. As a proxy for the quality of the test cases, we filter out stale repos- itories. Specifically, we consider a repository as stale if there were no commits on the main branch for the past two years. The threshold of two years was chosen intuitively because the distribution in Figure 3.2 did not reveal any distinct values that seemed particularly benefi- cial (note the long tail in the histogram and the almost linear decay in the cumulative sum). If mere maximization of the dataset size is desired in future work, increasing the threshold could a viable option. Figure 3.2: Histogram of last commit dates on the left. Number of remaining repositories for di↵erent threshold values on the right. Both include the chosen threshold of two years as a dashed vertical line. 6. Number of Tokens. After these first five filtering steps based on metadata, we clone the 1198 remaining repositories and use the mapping heuristic from Methods2Test [46] on them to get the pairs of MUT and test case. Since the dataset is meant for training and evaluation of language models, we restrict the combined length of MUT and test case to be at most 1850 tokens. This leaves 198 tokens for continuous prompts to the hard limit of 2048. For tokenization, we use the tokenizer of Polycoder. Note that in this step, we filter out individual test cases rather than complete repositories, unless all test cases exceed the token limit (0.4% of repositories). Figure 3.3 shows the distribution of the number of tokens, 96.5% of samples are below the threshold of 1850. 7. Number of Test Cases. We filter out any repository that has less than 80 mapped test cases in order to keep the number of repositories and the associated cost for compiling and executing them manageable. On the other hand, we want a diverse set of test cases that is not dominated by few repositories. Therefore, if more than 1000 test cases are mapped, we sample 1000 random ones. We find that there are many repositories with very few test cases, e.g., 8.6% contain only a single mappable test case and 74.4% contain less than 80 mappable test cases. Figure 3.4 displays the histogram of the number of test cases, as well as the cumulative sum. 3https://maven.apache.org/ https://maven.apache.org/ 3. Tests4J Benchmark 13 Figure 3.3: Histogram of the number of tokens of all mapped test cases. The threshold of 1850 is depicted as a dashed line. For readability purposes, 1,512 outliers with a mean of 4,816 tokens and a max of 36,862 tokens are not depicted in this figure. Figure 3.4: Left: Histogram of number of test cases per repository. The dashed line shows the lower threshold of 80, and the dot-dashed line represents the upper limit of 1000, above which we sample. Right: Plot of remaining total test cases for di↵erent lower thresholds, with the upper limit of 1000 already considered. 8. Compilation. We require all repositories to be compilable with maven, i.e. all dependencies must be specified in the pom.xml files. We first detect the Java version from the root pom.xml. We support Java 8, 11, 17 and 19, among which we select the closest most recent one. We run the compilation with mvn clean compile and keep the repository if the command succeeds. We find that roughly half of the repositories compile. We save the logs but don’t further investigate reasons for compilation errors. Finding patterns in errors and adjusting the environment accordingly might yield more compilable repositories in future work. 9. Execution. Similarly, we require all repositories, or more specifically their test suites, to be executable in our environment. We select the Java version as described above, run mvn clean test and keep a repository if the command succeeds, eliminating 58.6% of the repositories. Similar to the compilation, we save the logs and leave further investigation to future work. 14 3. Tests4J Benchmark 3.2 Dataset Split After filtering the repository candidates, we split them into train, validation and test splits. These splits are created based on whole repositories, i.e., samples from the same repository will all be in the same split. This ensures that the benchmark measures generalization and avoids data leakage through duplicate or near duplicate code from the same repository [3]. We aim for a split where 70% of the test cases are in train, 10% in validation and 20% in test. Since the repositories contain di↵erent numbers of test cases, creating the split is not as trivial as assigning 70% of the repositories to train and so on. We therefore use the following procedure to create the split randomly, while also getting as close as possible to the target number of test cases. First, we calculate the target number of test cases per split. Next, we iterate over the repositories and randomly assign every repository to one of the splits that can include this repository without exceeding their target number of test cases. Afterward, there are a number of repositories left over. These leftover repositories are sequentially assigned to the split that is still farthest away from their target. Using this procedure, we obtain the distribution depicted in Table 3.3. Number of Repositories Number of Test Cases Train 41 (65.08%) 8733 (69.93%) Validation 9 (14.29%) 1285 (10.29%) Test 13 (20.63%) 2471 (19.79%) Total 63 (100.0%) 12489 (100.0%) Table 3.3: Number of Repositories and Test Cases per Dataset Split. Note that these numbers include the validation repository and 2 test repositories that were later removed due to incompat- ibility with the coverage tool stack. 3.3 Final Dataset The final dataset consists of 63 repositories with ⇠12k mapped test cases corresponding to ⇠6k MUTs. Out of the 63 repositories, we remove 3 due to incompatibility with the coverage tool stack, as described in subsection 3.4.7. Table 3.4 displays more detailed statistics. We can observe that most MUTs are tested by only few test cases (1.87 on average). Most MUTs are rather short, for instance, 25% are shorter than 39 tokens, indicating low complexity. To confirm this, we further investigated the MUT complexity and found that 18.4% are getter methods, 2.7% are setter methods and the average cyclomatic complexity score [34] is 2.29. The dataset contains repositories using both Java 8 (43/63 or 68.25%) and Java 11 (20/63 or 31.75%). Moreover, the repositories use multiple test frameworks: JUnit4 (32/63 or 50.79%), JUnit5 (24/63 or 38.10%) and TestNG (7/63 or 11.11%). 3. Tests4J Benchmark 15 Avg Max Min 25% 50% 75% Total Number of Test Cases per Repository 198.24 1000 80 102.50 146.00 244.00 12,489 Number of MUTs per Repository 106.03 672 7 57.00 78.00 129.50 6,680 Number of Test Cases per MUT 1.87 86 1 1.00 1.00 2.00 12,489 Number of Tokens per MUT 153.73 1681 5 39.00 80.00 182.25 1,026,888 Number of Tokens per Test Case 194.05 1668 13 79.00 136.00 235.00 2,423,516 Table 3.4: Aggregate Dataset Statistics. Token counts correspond to the PolyCoder tokenizer. 25%, 50% and 75% correspond to the respective percentiles. 3.4 Coverage Pipeline To get coverage metrics for test cases generated by a model in a convenient way, we create a coverage pipeline that filters for executable test cases, inserts them into the repository and executes them. The main challenge is that model-generated test cases can be arbitrary, i.e., we do not have any guarantee of parsability, compilability or executability. Therefore, the pipeline sequentially filters out faulty test cases in multiple steps, such that only executable test cases remain in the end. Figure 3.5 gives an overview of the pipeline. In the following, we describe these steps and their respective challenges in more detail. Figure 3.5: Overview of the Coverage Pipeline 3.4.1 Post-processing and Parsing The first step is to apply a lightweight post-processing and filter for code that is parsable. The main case that the post-processing covers is incompleteness: the LM might not have completed the test case at the token limit or be stuck in an infinite loop. We first truncate everything after the closing outer parenthesis and attempt to parse the result with javalang4 using the parse member declaration method. We assert that the parsed tree represents a method declaration, rather than a class declaration for example. If parsing fails, we truncate back to the last complete statement, indicated by the last semicolon. Afterward, we deduplicate statements at the end of the method to fix infinite loop cases and close all open parentheses. Finally, we attempt to parse again and discard the failing samples. Since the goal of this work is to evaluate the proposed modelling technique, we keep the 4https://github.com/c2nes/javalang https://github.com/c2nes/javalang 16 3. Tests4J Benchmark post-processing rather simple, but more sophisticated repair techniques could be incorporated in future work. After repairing and parsing test cases, we modify its annotations. First, we remove potential @Ignore and @Disabled annotations to ensure that the test case will be executed later on. Second, we configure an execution timeout of 10 seconds per test case. To that end, we detect the test frame- work (JUnit4, JUnit5 and TestNG) based on the import statements in the test class and modify the annotations to include the respective timeout settings. In particular, we add @Test(timeout=10000) for JUnit4, @Test(timeOut=10000) for TestNG and @Timeout(10)@Test for JUnit 5. 3.4.2 Repository Configuration Before attempting to compile the parsable test cases, we prepare the repositories to create a stan- dardized environment. For that purpose, we first copy the repository to a temporary directory and perform all modifications there. That way, the original repository state remains intact for future changes of the coverage pipeline. This might be omitted in the future to avoid the slight overhead of creating the copy and modifying the configuration on every execution of the pipeline. In Maven, arbitrary plugin executions can be hooked into the test phase triggered by calling mvn test. The purposes of such executions are manifold. Some plugins are required to build the project, while others are mainly supporting developers with analyses or deployments. In the con- text of our benchmark, we want to limit the execution to those plugins necessary for compilation to accelerate the process and avoid failures due to errors unrelated to our use case. Through man- ual investigation, we find maven-compiler-plugin, jaxb2-maven-plugin, build-helper-maven-plugin, antlr4-maven-plugin, maven-jar-plugin, protoc-jar-maven-plugin, maven-bundle-plugin, maven-shade-plugin and maven-install-plugin to be the essential plugins in our selection of repos- itories. They either configure the compilation and packaging process or generate code. We keep these plugins and remove all others from all pom.xml configuration files. To create a uniform environment that allows automatic coverage evaluation for di↵erent reposi- tories, and to enable coverage analyses, we programmatically modify the project configuration. The first challenge is to find the pom.xml file that is a common ancestor to all sub-configurations, such that the modifications take e↵ect for all submodules. While very common, the root configuration does not have to be the pom.xml file at the root directory of the project. Therefore, we traverse the tree created by the parent relations between configurations until we arrive at the root. This does also not necessarily yield a single configuration file, i.e., the relations can sometimes constitute multiple unconnected trees. In our selection of repositories, this only happens when artifact or re- source directories also contain pom.xml files, which do not need the common configuration options. Consequently, we select the pom.xml file with the most children as a heuristic, which works for all our repositories. After selecting the root configuration that influences all sub-configurations, we make the following modifications. 1. When filtering for compilable test cases among a large number of generated test cases, it is essential to receive as many compilation errors as possible, such that the faulty test cases can be removed. To achieve that, we configure the maven-compiler-plugin not to stop the 3. Tests4J Benchmark 17 compilation process when errors occur and increase the maximum number of displayed errors (see appendix A.1.1). Note that this plugin is also among the essential plugins we identified. Therefore, we merge the original settings with these new ones. 2. Next, we configure the maven-surefire-plugin, which controls the test execution. Similar to the compilation step, we instruct the plugin to keep running even when errors occur, to get as much feedback as possible in one run (see appendix A.1.2). 3. Lastly, we add the configuration for the jacoco-maven-plugin (see appendix A.1.3). It is used to measure the code coverage. We chose JaCoCo5 over alternatives because it is the most mature tool that supports most Java versions. 3.4.3 Sca↵olding We view the test case generation task on the method level, yet methods alone cannot be compiled, but need a test class. A complete tool would generate a sca↵olding, including all imports and a test class. While not impossible, generating this automatically is not trivial for many reasons, e.g., it is not always clear from which package a name should be imported. Since this work focuses on the method level, we instead use the test classes of the ground truth test cases as sca↵olding. In all cases except one, there is only a single test class for every MUT in the dataset. 3.4.4 Compilation The goal of this step in the pipeline is to identify test cases that are not compilable. To that end, the compiler was configured to continue compiling on errors and to output all errors. Ideally, the compiler would identify all compilation errors in one run. In practice, this is not realistic due to masking e↵ects between errors, i.e., some errors only occur when others are resolved. Therefore, we iteratively compile the project and remove faulty test cases. Specifically, we compile using JAVA HOME=/path/to/java/version mvn clean test-compile, where the appropriate java version is detected from the root configuration. Next, we parse the logs produced by the compiler using regular expressions. Unfortunately, the errors are not always formatted consistently. The three regular expressions in appendix A.1.4 cover all variants that we experienced during the experiments and extract the file name, line number of the error in that file and the error message. We then parse the test classes into their Abstract Syntax Trees (ASTs), find all test cases that span across at least one of the error lines, and remove these test cases from their test class. We iterate until no further errors are detected by the regular expressions, such that all remaining test cases are compilable. We save both compilation logs and error messages per test case for post-hoc analyses. 3.4.5 Execution Similar to the compilation step, one goal of executing the test cases is finding those that run successfully. We again construct regular expressions to parse the logs, extract the files and names of the failing test cases, and remove them from their test classes. Unlike during compilation, 5https://github.com/jacoco/jacoco https://github.com/jacoco/jacoco 18 3. Tests4J Benchmark removing failing test cases would not be strictly necessary to remove failing test cases. However, parsing the logs yields interesting information about how many and which test cases were ultimately runnable. At the same time, by removing the failing test cases, we make sure that only passing tests contribute to the final coverage metrics. We save all error information, e.g., enabling subsequent analyses of assertion errors. 3.4.6 Report Parsing The execution of mvn test produces the JaCoCo reports as artifacts. We first parse the XML reports and get the coverage information for all methods. We then retrieve the coverage for all MUTs by matching their signature to the JVM signatures in the JaCoCo report. To that end, we leverage the org.jacoco.report.JavaNames class to convert the JVM signatures to a more easily readable and parsable format. Ultimately, we extract both line and branch coverage per MUT and aggregate. 3.4.7 Pipeline Validation and Manual Corrections To validate the correctness of the pipeline, we run it in two settings, where we know the expected outcome. First, we implement a dry run where the pipeline is executed without inserting any test cases into the repositories. Therefore, in that setting, we eliminate all steps that are concerned with processing the test cases as sources of error. We ensure that all dependencies are available and make sure that the project is compilable without non-essential plugins, and therefore validate that our list of essential plugins is su�cient. Furthermore, we check that no test cases are executed, as all test cases should have been removed from the repository during the scraping process. Second, we further leverage the ground truth test cases for validation purposes. In other words, we re-insert the mapped test cases into their test classes and attempt to measure their coverage. To that end, we first transform the mapped test cases from the dataset format into the prediction format, such that the whole pipeline is executed as if the test cases were model-generated. Ideally, we expect all test cases to be parsable, compilable, and runnable. We find that 99% of test cases are executable, validating the pre-processing, compilation and execution procedure. We further find that 98% of those also test the correct MUT, validating the mapping heuristic. We analyze the few test cases that are not parsable, compilable or executable and find three patterns of edge cases. First, some test cases depend on the side e↵ects of other test cases, like setting attributes in shared objects. These test cases might fail in our setup because not all test cases of original test classes are mapped during scraping and because the execution order can di↵er. Second, some test cases exceed the 10 seconds timeout. Lastly, some test cases fail with various error types after their @Disabled or @Ignore annotations were removed. This is likely because these test cases are outdated with respect to their MUT. We conclude that the pipeline works correctly and that the few edge cases are not due to a bug in the pipeline. A high degree of automation was one of the main goals throughout the benchmark creation, to manage and support the large scale of this study. At the same time, it makes the dataset extensible for future work. In that sense, the scraping process is fully automated, and a larger dataset can be 3. Tests4J Benchmark 19 created by simply adding more candidate repositories. However, when adding a new dataset to the pipeline, one should manually validate that the repository works in the pipeline environment. For instance, one should ensure that there are no dependency conflicts between the repository and the pipeline tooling. We believe that the two validation approaches discussed above are also excellent test cases to do that. For the 22 repositories in the validation and test set, we perform this manual validation. In the following, we list all manual changes we made to these repositories, that were identified in the process: 1. intuit/CloudRaider uses powermock6 in their test classes to create mock objects. Unfortu- nately, powermock uses on-the-fly code instrumentation and is therefore incompatible7 with JaCoCo’s on-the-fly instrumentation. For that reason, we exclude this repository from any further experiments. We further exclude Flipkart/foxtrot because it depends on running database containers and SonarSource-orchestrator because its build process is incompatible with JaCoCo. 2. The configuration of bazaarvoice/ostrich has a parent configuration outside of the repository itself, that overrides the argLine property of surefire. This override breaks JaCoCo, so we remove the dependency to that parent configuration. 3. In eclipse/winery, we find that the frontends submodule has additional dependencies to NodeJS and takes a very long time to compile. At the same time, it does not contain any mapped test case and no other submodule depends on it. We therefore remove this submodule from the build by removing the corresponding entry in the root configuration. 4. JUnit5 did not support timeouts until version 5.5. Therefore, we upgrade any older version to 5.5.1. This is the case for Domo42/saga-lib and flipkart-incubator/Lyrics. We further upgrade JUnit4 from 4.11 to 4.12 in Domo42/saga-lib in order to make it compatible with junit-vintage-engine (which was used in the repository all along) and therefore to ensure that all test cases are executed. 5. During scraping, we removed all test cases from the repositories, such that only the inserted test cases would contribute to the coverage. The script found and removed test cases anno- tated with @Test and @ParameterizedTest, but not those that use the JUnit3 style of declaring a class as a test class: by extending it from junit.framework.TestCase. This occurred only a single time and was corrected manually, but could be automated in the script in a future iteration. 3.5 Evaluation Metrics For evaluation, Tests4J first computes BLEU and crystalBLEU scores as intrinsic metrics. These metrics compare generated test cases to correct, human-written test cases for the same MUT. In particular, they compute scores based on n-gram overlaps. CrystalBLEU ignores trivial n-grams 6https://github.com/powermock/powermock 7https://github.com/powermock/powermock/wiki/Code-coverage-with-JaCoCo https://github.com/powermock/powermock https://github.com/powermock/powermock/wiki/Code-coverage-with-JaCoCo 20 3. Tests4J Benchmark in that process and thus more accurately measures code similarity. Both can give an indication of the quality of a test case, but don’t capture its behavior during execution. Therefore, the pipeline computes several metrics, including four coverage metrics: MiLC, MaLC, MiBC and MaBC. They correspond to the micro (Mi) and macro (Ma) averages of the line coverage (LC ) and branch coverage (BC ). The micro average is the proportion of the total lines or branches of the MUTs that were covered by the tests, whereas the macro average is computed per MUT and then averaged. Therefore, MiLC and MiBC account for the fact that MUTs vary in complexity and weighs methods with many lines or branches higher than those with few. Consequently, we use them as main coverage metrics. We additionally report MaLC and MaBC because, in comparison with their micro average counterpart, they indicate whether the test suite covers mostly small or large MUTs. Besides coverage, we further report five count metrics that quantify how many test cases remain at certain stages of the pipeline: 1. Unique: Number of test cases after de-duplication. 2. Parsable: Number of test cases that comply with the Java grammar, absolute and relative to Unique. 3. Compilable: Number of test cases that were compiled in their sca↵olding without error, absolute and relative to Unique. 4. Executable: Number of test cases that were executed without errors, absolute and relative to Unique. 5. Correct : Number of test cases that were executed without errors and called the correct MUT, absolute and relative to Unique. This metric is adopted from Tufano et al. [47] and is directly comparable, albeit computed on di↵erent repositories. An example of all metrics can be found in Table 3.5, which displays the results of the ground truth test cases. 3.6 Ground Truth Results Table 3.5 displays the results of inserting the mapped test cases (called ”ground truth test cases”) back into their respective test classes and evaluating their coverage with the pipeline described above. Besides being used for the pipeline validation as discussed in subsection 3.4.7, the results yield some further insights regarding the dataset. For instance, the ground truth test cases achieve very high coverage, even though there are only 1.87 test cases per MUT on average with an average length of 194 tokens. We conclude that the MUTs are usually testable with relatively few and short test cases, i.e., there usually is a good and simple solution. We also observe that for both line and branch coverage, the micro average is higher than the macro average. Since the micro average is equivalent to a macro average weighted by total lines, this indicates that the ground truth tests were more successful in covering complex MUTs than simple ones. This is opposed to test cases generated by deep learning-based approaches, where typically the macro average is higher. Further investigations are necessary to find the reasons for this opposing trend. The BLEU scores are obviously very high because we compare the ground truth test cases to themselves, but they are not 1 because of the preprocessing. 3. Tests4J Benchmark 21 Ground Truth BLEU: 0.9455 crystalBLEU: 0.9369 Repository Coverage Counts MiLC MaLC MiBC MaBC Unique Parsable Compilable Executable Correct criteo-garmadon 83.54% 86.65% 76.28% 85.59% 109 109 (100.00%) 107 (98.17%) 106 (97.25%) 102 (93.58%) bazaarvoice-ostrich 11.50% 11.27% 10.32% 10.92% 158 157 (99.37%) 157 (99.37%) 157 (99.37%) 153 (96.84%) adorsys-XS2A-Sandbox 93.98% 96.03% 85.51% 92.20% 180 180 (100.00%) 180 (100.00%) 178 (98.89%) 178 (98.89%) flipkart-incubator-Lyrics 97.30% 99.02% 81.75% 93.89% 102 102 (100.00%) 102 (100.00%) 102 (100.00%) 102 (100.00%) seedstack-seed 84.43% 89.00% 74.61% 84.30% 97 97 (100.00%) 97 (100.00%) 97 (100.00%) 86 (88.66%) intuit-QuickBooks-V3-Java-SDK 70.19% 75.29% 59.55% 75.53% 238 238 (100.00%) 238 (100.00%) 237 (99.58%) 237 (99.58%) gooddata-gooddata-java 77.44% 87.48% 83.58% 88.34% 377 375 (99.47%) 375 (99.47%) 375 (99.47%) 374 (99.20%) Flipkart-foxtrot - - - - - - - - - shro↵k-phoebus 74.98% 87.02% 67.39% 81.06% 285 285 (100.00%) 279 (97.89%) 277 (97.19%) 274 (96.14%) messaginghub-pooled-jms 87.86% 93.42% 93.87% 97.72% 239 239 (100.00%) 239 (100.00%) 239 (100.00%) 239 (100.00%) nhl-dflib 84.64% 77.94% 73.71% 72.18% 134 134 (100.00%) 134 (100.00%) 134 (100.00%) 129 (96.27%) monarch-initiative-phenol 86.68% 86.32% 76.92% 82.05% 120 120 (100.00%) 120 (100.00%) 120 (100.00%) 116 (96.67%) SonarSource-orchestrator - - - - - - - - - Total 78.32% 68.42% 71.96% 66.44% 2039 2036 (99.85%) 2028 (99.46%) 2022 (99.17%) 1990 (97.60%) Table 3.5: Evaluation results of re-inserting the ground truth test cases back into their test classes. 4 Context-Aware Prompt Tuning In this chapter, we propose a novel method we call Context-Aware Prompt Tuning. It is fundamen- tally motivated by the need for more contextual information, as well as the limited context length and quadratic complexity of transformer language models. We further discuss this motivation in section 4.1 and present Context-Aware Prompt Tuning in section 4.2. 4.1 Motivation One of the fundamental premises of this work is that contextual information is important for test case generation. In the most basic setup of neural test case generation, the model translates from a MUT to a test case. However, we argue that this task setup is ill-posed because the input is missing important information, without which not even a human expert developer could write correct test cases. For instance, writing tests requires knowledge about available classes, methods, fields and libraries from the current project. Whereas it is likely that models are able to use the APIs of popular libraries due to their usage in the pre-training corpus, it is unlikely to impossible that the model is able to infer APIs of the local project. Fine-tuning the model to a specific project [8] and saving this local knowledge in the model parameters is possible but not feasible for many evolving projects. Instead, contextual information can be passed to the model in the prompt. The model is then not trained on a specific project, but simply trained to leverage the contextual information given in the prompt. Tufano et al. [47] investigated the e↵ect of additional context for the test case generation task and were able to reduce the loss by almost 10%. Similarly, the e↵ects of context from the current project have been shown to significantly improve performance on other code-related tasks as well [44, 53]. Specific to our benchmark is that contextual information is not only required from the focal class and potentially other classes in the main code, but also from the test class. The evaluation pipeline directly inserts generated test cases into a fixed sca↵olding. This sca↵olding already contains imports and the test class declaration. This test class often contains declared fields, setup code and helper methods. Generating correct test cases that fit into this test class obviously requires this test class knowledge, otherwise available names and signatures would need to be inferred or hallucinated by the model. To summarize, in this task setup, a model has to process at least the MUT, the focal class as well as the test class, and infer a correct test case from that information. With the token limit of 2048 in PolyCoder, this is not possible. The complete information would require more than 7000 tokens on average. To mitigate the issue, prior work [47] has suggested omitting method bodies and only including their names and signatures, with the intuition that this already provides the 23 24 4. Context-Aware Prompt Tuning most important information on how to use it. Whereas this approach reduces the total number of tokens to about 1900 on average, more than 16% of the samples would still exceed the hard limit of 2048 tokens. Furthermore, maxing out the token limit might not be desirable because of the quadratic memory and compute complexity with respect to the sequence length on transformers. We illustrate this in a toy example and measure the memory consumption of training the smallest PolyCoder model with just 160M parameters in 16-bit precision with a batch size of 4. Figure 4.1 presents the results of that experiment. Even in this minimalistic setting, a 32GB GPU is maxed out when approaching the maximum sequence length. Figure 4.1: GPU memory consumption during training of PolyCoder-160M in FP16 with a batch size of 4 and varying sequence lengths. 4.2 Method Motivated by the issues discussed above, we propose Context-Aware Prompt Tuning (CAPT), a novel method to allow the model to ingest more context without quadratic scaling of memory requirements. The idea is fundamentally based on soft prompt tuning, where the model learns con- tinuous embeddings representing soft tokens that steer a frozen pre-trained LM towards performing a specific task. In CAPT, we aim to learn soft tokens that contain particular context information and thus steer the model towards generating test cases that better fit that context. To that end, we use small neural networks that first summarize and compress the context information into one or multiple soft tokens. We call these networks Prompt Generators (PGs). They take a piece of context as input and output an embedding that is then injected into the LM. Unlike other methods [44], the output of the Prompt Generator (PG) is continuous, which allows joint end-to-end train- ing. I.e., we backpropagate the gradients from the language modelling loss through the language model to the PGs. Figure 4.2 presents an overview of the approach. 4.2.1 Context Types The proposed method of encoding context with a PG into an embedding allows for flexible context modalities. Much like in [24], the context does not need to be of textual nature, but could be images or videos. For test case generation, code is arguably one of the most important modalities. 4. Context-Aware Prompt Tuning 25 Figure 4.2: Context-Aware Prompt Tuning with test class and focal class context encoded as soft tokens with an all-layer injection strategy. The left-most prefix is a vanilla (instance-invariant) task prefix. However, others could be beneficial as well. For instance, class hierarchies and call graphs could be encoded with graph neural networks and provide valuable information about the overall structure of the project and natural language description from documentation and Javadoc could provide low- level semantic information about classes and methods. In this work, we focus on code as modality and investigate the possible compression rather than multi-modality. As explained in section 4.1, the focal class and test class are the two most important sources of information. Following previous work [47], we omit method bodies and try to only include the most relevant information: 1. Imports are relevant for the test class context because they introduce names that are avail- able in the test case to generate. They are not relevant for the focal class because the model does not need to generate a method in its namespace. 2. The class name is relevant for multiple reasons. In the focal class, it provides semantic clues about the class functionality and gives the model partial information on how to instantiate an object of that class. In the test class, it mainly contributes to a coherent context, as providing syntactically incorrect code might confuse the model. 3. The method signatures are equally important for focal classes and test classes. First, they give semantic clues about the class, inform the model which methods are available and how to use them. To maintain syntactic correctness, we do not simply remove the method bodies but replace them with an empty block ’{}’. In focal classes, we only include public methods. In test classes, we do not remove the body of the setup method marked by the ’@Before’ annotation. We argue that knowing the state of the objects instantiated in these setup methods is very important, especially for generating correct assertions, and that the setup code can provide this information to a high degree. 4. Lastly, the fields also provide both semantic clues and information on how to use the focal class. In test class, they inform the model on the declared names and their types. As with methods, we only include public fields for the focal class context. To optimally leverage knowledge from pretraining, we format this context in a way that should look as natural as possible to the model. We provide further details on the formatting and trunca- tion of context in the experimental setup in subsection 5.1.1. 26 4. Context-Aware Prompt Tuning 4.2.2 Prompt Generator Architectures The prompt generators (PGs) are the fundamental building blocks of CAPT. They compress con- textual information in order to steer a LM. In this section, we first sketch the design space for PGs and afterward derive 6 concrete architectures from that space. Every prompt generator takes a context as input and outputs one or multiple soft tokens. In the following, we list the design decisions that we considered when designing prompt generators. • A very fundamental decision is how to represent the context. Various representations like graph-based representations are possible, but for the sake of simplicity, we focus on token- based representations in this work. • Next, the tokens need to be embedded. Here, we focus on lightweight approaches based on pre-trained models. Using the main LM itself to create context embeddings would likely yield more powerful representations, but it would also cause significant compute and memory overhead to perform multiple forward passes and thus defeat the purpose of CAPT. Instead, we consider two approaches: using the input embeddings of PolyCoder and using CodeBERT. • After obtaining an embedding for every token, these embeddings need to be projected into the target space. This target space is either the input space of the LM (1024 dimensional) or the space of prefix tokens across all layers (1024⇤24 layers⇤2 = 49, 152 dimensional). This depends on the way the soft tokens are injected into the LM, either only at the input layer as in Prompt Tuning [29] or at every layer as in Prefix Tuning [31] and P-Tuning v2 [33]. We find that both approaches work similarly well. Since at the time of implementation, passing input embeddings to a LM was not supported during generation1, we focus on injecting soft tokens at every layer. The concrete implementation of how to project token embeddings into the target space also requires a few further architectural decisions. In particular, we perform two main steps: aggregation and projection. • During aggregation, multiple token embeddings are combined into a fixed-size representa- tion. That is, the aggregation step performs the actual compression of the technique. Overall, we consider four aggregation approaches: CLS token, sum, LSTM and MLP. Whereas CLS to- ken, sum and LSTM can aggregate variable-length sequence into a fixed-size representations, an MLP requires a fixed-length input. • During projection, the fixed-size representation from the aggregation step is projected into the target space. We consider a simple linear layer, as well as a 2-layer MLP with a bottleneck. In the former, the first layer projects the aggregated embedding down to an even smaller embedding, followed by a non-linear activation function, before the second layer projects it up into the target space. Bottleneck are also used in regular prefix tuning, and we can confirm in preliminary experiments that bottlenecks improve performance slightly. We therefore use bottlenecks in all setups. 1Support was recently added in v4.27: https://github.com/huggingface/transformers/issues/6535 https://github.com/huggingface/transformers/issues/6535 4. Context-Aware Prompt Tuning 27 • Lastly, an important consideration is how to scale the number of soft tokens that the prompt generator outputs. We see the concept of compression at the center of CAPT. Thus, we believe being able to flexibly change the rate of compression is crucial. The amount of information that can be stored in an embedding is limited, forming an information bottleneck. We implement two main ways of scaling. First, we use the same aggregated context representation for all tokens, but use separate projections. The major drawback is that all soft tokens are based on the same information and thus scaling the number of soft tokens will likely not increase the performance much. Motivated by this reasoning, we also propose to use chunked aggregation, where the token embeddings are chunked before aggregation. Then each aggregated chunk embedding is projected into the target space. Thereby, every soft token is based on the information from di↵erent parts of the context. Based on these general concepts and design consideration, we now present 6 prompt generator architectures. In the architecture visualizations, blue indicates frozen modules, red indicates train- able modules, green boxes represent chunks and the right-most boxes with black borders represent the soft tokens that will be injected into the LM. We use trapezoids to illustrate linear projections that alter the dimensionality. All architectures use tanh as non-linear activation function in the bottleneck. We indicate models that chunk the context with the su�x C. 1. In BERT, we pass the context through CodeBERT [18] and use the embedding of the CLS token as a holistic representation of the context. This representation is then fed through a projection MLP with bottleneck to obtain a soft token. To create k soft tokens, the CLS representation is projected with k di↵erent MLPs. With a bottleneck dimension of 512, the number of trainable parameters amounts to 110M in CodeBERT and k ⇤ (768 ⇤ 512 + 512 ⇤ 49, 152) in the projection modules, totaling ⇠ 135M parameters for 1 soft token and ⇠ 365M parameters for 10 soft tokens. Therefore, the number of parameters scales linearly with k, making it only a viable option if k is small. Furthermore, all information has to be contained in the CLS token, such that increasing k would likely not result in significantly better results. 2. Contrarily, in BERT C, we divide the sequence of context tokens into fixed-size chunks and concatenate the embeddings of all tokens in each chunk. The chunk size is determined based on the number of soft tokens k as cs = floor(512/k). We project the chunk embeddings into the target space with a bottleneck of 1024. Since each of the k soft token is based on di↵erent parts of the context, we share the projection module among them. Consequently, the number of trainable parameters is independent of k and amounts to 110M for CodeBERT and cs⇤1024+1024⇤49, 152 ⇡ cs⇤1024+50M. We expect this architecture to have better scaling behavior than BERT because increasing k will make the chunks smaller, thus decreasing the compression rate and alleviating the information bottleneck. 28 4. Context-Aware Prompt Tuning 3. In LSTM and all further architectures, we use the input embedding layer of PolyCoder to embed context tokens. This approach is much more lightweight than CodeBERT, but has the potential advantage that it uses the same tokenizer and embedding space as the LM. In this variant without chunking, we process the complete context with a single-layer LSTM and use the last token embedding as a holistic representation. The projection layer is equivalent to the one described for BERT. In that sense, the setup also shares its drawbacks of the limited scaling possibilities and the small information bottleneck. The number of trainable parameters is 8⇤1024⇤1024 ⇡ 8M for the LSTM and k ⇤ (1024⇤512+512⇤49, 152) ⇡ k ⇤25M for the projection module. 4. In LSTM C, we use the same setup as in LSTM but chunk the context tokens into k chunks, process each chunk with an LSTM separately and use the embedding of the last token of each chunk as chunk embedding. As with BERT C, the projection module is shared among the k soft tokens and increasing k will alleviate the information bottleneck. The number of trainable parameters is independent of k and totals about 33M. 5. In Sum, we also use PolyCoder’s input embeddings but use the sum as non-parametric approach to aggregation. In this non-chunked variant, the sum of all token embeddings is used as a holistic representation and projected into the target space with the same projection module as in LSTM and BERT. The number of trainable parameters is k ⇤ (1024⇤512+512⇤ 49, 152) ⇡ k ⇤ 25M. 6. Lastly, propose the chunked variant of Sum, Sum C, where the embeddings of all tokens in a chunk are summed to form the chunk embedding. The projection module is the same is in LSTM C and BERT C. The number of trainable parameters is ⇠ 25M. 4. Context-Aware Prompt Tuning 29 5 Experiments In this chapter, we describe the experiments that we performed to investigate five research questions (RQs). We first describe the experimental setup in section 5.1. Afterward, we define the RQs and introduce the experiments we designed to answer each RQ in section 5.2. Finally, we present the results in section 5.3. 5.1 Experimental Setup In the following subsections, we provide details about pre-processing, training, generation, as well as evaluation metrics. 5.1.1 Pre-processing During pre-processing, we tokenize the di↵erent parts of the input and assemble them. In all experiments, we use a maximum sequence length of 1850, which leaves room for 198 potential soft tokens. We always perform padding to the longest sequence in each batch and truncate to the maximum sequence length. As elaborated on in subsection 4.2.1, we identified four essential components of focal and test classes: imports, class name, method signatures and fields. During pre-processing, we want to make sure that the condensed context representations look as much as possible like real code. To that end, we replace method bodies with empty blocks ’{}’ instead of just removing them. We place MUTs inside their focal class and test cases in their test class, and account for indentation. We furthermore argue that truncating the context naively will likely result in incomplete classes and statements, negatively conditioning the LM. We therefore implement a truncation procedure, where if the context is too long, we successively remove parts of the context according to a prioritization without sacrificing syntactic correctness. In particular, we go through the steps listed below, check if the context is short enough and only continue to the next step if it isn’t. We define this prioritization for the scenario of fitting both test class and focal class context into a fixed token budget, which is the case when using both contexts as hard prompt. In other scenarios, we skip the steps that work on the respective other context. When using both contexts, we generally prioritize test class context over focal class context based on the insights of RQ2. 1. We first remove the fields of the focal class and thus assign it the lowest priority, based on the intuition that fields in the focal class are mostly relevant for the inner workings of the class and that usage is defined via interfaces and methods. 31 32 5. Experiments 2. Next, we remove the method signatures of the focal class. However, we keep constructor signatures, given their special importance in test cases. 3. When the focal class representation is reduced to its minimum (class name and constructor signatures) and the context still exceeds its token limit, we remove the method signatures of the test class. We argue that compared to imports and fields, using helper methods is useful but not necessary for test cases. 4. Afterward, we remove the imports of the test class with the intuition that imports and fields convey similar information in the test class (which names are available), and that imported names are easier to infer than field names. 5. Lastly, we remove the fields of the test class, leaving minimal representations for both focal and test class context. If they still exceed the token limit (occurs < 1% of the cases), we truncate naively. An example of the formatting and visualization of this prioritization can be found in Figure 5.1. 5.1.2 Training Procedure For all neural models, we use the PolyCoder model [52] with 400M parameters. The PolyCoder family of models is publicly available, has good performance and is, unlike CodeGen [35], trans- parent with respect to the pre-training corpus. The variant with 400M parameters allows more direct comparison with AthenaTest [47] and enables faster experimentation compared to the 2.7B variant. In prefix tuning, we use 10 soft tokens, as that performs best in preliminary experiments. For optimization, we use regular cross-entropy loss and the AdamW optimizer from PyTorch. We set the learning rate to 5e� 6 for prefix tuning and to 1e� 6 for full fine-tuning. We train for 3 epochs with a batch size of 8 and evaluate on the validation set 4 times per epoch and save a checkpoint. The checkpoint corresponding to the best validation loss is used for testing. 5.1.3 Generation Procedure We sample from the trained models with top k=50 and top p=0.9. Furthermore, we increase the temperature from 0.05 to 0.4 in increments of 0.05. We generate 4 samples per temperature, i.e., a total of 32 samples per MUT. In preliminary experiments, we found this setting to provide a good tradeo↵ between diversity and quality. These experiments also showed that increasing the number of samples can significantly improve the coverage, e.g., increasing it from 32 to 96 increased it by almost 50%. We use the moderate number of 32 samples to save computational resources and to stay somewhat comparable to AthenaTest, which uses 30 samples. The post-processing is performed as described in subsection 3.4.1. 5.1.4 Evaluation metrics As described in section 3.5, we calculate the four coverage metrics MiLC, MaLC, MiBC and MaBC from the coverage reports generated by Tests4J. Among these, we use MiLC, the fraction of total 5. Experiments 33 1 // Focal Class: 2 public class SwingListbox extends AbstractSwingContainer 3 implements XulListbox, ListSelectionListener { 4 // Methods 5 public SwingListbox( 6 Element self, 7 XulComponent parent, 8 XulDomContainer container, 9 String tagName 10 ) {} 11 public Object getManagedObject() {} 12 ... 13 public void setCommand(final String command) {} 14 // Fields 15 public int counter = 0; 16 // Method under Test 17 public String getSeltype() { 18 return selType; 19 }} 20 } 21 // Test Class: 22 import static org.junit.Assert.assertEquals; 23 ... 24 import org.pentaho.ui.xul.swing.SwingXulLoader; 25 26 public class SwingListboxTest { 27 // Fields 28 Document doc = null; 29 XulDomContainer container; 30 XulListbox list; 31 // Helper Methods 32 @Before public void setUp() throws Exception { 33 // Do not run on headless environment 34 Assume.assumeTrue(!GraphicsEnvironment.isHeadless()); 35 container = new SwingXulLoader().loadXul("documents/listbox.xul"); 36 doc = container.getDocumentRoot(); 37 list = (XulListbox) doc.getElementById("listbox"); 38 } 39 private static String toString(int[] is) {} 40 // Unit Test Case 41 @Test public void testSeltype() throws Exception { 42 assertEquals("single", list.getSeltype()); 43 } 44 } Class Name and Constructor Priority 6 Focal Class Methods Priority 2 Test Class Fields Priority 5 Test Class Imports Priority 4 Test Class Methods Priority 3 Focal Class Fields Priority 1 MUT Priority 6 Class Name Priority 6 Test Case Priority 6 Figure 5.1: Example of the formatting of focal class and test class context. color schema on the right indicates prioritization of components when not all context fits into the token limit. Priority 6 means highest priority, Priority 1 means lowest priority. 34 5. Experiments covered lines and total lines, as our main evaluation metric. Furthermore, we consider the Correct metric proposed by Tufano et al. [47] measuring the fraction of test cases that are executable and call the correct MUT. We furthermore report the test loss, BLEU and crystalBLEU scores as intrinsic metrics. 5.2 Research Questions RQ1 Training Methods: How e↵ective is prefix tuning for test case generation? Prefix Tuning has been applied to a plethora of tasks in NLP and code intelligence, but there is not a clear consensus if it is worse than, equal to or better than full fine-tuning in terms of the final performance. To the best of our knowledge, no papers have applied prefix tuning as the main training method to the task of test case generation. Zlotchevski et al. [54] use prefix tuning for adapting to specific repositories, but the model was first fully fine-tuned on test case generation. We close that gap by comparing the performance of prefix tuning, full fine-tuning and zero-shot prompting on Tests4J. RQ2 Context Information: How does context information a↵ect the performance when provided in the hard prompt? Tufano et al. [47] found that providing AthenaTest with focal class context reduces its loss by about 10%. However, they do not evaluate the performance impact with further, more important metrics such as coverage or BLEU scores. In contrast, we evaluate the e↵ect of context using Tests4J, presenting both intrinsic and execution-based metrics. We always provide as much context as possible using the method described in subsection 5.1.1 and additionally consider the test class context. RQ3 E↵ectiveness of CAPT: Can CAPT e↵ectively compress context information while retaining performance? Motivated by the importance of context information and the quadratic complexity of trans- formers, we proposed Context-Aware Prompt Tuning (CAPT) in chapter 4. In this RQ, we evaluate the e↵ectiveness of CAPT. We first investigate the limits of how much information can be stored in a single soft token to get foundational insights about the mechanism. Af- terward, we evaluate CAPT on Tests4J and compare the performance between not providing context, providing it via CAPT and providing it in the hard prompt. RQ4 Comparison with Evosuite: How do neural models compare to the state-of-the-art non-neural test case generator on Tests4J? Evosuite [19] is one of the most established tools for non-neural test case generation and typically achieves very high coverage scores. In a study [20], Evosuite was even found to out- perform human developers in terms of coverage. While it excels at generating tests with high coverage, other quality properties like readability and maintainability lack behind human- written tests [22, 38, 23, 14]. In contrast, neural models are specifically trained to generate 5. Experiments 35 tests that look similar to their training distribution, i.e., to human-written code. Conse- quently, they generate test cases that are much more readable for humans. For instance, 61% of human developers favored AthenaTest’s test cases over Evosuite’s in terms of readability, 10% favored Evosuite’s and 29% found them equally readable [47]. In this work, we do not further evaluate readability, but focus on the performance metrics computed by Tests4J. The authors of AthenaTest found that it achieved comparable coverage to Evosuite on a single focal class. We argue that due to the high variance in coverage between repositories, an eval- uation on such a small scale is not su�cient. We close that gap by evaluating neural models and Evosuite on Tests4J, which includes 484 focal classes from 11 repositories in the test set. To the best of our knowledge, we are the first to compare Evosuite to neural approaches at scale. RQ5 Evaluation Methodology: How important is large-scale coverage evaluation? Tests4J computes both intrinsic and execution-based metrics for test case generators. Among these, loss is by far the easiest to compute because it is available at training time and does not require execution. BLEU and crystalBLEU are available after sampling from the trained LM. Finally, the coverage and count metrics from Tests4J require both sampling from the LM and execution, which makes them the most time-consuming. To guide future research in the methodology of training and evaluating neural test case generators, we investigate the importance of large-scale coverage evaluation. In particular, we aim to find out if the cheaper to compute metrics mentioned above are good indicators of coverage and how important the scale is for a reliable evaluation. 5.3 Results In this section, we present the results of the research questions introduced in section 5.2. Table A.21 summarizes the results of all evaluated setups. The detailed evaluation tables are presented at the respective RQs. We name every run according to their training algorithm, context information and the way the context is represented. More specifically, runs names start with the training algorithm (FT for fine-tuning, PT for prefix tuning and ZS for zero-shot), followed by the context in the hard prompt as Hard-All, Hard-TCl, Hard-FCl or Hard-None, and the context in the soft prompt as [PG]-All, [PG]-TCl or [PG]-FCl where [PG] is one of the prompt generator architectures presented in chapter 4. 36 5. Experiments Name Training Algorithm Context Metrics Test Class (TCl) Focal Class (FCl) MiLC Correct Loss crystalBLEU BLEU Evosuite - - - 37.78% - - - - FT-Hard-All Fine-tuning Hard Hard 13.55% 2.81% 0.6743 0.0473 0.0808 ZS-Hard-All Zero-Shot Hard Hard 3.07% 0.29% 1.0330 0.0237 0.0509 PT-Hard-All Prefix Tuning Hard Hard 9.76% 1.16% 0.6825 0.0355 0.0677 PT-Hard-TCl Prefix Tuning Hard - 5.48% 0.82% 0.6887 0.0378 0.0691 PT-Hard-FCl Prefix Tuning - Hard 3.28% 0.73% 0.8091 0.0323 0.0647 PT-Hard-None Prefix Tuning - - 1.79% 0.12% 0.8766 0.0283 0.0550 PT-Hard-None-BERT-TCl-1 Prefix Tuning BERT (l = 1) - 1.79% 0.19% 0.8944 0.0257 0.0557 PT-Hard-None-BERT-TCl-10 Prefix Tuning BERT (l = 10) - 1.82% 0.19% 0.8937 0.0285 0.0586 PT-Hard-None-BERT C-TCl-30 Prefix Tuning BERT C (l = 30) - 2.03% 0.23% 0.9004 0.0304 0.0611 PT-Hard-None-BERT C-TCl-90 Prefix Tuning BERT C (l = 90) - 1.92% 0.28% 0.9051 0.0328 0.0654 PT-Hard-None-BERT C-TCl-150 Prefix Tuning BERT C (l = 150) - 1.89% 0.23% 1.1792 0.0306 0.0627 PT-Hard-None-BERT C-TCl-188 Prefix Tuning BERT C (l = 188) - 1.79% 0.38% 1.1810 0.0312 0.0632 PT-Hard-None-LSTM-TCl-1 Prefix Tuning LSTM (l = 1) - 1.49% 0.18% 0.8957 0.0281 0.0587 PT-Hard-None-LSTM-TCl-10 Prefix Tuning LSTM (l = 10) - 1.81% 0.22% 0.8908 0.0305 0.0618 PT-Hard-None-LSTM C-TCl-30 Prefix Tuning LSTM C (l = 30) - 1.55% 0.24% 0.8964 0.0322 0.0644 PT-Hard-None-LSTM C-TCl-90 Prefix Tuning LSTM C (l = 90) - 2.03% 0.27% 0.8941 0.0295 0.0608 PT-Hard-None-LSTM C-TCl-150 Prefix Tuning LSTM C (l = 150) - 1.77% 0.22% 0.8951 0.0289 0.0596 PT-Hard-None-LSTM C-TCl-188 Prefix Tuning LSTM C (l = 188) - 1.68% 0.19% 1.1903 0.0292 0.0601 PT-Hard-None-Sum-TCl-1 Prefix Tuning Sum (l = 1) - 1.85% 0.22% 0.8939 0.0272 0.0532 PT-Hard-None-Sum-TCl-10 Prefix Tuning Sum (l = 10) - 1.95% 0.32% 0.8978 0.0291 0.0560 PT-Hard-None-Sum C-TCl-30 Prefix Tuning Sum C (l = 30) - 2.08% 0.23% 0.8939 0.0292 0.0604 PT-Hard-None-Sum C-TCl-90 Prefix Tuning Sum C (l = 90) - 1.76% 0.26% 0.8922 0.0296 0.0608 PT-Hard-None-Sum C-TCl-150 Prefix Tuning Sum C (l = 150) - 1.58% 0.32% 0.8929 0.0292 0.0597 PT-Hard-None-Sum C-TCl-188 Prefix Tuning Sum C (l = 188) - 1.98% 0.24% 0.8937 0.0296 0.0611 Table 5.1: Summary of the results. 5.3.1 RQ1: Training Methods To compare the three training paradigms, we use the setup with all context information in the hard prompt. We find that full fine-tuning (Table 5.2) significantly outperforms prefix tuning (Table 5.3). For instance, the MiLC is almost 40% higher (13.55% and 9.76%) and the number of correct test cases is more than twice as high (625 and 274). Moreover, both the coverage and the fraction of correct test cases is higher on every project except one. The loss of full fine-tuning is better as well, but the di↵erence is rather small (⇠ 1.2%) compared to the other metrics. The performance of PolyCoder in a zero-shot setting (Table 5.4) is unsurprisingly much worse than both fine-tuning and prefix tuning. For three repositories, it failed to generate any correct test cases. The fraction of parsable test cases, however, is about the same as with full fine-tuning and prefix tuning. We argue that this is because the ability to generate syntactically correct code is mainly acquired during pre-training. We furthermore find that the fraction of compilable test cases is very similar between all training setups, and that the fraction of executable test cases is much higher in the zero-shot setting. We hypothesize that this is because without training, the model will generate a large fraction of meaningless test cases that, for instance, only declare a number of primitive variables without ever invoking any project functionality. Thereby, these test cases avoid the challenge of correctly using local APIs. We find support for this hypothesis by comparing the fraction of executable test cases that invoke the MUT ( Correct Executable ) and the fraction of test cases that use at least one assert statement. Both values are significantly higher for prefix tuning than for zero-shot and significantly higher for full fine-tuning than for prefix tuning. We conclude that generating test cases that invoke the MUT and assert the program state is a skill acquired during fine-tuning and that full fine-tuning is more e↵ective than prefix tuning in that regard. In all models, the rate of compilable tests is rather low (12.24% to 13.07%) compared to the 5. Experiments 37 FT-Hard-All loss: 0.6743 BLEU: 0.0808 crystalBLEU: 0.0473 Repository Coverage Counts MiLC MaLC MiBC MaBC Unique Parsable Compilable Executable Correct criteo-garmadon 6.91% 6.43% 5.58% 5.79% 1231 1050 (85.30%) 157 (12.75%) 56 (4.55%) 11 (0.89%)