Software Lab
Institute of Software Engineering

University of Stuttgart
Universitätsstraße 38, 70569 Stuttgart

Master Thesis

Tests4J Benchmark:

Execution-based Evaluation

of Context-Aware Language

Models for Test Case

Generation

Valentin Knappich

Course of study: Computer Science

Examiner: Prof. Dr. Michael Pradel

Supervisor: Prof. Dr. Michael Pradel, Matteo Paltenghi

Started: October 17, 2022

Completed: April 17, 2023


Abstract

Testing is a critical part of the software engineering process. The cost associated with writing and

maintaining test suites has motivated the research community to develop approaches to automati-

cally generate test cases for a given software. Evosuite is one of the most established tools for Java

and has been shown to achieve high coverage. However, the test cases lack readability, motivating

the application of language models (LMs) of code to the task. Evaluating such neural test case

generators based on their execution requires substantial e↵orts to set up evaluation projects and

to obtain metrics like coverage from them. In consequence, most prior work on Java test case

generation has either evaluated models on a small number of selected methods under test (MUTs)

or used Defects4J as an evaluation benchmark. However, small benchmarks su↵er from high vari-

ance, and many projects in Defects4J have been used to pre-train LMs of code. To fill that gap,

we introduce Tests4J, a novel benchmark for neural and non-neural test case generators. Tests4J

contains 12k test cases from 60 Java projects, out of which 41 are used for training while 19 are

used for evaluation. For all projects, it includes the complete repository, enabling execution-based

evaluation and open-ended experimentation with project-specific context information. In a single

command, Tests4J allows researchers to obtain execution-based metrics like coverage and intrinsic

metrics like loss, BLEU and crystalBLEU.

Using Tests4J, we train and evaluate several test case generation models based on PolyCoder

with 400M parameters. We compare Evosuite to our best neural model and find that the individual

test cases achieve similar coverage. However, Evosuite generates 3 times as many test cases, covering

about 3 times as many lines in total. We furthermore find that Evosuite fails to generate any test

cases for 4 out of 11 projects in the test set. This presents a fundamental advantage of LMs: they

do not need to integrate with the project and thus don’t su↵er from dependency conflicts. Next,

we evaluate prefix tuning as a training method and find that there is a significant gap to full fine-

tuning. We further investigate the importance of project-specific context information and create

simplified representations of the focal class and the test class. We find that adding this context

information increases the achieved coverage by more than 4x, and that the focal class and test

class context are highly complementary. Motivated by this finding, as well as the hard token limit

and quadratic complexity of transformers, we propose Context-Aware Prompt Tuning (CAPT).

In CAPT, context information is first compressed into embeddings, and then injected into the

LM as soft tokens, similar to prefix tuning. We find that the method does not yield significant

improvements over the baseline, but present directions for future research. Lastly, we find that loss

is not an ideal indicator of coverage and that there is a high variance in coverage between projects,

and thus advocate for large-scale execution-based evaluations.

iii


Zusammenfassung

Softwaretests spielen eine zentrale Rolle in der Softwareentwicklung. Neben etablierten Tools wie

Evosuite wurden in den letzten Jahren vermehrt Sprachmodelle für die automatische Generierung

von Testfällen eingesetzt. Die ausführungsbasierte Evaluierung solcher Modelle stellt einen erhe-

blichen Aufwand dar. Infolgedessen haben vorherige Arbeiten Modelle entweder auf einer kleinen

Auswahl zu testender Methoden oder auf Defects4J evaluiert. Wir halten beide Ansätze für sub-

optimal, da Evaluierungen auf kleinen Datensätzen eine hohe Varianz aufweisen und die Projekte

in Defects4J zu einem großen Anteil in weit verbreiteten Datensätzen für das Vor-Trainieren von

Sprachmodellen enthalten sind. Um diese Defizite zu beheben, führen wir in dieser Arbeit Tests4J

ein. Es handelt sich dabei um einen neuen Benchmark zur Evaluierung von neuronalen und nicht-

neuronalen Methoden für die Generierung von Tests. Tests4J beinhaltet 12k Testfälle aus 60 Java

Projekten, wobei 41 für das Training und 19 für die Evaluierung eingesetzt werden. Tests4J enthält

eine komplette Kopie aller Projekte, sodass eine ausführungsbasierte Evaluierung und umfassende

Experimente mit Kontextinformationen ermöglicht werden. In einem einzigen Befehl können mit

Tests4J sowohl ausführungsbasierte Metriken wie die Abdeckung als auch intrinsische Metriken wie

Loss, BLEU und crystalBLEU berechnet werden.

Mithilfe von Tests4J trainieren und evaluieren wir verschiedene Modelle auf Basis von Poly-

Coder mit 400M Parametern. Wir vergleichen Evosuite mit unserem besten neuronalen Modell

und stellen fest, dass einzelne Testfälle ähnlich e↵ektiv in der Abdeckung sind. Jedoch gener-

iert Evosuite dreimal so viele Testfälle und erreicht die dreifache Abdeckung insgesamt. Zudem

evaluieren wir Prefix Tuning als Trainingsmethode und stellen einen signifikanten Unterschied zum

Trainieren des ganzen Modells in Bezug auf die Abdeckung und die intrinsischen Metriken fest.

Wir untersuchen zudem die Bedeutung von projektspezifischen Kontextinformationen und erstellen

vereinfachte Darstellungen der zu testenden Klassen und Testklassen. Dabei kommen wir zu dem

Ergebnis, dass Kontextinformationen essenziell sind und zu mehr als der vierfachen Abdeckung

führen. Motiviert durch diese Erkenntnis sowie durch die limitierte Anzahl an Tokens und die

quadratische Komplexität von Transformern, schlagen wir Context-Aware Prompt Tuning vor.

Dabei wird der Kontext zunächst von einem kleineren Modell in numerische Repräsentationen

komprimiert, welche dann als virtuelle Tokens vom Sprachmodell verarbeitet werden. Wir stellen

fest, dass die Methode keine signifikanten Verbesserungen gegenüber der Baseline erzielt, zeigen

aber Richtungen für zukünftige Forschung auf. Abschließend stellen wir fest, dass Loss kein ide-

aler Indikator für den Abdeckungsgrad ist, sowie dass es eine große Varianz im Abdeckungsgrad

zwischen Projekten gibt und plädieren daher für umfassende, ausführungsbasierte Evaluierungen.

v


Contents

1 Introduction 1

2 Background 5

2.1 Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Soft Prompt Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Test Case Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Tests4J Benchmark 9

3.1 Filtering Repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Dataset Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3 Final Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.4 Coverage Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.4.1 Post-processing and Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.4.2 Repository Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.4.3 Sca↵olding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.4.4 Compilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.4.5 Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.4.6 Report Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.4.7 Pipeline Validation and Manual Corrections . . . . . . . . . . . . . . . . . . . 18

3.5 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.6 Ground Truth Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Context-Aware Prompt Tuning 23

4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2.1 Context Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2.2 Prompt Generator Architectures . . . . . . . . . . . . . . . . . . . . . . . . . 26

5 Experiments 31

5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.1.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.1.2 Training Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.1.3 Generation Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

vii


viii Contents

5.1.4 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.3.1 RQ1: Training Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.3.2 RQ2: Context Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.3.3 RQ3: E↵ectiveness of CAPT . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.3.4 RQ4: Comparison with Evosuite . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.3.5 RQ5: Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 44

6 Discussion and Future Work 47

6.1 Tests4J Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.2 Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

7 Related Work 49

7.1 Neural Test Case Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

7.2 Soft Prompt Tuning in Code Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

7.3 Dynamic Prompt Tuning and Multi-modal Language Models . . . . . . . . . . . . . 52

8 Conclusion 55

A Appendix 57

A.1 Maven Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

A.1.1 Maven Compiler Plugin Configuration . . . . . . . . . . . . . . . . . . . . . . 57

A.1.2 Maven Surefire Plugin Configuration . . . . . . . . . . . . . . . . . . . . . . . 57

A.1.3 Maven JaCoCo Plugin Configuration . . . . . . . . . . . . . . . . . . . . . . . 57

A.1.4 Maven Compile Log Regex . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

A.1.5 Maven Execution Log Regex . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

A.2 List of Repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

A.3 CAPT Result Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

A.3.1 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

A.3.2 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

A.3.3 Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Bibliography 66


Acronyms

AST Abstract Syntax Tree. 17

CAPT Context-Aware Prompt Tuning. 3, 24

LM Languge Model. 1, 3, 5–7, 9, 11, 15, 24, 49

MLP Multilayer Perceptron. 7

MUT method under test. 1, 8–10, 12, 14, 17, 18, 20, 23

NLP Natural Languge Processing. 1, 6

PEFT parameter-e�cient fine-tuning. 7

PG Prompt Generator. 24

SGD Stochastic Gradient Descent. 1, 7

ix


1 Introduction

Testing is a critical part of the software engineering process. Unit tests are at the bottom of the

testing pyramid [13] and test single functions or methods called methods under test (MUTs). In

modern codebases, test code constitutes a significant amount of code, often spanning more lines

than the main code. The cost associated with writing, updating and maintaining this test code

motivated the research community to develop tools for automatic test case generation. These

tools automatically generate test cases for MUTs based on their signature, body, and sometimes

additional context information from the class or project. Evosuite [19] and Randoop [36] are the

most popular tools for Java. Evosuite uses evolutionary algorithms to generate test suites, whereas

Randoop uses a random search directed by feedback from executing the tests. Typically, test suites

generated by these tools achieve excellent coverage scores, sometimes even outperforming human

developers in that regard [20]. However, there are some limitations that hinder the adoption of these

tools in practice. One major concern is the quality of the generated test cases. Analyses revealed

multiple test smells that appear significantly more often in generated tests than in manually written

ones [22, 38]. Generated test cases have been shown to be less readable and understandable [23, 14],

arguably increasing the amount of work necessary for maintaining test suites. In a survey, most

industrial practitioners said they would not keep a generated test case without modification, citing

readability as one of the main reasons [4].

Meanwhile, the Natural Languge Processing (NLP) research community developed powerful

Language Models (LMs) able to perform both discriminative and generative tasks on natural lan-

guage. These models, usually based on the Transformer architecture [48], are pre-trained on self-

supervised objectives and later adapted to downstream tasks such as text summarization. For such

adaptation, there are two main approaches: fine-tuning and in-context learning. In fine-tuning,

the LM weights are further trained on a downstream task objective, whereas in-context learning

[32] achieves adaptation merely by providing a task description or input-output examples in the

input, also called prompt. Soft prompt tuning is a family of methods that aim to find a middle

ground between these two directions by introducing additional parameters in the form of soft to-

kens. Thereby, soft tokens refer to continuous embeddings that are treated as tokens but trained

directly with Stochastic Gradient Descent (SGD). Accordingly, normal textual prompts are called

hard prompts.

The success of LMs in the field of NLP has led to the adoption in the domain of source code as

well. The result are LMs of code that can reason about and generate source code [18, 35, 50, 12,

52, 5]. They have been applied to many tasks, code completion, code synthesis and bug detection

being among the most popular ones [40]. Yet, only relatively few works have attempted to generate

1


2 1. Introduction

1 public T retain() throws RefCountResourceException {
2 synchronized (this) {
3 if (refCount.getAndIncrement() == 0) {
4 try {
5 log.debug("First Reference. Create Resource.");
6 resource.set(resourceSupplier.get());
7 }
8 catch (Throwable throwable) {
9 throw new RefCountResourceException(throwable);

10 }
11 }
12

13 return resource.get();
14 }
15 }
16

17 @Test
18 public void retainShouldNotCreateResourceOnSecondCall() throws Throwable {
19 AtomicInteger refCount = new AtomicInteger(1);
20 RefCountResource resource = new RefCountResource(resourceSupplier, resourceCleanup, refCount,

new AtomicReference(new Object()));,!

21 resource.retain();
22

23 verify(resourceSupplier, times(0)).get();
24 assertEquals(2, refCount.get());
25 }

Figure 1.1: Example MUT and test case that requires substantial context information.

test cases with LMs [6, 47, 2, 43]. In these works, there is no clear consensus on the evaluation

methodology. Some works evaluate models using intrinsic metrics like the loss [47, 2]. Such metrics,

loss in particular, are easy to compute, enabling simple evaluations on a large scale. However, we

argue that ultimately, a test case should cover the MUT and correctly assert its behavior, which

can be achieved in many ways. Since intrinsic metrics only compare a generated test case to a

ground truth, there is the risk of a disconnect between these metrics and the final performance.

Execution-based metrics o↵er a more reliable evaluation because they are not restricted to a single

ground truth and directly measure coverage as one key quality attribute of a test case. However,

performing coverage analyses is associated with substantial e↵ort to set up. Prior work has therefore

either focussed on coverage evaluation on a small number of classes and projects [6], or used existing

benchmarks [47, 2] like Defects4J [27]. We argue that small-scale evaluation setups su↵er from high

variance in coverage between classes and even between projects. Defects4J, on the other hand,

contains a large portion of repositories that have been used to pre-train LMs, potentially leading to

data leakage and distorted results. To fill that gap, we introduce Tests4J, a large-scale benchmark

for neural and non-neural test case generators that provides execution-based and intrinsic metrics,

while avoiding data leakage.

To apply LMs to the task of test case generation, it is cast to a completion task: the language

model is given a prompt and asked to complete it. In the most basic setup, this prompt only

contains the MUT. We argue that this setup is ill-posed because the MUT is only part of the

information required to write correct test cases. Figure 1.1 shows an example where the test case


1. Introduction 3

requires substantial context information beyond the MUT. For instance, it requires information

on how to instantiate objects of the focal class RefCountResource, as well as information on the

initialized mock objects resourceSupplier and resourceCleanup from the test class. Illustrated

by this example and empirically shown by prior work [47], providing such context information is

therefore essential for generating test cases. However, transformer LMs are limited in processing

context by hard token limits of 1024 or 2048 and the quadratic compute and memory complexity.

We attempt to alleviate these issues and introduce Context-Aware Prompt Tuning (CAPT), which

first compresses context information into embeddings and injects them into the LM as virtual

tokens, similar to soft prompt tuning. We summarize our main contributions to be the following:

1. We introduce Tests4J, containing 12k pairs of MUTs and test cases from 60 Java projects.

Out of these, 19 projects are set up to be executable in a fully automatic way, enabling both

intrinsic and execution-based evaluation in a single command. The remaining projects are

used for training. All projects are stored as complete clones, allowing extensive experimen-

tation with context information.

2. We integrate Evosuite into Tests4J, enabling the comparison of Evosuite and our neural

approaches (RQ4). We find that individual test cases generated by the two approaches are

similarly e↵ective in covering the MUTs. However, Evosuite generates about 3 times as many

executable test cases, covering about 3 times as many lines in total.

3. Based on Tests4J, we experiment with prefix tuning (RQ1) and the importance of context

information (RQ2). We find that prefix tuning lacks significantly behind full fine-tuning and

that good context representations from both focal and test class are essential.

4. We propose and evaluate CAPT, finding that it is unable to e↵ectively retain performance

while reducing the sequence length. We present hypotheses for the reasons and leave further

experimentation to future work.

The remainder of this thesis is structured as follows. In chapter 2, we provide background

knowledge on test case generation, LMs and Prompt Tuning. Chapter 3 describes the construction

of the benchmark, whereas Context-Aware Prompt Tuning is described in chapter 4. The experi-

mental setup, research questions and results are presented in chapter 5. In chapter 6, we discuss

the results and derive directions for future work. Chapter 7 describes related work and chapter 8

summarizes the experiments and findings of this work.


2 Background

This chapter contains background information on the most relevant topics in this thesis: Language

Models, Prompt Tuning and Test Case Generation.

2.1 Language Models

Language Models (LMs) are machine learning models trained to predict the probability of a given

text. They are usually trained in a self-supervised manner using a denoising objective like masked

language modeling or causal language modeling. Generative LMs, which this work focuses on,

often use the former. At every time step, the model predicts the most likely next token. To

generate text, the prediction is applied autoregressively, i.e., the model generates text token by

token from left to right. Generally, the inputs and outputs of LMs are sequences of tokens. We

denote such sequences by xi:j , which refers to all tokens with indices inclusively between i and j:

xi:j = [xi, ..., xj ]. In a sequence-to-sequence task, such as test case generation, the model is trained

to infer a target sequence y0:m from a prompt x0:n. Figure 2.1 illustrates this setup and introduces

the model visualization used throughout this thesis. More formally, we define the model to predict

a conditional probability distribution p(yi+1|x0:n, y0:i). We can then formulate the cross entropy

loss as

CE = �

mX

i=0

log(p(yi+1|x0:n, y0:i))

Since the introduction of the transformer architecture [48], it has been the predominant archi-

tecture for LMs. It is based on self-attention, where every token is represented by an embedding.

At every layer, every token embedding is used to create query, key, and value embeddings via linear

projections. For every token, the dot-product similarity of its query embedding with all other key

embeddings determines how much each of the other tokens should influence this token embedding,

i.e., how much attention it should pay to each other token. The new token embedding is then

the weighted sum of each value embedding, weighted by attention score. More formally, the au-

thors formulate the attention mechanism in a compact way by combining the query, key, and value

embeddings for every token into the matrices Q, K, and V:

Attention(Q,K, V ) = softmax(
QKT

p
dk

)V

Instead of performing the attention mechanism with single, large query, key, and value embeddings

per token, the authors propose multi-head self-attention. There, self-attention as described above

5


6 2. Background

is performed for multiple smaller query, key, and value embeddings that are created by separate

projections.

MultiHead(Q,K, V ) = Concat(head1, ..., headh)W
O

where headi = Attention(QWQ

i
,KWK

i , V W V

i )

The self-attention mechanism has one major drawback compared to earlier approaches based on

recurrence or convolutions. It has a quadratic complexity in terms of both compute and memory

with respect to the sequence length. That is, using longer prompts with more context information

comes at the cost of much higher resource requirements. While the notation of O(n2) is generally

about scaling behavior in the limit of n, quadratic behavior can already be clearly observed within

the range of [0, 2048] (see e.g. Figure 4.1), motivating our work on CAPT in chapter 4.

The original transformer proposed by Vaswani et al. [48] was an encoder-decoder architecture,

i.e., an encoder first processed the prompt and a decoder generated new output while flexibly at-

tending to both prompt and previous outputs. Subsequently, many architectures were derived that

use only an encoder (e.g. BERT [15]), only a decoder (e.g. GPT-2 [41]) or both (e.g. T5 [42], BART

[30]). Motivated by the success of pre-trained transformer LMs in NLP, they were also adopted

for code-related tasks, resulting in LMs of code. In that context, code is, like natural language,

represented as a sequence of tokens. Some of the most notable pre-trained models are Codex [12],

PLBART [1], CodeGen [35] and PolyCoder [52]. In this work, we focus on the PolyCoder family of

models based on the GPT-2 architecture. The authors publicly release three model variants with

160M, 400M and 2.7B parameters. Such models have been applied to several tasks. For instance,

code completion poses the task of continuing an incomplete snippet of code, code generation is the

task of generating code from a natural language description and code summarization is the task of

generating a natural language description for code [40]. LMs of code are usually pre-trained on a

corpus of code that is as general and representative as possible. In that sense, such corpora usually

contain code in a number of programming languages and from many domains. To specialize models

to a specific downstream task, such as Java test case generation, there are two main paradigms of

adaption. First, there is fine-tuning, where all or some of the model parameters are adjusted using

task-specific losses. Second, there is in-context learning, where the model is merely conditioned on

a task via prompt, while the parameters remain unchanged. To perform in-context learning, the

prompt might for instance contain a natural language task description or input-output pairs. If

there are multiple of these pairs in the prompt, the method is called few-shot prompting. Depend-

ing on the task, fine-tuning often leads to equally accurate but much smaller models. On the other

hand, fine-tuning requires su�ciently large datasets and results in the deployment of many models

if multiple tasks should be supported.


2. Background 7

Figure 2.1: Transformer Language Model.
x0:n represents the prompt and y0:m the tar-
get. Red boxes represent trainable parame-
ters, blue boxes represent frozen parameters,
and black boxes represent tokens.

Figure 2.2: Deployment benefits of soft
prompt tuning. Figure from [29].

2.2 Soft Prompt Tuning

To reduce the complexity of storing and deploying fine-tuned models for multiple tasks, and to

reduce computational requirements for fine-tuning, parameter-e�cient fine-tuning (PEFT) methods

have been proposed. In PEFT, only a small number of parameters are tuned, while the others

remain unchanged. The tuned parameters are either newly introduced and randomly initialized or

selected among the existing ones [11]. Soft prompt tuning is a family of PEFT methods. Inspired

by the success of discrete prompts in models like GPT-3 [10], they add soft tokens to frozen LMs.

These soft tokens are continuous embeddings and can therefore be directly optimized using SGD.

Whereas only a small portion of the total parameters are fine-tuned, these methods have been

shown to achieve comparable performance to full fine-tuning in many tasks [31, 29, 33]. At the

same time, they open up interesting opportunities in the model deployment. Practitioners can

deploy a single model endpoint of the main model that supports multiple tasks by selecting the

respective soft prompts. As depicted in Figure 2.2, one can even mix di↵erent tasks in the same

batch. Additionally, the memory consumption during fine-tuning is slightly reduced because the

optimizer does not need to maintain states for the parameters of the main model (gradients of the

main model are still required to allow backpropagation through the model to the soft tokens). The

two most prominent methods in this family are Prefix Tuning [31] and Prompt Tuning [29]. They

mostly di↵er in the shape of the soft prompts and how they are injected into the LM, as depicted

in Figure 2.3. Prompt Tuning trains soft tokens of the size of the hidden dimensionality and treats

them as input embeddings. The hidden states of subsequent layers are determined by the attention

mechanism, just as with regular tokens. In contrast, the soft tokens learned by Prefix Tuning

span all layers of the LM. Specifically, they contain key and value embeddings for every layer that

are directly injected into the attention mechanism. The authors of Prefix Tuning argue that this

increases the expressiveness of the method. To stabilize the training process, the prefix embeddings

are reparameterized: every token is represented by an embedding of the hidden dimensionality, and

projected up to the target size by a Multilayer Perceptron (MLP) that is shared among tokens.


8 2. Background

Figure 2.3: Comparison of Prompt Tuning and Prefix Tuning. Prefix Tuning is shown at inference
time, i.e., the reparameterization with the MLP is not depicted.

2.3 Test Case Generation

Unit test cases are functions that test the functionality of software. Contrary to integration tests and

end-to-end tests, unit tests test a small piece of code, usually functions or methods. They generally

follow four phases: setup, execution, validation, tear-down [9]. In object-oriented languages such

as Java, test cases are often organized in test classes, where some of the setup code is shared among

test cases. Besides general software quality factors, the quality of test cases are usually quantified

using metrics like coverage and mutation scores. Coverage generally measures what portion of the

code under test was executed by a test case or test suite. In particular, it measures the fraction

of lines or branches that are executed, yielding the line coverage and branch coverage metrics.

Whereas coverage is an important criterion for the quality of test cases, it does not consider the

quality of the assertions. In contrast, mutation scores mutate the code under test, check if it still

passes the test case and can thereby quantify how likely it is that a regression is detected by the

test.

Test case generation refers to the task of automatically generating test cases for a given MUT.

In the last decades, several approaches for test case generation have been proposed. For Java,

Evosuite [19] and Randoop [37] have been the predominantly used tools. Randoop’s approach uses

random testing, where the test cases are generated by randomly sampling sequences of method

calls. Furthermore, the random generation is guided by feedback from the execution in order to get

sequences that are executable and not redundant with respect to the program state. In contrast,

Evosuite uses a genetic algorithm that first generates a set of seed tests and generates new tests

by applying crossover and mutation operators to the population. The fitness of tests is determined

using coverage feedback, i.e., Evosuite directly optimizes for code coverage. Consequently, Evosuite

typically achieves very high coverage scores, sometimes even outperforming human developers in

that regard [20].


3 Tests4J Benchmark

We introduce Tests4J, a benchmark for training and evaluating neural test case generation models.

In essence, the benchmark consists of a dataset and an evaluation pipeline. The dataset contains

mappings from MUTs to test cases, whereas the evaluation pipeline automatically obtains coverage

metrics for generated test cases. We restrict our work to the Java programming language. It is

one of the most used languages in prior work regarding test case generation and also provides

strong non-neural baselines such as Evosuite [19]. To enable the evaluation pipeline in obtaining

execution-based metrics, we include complete repository snapshots in the dataset, rather than just

pairs of MUTs and test cases. This also makes the dataset extensible regarding the additional

context information a model could use.

Prior work [47, 2] has used a subset of Defects4J [27] to evaluate test case generators with

coverage metrics. However, many of the repositories have been used to pre-train LMs of code.

For instance, the popular pre-training dataset The Pile [21] contains parts of 16 out of 17 repos-

itories in Defects4J. Similarly, PolyCoder’s [52] pre-training corpus contains parts of 12 out of 17

repositories. Evaluating on these repositories is therefore problematic because they might have

memorized the test case rather than inferring them. In this benchmark, we remove all reposito-

ries used in these two datasets from the candidates. Furthermore, the dataset creation process is

mostly automated, enabling future research to increase the scale and avoid data leakage from other

pre-training corpuses as well. Furthermore, Defects4J does not contain ground truth test cases.

That is, it does not enable evaluation with intrinsic metrics like crystalBLEU [17]. To evaluate

a novel test case generator through execution, researchers can insert generated test cases into the

respective repositories.

Another popular benchmark for test case generators is JUGE [16]. It is used annually for the

JUnit testing tool competition1. At its core, JUGE is a standardized infrastructure to measure the

e↵ectiveness and e�ciency of test case generators. It has been designed for and applied to a variety

of methods, inlcuding search-based approaches (e.g. Evosuite [19]) and random-based approaches

(e.g. Randoop [36]). However, it has not been used to evaluate neural approaches. We argue that

its design is fundamentally not ideal for neural test case generators for multiple reasons. First,

much like Defects4J, the majority of the repositories that have been used as benchmark in the last

years can be found in the pre-training corpus of many LMs. Second, one key pillar of JUGE is

a standardized execution environment that is not su�cient to run LMs. Lastly, JUGE requires

developers to write Java interfaces for their test case generators while most deep learning research

is done in Python. In contrast, the only interface between models and Tests4J is JSON with a

1https://junitcontest.github.io/

9

https://junitcontest.github.io/


10 3. Tests4J Benchmark

Benchmark
Training
Data

Intrinsic
Evaluation

Execution-based
Evaluation

Accounts for
Data Leakage

Interface between
Generator and Benchmark

Methods2Test [46] - - -
Defects4J [27] - - - -
JUGE [16] - - - Java

Tests4J (ours) JSON

Table 3.1: Comparison of Test Case Generation Benchmarks and Datasets

simple schema, allowing quick adoption.

Methods2Test [46] is a dataset with pairs of MUTs and test cases intended for training and

intrinsic evaluation. We improve upon Methods2Test by including complete real-world projects,

enabling execution-based evaluation and open-ended context. In contrast, Methods2Test only

contains pairs of MUTs and test cases, and does not provide commit hashes or scraping dates.

Therefore, it does not allow execution-based evaluation and provides fixed, limited contextual

information. However, we build upon Methods2Test in two main ways. First, we use their list of

repositories as a starting point, which assures that all our repositories are using licenses that allow

redistribution and that they are not forks. Second, we re-use their heuristic used to map test cases

to focal methods based on matching file paths, method names, and method calls.

Table 3.1 summarizes the comparison with the aforementioned existing benchmarks and datasets.

The remainder of this chapter describes both the creation of the dataset and the process of obtain-

ing coverage metrics for generated test cases. We start with a set of candidate repositories and filter

it according to our requirements, as illustrated in section 3.1. Next, we describe the dataset split

in section 3.2 and present the final dataset in section 3.3. Afterward, we elaborate on the process

of inserting generated test cases into the repository and getting coverage metrics in section 3.4.

Finally, we introduce our evaluation metrics in section 3.5 and present the coverage scores for the

ground truth test cases in section 3.6.

3.1 Filtering Repositories

The overall goal of the filters is to ensure that all repositories meet the requirements described

above. We start with a list of candidates, which contains the 9410 repositories from Methods2Test

as well as 9 additional repositories that were used by Bareiß et al. [6] to evaluate their approach

to test case generation. The filters are executed in two main steps. First, the repositories are

filtered based on their metadata, which we query from the GitHub API2 (steps 1-5). Afterward,

the repositories are cloned and filtered, e.g., based on the number of mappable test cases and

whether they are executable (steps 6-9). Table 3.2 summarizes the filtering process. We apply

the 9 filters described below, after which 63 out of the 9419 candidate repositories remain. In the

following, all quantitative results and plots refer to the state of the dataset at the respective stages

in the process, not to the final dataset. Section 3.3 presents further information about the final

dataset.

2https://docs.github.com/en/rest/repos/repos?apiVersion=2022-11-28

https://docs.github.com/en/rest/repos/repos?apiVersion=2022-11-28


3. Tests4J Benchmark 11

Filter Name Before After Relative Reduction

1

G
it
H
u
b
A
P
I Data Leakage 9419 6428 -31.8%

2 Existence 6428 6277 -2.3%
3 Repository Size 6277 6163 -1.8%
4 Maven Usage 6163 3272 -46.9%
5 Last Commit Date 3272 1198 -63.4%

6

C
lo
n
ed

Number of Tokens 1198 1193 -0.4%
7 Number of Test Cases 1193 305 -74.4%
8 Compilation 305 152 -50.2%
9 Execution 152 63 -58.6%

Table 3.2: Number of Repositories before and after the respective filtering steps

1. Data Leakage. To avoid data leakage, we first remove all candidate repositories that appear

in the pretraining corpus of two LMs that were considered for this work: Polycoder [52] and

GPT-Neo [7]. This eliminates almost a third of the repositories.

2. Existence. We ensure that the repository still exists by calling the GitHub API, eliminating

about 2.3% of the candidates.

3. Repository Size. We find that there is a small number of very large repositories that

take a long time to clone and compile, slowing down the process of scraping as well as

coverage evaluation. Thus, with a threshold of 300MB, we reduce the total size of candidate

repositories by over 60% while reducing the number of candidates by only 1.8%. Figure 3.1

shows a histogram of the repository sizes. Note that the size returned by GitHub’s API

corresponds to the size that is downloaded during cloning, i.e., a compressed version of the

repository, including its history. The history is not as relevant for our purposes, but we see

this size as a proxy for both disk usage and compile time.

Figure 3.1: Histogram of the Repository Sizes. The threshold of 300MB is shown as dashed line.
23 outliers with a mean size of 1.46GB are not depicted.

4. Maven Usage. Compiling and executing repositories automatically requires all dependencies

to be resolvable by a package manager. To simplify subsequent steps, we restrict our dataset


12 3. Tests4J Benchmark

to repositories that use Maven3 as package manager. To determine if a repository uses Maven,

we check if there is a pom.xml file at the repository root, filtering out 46.9% of the repositories.

5. Last Commit Date. As a proxy for the quality of the test cases, we filter out stale repos-

itories. Specifically, we consider a repository as stale if there were no commits on the main

branch for the past two years. The threshold of two years was chosen intuitively because the

distribution in Figure 3.2 did not reveal any distinct values that seemed particularly benefi-

cial (note the long tail in the histogram and the almost linear decay in the cumulative sum).

If mere maximization of the dataset size is desired in future work, increasing the threshold

could a viable option.

Figure 3.2: Histogram of last commit dates on the left. Number of remaining repositories for
di↵erent threshold values on the right. Both include the chosen threshold of two years as a dashed
vertical line.

6. Number of Tokens. After these first five filtering steps based on metadata, we clone the

1198 remaining repositories and use the mapping heuristic from Methods2Test [46] on them

to get the pairs of MUT and test case. Since the dataset is meant for training and evaluation

of language models, we restrict the combined length of MUT and test case to be at most

1850 tokens. This leaves 198 tokens for continuous prompts to the hard limit of 2048. For

tokenization, we use the tokenizer of Polycoder. Note that in this step, we filter out individual

test cases rather than complete repositories, unless all test cases exceed the token limit (0.4%

of repositories). Figure 3.3 shows the distribution of the number of tokens, 96.5% of samples

are below the threshold of 1850.

7. Number of Test Cases. We filter out any repository that has less than 80 mapped test

cases in order to keep the number of repositories and the associated cost for compiling and

executing them manageable. On the other hand, we want a diverse set of test cases that

is not dominated by few repositories. Therefore, if more than 1000 test cases are mapped,

we sample 1000 random ones. We find that there are many repositories with very few test

cases, e.g., 8.6% contain only a single mappable test case and 74.4% contain less than 80

mappable test cases. Figure 3.4 displays the histogram of the number of test cases, as well

as the cumulative sum.
3https://maven.apache.org/

https://maven.apache.org/


3. Tests4J Benchmark 13

Figure 3.3: Histogram of the number of tokens of all mapped test cases. The threshold of 1850 is
depicted as a dashed line. For readability purposes, 1,512 outliers with a mean of 4,816 tokens and
a max of 36,862 tokens are not depicted in this figure.

Figure 3.4: Left: Histogram of number of test cases per repository. The dashed line shows the
lower threshold of 80, and the dot-dashed line represents the upper limit of 1000, above which we
sample. Right: Plot of remaining total test cases for di↵erent lower thresholds, with the upper
limit of 1000 already considered.

8. Compilation. We require all repositories to be compilable with maven, i.e. all dependencies

must be specified in the pom.xml files. We first detect the Java version from the root pom.xml.

We support Java 8, 11, 17 and 19, among which we select the closest most recent one.

We run the compilation with mvn clean compile and keep the repository if the command

succeeds. We find that roughly half of the repositories compile. We save the logs but don’t

further investigate reasons for compilation errors. Finding patterns in errors and adjusting

the environment accordingly might yield more compilable repositories in future work.

9. Execution. Similarly, we require all repositories, or more specifically their test suites, to be

executable in our environment. We select the Java version as described above, run mvn clean

test and keep a repository if the command succeeds, eliminating 58.6% of the repositories.

Similar to the compilation, we save the logs and leave further investigation to future work.


14 3. Tests4J Benchmark

3.2 Dataset Split

After filtering the repository candidates, we split them into train, validation and test splits. These

splits are created based on whole repositories, i.e., samples from the same repository will all be in

the same split. This ensures that the benchmark measures generalization and avoids data leakage

through duplicate or near duplicate code from the same repository [3]. We aim for a split where

70% of the test cases are in train, 10% in validation and 20% in test. Since the repositories contain

di↵erent numbers of test cases, creating the split is not as trivial as assigning 70% of the repositories

to train and so on. We therefore use the following procedure to create the split randomly, while

also getting as close as possible to the target number of test cases. First, we calculate the target

number of test cases per split. Next, we iterate over the repositories and randomly assign every

repository to one of the splits that can include this repository without exceeding their target number

of test cases. Afterward, there are a number of repositories left over. These leftover repositories are

sequentially assigned to the split that is still farthest away from their target. Using this procedure,

we obtain the distribution depicted in Table 3.3.

Number of Repositories Number of Test Cases

Train 41 (65.08%) 8733 (69.93%)
Validation 9 (14.29%) 1285 (10.29%)

Test 13 (20.63%) 2471 (19.79%)

Total 63 (100.0%) 12489 (100.0%)

Table 3.3: Number of Repositories and Test Cases per Dataset Split. Note that these numbers
include the validation repository and 2 test repositories that were later removed due to incompat-
ibility with the coverage tool stack.

3.3 Final Dataset

The final dataset consists of 63 repositories with ⇠12k mapped test cases corresponding to ⇠6k

MUTs. Out of the 63 repositories, we remove 3 due to incompatibility with the coverage tool

stack, as described in subsection 3.4.7. Table 3.4 displays more detailed statistics. We can observe

that most MUTs are tested by only few test cases (1.87 on average). Most MUTs are rather

short, for instance, 25% are shorter than 39 tokens, indicating low complexity. To confirm this, we

further investigated the MUT complexity and found that 18.4% are getter methods, 2.7% are setter

methods and the average cyclomatic complexity score [34] is 2.29. The dataset contains repositories

using both Java 8 (43/63 or 68.25%) and Java 11 (20/63 or 31.75%). Moreover, the repositories

use multiple test frameworks: JUnit4 (32/63 or 50.79%), JUnit5 (24/63 or 38.10%) and TestNG

(7/63 or 11.11%).


3. Tests4J Benchmark 15

Avg Max Min 25% 50% 75% Total

Number of Test Cases per Repository 198.24 1000 80 102.50 146.00 244.00 12,489
Number of MUTs per Repository 106.03 672 7 57.00 78.00 129.50 6,680
Number of Test Cases per MUT 1.87 86 1 1.00 1.00 2.00 12,489
Number of Tokens per MUT 153.73 1681 5 39.00 80.00 182.25 1,026,888
Number of Tokens per Test Case 194.05 1668 13 79.00 136.00 235.00 2,423,516

Table 3.4: Aggregate Dataset Statistics. Token counts correspond to the PolyCoder tokenizer.
25%, 50% and 75% correspond to the respective percentiles.

3.4 Coverage Pipeline

To get coverage metrics for test cases generated by a model in a convenient way, we create a coverage

pipeline that filters for executable test cases, inserts them into the repository and executes them.

The main challenge is that model-generated test cases can be arbitrary, i.e., we do not have any

guarantee of parsability, compilability or executability. Therefore, the pipeline sequentially filters

out faulty test cases in multiple steps, such that only executable test cases remain in the end.

Figure 3.5 gives an overview of the pipeline. In the following, we describe these steps and their

respective challenges in more detail.

Figure 3.5: Overview of the Coverage Pipeline

3.4.1 Post-processing and Parsing

The first step is to apply a lightweight post-processing and filter for code that is parsable. The main

case that the post-processing covers is incompleteness: the LM might not have completed the test

case at the token limit or be stuck in an infinite loop. We first truncate everything after the closing

outer parenthesis and attempt to parse the result with javalang4 using the parse member declaration

method. We assert that the parsed tree represents a method declaration, rather than a class

declaration for example. If parsing fails, we truncate back to the last complete statement, indicated

by the last semicolon. Afterward, we deduplicate statements at the end of the method to fix infinite

loop cases and close all open parentheses. Finally, we attempt to parse again and discard the failing

samples. Since the goal of this work is to evaluate the proposed modelling technique, we keep the

4https://github.com/c2nes/javalang

https://github.com/c2nes/javalang


16 3. Tests4J Benchmark

post-processing rather simple, but more sophisticated repair techniques could be incorporated in

future work.

After repairing and parsing test cases, we modify its annotations. First, we remove potential

@Ignore and @Disabled annotations to ensure that the test case will be executed later on. Second,

we configure an execution timeout of 10 seconds per test case. To that end, we detect the test frame-

work (JUnit4, JUnit5 and TestNG) based on the import statements in the test class and modify the

annotations to include the respective timeout settings. In particular, we add @Test(timeout=10000)

for JUnit4, @Test(timeOut=10000) for TestNG and @Timeout(10)@Test for JUnit 5.

3.4.2 Repository Configuration

Before attempting to compile the parsable test cases, we prepare the repositories to create a stan-

dardized environment. For that purpose, we first copy the repository to a temporary directory and

perform all modifications there. That way, the original repository state remains intact for future

changes of the coverage pipeline. This might be omitted in the future to avoid the slight overhead

of creating the copy and modifying the configuration on every execution of the pipeline.

In Maven, arbitrary plugin executions can be hooked into the test phase triggered by calling

mvn test. The purposes of such executions are manifold. Some plugins are required to build the

project, while others are mainly supporting developers with analyses or deployments. In the con-

text of our benchmark, we want to limit the execution to those plugins necessary for compilation

to accelerate the process and avoid failures due to errors unrelated to our use case. Through man-

ual investigation, we find maven-compiler-plugin, jaxb2-maven-plugin, build-helper-maven-plugin,

antlr4-maven-plugin, maven-jar-plugin, protoc-jar-maven-plugin, maven-bundle-plugin,

maven-shade-plugin and maven-install-plugin to be the essential plugins in our selection of repos-

itories. They either configure the compilation and packaging process or generate code. We keep

these plugins and remove all others from all pom.xml configuration files.

To create a uniform environment that allows automatic coverage evaluation for di↵erent reposi-

tories, and to enable coverage analyses, we programmatically modify the project configuration. The

first challenge is to find the pom.xml file that is a common ancestor to all sub-configurations, such

that the modifications take e↵ect for all submodules. While very common, the root configuration

does not have to be the pom.xml file at the root directory of the project. Therefore, we traverse

the tree created by the parent relations between configurations until we arrive at the root. This

does also not necessarily yield a single configuration file, i.e., the relations can sometimes constitute

multiple unconnected trees. In our selection of repositories, this only happens when artifact or re-

source directories also contain pom.xml files, which do not need the common configuration options.

Consequently, we select the pom.xml file with the most children as a heuristic, which works for all

our repositories. After selecting the root configuration that influences all sub-configurations, we

make the following modifications.

1. When filtering for compilable test cases among a large number of generated test cases, it is

essential to receive as many compilation errors as possible, such that the faulty test cases

can be removed. To achieve that, we configure the maven-compiler-plugin not to stop the


3. Tests4J Benchmark 17

compilation process when errors occur and increase the maximum number of displayed errors

(see appendix A.1.1). Note that this plugin is also among the essential plugins we identified.

Therefore, we merge the original settings with these new ones.

2. Next, we configure the maven-surefire-plugin, which controls the test execution. Similar to

the compilation step, we instruct the plugin to keep running even when errors occur, to get

as much feedback as possible in one run (see appendix A.1.2).

3. Lastly, we add the configuration for the jacoco-maven-plugin (see appendix A.1.3). It is used

to measure the code coverage. We chose JaCoCo5 over alternatives because it is the most

mature tool that supports most Java versions.

3.4.3 Sca↵olding

We view the test case generation task on the method level, yet methods alone cannot be compiled,

but need a test class. A complete tool would generate a sca↵olding, including all imports and a

test class. While not impossible, generating this automatically is not trivial for many reasons, e.g.,

it is not always clear from which package a name should be imported. Since this work focuses on

the method level, we instead use the test classes of the ground truth test cases as sca↵olding. In

all cases except one, there is only a single test class for every MUT in the dataset.

3.4.4 Compilation

The goal of this step in the pipeline is to identify test cases that are not compilable. To that end,

the compiler was configured to continue compiling on errors and to output all errors. Ideally, the

compiler would identify all compilation errors in one run. In practice, this is not realistic due to

masking e↵ects between errors, i.e., some errors only occur when others are resolved. Therefore,

we iteratively compile the project and remove faulty test cases. Specifically, we compile using

JAVA HOME=/path/to/java/version mvn clean test-compile, where the appropriate java version is

detected from the root configuration. Next, we parse the logs produced by the compiler using

regular expressions. Unfortunately, the errors are not always formatted consistently. The three

regular expressions in appendix A.1.4 cover all variants that we experienced during the experiments

and extract the file name, line number of the error in that file and the error message. We then

parse the test classes into their Abstract Syntax Trees (ASTs), find all test cases that span across

at least one of the error lines, and remove these test cases from their test class. We iterate until

no further errors are detected by the regular expressions, such that all remaining test cases are

compilable. We save both compilation logs and error messages per test case for post-hoc analyses.

3.4.5 Execution

Similar to the compilation step, one goal of executing the test cases is finding those that run

successfully. We again construct regular expressions to parse the logs, extract the files and names

of the failing test cases, and remove them from their test classes. Unlike during compilation,

5https://github.com/jacoco/jacoco

https://github.com/jacoco/jacoco


18 3. Tests4J Benchmark

removing failing test cases would not be strictly necessary to remove failing test cases. However,

parsing the logs yields interesting information about how many and which test cases were ultimately

runnable. At the same time, by removing the failing test cases, we make sure that only passing tests

contribute to the final coverage metrics. We save all error information, e.g., enabling subsequent

analyses of assertion errors.

3.4.6 Report Parsing

The execution of mvn test produces the JaCoCo reports as artifacts. We first parse the XML

reports and get the coverage information for all methods. We then retrieve the coverage for all

MUTs by matching their signature to the JVM signatures in the JaCoCo report. To that end, we

leverage the org.jacoco.report.JavaNames class to convert the JVM signatures to a more easily

readable and parsable format. Ultimately, we extract both line and branch coverage per MUT and

aggregate.

3.4.7 Pipeline Validation and Manual Corrections

To validate the correctness of the pipeline, we run it in two settings, where we know the expected

outcome. First, we implement a dry run where the pipeline is executed without inserting any test

cases into the repositories. Therefore, in that setting, we eliminate all steps that are concerned

with processing the test cases as sources of error. We ensure that all dependencies are available and

make sure that the project is compilable without non-essential plugins, and therefore validate that

our list of essential plugins is su�cient. Furthermore, we check that no test cases are executed, as

all test cases should have been removed from the repository during the scraping process.

Second, we further leverage the ground truth test cases for validation purposes. In other words,

we re-insert the mapped test cases into their test classes and attempt to measure their coverage.

To that end, we first transform the mapped test cases from the dataset format into the prediction

format, such that the whole pipeline is executed as if the test cases were model-generated. Ideally,

we expect all test cases to be parsable, compilable, and runnable. We find that 99% of test cases

are executable, validating the pre-processing, compilation and execution procedure. We further

find that 98% of those also test the correct MUT, validating the mapping heuristic. We analyze

the few test cases that are not parsable, compilable or executable and find three patterns of edge

cases. First, some test cases depend on the side e↵ects of other test cases, like setting attributes

in shared objects. These test cases might fail in our setup because not all test cases of original

test classes are mapped during scraping and because the execution order can di↵er. Second, some

test cases exceed the 10 seconds timeout. Lastly, some test cases fail with various error types after

their @Disabled or @Ignore annotations were removed. This is likely because these test cases are

outdated with respect to their MUT. We conclude that the pipeline works correctly and that the

few edge cases are not due to a bug in the pipeline.

A high degree of automation was one of the main goals throughout the benchmark creation, to

manage and support the large scale of this study. At the same time, it makes the dataset extensible

for future work. In that sense, the scraping process is fully automated, and a larger dataset can be


3. Tests4J Benchmark 19

created by simply adding more candidate repositories. However, when adding a new dataset to the

pipeline, one should manually validate that the repository works in the pipeline environment. For

instance, one should ensure that there are no dependency conflicts between the repository and the

pipeline tooling. We believe that the two validation approaches discussed above are also excellent

test cases to do that. For the 22 repositories in the validation and test set, we perform this manual

validation. In the following, we list all manual changes we made to these repositories, that were

identified in the process:

1. intuit/CloudRaider uses powermock6 in their test classes to create mock objects. Unfortu-

nately, powermock uses on-the-fly code instrumentation and is therefore incompatible7 with

JaCoCo’s on-the-fly instrumentation. For that reason, we exclude this repository from any

further experiments. We further exclude Flipkart/foxtrot because it depends on running

database containers and SonarSource-orchestrator because its build process is incompatible

with JaCoCo.

2. The configuration of bazaarvoice/ostrich has a parent configuration outside of the repository

itself, that overrides the argLine property of surefire. This override breaks JaCoCo, so we

remove the dependency to that parent configuration.

3. In eclipse/winery, we find that the frontends submodule has additional dependencies to

NodeJS and takes a very long time to compile. At the same time, it does not contain any

mapped test case and no other submodule depends on it. We therefore remove this submodule

from the build by removing the corresponding <module> entry in the root configuration.

4. JUnit5 did not support timeouts until version 5.5. Therefore, we upgrade any older version

to 5.5.1. This is the case for Domo42/saga-lib and flipkart-incubator/Lyrics. We further

upgrade JUnit4 from 4.11 to 4.12 in Domo42/saga-lib in order to make it compatible with

junit-vintage-engine (which was used in the repository all along) and therefore to ensure

that all test cases are executed.

5. During scraping, we removed all test cases from the repositories, such that only the inserted

test cases would contribute to the coverage. The script found and removed test cases anno-

tated with @Test and @ParameterizedTest, but not those that use the JUnit3 style of declaring

a class as a test class: by extending it from junit.framework.TestCase. This occurred only

a single time and was corrected manually, but could be automated in the script in a future

iteration.

3.5 Evaluation Metrics

For evaluation, Tests4J first computes BLEU and crystalBLEU scores as intrinsic metrics. These

metrics compare generated test cases to correct, human-written test cases for the same MUT. In

particular, they compute scores based on n-gram overlaps. CrystalBLEU ignores trivial n-grams

6https://github.com/powermock/powermock
7https://github.com/powermock/powermock/wiki/Code-coverage-with-JaCoCo

https://github.com/powermock/powermock
https://github.com/powermock/powermock/wiki/Code-coverage-with-JaCoCo


20 3. Tests4J Benchmark

in that process and thus more accurately measures code similarity. Both can give an indication

of the quality of a test case, but don’t capture its behavior during execution. Therefore, the

pipeline computes several metrics, including four coverage metrics: MiLC, MaLC, MiBC and

MaBC. They correspond to the micro (Mi) and macro (Ma) averages of the line coverage (LC )

and branch coverage (BC ). The micro average is the proportion of the total lines or branches of

the MUTs that were covered by the tests, whereas the macro average is computed per MUT and

then averaged. Therefore, MiLC and MiBC account for the fact that MUTs vary in complexity

and weighs methods with many lines or branches higher than those with few. Consequently, we use

them as main coverage metrics. We additionally report MaLC and MaBC because, in comparison

with their micro average counterpart, they indicate whether the test suite covers mostly small or

large MUTs. Besides coverage, we further report five count metrics that quantify how many test

cases remain at certain stages of the pipeline:

1. Unique: Number of test cases after de-duplication.

2. Parsable: Number of test cases that comply with the Java grammar, absolute and relative to

Unique.

3. Compilable: Number of test cases that were compiled in their sca↵olding without error,

absolute and relative to Unique.

4. Executable: Number of test cases that were executed without errors, absolute and relative to

Unique.

5. Correct : Number of test cases that were executed without errors and called the correct MUT,

absolute and relative to Unique. This metric is adopted from Tufano et al. [47] and is directly

comparable, albeit computed on di↵erent repositories.

An example of all metrics can be found in Table 3.5, which displays the results of the ground

truth test cases.

3.6 Ground Truth Results

Table 3.5 displays the results of inserting the mapped test cases (called ”ground truth test cases”)

back into their respective test classes and evaluating their coverage with the pipeline described

above. Besides being used for the pipeline validation as discussed in subsection 3.4.7, the results

yield some further insights regarding the dataset. For instance, the ground truth test cases achieve

very high coverage, even though there are only 1.87 test cases per MUT on average with an average

length of 194 tokens. We conclude that the MUTs are usually testable with relatively few and

short test cases, i.e., there usually is a good and simple solution. We also observe that for both

line and branch coverage, the micro average is higher than the macro average. Since the micro

average is equivalent to a macro average weighted by total lines, this indicates that the ground

truth tests were more successful in covering complex MUTs than simple ones. This is opposed

to test cases generated by deep learning-based approaches, where typically the macro average is

higher. Further investigations are necessary to find the reasons for this opposing trend. The BLEU

scores are obviously very high because we compare the ground truth test cases to themselves, but

they are not 1 because of the preprocessing.


3. Tests4J Benchmark 21

Ground Truth BLEU: 0.9455 crystalBLEU: 0.9369

Repository
Coverage Counts

MiLC MaLC MiBC MaBC Unique Parsable Compilable Executable Correct

criteo-garmadon 83.54% 86.65% 76.28% 85.59% 109 109 (100.00%) 107 (98.17%) 106 (97.25%) 102 (93.58%)
bazaarvoice-ostrich 11.50% 11.27% 10.32% 10.92% 158 157 (99.37%) 157 (99.37%) 157 (99.37%) 153 (96.84%)
adorsys-XS2A-Sandbox 93.98% 96.03% 85.51% 92.20% 180 180 (100.00%) 180 (100.00%) 178 (98.89%) 178 (98.89%)
flipkart-incubator-Lyrics 97.30% 99.02% 81.75% 93.89% 102 102 (100.00%) 102 (100.00%) 102 (100.00%) 102 (100.00%)
seedstack-seed 84.43% 89.00% 74.61% 84.30% 97 97 (100.00%) 97 (100.00%) 97 (100.00%) 86 (88.66%)
intuit-QuickBooks-V3-Java-SDK 70.19% 75.29% 59.55% 75.53% 238 238 (100.00%) 238 (100.00%) 237 (99.58%) 237 (99.58%)
gooddata-gooddata-java 77.44% 87.48% 83.58% 88.34% 377 375 (99.47%) 375 (99.47%) 375 (99.47%) 374 (99.20%)
Flipkart-foxtrot - - - - - - - - -
shro↵k-phoebus 74.98% 87.02% 67.39% 81.06% 285 285 (100.00%) 279 (97.89%) 277 (97.19%) 274 (96.14%)
messaginghub-pooled-jms 87.86% 93.42% 93.87% 97.72% 239 239 (100.00%) 239 (100.00%) 239 (100.00%) 239 (100.00%)
nhl-dflib 84.64% 77.94% 73.71% 72.18% 134 134 (100.00%) 134 (100.00%) 134 (100.00%) 129 (96.27%)
monarch-initiative-phenol 86.68% 86.32% 76.92% 82.05% 120 120 (100.00%) 120 (100.00%) 120 (100.00%) 116 (96.67%)
SonarSource-orchestrator - - - - - - - - -

Total 78.32% 68.42% 71.96% 66.44% 2039 2036 (99.85%) 2028 (99.46%) 2022 (99.17%) 1990 (97.60%)

Table 3.5: Evaluation results of re-inserting the ground truth test cases back into their test classes.


4 Context-Aware Prompt Tuning

In this chapter, we propose a novel method we call Context-Aware Prompt Tuning. It is fundamen-

tally motivated by the need for more contextual information, as well as the limited context length

and quadratic complexity of transformer language models. We further discuss this motivation in

section 4.1 and present Context-Aware Prompt Tuning in section 4.2.

4.1 Motivation

One of the fundamental premises of this work is that contextual information is important for test

case generation. In the most basic setup of neural test case generation, the model translates from a

MUT to a test case. However, we argue that this task setup is ill-posed because the input is missing

important information, without which not even a human expert developer could write correct test

cases. For instance, writing tests requires knowledge about available classes, methods, fields and

libraries from the current project. Whereas it is likely that models are able to use the APIs of

popular libraries due to their usage in the pre-training corpus, it is unlikely to impossible that the

model is able to infer APIs of the local project. Fine-tuning the model to a specific project [8] and

saving this local knowledge in the model parameters is possible but not feasible for many evolving

projects. Instead, contextual information can be passed to the model in the prompt. The model

is then not trained on a specific project, but simply trained to leverage the contextual information

given in the prompt. Tufano et al. [47] investigated the e↵ect of additional context for the test

case generation task and were able to reduce the loss by almost 10%. Similarly, the e↵ects of

context from the current project have been shown to significantly improve performance on other

code-related tasks as well [44, 53]. Specific to our benchmark is that contextual information is not

only required from the focal class and potentially other classes in the main code, but also from the

test class. The evaluation pipeline directly inserts generated test cases into a fixed sca↵olding. This

sca↵olding already contains imports and the test class declaration. This test class often contains

declared fields, setup code and helper methods. Generating correct test cases that fit into this test

class obviously requires this test class knowledge, otherwise available names and signatures would

need to be inferred or hallucinated by the model.

To summarize, in this task setup, a model has to process at least the MUT, the focal class as

well as the test class, and infer a correct test case from that information. With the token limit of

2048 in PolyCoder, this is not possible. The complete information would require more than 7000

tokens on average. To mitigate the issue, prior work [47] has suggested omitting method bodies

and only including their names and signatures, with the intuition that this already provides the

23


24 4. Context-Aware Prompt Tuning

most important information on how to use it. Whereas this approach reduces the total number of

tokens to about 1900 on average, more than 16% of the samples would still exceed the hard limit

of 2048 tokens. Furthermore, maxing out the token limit might not be desirable because of the

quadratic memory and compute complexity with respect to the sequence length on transformers.

We illustrate this in a toy example and measure the memory consumption of training the smallest

PolyCoder model with just 160M parameters in 16-bit precision with a batch size of 4. Figure 4.1

presents the results of that experiment. Even in this minimalistic setting, a 32GB GPU is maxed

out when approaching the maximum sequence length.

Figure 4.1: GPU memory consumption during training of PolyCoder-160M in FP16 with a batch
size of 4 and varying sequence lengths.

4.2 Method

Motivated by the issues discussed above, we propose Context-Aware Prompt Tuning (CAPT), a

novel method to allow the model to ingest more context without quadratic scaling of memory

requirements. The idea is fundamentally based on soft prompt tuning, where the model learns con-

tinuous embeddings representing soft tokens that steer a frozen pre-trained LM towards performing

a specific task. In CAPT, we aim to learn soft tokens that contain particular context information

and thus steer the model towards generating test cases that better fit that context. To that end,

we use small neural networks that first summarize and compress the context information into one

or multiple soft tokens. We call these networks Prompt Generators (PGs). They take a piece of

context as input and output an embedding that is then injected into the LM. Unlike other methods

[44], the output of the Prompt Generator (PG) is continuous, which allows joint end-to-end train-

ing. I.e., we backpropagate the gradients from the language modelling loss through the language

model to the PGs. Figure 4.2 presents an overview of the approach.

4.2.1 Context Types

The proposed method of encoding context with a PG into an embedding allows for flexible context

modalities. Much like in [24], the context does not need to be of textual nature, but could be

images or videos. For test case generation, code is arguably one of the most important modalities.


4. Context-Aware Prompt Tuning 25

Figure 4.2: Context-Aware Prompt Tuning with test class and focal class context encoded as soft
tokens with an all-layer injection strategy. The left-most prefix is a vanilla (instance-invariant) task
prefix.

However, others could be beneficial as well. For instance, class hierarchies and call graphs could be

encoded with graph neural networks and provide valuable information about the overall structure of

the project and natural language description from documentation and Javadoc could provide low-

level semantic information about classes and methods. In this work, we focus on code as modality

and investigate the possible compression rather than multi-modality. As explained in section 4.1,

the focal class and test class are the two most important sources of information. Following previous

work [47], we omit method bodies and try to only include the most relevant information:

1. Imports are relevant for the test class context because they introduce names that are avail-

able in the test case to generate. They are not relevant for the focal class because the model

does not need to generate a method in its namespace.

2. The class name is relevant for multiple reasons. In the focal class, it provides semantic clues

about the class functionality and gives the model partial information on how to instantiate

an object of that class. In the test class, it mainly contributes to a coherent context, as

providing syntactically incorrect code might confuse the model.

3. The method signatures are equally important for focal classes and test classes. First,

they give semantic clues about the class, inform the model which methods are available and

how to use them. To maintain syntactic correctness, we do not simply remove the method

bodies but replace them with an empty block ’{}’. In focal classes, we only include public

methods. In test classes, we do not remove the body of the setup method marked by the

’@Before’ annotation. We argue that knowing the state of the objects instantiated in these

setup methods is very important, especially for generating correct assertions, and that the

setup code can provide this information to a high degree.

4. Lastly, the fields also provide both semantic clues and information on how to use the focal

class. In test class, they inform the model on the declared names and their types. As with

methods, we only include public fields for the focal class context.

To optimally leverage knowledge from pretraining, we format this context in a way that should

look as natural as possible to the model. We provide further details on the formatting and trunca-

tion of context in the experimental setup in subsection 5.1.1.


26 4. Context-Aware Prompt Tuning

4.2.2 Prompt Generator Architectures

The prompt generators (PGs) are the fundamental building blocks of CAPT. They compress con-

textual information in order to steer a LM. In this section, we first sketch the design space for

PGs and afterward derive 6 concrete architectures from that space. Every prompt generator takes

a context as input and outputs one or multiple soft tokens. In the following, we list the design

decisions that we considered when designing prompt generators.

• A very fundamental decision is how to represent the context. Various representations like

graph-based representations are possible, but for the sake of simplicity, we focus on token-

based representations in this work.

• Next, the tokens need to be embedded. Here, we focus on lightweight approaches based on

pre-trained models. Using the main LM itself to create context embeddings would likely

yield more powerful representations, but it would also cause significant compute and memory

overhead to perform multiple forward passes and thus defeat the purpose of CAPT. Instead,

we consider two approaches: using the input embeddings of PolyCoder and using CodeBERT.

• After obtaining an embedding for every token, these embeddings need to be projected into

the target space. This target space is either the input space of the LM (1024 dimensional)

or the space of prefix tokens across all layers (1024⇤24 layers⇤2 = 49, 152 dimensional). This

depends on the way the soft tokens are injected into the LM, either only at the input layer

as in Prompt Tuning [29] or at every layer as in Prefix Tuning [31] and P-Tuning v2 [33]. We

find that both approaches work similarly well. Since at the time of implementation, passing

input embeddings to a LM was not supported during generation1, we focus on injecting soft

tokens at every layer. The concrete implementation of how to project token embeddings into

the target space also requires a few further architectural decisions. In particular, we perform

two main steps: aggregation and projection.

• During aggregation, multiple token embeddings are combined into a fixed-size representa-

tion. That is, the aggregation step performs the actual compression of the technique. Overall,

we consider four aggregation approaches: CLS token, sum, LSTM and MLP. Whereas CLS to-

ken, sum and LSTM can aggregate variable-length sequence into a fixed-size representations,

an MLP requires a fixed-length input.

• During projection, the fixed-size representation from the aggregation step is projected into

the target space. We consider a simple linear layer, as well as a 2-layer MLP with a bottleneck.

In the former, the first layer projects the aggregated embedding down to an even smaller

embedding, followed by a non-linear activation function, before the second layer projects it

up into the target space. Bottleneck are also used in regular prefix tuning, and we can confirm

in preliminary experiments that bottlenecks improve performance slightly. We therefore use

bottlenecks in all setups.

1Support was recently added in v4.27: https://github.com/huggingface/transformers/issues/6535

https://github.com/huggingface/transformers/issues/6535


4. Context-Aware Prompt Tuning 27

• Lastly, an important consideration is how to scale the number of soft tokens that the prompt

generator outputs. We see the concept of compression at the center of CAPT. Thus, we believe

being able to flexibly change the rate of compression is crucial. The amount of information

that can be stored in an embedding is limited, forming an information bottleneck. We

implement two main ways of scaling. First, we use the same aggregated context representation

for all tokens, but use separate projections. The major drawback is that all soft tokens

are based on the same information and thus scaling the number of soft tokens will likely

not increase the performance much. Motivated by this reasoning, we also propose to use

chunked aggregation, where the token embeddings are chunked before aggregation. Then

each aggregated chunk embedding is projected into the target space. Thereby, every soft

token is based on the information from di↵erent parts of the context.

Based on these general concepts and design consideration, we now present 6 prompt generator

architectures. In the architecture visualizations, blue indicates frozen modules, red indicates train-

able modules, green boxes represent chunks and the right-most boxes with black borders represent

the soft tokens that will be injected into the LM. We use trapezoids to illustrate linear projections

that alter the dimensionality. All architectures use tanh as non-linear activation function in the

bottleneck. We indicate models that chunk the context with the su�x C.

1. In BERT, we pass the context through CodeBERT [18] and use the embedding of the CLS

token as a holistic representation of the context. This representation is then fed through a

projection MLP with bottleneck to obtain a soft token. To create k soft tokens, the CLS

representation is projected with k di↵erent MLPs. With a bottleneck dimension of 512, the

number of trainable parameters amounts to 110M in CodeBERT and k ⇤ (768 ⇤ 512 + 512 ⇤

49, 152) in the projection modules, totaling ⇠ 135M parameters for 1 soft token and ⇠ 365M

parameters for 10 soft tokens. Therefore, the number of parameters scales linearly with k,

making it only a viable option if k is small. Furthermore, all information has to be contained

in the CLS token, such that increasing k would likely not result in significantly better results.

2. Contrarily, in BERT C, we divide the sequence of context tokens into fixed-size chunks and

concatenate the embeddings of all tokens in each chunk. The chunk size is determined based

on the number of soft tokens k as cs = floor(512/k). We project the chunk embeddings

into the target space with a bottleneck of 1024. Since each of the k soft token is based on

di↵erent parts of the context, we share the projection module among them. Consequently, the

number of trainable parameters is independent of k and amounts to 110M for CodeBERT and

cs⇤1024+1024⇤49, 152 ⇡ cs⇤1024+50M. We expect this architecture to have better scaling

behavior than BERT because increasing k will make the chunks smaller, thus decreasing the

compression rate and alleviating the information bottleneck.


28 4. Context-Aware Prompt Tuning

3. In LSTM and all further architectures, we use the input embedding layer of PolyCoder to

embed context tokens. This approach is much more lightweight than CodeBERT, but has

the potential advantage that it uses the same tokenizer and embedding space as the LM. In

this variant without chunking, we process the complete context with a single-layer LSTM and

use the last token embedding as a holistic representation. The projection layer is equivalent

to the one described for BERT. In that sense, the setup also shares its drawbacks of the

limited scaling possibilities and the small information bottleneck. The number of trainable

parameters is 8⇤1024⇤1024 ⇡ 8M for the LSTM and k ⇤ (1024⇤512+512⇤49, 152) ⇡ k ⇤25M

for the projection module.

4. In LSTM C, we use the same setup as in LSTM but chunk the context tokens into k chunks,

process each chunk with an LSTM separately and use the embedding of the last token of

each chunk as chunk embedding. As with BERT C, the projection module is shared among

the k soft tokens and increasing k will alleviate the information bottleneck. The number of

trainable parameters is independent of k and totals about 33M.

5. In Sum, we also use PolyCoder’s input embeddings but use the sum as non-parametric

approach to aggregation. In this non-chunked variant, the sum of all token embeddings is

used as a holistic representation and projected into the target space with the same projection

module as in LSTM and BERT. The number of trainable parameters is k ⇤ (1024⇤512+512⇤

49, 152) ⇡ k ⇤ 25M.

6. Lastly, propose the chunked variant of Sum, Sum C, where the embeddings of all tokens in

a chunk are summed to form the chunk embedding. The projection module is the same is in

LSTM C and BERT C. The number of trainable parameters is ⇠ 25M.


4. Context-Aware Prompt Tuning 29


5 Experiments

In this chapter, we describe the experiments that we performed to investigate five research questions

(RQs). We first describe the experimental setup in section 5.1. Afterward, we define the RQs and

introduce the experiments we designed to answer each RQ in section 5.2. Finally, we present the

results in section 5.3.

5.1 Experimental Setup

In the following subsections, we provide details about pre-processing, training, generation, as well

as evaluation metrics.

5.1.1 Pre-processing

During pre-processing, we tokenize the di↵erent parts of the input and assemble them. In all

experiments, we use a maximum sequence length of 1850, which leaves room for 198 potential

soft tokens. We always perform padding to the longest sequence in each batch and truncate to

the maximum sequence length. As elaborated on in subsection 4.2.1, we identified four essential

components of focal and test classes: imports, class name, method signatures and fields. During

pre-processing, we want to make sure that the condensed context representations look as much as

possible like real code. To that end, we replace method bodies with empty blocks ’{}’ instead of

just removing them. We place MUTs inside their focal class and test cases in their test class, and

account for indentation. We furthermore argue that truncating the context naively will likely result

in incomplete classes and statements, negatively conditioning the LM. We therefore implement a

truncation procedure, where if the context is too long, we successively remove parts of the context

according to a prioritization without sacrificing syntactic correctness. In particular, we go through

the steps listed below, check if the context is short enough and only continue to the next step if it

isn’t. We define this prioritization for the scenario of fitting both test class and focal class context

into a fixed token budget, which is the case when using both contexts as hard prompt. In other

scenarios, we skip the steps that work on the respective other context. When using both contexts,

we generally prioritize test class context over focal class context based on the insights of RQ2.

1. We first remove the fields of the focal class and thus assign it the lowest priority, based on

the intuition that fields in the focal class are mostly relevant for the inner workings of the

class and that usage is defined via interfaces and methods.

31


32 5. Experiments

2. Next, we remove the method signatures of the focal class. However, we keep constructor

signatures, given their special importance in test cases.

3. When the focal class representation is reduced to its minimum (class name and constructor

signatures) and the context still exceeds its token limit, we remove the method signatures of

the test class. We argue that compared to imports and fields, using helper methods is useful

but not necessary for test cases.

4. Afterward, we remove the imports of the test class with the intuition that imports and fields

convey similar information in the test class (which names are available), and that imported

names are easier to infer than field names.

5. Lastly, we remove the fields of the test class, leaving minimal representations for both focal

and test class context. If they still exceed the token limit (occurs < 1% of the cases), we

truncate naively.

An example of the formatting and visualization of this prioritization can be found in Figure 5.1.

5.1.2 Training Procedure

For all neural models, we use the PolyCoder model [52] with 400M parameters. The PolyCoder

family of models is publicly available, has good performance and is, unlike CodeGen [35], trans-

parent with respect to the pre-training corpus. The variant with 400M parameters allows more

direct comparison with AthenaTest [47] and enables faster experimentation compared to the 2.7B

variant. In prefix tuning, we use 10 soft tokens, as that performs best in preliminary experiments.

For optimization, we use regular cross-entropy loss and the AdamW optimizer from PyTorch. We set

the learning rate to 5e� 6 for prefix tuning and to 1e� 6 for full fine-tuning. We train for 3 epochs

with a batch size of 8 and evaluate on the validation set 4 times per epoch and save a checkpoint.

The checkpoint corresponding to the best validation loss is used for testing.

5.1.3 Generation Procedure

We sample from the trained models with top k=50 and top p=0.9. Furthermore, we increase the

temperature from 0.05 to 0.4 in increments of 0.05. We generate 4 samples per temperature, i.e.,

a total of 32 samples per MUT. In preliminary experiments, we found this setting to provide a

good tradeo↵ between diversity and quality. These experiments also showed that increasing the

number of samples can significantly improve the coverage, e.g., increasing it from 32 to 96 increased

it by almost 50%. We use the moderate number of 32 samples to save computational resources

and to stay somewhat comparable to AthenaTest, which uses 30 samples. The post-processing is

performed as described in subsection 3.4.1.

5.1.4 Evaluation metrics

As described in section 3.5, we calculate the four coverage metrics MiLC, MaLC, MiBC and MaBC

from the coverage reports generated by Tests4J. Among these, we use MiLC, the fraction of total


5. Experiments 33

1 // Focal Class:

2 public class SwingListbox extends AbstractSwingContainer
3 implements XulListbox, ListSelectionListener {
4 // Methods

5 public SwingListbox(
6 Element self,
7 XulComponent parent,
8 XulDomContainer container,
9 String tagName

10 ) {}
11 public Object getManagedObject() {}
12 ...
13 public void setCommand(final String command) {}
14 // Fields

15 public int counter = 0;
16 // Method under Test

17 public String getSeltype() {
18 return selType;
19 }}
20 }
21 // Test Class:

22 import static org.junit.Assert.assertEquals;
23 ...
24 import org.pentaho.ui.xul.swing.SwingXulLoader;
25

26 public class SwingListboxTest {
27 // Fields

28 Document doc = null;
29 XulDomContainer container;
30 XulListbox list;
31 // Helper Methods

32 @Before public void setUp() throws Exception {
33 // Do not run on headless environment

34 Assume.assumeTrue(!GraphicsEnvironment.isHeadless());
35 container = new SwingXulLoader().loadXul("documents/listbox.xul");
36 doc = container.getDocumentRoot();
37 list = (XulListbox) doc.getElementById("listbox");
38 }
39 private static String toString(int[] is) {}
40 // Unit Test Case

41 @Test public void testSeltype() throws Exception {
42 assertEquals("single", list.getSeltype());
43 }
44 }

Class Name and
Constructor

Priority 6

Focal Class Methods
Priority 2

Test Class Fields
Priority 5

Test Class Imports
Priority 4

Test Class Methods
Priority 3

Focal Class Fields
Priority 1

MUT
Priority 6

Class Name
Priority 6

Test Case
Priority 6

Figure 5.1: Example of the formatting of focal class and test class context. color schema on the
right indicates prioritization of components when not all context fits into the token limit. Priority
6 means highest priority, Priority 1 means lowest priority.


34 5. Experiments

covered lines and total lines, as our main evaluation metric. Furthermore, we consider the Correct

metric proposed by Tufano et al. [47] measuring the fraction of test cases that are executable

and call the correct MUT. We furthermore report the test loss, BLEU and crystalBLEU scores as

intrinsic metrics.

5.2 Research Questions

RQ1 Training Methods: How e↵ective is prefix tuning for test case generation?

Prefix Tuning has been applied to a plethora of tasks in NLP and code intelligence, but there

is not a clear consensus if it is worse than, equal to or better than full fine-tuning in terms

of the final performance. To the best of our knowledge, no papers have applied prefix tuning

as the main training method to the task of test case generation. Zlotchevski et al. [54] use

prefix tuning for adapting to specific repositories, but the model was first fully fine-tuned on

test case generation. We close that gap by comparing the performance of prefix tuning, full

fine-tuning and zero-shot prompting on Tests4J.

RQ2 Context Information: How does context information a↵ect the performance when

provided in the hard prompt?

Tufano et al. [47] found that providing AthenaTest with focal class context reduces its loss

by about 10%. However, they do not evaluate the performance impact with further, more

important metrics such as coverage or BLEU scores. In contrast, we evaluate the e↵ect of

context using Tests4J, presenting both intrinsic and execution-based metrics. We always

provide as much context as possible using the method described in subsection 5.1.1 and

additionally consider the test class context.

RQ3 E↵ectiveness of CAPT: Can CAPT e↵ectively compress context information

while retaining performance?

Motivated by the importance of context information and the quadratic complexity of trans-

formers, we proposed Context-Aware Prompt Tuning (CAPT) in chapter 4. In this RQ, we

evaluate the e↵ectiveness of CAPT. We first investigate the limits of how much information

can be stored in a single soft token to get foundational insights about the mechanism. Af-

terward, we evaluate CAPT on Tests4J and compare the performance between not providing

context, providing it via CAPT and providing it in the hard prompt.

RQ4 Comparison with Evosuite: How do neural models compare to the state-of-the-art

non-neural test case generator on Tests4J?

Evosuite [19] is one of the most established tools for non-neural test case generation and

typically achieves very high coverage scores. In a study [20], Evosuite was even found to out-

perform human developers in terms of coverage. While it excels at generating tests with high

coverage, other quality properties like readability and maintainability lack behind human-

written tests [22, 38, 23, 14]. In contrast, neural models are specifically trained to generate


5. Experiments 35

tests that look similar to their training distribution, i.e., to human-written code. Conse-

quently, they generate test cases that are much more readable for humans. For instance, 61%

of human developers favored AthenaTest’s test cases over Evosuite’s in terms of readability,

10% favored Evosuite’s and 29% found them equally readable [47]. In this work, we do not

further evaluate readability, but focus on the performance metrics computed by Tests4J. The

authors of AthenaTest found that it achieved comparable coverage to Evosuite on a single

focal class. We argue that due to the high variance in coverage between repositories, an eval-

uation on such a small scale is not su�cient. We close that gap by evaluating neural models

and Evosuite on Tests4J, which includes 484 focal classes from 11 repositories in the test set.

To the best of our knowledge, we are the first to compare Evosuite to neural approaches at

scale.

RQ5 Evaluation Methodology: How important is large-scale coverage evaluation?

Tests4J computes both intrinsic and execution-based metrics for test case generators. Among

these, loss is by far the easiest to compute because it is available at training time and does

not require execution. BLEU and crystalBLEU are available after sampling from the trained

LM. Finally, the coverage and count metrics from Tests4J require both sampling from the

LM and execution, which makes them the most time-consuming. To guide future research

in the methodology of training and evaluating neural test case generators, we investigate the

importance of large-scale coverage evaluation. In particular, we aim to find out if the cheaper

to compute metrics mentioned above are good indicators of coverage and how important the

scale is for a reliable evaluation.

5.3 Results

In this section, we present the results of the research questions introduced in section 5.2. Table A.21

summarizes the results of all evaluated setups. The detailed evaluation tables are presented at the

respective RQs. We name every run according to their training algorithm, context information and

the way the context is represented. More specifically, runs names start with the training algorithm

(FT for fine-tuning, PT for prefix tuning and ZS for zero-shot), followed by the context in the hard

prompt as Hard-All, Hard-TCl, Hard-FCl or Hard-None, and the context in the soft prompt as

[PG]-All, [PG]-TCl or [PG]-FCl where [PG] is one of the prompt generator architectures presented

in chapter 4.


36 5. Experiments

Name
Training
Algorithm

Context Metrics

Test Class (TCl) Focal Class (FCl) MiLC Correct Loss crystalBLEU BLEU

Evosuite - - - 37.78% - - - -

FT-Hard-All Fine-tuning Hard Hard 13.55% 2.81% 0.6743 0.0473 0.0808
ZS-Hard-All Zero-Shot Hard Hard 3.07% 0.29% 1.0330 0.0237 0.0509
PT-Hard-All Prefix Tuning Hard Hard 9.76% 1.16% 0.6825 0.0355 0.0677
PT-Hard-TCl Prefix Tuning Hard - 5.48% 0.82% 0.6887 0.0378 0.0691
PT-Hard-FCl Prefix Tuning - Hard 3.28% 0.73% 0.8091 0.0323 0.0647
PT-Hard-None Prefix Tuning - - 1.79% 0.12% 0.8766 0.0283 0.0550
PT-Hard-None-BERT-TCl-1 Prefix Tuning BERT (l = 1) - 1.79% 0.19% 0.8944 0.0257 0.0557
PT-Hard-None-BERT-TCl-10 Prefix Tuning BERT (l = 10) - 1.82% 0.19% 0.8937 0.0285 0.0586
PT-Hard-None-BERT C-TCl-30 Prefix Tuning BERT C (l = 30) - 2.03% 0.23% 0.9004 0.0304 0.0611
PT-Hard-None-BERT C-TCl-90 Prefix Tuning BERT C (l = 90) - 1.92% 0.28% 0.9051 0.0328 0.0654
PT-Hard-None-BERT C-TCl-150 Prefix Tuning BERT C (l = 150) - 1.89% 0.23% 1.1792 0.0306 0.0627
PT-Hard-None-BERT C-TCl-188 Prefix Tuning BERT C (l = 188) - 1.79% 0.38% 1.1810 0.0312 0.0632
PT-Hard-None-LSTM-TCl-1 Prefix Tuning LSTM (l = 1) - 1.49% 0.18% 0.8957 0.0281 0.0587
PT-Hard-None-LSTM-TCl-10 Prefix Tuning LSTM (l = 10) - 1.81% 0.22% 0.8908 0.0305 0.0618
PT-Hard-None-LSTM C-TCl-30 Prefix Tuning LSTM C (l = 30) - 1.55% 0.24% 0.8964 0.0322 0.0644
PT-Hard-None-LSTM C-TCl-90 Prefix Tuning LSTM C (l = 90) - 2.03% 0.27% 0.8941 0.0295 0.0608
PT-Hard-None-LSTM C-TCl-150 Prefix Tuning LSTM C (l = 150) - 1.77% 0.22% 0.8951 0.0289 0.0596
PT-Hard-None-LSTM C-TCl-188 Prefix Tuning LSTM C (l = 188) - 1.68% 0.19% 1.1903 0.0292 0.0601
PT-Hard-None-Sum-TCl-1 Prefix Tuning Sum (l = 1) - 1.85% 0.22% 0.8939 0.0272 0.0532
PT-Hard-None-Sum-TCl-10 Prefix Tuning Sum (l = 10) - 1.95% 0.32% 0.8978 0.0291 0.0560
PT-Hard-None-Sum C-TCl-30 Prefix Tuning Sum C (l = 30) - 2.08% 0.23% 0.8939 0.0292 0.0604
PT-Hard-None-Sum C-TCl-90 Prefix Tuning Sum C (l = 90) - 1.76% 0.26% 0.8922 0.0296 0.0608
PT-Hard-None-Sum C-TCl-150 Prefix Tuning Sum C (l = 150) - 1.58% 0.32% 0.8929 0.0292 0.0597
PT-Hard-None-Sum C-TCl-188 Prefix Tuning Sum C (l = 188) - 1.98% 0.24% 0.8937 0.0296 0.0611

Table 5.1: Summary of the results.

5.3.1 RQ1: Training Methods

To compare the three training paradigms, we use the setup with all context information in the

hard prompt. We find that full fine-tuning (Table 5.2) significantly outperforms prefix tuning

(Table 5.3). For instance, the MiLC is almost 40% higher (13.55% and 9.76%) and the number of

correct test cases is more than twice as high (625 and 274). Moreover, both the coverage and the

fraction of correct test cases is higher on every project except one. The loss of full fine-tuning is

better as well, but the di↵erence is rather small (⇠ 1.2%) compared to the other metrics.

The performance of PolyCoder in a zero-shot setting (Table 5.4) is unsurprisingly much worse

than both fine-tuning and prefix tuning. For three repositories, it failed to generate any correct

test cases. The fraction of parsable test cases, however, is about the same as with full fine-tuning

and prefix tuning. We argue that this is because the ability to generate syntactically correct code is

mainly acquired during pre-training. We furthermore find that the fraction of compilable test cases

is very similar between all training setups, and that the fraction of executable test cases is much

higher in the zero-shot setting. We hypothesize that this is because without training, the model

will generate a large fraction of meaningless test cases that, for instance, only declare a number of

primitive variables without ever invoking any project functionality. Thereby, these test cases avoid

the challenge of correctly using local APIs. We find support for this hypothesis by comparing the

fraction of executable test cases that invoke the MUT ( Correct
Executable ) and the fraction of test cases

that use at least one assert statement. Both values are significantly higher for prefix tuning than

for zero-shot and significantly higher for full fine-tuning than for prefix tuning. We conclude that

generating test cases that invoke the MUT and assert the program state is a skill acquired during

fine-tuning and that full fine-tuning is more e↵ective than prefix tuning in that regard.

In all models, the rate of compilable tests is rather low (12.24% to 13.07%) compared to the


5. Experiments 37

FT-Hard-All loss: 0.6743 BLEU: 0.0808 crystalBLEU: 0.0473

Repository
Coverage Counts

MiLC MaLC MiBC MaBC Unique Parsable Compilable Executable Correct

criteo-garmadon 6.91% 6.43% 5.58% 5.79% 1231 1050 (85.30%) 157 (12.75%) 56 (4.55%) 11 (0.89%)