Motivation. Software tests are a necessity in the development of software to secure functionality, reliability, and usability [10]; however, these tests are costly and time-consuming [6]. Although tool support for software testing has advanced, there remains considerable potential for enhancement. Many software tests are still devised manually, with the creation of unit tests being particularly laborious. Automating the generation of test cases is promising for streamlining this aspect of software testing [6].
Large Language Models (LLMs) have exhibited capabilities in code generation [11, 13--15], test case generation [17], and various other domains [11]. The advancement of model performance of transformer-based LLMs is mainly achieved by expanding the model size in line with an increase in training data size [7, 8]. However, this approach leads to high computational costs which can only be afforded by corporations with significant financial resources. This highlights the need for transformer-based LLMs that perform well on a specific downstream task and are also cost-efficient. Addressing this, we focused on supervised fine-tuning (SFT) of more resource-efficient transformer-based LLMs LLaMA 2 13B, Code Llama 13B, and Mistral 7B for the specific downstream task of generating test cases for mobile applications.
... mehr
Research questions. This work investigated: Does SFT enhance the capabilities of a transformer-based LLM in the specific downstream task of generating test cases for mobile applications while being cost-efficient and runnable on standard consumer hardware? Does the fine-tuned model outperform other state-of-the-art models in the task of test generation for mobile applications?
Approach. Our approach is a modification of the ATHENATEST approach [16]. However, our approach focuses on supervised fine-tuning (SFT) on both pre-trained and already fine-tuned transformer-based LLMs for the task of test case generation for mobile applications in Dart.
The approach involves three steps, as illustrated in Figure 1. Firstly, a labeled dataset of corresponding input-output pairs (X, Y) was obtained to model the conditional probability P(Y|X; θ) [9, 12]. Dart code and corresponding test files were extracted from open-source GitHub repositories using Google BigQuery. These files were then matched using regular expressions, ensuring that each code file was matched with its corresponding test file based on matching base filenames. The dataset underwent quality filtering and deduplication, resulting in 16,252 input-output pairs, which was then divided into training (90%) and validation (10%) sets. The training set of the dataset consists of a total of 88.5M tokens using the LLaMA tokenizer.
Secondly, for SFT on the downstream task of test generation, models were selected based on their code generation capabilities, as indicated by the pass@1 score on the HumanEval [2] and MBPP [1] benchmark, their parameter sizes, and the extent to which they had been trained on Dart data. In model selection, open-source models capable of running on cost-efficient consumer hardware with code generation abilities were primarily chosen.
Thirdly, in the SFT process, the test generation task was represented as translation task, in line with ATHENATEST [16]. This is achieved by employing the following structured prompt format for SFT [9]:
"{prefix_prompt} ### Code: {code} ### Test: {test}"
In this work, there was no prefix prompt used during SFT.
Fine-tuning. The fine-tuning was conducted on a single GPU system using Flash Attention 2 [3] and the QLoRA method [4] to reduce memory size and the number of trainable parameters. The fine-tuning process varied in duration up to 32 hours, resulting in total emissions of 13.099 kgCO2eq [5].
Experimental Results. The performance of TestGen-Dart models was evaluated for their unit testing capabilities in Dart, in comparison to base models LLaMA 2 13B, Code Llama 13B, and Mistral 7B. The models were loaded in both float16 and 4-bit quantization configurations, and the evaluation involved nine different Dart files, encompassing 42 test cases. The results were obtained in a zero-shot setting using a structured prompt format, as described in the approach section. This included a prefix prompt instructing the models to generate unit tests: "Generate unit tests in Dart for the following class. The unit test should be structured with the 'test' function, an appropriate description, and an assertion 'expect' within the function to validate the test case." The generated unit tests were classified into three categories: syntax errors (SE), syntactic correctness (SC), and functional correctness (FC). In a 4-bit quantization configuration, TestGen-Dart_v0.2 enhanced the generation of syntactically correct unit tests by 15.38% and functionally correct unit tests by 16.67%, compared to the underlying base model, Code Llama 13B. Additionally, TestGen-Dart_v0.2 demonstrated superior performance in the 16-bit configuration. This evidenced that supervised fine-tuning (SFT) increases the capability of transformer-based LLMs in a specific downstream task, in this instance, generating test cases for mobile applications, addressing the first research question posed in this work. Additionally, TestGen-Dart_v0.2 outperformed the other state-of-the-art models of interest LLaMA 2 13B and Mistral 7B in that task, addressing the second research question.
Conclusion. This work demonstrates that SFT enhances the capability of transformer-based LLMs in generating test cases for mobile applications in Dart. Furthermore, the 13B parameter size of the TestGen-Dart enables it to run locally on standard consumer hardware, potentially making it a cost-efficient and privacy-friendly testing assistant for software developers by avoiding an external server connection to run the model.
Outlook. Future work currently in progress may expand this approach to other programming languages and refine TestGen-Dart's performance by using higher-quality fine-tuning data either synthetic or human-annotated. Additionally, the evaluation method may be enhanced by using TestGen-Dart for generating test cases for dummy applications and measuring code coverage.