Improving the Quality of GitHub Copilot Generated Unit Tests
Unit tests are essential for software quality control, but remain costly and tedious to write.
AI coding assistants like GitHub Copilot can generate unit tests automatically, yet the practical effectiveness of such tests is poorly understood.
Prior work has mostly focused on AI models in isolation, leaving open how actual tools like Copilot perform in realistic developer workflows.
This paper presents a systematic study of Copilot-generated unit tests for Java across five real-world projects.
We evaluate three factors for improving test quality:
(i) prompts to prevent compile errors,
(ii) prompts emphasizing testing best practices and fault detection, and
(iii) the choice of the underlying AI model.
Our evaluation combines compilation rate, line coverage, and mutation coverage, providing both syntactic and fault-detection perspectives.
A custom tool allows for test generation, prompt variation, and evaluation without human intervention.
We find that prompts aimed at preventing compile errors are not beneficial in practice.
Goal-setting prompts that instruct Copilot to maximize fault detection lead to a sharp decline in compilation success, limiting the number of usable test classes.
However, for those classes that do compile, these prompts consistently yield higher mutation coverage than both the standard prompt and even developer-written baselines, highlighting their potential for strong fault-detection effectiveness if compilation challenges are addressed.
The choice of LLM is most influential: GPT-4.1 achieves the highest compilation rate, while Anthropic Sonnet variants (e.g., Sonnet 3.7, Sonnet 4) deliver higher mutation coverage and greater fault-detection capability, albeit at the cost of more frequent compile errors.
This creates a practical trade-off: developers seeking stronger fault sensitivity may prefer Sonnet models, provided they can accommodate higher error rates, while those prioritizing fewer compile errors may favor GPT-4.1.
Finally, we show how simple and pragmatic post-processing can substantially increase the practical usability of generated tests.
This study provides actionable guidance for practitioners using GitHub Copilot, providing empirical results and translating them into concrete decisions about prompts, models, and repair strategies
Talk language: English
Level: Scientific
Target group:
Company:
TUM (University, I am not affiliated with a company for the purpose of my paper)
Max Schallermayer