September 6, 2024
Large Language Models (LLMs), such as OpenAI’s ChatGPT and GitHub Copilot, have revolutionized the software development landscape by automating code generation. These AI-powered tools promise to enhance productivity, streamline rapid prototyping, and facilitate education. However, the reliability and quality of the generated code are crucial to ensure that it matches the standards of manually written code. To evaluate the current performance of LLMs in generating high-quality and correct code, a team of researchers comprising of Robin Beer, Alexander Feix, Tim Guttzeit, Tamara Muras, Vincent Müller, Maurice Rauscher, Florian Schäffler and Welf Lowe conducted a series of controlled experiments comparing ChatGPT and GitHub Copilot. The results of these experiments were published in their co-authored research paper titled, Analysis of Code and Test-Code Generated by LLMs. This blog explores the findings of these experiments, the methodologies used, and the implications for the future of AI-driven software development.
Introduction: The Rise of AI-Powered Code Generation
Artificial Intelligence (AI) is increasingly integrated into modern software development, with tools designed for bug detection, program analysis, test automation, and code generation. Among these innovations, LLMs have gained significant attention, promising to automate many aspects of coding. The goal is to not only increase developer productivity but also potentially redefine programming itself, allowing developers to describe the desired outcomes instead of writing detailed code.
However, this shift raises important questions:
-Can LLMs generate code that is both correct and clean?
-How do these tools perform when tasked with generating test code, which is essential for
verifying the correctness of software?
-Do different LLMs perform differently across various programming languages, and are they
improving over time?
Methodology: Designing the Experiment
To investigate these questions, we conducted controlled experiments focusing on ChatGPT and GitHub Copilot. The study assessed their ability to generate correct and high-quality code in two popular programming languages: Python and Java. We chose these languages due to their prevalence on platforms like GitHub, with Java representing a compiled, typed language, and Python serving as an interpreted, untyped language.
Algorithm Selection: We selected twelve well-known algorithms for this study, ensuring diversity across algorithm types. The chosen algorithms included classic algorithms like Bellman-Ford, Binary Search, and Merge Sort, along with others such as Egyptian Fractions and Dijkstra’s algorithm. The selection aimed to cover different types of problems, including optimization, search, sorting, and encoding algorithms.
Formulating Prompts: The performance of LLMs heavily depends on the prompts provided. For each algorithm, we designed specific prompts tailored to the requirements of Python and Java. These prompts guided the models to generate either the algorithm implementation or the corresponding unit tests.
Generating Code: Code generation was conducted using ChatGPT's web interface, where independent chat windows were used to ensure unbiased outputs. GitHub Copilot, being an IDE-integrated tool, required fresh project setups for each generation to avoid context contamination. We generated 50 samples of each algorithm for source code evaluation and 10 samples for test code evaluation, totaling 1,920 code samples across the two languages and tools.
Results: Analyzing the Generated Code
The evaluation was divided into two categories: algorithm implementation and test case generation. We assessed the code using metrics such as correctness and quality (coverage) for both the algorithms and their unit tests.
Algorithm Code Generation: ChatGPT outperformed GitHub Copilot in terms of correctness. ChatGPT achieved a correctness rate of 89.33% for Java and 79.17% for Python, compared to GitHub Copilot’s 75.50% for Java and 62.50% for Python. In terms of code quality, GitHub Copilot demonstrated a slight edge, achieving 98.13% for Java and 90.15% for Python, while ChatGPT scored 98.09% and 88.20%, respectively. These results indicate that while ChatGPT excels at producing correct code, Copilot is slightly better at generating code that meets high-quality standards.
Test Case Generation: Both LLMs struggled more with test case generation compared to algorithm implementation. ChatGPT generated correct test cases 37.50% of the time for Java and 28.61% for Python. GitHub Copilot showed marginally better performance with 49.72% correctness for Java and 39.17% for Python. In terms of coverage, both models achieved intermediate results, with ChatGPT averaging 58.79% coverage for Java and 85.94% for Python, and GitHub Copilot scoring 57.18% and 82.03%, respectively.
Language Comparison: Across both models, Java consistently outperformed Python in terms of correctness and quality of generated code. However, Python showed better results in test coverage, nearing complete coverage in some cases. The observed differences were statistically significant, confirming that language-specific factors influence LLM performance.
Model Comparison: Statistical analysis revealed significant differences between ChatGPT and GitHub Copilot. ChatGPT was superior in generating correct code, while GitHub Copilot often produced slightly higher quality code. These findings suggest that ChatGPT might be the preferred choice for generating reliable code, whereas Copilot’s integration within IDEs offers convenience and efficiency for developers.
Discussion: Key Insights and Implications
The study’s results highlight that while LLMs like ChatGPT and GitHub Copilot are promising tools for code generation, there are still areas for improvement, particularly in generating test cases. Both tools showed a clear ability to generate code that can serve as a starting point, but the gap between the generated and human-written code—especially for complex algorithms and thorough test cases—suggests that human oversight remains essential.
Correctness vs. Quality: The results underscore a trade-off between correctness and quality. ChatGPT, while producing highly correct code, occasionally fell short of Copilot in terms of quality. This suggests that while developers can rely on these tools for initial code drafts, manual refinement is often necessary to meet high-quality standards.
Impact of Language: The discrepancies between Java and Python highlight the impact of language-specific characteristics on LLM performance. Java’s structured and typed nature might offer clearer guidance to LLMs, resulting in better performance, while Python’s flexibility could pose challenges.
Evolution Over Time: Comparing the current results with past studies indicates that LLMs are improving, with notable advancements in code correctness for ChatGPT and quality for Copilot. This trend suggests that continuous improvements in LLM training and architecture could further enhance their utility in software development.
Conclusion: The Future of AI in Software Development
Our study concludes that LLMs can effectively generate algorithm implementations and test codes for Java and Python, though with varying success rates. ChatGPT outperforms Copilot in generating correct code, while Copilot remains a competitive choice due to its seamless IDE integration. Importantly, neither tool is perfect, with generated test cases still requiring significant manual intervention.
The results suggest that, while LLMs are valuable assistants, they are not yet ready to fully replace human developers. As these models evolve, their role in software development will likely expand, allowing developers to focus more on complex problem-solving and less on routine coding tasks.
Future research should explore the capabilities of LLMs in more complex scenarios, such as debugging and refactoring existing code, to better understand their potential and limitations. Additionally, further analysis of the non-deterministic nature of these models will provide deeper insights into their reliability. Ultimately, the integration of LLMs in development workflows represents a promising step forward, but one that still requires careful navigation and human oversight.