DevBench is a comprehensive benchmark designed to evaluate Large Language Models (LLMs) across various stages of the software development lifecycle. It covers critical steps such as software design, environment setup, implementation, acceptance testing, and unit testing. By integrating these interconnected tasks under a single framework, DevBench offers a holistic perspective on the potential of LLMs for automated software development1.

Here are some key details about DevBench:

Purpose: DevBench aims to assess LLMs’ capabilities in automating software development tasks. Dataset: The DevBench dataset comprises 22 curated repositories across 4 programming languages (Python, C/C++, Java, JavaScript). These repositories cover diverse domains, including machine learning, databases, web services, and command-line utilities. Evaluation Suite: Implementation Task: DevBench provides extensive acceptance and unit test cases for the implementation task. Software Design Task: For evaluating the software design task, DevBench utilizes LLM-as-a-Judge. Baseline Agent System: DevBench includes a baseline agent system based on the popular multi-agent software development system, ChatDev.

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


License


  • Unknown

Modalities


Languages