OpenAI Launches SWE-benchVerified for More Accurate AI Model Evaluation in Software Engineering

TapTechNews August 15th news, OpenAI company released a press release on August 13th, announcing the launch of the SWE-benchVerified code generation evaluation benchmark, solving the previous limitations and being able to more accurately evaluate the performance of AI models in software engineering tasks.

SWE-bench

TapTechNews note: SWE-Bench is a benchmark test data set used to evaluate the ability of LLM to solve real software problems on GitHub.

It collects 2294 Issue-PullRequest pairs from 12 popular Python repositories. During the test, the LLM will get a code repository and issue description, and then generate a patch to solve the problem described in the issue.

This benchmark uses two types of tests:

FAIL_TO_PASS test is used to check whether the problem has been solved

PASS_TO_PASS test is used to ensure that code changes do not break existing functions.

SWE-bench's problems

OpenAI pointed out three main problems of SWE-bench:

Unit tests are too strict: Unit tests used to evaluate the correctness of solutions are often too specific and sometimes even irrelevant to the problem, which may lead to the rejection of correct solutions.

Problem descriptions are not clear: The problem descriptions of many samples are not specific enough, resulting in ambiguity about what the problem is and how it should be solved.

Difficult to set up the development environment: Sometimes it is difficult to reliably set up the SWE-bench development environment for the agent, which inadvertently leads to the failure of the unit test.

SWE-benchVerified

The main improvement of SWE-benchVerified is the development of a new evaluation toolkit using a containerized Docker environment.

This improvement aims to make the evaluation process more consistent and reliable and reduce the possibility of problems related to the setting of the development environment.

For example, GPT-4o solved 33.2% of the samples, while the best-performing open-source agent framework Agentless doubled its score to 16%.

The improvement in performance indicates that SWE-benchVerified better captures the true ability of AI models in software engineering tasks.

Likes