SWE-Bench: A Deep Dive

The $X billion dollar benchmark

JB ·

This is part 1 of a multi-part deep dive into the state of LLM/AI-based coding assistants.

Why should you care? I’ll give two answers:

If you don’t use LLM coding assistants, you might know that approximately X% of US GDP is currently being allocated to funding capital expenditures for LLM hardware [ref]. And the most prominent professional use-case of LLMs is coding assistants [ref]. Claude Code is perhaps the fastest product ever to 1B ARR [ref].

If you do use LLM coding assistants, you probably already use LLMs to help you code in some capacity today, and understanding how frontier models are evaluated on coding tasks will help you understand benchmarks and decide what models to use in what situations.

I will start this series with a brief deep-dive into the most influential benchmark for coding assistants: SWE-Bench.

What is and isn’t SWE-Bench?

SWE-Bench (SoftWare Engineering-Bench) is a benchmark for evaluating the accuracy of LLMs on complex software engineering tasks in real-world codebases.

To know what SWE-Bench is, it helps to know what SWE-Bench isn’t. Before the coding assistant hype cycle started, there were many LLM benchmarks that focused on short self-contained tasks. For example, one task from OpenAIs HumanEval [ref] benchmark prompted the LLM to implement this function:

from typing import List


def has_close_elements(numbers: List[float], threshold: float) -> bool:
    """
    Check if in given list of numbers, are any two numbers closer to each other
    than the given threshold.

    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
    True
    """

The LLM will be given one or more independent attempts to generate a solution, unit tests will be ran on the solution, and if the unit test passed, then the LLM passed the test.

Benchmarks consisting of clearly-scoped problems like this can broadly be categorized under “Self-Contained Coding Benchmarks”. The model does not need knowledge about an existing codebase to solve problems in self-contained coding benchmarks.

SWE-Bench originated in 2023 as the first large-scale benchmark that went past self-contained problems. What SWE-Bench tests for is the ability for an LLM to solve real-world Github issues in real-world codebases.

A real SWE-Bench Problem

Let’s look at an example SWE-Bench problem:

Voting estimator will fail at fit if weights are passed and an estimator is None #13777

Closed glemaitre opened this issue on May 3, 2019
glemaitre opened on May 3, 2019 Member

Because we don’t check for an estimator to be None in sample_weight support, fit is failing.

X, y = load_iris(return_X_y=True)
voter = VotingClassifier(
    estimators=[('lr', LogisticRegression()),
                ('rf', RandomForestClassifier())]
)
voter.fit(X, y, sample_weight=np.ones(y.shape))
voter.set_params(lr=None)
voter.fit(X, y, sample_weight=np.ones(y.shape))
AttributeError: 'NoneType' object has no attribute 'fit'

A SWE-Bench problem consist of the following primary parts:

  • Issue text: The title and body of a real GitHub issue.
  • Codebase: The entire codebase for the repository.
  • Fail-To-Pass Tests: A set of tests related to the issue that fail in the original codebase, and must be passed for the issue to be resolved. These tests check that code changes resolved the issue.
  • Pass-To-Pass Tests: A set of tests unrelated to the issue that pass in the original codebase, and must be passed after the issue is resolved. These tests check that code changes did not break functionality unrelated to the issue.

Typically, SWE-Bench tests are run in dockerized containers. The entire repo at the base commit is cloned, and the coding assistant is prompted with the issue text and given full access to the repo at the base commit. The coding assistant runs and generates a potential solution, which is verified against both the Fail-To-Pass and Pass-To-Pass tests. If all tests passed, that problem is considered solved. Importantly, the coding assistant should NOT have access to the Fail-To-Pass tests to avoid hacking a solution.

Docker Container
Issue Text
Codebase @ base commit
Coding Assistant generating patch...
Fail-To-Pass
Codebase + patch applied
Pass-To-Pass
Problem Solved!

The SWE-Bench scores you see frontier labs report are the percentage of SWE-Bench problems solved by the LLM.

The History

SWE-Bench started as a single benchmark in late 2023, but is now a name for a family of benchmarks. A lot of confusion in understanding coding assistant capabilities comes from confusion about SWE-Bench variants. The chronology below illustrates the history and evolution of SWE-Bench.