This is part 1 of a multi-part deep dive into the state of LLM/AI-based coding assistants.
Why should you care? I’ll give two answers:
If you don’t use LLM coding assistants,
you might know that approximately X% of US GDP is currently being allocated to funding capital expenditures for LLM hardware [ref].
And the most prominent professional use-case of LLMs is coding assistants [ref]. Claude Code is perhaps the fastest product ever to 1B ARR
[ref].
If you do use LLM coding assistants, you probably already use LLMs to help you code in some capacity today, and understanding how frontier models are evaluated on coding tasks will
help you understand benchmarks and decide what models to use in what situations.
I will start this series with a brief deep-dive into the most influential benchmark for coding assistants: SWE-Bench.
What is and isn’t SWE-Bench?
SWE-Bench (SoftWare Engineering-Bench) is a benchmark for evaluating the accuracy of LLMs on complex software engineering tasks in real-world codebases.
To know what SWE-Bench is, it helps to know what SWE-Bench isn’t. Before the coding assistant hype cycle started,
there were many LLM benchmarks that focused on short self-contained tasks. For example, one task from OpenAIs HumanEval [ref] benchmark
prompted the LLM to implement this function:
from typing import Listdef has_close_elements(numbers: List[float], threshold: float) -> bool: """ Check if in given list of numbers, are any two numbers closer to each other than the given threshold. >>> has_close_elements([1.0, 2.0, 3.0], 0.5) False >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3) True """
The LLM will be given one or more independent attempts to generate a solution, unit tests will be ran on the solution, and if the unit test passed, then the LLM passed the test.
Benchmarks consisting of clearly-scoped problems like this can broadly be categorized under “Self-Contained Coding Benchmarks”. The model does not need knowledge about an existing
codebase to solve problems in self-contained coding benchmarks.
SWE-Bench originated in 2023 as the first large-scale benchmark that went past self-contained problems. What SWE-Bench tests for is the ability for an LLM to solve real-world Github
issues in real-world codebases.
A real SWE-Bench Problem
Let’s look at an example SWE-Bench problem:
Voting estimator will fail at fit if weights are passed and an estimator is None #13777
Closed glemaitre opened this issue on May 3, 2019
glemaitreopened on May 3, 2019Member
Because we don’t check for an estimator to be None in sample_weight support, fit is failing.
X, y = load_iris(return_X_y=True)voter = VotingClassifier( estimators=[('lr', LogisticRegression()), ('rf', RandomForestClassifier())])voter.fit(X, y, sample_weight=np.ones(y.shape))voter.set_params(lr=None)voter.fit(X, y, sample_weight=np.ones(y.shape))
AttributeError: 'NoneType' object has no attribute 'fit'
@pytest.mark.parametrize( "X, y, voter", [(X, y, VotingClassifier( [('lr', LogisticRegression()), ('rf', RandomForestClassifier(n_estimators=5))])), (X_r, y_r, VotingRegressor( [('lr', LinearRegression()), ('rf', RandomForestRegressor(n_estimators=5))]))])@pytest.mark.parametrize("drop", [None, 'drop'])def test_none_estimator_with_weights(X, y, voter, drop): # TODO: remove the parametrization on 'drop' when support for None is # removed. # check that an estimator can be set to 'drop' and passing some weight # regression test for # https://github.com/scikit-learn/scikit-learn/issues/13777 voter = clone(voter) voter.fit(X, y, sample_weight=np.ones(y.shape)) voter.set_params(lr=drop) with pytest.warns(None) as record: voter.fit(X, y, sample_weight=np.ones(y.shape)) assert record if drop is None else not record y_pred = voter.predict(X) assert y_pred.shape == y.shape
@pytest.mark.parametrize( "X, y, voter", [(X, y, VotingClassifier( [('lr', LogisticRegression()), ('rf', RandomForestClassifier(n_estimators=5))])), (X_r, y_r, VotingRegressor( [('lr', LinearRegression()), ('rf', RandomForestRegressor(n_estimators=5))]))])@pytest.mark.parametrize("drop", [None, 'drop'])def test_none_estimator_with_weights(X, y, voter, drop): # TODO: remove the parametrization on 'drop' when support for None is # removed. # check that an estimator can be set to 'drop' and passing some weight # regression test for # https://github.com/scikit-learn/scikit-learn/issues/13777 voter = clone(voter) voter.fit(X, y, sample_weight=np.ones(y.shape)) voter.set_params(lr=drop) with pytest.warns(None) as record: voter.fit(X, y, sample_weight=np.ones(y.shape)) assert record if drop is None else not record y_pred = voter.predict(X) assert y_pred.shape == y.shape
@pytest.mark.filterwarnings('ignore: Default solver will be changed') # 0.22@pytest.mark.filterwarnings('ignore: Default multi_class will') # 0.22def test_estimator_init(): eclf = VotingClassifier(estimators=[]) msg = ('Invalid `estimators` attribute, `estimators` should be' ' a list of (string, estimator) tuples') assert_raise_message(AttributeError, msg, eclf.fit, X, y) clf = LogisticRegression(random_state=1) eclf = VotingClassifier(estimators=[('lr', clf)], voting='error') msg = ('Voting must be \'soft\' or \'hard\'; got (voting=\'error\')') assert_raise_message(ValueError, msg, eclf.fit, X, y) eclf = VotingClassifier(estimators=[('lr', clf)], weights=[1, 2]) msg = ('Number of `estimators` and weights must be equal' '; got 2 weights, 1 estimators') assert_raise_message(ValueError, msg, eclf.fit, X, y) eclf = VotingClassifier(estimators=[('lr', clf), ('lr', clf)], weights=[1, 2]) msg = "Names provided are not unique: ['lr', 'lr']" assert_raise_message(ValueError, msg, eclf.fit, X, y) eclf = VotingClassifier(estimators=[('lr__', clf)]) msg = "Estimator names must not contain __: got ['lr__']" assert_raise_message(ValueError, msg, eclf.fit, X, y) eclf = VotingClassifier(estimators=[('estimators', clf)]) msg = "Estimator names conflict with constructor arguments: ['estimators']" assert_raise_message(ValueError, msg, eclf.fit, X, y)
@pytest.mark.filterwarnings('ignore: Default solver will be changed') # 0.22@pytest.mark.filterwarnings('ignore: Default multi_class will') # 0.22def test_predictproba_hardvoting(): eclf = VotingClassifier(estimators=[('lr1', LogisticRegression()), ('lr2', LogisticRegression())], voting='hard') msg = "predict_proba is not available when voting='hard'" assert_raise_message(AttributeError, msg, eclf.predict_proba, X)
@pytest.mark.filterwarnings('ignore: Default solver will be changed') # 0.22@pytest.mark.filterwarnings('ignore: Default multi_class will') # 0.22def test_notfitted(): eclf = VotingClassifier(estimators=[('lr1', LogisticRegression()), ('lr2', LogisticRegression())], voting='soft') ereg = VotingRegressor([('dr', DummyRegressor())]) msg = ("This %s instance is not fitted yet. Call \'fit\'" " with appropriate arguments before using this method.") assert_raise_message(NotFittedError, msg % 'VotingClassifier', eclf.predict, X) assert_raise_message(NotFittedError, msg % 'VotingClassifier', eclf.predict_proba, X) assert_raise_message(NotFittedError, msg % 'VotingClassifier', eclf.transform, X) assert_raise_message(NotFittedError, msg % 'VotingRegressor', ereg.predict, X_r) assert_raise_message(NotFittedError, msg % 'VotingRegressor', ereg.transform, X_r)
@pytest.mark.filterwarnings('ignore: Default solver will be changed') # 0.22@pytest.mark.filterwarnings('ignore: Default multi_class will') # 0.22@pytest.mark.filterwarnings('ignore:The default value of n_estimators')def test_majority_label_iris(): """Check classification by majority label on dataset iris.""" clf1 = LogisticRegression(random_state=123) clf2 = RandomForestClassifier(random_state=123) clf3 = GaussianNB() eclf = VotingClassifier(estimators=[ ('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='hard') scores = cross_val_score(eclf, X, y, cv=5, scoring='accuracy') assert_almost_equal(scores.mean(), 0.95, decimal=2)
@pytest.mark.filterwarnings('ignore:The default value of n_estimators')def test_tie_situation(): """Check voting classifier selects smaller class label in tie situation.""" clf1 = LogisticRegression(random_state=123, multi_class='ovr', solver='liblinear') clf2 = RandomForestClassifier(random_state=123) eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2)], voting='hard') assert_equal(clf1.fit(X, y).predict(X)[73], 2) assert_equal(clf2.fit(X, y).predict(X)[73], 1) assert_equal(eclf.fit(X, y).predict(X)[73], 1)
@pytest.mark.filterwarnings('ignore: Default solver will be changed') # 0.22@pytest.mark.filterwarnings('ignore: Default multi_class will') # 0.22@pytest.mark.filterwarnings('ignore:The default value of n_estimators')def test_weights_iris(): """Check classification by average probabilities on dataset iris.""" clf1 = LogisticRegression(random_state=123) clf2 = RandomForestClassifier(random_state=123) clf3 = GaussianNB() eclf = VotingClassifier(estimators=[ ('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='soft', weights=[1, 2, 10]) scores = cross_val_score(eclf, X, y, cv=5, scoring='accuracy') assert_almost_equal(scores.mean(), 0.93, decimal=2)
@pytest.mark.filterwarnings('ignore: Default solver will be changed') # 0.22@pytest.mark.filterwarnings('ignore: Default multi_class will') # 0.22@pytest.mark.filterwarnings('ignore:The default value of n_estimators')def test_parallel_fit(): """Check parallel backend of VotingClassifier on toy dataset.""" clf1 = LogisticRegression(random_state=123) clf2 = RandomForestClassifier(random_state=123) clf3 = GaussianNB() X = np.array([[-1.1, -1.5], [-1.2, -1.4], [-3.4, -2.2], [1.1, 1.2]]) y = np.array([1, 1, 2, 2]) eclf1 = VotingClassifier(estimators=[ ('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='soft', n_jobs=1).fit(X, y) eclf2 = VotingClassifier(estimators=[ ('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='soft', n_jobs=2).fit(X, y) assert_array_equal(eclf1.predict(X), eclf2.predict(X)) assert_array_almost_equal(eclf1.predict_proba(X), eclf2.predict_proba(X))
@pytest.mark.filterwarnings('ignore: Default solver will be changed') # 0.22@pytest.mark.filterwarnings('ignore: Default multi_class will') # 0.22@pytest.mark.filterwarnings('ignore:The default value of n_estimators')def test_sample_weight(): """Tests sample_weight parameter of VotingClassifier""" clf1 = LogisticRegression(random_state=123) clf2 = RandomForestClassifier(random_state=123) clf3 = SVC(gamma='scale', probability=True, random_state=123) eclf1 = VotingClassifier(estimators=[ ('lr', clf1), ('rf', clf2), ('svc', clf3)], voting='soft').fit(X, y, sample_weight=np.ones((len(y),))) eclf2 = VotingClassifier(estimators=[ ('lr', clf1), ('rf', clf2), ('svc', clf3)], voting='soft').fit(X, y) assert_array_equal(eclf1.predict(X), eclf2.predict(X)) assert_array_almost_equal(eclf1.predict_proba(X), eclf2.predict_proba(X)) sample_weight = np.random.RandomState(123).uniform(size=(len(y),)) eclf3 = VotingClassifier(estimators=[('lr', clf1)], voting='soft') eclf3.fit(X, y, sample_weight) clf1.fit(X, y, sample_weight) assert_array_equal(eclf3.predict(X), clf1.predict(X)) assert_array_almost_equal(eclf3.predict_proba(X), clf1.predict_proba(X)) clf4 = KNeighborsClassifier() eclf3 = VotingClassifier(estimators=[ ('lr', clf1), ('svc', clf3), ('knn', clf4)], voting='soft') msg = ('Underlying estimator \'knn\' does not support sample weights.') assert_raise_message(ValueError, msg, eclf3.fit, X, y, sample_weight)
def test_sample_weight_kwargs(): """Check that VotingClassifier passes sample_weight as kwargs""" class MockClassifier(BaseEstimator, ClassifierMixin): """Mock Classifier to check that sample_weight is received as kwargs""" def fit(self, X, y, *args, **sample_weight): assert 'sample_weight' in sample_weight clf = MockClassifier() eclf = VotingClassifier(estimators=[('mock', clf)], voting='soft') # Should not raise an error. eclf.fit(X, y, sample_weight=np.ones((len(y),)))
@pytest.mark.filterwarnings('ignore: Default solver will be changed') # 0.22@pytest.mark.filterwarnings('ignore: Default multi_class will') # 0.22@pytest.mark.filterwarnings('ignore:The default value of n_estimators')def test_set_params(): """set_params should be able to set estimators""" clf1 = LogisticRegression(random_state=123, C=1.0) clf2 = RandomForestClassifier(random_state=123, max_depth=None) clf3 = GaussianNB() eclf1 = VotingClassifier([('lr', clf1), ('rf', clf2)], voting='soft', weights=[1, 2]) assert 'lr' in eclf1.named_estimators assert eclf1.named_estimators.lr is eclf1.estimators[0][1] assert eclf1.named_estimators.lr is eclf1.named_estimators['lr'] eclf1.fit(X, y) assert 'lr' in eclf1.named_estimators_ assert eclf1.named_estimators_.lr is eclf1.estimators_[0] assert eclf1.named_estimators_.lr is eclf1.named_estimators_['lr'] eclf2 = VotingClassifier([('lr', clf1), ('nb', clf3)], voting='soft', weights=[1, 2]) eclf2.set_params(nb=clf2).fit(X, y) assert not hasattr(eclf2, 'nb') assert_array_equal(eclf1.predict(X), eclf2.predict(X)) assert_array_almost_equal(eclf1.predict_proba(X), eclf2.predict_proba(X)) assert_equal(eclf2.estimators[0][1].get_params(), clf1.get_params()) assert_equal(eclf2.estimators[1][1].get_params(), clf2.get_params()) eclf1.set_params(lr__C=10.0) eclf2.set_params(nb__max_depth=5) assert eclf1.estimators[0][1].get_params()['C'] == 10.0 assert eclf2.estimators[1][1].get_params()['max_depth'] == 5 assert_equal(eclf1.get_params()["lr__C"], eclf1.get_params()["lr"].get_params()['C'])
@pytest.mark.filterwarnings('ignore: Default solver will be changed') # 0.22@pytest.mark.filterwarnings('ignore: Default multi_class will') # 0.22@pytest.mark.filterwarnings('ignore:The default value of n_estimators')def test_set_estimator_none(): """VotingClassifier set_params should be able to set estimators as None""" # Test predict clf1 = LogisticRegression(random_state=123) clf2 = RandomForestClassifier(random_state=123) clf3 = GaussianNB() eclf1 = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('nb', clf3)], voting='hard', weights=[1, 0, 0.5]).fit(X, y) eclf2 = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('nb', clf3)], voting='hard', weights=[1, 1, 0.5]) eclf2.set_params(rf=None).fit(X, y) assert_array_equal(eclf1.predict(X), eclf2.predict(X)) assert dict(eclf2.estimators)["rf"] is None assert len(eclf2.estimators_) == 2 assert all(isinstance(est, (LogisticRegression, GaussianNB)) for est in eclf2.estimators_) assert eclf2.get_params()["rf"] is None eclf1.set_params(voting='soft').fit(X, y) eclf2.set_params(voting='soft').fit(X, y) assert_array_equal(eclf1.predict(X), eclf2.predict(X)) assert_array_almost_equal(eclf1.predict_proba(X), eclf2.predict_proba(X)) msg = 'All estimators are None. At least one is required!' assert_raise_message( ValueError, msg, eclf2.set_params(lr=None, rf=None, nb=None).fit, X, y) # Test soft voting transform X1 = np.array([[1], [2]]) y1 = np.array([1, 2]) eclf1 = VotingClassifier(estimators=[('rf', clf2), ('nb', clf3)], voting='soft', weights=[0, 0.5], flatten_transform=False).fit(X1, y1) eclf2 = VotingClassifier(estimators=[('rf', clf2), ('nb', clf3)], voting='soft', weights=[1, 0.5], flatten_transform=False) eclf2.set_params(rf=None).fit(X1, y1) assert_array_almost_equal(eclf1.transform(X1), np.array([[[0.7, 0.3], [0.3, 0.7]], [[1., 0.], [0., 1.]]])) assert_array_almost_equal(eclf2.transform(X1), np.array([[[1., 0.], [0., 1.]]])) eclf1.set_params(voting='hard') eclf2.set_params(voting='hard') assert_array_equal(eclf1.transform(X1), np.array([[0, 0], [1, 1]])) assert_array_equal(eclf2.transform(X1), np.array([[0], [1]]))
@pytest.mark.filterwarnings('ignore: Default solver will be changed') # 0.22@pytest.mark.filterwarnings('ignore: Default multi_class will') # 0.22@pytest.mark.filterwarnings('ignore:The default value of n_estimators')def test_estimator_weights_format(): # Test estimator weights inputs as list and array clf1 = LogisticRegression(random_state=123) clf2 = RandomForestClassifier(random_state=123) eclf1 = VotingClassifier(estimators=[ ('lr', clf1), ('rf', clf2)], weights=[1, 2], voting='soft') eclf2 = VotingClassifier(estimators=[ ('lr', clf1), ('rf', clf2)], weights=np.array((1, 2)), voting='soft') eclf1.fit(X, y) eclf2.fit(X, y) assert_array_almost_equal(eclf1.predict_proba(X), eclf2.predict_proba(X))
Fail-To-Pass Tests: A set of tests related to the issue that fail in the original codebase, and must be passed for the issue to be resolved. These tests check that code changes resolved the issue.
Pass-To-Pass Tests: A set of tests unrelated to the issue that pass in the original codebase, and must be passed after the issue is resolved. These tests check that code changes did not break functionality unrelated to the issue.
Typically, SWE-Bench tests are run in dockerized containers. The entire repo at the base commit is cloned, and the coding assistant is prompted with
the issue text and given full access to the repo at the base commit. The coding assistant runs and generates a potential solution, which is verified against both the
Fail-To-Pass and Pass-To-Pass tests. If all tests passed, that problem is considered solved.
Importantly, the coding assistant should NOT have access to the Fail-To-Pass tests to avoid hacking a solution.
Docker Container
Issue Text
Codebase@ base commit
Coding Assistantgenerating patch...
Fail-To-Pass
Codebase+ patch applied
Pass-To-Pass
Problem Solved!
The SWE-Bench scores you see frontier labs report are the percentage of SWE-Bench problems solved by the LLM.
The History
SWE-Bench started as a single benchmark in late 2023, but is now a name for a family of benchmarks. A lot of confusion in understanding coding assistant capabilities comes
from confusion about SWE-Bench variants. The chronology below illustrates the history and evolution of SWE-Bench.