The State of AI Coding Assistants
What does the data say?
Approximately % of LLM tokens are now used for tasks related to software engineering. This is a staggering __ tokens per day, or $ in dollar terms [ref].
I’ve been writing code in some capacity for over a decade, and in the ML industry specifically for a majority of that time. I now find myself relying on LLMs for a large chunk of my worklife. I’m not really sure if that is good or bad. But one thing I know is that I definitely feel more productive. Throughout the past year, I’ve switched from ChatGPT copy-pasting, to Cursor, to Claude Code, to Codex, back to Claude Code. I am generally open to new technology, but even I started out extremely skeptical of LLMs for real work. I’ve come around a little bit. I know a lot of folks in our industry are still skeptical, and another subset (who seem to congregate in HN comment sections) are straight-up hostile.
Why Benchmarks?
A lot of the attitudes to LLMs in the software engineering space are based on vibes. Vibes are like __, everybody has them.
In the LLM world, benchmarks try to solve this problem. The goal of an LLM benchmark (or a performance metric for any ML model, really) is to evaluate the performance of the algorithm on inputs outside of it’s training dataset. As I’ve used LLMs more and more in my day job, I’ve made a habit of researching benchmarks. I do think there are a lot of problems with existing benchmarks. Some are better than others. But when considered in aggregrate they paint a clear picture of where we’ve come from, where we are, and where we are going.
In this article I want to focus on where we are. I’m writing this mainly to collect a lot of my own thoughts. I’m hoping it will be broadly useful for engineers that are interested or skeptical of these tools.
Coding Assistants
First, let’s define a few terms. Large Language Models (LLMs) are machine learning algorithms with a lot of parameters that takes a token sequence as input and output a token. Tokens can take many forms, but typically represent a piece of text or image. AI has become a colloquial/marketing term for LLMs. Broadly speaking, there are currently no frontier “AI systems” that are not based on LLMs. So if you hear “AI”, you can just plug in “LLM” and you will be right approximately 100% of the time.
“Agentic AI” is a newer term. Typically it refers to “AI systems” (read: “LLMs”) that have access to certain tools that they can use in a loop to take action in an environment. The most popular instantiation of agentic AI by far has been coding assistants. “Coding assistants” are LLMs that have access to tools that read and modify code. In many cases, they also have tools to run terminal commands, and even to search the web for documentation, create architecture plans, and more. But, generally, the formula is just “Coding Assistant” = “LLM” + “Code Tools”. The “Code Tools” part of this equation is usually called the “scaffold”, so I will use that term here. So “Coding Assistant” = “LLM” + “Scaffold”
It’s important to emphasize that the LLM and the scaffold are independent. Typically the frontier lab products (e.g. Codex, Claude Code) build their own scaffolds fine-tuned to their LLMs strength (or vice-versa), but that is not a necessity. The table below gives some examples of models, scaffolds, and the product name for popular coding assistants.
| Product | LLM | Scaffold |
|---|---|---|
| Claude Code | Claude Sonnet 4 / Opus 4 | Claude Code |
| Codex | gpt-5.2 / gpt-5.1-codex | Codex |
| Cursor | GPT, Sonnet, Opus, Gemini, etc. | Cursor |
| Cline | GPT, Sonnet, Opus, Gemini, etc. | Cline |
As you can see, often the product uses the same name as the scaffold. That is because often the scaffold is the product. The same LLM running on two different scaffolds can have materially different performance. Some people even hack lab-specific scaffolds to use another labs LLM. For example [ref] hacked Claude Code to use gpt-5.1-codex.
Later I plan to do a more detailed blog post with more technical details of coding assistants scaffolds, but for now, I just wanted to emphasize that any benchmarking of coding assistants needs to consider both the LLM and the scaffold.
Types of Benchmarks
Now for the fun part, let’s get into the benchmarks. I will focus specifically on benchmarks for coding assistant/agentic coding workflows. Note that before the coding assistant hype cycle started, there were many LLM benchmarks that focused on short self-contained tasks. For example, one task from OpenAIs HumanEval [ref] benchmark prompted the LLM to implement this function:
from typing import List
def has_close_elements(numbers: List[float], threshold: float) -> bool:
"""
Check if in given list of numbers, are any two numbers closer to each other
than the given threshold.
>>> has_close_elements([1.0, 2.0, 3.0], 0.5)
False
>>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
True
"""
The LLM will be given one or more independent attempts to generate a solution, unit tests will be ran on the solution, and if the unit test passed, then the LLM passed the test. These types of benchmarks test an LLMs ability to solve a self-contained, well-defined coding task.
Most SWE work does not consist of self-contained, well-defined tasks. Most SWE work involves taking a vague, poorly specified feature request or bug, analyzing a large codebase to understand how to best implement the feature or bug fix, and then executing on the code changes. This is the scope at which coding assistants claim to operate, and the scope that we want to understand. For this reason, I will exclude benchmarks that focus on short, self-contained tasks, and instead focus on benchmarks that focus on complex tasks in an existing large codebase.
The most popular benchmark that focuses on complex tasks in existing large codebases is SWE-bench [ref]. SWE-bench was created around 2023 by scraping real Github repos with accepted pull requests associated with a (now-closed) issue. Each PR is associated has an associated set of tests that initially fail. The coding assistant is given the issue text and the entire codebase and asked to solve the issue. After the coding assistant changes, the same set of tests that initially failed must now pass. This is called a “Fail-to-Pass” test and is the primary signal used by SWE-bench. Here is a real example problem from SWE-bench.
Benchmarks
As of the time of writing (December 2025), the table below is the most comprehensive summary of existing coding assistant benchmarks on complex tasks in large codebases.