Terminal-Bench: Agents in Open-Ended Environments

A Deep Dive

JB ·

This is part 2 of a multi-part series into the state of LLM/AI-based coding assistants. In part 1 we did a deep dive into SWE-Bench, the most influential LLM benchmark related to coding assistants. In this post we will look at Terminal-Bench, a slightly less popular benchmark that evaluates the capability of agents in a terminal environment to solve complex tasks requiring extensive use of tools.

What is and isn’t Terminal-Bench?

Terminal-Bench is a benchmark for evaluating the accuracy of LLM-based agents on tasks in a terminal environment.

It helps to understand how Terminal-Bench differs from benchmarks like SWE-Bench. As a reminder, SWE-Bench assesses a system’s ability to fix real GitHub issues on real-world complex codebases. In SWE-Bench, the input to the LLM system is the text from a Github issue and a codebase, and the output is a patch to be applied to the original issue commit that is verified against tests. See part 1 in the series for more details.

In Terminal-Bench, the input is different. It consists of a task description (instructions) and a pre-specified terminal environment that can include existing code files, non-code files (binaries, images, etc.), CLI tools, and more. The output is not a patch. Instead, the output is a new, modified version of the environment created by the LLM agent executing tools against the original environment. This new environment is evaluated on a series of tests to confirm that the task description has been solved.

A Real Terminal-Bench Problem

Let’s look at a real Terminal-Bench example problem:

These are the instructions passed to the LLM agent.

instruction.md

Download this video of someone playing zork. https://www.youtube.com/watch?v=ZCbvyPbhRfA. Then transcribe the entire contents of the text, and create a file /app/solution.txt that has all the moves they input, one per line, in the format 'n' or 'get bag' etc.

A Terminal-Bench problem consists of the following primary parts:

  • Task Instruction: A natural-language description of the task stored in instruction.md. This is the input prompt to the agent.
  • Environment: A Docker-based environment defined by a Dockerfile and configuration in task.toml. This includes resource limits, timeouts, and any pre-installed tools. This is used to configure the agent environment, but is not directly seen by the agent.
  • Solution: A reference solution (solve.sh or solution.yaml) that demonstrates a known-good path to completing the task. This is used for validation and oracle testing. This is not seen by the agent.
  • Tests: Verification scripts that check whether the agent successfully completed the task. Tests are run after the agent finishes, and are not visible to the agent during execution.

The figure below illustrates the process. A Terminal-Bench problem is considered solved if all tests pass. The Terminal-Bench score is the percentage of Terminal-Bench problems that are solved.

Docker Environment
Task Instruction instruction.md
Environment original state
Agent
terminal
$ yt-dlp video.mp4
$ ffmpeg -i video.mp4
$ python extract.py
Environment modified state
Verification Tests
File exists
Content valid
Task Completed!

The History

There are now several versions of Terminal-Bench, which are listed on the official website. The most important variants are the original Terminal-Bench Core (1.0) and the newer Terminal-Bench 2.0. Here are the history, differences, and examples of both:

Terminal-Bench Core (1.0)

May 2025 · Laude Institute and Stanford

The first broadly accepted benchmark for evaluating LLM agents in terminal environments. Tasks span system administration, file operations, web scraping, security challenges, and more

Tasks

247 (originally 80)

Key Differences

  • First large-scale benchmark for terminal-based agent tasks
  • Docker-based isolated environments with reproducible setups
  • Tasks require multi-step reasoning and tool use
  • Covers diverse domains from kernel compilation to game playing

Limitations

  • Limited verification—some edge cases may not be caught by tests
  • Solutions are publically available on the web. Although canary present, solutions may still be in pre-training corpora
  • Somewhat small sample size, although improved since initial release

Example Task

These are the instructions passed to the LLM agent.

instruction.md
Build linux kernel linux-6.9 from source. Insert a custom printk statement into the start_kernel function:
printk(KERN_INFO "Hello, this is a custom kernel");
Use the pre-existing initramfs.list located in the ramfs directory. Generate the ramfs using gen_init_cpio. Execute via QEMU with specified parameters to verify the kernel boots and displays the custom message.

Where we Stand

Most frontier labs now report accuracy on Terminal-Bench 2.0. As someone who has been writing software professionally for over a decade, Terminal-Bench 2.0 tasks range from “non-trivial but I can probably figure out in ~30min”, all the way to seriously challenging problems that I would be embarrassed to admit how long I would take to solve.

In the previous post on SWE-Bench I mentioned that SWE-Bench-Verified, the most popular SWE-Bench variant for frontier labs to report accuracy on, is not really representative of typical SWE task difficulty. For Terminal-Bench 2.0, the situation is distinctly different. Terminal-Bench 2.0 represents problems that would challenge most top-decile professional SWEs.

Here is where current frontier models stand, as of late 2025:

marginlab MARGIN LAB

Terminal-Bench 2.0

ACCURACY (%)
706050403020100
64%*
62.2%*
59.27%
58.1%*
54.2%
GPT-5.2 Codex*
GPT-5.2*
Claude Opus 4.5
GPT-5.1 Codex-Max*
Gemini 3 Pro

*Maximum reasoning effort

Terminal-Bench 2.0 scores for frontier models

Where are we Going?

The pace of progress on Terminal-Bench 2.0 is more impressive for me than SWE-Bench. Agents that can autonomously and successfully complete complex tasks in an open-ended computer environment are much closer to AGI than code patches.


Previous: SWE-Bench: The $500B Benchmark