Last updated: Jan 9, 2026

Claude Code Opus 4.5 Performance Tracker

The goal of this tracker is to detect statistically significant degradations in Claude Code with Opus 4.5 performance on SWE tasks.

• Updated daily: Daily benchmarks on a curated subset of SWE-Bench-Pro
• Detect degradation: Statistical testing for degradation detection
• What you see is what you get: We benchmark in Claude Code CLI with the SOTA model (currently Opus 4.5) directly, no custom harnesses.

About

Summary

Status

Degradation Status

Shows if any time period has a statistically significant performance drop (p < 0.05).

Nominal

Daily Pass Rate

Percentage of benchmark tasks passed in the most recent day's evaluations.

56 %

50 evaluations

7-day Pass Rate

Aggregate pass rate over the last 7 days. Provides a more stable measure than daily results.

58 %

350 evaluations

30-day Pass Rate

Aggregate pass rate over the last 30 days. Best measure of overall sustained performance.

58 %

1500 evaluations

Daily Trend

Pass rate over time

Pass Rate

95% CI

Shaded region shows 95% confidence interval

Weekly Trend

Aggregated 7-day pass rate

Pass Rate

95% CI

Shaded region shows 95% confidence interval

Change Overview

Performance delta by period

Improvement

Regression

1D Yesterday

-2.0%

Not Statistically Significant

-10% 0 +10%

7D Last Week

+0.0%

Not Statistically Significant

-10% 0 +10%

30D Last Month

+0.0%

Not Statistically Significant

-10% 0 +10%

Methodology

The goal of this tracker is to detect statistically significant degradations in Claude Code with Opus 4.5 performance on SWE tasks. We are an independent third party with no affiliation to frontier model providers.

Context: In September 2025, Anthropic published a postmortem on Claude degradations. We want to offer a resource to detect such degradations in the future.

We run a daily evaluation of Claude Code CLI on a curated, contamination-resistant subset of SWE-Bench-Pro. We always use the latest available Claude Code release and the SOTA model (currently Opus 4.5). Benchmarks run directly in Claude Code without custom harnesses, so results reflect what actual users can expect. This allows us to detect degradation related to both model changes and harness changes.

Each daily evaluation runs on N=50 test instances, so daily variability is expected. Weekly and monthly results are aggregated for more reliable estimates.

We model tests as Bernoulli random variables and compute 95% confidence intervals around daily, weekly, and monthly pass rates. Statistically significant differences in any of those time horizons are reported.