Last updated: Jan 19, 2026

Codex gpt-5.2-high Performance Tracker

The goal of this tracker is to detect statistically significant degradations in Codex with gpt-5.2-high performance on SWE tasks.

• Updated daily: Daily benchmarks on a curated subset of SWE-Bench-Pro
• Detect degradation: Statistical testing for degradation detection
• What you see is what you get: We benchmark in Codex CLI with gpt-5.2-high directly, no custom harnesses.

About

Summary

Status

Degradation Status

Shows if any time period has a statistically significant performance drop (p < 0.05).

Nominal

Daily Pass Rate

Percentage of benchmark tasks passed in the most recent day's evaluations.

59 %

49 evaluations

7-day Pass Rate

Aggregate pass rate over the last 7 days. Provides a more stable measure than daily results.

56 %

246 evaluations

30-day Pass Rate

Aggregate pass rate over the last 30 days. Best measure of overall sustained performance.

55 %

395 evaluations

Daily Trend

Pass rate over time

Daily benchmark pass rates over the past 30 days. Hover over legend items for details on each visual element.

Pass Rate

Daily benchmark pass rate showing the percentage of tasks solved each day.

Baseline

Historical average pass rate (58%) used as reference for detecting performance changes.

Threshold

Shaded region around baseline (±13.3%). Changes within this band are not statistically significant (p ≥ 0.05).

95% CI

95% confidence interval for each data point. Toggle checkbox to show/hide. Wider intervals indicate more uncertainty (fewer samples).

Dashed line at 58% baseline with ±13.3% significance threshold

Weekly Trend

Aggregated 7-day pass rate

Rolling 7-day aggregated pass rates for a smoother trend view with reduced day-to-day noise.

Pass Rate

7-day rolling pass rate aggregating daily results for a smoother trend view.

Baseline

Historical average pass rate (58%) used as reference for detecting performance changes.

Threshold

Shaded region around baseline (±5.5%). Changes within this band are not statistically significant (p ≥ 0.05).

95% CI

95% confidence interval for each data point. Toggle checkbox to show/hide. Wider intervals indicate more uncertainty (fewer samples).

No data available yet

Change Overview

Performance delta by period

Improvement

Regression

Not Significant

1D Yesterday

+1.2%

Not Statistically Significant

-15% ±13.3% threshold ? With 49 trials, ±13.3% change needed for p < 0.05 +15%

7D Last Week

-1.9%

Not Statistically Significant

-15% ±5.5% threshold ? With 246 trials, ±5.5% change needed for p < 0.05 +15%

30D Last Month

-3.1%

Not Statistically Significant

-15% ±4.3% threshold ? With 395 trials, ±4.3% change needed for p < 0.05 +15%

Methodology

The goal of this tracker is to detect statistically significant degradations in Codex with gpt-5.2-codex-high performance on SWE tasks. We are an independent third party with no affiliation to frontier model providers.

Context: As AI coding assistants become increasingly relied upon for software development, monitoring their performance over time is critical. This tracker helps detect regressions that may affect developer productivity.

We run a daily evaluation of Codex CLI on a curated, contamination-resistant subset of SWE-Bench-Pro. We always use the latest available Codex release and the gpt-5.2-codex-high model. Benchmarks run directly in Codex without custom harnesses, so results reflect what actual users can expect. This allows us to detect degradation related to both model changes and harness changes.

Each daily evaluation runs on N=50 test instances, so daily variability is expected. Weekly and monthly results are aggregated for more reliable estimates.

We model tests as Bernoulli random variables and compute 95% confidence intervals around daily, weekly, and monthly pass rates. Statistically significant differences in any of those time horizons are reported.