Codex gpt-5.2-high Performance Tracker
The goal of this tracker is to detect statistically significant degradations in Codex with gpt-5.2-high performance on SWE tasks.
- • Updated daily: Daily benchmarks on a curated subset of SWE-Bench-Pro
- • Detect degradation: Statistical testing for degradation detection
- • What you see is what you get: We benchmark in Codex CLI with gpt-5.2-high directly, no custom harnesses.
Summary
Daily Trend
Pass rate over time
Dashed line at 58% baseline with ±13.3% significance threshold
Weekly Trend
Aggregated 7-day pass rate
Change Overview
Performance delta by period
Methodology
The goal of this tracker is to detect statistically significant degradations in Codex with gpt-5.2-codex-high performance on SWE tasks. We are an independent third party with no affiliation to frontier model providers.
Context: As AI coding assistants become increasingly relied upon for software development, monitoring their performance over time is critical. This tracker helps detect regressions that may affect developer productivity.
We run a daily evaluation of Codex CLI on a curated, contamination-resistant subset of SWE-Bench-Pro. We always use the latest available Codex release and the gpt-5.2-codex-high model. Benchmarks run directly in Codex without custom harnesses, so results reflect what actual users can expect. This allows us to detect degradation related to both model changes and harness changes.
Each daily evaluation runs on N=50 test instances, so daily variability is expected. Weekly and monthly results are aggregated for more reliable estimates.
We model tests as Bernoulli random variables and compute 95% confidence intervals around daily, weekly, and monthly pass rates. Statistically significant differences in any of those time horizons are reported.