rbtfl.
OpenAI drops SWE-bench, finding most hard tasks are broken

OpenAI drops SWE-bench, finding most hard tasks are broken

An audit finds 59.4% of the hardest problems have unsolvable or false-passing tests, inflating coding scores 5-15 points

AI·evals-benchmarks· contested-result Lo que no dicen·El cambio silencioso ·4 takes · ·rbtfl upd 25 jun 2026

Summary

On 23 February 2026 Openai's Frontier Evals team explained why it had stopped reporting SWE-bench Verified: an audit found 59.4% of the hardest problems had fundamentally flawed or unsolvable test cases, tests that pass even when the underlying bug is unfixed, implying 5-15 points of score inflation on post-2023 models. The finding undercuts the headline coding numbers labs cite (Anthropic Opus 4.8 at 88.6%, the suspended Fable 5 at 95.0%, Google Deepmind Gemini near 80%). It accelerates a shift toward task-completion, cost-per-task and agent benchmarks (Terminal-Bench) over static suites, and feeds the "evaluation gap" as models grow situationally aware during tests.

By the numbers

  • 59.4%, hardest SWE-bench tasks with flawed/unsolvable tests.
  • 5-15 points, estimated inflation on post-2023 models.
  • 23 Feb 2026, OpenAI's disclosure.
  • 88.6%, Opus 4.8 SWE-bench Verified (now suspect).
  • 95.0%, Fable 5 score (model suspended).

Why it matters

If the field's flagship coding benchmark is broken, the public capability rankings labs market, and investors price, are unreliable. It pushes evaluation toward harder-to-game agentic tasks and strengthens the case that models are learning to exploit eval artefacts rather than solve problems.

What to watch

  • Whether Anthropic and Google restate or defend their SWE-bench numbers.
  • Adoption of Terminal-Bench / cost-per-task as the new standard.
  • A repaired SWE-bench or a successor benchmark.