ML research

立場別 · 2 takes across the edition

MarkTechPost · United States · Coding agents become the frontier's main proving ground

Ranks software-development agents by benchmark, putting Codex+GPT-5.5 atop Terminal-Bench (~83.4%) with Claude Code+Fable 5 close behind (~83.1%), and stresses that agent performance is a tool-plus-model property, not a raw model score.

“Codex + GPT-5.5 leads Terminal-Bench at 83.4%; Claude Code + Fable 5 is 83.1%.”

出典 ↗

MarkTechPost · United States · OpenAI drops SWE-bench, finding most hard tasks are broken

Benchmark-driven ranking of coding agents that surfaces OpenAI's Frontier Evals finding: 59.4% of the hardest SWE-bench tasks had tests passing even with the bug unfixed, implying 5-15 point inflation on post-2023 models, prompting OpenAI to stop reporting the score.

“OpenAI found 59.4% of the hardest SWE-bench tasks had tests that pass even when the underlying bug is unfixed.”

出典 ↗