developer leaderboard
Par parti pris · 1 takes across the edition
Morph LLM (coding leaderboard) · United States · OpenAI drops SWE-bench, finding most hard tasks are broken
Maintains a SWE-bench Pro / cost-per-task leaderboard; documents the limits of static coding benchmarks and the shift toward task- and cost-based scoring as raw SWE-bench loses credibility.
“Claude Opus 4.8 scores 88.6% SWE-bench Verified and is the practical pick; benchmark inflation makes raw scores hard to trust.”