developer leaderboard

Par parti pris · 1 takes across the edition

Morph LLM (coding leaderboard) · United States · OpenAI drops SWE-bench, finding most hard tasks are broken

Maintains a SWE-bench Pro / cost-per-task leaderboard; documents the limits of static coding benchmarks and the shift toward task- and cost-based scoring as raw SWE-bench loses credibility.

“Claude Opus 4.8 scores 88.6% SWE-bench Verified and is the practical pick; benchmark inflation makes raw scores hard to trust.”

Source ↗