OpenAI has declared SWE-bench Verified, a benchmark for assessing AI coding capabilities, as unreliable due to severe contamination issues, leading to inflated scores previously used to claim dominance. Following an audit, OpenAI found that 59.4% of tasks were flawed, with models recalling solutions from prior training rather than solving problems independently. Consequently, OpenAI introduced SWE-bench Pro, a more stringent benchmark featuring diverse codebases and reduced exposure to training data. Scores fell dramatically from approximately 70% on the old benchmark to around 23% on the new one. This shift complicates the competitive landscape, particularly with emerging models like DeepSeek, which may soon challenge established AI. OpenAI’s transition to privately authored evaluations and its acknowledgment of SWE-bench Verified’s shortcomings highlight the ongoing difficulties in accurately measuring AI coding performance, crucial for maintaining credibility in AI advancements.
Source link
