OpenAI Reveals ‘Contaminated’ Benchmark Used to Evaluate AI Coding Skills—Here’s the Explanation

February 25, 2026

OpenAI has declared SWE-bench Verified, a benchmark for assessing AI coding capabilities, as unreliable due to severe contamination issues, leading to inflated scores previously used to claim dominance. Following an audit, OpenAI found that 59.4% of tasks were flawed, with models recalling solutions from prior training rather than solving problems independently. Consequently, OpenAI introduced SWE-bench Pro, a more stringent benchmark featuring diverse codebases and reduced exposure to training data. Scores fell dramatically from approximately 70% on the old benchmark to around 23% on the new one. This shift complicates the competitive landscape, particularly with emerging models like DeepSeek, which may soon challenge established AI. OpenAI’s transition to privately authored evaluations and its acknowledgment of SWE-bench Verified’s shortcomings highlight the ongoing difficulties in accurately measuring AI coding performance, crucial for maintaining credibility in AI advancements.

Source link

{{post_title}}

OpenAI Reveals ‘Contaminated’ Benchmark Used to Evaluate AI Coding Skills—Here’s the Explanation

NO COMMENTS

LEAVE A REPLY Cancel reply

Loading…

Here are the results for the search: "{{td_search_query}}"

No results!

{{post_title}}

RELATED ARTICLES

AI Revolutionizes Cybersecurity Access: Empowering Defenders with Advanced Tools

Adobe Unveils Firefly AI Assistant, Featuring Enhanced Generative AI and Creative...

IDC MarketScape: Vendor Assessment of Global AI-Driven Enterprise Asset Management Solutions...

NO COMMENTS

LEAVE A REPLY Cancel reply