AI Math Miscalculations: What the Latest Study Reveals
In the realm of artificial intelligence, even the most advanced language models (LLMs) struggle with basic math. A recent benchmark study called ORCA (Omni Research on Calculation in AI) evaluated five leading models: ChatGPT-5, Gemini 2.5 Flash, Claude Sonnet 4.5, Grok 4, and DeepSeek V3.2. The results? All scored below 63% accuracy.
Key Findings:
-
Accuracy Scores:
- Gemini 2.5 Flash: 63%
- Grok 4: 62.8%
- DeepSeek V3.2: 52%
- ChatGPT-5: 49.4%
- Claude Sonnet 4.5: 45.2%
-
Common Errors:
- 35% involved rounding inaccuracies
- 33% were calculation mistakes
Researchers emphasize the need for benchmarks that test true computational reasoning, revealing past evaluations may not reflect real-world capabilities.
🚀 Engage with us! Share your thoughts on AI’s math reliability and future improvements.
