In June, Apple researchers published a study assessing the capabilities of simulated reasoning (SR) models, including OpenAI’s o1 and o3, in solving novel problems that require systematic thinking. Their findings echoed a prior study by the United States Mathematical Olympiad, revealing that these models struggled with novel mathematical proofs, scoring under 5% in most cases. The study, led by Parshin Shojaee, examined how “large reasoning models” (LRMs) simulate logical reasoning, often using a “chain-of-thought” method. Researchers tested the models against four classic puzzles, ranging in complexity, highlighting their limitations in reasoning capabilities. The authors emphasized that existing evaluations focus on accuracy in familiar tasks without considering the model’s genuine reasoning process. The results showed significant performance declines for problems needing extended reasoning, underscoring a need to reassess how we evaluate AI reasoning. Ultimately, only one model achieved a score of 25% on the mathematical proofs, revealing substantial challenges in novel reasoning tasks.
Source link
New Apple Study Questions the True Reasoning Abilities of AI Models

Leave a Comment
Leave a Comment