The development of KMMLU-Redux and KMMLU-Pro represents a significant advancement in assessing AI models’ proficiency in professional Korean language knowledge. These benchmarks evaluate AI performance against Korea’s national professional qualification exams, including law, accounting, and medicine. OpenAI’s “o1” model achieved the highest score of 79.55%, demonstrating strong reasoning capabilities. Anthropic’s “Claude 3.7 Sonet” notably passed 12 out of 14 tests, showcasing consistent performance across a range of challenges. The new benchmarks emerged from concerns about the reliability and contamination of the previous KMMLU dataset, which contained errors in 7.66% of questions. KMMLU-Redux features 2,587 refined questions, while KMMLU-Pro includes 2,822 rigorous questions from official sources. Despite remarkable performance, some models struggled in legal and tax assessments due to the specialized nature of domestic law. Future improvements may incorporate multimodal and subjective question types for a comprehensive evaluation.
Source link
![news-p.v1.20250717.ae790d364b1042349bf3544242647587_P1.png A form showing whether AI models have passed 14 national professional qualification tests in the new benchmark 'KMMLU-Pro', which evaluates professional Korean language knowledge. The average score was the highest in OpenAI's "o1" model with 79.55% accuracy, but Anthropic's "Claude 3.7 Sonet" passed 12 tests, showing the most even performance. [Source = arXiv] [Screenshot the thesis]](https://site.server489.com/wp-content/uploads/2025/07/news-p.v1.20250717.ae790d364b1042349bf3544242647587_P1-696x443.png)
Share
Read more