A recent multilingual benchmark study reveals that Polish leads in long-context language model (LLM) accuracy, achieving 88% accuracy with context windows up to 64,000 tokens. This outperforms English, which ranks sixth, and Chinese, which is among the bottom four. Conducted using the OneRuler benchmark in a COLM 2025 paper, the research indicates that language structure significantly impacts accuracy at larger context lengths. It highlights that Latin-based scripts, such as Polish, French, and Spanish, outperform logographic and abugida systems, with a performance gap widening as context expands from 11% at 8,000 tokens to 34% at 128,000 tokens. The study suggests that tokenization efficiency and structural characteristics are more crucial than dataset volume when handling long documents. Consequently, English dominance in LLM benchmarks may not be representative at larger sequence lengths, emphasizing the need for diverse language testing in LLM performance evaluations.
Source link
Share
Read more