Tuesday, September 30, 2025

Hugging Face Unveils mmBERT: A Multilingual Encoder Supporting Over 1,800 Languages

Hugging Face has introduced mmBERT, a groundbreaking multilingual encoder trained on over 3 trillion tokens across 1,833 languages. Enhancing the ModernBERT architecture, mmBERT outperforms the longstanding XLM-R baseline for multilingual tasks. A progressive training schedule initiates learning with 60 high-resource languages, expanding to 110 and then all 1,833 languages, effectively reducing the masking ratio from 30% to 5%. This methodology ensures optimal language representation, even for low-resource languages like Faroese and Tigrinya, which showed significant performance gains during their final phase of inclusion. With 110M non-embedding parameters, mmBERT rivals larger models and supports sequences of 8,192 tokens, all while remaining memory-efficient. Its innovative TIES merging technique combines multiple model variants, enhancing performance without compromising efficiency. Evaluations reveal that mmBERT consistently surpasses previous multilingual encoders, setting new benchmarks in retrieval, classification, and cross-lingual tasks, making it a pivotal tool in AI language processing.

Source link

Share

Read more

Local News