Hugging Face Unveils mmBERT: A Multilingual Encoder Supporting Over 1,800 Languages

Hugging Face has introduced mmBERT, a groundbreaking multilingual encoder trained on over 3 trillion tokens across 1,833 languages. Enhancing the ModernBERT architecture, mmBERT outperforms the longstanding XLM-R baseline for multilingual tasks. A progressive training schedule initiates learning with 60 high-resource languages, expanding to 110 and then all 1,833 languages, effectively reducing the masking ratio from 30% to 5%. This methodology ensures optimal language representation, even for low-resource languages like Faroese and Tigrinya, which showed significant performance gains during their final phase of inclusion. With 110M non-embedding parameters, mmBERT rivals larger models and supports sequences of 8,192 tokens, all while remaining memory-efficient. Its innovative TIES merging technique combines multiple model variants, enhancing performance without compromising efficiency. Evaluations reveal that mmBERT consistently surpasses previous multilingual encoders, setting new benchmarks in retrieval, classification, and cross-lingual tasks, making it a pivotal tool in AI language processing.

Source link

News

Company:

Join our community of SUBSCRIBERS and be part of the conversation.

Top Legal AI Solutions for Professionals

AI Apologies: Zoho’s Sridhar Vembu Sounds Alarm on the Risks of Agentic AI and Potential Business Leaks

The Importance of Cryptographic Identity for Securing AI Agents

AI Agents Surge on AWS Marketplace: Achieving Over 40 Times Initial Targets

Trump’s Drug War Strategy Boosts Opportunities for Defense Startups – The Wall Street Journal

Client Dilemma: Navigating Complex Needs

Complimentary AI Photo Enhancer: Boost Your Image Quality Instantly Online

Introducing Filmgine: An AI-Powered Story Generator and Video Creation Tool

HakAl/Scrappy: Your Free, Context-Aware Coding Assistant for Students and Learners!

Undermining Strong Authentication: The Impact of AI Threats

Hugging Face Unveils mmBERT: A Multilingual Encoder Supporting Over 1,800 Languages

AI Agents Experience Phenomenal Surge on AWS Marketplace—Surpassing Team’s Initial Expectations by Over 40 Times

Respecting Boundaries: The Case Against Forcing AI Adoption

AI Visualization of Roman Warfare Depicted on Trajan’s Column

Google Refutes Claims of Using Gmail Data for Gemini AI Training: Steps to Disable Smart Features on Desktop and Mobile Apps

Unlock 25+ AI Tools in One Bundle: Enjoy Nearly 90% Off on ChatGPT, Gemini, Claude, Perplexity, and More!

Local News

Client Dilemma: Navigating Complex Needs

Top Legal AI Solutions for Professionals

Complimentary AI Photo Enhancer: Boost Your Image Quality Instantly Online

AI Apologies: Zoho’s Sridhar Vembu Sounds Alarm on the Risks of Agentic AI and Potential Business Leaks

Client Dilemma: Navigating Complex Needs

Top Legal AI Solutions for Professionals

Complimentary AI Photo Enhancer: Boost Your Image Quality Instantly Online