As global AI adoption increases, developers face challenges in achieving large language model (LLM) performance for real-world applications, especially in voice-based AI. Sarvam AI, a Bengaluru-based generative AI startup, addresses these issues by developing sovereign, multilingual models tailored for India, which are trained using NVIDIA technologies, including the NeMo framework. A collaboration with NVIDIA resulted in a 4x speedup for its 30B model through hardware and software optimizations. Key optimizations included a mixture-of-experts (MoE) architecture and advanced kernel strategies that cut latency significantly, meeting service-level agreements (SLAs) for token response times. This performance boost extends to NVIDIA’s Blackwell architecture, achieving 2.8x throughput improvements. By integrating model design, kernel engineering, and scheduling, Sarvam AI and NVIDIA created an efficient inference stack, setting a blueprint for scalable, sovereign AI applications. For developers, NVIDIA’s Nemotron framework offers resources for building localized AI solutions. Stay informed about updates on AI developments from NVIDIA.
Source link
