Tuesday, August 12, 2025

Apple Researchers Accelerate Token Prediction in LLMs by Up to 5x

Apple’s latest research introduces a groundbreaking technique aimed at accelerating large language model (LLM) responses while maintaining high output quality. Traditionally, LLMs generate text one token at a time, which can be slow due to the autoregressive nature of the process. In the study, titled “Your LLM Knows the Future: Uncovering Its Multi-Token Prediction Potential,” Apple’s team reveals that LLMs, although trained to predict one token at a time, possess valuable information about upcoming tokens. They developed a multi-token prediction (MTP) framework that generates multiple tokens simultaneously by introducing special “mask” tokens in prompts. This innovative approach achieved speed boosts of 2-3 times for general tasks and up to 5 times for more predictable scenarios, such as coding and math, without compromising generation quality. The method utilizes a technique known as gated LoRA adaptation. For in-depth insights, access the full paper on arXiv.

Source link

Share

Read more

Local News