Home AI Bridging the Divide: Enhancing Text and Speech Comprehension in LLMs

Bridging the Divide: Enhancing Text and Speech Comprehension in LLMs

0

Large Language Models (LLMs) can adapt text capabilities to process speech inputs; however, they often exhibit lower performance in language understanding tasks compared to text-based versions, a phenomenon known as the text-speech understanding gap. This gap arises due to two primary factors: loss of text skills during adaptation and cross-modal misalignment between speech and text. Current strategies to address this gap either involve costly large-scale speech synthesis or reliance on proprietary datasets, which lack reproducibility. To provide a more data-efficient solution, we propose SALAD—Sample-efficient Alignment with Learning through Active selection and cross-modal Distillation. This method integrates cross-modal distillation and targeted synthetic data to enhance alignment while minimizing forgetting. When applied to 3B and 7B LLMs, SALAD demonstrates competitive performance across diverse benchmarks in knowledge, language understanding, and reasoning, achieving results while utilizing significantly less speech data from public corpora.

Source link

NO COMMENTS

Exit mobile version