Home AI Apple Develops LLM for Enhanced Long-Form Video Comprehension

Apple Develops LLM for Enhanced Long-Form Video Comprehension

0
Apple trained an LLM to efficiently understand long-form video

Apple researchers have unveiled the SlowFast-LLaVA-1.5 model, an advanced version of the SlowFast-LLaVA model, excelling in long-form video analysis. This model integrates video perception with pre-trained large language models (LLMs), efficiently processing videos by focusing on significant frames while maintaining language coherence. Traditional methods analyze every frame, overwhelming LLM context windows, but Apple’s approach uses a dual-stream framework: a slow stream reviews fewer frames in detail while a fast stream assesses frame motion quickly. Consequently, SF-LLaVA-1.5 outperforms larger models across video tasks and maintains strong performance in image tasks like OCR and reasoning while overcoming limitations of previous models. Although it has a maximum input frame length of 128, which may occasionally miss crucial frames, its ability to operate on public datasets underscores its versatility. Now available on GitHub and Hugging Face, this model sets state-of-the-art benchmarks in long-form video and image comprehension.

Source link

NO COMMENTS

Exit mobile version