Multimodal Large Language Models (MLLMs) face challenges in utilizing external vision tools effectively, often struggling to translate pixel-level outputs into practical insights. This issue arises from a mismatch between dense visual data and the language-centric design of LLMs, resulting in limited perception. Researchers emphasize that the key challenge lies not in enhanced tools or larger models, but in how information is represented. The breakthrough solution, outlined in a recent arXiv paper, introduces Perception Programs (P2), which reformulate these visual outputs into structured, language-centric summaries. This innovative method significantly enhances the interaction between MLLMs and vision tools. P2 has demonstrated remarkable performance improvements across six perception tasks on the BLINK benchmark, achieving an average gain of 22% and setting new state-of-the-art records. Notably, integration with GPT-5 Mini boosted accuracy dramatically, showcasing P2’s efficacy even in smaller models like InternVL3.5-4B without additional training, outperforming traditional reinforcement learning methods.
Source link
Share
Read more