Large language models (LLMs) excel in in-context learning (ICL), allowing them to adapt to varied tasks efficiently. Our research explores the application of ICL in computer vision, specifically utilizing off-the-shelf Stable Diffusion models for visual in-context learning (V-ICL). We introduce a novel in-place attention re-computation within Stable Diffusion’s self-attention layers, enhancing the interaction between query and example prompts without requiring additional fine-tuning. This methodology effectively adapts the model for six distinct tasks, including foreground segmentation and edge detection. Notably, our approach enhances the mean intersection over union (mIoU) for foreground segmentation on the Pascal-5i dataset, surpassing recent methods like Visual Prompting and IMProv by 8.9% and 3.2%, respectively. Furthermore, we demonstrate that ensembling multiple prompts leads to improved performance, showcasing the robust capabilities of repurposed Stable Diffusion models in various visual tasks, thus advancing the field of natural language processing and computer vision synergy.
Source link

Share
Read more