In the paper accepted at the Learning from Time Series for Health workshop at NeurIPS 2025, we explore the power of large language models (LLMs) in late fusion for activity classification based on audio and motion time series data. We utilized a curated subset of the Ego4D dataset to recognize diverse activities, including household tasks and sports. Our evaluations demonstrated that the LLMs achieved impressive 12-class zero- and one-shot classification F1-scores, significantly surpassing chance levels without task-specific training. This zero-shot classification approach enables efficient LLM-based fusion from modality-specific models, facilitating multimodal temporal applications even when aligned training data is sparse. Furthermore, deploying LLM-based fusion enables developers to bypass additional memory and computational requirements typically needed for application-specific multimodal models. This research underscores the potential of LLMs in enhancing multimodal data integration and activity recognition, offering significant implications for health and various downstream applications.
Source link
Share
Read more