Wednesday, November 12, 2025

Assessing the Effectiveness of General-Purpose Large Language Models in Detecting Human Facial Emotions

The study, IRB-exempt from Beth Israel Deaconess Medical Center, utilized the NimStim dataset—comprised of 672 facial expression images from 43 multiracial actors—to evaluate facial emotion recognition via large language models (LLMs). The dataset features eight emotional expressions, showcasing a diverse representation of racial backgrounds, with psychometric evaluations demonstrating strong reliability and agreement among observers. Two LLM models, OpenAI GPT-4o and Google Gemini 2.0, processed the images in a standardized manner. Analytical methods included calculating Cohen’s kappa to assess model performance against established NimStim metrics, alongside confusion matrices for accuracy, precision, recall, and F1 scores. The study confirmed no public access to the NimStim dataset, ensuring its exclusivity and integrity in research. By comparing model outputs with the established kappa values from NimStim, the research benchmarked LLM performance, determining overlapping confidence intervals, which indicated comparable or different agreement levels across emotion categories.

Source link

Share

Read more

Local News