Thursday, July 3, 2025

Evaluating the Output Quality of Large Language Models in Maternal Health

Share

In a survey of 47 gynecology and obstetrics specialists with a median age of 50, 85% were female, averaging 19 years of clinical experience and handling approximately 110 assisted pregnancies monthly. Most respondents were from the US, Brazil, and Pakistan. Survey results highlighted the performance of different AI models, with GPT-3.5 and GPT-4 achieving higher scores in non-technical and technical assessments than Meditron-70b and a custom GPT-3.5 model. Inter-rater reliability was excellent, particularly for English and Portuguese scores. Despite high clarity and content quality ratings, critiques centered on incomplete information and outdated terminology. Readability analyses revealed that AI-generated responses demanded a college reading level for comprehension. Gender bias was also noted in how models referred to healthcare professionals. Overall, the findings emphasize the need for enhanced AI responses in healthcare, considering varying demographics and language nuances.

Source link

Read more

Local News