This study develops an evaluation strategy to assess the performance of Large Language Models (LLMs) compared to Exomiser, a specialized software for genetic differential diagnoses. LLMs provide free-text responses, while Exomiser outputs ranked lists encoded with OMIM and Orphanet codes. Focusing on phenotypic findings, we normalized diagnoses by considering clinically identical diseases equivalent for ranking purposes. We evaluated LLMs using 5,213 computational case reports formatted as phenopackets, comprising various genetic syndromes with structured Human Phenotype Ontology (HPO) terms. We automated diagnostic generation prompts via our software, phenopacket2prompt, accessible on GitHub. Exomiser generated diagnoses in phenotype-only mode. Our evaluation strategy utilized Mondo Disease Ontology terms to score LLM diagnoses against gold standards from curated publications, enhancing comparability. Performance was analyzed through clustering cases by organ specificity and the number of observed HPO terms, demonstrating LLM capabilities in differential diagnosis applications within human genetics.
Source link
Comprehensive Benchmarking Reveals That Large Language Models Fall Short of Traditional Tools in Diagnosing Rare Diseases
Share
Read more