Home AI Hacker News Comprehensive Audit of the LoCoMo Benchmark: dial481/locomo-audit on GitHub

Comprehensive Audit of the LoCoMo Benchmark: dial481/locomo-audit on GitHub

0

🔍 Independent Audit of LoCoMo Benchmark: Key Findings Unveiled!

Dive into our comprehensive audit of the LoCoMo (Long-Context Modeling) benchmark and EverMemOS evaluation framework. We highlight critical issues affecting performance and reliability in AI evaluations.

Key Findings:

  • Ground Truth Errors: 6.4% (99 out of 1,540) of questions have incorrect golden answers, impacting overall performance.
  • Token Cost Misrepresentation: Claims suggest average tokens per question are 2,298, but actual findings show 6,669 with GPT-4.1-mini (2.9x higher).
  • Judge Leniency: 62.81% of intentionally vague answers were wrongly accepted by LLM judges.
  • Reproducibility Failures: Third-party evaluations report only 38.38% reproducibility versus a claimed 92.32%.

Discover how these insights can reshape your understanding of AI evaluation standards.

👉 Explore the audit and share your thoughts! Your feedback matters!

Source link

NO COMMENTS

Exit mobile version