🔍 Independent Audit of LoCoMo Benchmark: Key Findings Unveiled!
Dive into our comprehensive audit of the LoCoMo (Long-Context Modeling) benchmark and EverMemOS evaluation framework. We highlight critical issues affecting performance and reliability in AI evaluations.
Key Findings:
- Ground Truth Errors: 6.4% (99 out of 1,540) of questions have incorrect golden answers, impacting overall performance.
- Token Cost Misrepresentation: Claims suggest average tokens per question are 2,298, but actual findings show 6,669 with GPT-4.1-mini (2.9x higher).
- Judge Leniency: 62.81% of intentionally vague answers were wrongly accepted by LLM judges.
- Reproducibility Failures: Third-party evaluations report only 38.38% reproducibility versus a claimed 92.32%.
Discover how these insights can reshape your understanding of AI evaluation standards.
👉 Explore the audit and share your thoughts! Your feedback matters!
