Andrej Karpathy, an AI researcher and founder of Eureka Labs, recently introduced the “LLM-Council”, an innovative experiment where multiple language models (LLMs) assess user queries before ranking each other’s answers anonymously. Remarkably, OpenAI’s GPT-5.1 frequently emerged as the highest-rated model, contradicting earlier benchmarks that favored Google’s Gemini 3.0. Karpathy noted that LLMs often acknowledge superior responses from their peers, leading to insightful model evaluations. The experiment employs a three-step process: user queries are sent to all models, they anonymously rank each other’s responses, and a “chairman model” consolidates these rankings into a coherent answer. While subjective, Karpathy expressed reservations about the rankings reflecting his own assessments, citing that he finds GPT-5.1 verbose compared to the concise Gemini 3.0. Notably, Vasuman M from Varick AI Agents indicated similar outcomes in his experiments, consistently identifying GPT-5.1 as the top performer, even prompting other models to correct themselves when aware of GPT’s output.
Source link
Share
Read more