I've lost faith in AI benchmarks after the Llama 4 scandal, where Meta's "experimental" version outranked their actual release by gaming the system. I now believe we must create our own benchmarks for our specific needs.
Share this post
🤝 Stop trusting LLM benchmarks
Share this post
I've lost faith in AI benchmarks after the Llama 4 scandal, where Meta's "experimental" version outranked their actual release by gaming the system. I now believe we must create our own benchmarks for our specific needs.