I've lost faith in AI benchmarks after the Llama 4 scandal, where Meta's "experimental" version outranked their actual release by gaming the system. I now believe we must create our own benchmarks for our specific needs.
🤝 Stop trusting LLM benchmarks
I've lost faith in AI benchmarks after the Llama 4 scandal, where Meta's "experimental" version outranked their actual release by gaming the system. I now believe we must create our own benchmarks for our specific needs.