🤝 Stop trusting LLM benchmarks

and create your own evaluation instead.

Apr 13, 2025

Dear curious mind,

This issue shares why the era of trusted benchmarks is over. It's time we face this reality and adapt our approach to AI evaluation. Let's dive in.

In this issue:

💡 Shared Insight
- Behind the Numbers: The Llama 4 Arena Dilemma
📰 AI Update
- How DeepCoder Matches o3-mini with Just 14B Parameters
- Cogito: When Models Teach Themselves to Be Smarter

💡 Shared Insight

Behind the Numbers: The Llama 4 Arena Dilemma

The world of AI has been tumbled by a controversy surrounding Meta's Llama 4 release. Last week, I shared that Meta had released the first two models from their Llama 4 series. Since then, a story has unfolded that raises serious questions about how we evaluate LLMs.

Meta's initial announcement proudly claimed their Llama 4 Maverick model achieved in an experimental chat version an ELO score of 1417 on LMArena - placing it at rank 2 on the leaderboard.

Llama 4 Maverick, a 17 billion active parameter model with 128 experts, is the best multimodal model in its class, beating GPT-4o and Gemini 2.0 Flash across a broad range of widely reported benchmarks, while achieving comparable results to the new DeepSeek v3 on reasoning and coding—at less than half the active parameters. Llama 4 Maverick offers a best-in-class performance to cost ratio with an experimental chat version scoring ELO of 1417 on LMArena. [source: Meta blog]

This was a significant selling point, suggesting the new model outperformed nearly all models. However, once the actual release version was added to the leaderboard, the results told a different story. The publicly released Llama 4 Maverick scored only an ELO score of 1273, which corresponds to rank 32 - a dramatic drop from what was initially reported.

The team behind LMArena themselves acknowledged this discrepancy on 𝕏, noting there was something odd with the Maverick model's ranking. What emerged was revealing: the experimental version Meta used for benchmarking had been specifically optimized for human judgment - producing longer outputs with more emojis, factors that tend to be favored by humans selection the model reply out of two answers shown side-by-side for creating the benchmark results.

Post from the team behind LMArena about the Llama 4 ranking. [source]

This optimization from Meta for the benchmark rather than real-world performance raises questions about the reliability of our evaluation methods. If the same model can rank 2nd or 23rd simply by tweaking output formatting, what does that tell us about the benchmark's validity? For me, the trust in this ranking is destroyed.

The community reaction has been harsh and critical:

In my testing it performed worse than Gemma 3 27B in every way, including multimodal. Genuinely astonished how bad it is. [source]

Don't forget, they used the many innovations DeepSeek opened sourced and yet failed miserably! I promise, I just knew it. They went for the size again to remain relevant.
We, the community who can run models locally on a consumer HW who made llama a success, And now, they just went for the size. That was predictable and I knew it.
DeepSeek did us a favor by showing to everyone that the real talent is in the optimization and efficiency. You can have all the compute and data in the world, but if you can't optimize, you won't be relevant. [source]

Remember when Deepseek came out and rumors swirled about how Llama 4 was so disappointing in comparison that they weren't sure to release it or not?Maybe they should've just waited this generation and released Llama 5... [source]

r/LocalLLaMA - Meta's Llama 4 Fell Short — Meme reflecting the weak performance of Llama 4. [source]

Interestingly, just days before the release, Joelle Pineau, VP AI Research at Meta who led the FAIR team responsible for the Llama models, announced her departure from the company after nearly 8 years. While we can't know if these events are directly connected, the timing is certainly notable.

This controversy exposes a fundamental truth about AI evaluation: even our most trusted benchmarks can be gamed. Traditional benchmarks with static questions are likely already incorporated into training data, and as this incident shows, even comparative human evaluations can be manipulated through formatting tricks rather than actual capability improvements.

So where does this leave us? I believe the only reliable path forward is creating and running your own benchmarks tailored to your specific use cases. I've started exploring this direction, and I am looking forward to sharing more on how you can build evaluation frameworks that actually measure what is relevant for your usage.

For those interested in this field, I recommend following Wolfram Ravenwolf on 𝕏 or LinkedIn, who has been working on and sharing LLM evaluations for some time and announced even more updates soon due to a change in his professional career.

In an era where model releases come with increasingly impressive claims, we must become more sophisticated in how we evaluate them. After all, if we can't trust the benchmarks, we must trust our own judgment. Doing this manual is possible but time-consuming. Only through creating our own benchmarks, we can truly judge if a model is well suited for our specific requirements. What works exceptionally for one use case might perform poorly for another, making personalized evaluation essential in the rapidly evolving AI landscape.

📰 AI Update

How DeepCoder Matches o3-mini with Just 14B Parameters [Together blog]

Comparison of the DeepCoder model with some competitors. source

This preview of DeepCoder sounds outstanding, offering impressive coding performance that can be run locally. Achieving 60.6% Pass@1 on LiveCodeBench with just 14B parameters and matching the performance of OpenAI's o3-mini is truly remarkable. A comparison to the current best coding model Claude 3.7 Sonnet is missing from their analysis, which would have been an interesting comparison. The fully open-source approach with shared datasets, code, and training logs makes this release especially exciting as this allows others to build on top of it.

Comparison of the Cogito 32B model with the Qwen base-models it was trained from. [source]

Comparison of the Cogito 70B model with the Llama base-models it was trained from. Be aware that the reasoning model DeepSeek R1 Distill 70B also uses Llama 3.3. 70B as its base model. [source]

Cogito's models use Iterated Distillation and Amplification (IDA), where the model repeatedly improves itself by finding better solutions with extra compute, then integrating these improvements into its parameters. The released models (3B-70B) significantly outperform their strong Llama and Qwen starting points across standard benchmarks. When you have a closer look, you will realize that there is not a big performance increase from 32B to 70B which shows how strong the Qwen 32B base model is. You can run the models yourself via Ollama.

Disclaimer: This newsletter is written with the aid of AI. I use AI as an assistant to generate and optimize the text. However, the amount of AI used varies depending on the topic and the content. I always curate and edit the text myself to ensure quality and accuracy. The opinions and views expressed in this newsletter are my own and do not necessarily reflect those of the sources or the AI models.

Liked this Aidful News issue? Share it with a friend, colleague or on social media! Your support means a lot.

Aidful News

🤝 Stop trusting LLM benchmarks

and create your own evaluation instead.

Behind the Numbers: The Llama 4 Arena Dilemma

How DeepCoder Matches o3-mini with Just 14B Parameters [Together blog]

Cogito: When Models Teach Themselves to Be Smarter [Cogito blog]

Discussion about this post