Post
Benchmarks Are Thermometers, Not Report Cards
LLM benchmarks are useful when you treat them like instruments, not trophies. Here is how to read MMLU, Arena, SWE-bench, HELM, and your own evals without turning the leaderboard into a religion.
Here is the little trap hiding inside every LLM benchmark: the number looks like a final grade.
92.1. 1287 Elo. 67% resolved. Very official. Very sortable. Very easy to screenshot and turn into a tiny religion.
But a benchmark score is not a report card. It is closer to a thermometer: useful, sometimes precise, and still deeply dependent on where you put it. A thermometer in the oven tells you one thing. A thermometer under your armpit tells you another. Neither tells you if dinner tastes good.
That is the benchmark problem in one kitchen accident.
What a benchmark actually is
A benchmark is a controlled little world with a scoreboard attached.
You give the model tasks, define a scoring rule, freeze the setup, and compare results. That is valuable because vibes are expensive. Without benchmarks, every model discussion turns into "this one feels smarter" and "that one wrote a better poem for my cousin's resume."
The catch is that the moment you freeze a task, you also shrink reality.
Real work is messy. Prompts are weird. Users ask follow-up questions. Tools fail. Context is stale. The model has to decide when to ask, when to act, when to stop, and when not to confidently eat the spreadsheet.
Benchmarks do not remove that mess. They take a tiny core sample and ask us not to confuse it for the whole mountain.
The benchmark map
Here is the field guide version.
| Benchmark type | What it is good at | What it misses |
|---|---|---|
| Knowledge exams like MMLU | Broad academic and professional question answering | Real workflows, tool use, freshness, long context |
| Harder exam variants like MMLU-Pro | More reasoning pressure and less easy multiple-choice noise | Still mostly a test-shaped world |
| Capability zoos like BIG-bench | Weird edges, rare skills, surprising failures | Product usefulness and user preference |
| Holistic suites like HELM | Multi-metric comparison: accuracy, robustness, fairness, efficiency | Still incomplete by design |
| Human preference arenas like Chatbot Arena | Which answers people prefer in open-ended chats | Taste, crowd bias, prompt mix, popularity effects |
| Work benchmarks like SWE-bench | Whether an agent can fix real issues in real repos | Harness quality, test coverage, task selection |
| Private evals | Your actual use case, your data, your workflow | Harder to compare publicly |
No single row is "the real benchmark." They are different instruments. Asking which one is best is like asking whether a microscope beats a speedometer. Depends whether you are looking at bacteria or driving into a wall.
MMLU: the old school exam
MMLU became famous because it did something simple and useful: test broad multitask knowledge across 57 subjects, from law and history to computer science and math.
That made sense for its moment. Early LLMs were still proving they could hold a lot of world knowledge in one head. MMLU gave the field a shared ruler.
But school exams have school-exam problems. They reward recognizing the right answer in a fixed format. They can saturate. They can contain noisy questions. They can leak into training data. And they do not tell you whether the model can run a meeting, debug a repo, follow a style guide, or notice that the user asked the wrong question.
MMLU-Pro is the natural sequel: harder questions, more answer choices, more reasoning pressure, less trivial noise. The useful lesson is not "MMLU is bad." The lesson is: when models start acing the worksheet, make a better worksheet.
Arena: the popularity contest that is still useful
Chatbot Arena is the opposite vibe.
Instead of asking "did the model pick option C?", it asks humans to compare two anonymous answers and vote. That is valuable because a lot of LLM use is open-ended. There is no answer key for "explain this better" or "which draft would I rather send?"
But human preference is not truth. It is preference.
People reward confidence, polish, length, friendliness, speed, and sometimes just the answer that sounds less annoying after lunch. Arena is a useful taste test. It is not a lab result.
The friend-at-coffee translation: Arena tells you which model wins more bar bets. It does not tell you which model you should trust with payroll.
SWE-bench: now we are touching the machinery
SWE-bench is more interesting for agents because it moves from "answer a question" to "change a real codebase." It uses real GitHub issues and asks the model to produce patches that pass tests.
Now we are closer to work, which is where the pretty charts start sweating.
It also means the benchmark is now testing a whole system, not just a model. The prompt matters. The editing tools matter. The repo navigation matters. The test suite matters. The retry policy matters. A model inside a bad harness can look dumber than it is. A model inside a great harness can look like it learned carpentry overnight.
So when you see a SWE-bench number, do not read it as "model IQ." Read it as:
model + prompt + agent loop + tools + test harness + time budget
That is not a flaw. That is the point. Agents are systems.
HELM: the adult in the room
HELM's useful contribution is philosophical: stop pretending accuracy is the only axis just because it is the easiest one to chart.
For language models, we also care about robustness, fairness, toxicity, calibration, efficiency, and transparency. A model that is 2% better on one exam but costs 4x as much, fails more often under prompt variation, and says little about training data may not be "better" in the way your product needs.
HELM also says the quiet part out loud: evaluation is always incomplete. That humility is healthy. A benchmark that admits its missing pieces is usually more trustworthy than one wearing a cape and pretending to save the city.
The three benchmark sins
Most leaderboard discourse goes wrong in three boring ways.
| Sin | What it sounds like | Better question |
|---|---|---|
| Score worship | "Model A is better because it is +1.8 on X." | Is that difference meaningful for my task? |
| Benchmark shopping | "Look, my favorite model wins this chart." | Which benchmark matches the failure I care about? |
| Setup blindness | "The model scored 67%." | With what prompt, tools, budget, judge, and data split? |
The third one is the sneakiest. A benchmark is not just the dataset. It is the full recipe. If someone changes the prompt, number of attempts, tool access, judge model, temperature, context window, or time budget, they may still call it the same benchmark while measuring something meaningfully different.
That is how leaderboard soup happens.
How to read a benchmark without falling for the chart
When a model company shows you a chart, do the boring useful thing and ask six questions:
- What is the unit of work? Multiple-choice question, chat answer, code patch, browser task, agent workflow?
- Who judges it? Exact answer, unit tests, humans, another model, rubric?
- What was the harness? Single prompt, chain-of-thought, tools, retries, agent loop?
- Is it fresh? Could the model have seen the tasks during training?
- Is the score saturated? If everyone is above 90, the benchmark is now a tie-breaker, not a map.
- Does it match your use case? If you need legal drafting, a coding benchmark is trivia. If you need repo repair, MMLU is trivia.
That last one is the money sentence: the best benchmark is the one shaped like your failure mode.
If your model keeps forgetting formatting rules, build an eval for formatting rules. If it makes bad SQL joins, build an eval for SQL joins. If it gives beautiful wrong answers to support tickets, benchmark that exact support workflow.
Public benchmarks are weather reports. Private evals are checking the window before leaving home because your street floods first.
My take
Benchmarks are good. Leaderboard worship is bad.
A good benchmark gives the field a shared argument. A bad reading of a benchmark turns that argument into a horse race. The chart is not the truth. The chart is an instrument panel. You still need to know what vehicle you are driving, what road you are on, and whether the fuel gauge is lying.
So use MMLU to ask about broad knowledge. Use MMLU-Pro when the old exam got too easy. Use Arena to sample human taste. Use SWE-bench to inspect coding-agent systems. Use HELM to remember that accuracy is not the whole animal. Use OpenAI Evals, your own scripts, or a boring spreadsheet of real examples to test what actually hurts in your product.
The Shakesbee rule: never ask "which model is best?"
Ask: best at what, under which setup, with which failure cost?
That question is less fun for marketing slides. It is much better for not buying a thermometer and calling it dinner.
Sources
- Measuring Massive Multitask Language Understanding — the original MMLU paper, covering 57 tasks across academic and professional subjects
- MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark — explains the harder MMLU-Pro setup, including ten answer choices and stronger reasoning pressure
- BIG-bench — Google's collaborative benchmark collection for probing broad and unusual language-model capabilities
- Holistic Evaluation of Language Models — Stanford CRFM's framing for multi-metric, transparent, incomplete-by-design evaluation
- Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference — the paper behind crowdsourced pairwise preference evaluation
- SWE-bench: Can Language Models Resolve Real-World GitHub Issues? — Princeton's explanation of evaluating models on real GitHub issue-to-patch tasks
- OpenAI Evals — open-source framework and registry for building and running LLM evals, including private workflow-specific evals