SkillsBench vs our skillrank: a postmortem

Tomorrow I want to study skillsbench.ai/leaderboard closely and compare it against the eval system we built. The leaderboard format itself is on the right track. But SkillsBench has a lot more going on:

  • With/without skill comparison — same agent, same task, with vs. without the skill loaded.
  • Wallclock time + token cost per run.
  • Tasks are concrete, not categories — each one has an instruction, a verifier, results, and a trajectory.

Mistake 1: didn't search "who already did this"

This was the biggest mistake. Thirty minutes of searching for "AI skill evaluation benchmark" or "skill leaderboard for agents" on day one would have surfaced SkillsBench — it has a 2026 paper, a public GitHub, and a working leaderboard. Instead we built a worse version from scratch.

Lesson: Build before search = waste. Search competitors first, then decide whether to fork, contribute, or build your own.

Mistake 2: LLM-as-judge instead of real execution

Our "real execution" wasn't real execution. It was handing SKILL.md to an LLM and asking it to write a response. SkillsBench actually runs an agent in Docker, compiles a Maven project, processes a .pcap file, and so on.

  • We measured: "how good is qwen-turbo's reply after reading this doc?"
  • SkillsBench measured: "can Claude Code, equipped with this skill, finish the task?"

Lesson: Real execution means real input files, real shell commands, and real output verification. Feeding SKILL.md to an LLM and grading the essay is not execution.

They don't use LLM-as-judge — they use a deterministic verifier

SkillsBench uses a deterministic verifier — a script that checks whether the agent's output meets the spec. For example:

  • Task: upgrade this Spring Boot 2 project to Spring Boot 3 and make tests pass.
  • Verifier: mvn test exits 0 AND no javax. imports remain in the changed files.
  • Result: PASS or FAIL. No grey zone.

Our approach asked qwen-turbo to read two passages and pick the better one — a subjective judgment that produces different rankings each run. A deterministic verifier gives you the same answer 100 runs in a row, which is why SkillsBench can publish 95% confidence intervals while our shadcn entry was flipping between #1 and #5.

Mistake 3: pairwise comparisons instead of pass/fail

Bradley-Terry pairwise rankings break down between similarly-skilled competitors — we watched shadcn and stitch-react flip back and forth all night. SkillsBench uses absolute pass/fail: a task either passes the verifier or it doesn't.

Lesson: If pass/fail is definable, don't use pairwise. Pairwise is a fallback for tasks where pass/fail is impossible (e.g. "which essay is better"), not a default.

Mistake 4: no with/without baseline

SkillsBench's killer insight: same agent, same task — pass rate is 48.7% with the skill, 31.3% without. That directly answers "is this skill useful?"

A pure ranking can't answer that question. A ranking only says "A beats B" — it can't say "A is better than not using a skill at all."

Lesson: The core eval question is "does the skill help?" — not "which is best?" Without a baseline, there is no answer.

Mistake 5: scenarios were too vague to verify

Our scenarios looked like exam essay prompts ("build me a multi-step form component"). SkillsBench's tasks look like physics problems ("here's an .stl file, compute the mass, write the answer to /root/answer.json").

The first kind requires subjective LLM judgment. The second kind can be auto-verified by a script.

Lesson: A good eval task = explicit input + explicit expected output + a verifier that auto-judges.

Mistake 6: classification kept flipping

  • Started with ship_code (too broad)
  • Expanded to 10 intents (too many)
  • Cut to 3 (too few)
  • Treated sub-skills as skills (wrong model)
  • Reverted to repo-level (correct)
  • Expanded back to 10

Each pivot meant rewriting the import script, re-running evals, and redeploying. Huge time sink.

Lesson: Define "what is a skill?" and "how do we verify a skill is good?" before writing code. Don't think it through while running.

Mistake 7: too much time on infrastructure

Fly.io deploys, OpenRouter free-tier rate limits, SQLite migrations, judge-prompt tuning — all of it is infrastructure noise, unrelated to whether the eval itself is good. SkillsBench probably runs Docker locally with a Python verifier and that's it. No Fly.io, no dashboard, no Bradley-Terry.

Lesson: The core of an eval is (task, execution, verification). Dashboards and leaderboards are presentation, not the core. Get the eval right first, then ship the UI.