Skip to main content
What this unit solvesOther people’s benchmarks test other people’s tasks. Answering “is this tool, model, or configuration actually better for me?” requires a set of your own reproducible test cases and a habit of verifying rather than occasionally remembering to. This unit gives you concrete methods for designing a personal benchmark, choosing between two verification modes, running the git worktree comparison method, and understanding why a 100% pass rate may be a sign of trouble.

Learning objectives

  • Design a set of 3 to 5 reproducible tasks that reflect your real work.
  • Distinguish checkpoint-based from continuous verification and choose the right mode based on task characteristics.
  • Use the git worktree comparison method to measure the difference between “with a setting” and “without a setting” or “tool A” vs “tool B.”
  • Explain the saturation check: why 100% pass rate means the tests are too easy.
  • Turn “programmatically verifiable claims get execution verification, factual claims get search verification, and neither gets marked unverified” into a repeatable habit.

1. Why build your own benchmark

Public benchmarks rarely correlate strongly with your specific tasks. HumanEval tests LeetCode-style standalone functions; SWE-bench tests GitHub issue fixes. These overlap with what you do, but they are not the same thing. A model’s high public score does not imply good performance on your codebase, your domain, or your constraints. Three reasons to build your own:
  1. True representativeness: tasks you have actually done recently will fit your workflow better than any public set.
  2. Real constraints: your real constraints (private models, no network, a specific framework, a team coding style) are not covered by public benchmarks.
  3. Reproducibility: your own tests can be re-run, adjusted, and extended with edge cases; public benchmarks are frozen.
Public benchmarks are for filtering, not decidingPublic benchmarks are appropriate for the first round of filtering (eliminating obviously unsuitable options). Once a candidate list is established, your own tests make the final call. Treating “the top public-score model” as “best for me” is a misuse — it is the same error as treating GitHub star counts as a quality signal, just in a different column. See 03-2 An evaluation framework beyond the GitHub star trap for more on signal weighting.

2. Designing a reproducible task set

2.1 Criteria for task selection

Pick 3 to 5 tasks, each satisfying all of the following:
  • Representative: you have done it at least once in the past six months and will do it again.
  • Repeatable: the same input can be run multiple times without depending on the current time, network state, or random API calls.
  • Clear success criteria: the output can be judged in binary terms (pass / fail) or measured (time, tokens, diff size, error count).
  • Under one hour: the full task fits within a single work session without spanning days.
Personal benchmark examplesSay you are a backend engineer whose work includes: refactoring a module, writing a migration, reviewing a PR, debugging a race condition. Five reasonable benchmark tasks:
  1. Convert orders/service.py from class-based to function-based with identical behavior, verified by running the existing unit tests to completion.
  2. Write a migration (including rollback) for a newly added payment_method column and update the ORM model.
  3. Produce a structured review of a 200-line PR (categorized as must-fix / nit / question), at least one item per category.
  4. Reproduce a concurrency bug and write a reproducer test.
  5. Rewrite a legacy shell script in Python, preserving all error code behavior.
These five tasks take two to four hours combined. They can be run individually or together. Run them every time a tool upgrade happens.

2.2 Quantitative metrics

All four metric types must be tracked; omitting any one of them is a gap:
Metric typeExampleWhy it matters
Timecombined human + AI elapsed timeSpeed is most easily overstated; ignoring it makes it easy to miss that the time saved does not cover the subscription
Qualityreview acceptance rate, unit test pass rateA fast tool that produces unusable output is not useful
Costtoken consumption, API spendWhen two tools have equal quality, cost is the deciding factor
Reproducibilityrate of substantive differences when re-running the same taskAn unstable tool’s future debugging time will consume all the time it saves
Reproducibility is especially easy to overlook: an agent that produces different results on the same task twice may be responding to temperature, context drift, or random tool call ordering. If differences appear more than half the time, the tool is unsuitable for reproducibility-sensitive work.

2.3 What not to do

  • Do not use demo tasks. A demo is “let me try this,” not your real work. Evaluation results cannot be generalized.
  • Do not use trivially simple tasks. “Convert this string to uppercase” teaches nothing about differences no matter how fast it runs.
  • Do not use tasks that require network access or third-party APIs, unless that is the core of your work; network variance will pollute the signal.
  • Do not open too many tasks at once. Five is enough. Ten or more will not get run, which means there is no benchmark.

3. Two verification modes

Choose a mode based on the task. Using both is reasonable, but know which one you are in at any given moment.

3.1 Checkpoint-based verification

Characteristics: verify “is this correct so far?” at defined milestones. Best for:
  • Tasks with distinct phase boundaries (finishing a function, completing a round of refactoring, opening a PR).
  • One-shot workflows (writing docs, generating a scaffold, debugging a single bug).
  • Scenarios where failure cost is manageable, the work can be interrupted, and rollback is possible.
Implementation notes:
  • Write the checkpoints down before starting: “this task stops for review after X is done.” Committing to writing forces clarity and prevents the feeling of flow from deceiving you.
  • Run one verification pass at each checkpoint (unit test, lint, type check, visual feedback).
  • The task definition is “checkpoint passed equals done,” not “the whole thing finished equals done.”
Checkpoint exampleWriting a new feature. First checkpoint: API interface defined (method, params, return type) — run a mock test to confirm the schema is correct. Second checkpoint: core logic written — run unit tests to confirm the happy path. Third checkpoint: error cases handled — run the full test suite + lint + type check. Stop at each point, look, then continue. If anything is wrong at any point, stop and fix it — do not push forward just to keep the flow going.

3.2 Continuous verification

Characteristics: on every change or timed trigger, run tests + lint + verification; stop on failure. Best for:
  • Long-running agents (background automation, recurring tasks, a coding agent in CI).
  • Scenarios with frequent configuration changes (upgrading a framework, adjusting hooks, changing rules).
  • Components that “affect every session,” such as skills, hooks, and subagents.
Implementation notes:
  • Run one baseline pass before making changes (see Section 4, the worktree comparison method).
  • After changes, run the same test set and diff the results.
  • Set a fail threshold: if the failure rate exceeds a certain proportion, or a particular error type appears N times consecutively, stop.
  • The hook mechanism in 04-6 Hooks is the concrete engineering interface for continuous verification.
What triggers continuous verificationThe trigger does not have to be a timer. File changes, specific commands, or scheduled runs all work as triggers. The key is that the chain of “change, verify, diff, notify” closes completely — do not leave it half-wired or let verification results go unread.

3.3 Mode comparison

DimensionCheckpoint-basedContinuous
Applicable tasksOne-shot, distinct phasesLong-running, frequently changing
TriggerHuman-defined milestoneChange event or timer
Failure responseStop and reviewAuto-stop + notify
Attention burdenHigh (someone must be present)Low (intervene when notified)
RiskMissing intermediate errorsNoise (too-frequent verification disrupts development)
In practice, most developers use checkpoint as the primary mode and continuous as a supplement: core flows run on checkpoint (someone is present), CI and long-running configurations run continuous (background). There is no need to force every workflow into continuous in pursuit of “automation.”

4. The worktree comparison method

Run the same task in two branches — “with the setting” vs “without the setting,” or “tool A” vs “tool B” — and compare results. git worktree lets you do this without switching branches or polluting your working directory.

4.1 Concept

Each worktree is an independent checkout of the same repository, sharing .git but with separate file systems. Multiple worktrees can run different experiments in parallel at the same time.

4.2 Typical procedure

1

Create two worktrees from the main repo

cd ~/work/my-project

git worktree add ../my-project-with-skill feat/skill-on
git worktree add ../my-project-without-skill main

git worktree list
feat/skill-on is the configuration being evaluated (with the skill or setting enabled); main is the baseline.
2

Run the same benchmark tasks in each worktree

Run all 3 to 5 baseline tasks in each directory separately, recording the same four metric types for each task: time, quality, cost, and reproducibility. Do not draw conclusions from a single task.
3

Diff the results and derive the net effect

# Pass rate difference
echo "with-skill: 4/5, without-skill: 3/5"

# Token cost difference (from hook logs or API billing)
diff <(cat ../my-project-with-skill/task-tokens.log) \
     <(cat ../my-project-without-skill/task-tokens.log)
The difference is the net effect of this configuration. Did pass rate improve but cost rise 35%? Or did quality and pass rate stay equal while speed improved 18%? Let the numbers answer.
4

Decide whether to keep or remove the worktree

After running, you can keep the worktree as a “best baseline” snapshot or remove it:
git worktree remove ../my-project-without-skill

4.3 What to measure

MetricHow to measureHow to interpret
Pass ratetasks passed / total tasksDifference > 0 means improvement; difference < 0 means regression
Average timearithmetic mean of task durationsUse a paired t-test or just compare magnitudes; avoid over-interpreting small samples
Token costinput + output tokens per taskA cost commonly ignored when switching tools; public benchmarks do not report it either
Diff sizelines of code changedToo large means the tool did unnecessary things; too small means it did not do enough
Reproducibilityvariance across N re-runs of the same taskHigh variance = instability; future maintenance cost will consume all the time saved
Comparison experiment exampleYou want to evaluate whether a particular CLI tool is better than Claude Code for your daily Python refactoring.
  1. git worktree add ../eval-cli feat/cli-pilot (with the CLI installed)
  2. git worktree add ../eval-claude main (using Claude Code)
  3. Run five benchmark tasks in each worktree, recording pass rate, time, and tokens.
  4. Results: the CLI tool averaged 18% faster, but used 35% more tokens; pass rates were equal; CLI reproducibility was slightly worse.
  5. Conclusion: the speed improvement does not cover the token cost increase, and reproducibility is worse — continue with Claude Code.
This experiment takes three hours and is both faster and more reliable than two weeks of gut-feel evaluation.

4.4 Limitations of the worktree method

  • Not suitable for evaluations requiring persistent state (databases, caches, external services) because worktrees will compete for the same resources.
  • Not suitable for cross-repo evaluation; worktrees are a single-repo view. Evaluating cross-repo tools requires a more complex setup.
  • Cannot eliminate selection bias: your baseline tasks are still a subjective selection. The “best tool” derived from them may only be best for those five tasks.

5. Saturation check

100% pass rate means the tests are too easyIf your personal benchmark passes 100% every run, the test set has no discriminating power. It cannot distinguish “the best tool” from “an average tool,” because both score full marks.
How to perform a saturation check:
  • After running a round, deliberately use a “known weak” tool or baseline (an outdated model, the default zero-configuration setup, a base agent with no skills enabled) to run the same tasks.
  • If the baseline also passes 100%, your tests are too easy.
  • Add difficulty, add edge cases, increase scale — until at least one tool or configuration fails.
  • Target distribution: baseline passes roughly 30% to 60%; “reasonable choice” passes roughly 70% to 90%; “best choice” passes 90% or above. A spread is required for the benchmark to carry information.
Saturation check exampleYour five benchmark tasks currently pass 100% for both tool A and Claude Code. Run the same tasks with base Claude (no skills, no rules, no CLAUDE.md) — still 100%. The tests are too easy.Fix: replace task 1 with “refactor a legacy function with 6 levels of nesting while preserving all side effects.” Re-run: base Claude 60%, configured Claude 90%, tool A 85%. Now there is discriminating power.Conversely, if after increasing difficulty all tools drop to 30%, the tasks are too hard and no tool succeeds — you need to return to a middle difficulty level.

6. Turning verification into a habit

Reproducible verification requires a standing operating procedure. Three rules to give yourself:

6.1 Programmatically verifiable claims get execution verification

If a claim can be written as an executable check (a numeric value, a regex, a transformation, an API call), execute it — do not rely on memory or impression.
  • “This refactor did not change behavior”: run the existing test suite.
  • “This code contains no os.system calls”: run rg "os\.system" path/.
  • “This API returns status 200”: use curl -I or an SDK call.
A concrete form of automationPlace scripts/verify.sh at the repo root and chain together the common verification greps, tests, and lint passes. Run it on every substantive change; enforce it in CI. Add it to CLAUDE.md or AGENTS.md so the agent knows this SOP exists.

6.2 Factual claims get search verification

If a claim is factual in nature (a version, API behavior, whether a reference exists, package compatibility), search an official source.
  • Latest package version: pypi.org/project/<pkg>/, npm registry.
  • API behavior changes: official changelog, release notes, migration guide.
  • Academic claims: the original paper, IEEEXplore, arXiv.
  • Cannot find it: mark as “unverified” — do not guess.

6.3 Mark “unverified” only when neither method works

When a claim can neither be verified programmatically nor found through search (for example, “this tool’s long-term maintenance commitment” or “the future roadmap of an emerging framework”), explicitly mark it “unverified, inference only.” Inferences must include the reasoning behind them, not just a conclusion.
Maintaining the verification SOPWrite these three rules into a “Verification principles” section of CLAUDE.md or AGENTS.md. Every time you or an agent wants to write “I think / it should be / probably,” go back and read that section. After three months, these three rules become a reflex.

Common pitfalls

  • Judging by feel: “it seems better after switching.” Feelings are contaminated by energy levels, mood, and the novelty of a first encounter. Quantitative metrics are the only reliable basis.
  • A test suite that always passes 100% and believing that means quality is high. A saturated benchmark has no discriminating power; running it is equivalent to not running it.
  • Comparing quality without comparing cost. When two tools have equal quality, token cost, time, and subscription cost are the deciding factors. Public benchmarks not reporting cost is a common pitfall.
  • Too few benchmark tasks. Conclusions from a single task are not credible; three to five is the minimum credible threshold.
  • Never updating benchmark tasks. Tasks must evolve with your work; otherwise you are testing last year’s version of yourself.
  • Treating a worktree comparison as an A/B test. A/B testing has statistical significance requirements and sample size requirements. A worktree comparison is a quick qualitative comparison, not a rigorous quantitative experiment — do not overstate conclusions.
  • Not recording verification results. If results are not recorded, the next tool change requires starting over. Keep the commands, outputs, and conclusions from every verification run.

Self-check

Self-check
  • Can you produce a fixed set of tasks you would re-run whenever switching tools? When were those tasks last updated?
  • Have you deliberately verified whether your baseline tasks “saturate at 100%”?
  • For a tool you used last month, what metrics led you to keep it or switch away? Can you write them down?
  • In your current workflow, which flows suit checkpoint-based verification and which suit continuous? Can you tell them apart clearly?
  • Does the “verification principles” section in your CLAUDE.md or AGENTS.md contain concretely executable commands, or only abstract descriptions?
  • Think back to the last time you chose to trust a claim without verifying it. In hindsight, was that claim correct?

Sources and further reading

Factual claims are grounded in official documentation; fast-changing items are annotated as of 2026-05.