What this unit solvesOther people’s benchmarks test other people’s tasks. Answering “is this tool, model, or configuration actually better for me?” requires a set of your own reproducible test cases and a habit of verifying rather than occasionally remembering to. This unit gives you concrete methods for designing a personal benchmark, choosing between two verification modes, running the git worktree comparison method, and understanding why a 100% pass rate may be a sign of trouble.
Learning objectives
- Design a set of 3 to 5 reproducible tasks that reflect your real work.
- Distinguish checkpoint-based from continuous verification and choose the right mode based on task characteristics.
- Use the git worktree comparison method to measure the difference between “with a setting” and “without a setting” or “tool A” vs “tool B.”
- Explain the saturation check: why 100% pass rate means the tests are too easy.
- Turn “programmatically verifiable claims get execution verification, factual claims get search verification, and neither gets marked unverified” into a repeatable habit.
1. Why build your own benchmark
Public benchmarks rarely correlate strongly with your specific tasks. HumanEval tests LeetCode-style standalone functions; SWE-bench tests GitHub issue fixes. These overlap with what you do, but they are not the same thing. A model’s high public score does not imply good performance on your codebase, your domain, or your constraints. Three reasons to build your own:- True representativeness: tasks you have actually done recently will fit your workflow better than any public set.
- Real constraints: your real constraints (private models, no network, a specific framework, a team coding style) are not covered by public benchmarks.
- Reproducibility: your own tests can be re-run, adjusted, and extended with edge cases; public benchmarks are frozen.
2. Designing a reproducible task set
2.1 Criteria for task selection
Pick 3 to 5 tasks, each satisfying all of the following:- Representative: you have done it at least once in the past six months and will do it again.
- Repeatable: the same input can be run multiple times without depending on the current time, network state, or random API calls.
- Clear success criteria: the output can be judged in binary terms (pass / fail) or measured (time, tokens, diff size, error count).
- Under one hour: the full task fits within a single work session without spanning days.
Personal benchmark examplesSay you are a backend engineer whose work includes: refactoring a module, writing a migration, reviewing a PR, debugging a race condition. Five reasonable benchmark tasks:
- Convert
orders/service.pyfrom class-based to function-based with identical behavior, verified by running the existing unit tests to completion. - Write a migration (including rollback) for a newly added
payment_methodcolumn and update the ORM model. - Produce a structured review of a 200-line PR (categorized as must-fix / nit / question), at least one item per category.
- Reproduce a concurrency bug and write a reproducer test.
- Rewrite a legacy shell script in Python, preserving all error code behavior.
2.2 Quantitative metrics
All four metric types must be tracked; omitting any one of them is a gap:| Metric type | Example | Why it matters |
|---|---|---|
| Time | combined human + AI elapsed time | Speed is most easily overstated; ignoring it makes it easy to miss that the time saved does not cover the subscription |
| Quality | review acceptance rate, unit test pass rate | A fast tool that produces unusable output is not useful |
| Cost | token consumption, API spend | When two tools have equal quality, cost is the deciding factor |
| Reproducibility | rate of substantive differences when re-running the same task | An unstable tool’s future debugging time will consume all the time it saves |
2.3 What not to do
- Do not use demo tasks. A demo is “let me try this,” not your real work. Evaluation results cannot be generalized.
- Do not use trivially simple tasks. “Convert this string to uppercase” teaches nothing about differences no matter how fast it runs.
- Do not use tasks that require network access or third-party APIs, unless that is the core of your work; network variance will pollute the signal.
- Do not open too many tasks at once. Five is enough. Ten or more will not get run, which means there is no benchmark.
3. Two verification modes
Choose a mode based on the task. Using both is reasonable, but know which one you are in at any given moment.3.1 Checkpoint-based verification
Characteristics: verify “is this correct so far?” at defined milestones. Best for:- Tasks with distinct phase boundaries (finishing a function, completing a round of refactoring, opening a PR).
- One-shot workflows (writing docs, generating a scaffold, debugging a single bug).
- Scenarios where failure cost is manageable, the work can be interrupted, and rollback is possible.
- Write the checkpoints down before starting: “this task stops for review after X is done.” Committing to writing forces clarity and prevents the feeling of flow from deceiving you.
- Run one verification pass at each checkpoint (unit test, lint, type check, visual feedback).
- The task definition is “checkpoint passed equals done,” not “the whole thing finished equals done.”
Checkpoint exampleWriting a new feature. First checkpoint: API interface defined (method, params, return type) — run a mock test to confirm the schema is correct. Second checkpoint: core logic written — run unit tests to confirm the happy path. Third checkpoint: error cases handled — run the full test suite + lint + type check. Stop at each point, look, then continue. If anything is wrong at any point, stop and fix it — do not push forward just to keep the flow going.
3.2 Continuous verification
Characteristics: on every change or timed trigger, run tests + lint + verification; stop on failure. Best for:- Long-running agents (background automation, recurring tasks, a coding agent in CI).
- Scenarios with frequent configuration changes (upgrading a framework, adjusting hooks, changing rules).
- Components that “affect every session,” such as skills, hooks, and subagents.
- Run one baseline pass before making changes (see Section 4, the worktree comparison method).
- After changes, run the same test set and diff the results.
- Set a fail threshold: if the failure rate exceeds a certain proportion, or a particular error type appears N times consecutively, stop.
- The hook mechanism in 04-6 Hooks is the concrete engineering interface for continuous verification.
3.3 Mode comparison
| Dimension | Checkpoint-based | Continuous |
|---|---|---|
| Applicable tasks | One-shot, distinct phases | Long-running, frequently changing |
| Trigger | Human-defined milestone | Change event or timer |
| Failure response | Stop and review | Auto-stop + notify |
| Attention burden | High (someone must be present) | Low (intervene when notified) |
| Risk | Missing intermediate errors | Noise (too-frequent verification disrupts development) |
4. The worktree comparison method
Run the same task in two branches — “with the setting” vs “without the setting,” or “tool A” vs “tool B” — and compare results. git worktree lets you do this without switching branches or polluting your working directory.4.1 Concept
Each worktree is an independent checkout of the same repository, sharing.git but with separate file systems. Multiple worktrees can run different experiments in parallel at the same time.
4.2 Typical procedure
Create two worktrees from the main repo
feat/skill-on is the configuration being evaluated (with the skill or setting enabled); main is the baseline.Run the same benchmark tasks in each worktree
Run all 3 to 5 baseline tasks in each directory separately, recording the same four metric types for each task: time, quality, cost, and reproducibility. Do not draw conclusions from a single task.
Diff the results and derive the net effect
4.3 What to measure
| Metric | How to measure | How to interpret |
|---|---|---|
| Pass rate | tasks passed / total tasks | Difference > 0 means improvement; difference < 0 means regression |
| Average time | arithmetic mean of task durations | Use a paired t-test or just compare magnitudes; avoid over-interpreting small samples |
| Token cost | input + output tokens per task | A cost commonly ignored when switching tools; public benchmarks do not report it either |
| Diff size | lines of code changed | Too large means the tool did unnecessary things; too small means it did not do enough |
| Reproducibility | variance across N re-runs of the same task | High variance = instability; future maintenance cost will consume all the time saved |
Comparison experiment exampleYou want to evaluate whether a particular CLI tool is better than Claude Code for your daily Python refactoring.
git worktree add ../eval-cli feat/cli-pilot(with the CLI installed)git worktree add ../eval-claude main(using Claude Code)- Run five benchmark tasks in each worktree, recording pass rate, time, and tokens.
- Results: the CLI tool averaged 18% faster, but used 35% more tokens; pass rates were equal; CLI reproducibility was slightly worse.
- Conclusion: the speed improvement does not cover the token cost increase, and reproducibility is worse — continue with Claude Code.
4.4 Limitations of the worktree method
- Not suitable for evaluations requiring persistent state (databases, caches, external services) because worktrees will compete for the same resources.
- Not suitable for cross-repo evaluation; worktrees are a single-repo view. Evaluating cross-repo tools requires a more complex setup.
- Cannot eliminate selection bias: your baseline tasks are still a subjective selection. The “best tool” derived from them may only be best for those five tasks.
5. Saturation check
How to perform a saturation check:- After running a round, deliberately use a “known weak” tool or baseline (an outdated model, the default zero-configuration setup, a base agent with no skills enabled) to run the same tasks.
- If the baseline also passes 100%, your tests are too easy.
- Add difficulty, add edge cases, increase scale — until at least one tool or configuration fails.
- Target distribution: baseline passes roughly 30% to 60%; “reasonable choice” passes roughly 70% to 90%; “best choice” passes 90% or above. A spread is required for the benchmark to carry information.
Saturation check exampleYour five benchmark tasks currently pass 100% for both tool A and Claude Code. Run the same tasks with base Claude (no skills, no rules, no CLAUDE.md) — still 100%. The tests are too easy.Fix: replace task 1 with “refactor a legacy function with 6 levels of nesting while preserving all side effects.” Re-run: base Claude 60%, configured Claude 90%, tool A 85%. Now there is discriminating power.Conversely, if after increasing difficulty all tools drop to 30%, the tasks are too hard and no tool succeeds — you need to return to a middle difficulty level.
6. Turning verification into a habit
Reproducible verification requires a standing operating procedure. Three rules to give yourself:6.1 Programmatically verifiable claims get execution verification
If a claim can be written as an executable check (a numeric value, a regex, a transformation, an API call), execute it — do not rely on memory or impression.- “This refactor did not change behavior”: run the existing test suite.
- “This code contains no
os.systemcalls”: runrg "os\.system" path/. - “This API returns status 200”: use
curl -Ior an SDK call.
A concrete form of automationPlace
scripts/verify.sh at the repo root and chain together the common verification greps, tests, and lint passes. Run it on every substantive change; enforce it in CI. Add it to CLAUDE.md or AGENTS.md so the agent knows this SOP exists.6.2 Factual claims get search verification
If a claim is factual in nature (a version, API behavior, whether a reference exists, package compatibility), search an official source.- Latest package version:
pypi.org/project/<pkg>/, npm registry. - API behavior changes: official changelog, release notes, migration guide.
- Academic claims: the original paper, IEEEXplore, arXiv.
- Cannot find it: mark as “unverified” — do not guess.
6.3 Mark “unverified” only when neither method works
When a claim can neither be verified programmatically nor found through search (for example, “this tool’s long-term maintenance commitment” or “the future roadmap of an emerging framework”), explicitly mark it “unverified, inference only.” Inferences must include the reasoning behind them, not just a conclusion.Common pitfalls
- Judging by feel: “it seems better after switching.” Feelings are contaminated by energy levels, mood, and the novelty of a first encounter. Quantitative metrics are the only reliable basis.
- A test suite that always passes 100% and believing that means quality is high. A saturated benchmark has no discriminating power; running it is equivalent to not running it.
- Comparing quality without comparing cost. When two tools have equal quality, token cost, time, and subscription cost are the deciding factors. Public benchmarks not reporting cost is a common pitfall.
- Too few benchmark tasks. Conclusions from a single task are not credible; three to five is the minimum credible threshold.
- Never updating benchmark tasks. Tasks must evolve with your work; otherwise you are testing last year’s version of yourself.
- Treating a worktree comparison as an A/B test. A/B testing has statistical significance requirements and sample size requirements. A worktree comparison is a quick qualitative comparison, not a rigorous quantitative experiment — do not overstate conclusions.
- Not recording verification results. If results are not recorded, the next tool change requires starting over. Keep the commands, outputs, and conclusions from every verification run.
Self-check
Self-check
- Can you produce a fixed set of tasks you would re-run whenever switching tools? When were those tasks last updated?
- Have you deliberately verified whether your baseline tasks “saturate at 100%”?
- For a tool you used last month, what metrics led you to keep it or switch away? Can you write them down?
- In your current workflow, which flows suit checkpoint-based verification and which suit continuous? Can you tell them apart clearly?
- Does the “verification principles” section in your
CLAUDE.mdorAGENTS.mdcontain concretely executable commands, or only abstract descriptions? - Think back to the last time you chose to trust a claim without verifying it. In hindsight, was that claim correct?
Sources and further reading
Factual claims are grounded in official documentation; fast-changing items are annotated as of 2026-05.- [1] git SCM, “git-worktree Documentation.” https://git-scm.com/docs/git-worktree (as of 2026-06)
- Related: 03-1 Evaluating whether a tool or skill fits you on the three dimensions of tool fit; 03-2 An evaluation framework beyond the GitHub star trap on evaluation signal weighting; 03-3 Security, privacy, and supply chain risk for a complete discussion of security boundaries; 04-6 Hooks for automated verification mechanics (hook triggers, PostToolUse).