Editor's Note
bug-fix
|
Install
npx skills add https://github.com/nexus-substrate/nexus-agents --skill bug-fixSKILL.md
Bug Fix Skill
<!-- CANONICAL SOURCES: - docs/development/CONTRIBUTION_GUIDE.md - skills/test-driven-development (Prove-It Pattern for the failing-test step) - skills/references/testing-patterns.md (pyramid, AAA, naming) -->Full workflow: CONTRIBUTION_GUIDE.md
Stop-the-Line Protocol
When a test, build, or lint fails mid-flow — before making any more changes:
- STOP — do not edit, do not retry, do not "just try one more thing"
- PRESERVE — capture the raw error, stack trace, last passing commit, and repro command. Paste into the PR description if relevant
- DIAGNOSE — run the Triage Sequence below; identify the causal layer, not the first surface that surfaced the error
- FIX — change the root cause only. No speculative refactors, no drive-by cleanups
- GUARD — add a regression test that would have caught the original failure
- RESUME — re-run the full check suite; only then continue the original task
Rationale: failures compound. A bug unresolved at step N makes every change N+1…N+k incorrect. We observed this in #1871 (consensus_vote silent hang) — the symptoms looked like adapter timeouts, the cause was missing overall-deadline race.
Triage Sequence
Run this order before "Implement fix." Skipping steps produces symptom-level patches that regress.
- Reproduce reliably. If non-reproducible: log timestamps, add artificial delays to widen race windows, diff CI vs local env. Document the conditions.
- Localize to a layer: UI / service / data / build / external dep. Resist "it must be X" — verify with a log, not a hunch.
- Reduce to the minimal failing case. Smaller repros make the cause obvious; large repros hide it.
- Fix the root cause. If your fix is upstream of where the error surfaced, you're probably right. If it's downstream (dedupe in the view, catch in the caller), you're probably patching the symptom.
- Guard with a regression test that fails without the fix.
- Verify end-to-end:
pnpm lint && pnpm typecheck && pnpm testplus any integration/smoke test for the affected area.
Anti-Rationalization — Debugging
Debugging tempts four recurring shortcuts. Each is a trap.
| Excuse | Counter |
|---|---|
| "I know what the bug is" | Unverified guesses resolve only ~70% of the time. Reproduce first — the cost of confirming is minutes; the cost of a wrong fix is hours plus a regression. |
| "The failing test must be wrong" | Sometimes true, but verify against the spec before changing the test. A test asserting correct-but-inconvenient behavior gets "fixed" to assert the bug. |
| "It works on my machine" | Environment is a variable. Diff Node version, env vars, lockfile drift, CI container, clock/timezone, file system case sensitivity. |
| "This is flaky, ignore it" | Flaky = the test is exposing a real race/ordering/timing bug. Fix the intermittency or understand why it happens. .skip with a TODO is a ticking debt. |
Workflow
-
Create/verify GitHub issue with reproduction steps
gh issue list --state open gh issue create --title "bug: [description]" --label "bug" -
Write failing test that demonstrates the bug
pnpm test -- --watch -
Implement fix — minimal changes, don't refactor surrounding code
-
Verify test passes
pnpm lint && pnpm typecheck && pnpm test -
Check for similar bugs elsewhere
# Search for similar patterns in codebase -
Create PR
git commit -m "fix(scope): description Closes #<issue> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>" gh pr create --title "fix(scope): description" --base main
Quality Checklist
- Reproduced reliably (or non-reproducibility documented)
- Localized to the causal layer (not the surfacing layer)
- Failing test written before fix
- Fix is at the root cause, not the symptom
- Fix is minimal (no unrelated refactors)
- Similar patterns checked elsewhere
- Regression test would have caught the original bug
-
pnpm lint && pnpm typecheck && pnpm testpasses - Issue referenced in commit
Red flags (in addition to the Stop-the-Line protocol)
- Bug fix without a regression test (the test stays as the future guard)
- Symptom-level patch (caller-side dedupe, view-side filter) when the root cause is upstream
- Fix that "just retries" without addressing why the failure happened
- Multiple unrelated changes in the same PR (drive-by refactors)
- Issue closed before the regression test fails on
mainwithout the fix