A complete guide to building large-scale web projects with code agents (Claude Code and similar). Read top to bottom; each section builds on the last.
Introduction
What this is
A practical handbook for engineering teams using code agents seriously โ not as autocomplete, but as the primary interface for writing and reviewing code. It assumes you have an agent and a non-trivial codebase. It covers what to build around the agent so the agent produces consistently good work.
Who it’s for
- Tech leads setting up agent workflows for a team
- Senior engineers on agent-driven projects who want fewer surprises
- Anyone running a long-running refactor or migration with an agent
The thesis
Most agent-driven productivity comes from infrastructure, not prompting. Better prompts produce slightly better output. Better infrastructure โ clear specs, layered conventions, hooks, ADRs, validation scripts โ produces dramatically better output, consistently, across every session.
A team that invests one week in infrastructure typically produces 2-3x more shippable code per agent-hour over the next quarter than a team with the same agent and no infrastructure. The infrastructure compounds. The prompting doesn’t.
Part 1: The Mental Model
The development loop
Every agent-driven feature follows this loop:
|
|
Each arrow is a place errors compound. The mistake most teams make is collapsing spec โ plan โ implementation into a single prompt โ the agent designs while it codes, and you can’t tell where it went wrong. Keep these as separate artifacts in separate turns.
For trivial work (rename, bug fix, docs), skip to implementation. For anything touching > 3 files or > 1 module, run the full loop.
The three meta-rules
These compound across every section that follows:
1. Optimize for legibility, not cleverness. Every choice โ naming, structure, abstraction level โ should privilege “easy for the next reader to understand” over “elegant by some other metric.” Agents are readers too, and clear code makes them better.
2. Make the right thing easy and the wrong thing hard. If a convention requires constant vigilance, it will fail. Encode it in linters, hooks, types, or directory structure. The agent should fall into the right pattern by default.
3. Treat your agent setup as a first-class part of the codebase. CLAUDE.md, specs/, .claude/, docs/adr/ โ these are not “documentation,” they are infrastructure. They get reviewed, refactored, and improved like any other code. Bad infrastructure produces bad output, even with great agents.
The fix-upstream principle
If something feels broken, the fix is almost always upstream of where the symptom appears:
- Bad code? Look at the prompt.
- Bad prompt? Look at the plan.
- Bad plan? Look at the spec.
- Bad spec? Look at the conventions and
CLAUDE.md. - Wrong conventions? Look at the project’s actual goals.
Part 2: Setting Up the System
CLAUDE.md hierarchy
Claude Code auto-loads CLAUDE.md from the working directory and every parent up to the repo root. Use this to put context where it’s needed, not all at the root.
Root CLAUDE.md โ the always-loaded file
Keep it under ~150 lines. Answer “what is this and what are the global rules”:
|
|
What does not belong in root CLAUDE.md:
- File listings or directory trees (the agent can list)
- Code style minutiae (the linter enforces this)
- Git history (use
git log) - Module-specific rules (push them down to subdirectory
CLAUDE.md)
Subdirectory CLAUDE.md โ loaded only when working there
|
|
A line like “all queries in this directory must go through withTransaction()” in lib/db/CLAUDE.md prevents an entire class of agent mistakes โ without bloating the root file every other agent invocation reads.
The specs/ system
A spec is your one chance to disambiguate before the agent makes irreversible-feeling choices. Without a spec, the agent will confidently produce 800 lines of code that solve the wrong problem.
Directory structure
|
|
Why this shape:
- Lifecycle folders are a cheap state machine. The agent can tell from the path whether a spec is authoritative.
- Folder-per-active-spec keeps plans, notes, and assets scoped.
- Date-prefixed slugs sort chronologically and read well in
git log. _template/ensures specs don’t drift in shape.conventions/is referenced by specs, not copied into them.archive/keeps rejections โ when the same idea comes up again, the reasoning is one grep away.
The spec template
|
|
Spec quality bar
A good spec lets you predict what code the agent will produce before it produces it. That’s the bar.
Rules of thumb:
- Short. A two-page spec gets read; a ten-page spec gets skimmed.
- Concrete acceptance criteria. “User can archive a card” is bad. “Clicking archive sets
archived_at, removes the card from the list within 200ms, and shows an undo toast for 5s” is good. - Honest non-goals. If you’re tempted to write “we may also redesign X,” either commit or rule it out. Maybe-scope is poison.
- Linked, not embedded, conventions. Reference
specs/conventions/testing.md; don’t copy it.
Plans vs specs
The spec answers “what and why.” The plan answers “how, in what order.” A plan is generated after the spec is approved.
Plan template:
|
|
The plan is what you hand the agent for implementation. It’s specific enough that you can sanity-check the diff against it.
Architecture Decision Records (ADRs)
Specs say what to build. ADRs say why we build that way.
|
|
ADR template:
|
|
When to write an ADR:
- Any choice that took > 1 hour of discussion
- Any choice that locks out alternatives
- Any choice that contradicts an industry default
- Any choice future agents will second-guess
When the agent proposes “let’s use Recoil for this state,” you don’t argue from memory โ you point to ADR-0002 (“Zustand for client state”) and say “doesn’t fit our convention; if you think we should change, write a superseding ADR first.”
.claude/ configuration
Permissions (.claude/settings.json)
|
|
Read-only and idempotent commands shouldn’t prompt. The fewer-permission-prompts skill scans your transcripts and suggests these โ run it monthly.
Hooks
Hooks run on harness events and are the only way to enforce automated behavior โ not memory, not preferences.
|
|
Use the update-config skill to wire these โ it knows the schema details.
Slash commands (.claude/commands/)
Slash commands are workflow accelerators. Keep them in .claude/commands/.
spec.md:
|
|
promote-spec.md:
|
|
ship-spec.md:
|
|
Part 3: The Daily Workflow
A worked example, end to end. Imagine you’re adding a “duplicate card” feature.
Step 1 โ From idea to spec
15-minute draft in specs/draft/duplicate-card.md:
|
|
Promote: move to specs/active/2026-05-08-duplicate-card/spec.md.
Step 2 โ From spec to plan
specs/active/2026-05-08-duplicate-card/plan.md:
|
|
Step 3 โ Implementation
The prompt should reference the artifacts, not re-explain them:
|
|
Step 4 โ Verification
The agent’s end-of-turn summary is intent, not result. Verify.
Cheap verifications (do every time):
|
|
Worth-it verifications for non-trivial changes:
- UI changes: start the dev server, click through the feature. Type-check passing โ feature working โ there is no automated substitute for this.
- API changes: hit the endpoint with
curl. Confirm response shape. - Data changes: run migration on a copy of prod data, not just dev fixtures.
Step 5 โ Review and ship
Review checklist:
|
|
Use /review for fast self-review before pushing. Reserve /ultrareview for high-stakes PRs (auth, data, shared infra).
After merge:
- Update
notes.mdwith anything unexpected during implementation - Move spec to
done/ - Update memory if anything was non-obvious
- Update refactor ledger if applicable
The notes.md artifact
The most undervalued artifact in this whole system. Where future-you (and future agents) learn from this implementation. Three months later when someone duplicates this pattern, they read this and skip the same mistake.
|
|
Part 4: Working Effectively with Agents
Prompt engineering basics
The implementation prompt shape
|
|
The “Report” section matters more than people think. Without it, the agent gives you marketing copy (“successfully implemented the feature”). With it, you get information you can act on.
The investigation prompt shape
|
|
“Why I’m asking” is the single most valuable line. It turns a narrow lookup into an answer that addresses the actual question.
Anti-patterns in prompts
- “Implement the feature.” Too vague. Agent invents scope.
- “Make it production ready.” Means nothing measurable.
- “Be careful.” Agents can’t be more careful on demand. Be specific about what to verify.
- “Use best practices.” Whose? Specify the rule, not the vibe.
- “Make it match the existing style.” Better: “match the patterns in
lib/cards/queries.ts.”
Context management
The 1M context window is a budget, not free. Once you hit ~60% utilization, agent quality drops noticeably โ it forgets earlier instructions, repeats searches, gets confused about file state.
Detecting bloat
- Agent re-reads files it already read this session
- Agent asks for information you already gave it
- Responses get more generic and hedged
- You find yourself repeating constraints
Strategies
- Push context to files. A 500-line
CLAUDE.mdcosts ~2K tokens once. Repeating those 500 lines across 20 prompts costs 40K tokens. - Fork for research. Tool output stays in the fork, not your main thread.
/compactstrategically. Before starting a new task in the same session./clearbetween tasks. If task A is done and task B is unrelated, just clear.- Avoid pasting large files. Reference them by path.
- Avoid screenshot-heavy sessions for code work. Images are expensive.
Communication patterns
Calibrated uncertainty
Agents will confidently propose answers. You should respond with calibrated confidence too:
- “I think X but I’m not sure โ verify before acting.”
- “This is wrong because Y, but I might be wrong about Y.”
- “I want X, but if you see a reason it’s bad, push back.”
- “Don’t change anything yet โ explain what you’d do.”
The “explain it back” pattern
Before letting the agent implement something non-trivial:
“Before you start, summarize what you understand the task to be. List the files you expect to change and any acceptance criteria you think apply. Don’t write any code.”
If the summary matches your intent, proceed. If it doesn’t, the spec/prompt was unclear โ fix it before any code is written. You catch 80% of misunderstandings here, at zero cost.
Productive disagreement
When the agent pushes back, take it seriously. Sometimes it’s right. Useful prompts:
- “You said X. I think Y. Here’s why: [reason]. What am I missing, or do you want to change your view?”
- “Walk me through your reasoning step by step.”
- “What evidence would change your mind?”
When the agent says “you’re right, I’ll do Y” โ be skeptical. Sometimes that’s the right answer; sometimes it’s the agent capitulating because you sound certain. Push: “Are you actually convinced, or just deferring?”
When to interrupt
Stop the agent immediately if:
- It’s about to run a destructive command you didn’t authorize
- It’s modifying files outside the planned scope
- It’s three iterations into a fix-test-fail loop with no progress
- Its summary diverges from what the diff actually shows
Don’t wait politely for it to finish a wrong direction.
Multi-agent coordination
Collision modes
- Same branch, different agents โ merge conflicts
- Same database, different agents โ fixture poisoning
- Same context, different agents โ rate limits
- Same spec, different agents โ diverging implementations
Coordination protocol
One spec, one branch, one agent at a time. Worktrees enforce this:
|
|
Database isolation per agent:
|
|
Parallel forks for research, not implementation. Forks are great for “audit X, audit Y, audit Z” run in parallel. They are bad for “implement X, implement Y, implement Z” because the diffs need to interleave.
When parallel makes sense
|
|
When in doubt: serial is safe; parallel is for I/O-bound research.
Part 5: Code Structure That Helps Agents
This is the most underrated lever. Agents perform dramatically better in well-structured codebases.
Co-located tests
|
|
Agents find tests reliably when they live next to code. A separate tests/ mirror requires the agent to track two trees and they often miss tests when refactoring.
Explicit module boundaries
Each top-level directory should have an index.ts that re-exports its public surface. This signals to the agent (and humans) what’s API and what’s internal.
|
|
Shallow trees over deep ones
components/cards/CardDetail.tsx is easier for an agent to navigate than src/features/cards/components/detail/CardDetail/CardDetail.tsx. Each extra level is friction.
Predictable file shapes
If every API route looks like this:
|
|
โฆthe agent generates new ones correctly the first time. Inconsistent route shapes mean the agent has to guess.
Type-first interfaces
Define types before implementation. The agent can write code against types it can see; it can’t write code against types you mean.
|
|
Avoid magic
Decorators, dynamic imports, runtime metaprogramming, monkey-patching โ all cost agent reasoning quality. Boring, explicit code is agent-friendly code. (It’s also human-friendly. The two correlate.)
Index files for large modules
For modules with many files, add an INDEX.md:
|
|
A 30-second investment that pays off every session.
Part 6: Conventions Worth Standardizing
Performance and accessibility budgets
specs/conventions/performance.md:
|
|
specs/conventions/accessibility.md:
|
|
A pre-commit hook running axe-core on changed routes catches most of these.
Security and authentication
|
|
Security checklist for auth-touching PRs
|
|
Run /security-review on auth-touching PRs.
Database and migrations
The highest-blast-radius area. Hard rules in lib/db/CLAUDE.md:
|
|
Five PRs for “rename a column” is not overkill. It’s the difference between “deploy on a Wednesday” and “production outage at 3am.”
State management
|
|
The “URL is state too” rule is the most undervalued. Agents will reach for useState to track which tab is selected, breaking deep links and back button.
Forms
|
|
Observability
|
|
Part 7: The Refactor Playbook
A “refactor” project is mostly about not breaking things while changing them. Agents can be unusually good or unusually bad depending on setup.
Strangler fig, not big bang
Keep old code working while new code is added next to it. Agents handle “add new code” much better than “edit old code in place.” The Next.js Pages โ App Router migration naturally fits this.
Make the new path opt-in via flag
Not just a feature flag for users โ a routing flag for you. New page handles ?v=2, old handles default. Compare side-by-side. Once new is ready, swap defaults. Agents thrive when they can build the new without breaking the old.
Tests as the contract
Before refactoring a module, write characterization tests against its current behavior. Then the agent can refactor freely โ if tests pass, behavior is preserved. Without tests, you’re trusting the agent to preserve invariants it can’t see.
Type-driven extraction
Extract types first, in their own PR. Once Card, CardInput, CardEvent are well-typed, refactoring code that uses them is much safer because the type system catches drift.
One pattern at a time
“Migrate all data fetching to Server Components” โ fine.
“Migrate data fetching AND restructure components AND swap the styling system” โ chaos.
Agents need a single transformation rule per pass.
Codemod for mechanical, agent for judgment
If the refactor is mechanical (rename, move, change import), use jscodeshift or ts-morph โ deterministic and reviewable as a single diff. Reserve the agent for the parts that need judgment.
Boundary first for legacy code
Before touching legacy code, draw a boundary around it. app/legacy/** is its own region with its own CLAUDE.md:
|
|
Refactor ledger
Keep docs/refactor-ledger.md as a running record:
|
|
The agent reads this and knows what’s done, what patterns are established, what to follow.
Part 8: Anti-Patterns
The recurring failure modes:
-
The mega-prompt. “Build the whole feature” with no spec. You get 1500 lines of plausible-looking code that solves a different problem than you wanted. Fix: spec โ plan โ implement.
-
Trusting the summary. Agent says “all tests pass.” Tests pass because the agent skipped the failing one. Fix: verify diff, not summary.
-
Letting context bloat. 200K tokens of tool output later, the agent has forgotten the original goal. Fix: fork research, use
/clearaggressively, push context toCLAUDE.md. -
No non-goals. Every spec gets implemented plus “while I’m here” cleanups. Fix: explicit non-goals, enforced at review.
-
Permission prompt fatigue. You start clicking “yes” without reading. Fix: allowlist read-only commands, reserve prompts for actions that matter.
-
Memory as scratchpad. Saving “currently working on X” โ that’s task state, goes stale instantly. Fix: memory is for durable facts.
-
One giant CLAUDE.md. 800 lines at the root, loaded on every turn, ignored by humans. Fix: layer CLAUDE.md by directory.
-
Specs as wishlist. Specs that say “we should also considerโฆ” without committing. Fix: cut everything that isn’t acceptance criteria or non-goals.
-
Skipping the human gate. Agent produces spec, plan, and code in one chain. No one ever approved the spec. Fix: human approval between artifacts.
-
Agent-generated commit messages with no review. Often wrong about why. Fix: edit the message, or write it yourself.
Failure recovery
Sometimes the agent produces broken code or makes changes you didn’t want.
Triage first
Before reverting:
- Is the diff partially correct? Salvage the good parts.
- Is the bug obvious? Just fix it forward.
- Is the approach wrong? Revert and re-prompt.
Most teams over-revert. A 200-line diff with one bad function doesn’t need to start over.
When to revert
- Agent fundamentally misunderstood the spec
- Diff touches files outside scope and you can’t easily separate them
- Multiple changes interact in ways that are hard to debug
How to revert without losing learning
|
|
When re-prompting, include what went wrong: “Previous attempt did X, which violated constraint Y. Don’t repeat that.”
When the agent gets stuck in a loop
Agent makes a change, runs tests, sees a failure, makes another change, runs tests, sees a different failureโฆ
Stop the loop. Take over manually for 5 minutes. The agent is missing context that would unlock progress. Once you understand the missing context, write it down (in CLAUDE.md or notes.md) so future agent sessions don’t re-discover it.
Part 9: Maintaining the System
Daily rhythm
Morning (planning)
- Read
git logfrom yesterday - Skim
specs/active/ - Pick the next thing
- If no spec exists, write the draft (15-30 min)
Midday (deep work)
- One spec โ one branch โ one focused agent session
- Verify diff before committing
- Commit, push, open PR
- While CI runs: start next spec or review someone else’s PR
Late afternoon (cleanup)
- Update
notes.md - Move shipped specs to
done/ - Update memory if anything was non-obvious
Friday hygiene
- Read your week’s specs and PRs as a batch
- Look for patterns: what failed? what didn’t?
- Update
CLAUDE.mdor ADRs based on what you learned - Archive anything stale
Quarterly pruning
- Memory: delete entries that haven’t been useful
CLAUDE.mdfiles: anything obsolete?- ADRs: any superseded but not marked?
- Specs in
archive/: still informative? if not, delete - Slash commands: any unused?
- Permissions: still right? run
fewer-permission-promptsagain
Annual audit
- Read your last 12 months of
done/specs as a batch. What patterns emerge? - Read your last 12 months of postmortems. What recurring categories?
- Read your ADRs. Which need superseding?
- Read your
CLAUDE.mds. Are they still the rules you actually follow?
Continuous evolution
The system is not finished. It’s a living thing. Every spec teaches the system. Every postmortem teaches the system. Every “I had to repeat that to the agent again” teaches the system.
Teams that make this work treat the agent infrastructure as a product they’re building alongside their actual product. The ones that don’t end up with a stale CLAUDE.md from project init and a confused agent six months later.
Appendix A: Templates
Spec template
See Part 2 / Spec Template above.
Plan template
See Part 2 / Plans vs specs above.
Bug spec template
|
|
ADR template
See Part 2 / Architecture Decision Records above.
Postmortem template
|
|
Appendix B: Scripts
scripts/validate-spec.js
|
|
scripts/check-test-discipline.sh
|
|
scripts/check-not-main.sh
|
|
Appendix C: Definition of Ready / Done
Definition of Ready
A spec is “ready” for implementation when ALL of the following are true:
- Frontmatter complete (status, owner, target date, related links)
- Problem stated in business terms (not “the code is bad”)
- Goals are measurable
- Non-Goals explicitly list what’s NOT in scope
- Approach names the modules/components that will change
- Acceptance Criteria are testable
- Open Questions are answered or explicitly deferred
- Rollout plan exists, including kill switch
- Plan generated and reviewed
- Spec linked from any related ADR or convention
Specs that don’t meet this bar stay in draft/.
Definition of Done
A spec is “done” when ALL of the following are true:
- All acceptance criteria pass in production
- Tests cover each acceptance criterion
- Type-check, lint, and tests pass on main
- No new
@ts-ignore,as any, or.skipintroduced - Telemetry/logging in place to detect regressions
- Feature flag (if any) is at intended rollout state
- Spec moved to
specs/done/ -
notes.mdcaptures any non-obvious decisions - Memory updated if anything was non-obvious or surprising
- Refactor ledger updated (if applicable)
- Old code marked for deletion (with date) if this replaced something
Appendix D: Onboarding Checklist
Day 1 reading list
README.mdCLAUDE.mdspecs/README.mdspecs/conventions/*(skim)docs/adr/README.md+ 2-3 load-bearing ADRs- 3 specs in
specs/done/chosen by your buddy
Day 1 doing
- Pair on one trivial fix end-to-end with a buddy
- Open your first PR (anything โ fix a typo)
First-week milestones
- Ship one bug fix
- Ship one small feature with a spec
- Review one PR using
/review - Save one memory entry about yourself
What new people get wrong
- Skipping specs for “small” things. Then “small” turns out to be a 6-file change. Default to writing a spec.
- Trusting the agent’s summary. Cure: pair on the first 5 agent PRs.
- Letting context bloat. Cure: teach
/clearand/compactearly. - Memory hoarding. Cure: show what NOT to save.
Appendix E: Where to Start
If you want to do exactly one thing this week to improve your agent-driven workflow, in order of leverage:
- Write a sharp
CLAUDE.mdat the root and one in each major module (1 day) - Build the
specs/system with templates and one filled-in example (1 day) - Add
.claude/settings.jsonpermissions + 3-5 hooks (half day) - Write 3-5 ADRs for decisions you’ve already made implicitly (half day)
- Add
validate-spec.jsandcheck-test-discipline.shscripts (1 hour) - Write 3-5 reusable prompts in
prompts/(1 hour)
Six items, one week, finished. Everything else is gardening on top of this.
Closing thought
Agents are amplifiers. They amplify what you give them โ clear specs become great code, vague specs become confident garbage. The infrastructure described in this tutorial is not overhead; it’s the leverage point.
Build the system, then let the system do the work.