The Map, Not the Territory: How Formal Architecture Declarations Change AI Coding Agent Behavior
I can't read code. Here's how I built 20 microservices by giving AI agents a map instead of letting them wander. Controlled experiments show architecture context reduces navigation by 33-44% — but the format doesn't matter.
The Problem I Was Trying to Solve
I can't read code.
I don't mean this metaphorically. I'm a video editor who taught himself to build software by describing what he wanted to AI coding agents. Claude Code writes my Rust. I review the diffs, approve the PRs, and ship. It works — until the project grows past a certain size.
Around 15,000 lines, something breaks. Not the code — the agent's ability to navigate it. Claude Code starts grepping for functions it already found three turns ago. It reads the same file twice. It guesses at module boundaries instead of knowing them. I watch it burn 10 tool calls to locate a function I could point to in 5 seconds — if I could read Rust.
I manage 20 projects. Some are 22K lines. One is 470K. If the agent spends half its time lost, I spend half my time waiting.
I needed a map.
What intent.lisp Is
I started writing architecture descriptions in S-expressions — Lisp's parenthesized notation. Not because I think Lisp is superior (I'll get to that), but because brackets enforce structure. You can't be vague in S-expressions the way you can in Markdown.
(intent jarvis
(design-constraints
(three-pillars (memory control tools))
(communication (must-use EventBus))
(db-access (only memory/storage)))
(pillar memory
(purpose "Data capture, storage, analysis")
(component storage
(role "SOLE DB gateway")
(invariants "No raw SQL outside this module")
(symbols
(function save_message
(sig "async fn save_message(&self, ...)"))))))
A component is inside its pillar. A symbol is inside its component. This isn't a formatting choice — it's a syntactic requirement. Two people (or two LLMs) describing the same project produce structurally convergent files. Try that with Markdown.
I write these for all my projects. A 111,000-line Rust daemon compresses to 1,587 lines of intent.lisp. A 470K-line monorepo compresses to 7,877 lines across 32 files. The weighted average across 646K lines of production code is 34:1.
I don't write them by hand. I describe what I want in natural language. Claude Code translates it into S-expressions. Another Claude Code session reads it later.
The Experiment That Surprised Me
I wanted to prove that S-expressions are better than Markdown for AI agents. So I ran controlled experiments: 24 code localization tasks, four conditions (blind, S-expression, JSON, Markdown), Claude Sonnet 4.6, temperature=0.
The result I expected: S-expression context dramatically reduces navigation steps. Confirmed — 33-44% fewer steps with any architecture context (Wilcoxon p = 0.009, Cohen's d = 0.92).
The result I didn't expect: the format doesn't matter.
S-expression, JSON, YAML, Markdown — the LLM achieves identical 95% accuracy across all four formats. No significant difference. The agent doesn't care whether you write your architecture in brackets or bullet points or curly braces. It reads them all the same.
I also tested the writer side: when an LLM generates architecture descriptions, does format matter? All formats produce zero filler at temperature=0. Sonnet 4.6 is equally concise whether you ask for Lisp or Markdown.
This demolished my thesis. If the format is irrelevant, why use S-expressions at all?
What Actually Matters (and What Doesn't)
Three things matter. None of them are what I originally thought.
1. Formalization matters. Format doesn't.
Having an architecture file reduces navigation steps by 33-44%. Not having one means the agent wanders. The content is the signal; the brackets are noise.
But here's the deeper finding: in my field study (7,012 Claude Code sessions), sessions where the agent read intent.lisp performed identically to sessions where it didn't (1.65 vs 1.65 explore/edit ratio). The improvement was systemic, not per-session.
What changed wasn't the file being consumed. What changed was that I wrote it. The act of formalizing my architecture — declaring which module owns the database, which events flow where, what constraints exist — forced me to think clearly about my own system. That clarity propagated into better prompts, cleaner code boundaries, more explicit CLAUDE.md references. I call this the developer self-clarification effect.
But I needed to verify: is the file useless as an artifact, or does it also help directly?
2. The artifact has direct value — even auto-generated.
I ran a second experiment on a project where I never refactored code to match any descriptor. A tool scanned the codebase and auto-generated a 170-line architecture summary — zero human editing, zero code restructuring.
Result: 100% accuracy with the auto-generated descriptor versus 80% blind (p = 0.002, d = 1.04).
The file itself helps. You don't need to write it to benefit. A machine can generate it and another machine can consume it. The developer self-clarification is a bonus, not the whole story.
Surprising twist: the auto-generated 170-line version outperformed my hand-curated 698-line version on accuracy (100% vs 87%). Less is more — at least for navigation. Longer descriptors consume token budget that the agent needs for tool calls.
3. Failure modes are where formats actually diverge.
If the LLM reads all formats equally, why choose one over another? Because LLMs also write descriptors, and they make mistakes. When they do, the failure mode matters:
- JSON fails atomically. One missing brace = the entire file is unparseable. Zero content recovery.
- YAML fails silently. An indentation error re-parents a component to a different module without any warning. 50% of injected errors silently change meaning.
- Markdown detects nothing. There is no parser. 100% of errors are invisible.
- S-expression detects all structural errors (missing brackets). The localization is imprecise — the parser reports at EOF rather than the error location — but the content before the error remains intact.
No single format wins across all dimensions. JSON has lower silent corruption (21%) but catastrophic atomic failure. S-expression avoids atomic failure but has higher silent corruption (50%) on content deletions. Both are strictly superior to YAML and Markdown for any pipeline that needs automated validation.
This is why I use S-expressions: not because the LLM reads them better, but because they fail better.
The Numbers
All experiments used Claude Sonnet 4.6 at temperature=0 via OpenRouter. Total: 317 API calls, ~$27.
| Experiment | Finding | Key Number | |---|---|---| | A: Code localization (24 tasks × 4 formats) | Context helps, format irrelevant | d = 0.92, p = 0.009 | | B: Comprehension (20 questions × 4 formats) | All formats identical | 95% across the board | | C: Generation (96 runs) | All parseable formats >91% valid | S-expr most compact parseable | | D: Error injection (96 injections) | Each format fails differently | YAML 50% silent, JSON atomic | | E: Artifact vs process (15 tasks × 3 conditions) | Auto-gen descriptor works | 100% vs 80% blind, d = 1.04 | | Field study (7,012 sessions) | Variance reduction | IQR -52% |
Honest Limitations
The effect size is model-sensitive. An earlier run with Sonnet 4 yielded d = 1.70; Sonnet 4.6 yielded d = 0.92. The stronger model navigates better blind (5.2 vs 9.9 steps), compressing the marginal benefit of context. On a 22K-line project, a strong model may not need a map. On a 470K-line monorepo, it absolutely does — but I haven't run that experiment yet.
The sample is small: 24 tasks, single model, single developer's projects (all Rust). I can't claim generalizability. What I can claim is direction: having formal architecture context is better than not having it. This held across two model versions and five experiments.
Format comparisons had only 3-6 effective paired differences — not enough to prove formats are truly equivalent, only that I couldn't detect a difference. The 95% = 95% comprehension result is more convincing, but it's still 20 questions.
Why This Matters
The entire context engineering community is debating what format to put in CLAUDE.md, AGENTS.md, Cursor Rules. YAML frontmatter? Markdown headings? JSON Schema?
The answer, at least for architecture context, appears to be: it doesn't matter. What matters is that you formalize your architecture at all. Write it in whatever format your toolchain supports. If you need automated validation downstream, pick a parseable format. If you don't, Markdown is fine.
The tool I built — forge survey — generates an intent.lisp from any codebase in about 60 seconds. It scans the AST with tree-sitter, assembles a prompt, and asks Claude to produce an S-expression architecture description. The developer reviews and refines. The file goes into version control alongside code.
For me, this is the difference between managing 20 projects and drowning in 20 projects. The agent gets a map. I get predictability.
Links
- Paper: arXiv:2604.13108
- Code and data: DOI 10.5281/zenodo.19500105
- Forge toolkit: github.com/RuoqiJin/forge
- Neural Codegen (prior work): DOI 10.5281/zenodo.19372158