Research / Flywheel System
Flywheel System
Self-Evolving Microservice Cascade: A Four-Stroke Engine for Intent-Driven Autonomous Code Repair
The Problem
In a microservice universe, changing a single upstream type — say, User.id from i32 to Uuid — shatters every downstream service that consumes it. Today, developers trace the blast radius by hand, open each project, fix compilation errors one by one, and pray they didn't miss a transitive dependency.
AI coding assistants can fix individual files, but they lack cross-service awareness. They don't know the dependency graph, can't compute which services are affected, and have no mechanism to coordinate repairs in topological order. The result: engineers still spend most of their time doing integration plumbing, not building features.
A survey of 57 sources across six research directions — self-adaptive systems, LLM-based program repair, multi-agent SE, declarative architecture, software evolution, and self-improving AI — reveals a critical gap: no existing system unifies a declarative blueprint as simultaneously a synthesis constraint, a compiler-validated invariant, and a transformation source.
Intellectual Genesis: Three Thought Experiments
The system originated from three thought experiments about using Rust's type system as a physical fitness function for AI-driven code evolution:
- The Compilation Sandbox — Can an LLM improve code without a teacher model? Let it generate variants at high temperature, use
cargo checkas the sole fitness judge. The codebase state becomes the model's “weights”; each successful merge is a weight update. The compiler is the only oracle. - Homoiconic Bootstrapping — Since the system's own logic is expressed in S-expressions (code as data), the AI can evolve at the meta-level by modifying Lisp declarations rather than Rust code. When it discovers recurring patterns, it synthesizes new high-level macros — vocabulary bootstrapping that expands the system's cognitive capacity.
- The Topology Guardian — Inspired by self-distillation's principle of “suppressing distractor tails where precision matters.” Core infrastructure is locked to deterministic mode (Temperature = 0). Peripheral tools allow high-temperature exploration. Topology-aware mutation rates prevent catastrophic self-modification.
These crystallized into three design principles: compiler as oracle, declarative intent as evolution substrate, and topology-aware safety boundaries.
The Insight: Four-Stroke Engine
The system implements a dual-loop control architecture: an inner loop for deterministic delivery (intent change → generate → repair → validate) and an outer loop for evolutionary self-improvement (repair logs → distillation → template fission → reduced cost). The inner loop is convergent and idempotent. The outer loop is stochastic but bounded by meta-utility evaluation.
- Sense — Detect intent changes, parse the cross-service dependency graph, compute blast radius
- Plan — Topologically sort affected services, generate a repair queue with dependency-aware scheduling
- Execute — Deterministic code stamping (Forge) + sandboxed AI repair (Mechanic) + compiler-verified guardrails
- Learn — Distill repair patterns, propose template evolution, evaluate with meta-utility scoring
The flywheel's essential dynamic: custom.rs shrinks, generated.rs expands. Each evolution cycle transfers hand-written patterns into deterministic templates, monotonically reducing the surface area that requires AI repair.
Theoretical Foundations
The architecture synthesizes ideas from five academic papers and one key inspiration into a single closed-loop system:
- Rainbow (Garlan et al., CMU 2004) — Architecture-based self-adaptation. Direct 1:1 mapping: Probes =
cargo check/ git diff, Effectors = Forge / Mechanic, Model Manager =intent.lisptopology, Adaptation Engine = Cascade Controller. Key adoption: separate decision-making from execution. - SRepair (Gao et al., ISSTA 2024) — Dual-LLM program repair. First system to achieve multi-function repair (32 bugs, $0.029/bug). Shaped Mechanic's Researcher + Coder separation. Extended with a Global Researcher phase for cross-service cascade repairs.
- STOP (Zelikman et al., COLM 2024) — Recursive self-improvement. Core mechanism behind the evolution loop. Key safety argument: Rust's type checker makes reward hacking theoretically impossible — a template that passes
cargo checkgenuinely produces structurally valid code. - SWE-agent (Yang et al., ICLR 2025) — Agent-computer interfaces. Four ACI mechanisms adopted: stateful file viewer (100-line window), guarded edit (auto-rollback on failure, 40%+ recovery improvement), history context collapsing (30+ round coherence), bounded search (50-result cutoff).
- Towards CIA (Cerny et al., 2025) — Dual-level IR graph: micro-level (Endpoint → Service → Repository → Entity) + macro-level (cross-service via remote calls). JSON-driven conflict rules pre-filter blast radius before invoking LLMs.
The initial spark came from Embarrassingly Simple Self-Distillation Improves Code Generation (Zhang et al., arXiv:2604.01193). The system generalizes self-distillation from model weights to infrastructure: repair logs become the signal for evolving deterministic code generation templates. Cost data from the literature confirms viability: $0.03–$0.42 per bug via structured compiler feedback.
Architecture: Four Components
Forge — The Shipyard
Deterministic code stamping engine. Reads intent.lisp, parses S-expressions into typed IR, generates Rust code through template lookup. Same input always produces the same output. 8 stamping patterns: CRUD gateway, event listener / cron worker, MCP tool, state machine, bootstrap (DI), pure utility, RPC gateway, domain engine. The codebase is split via the Generation Gap pattern: generated.rs (machine-owned, overwritten on every stamp) vs custom.rs (human-owned, never touched by Forge). Y-Pipeline enables the same Lisp to output Rust / TypeScript / Python.
Mechanic — The Repair Squad
Five-engine pipeline: Observer (scan compile errors) → Dispatcher (create git worktree sandbox) → Claude Code (dual-model repair) → Assessor (validate) → Cherry-pick. When Forge stamps new interfaces that break custom.rs, Mechanic deploys two LLMs: a Researcher (read-only, outputs natural language strategy) and a Coder (executes edits with cargo check guardrails after every change). Five governance modes control what the AI is permitted to do: experimental, strict-codegen, cartography, self-distillation, polisher.
MissionD — The General Staff
Multi-agent orchestration daemon. Slot-based management of 1 foreground + N background Claude Code processes. Houses the Cascade Controller, Intent Sentinel, knowledge base (SQLite FTS5 + embedding hybrid search, 380+ architecture memories), 67 MCP tools across 4 domains, and DAG-based task board. Dispatches work to Forge and Mechanic, tracks progress, records repair history.
Jarvis — The Observatory
Data capture and drift detection with three-pillar architecture (Memory / Control / Tools). RealityMirror extracts AST snapshots into S-expressions via Tree-sitter. DeltaDetector (909 lines, 27 tests) compares intent vs reality with four typed drift categories: ImplementationGap (critical), ArchitecturalDrift, StructuralGap, LocationMismatch. TopologyGuardian triggers audits on codebase mutations. All cross-pillar communication via EventBus.
Governance Modes: Stratified Evolution
A system cannot apply the same evolution strategy to a brand-new prototype and a battle-tested production service. Governance modes are Lisp-level declarations that control AI permissions per component:
| Mode | AI Permissions |
|---|---|
newborn | No AI access — human-only editing during active development |
cartography | Read-only analysis — may propose Lisp abstractions, may NOT modify Rust |
survival-patching | Fix compilation errors and panics — may NOT alter signatures |
strict-codegen | Modify only custom.rs within existing trait boundaries |
self-distillation | High-temperature variant generation with cargo bench evaluation |
Real-world observation revealed three service maturity tiers:
- Tier 1 — The Frontier (e.g., MissionD): Intent files marked DRAFT with [GAP] annotations. AI operates in
cartography— reading source and reverse-generating Lisp macros to expand vocabulary. - Tier 2 — The Blueprint (e.g., Auth, ASR): Complete state machines and schemas. Bugs in
custom.rsonly. AI operates instrict-codegenorsurvival-patching. - Tier 3 — The Behemoth (e.g., Router, 21K lines): Concurrent streaming, dynamic routing, microsecond billing. AI operates in
self-distillation— generating lock-free and zero-copy variants.
Defense Protocols
AI-driven code repair is dangerous without physical guardrails — mechanisms whose correctness does not depend on AI behavior:
| Mechanism | Protection |
|---|---|
| Hard Halt | Single node exceeds N repair cycles → immediate abort, preserve scene |
| Git 2PC | All repairs in shadow worktrees; failures discard cleanly, main branch stays compilable |
| Epoch Preemption | New intent overwrites old → kill stale pipeline via monotonic epoch ID + cancel token |
| Layered Validation | cargo check → clippy -D warnings → cargo test three-gate gauntlet |
| Path Whitelist | Operations restricted to declared UNIVERSE_ROOT |
| Human Sovereignty | AI evolution proposals stop at Draft PR — never auto-push to main |
Self-Evolution: The STOP Mechanism
The outermost loop answers: how do future bugs become fewer?
- Accumulate — Every successful repair's Git diff and corresponding intent change are persisted as JSONL repair logs.
- Distill — Strategy distillation clusters high-frequency repair patterns. When the same
custom.rsfix appears 5+ times with consistent structure, it's flagged as a template candidate. - Evolve — An Opus-class model proposes a new Forge template (mold fission). The proposal includes modified generator code + affected service list + evidence chain.
- Evaluate — Meta-utility scoring: replay historical repair scenarios with the new template. If
utility_score ≥ 0.6(i.e., 60%+ of past Mechanic repairs become unnecessary), the system auto-generates a PR. - Merge — Human reviews and approves. The template is absorbed into Forge's code generation core. Future stamps produce correct code without needing Mechanic at all.
Experimental Validation
Development proceeded through 7 epochs with strict acceptance criteria. Each epoch was verified by reproducible shell scripts and an independent cross-model audit (Gemini reviewed Claude's implementation across 8 rounds).
| Epoch | Scope | Tests | Status |
|---|---|---|---|
| 1 | Sense & blast radius (Universe Graph) | 7/7 | Passed |
| 2 | Cascade & hard halt (dry-run + fuse) | 7/7 | Passed |
| 3 | Anti-collapse (Git 2PC + epoch preemption) | 6/6 | Passed |
| 4 | Real-project integration (4-project universe) | 10/10 | Passed |
| 5–7 | Distillation + evolution + backwards compat | 29/29 | Passed |
Total: 134 unit tests, 59/59 epoch tests, all green. New modules: 3,795 lines across 5 files, test density 1 per 74 lines.
Scenario A: Upstream Type Change Cascade
Injected UserProfile.id: i32 → String in the upstream service. Downstream service compilation broke with 2 type mismatch errors. Claude Code automatically made 3 precise edits — zero hallucination, zero collateral changes. cargo check and cargo test both green after repair.
Scenario B: Business Logic Completion
Added a VIP interception rule to router.intent.lisp. The system generated the structural code and Mechanic filled in the business logic in custom.rs within the sandbox guardrails.
Audit Results
| Dimension | Score |
|---|---|
| Architecture completeness (4-stroke closure) | 100% |
| Defense protocol coverage | 80% |
| Code quality | 9/10 |
| Test coverage | 7.5/10 |
| Production readiness | 70% |
Independent cross-model audit conducted by Gemini reviewing Claude's implementation. 8 rounds of review with iterative fixes.
The New Development Paradigm
The Flywheel System redefines the developer's workflow:
- Declare intent — Write or modify
intent.lispwith Claude Code - Forge stamps skeleton — Deterministic generation of schemas, RPC interfaces, trait contracts
- Write happy path — Implement only the core business logic, leave edge cases rough
- Hand off — Change governance mode from
newborntosurvival-patching, commit - Come back to clean code — The next morning: compilable, lint-clean, edge-case-handled
Humans handle “what” and “why” (business intent, architectural decisions). AI handles “how” (structural generation, error handling, edge cases). The governance mode lifecycle for a typical component: newborn → survival-patching → strict-codegen → self-distillation.
Known Gaps
- Evolution loop not fully closed —
forge evolveoutputs proposals but doesn't auto-create PRs yet. Human must manually apply. - E2E black-box veto missing — Validation stops at
cargo test. No independent external API test serves as a final veto gate. - Single-machine scope — Cascade controller operates locally. Cross-host distributed dispatch is not implemented.
- Controlled trial, not graduation — The system passed controlled pre-production testing. Full autonomous operation requires closing the E2E veto gap and accumulating more repair history.
The Thesis
The fundamental bottleneck in AI-assisted software engineering is not model capability — it's system-level coordination. A single LLM can fix a single file. But nobody has built the infrastructure to make an LLM fix an entire microservice universe autonomously, safely, and with self-improving efficiency.
The Flywheel System demonstrates that this is achievable with four ingredients: a declarative intent layer (S-expressions as constraint + invariant + transformation source), physical guardrails (Rust's type system + Git worktree isolation + compiler-as-judge that makes reward hacking impossible), a feedback loop (repair logs → distillation → template evolution → fewer repairs), and stratified governance (topology-aware mutation rates enabling safe co-existence of human development and AI evolution).
The entire research trajectory — from thought experiment to literature survey (57 sources) to feasibility analysis to dual-model execution design to 7-epoch validation to 8-round cross-model audit — was completed by a single developer using AI as the implementation layer.
The flywheel gets lighter with every turn. Each repair teaches the system to generate better code next time. The end state: you change one line of Lisp, close your laptop, and come back to a fully repaired, fully tested, fully audited universe.
Data Flow
[Commander] --modify--> [service-A/intent.lisp]
|
v
Intent Sentinel (30s poll, git diff)
|
v
Cascade Controller (General Staff)
1. compute_blast_radius()
2. topological sort -> CascadePlan
3. Forge stamp -> cargo check -> Mechanic repair
4. all green -> auto-commit
|
+--------+--------+
v v
Forge Shipyard Mechanic Squad
generated.rs worktree sandbox
| |
+---- all green --+
|
v
KB: record repair experience
|
v
Strategy Distillation (weekly)
|
v
Mold Fission (on-demand)
utility > 0.6 -> PR -> human approve