Research / Flywheel System

Flywheel System

Self-Evolving Microservice Cascade: A Four-Stroke Engine for Intent-Driven Autonomous Code Repair

Ruoqi Jin·April 2026·Engineering Report

The Problem

In a microservice universe, changing a single upstream type — say, User.id from i32 to Uuid — shatters every downstream service that consumes it. Today, developers trace the blast radius by hand, open each project, fix compilation errors one by one, and pray they didn't miss a transitive dependency.

AI coding assistants can fix individual files, but they lack cross-service awareness. They don't know the dependency graph, can't compute which services are affected, and have no mechanism to coordinate repairs in topological order. The result: engineers still spend most of their time doing integration plumbing, not building features.

A survey of 57 sources across six research directions — self-adaptive systems, LLM-based program repair, multi-agent SE, declarative architecture, software evolution, and self-improving AI — reveals a critical gap: no existing system unifies a declarative blueprint as simultaneously a synthesis constraint, a compiler-validated invariant, and a transformation source.

Intellectual Genesis: Three Thought Experiments

The system originated from three thought experiments about using Rust's type system as a physical fitness function for AI-driven code evolution:

  1. The Compilation Sandbox — Can an LLM improve code without a teacher model? Let it generate variants at high temperature, use cargo check as the sole fitness judge. The codebase state becomes the model's “weights”; each successful merge is a weight update. The compiler is the only oracle.
  2. Homoiconic Bootstrapping — Since the system's own logic is expressed in S-expressions (code as data), the AI can evolve at the meta-level by modifying Lisp declarations rather than Rust code. When it discovers recurring patterns, it synthesizes new high-level macros — vocabulary bootstrapping that expands the system's cognitive capacity.
  3. The Topology Guardian — Inspired by self-distillation's principle of “suppressing distractor tails where precision matters.” Core infrastructure is locked to deterministic mode (Temperature = 0). Peripheral tools allow high-temperature exploration. Topology-aware mutation rates prevent catastrophic self-modification.

These crystallized into three design principles: compiler as oracle, declarative intent as evolution substrate, and topology-aware safety boundaries.

The Insight: Four-Stroke Engine

The system implements a dual-loop control architecture: an inner loop for deterministic delivery (intent change → generate → repair → validate) and an outer loop for evolutionary self-improvement (repair logs → distillation → template fission → reduced cost). The inner loop is convergent and idempotent. The outer loop is stochastic but bounded by meta-utility evaluation.

  1. Sense — Detect intent changes, parse the cross-service dependency graph, compute blast radius
  2. Plan — Topologically sort affected services, generate a repair queue with dependency-aware scheduling
  3. Execute — Deterministic code stamping (Forge) + sandboxed AI repair (Mechanic) + compiler-verified guardrails
  4. Learn — Distill repair patterns, propose template evolution, evaluate with meta-utility scoring

The flywheel's essential dynamic: custom.rs shrinks, generated.rs expands. Each evolution cycle transfers hand-written patterns into deterministic templates, monotonically reducing the surface area that requires AI repair.

Theoretical Foundations

The architecture synthesizes ideas from five academic papers and one key inspiration into a single closed-loop system:

  • Rainbow (Garlan et al., CMU 2004) — Architecture-based self-adaptation. Direct 1:1 mapping: Probes = cargo check / git diff, Effectors = Forge / Mechanic, Model Manager = intent.lisp topology, Adaptation Engine = Cascade Controller. Key adoption: separate decision-making from execution.
  • SRepair (Gao et al., ISSTA 2024) — Dual-LLM program repair. First system to achieve multi-function repair (32 bugs, $0.029/bug). Shaped Mechanic's Researcher + Coder separation. Extended with a Global Researcher phase for cross-service cascade repairs.
  • STOP (Zelikman et al., COLM 2024) — Recursive self-improvement. Core mechanism behind the evolution loop. Key safety argument: Rust's type checker makes reward hacking theoretically impossible — a template that passes cargo check genuinely produces structurally valid code.
  • SWE-agent (Yang et al., ICLR 2025) — Agent-computer interfaces. Four ACI mechanisms adopted: stateful file viewer (100-line window), guarded edit (auto-rollback on failure, 40%+ recovery improvement), history context collapsing (30+ round coherence), bounded search (50-result cutoff).
  • Towards CIA (Cerny et al., 2025) — Dual-level IR graph: micro-level (Endpoint → Service → Repository → Entity) + macro-level (cross-service via remote calls). JSON-driven conflict rules pre-filter blast radius before invoking LLMs.

The initial spark came from Embarrassingly Simple Self-Distillation Improves Code Generation (Zhang et al., arXiv:2604.01193). The system generalizes self-distillation from model weights to infrastructure: repair logs become the signal for evolving deterministic code generation templates. Cost data from the literature confirms viability: $0.03–$0.42 per bug via structured compiler feedback.

Architecture: Four Components

Forge — The Shipyard

Deterministic code stamping engine. Reads intent.lisp, parses S-expressions into typed IR, generates Rust code through template lookup. Same input always produces the same output. 8 stamping patterns: CRUD gateway, event listener / cron worker, MCP tool, state machine, bootstrap (DI), pure utility, RPC gateway, domain engine. The codebase is split via the Generation Gap pattern: generated.rs (machine-owned, overwritten on every stamp) vs custom.rs (human-owned, never touched by Forge). Y-Pipeline enables the same Lisp to output Rust / TypeScript / Python.

Mechanic — The Repair Squad

Five-engine pipeline: Observer (scan compile errors) → Dispatcher (create git worktree sandbox) → Claude Code (dual-model repair) → Assessor (validate) → Cherry-pick. When Forge stamps new interfaces that break custom.rs, Mechanic deploys two LLMs: a Researcher (read-only, outputs natural language strategy) and a Coder (executes edits with cargo check guardrails after every change). Five governance modes control what the AI is permitted to do: experimental, strict-codegen, cartography, self-distillation, polisher.

MissionD — The General Staff

Multi-agent orchestration daemon. Slot-based management of 1 foreground + N background Claude Code processes. Houses the Cascade Controller, Intent Sentinel, knowledge base (SQLite FTS5 + embedding hybrid search, 380+ architecture memories), 67 MCP tools across 4 domains, and DAG-based task board. Dispatches work to Forge and Mechanic, tracks progress, records repair history.

Jarvis — The Observatory

Data capture and drift detection with three-pillar architecture (Memory / Control / Tools). RealityMirror extracts AST snapshots into S-expressions via Tree-sitter. DeltaDetector (909 lines, 27 tests) compares intent vs reality with four typed drift categories: ImplementationGap (critical), ArchitecturalDrift, StructuralGap, LocationMismatch. TopologyGuardian triggers audits on codebase mutations. All cross-pillar communication via EventBus.

Governance Modes: Stratified Evolution

A system cannot apply the same evolution strategy to a brand-new prototype and a battle-tested production service. Governance modes are Lisp-level declarations that control AI permissions per component:

ModeAI Permissions
newbornNo AI access — human-only editing during active development
cartographyRead-only analysis — may propose Lisp abstractions, may NOT modify Rust
survival-patchingFix compilation errors and panics — may NOT alter signatures
strict-codegenModify only custom.rs within existing trait boundaries
self-distillationHigh-temperature variant generation with cargo bench evaluation

Real-world observation revealed three service maturity tiers:

  • Tier 1 — The Frontier (e.g., MissionD): Intent files marked DRAFT with [GAP] annotations. AI operates in cartography — reading source and reverse-generating Lisp macros to expand vocabulary.
  • Tier 2 — The Blueprint (e.g., Auth, ASR): Complete state machines and schemas. Bugs in custom.rs only. AI operates in strict-codegen or survival-patching.
  • Tier 3 — The Behemoth (e.g., Router, 21K lines): Concurrent streaming, dynamic routing, microsecond billing. AI operates in self-distillation — generating lock-free and zero-copy variants.

Defense Protocols

AI-driven code repair is dangerous without physical guardrails — mechanisms whose correctness does not depend on AI behavior:

MechanismProtection
Hard HaltSingle node exceeds N repair cycles → immediate abort, preserve scene
Git 2PCAll repairs in shadow worktrees; failures discard cleanly, main branch stays compilable
Epoch PreemptionNew intent overwrites old → kill stale pipeline via monotonic epoch ID + cancel token
Layered Validationcargo checkclippy -D warningscargo test three-gate gauntlet
Path WhitelistOperations restricted to declared UNIVERSE_ROOT
Human SovereigntyAI evolution proposals stop at Draft PR — never auto-push to main

Self-Evolution: The STOP Mechanism

The outermost loop answers: how do future bugs become fewer?

  1. Accumulate — Every successful repair's Git diff and corresponding intent change are persisted as JSONL repair logs.
  2. Distill — Strategy distillation clusters high-frequency repair patterns. When the same custom.rs fix appears 5+ times with consistent structure, it's flagged as a template candidate.
  3. Evolve — An Opus-class model proposes a new Forge template (mold fission). The proposal includes modified generator code + affected service list + evidence chain.
  4. Evaluate — Meta-utility scoring: replay historical repair scenarios with the new template. If utility_score ≥ 0.6 (i.e., 60%+ of past Mechanic repairs become unnecessary), the system auto-generates a PR.
  5. Merge — Human reviews and approves. The template is absorbed into Forge's code generation core. Future stamps produce correct code without needing Mechanic at all.

Experimental Validation

Development proceeded through 7 epochs with strict acceptance criteria. Each epoch was verified by reproducible shell scripts and an independent cross-model audit (Gemini reviewed Claude's implementation across 8 rounds).

EpochScopeTestsStatus
1Sense & blast radius (Universe Graph)7/7Passed
2Cascade & hard halt (dry-run + fuse)7/7Passed
3Anti-collapse (Git 2PC + epoch preemption)6/6Passed
4Real-project integration (4-project universe)10/10Passed
5–7Distillation + evolution + backwards compat29/29Passed

Total: 134 unit tests, 59/59 epoch tests, all green. New modules: 3,795 lines across 5 files, test density 1 per 74 lines.

Scenario A: Upstream Type Change Cascade

Injected UserProfile.id: i32 → String in the upstream service. Downstream service compilation broke with 2 type mismatch errors. Claude Code automatically made 3 precise edits — zero hallucination, zero collateral changes. cargo check and cargo test both green after repair.

Scenario B: Business Logic Completion

Added a VIP interception rule to router.intent.lisp. The system generated the structural code and Mechanic filled in the business logic in custom.rs within the sandbox guardrails.

Audit Results

DimensionScore
Architecture completeness (4-stroke closure)100%
Defense protocol coverage80%
Code quality9/10
Test coverage7.5/10
Production readiness70%

Independent cross-model audit conducted by Gemini reviewing Claude's implementation. 8 rounds of review with iterative fixes.

The New Development Paradigm

The Flywheel System redefines the developer's workflow:

  1. Declare intent — Write or modify intent.lisp with Claude Code
  2. Forge stamps skeleton — Deterministic generation of schemas, RPC interfaces, trait contracts
  3. Write happy path — Implement only the core business logic, leave edge cases rough
  4. Hand off — Change governance mode from newborn to survival-patching, commit
  5. Come back to clean code — The next morning: compilable, lint-clean, edge-case-handled

Humans handle “what” and “why” (business intent, architectural decisions). AI handles “how” (structural generation, error handling, edge cases). The governance mode lifecycle for a typical component: newbornsurvival-patching strict-codegenself-distillation.

Known Gaps

  • Evolution loop not fully closed forge evolve outputs proposals but doesn't auto-create PRs yet. Human must manually apply.
  • E2E black-box veto missing — Validation stops at cargo test. No independent external API test serves as a final veto gate.
  • Single-machine scope — Cascade controller operates locally. Cross-host distributed dispatch is not implemented.
  • Controlled trial, not graduation — The system passed controlled pre-production testing. Full autonomous operation requires closing the E2E veto gap and accumulating more repair history.

The Thesis

The fundamental bottleneck in AI-assisted software engineering is not model capability — it's system-level coordination. A single LLM can fix a single file. But nobody has built the infrastructure to make an LLM fix an entire microservice universe autonomously, safely, and with self-improving efficiency.

The Flywheel System demonstrates that this is achievable with four ingredients: a declarative intent layer (S-expressions as constraint + invariant + transformation source), physical guardrails (Rust's type system + Git worktree isolation + compiler-as-judge that makes reward hacking impossible), a feedback loop (repair logs → distillation → template evolution → fewer repairs), and stratified governance (topology-aware mutation rates enabling safe co-existence of human development and AI evolution).

The entire research trajectory — from thought experiment to literature survey (57 sources) to feasibility analysis to dual-model execution design to 7-epoch validation to 8-round cross-model audit — was completed by a single developer using AI as the implementation layer.

The flywheel gets lighter with every turn. Each repair teaches the system to generate better code next time. The end state: you change one line of Lisp, close your laptop, and come back to a fully repaired, fully tested, fully audited universe.

Data Flow

[Commander] --modify--> [service-A/intent.lisp]
                             |
                             v
      Intent Sentinel (30s poll, git diff)
                             |
                             v
      Cascade Controller (General Staff)
        1. compute_blast_radius()
        2. topological sort -> CascadePlan
        3. Forge stamp -> cargo check -> Mechanic repair
        4. all green -> auto-commit
                             |
                    +--------+--------+
                    v                 v
              Forge Shipyard    Mechanic Squad
              generated.rs      worktree sandbox
                    |                 |
                    +---- all green --+
                             |
                             v
                   KB: record repair experience
                             |
                             v
                   Strategy Distillation (weekly)
                             |
                             v
                   Mold Fission (on-demand)
                   utility > 0.6 -> PR -> human approve
Helper Disconnected