Ruoqi Jin

Self-Evolving Microservice Cascade: A Four-Stroke Engine for Intent-Driven Autonomous Code Repair

The Problem

In a microservice universe, changing a single upstream type — say, User.id from i32 to Uuid — shatters every downstream service that consumes it. Today, developers trace the blast radius by hand, open each project, fix compilation errors one by one, and pray they didn't miss a transitive dependency.

AI coding assistants can fix individual files, but they lack cross-service awareness. They don't know the dependency graph, can't compute which services are affected, and have no mechanism to coordinate repairs in topological order. The result: engineers still spend most of their time doing integration plumbing, not building features.

A survey of 57 sources across six research directions — self-adaptive systems, LLM-based program repair, multi-agent SE, declarative architecture, software evolution, and self-improving AI — reveals a critical gap: no existing system unifies a declarative blueprint as simultaneously a synthesis constraint, a compiler-validated invariant, and a transformation source.

Intellectual Genesis: Three Thought Experiments

The system originated from three thought experiments about using Rust's type system as a physical fitness function for AI-driven code evolution:

The Compilation Sandbox — Can an LLM improve code without a teacher model? Let it generate variants at high temperature, use cargo check as the sole fitness judge. The codebase state becomes the model's “weights”; each successful merge is a weight update. The compiler is the only oracle.
Homoiconic Bootstrapping — Since the system's own logic is expressed in S-expressions (code as data), the AI can evolve at the meta-level by modifying Lisp declarations rather than Rust code. When it discovers recurring patterns, it synthesizes new high-level macros — vocabulary bootstrapping that expands the system's cognitive capacity.
The Topology Guardian — Inspired by self-distillation's principle of “suppressing distractor tails where precision matters.” Core infrastructure is locked to deterministic mode (Temperature = 0). Peripheral tools allow high-temperature exploration. Topology-aware mutation rates prevent catastrophic self-modification.

These crystallized into three design principles: compiler as oracle, declarative intent as evolution substrate, and topology-aware safety boundaries.

The Insight: Four-Stroke Engine

The system implements a dual-loop control architecture: an inner loop for deterministic delivery (intent change → generate → repair → validate) and an outer loop for evolutionary self-improvement (repair logs → distillation → template fission → reduced cost). The inner loop is convergent and idempotent. The outer loop is stochastic but bounded by meta-utility evaluation.

Sense — Detect intent changes, parse the cross-service dependency graph, compute blast radius
Plan — Topologically sort affected services, generate a repair queue with dependency-aware scheduling
Execute — Deterministic code stamping (Forge) + sandboxed AI repair (Mechanic) + compiler-verified guardrails
Learn — Distill repair patterns, propose template evolution, evaluate with meta-utility scoring

The flywheel's essential dynamic: custom.rs shrinks, generated.rs expands. Each evolution cycle transfers hand-written patterns into deterministic templates, monotonically reducing the surface area that requires AI repair.

Theoretical Foundations

The architecture synthesizes ideas from five academic papers and one key inspiration into a single closed-loop system:

Rainbow (Garlan et al., CMU 2004) — Architecture-based self-adaptation. Direct 1:1 mapping: Probes = cargo check / git diff, Effectors = Forge / Mechanic, Model Manager = intent.lisp topology, Adaptation Engine = Cascade Controller. Key adoption: separate decision-making from execution.
SRepair (Gao et al., ISSTA 2024) — Dual-LLM program repair. First system to achieve multi-function repair (32 bugs, $0.029/bug). Shaped Mechanic's Researcher + Coder separation. Extended with a Global Researcher phase for cross-service cascade repairs.
STOP (Zelikman et al., COLM 2024) — Recursive self-improvement. Core mechanism behind the evolution loop. Key safety argument: Rust's type checker makes reward hacking theoretically impossible — a template that passes cargo check genuinely produces structurally valid code.
SWE-agent (Yang et al., ICLR 2025) — Agent-computer interfaces. Four ACI mechanisms adopted: stateful file viewer (100-line window), guarded edit (auto-rollback on failure, 40%+ recovery improvement), history context collapsing (30+ round coherence), bounded search (50-result cutoff).
Towards CIA (Cerny et al., 2025) — Dual-level IR graph: micro-level (Endpoint → Service → Repository → Entity) + macro-level (cross-service via remote calls). JSON-driven conflict rules pre-filter blast radius before invoking LLMs.

The initial spark came from Embarrassingly Simple Self-Distillation Improves Code Generation (Zhang et al., arXiv:2604.01193). The system generalizes self-distillation from model weights to infrastructure: repair logs become the signal for evolving deterministic code generation templates. Cost data from the literature confirms viability: $0.03–$0.42 per bug via structured compiler feedback.

Architecture: Four Components

Forge — The Shipyard

Deterministic code stamping engine. Reads intent.lisp, parses S-expressions into typed IR, generates Rust code through template lookup. Same input always produces the same output. 8 stamping patterns: CRUD gateway, event listener / cron worker, MCP tool, state machine, bootstrap (DI), pure utility, RPC gateway, domain engine. The codebase is split via the Generation Gap pattern: generated.rs (machine-owned, overwritten on every stamp) vs custom.rs (human-owned, never touched by Forge). Y-Pipeline enables the same Lisp to output Rust / TypeScript / Python.

Mechanic — The Repair Squad

Five-engine pipeline: Observer (scan compile errors) → Dispatcher (create git worktree sandbox) → Claude Code (dual-model repair) → Assessor (validate) → Cherry-pick. When Forge stamps new interfaces that break custom.rs, Mechanic deploys two LLMs: a Researcher (read-only, outputs natural language strategy) and a Coder (executes edits with cargo check guardrails after every change). Five governance modes control what the AI is permitted to do: experimental, strict-codegen, cartography, self-distillation, polisher.

MissionD — The General Staff

Multi-agent orchestration daemon. Slot-based management of 1 foreground + N background Claude Code processes. Houses the Cascade Controller, Intent Sentinel, knowledge base (SQLite FTS5 + embedding hybrid search, 380+ architecture memories), 67 MCP tools across 4 domains, and DAG-based task board. Dispatches work to Forge and Mechanic, tracks progress, records repair history.

Jarvis — The Observatory

Data capture and drift detection with three-pillar architecture (Memory / Control / Tools). RealityMirror extracts AST snapshots into S-expressions via Tree-sitter. DeltaDetector (909 lines, 27 tests) compares intent vs reality with four typed drift categories: ImplementationGap (critical), ArchitecturalDrift, StructuralGap, LocationMismatch. TopologyGuardian triggers audits on codebase mutations. All cross-pillar communication via EventBus.

Governance Modes: Stratified Evolution

A system cannot apply the same evolution strategy to a brand-new prototype and a battle-tested production service. Governance modes are Lisp-level declarations that control AI permissions per component:

Mode	AI Permissions
`newborn`	No AI access — human-only editing during active development
`cartography`	Read-only analysis — may propose Lisp abstractions, may NOT modify Rust
`survival-patching`	Fix compilation errors and panics — may NOT alter signatures
`strict-codegen`	Modify only `custom.rs` within existing trait boundaries
`self-distillation`	High-temperature variant generation with `cargo bench` evaluation

Real-world observation revealed three service maturity tiers:

Tier 1 — The Frontier (e.g., MissionD): Intent files marked DRAFT with [GAP] annotations. AI operates in cartography — reading source and reverse-generating Lisp macros to expand vocabulary.
Tier 2 — The Blueprint (e.g., Auth, ASR): Complete state machines and schemas. Bugs in custom.rs only. AI operates in strict-codegen or survival-patching.
Tier 3 — The Behemoth (e.g., Router, 21K lines): Concurrent streaming, dynamic routing, microsecond billing. AI operates in self-distillation — generating lock-free and zero-copy variants.

Defense Protocols

AI-driven code repair is dangerous without physical guardrails — mechanisms whose correctness does not depend on AI behavior:

Mechanism	Protection
Hard Halt	Single node exceeds N repair cycles → immediate abort, preserve scene
Git 2PC	All repairs in shadow worktrees; failures discard cleanly, main branch stays compilable
Epoch Preemption	New intent overwrites old → kill stale pipeline via monotonic epoch ID + cancel token
Layered Validation	`cargo check` → `clippy -D warnings` → `cargo test` three-gate gauntlet
Path Whitelist	Operations restricted to declared UNIVERSE_ROOT
Human Sovereignty	AI evolution proposals stop at Draft PR — never auto-push to main

Self-Evolution: The STOP Mechanism

The outermost loop answers: how do future bugs become fewer?

Accumulate — Every successful repair's Git diff and corresponding intent change are persisted as JSONL repair logs.
Distill — Strategy distillation clusters high-frequency repair patterns. When the same custom.rs fix appears 5+ times with consistent structure, it's flagged as a template candidate.
Evolve — An Opus-class model proposes a new Forge template (mold fission). The proposal includes modified generator code + affected service list + evidence chain.
Evaluate — Meta-utility scoring: replay historical repair scenarios with the new template. If utility_score ≥ 0.6 (i.e., 60%+ of past Mechanic repairs become unnecessary), the system auto-generates a PR.
Merge — Human reviews and approves. The template is absorbed into Forge's code generation core. Future stamps produce correct code without needing Mechanic at all.

Experimental Validation

Development proceeded through 7 epochs with strict acceptance criteria. Each epoch was verified by reproducible shell scripts and an independent cross-model audit (Gemini reviewed Claude's implementation across 8 rounds).

Epoch	Scope	Tests	Status
1	Sense & blast radius (Universe Graph)	7/7	Passed
2	Cascade & hard halt (dry-run + fuse)	7/7	Passed
3	Anti-collapse (Git 2PC + epoch preemption)	6/6	Passed
4	Real-project integration (4-project universe)	10/10	Passed
5–7	Distillation + evolution + backwards compat	29/29	Passed

Total: 134 unit tests, 59/59 epoch tests, all green. New modules: 3,795 lines across 5 files, test density 1 per 74 lines.

Scenario A: Upstream Type Change Cascade

Injected UserProfile.id: i32 → String in the upstream service. Downstream service compilation broke with 2 type mismatch errors. Claude Code automatically made 3 precise edits — zero hallucination, zero collateral changes. cargo check and cargo test both green after repair.

Scenario B: Business Logic Completion

Added a VIP interception rule to router.intent.lisp. The system generated the structural code and Mechanic filled in the business logic in custom.rs within the sandbox guardrails.

Audit Results

Dimension	Score
Architecture completeness (4-stroke closure)	100%
Defense protocol coverage	80%
Code quality	9/10
Test coverage	7.5/10
Production readiness	70%

Independent cross-model audit conducted by Gemini reviewing Claude's implementation. 8 rounds of review with iterative fixes.

The New Development Paradigm

The Flywheel System redefines the developer's workflow:

Declare intent — Write or modify intent.lisp with Claude Code
Forge stamps skeleton — Deterministic generation of schemas, RPC interfaces, trait contracts
Write happy path — Implement only the core business logic, leave edge cases rough
Hand off — Change governance mode from newborn to survival-patching, commit
Come back to clean code — The next morning: compilable, lint-clean, edge-case-handled

Humans handle “what” and “why” (business intent, architectural decisions). AI handles “how” (structural generation, error handling, edge cases). The governance mode lifecycle for a typical component: newborn → survival-patching → strict-codegen → self-distillation.

Known Gaps

Evolution loop not fully closed — forge evolve outputs proposals but doesn't auto-create PRs yet. Human must manually apply.
E2E black-box veto missing — Validation stops at cargo test. No independent external API test serves as a final veto gate.
Single-machine scope — Cascade controller operates locally. Cross-host distributed dispatch is not implemented.
Controlled trial, not graduation — The system passed controlled pre-production testing. Full autonomous operation requires closing the E2E veto gap and accumulating more repair history.

The Thesis

The fundamental bottleneck in AI-assisted software engineering is not model capability — it's system-level coordination. A single LLM can fix a single file. But nobody has built the infrastructure to make an LLM fix an entire microservice universe autonomously, safely, and with self-improving efficiency.

The Flywheel System demonstrates that this is achievable with four ingredients: a declarative intent layer (S-expressions as constraint + invariant + transformation source), physical guardrails (Rust's type system + Git worktree isolation + compiler-as-judge that makes reward hacking impossible), a feedback loop (repair logs → distillation → template evolution → fewer repairs), and stratified governance (topology-aware mutation rates enabling safe co-existence of human development and AI evolution).

The entire research trajectory — from thought experiment to literature survey (57 sources) to feasibility analysis to dual-model execution design to 7-epoch validation to 8-round cross-model audit — was completed by a single developer using AI as the implementation layer.

The flywheel gets lighter with every turn. Each repair teaches the system to generate better code next time. The end state: you change one line of Lisp, close your laptop, and come back to a fully repaired, fully tested, fully audited universe.

Data Flow

[Commander] --modify--> [service-A/intent.lisp]
                             |
                             v
      Intent Sentinel (30s poll, git diff)
                             |
                             v
      Cascade Controller (General Staff)
        1. compute_blast_radius()
        2. topological sort -> CascadePlan
        3. Forge stamp -> cargo check -> Mechanic repair
        4. all green -> auto-commit
                             |
                    +--------+--------+
                    v                 v
              Forge Shipyard    Mechanic Squad
              generated.rs      worktree sandbox
                    |                 |
                    +---- all green --+
                             |
                             v
                   KB: record repair experience
                             |
                             v
                   Strategy Distillation (weekly)
                             |
                             v
                   Mold Fission (on-demand)
                   utility > 0.6 -> PR -> human approve