Research / Neural Codegen
Neural Codegen
Neuro-Symbolic Code Generation via S-Expression Intermediate Representation and Deterministic Harness Engineering
The Problem
Every AI coding tool today follows the same loop: prompt the LLM, get code back, try to compile it, feed the error back, retry. With Rust, a single endpoint handler must simultaneously satisfy the type system, borrow checker, async semantics, error handling conventions, and project-specific middleware. The probability of getting every dimension correct on the first pass is near zero.
The result is that Claude Opus — one of the most capable models available — achieves only 62% Pass@1 compilation when generating Rust API handlers directly. The remaining 38% are stochastic failures: hallucinated crate imports, deprecated API calls, lifetime errors. Each failure requires another round-trip through the model, burning tokens and time.
The Insight: GPU Mode
GPUs don't think. They execute predefined shader programs from a fixed instruction set. They can't invent new instructions — and that's exactly why they're reliable. No GPU has ever “hallucinated” a pixel.
Neural Codegen applies the same principle to code generation. Instead of asking the LLM to create code (CPU mode — Turing-complete, unbounded, unreliable), we ask it to select from a finite, pre-verified instruction set (GPU mode). The LLM fills out a structured form in S-expressions. A deterministic engine validates and assembles the final code.
Architecture: Three Stages
Stage 1 — S-Expression Parser
The LLM outputs S-expressions — the simplest possible syntax. Only atoms and parenthesized lists. Zero parsing ambiguity, impossible to break syntactically. The parser is 150 lines of Rust with zero dependencies.
(api :method POST :path "/users/me/avatar"
:input (file :max-size "5MB"
:types ("image/png" "image/jpeg"))
:output (json :schema UserAvatar)
:auth required
:rate-limit "10/min")Stage 2 — Typed IR Validation
S-expressions are lowered to a typed intermediate representation defined as Rust enums. Each enum variant is a whitelist entry. Anything not in the whitelist is rejected with a structured error designed for LLM self-correction:
enum HttpMethod { Get, Post, Put, Delete } // 4 variants
enum InputSpec { Json, File, Query, None } // 4 variants
enum OutputSpec { Json, Text, NoContent } // 3 variants
enum AuthRequirement { Required, Optional, None } // 3 variantsTotal valid configuration space: 4 × 4 × 3 × 3 = 144 base combinations. Finite, enumerable, exhaustively testable. When the LLM outputs :method PATCH, the IR layer rejects it immediately with expected: ["GET", "POST", "PUT", "DELETE"] — not a Rust compiler stack trace.
Stage 3 — Deterministic Code Generation
Every valid IR variant maps to exactly one pre-verified code template. The generator performs lookup, not creation. Same input always produces the same output. Rust's exhaustive match ensures every IR variant has a corresponding template — the compiler itself guarantees completeness.
Results
| Test Case | Pipeline | Raw LLM |
|---|---|---|
| simple_health | Pass | Pass |
| crud_users | Pass | Pass |
| file_upload | Pass | Fail (hallucinated crate) |
| stateful_api | Pass | Fail (deprecated API) |
| mixed_io | Fail | Pass |
| auth_variants | Fail | Fail |
| rate_limited | Pass | Pass |
| complex_state | Pass | Pass |
Pipeline: 75% Pass@1 (6/8). Raw Claude Opus: 62% (5/8).
The critical difference is in failure modes. Raw LLM failures are stochastic hallucinations — importing crates that don't exist, using APIs removed three versions ago. You can't fix these without changing the model. Pipeline failures are deterministic engineering bugs — edge cases in template generation. Fixable by adding a template, without touching the architecture.
The Intent-Reality-Delta Loop
Neural Codegen is the code generation engine. In my development system (Jarvis), it's embedded in a continuous closed loop:
- Intent — The architecture is declared in
intent.lisp, a 2,700-line S-expression specification of what the system should be - Reality — Tree-sitter extracts the actual codebase AST into an S-expression snapshot every 3 seconds
- Delta — A pure deterministic algorithm (no LLM) compares intent vs reality, producing typed deltas:
ImplementationGap,ArchitecturalDrift,LocationMismatch - Repair — Deltas dispatch to a task board. The LLM generates S-expression patches. Neural Codegen compiles them to verified code. The cycle repeats until zero deltas remain.
The boundary is sharp: AI handles creative work (understanding intent, proposing repairs). Deterministic algorithms handle critical work (drift detection, validation, code generation). The typed S-expression is the interface between them.
Why S-Expressions?
This isn't an arbitrary syntax choice. It's a consequence of what LLMs actually are.
LLMs are stateless transformers — they perform Lambda Calculus, not Turing computation. They have no mutable tape, no program counter. Yet we ask them to output imperative code: mutable variables, control flow, memory management. We're asking a Lambda machine to pretend to be a Turing machine. The mismatch causes hallucinations.
S-expressions are the natural output format of a Lambda engine. Pure data — no mutation, no side effects, no state. The LLM declares what should exist. The deterministic engine handles how it compiles.
Honest Limitations
- Expressivity ceiling — The IR strips Turing-completeness. You can't express arbitrary algorithms (sorting, graph traversal). Only structural code patterns.
- Compilation ≠ correctness — The guarantee is structural. Code compiles, but business logic must still be validated through testing.
- Single target — Only Rust (axum) backend exists today. The architecture is language-agnostic in theory.
- State space growth — 144 base configurations is small. Real-world systems need thousands. Scaling the IR without losing the finite-verifiable property is the main open problem.
The Thesis
We've spent 2023–2025 making models bigger, training data larger, context windows wider. The hallucination rate barely moved. The reason: we're optimizing the wrong variable.
The moat in agentic software development is not the model. It's the harness. The same Claude Opus swings from 62% to 75% Pass@1 based purely on surrounding constraints. No fine-tuning, no RLHF, no extra training data. Just a 150-line parser, a whitelist of Rust enums, and a deterministic template engine.
The model is the engine. The harness is the track. Without the track, even the most powerful engine drives into walls.
Citation
Jin, R. (2026). Neuro-Symbolic Code Generation via
S-Expression Intermediate Representation and
Deterministic Harness Engineering.
Zenodo. https://doi.org/10.5281/zenodo.19372158