Research / Neural Codegen

Neural Codegen

Neuro-Symbolic Code Generation via S-Expression Intermediate Representation and Deterministic Harness Engineering

Ruoqi Jin·April 2026·Preprint

The Problem

Every AI coding tool today follows the same loop: prompt the LLM, get code back, try to compile it, feed the error back, retry. With Rust, a single endpoint handler must simultaneously satisfy the type system, borrow checker, async semantics, error handling conventions, and project-specific middleware. The probability of getting every dimension correct on the first pass is near zero.

The result is that Claude Opus — one of the most capable models available — achieves only 62% Pass@1 compilation when generating Rust API handlers directly. The remaining 38% are stochastic failures: hallucinated crate imports, deprecated API calls, lifetime errors. Each failure requires another round-trip through the model, burning tokens and time.

The Insight: GPU Mode

GPUs don't think. They execute predefined shader programs from a fixed instruction set. They can't invent new instructions — and that's exactly why they're reliable. No GPU has ever “hallucinated” a pixel.

Neural Codegen applies the same principle to code generation. Instead of asking the LLM to create code (CPU mode — Turing-complete, unbounded, unreliable), we ask it to select from a finite, pre-verified instruction set (GPU mode). The LLM fills out a structured form in S-expressions. A deterministic engine validates and assembles the final code.

Architecture: Three Stages

Stage 1 — S-Expression Parser

The LLM outputs S-expressions — the simplest possible syntax. Only atoms and parenthesized lists. Zero parsing ambiguity, impossible to break syntactically. The parser is 150 lines of Rust with zero dependencies.

(api :method POST :path "/users/me/avatar"
     :input (file :max-size "5MB"
                  :types ("image/png" "image/jpeg"))
     :output (json :schema UserAvatar)
     :auth required
     :rate-limit "10/min")

Stage 2 — Typed IR Validation

S-expressions are lowered to a typed intermediate representation defined as Rust enums. Each enum variant is a whitelist entry. Anything not in the whitelist is rejected with a structured error designed for LLM self-correction:

enum HttpMethod  { Get, Post, Put, Delete }       // 4 variants
enum InputSpec   { Json, File, Query, None }       // 4 variants
enum OutputSpec  { Json, Text, NoContent }         // 3 variants
enum AuthRequirement { Required, Optional, None }  // 3 variants

Total valid configuration space: 4 × 4 × 3 × 3 = 144 base combinations. Finite, enumerable, exhaustively testable. When the LLM outputs :method PATCH, the IR layer rejects it immediately with expected: ["GET", "POST", "PUT", "DELETE"] — not a Rust compiler stack trace.

Stage 3 — Deterministic Code Generation

Every valid IR variant maps to exactly one pre-verified code template. The generator performs lookup, not creation. Same input always produces the same output. Rust's exhaustive match ensures every IR variant has a corresponding template — the compiler itself guarantees completeness.

Results

Test CasePipelineRaw LLM
simple_healthPassPass
crud_usersPassPass
file_uploadPassFail (hallucinated crate)
stateful_apiPassFail (deprecated API)
mixed_ioFailPass
auth_variantsFailFail
rate_limitedPassPass
complex_statePassPass

Pipeline: 75% Pass@1 (6/8). Raw Claude Opus: 62% (5/8).

The critical difference is in failure modes. Raw LLM failures are stochastic hallucinations — importing crates that don't exist, using APIs removed three versions ago. You can't fix these without changing the model. Pipeline failures are deterministic engineering bugs — edge cases in template generation. Fixable by adding a template, without touching the architecture.

The Intent-Reality-Delta Loop

Neural Codegen is the code generation engine. In my development system (Jarvis), it's embedded in a continuous closed loop:

  1. Intent — The architecture is declared in intent.lisp, a 2,700-line S-expression specification of what the system should be
  2. Reality — Tree-sitter extracts the actual codebase AST into an S-expression snapshot every 3 seconds
  3. Delta — A pure deterministic algorithm (no LLM) compares intent vs reality, producing typed deltas: ImplementationGap, ArchitecturalDrift, LocationMismatch
  4. Repair — Deltas dispatch to a task board. The LLM generates S-expression patches. Neural Codegen compiles them to verified code. The cycle repeats until zero deltas remain.

The boundary is sharp: AI handles creative work (understanding intent, proposing repairs). Deterministic algorithms handle critical work (drift detection, validation, code generation). The typed S-expression is the interface between them.

Why S-Expressions?

This isn't an arbitrary syntax choice. It's a consequence of what LLMs actually are.

LLMs are stateless transformers — they perform Lambda Calculus, not Turing computation. They have no mutable tape, no program counter. Yet we ask them to output imperative code: mutable variables, control flow, memory management. We're asking a Lambda machine to pretend to be a Turing machine. The mismatch causes hallucinations.

S-expressions are the natural output format of a Lambda engine. Pure data — no mutation, no side effects, no state. The LLM declares what should exist. The deterministic engine handles how it compiles.

Honest Limitations

  • Expressivity ceiling — The IR strips Turing-completeness. You can't express arbitrary algorithms (sorting, graph traversal). Only structural code patterns.
  • Compilation ≠ correctness — The guarantee is structural. Code compiles, but business logic must still be validated through testing.
  • Single target — Only Rust (axum) backend exists today. The architecture is language-agnostic in theory.
  • State space growth — 144 base configurations is small. Real-world systems need thousands. Scaling the IR without losing the finite-verifiable property is the main open problem.

The Thesis

We've spent 2023–2025 making models bigger, training data larger, context windows wider. The hallucination rate barely moved. The reason: we're optimizing the wrong variable.

The moat in agentic software development is not the model. It's the harness. The same Claude Opus swings from 62% to 75% Pass@1 based purely on surrounding constraints. No fine-tuning, no RLHF, no extra training data. Just a 150-line parser, a whitelist of Rust enums, and a deterministic template engine.

The model is the engine. The harness is the track. Without the track, even the most powerful engine drives into walls.

Citation

Jin, R. (2026). Neuro-Symbolic Code Generation via
S-Expression Intermediate Representation and
Deterministic Harness Engineering.
Zenodo. https://doi.org/10.5281/zenodo.19372158

Demo

Helper Disconnected