Linzumi February Update (Dec + Jan catch-up)

TL;DR: Spec in → verified, working app out. On Feb 2nd our agents autonomously solved a hard problem through multiple implement/critique/judge cycles with zero human help. We can now watch specs compile into code in real-time. Demo in 2-3 weeks. 10-17 years runway.

Apologies for missing January - we were heads-down trying to ship and kept thinking "one more day and we can show something even more interesting." Bad habit, working on it. But hopefully this update and the attached screenshots give a good glimpse into what the system is becoming capable of.

Quick synopsis: We missed the Dec 22 holiday launch but got critical learnings on unstructured doc editing. We then pivoted to spec → coding agent compilation, and on Feb 2nd saw the first autonomous implementer/critic/judge cycles complete without human intervention. We now have a live report viewer and are 2-3 weeks from an end-to-end demo.

Financials

Runway (Months) – Actual and Projected, Next 6 Months

Cash in Bank (Actual and Projected, Next 6 Months)

December 2025

Cash: $5,679,219
Burn (gross): $33,878
Treasury yield (Dreyfus DGVXX @ ~2.9% APY): +$13,018
Burn (net): $20,860
Note: Partial month - Edu started Dec 15, no health insurance yet

January 2026

Cash: $5,651,519
Burn (gross): $44,145
Treasury yield (Dreyfus DGVXX @ ~3.5% APY): +$16,446
Burn (net): $27,699
Runway: 10-17 years

Incoming since Dec 1:

SV Angel X, L.P.: $150,000
Pioneer Fund III: $100,000
Bengler AS: $9,988
Treasury dividends: $29,464
Total: +$289,452

Timeline

Dec 22 (missed): Interactive holiday storybook - pared down from original mini-Linzumi games idea
Mid-Jan: Wrapped learnings on unstructured document editing, pivoted to spec compilation
Late Jan: Implemented full Implementer → Critic → Judge cycle with container isolation
Feb 2: First autonomous success - agent solved screenshot generation without human help
Now: Live report viewer, preparing for end-to-end demo

What We Built

1. Unstructured Document Editing (Dec-Jan)

We built an interactive storybook creator where the entire story spec (characters, scenes, narration) lives in a single markdown file edited via LLM. Key challenges solved:

Catastrophic edit prevention: How to make LLM edits reliable without corrupting the document
Parallelized generation: DAG-based planning for simultaneous character art, scene images, and audio narration
Model routing: Balancing intelligence (Claude), speed (Groq/Cerebras), and creativity (Sonnet for prose) dynamically
Zero parsing: All UI rendered by extracting structure from markdown entirely via LLM

The storybook generates genuinely good narrated stories. We'll return to ship it once we have better agents to implement the remaining pieces.

2. Model Reliability Evals (Jan)

Scaling exposed major reliability issues. Gemini Flash 2.0 was ideal on paper but hung constantly at scale. We built evals to compare models systematically:

Winner: Groq - Very reliable, very fast
Runner-up: Cerebras - Faster but less reliable (talking to their team)
Both running Llama 3.3 70B for mid-intelligence, ultra-low-latency structured outputs

3. Spec → Code Compilation (Jan-Feb)

We took an underspecified Hello World React/TypeScript app and tried to compile it via coding agents (Codex CLI, Claude Code). Initial results were mediocre, which led to:

Implementer/Critic/Judge cycle: Three-agent loop where work is reviewed and rejected until correct
Prompt optimization: Agents are extremely sensitive to prompting. Built evals to systematically improve prompts.
Meta-spec: Wrote a "spec about how to write specs" and had Claude interview us to produce higher-quality input specs
Container isolation: Every coding agent has sandbox quirks (Codex can't launch Chrome on macOS, etc.). Moved to containers for reliability.

A Hello World spec now compiles to several hundred tasks organized into a DAG, with each task going through the implement/critique/judge cycle.

4. First Autonomous Success (Feb 2)

The spec required screenshot evidence to prove the UI was correct. The implementer hit real problems with Playwright in our container environment. After several cycles:

Implementer tried workarounds
Critic rejected insufficient evidence
Judge agreed with Critic
Implementer eventually figured out a working solution

All without human intervention. We saw the evidence appear proving the app was running correctly.

Critic reviewing evidence and issuing PASS verdict with verification badges

5. Live Report Viewer (Now)

Several hundred tasks means too much evidence for humans to review manually. We built a report viewer with progressive disclosure:

Full report dashboard with spec sections, verification chips, and live timeline

Visual status map: 88 pass (green), 47 fail (red), 7 in progress, 127 pending

Features:

Click any spec section to see granular evidence
View the full discussion between agents
See evidence presented, rejected, and approved
Build trust by drilling down, then zoom out once confident

Watching the Implementer think in real-time with streaming output

Critic following up with live review of the Implementer's work

Live monitoring: Watch the spec being implemented in real-time with a color-coded progress heat map.

Dark mode with running tasks, evidence coverage metrics, and spec table of contents

Key Learnings

Unstructured ↔ structured interleaving is the sweet spot. Specs are unstructured, compile to structured DAGs, individual tasks produce structured evidence, which informs unstructured discussion.
Specs aren't enough - you need the motivating conversations. When modifying a spec, you need access to why it's currently written that way. We're keeping conversation history as context for future edits.
Agents should log ambiguous decisions. Every spec has gaps. It's fine for agents to fill them, but they must surface what decisions they made so humans can course-correct.
Imperative → declarative is the UX. Saying "make this bigger" is natural. Having that automatically update the spec so future regenerations incorporate it - that's the magic.
Model reliability varies wildly at scale. Always eval before committing to a model for production use.

What's Next (2-3 weeks)

Spec authoring experience: Good UX for creating and editing specs
Socratic partner: Finds gaps, contradictions, and ambiguities in specs before compilation
Decision capture: Surfaces agent decisions for human review/override
Cloud infrastructure: Scale beyond local hardware
End-to-end demo: Hello World → Todo MVC level apps, fully autonomous

Asks

None specific right now.

But: If any of you would like to try building a small app with us in the next couple weeks, we'd love to walk you through the experience and get feedback on what's useful, what's missing, and how it compares to your current workflow.

Team

Edu has been fantastic. We're fully in sync now and moving fast.

Sean Grove - Product, agent orchestration, infrastructure
Eduardo Rafael - Compiler/PL, spec compilation, verification system

Best, Sean

P.S. As always, if you'd like fewer/more/different updates, just reply and let me know.

sgrove/linzumi-investor-update-feb-2026.md

Select an option

No results found