Skip to content

Instantly share code, notes, and snippets.

@sgrove
Last active February 9, 2026 20:30
Show Gist options
  • Select an option

  • Save sgrove/b7f63dbfbf8c0c154410837a5a20e861 to your computer and use it in GitHub Desktop.

Select an option

Save sgrove/b7f63dbfbf8c0c154410837a5a20e861 to your computer and use it in GitHub Desktop.
Linzumi Investor Update - February 2026 (Dec + Jan catch-up)

Linzumi February Update (Dec + Jan catch-up)

TL;DR: Spec in → verified, working app out. On Feb 2nd our agents autonomously solved a hard problem through multiple implement/critique/judge cycles with zero human help. We can now watch specs compile into code in real-time. Demo in 2-3 weeks. 10-17 years runway.


Apologies for missing January - we were heads-down trying to ship and kept thinking "one more day and we can show something even more interesting." Bad habit, working on it. But hopefully this update and the attached screenshots give a good glimpse into what the system is becoming capable of.

Quick synopsis: We missed the Dec 22 holiday launch but got critical learnings on unstructured doc editing. We then pivoted to spec → coding agent compilation, and on Feb 2nd saw the first autonomous implementer/critic/judge cycles complete without human intervention. We now have a live report viewer and are 2-3 weeks from an end-to-end demo.


Financials

Runway Chart Runway (Months) – Actual and Projected, Next 6 Months

Cash Chart Cash in Bank (Actual and Projected, Next 6 Months)

December 2025

  • Cash: $5,679,219
  • Burn (gross): $33,878
  • Treasury yield (Dreyfus DGVXX @ ~2.9% APY): +$13,018
  • Burn (net): $20,860
  • Note: Partial month - Edu started Dec 15, no health insurance yet

January 2026

  • Cash: $5,651,519
  • Burn (gross): $44,145
  • Treasury yield (Dreyfus DGVXX @ ~3.5% APY): +$16,446
  • Burn (net): $27,699
  • Runway: 10-17 years

Incoming since Dec 1:

  • SV Angel X, L.P.: $150,000
  • Pioneer Fund III: $100,000
  • Bengler AS: $9,988
  • Treasury dividends: $29,464
  • Total: +$289,452

Timeline

  • Dec 22 (missed): Interactive holiday storybook - pared down from original mini-Linzumi games idea
  • Mid-Jan: Wrapped learnings on unstructured document editing, pivoted to spec compilation
  • Late Jan: Implemented full Implementer → Critic → Judge cycle with container isolation
  • Feb 2: First autonomous success - agent solved screenshot generation without human help
  • Now: Live report viewer, preparing for end-to-end demo

What We Built

1. Unstructured Document Editing (Dec-Jan)

We built an interactive storybook creator where the entire story spec (characters, scenes, narration) lives in a single markdown file edited via LLM. Key challenges solved:

  • Catastrophic edit prevention: How to make LLM edits reliable without corrupting the document
  • Parallelized generation: DAG-based planning for simultaneous character art, scene images, and audio narration
  • Model routing: Balancing intelligence (Claude), speed (Groq/Cerebras), and creativity (Sonnet for prose) dynamically
  • Zero parsing: All UI rendered by extracting structure from markdown entirely via LLM

The storybook generates genuinely good narrated stories. We'll return to ship it once we have better agents to implement the remaining pieces.

2. Model Reliability Evals (Jan)

Scaling exposed major reliability issues. Gemini Flash 2.0 was ideal on paper but hung constantly at scale. We built evals to compare models systematically:

  • Winner: Groq - Very reliable, very fast
  • Runner-up: Cerebras - Faster but less reliable (talking to their team)
  • Both running Llama 3.3 70B for mid-intelligence, ultra-low-latency structured outputs

3. Spec → Code Compilation (Jan-Feb)

We took an underspecified Hello World React/TypeScript app and tried to compile it via coding agents (Codex CLI, Claude Code). Initial results were mediocre, which led to:

  • Implementer/Critic/Judge cycle: Three-agent loop where work is reviewed and rejected until correct
  • Prompt optimization: Agents are extremely sensitive to prompting. Built evals to systematically improve prompts.
  • Meta-spec: Wrote a "spec about how to write specs" and had Claude interview us to produce higher-quality input specs
  • Container isolation: Every coding agent has sandbox quirks (Codex can't launch Chrome on macOS, etc.). Moved to containers for reliability.

A Hello World spec now compiles to several hundred tasks organized into a DAG, with each task going through the implement/critique/judge cycle.

4. First Autonomous Success (Feb 2)

The spec required screenshot evidence to prove the UI was correct. The implementer hit real problems with Playwright in our container environment. After several cycles:

  1. Implementer tried workarounds
  2. Critic rejected insufficient evidence
  3. Judge agreed with Critic
  4. Implementer eventually figured out a working solution

All without human intervention. We saw the evidence appear proving the app was running correctly.

Critic Pass Critic reviewing evidence and issuing PASS verdict with verification badges

5. Live Report Viewer (Now)

Several hundred tasks means too much evidence for humans to review manually. We built a report viewer with progressive disclosure:

Report Dashboard Full report dashboard with spec sections, verification chips, and live timeline

Status Map Visual status map: 88 pass (green), 47 fail (red), 7 in progress, 127 pending

Features:

  • Click any spec section to see granular evidence
  • View the full discussion between agents
  • See evidence presented, rejected, and approved
  • Build trust by drilling down, then zoom out once confident

Implementer Streaming Watching the Implementer think in real-time with streaming output

Critic Streaming Critic following up with live review of the Implementer's work

Live monitoring: Watch the spec being implemented in real-time with a color-coded progress heat map.

Dark Mode Dark mode with running tasks, evidence coverage metrics, and spec table of contents


Key Learnings

  1. Unstructured ↔ structured interleaving is the sweet spot. Specs are unstructured, compile to structured DAGs, individual tasks produce structured evidence, which informs unstructured discussion.

  2. Specs aren't enough - you need the motivating conversations. When modifying a spec, you need access to why it's currently written that way. We're keeping conversation history as context for future edits.

  3. Agents should log ambiguous decisions. Every spec has gaps. It's fine for agents to fill them, but they must surface what decisions they made so humans can course-correct.

  4. Imperative → declarative is the UX. Saying "make this bigger" is natural. Having that automatically update the spec so future regenerations incorporate it - that's the magic.

  5. Model reliability varies wildly at scale. Always eval before committing to a model for production use.


What's Next (2-3 weeks)

  • Spec authoring experience: Good UX for creating and editing specs
  • Socratic partner: Finds gaps, contradictions, and ambiguities in specs before compilation
  • Decision capture: Surfaces agent decisions for human review/override
  • Cloud infrastructure: Scale beyond local hardware
  • End-to-end demo: Hello World → Todo MVC level apps, fully autonomous

Asks

None specific right now.

But: If any of you would like to try building a small app with us in the next couple weeks, we'd love to walk you through the experience and get feedback on what's useful, what's missing, and how it compares to your current workflow.


Team

Edu has been fantastic. We're fully in sync now and moving fast.

  • Sean Grove - Product, agent orchestration, infrastructure
  • Eduardo Rafael - Compiler/PL, spec compilation, verification system

Best, Sean

P.S. As always, if you'd like fewer/more/different updates, just reply and let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment