TL;DR: Spec in → verified, working app out. On Feb 2nd our agents autonomously solved a hard problem through multiple implement/critique/judge cycles with zero human help. We can now watch specs compile into code in real-time. Demo in 2-3 weeks. 10-17 years runway.
Apologies for missing January - we were heads-down trying to ship and kept thinking "one more day and we can show something even more interesting." Bad habit, working on it. But hopefully this update and the attached screenshots give a good glimpse into what the system is becoming capable of.
Quick synopsis: We missed the Dec 22 holiday launch but got critical learnings on unstructured doc editing. We then pivoted to spec → coding agent compilation, and on Feb 2nd saw the first autonomous implementer/critic/judge cycles complete without human intervention. We now have a live report viewer and are 2-3 weeks from an end-to-end demo.


December 2025
- Cash: $5,679,219
- Burn (gross): $33,878
- Treasury yield (Dreyfus DGVXX @ ~2.9% APY): +$13,018
- Burn (net): $20,860
- Note: Partial month - Edu started Dec 15, no health insurance yet
January 2026
- Cash: $5,651,519
- Burn (gross): $44,145
- Treasury yield (Dreyfus DGVXX @ ~3.5% APY): +$16,446
- Burn (net): $27,699
- Runway: 10-17 years
Incoming since Dec 1:
- SV Angel X, L.P.: $150,000
- Pioneer Fund III: $100,000
- Bengler AS: $9,988
- Treasury dividends: $29,464
- Total: +$289,452
- Dec 22 (missed): Interactive holiday storybook - pared down from original mini-Linzumi games idea
- Mid-Jan: Wrapped learnings on unstructured document editing, pivoted to spec compilation
- Late Jan: Implemented full Implementer → Critic → Judge cycle with container isolation
- Feb 2: First autonomous success - agent solved screenshot generation without human help
- Now: Live report viewer, preparing for end-to-end demo
We built an interactive storybook creator where the entire story spec (characters, scenes, narration) lives in a single markdown file edited via LLM. Key challenges solved:
- Catastrophic edit prevention: How to make LLM edits reliable without corrupting the document
- Parallelized generation: DAG-based planning for simultaneous character art, scene images, and audio narration
- Model routing: Balancing intelligence (Claude), speed (Groq/Cerebras), and creativity (Sonnet for prose) dynamically
- Zero parsing: All UI rendered by extracting structure from markdown entirely via LLM
The storybook generates genuinely good narrated stories. We'll return to ship it once we have better agents to implement the remaining pieces.
Scaling exposed major reliability issues. Gemini Flash 2.0 was ideal on paper but hung constantly at scale. We built evals to compare models systematically:
- Winner: Groq - Very reliable, very fast
- Runner-up: Cerebras - Faster but less reliable (talking to their team)
- Both running Llama 3.3 70B for mid-intelligence, ultra-low-latency structured outputs
We took an underspecified Hello World React/TypeScript app and tried to compile it via coding agents (Codex CLI, Claude Code). Initial results were mediocre, which led to:
- Implementer/Critic/Judge cycle: Three-agent loop where work is reviewed and rejected until correct
- Prompt optimization: Agents are extremely sensitive to prompting. Built evals to systematically improve prompts.
- Meta-spec: Wrote a "spec about how to write specs" and had Claude interview us to produce higher-quality input specs
- Container isolation: Every coding agent has sandbox quirks (Codex can't launch Chrome on macOS, etc.). Moved to containers for reliability.
A Hello World spec now compiles to several hundred tasks organized into a DAG, with each task going through the implement/critique/judge cycle.
The spec required screenshot evidence to prove the UI was correct. The implementer hit real problems with Playwright in our container environment. After several cycles:
- Implementer tried workarounds
- Critic rejected insufficient evidence
- Judge agreed with Critic
- Implementer eventually figured out a working solution
All without human intervention. We saw the evidence appear proving the app was running correctly.

Several hundred tasks means too much evidence for humans to review manually. We built a report viewer with progressive disclosure:


Features:
- Click any spec section to see granular evidence
- View the full discussion between agents
- See evidence presented, rejected, and approved
- Build trust by drilling down, then zoom out once confident


Live monitoring: Watch the spec being implemented in real-time with a color-coded progress heat map.

-
Unstructured ↔ structured interleaving is the sweet spot. Specs are unstructured, compile to structured DAGs, individual tasks produce structured evidence, which informs unstructured discussion.
-
Specs aren't enough - you need the motivating conversations. When modifying a spec, you need access to why it's currently written that way. We're keeping conversation history as context for future edits.
-
Agents should log ambiguous decisions. Every spec has gaps. It's fine for agents to fill them, but they must surface what decisions they made so humans can course-correct.
-
Imperative → declarative is the UX. Saying "make this bigger" is natural. Having that automatically update the spec so future regenerations incorporate it - that's the magic.
-
Model reliability varies wildly at scale. Always eval before committing to a model for production use.
- Spec authoring experience: Good UX for creating and editing specs
- Socratic partner: Finds gaps, contradictions, and ambiguities in specs before compilation
- Decision capture: Surfaces agent decisions for human review/override
- Cloud infrastructure: Scale beyond local hardware
- End-to-end demo: Hello World → Todo MVC level apps, fully autonomous
None specific right now.
But: If any of you would like to try building a small app with us in the next couple weeks, we'd love to walk you through the experience and get feedback on what's useful, what's missing, and how it compares to your current workflow.
Edu has been fantastic. We're fully in sync now and moving fast.
- Sean Grove - Product, agent orchestration, infrastructure
- Eduardo Rafael - Compiler/PL, spec compilation, verification system
Best, Sean
P.S. As always, if you'd like fewer/more/different updates, just reply and let me know.