Gaps for Long-Running LLM Task Execution in Endo

Analysis of the llm-durable-messages branch: what works today and what's missing to enable an LLM agent to work on a task over hours/days with human-in-the-loop approval.

End-to-End User Flows

Flow 1: Give the LLM a coding task

Human                          Daemon                         LLM Agent
  |                              |                              |
  |-- endo send llm "build X" ->|-- deliver message ---------->|
  |                              |                              |-- calls Anthropic API
  |                              |                              |-- gets tool_use: define_code
  |                              |<- E(powers).define(src,slots)|
  |<- endo inbox shows defn ----|                              |
  |-- endo endow 0 --bind ... ->|-- eval formula created ----->|
  |                              |-- result resolves ---------->|
  |                              |                              |-- sends response to HOST
  |<- endo inbox shows result --|                              |

Works today. The happy path works. LLM proposes code via define, host endows, code runs, LLM gets result.

Breaks at: The LLM gets the result back as the resolution of the define() promise inside executeTool. The result is JSON-stringified and fed back into the tool call loop, then the LLM produces a text response sent to HOST via send. But the human has no way to send a follow-up message about the result scoped to the task -- the inbox is a flat list of unrelated messages.

Flow 2: Multi-step task (LLM needs to iterate)

Human: "Analyze my data, build a model, then test it"

Step 1: LLM defines code to load data
  -> Host endows with data-source
  -> Result stored as "loaded-data"

Step 2: LLM defines code to build model, needs "loaded-data"
  -> ??? How does LLM reference "loaded-data" in a define?

Breaks at: define() creates slots that the host fills. The LLM can't say "use the result from my last step" -- it can only describe capability slots for the host to bind. There's no mechanism for the agent to chain its own prior results into the next step without the host re-binding them each time.

With requestEvaluation, the LLM could reference last-result by pet name. But with define, that authority is deliberately removed. The gap: there's no way for an agent to build on its own prior results without host intervention at every step.

Flow 3: Daemon restarts mid-task

Before restart:
  - LLM has 15-message conversation history in memory
  - LLM was mid-way through a tool call loop
  - Host had approved 3 define requests

After restart:
  - Durable messages: OK (inbox messages survive)
  - Formula graph: OK (counter objects, eval results survive)
  - LLM conversation history: LOST (in-memory array in anthropic-backend.js:141)
  - LLM agent process: KILLED (unconfined worker dies)
  - Tool call loop state: LOST

Breaks at: The make-unconfined formula for the LLM agent will re-evaluate on restart (re-run make(powers)), creating a fresh agent with empty conversation history. It will see its old durable messages via followMessages(), but it has no way to reconstruct the conversation context from those messages. The Anthropic API needs the full messages array to maintain coherence.

Flow 4: Host monitors progress

Human: "What's the LLM working on? How far along is it?"

  endo inbox
  > 0. "llm-handle" sent "Llamadrome ready for work." at ...
  > 1. you sent "build a counter" at ...
  > 2. "llm-handle" proposed code (slots: counter) at ...
  > 3. "llm-handle" sent "Here's your counter [result: 42]" at ...

Partially works. The inbox shows messages chronologically. But there's no concept of a task grouping these messages. If you give the LLM two tasks, messages interleave. There's no progress indicator, no "step 3 of 7", no way to see what the LLM is currently thinking about vs. waiting for.

Flow 5: Human takes a break, comes back hours later

10:00am  Human sends task to LLM
10:01am  LLM proposes define, waits for host endow
           ... human goes to lunch ...
 2:00pm  Human runs "endo inbox"
           -> Sees the pending definition
           -> Endows it
           -> LLM gets result... but the Anthropic API call timed out hours ago

Breaks at: The define() call in executeTool is an await that blocks the tool call loop. If the host doesn't endow promptly, the Anthropic backend is sitting on a hanging promise. The Ollama backend doesn't even try -- it fire-and-forgets code blocks. There's no mechanism to park a pending request and resume the conversation when the approval arrives.

Key Architectural Gaps

Gap	What it blocks	Difficulty
No conversation persistence	Flow 3 -- LLM loses all context on restart	Medium
No self-referencing results	Flow 2 -- LLM can't chain steps without host re-binding every time	Medium
No async approval handling	Flow 5 -- `define()` blocks the tool loop; human can't take time to review	High
No task concept	Flow 4 -- messages are a flat stream, no grouping or progress	Medium
No conversation resumption	Flow 3 -- even if messages are durable, LLM can't rebuild its API context from them	High

Gap 1: No conversation persistence

The Anthropic backend holds the full conversation in messages: Array<{role, content}> (anthropic-backend.js:141). The Ollama backend holds transcript in memory (ollama-backend.js:34-36). Both are lost on daemon restart or worker termination.

What's needed: Persist conversation turns to the guest's directory (via storeValue or a dedicated conversation store). On restart, reconstruct the messages array from stored turns before resuming the followMessages loop.

Gap 2: No self-referencing results

With define(), the agent proposes code with named slots and the host binds capabilities. This is good for security (the agent can't grab capabilities by name). But it means the agent can't say "use the result I got from my last define" without the host manually binding it.

What's needed: Either (a) an autoEndow variant of define where the agent can specify which of its own prior results to bind (host still approves the code), or (b) a task-scoped workspace where evaluation results are automatically available to subsequent steps.

Gap 3: No async approval handling

The tool call loop in anthropic-backend.js:158-192 is synchronous: it awaits each tool result before continuing. When define() is called, it blocks until the host endows. If the host takes hours to review, the Anthropic API connection may time out, or the LLM context may grow stale.

What's needed: The agent needs to be event-driven rather than synchronous. When a define() is pending approval, the agent should go idle. When the approval arrives (as a message or event), the agent wakes up, feeds the result into a new API call with the conversation history, and continues.

Gap 4: No task concept

Messages in the inbox are a flat chronological stream. There's no way to group related messages into a task, track progress across steps, or distinguish between concurrent tasks.

What's needed: A task envelope or thread ID that groups related messages. The define → endow → result chain should be visible as a single task with steps. The host should be able to run endo tasks to see active task threads and their status.

Gap 5: No conversation resumption

Even if messages are durable and the agent restarts, there's no way to reconstruct a valid Anthropic/Ollama conversation from the durable messages. The message format (strings + edge names + pet names) doesn't capture the LLM-specific structure (role, tool_use blocks, tool_result blocks).

What's needed: Either (a) persist the raw LLM API messages alongside the Endo messages, or (b) define a reconstruction protocol where the agent replays durable messages into a fresh conversation context on startup.

What Exists as Foundation

The branch has solid foundations to build on:

Durable messages -- fully implemented, persisted to disk, survive restarts
Formula persistence -- core daemon handles formulation and reincarnation
define/endow/form verbs -- authority-separating message protocol
Guest/host distinction -- clear capability boundaries
followMessages() -- async iterable that yields existing messages first, then new ones
Tool calling (Anthropic backend) -- LLM can propose tools with structured args

The most impactful gap is async approval handling + conversation persistence. Today the LLM agent is fundamentally synchronous: receive message, call API, maybe call tools, send response. Long-running tasks need the agent to be event-driven: propose something, go idle, wake up when the approval arrives, continue from where it left off.

The durable messages infrastructure already provides the wake-up mechanism (followMessages yields new messages as they arrive). What's missing is the agent-side state machine that can park a pending step, serialize its state, and resume when the relevant message arrives.

zmanian/llm-long-running-task-gaps.md

Select an option

No results found