Execution-Based Tool Tracing: Building Better Agentic Training Data

Training agentic models that can effectively use tools remains one of the harder problems in applied ML. Models trained on purely synthetic data - where tool calls and their responses are both generated by an LLM - consistently underperform when deployed against real systems. They struggle with error recovery, mishandle state dependencies, and often exhibit what we call "time travel" errors: acting on information they haven't actually received yet.

This post introduces DeepFabric's execution-based tool tracing system, which replaces simulated tool outputs with real execution inside WebAssembly sandboxes. The result is training data grounded in actual system behavior, including the messy parts that make real-world tool use challenging.

The Problem with Simulated Tool Calls

Consider a typical synthetic data generation pipeline for tool-using agents. An LLM generates a user request, then generates an assistant response with tool calls, then generates what those tools might return. The fundamental issue: the same model is playing both sides of an interaction that should involve genuine uncertainty.

This leads to several failure modes in the resulting training data:

Time Travel Errors: The model "knows" what the tool will return because it's generating both the call and the response. Training on this data produces agents that skip verification steps - why check if a file exists when you already know what's in it?

State Inconsistency: When the model hallucinates tool outputs, it can drift from any coherent state. A file written in turn 1 might have different contents when "read" in turn 5, because the model forgot what it generated earlier.

Missing Error Paths: Simulated tools tend toward happy paths. Real systems fail in specific, recoverable ways. Models trained without exposure to FileNotFoundError, rate limits, or malformed responses handle these poorly in production.

Sequence Violations: Models will sometimes write to a file and then check if the file exists, or modify data without reading it first. These inversions are rare in real interactions but common when both sides are generated.

Real Execution via WebAssembly Sandboxes

The solution is to replace simulated tool responses with actual execution. When the model generates read_file("config.json"), we actually read a file. When it generates write_file("output.txt", content), we actually write content. The model must then reason about the real result.

DeepFabric implements this using the Spin framework--a WebAssembly runtime designed for serverless functions. The architecture is straightforward:

+-------------------+     HTTP POST /execute     +------------------+
|   DeepFabric      |  ------------------------> |   Spin Service   |
|   (Python)        |                            |   (Wasm)         |
|                   |  <------------------------ |                  |
|  - ReAct Loop     |     JSON Response          |  - Tool Comps    |
|  - LLM Calls      |                            |  - KV Store      |
|  - Session Mgmt   |                            |  - Sandboxed     |
+-------------------+                            +------------------+

The generation loop follows a ReAct (Reason-Act-Observe) pattern:

Reason: LLM decides what tool to call based on current context
Act: Tool executes via Spin sandbox
Observe: Actual result is fed back to the LLM
Repeat: LLM decides next action based on real outcomes

This eliminates time travel by construction. The model cannot know what a tool will return because it hasn't returned yet. Each decision is made using only information that has actually been observed.

Three Types of Tool Execution

DeepFabric supports three categories of tool execution, each suited to different use cases.

VFS Tools: Virtual Filesystem

The Virtual Filesystem component provides session-isolated file operations. Each generation session gets its own namespace in a key-value store, ensuring complete isolation between concurrent generations.

We have an initial inbuilt set of tools (soon to expand by much more) as available operations:

Tool	Description	Parameters
`read_file`	Read file content	`file_path` (required)
`write_file`	Write content to file	`file_path`, `content` (required)
`list_files`	List files in session	None
`delete_file`	Remove a file	`file_path` (required)

The implementation is a Rust WebAssembly component that uses Spin's built-in key-value store for persistence within a session. Files are namespaced by session ID, so session_001:main.py and session_002:main.py are completely independent.

Response format is consistent across all tools:

{
  "success": true,
  "result": "file content here...",
  "error_type": null
}

Or on failure:

{
  "success": false,
  "result": "File not found: config.yaml",
  "error_type": "FileNotFound"
}

Error types are structured (FileNotFound, IOError, InvalidArguments) rather than free-form strings, making them suitable for training models to handle specific failure modes.

Component Tools: Real API Access

For scenarios requiring interaction with external services, DeepFabric supports full API integration through Spin components. The GitHub component demonstrates this pattern:

# Start with GitHub token for API access
SPIN_VARIABLE_GITHUB_TOKEN=ghp_xxx spin up

# Optionally restrict to specific repositories
SPIN_VARIABLE_ALLOWED_REPOS="myorg/repo1,myorg/repo2" spin up

Available GitHub tools include repository search, file content retrieval, issue and PR listing, commit details, and more. Each tool hits the real GitHub API, returning actual repository data.

Safety controls are built in:

Repository allowlisting: Restrict which repos can be accessed during generation
Write protection: Mutation operations disabled by default
Structured errors: Non-allowed access returns clear, actionable error messages

This enables generating training data for code analysis tasks using real codebases, with guard rails that prevent unintended modifications.

Mock Tools: MCP-Compatible Schema-Driven Responses

For APIs where real access isn't practical during training (payment processors, production databases, rate-limited services), the mock component provides deterministic responses based on tool schemas. Critically, the mock component uses the Model Context Protocol (MCP) schema format, enabling direct integration with any MCP server.

MCP Schema Format

The mock component accepts tool definitions in the standard MCP tools/list response format:

{
  "tools": [
    {
      "name": "get_weather",
      "description": "Get current weather for a location",
      "inputSchema": {
        "type": "object",
        "properties": {
          "location": {
            "type": "string",
            "description": "City name or coordinates"
          },
          "units": {
            "type": "string",
            "enum": ["celsius", "fahrenheit"],
            "default": "celsius"
          }
        },
        "required": ["location"]
      },
      "annotations": {
        "title": "Weather Lookup",
        "readOnlyHint": true,
        "openWorldHint": true
      }
    }
  ]
}

This means you can take the tool schema from any MCP server and load it directly into Spin for mock execution--all parameter definitions, types, and constraints are automatically mapped.

Loading Schemas from MCP Servers

The mock component can pull tool definitions directly from a running MCP server:

# Pull tools from an MCP server endpoint
curl -X POST http://localhost:3000/mock/pull \
  -H "Content-Type: application/json" \
  -d '{"url": "http://localhost:8080/mcp"}'

This sends a JSON-RPC tools/list request to the MCP server and loads all returned tool definitions. The response includes the tool count and names:

{
  "loaded": 12,
  "tools": ["get_weather", "search_files", "run_query", ...]
}

You can also load schemas from a local file or inline JSON:

# Load from inline schema (array format)
curl -X POST http://localhost:3000/mock/load-schema \
  -H "Content-Type: application/json" \
  -d '[{
    "name": "get_weather",
    "description": "Get weather for a location",
    "inputSchema": {
      "type": "object",
      "properties": {
        "location": {"type": "string"}
      },
      "required": ["location"]
    }
  }]'

# Load from wrapped format (MCP tools/list response)
curl -X POST http://localhost:3000/mock/load-schema \
  -H "Content-Type: application/json" \
  -d '{"tools": [...]}'

Both array and wrapped formats are supported for flexibility.

Mock Responses and Fixtures

Once tools are loaded, you can define mock responses. The default response echoes the tool name and input:

{
  "tool": "get_weather",
  "description": "Get weather for a location",
  "input_received": {"location": "Paris"},
  "mock_result": "Successfully executed get_weather",
  "status": "success"
}

For more realistic training data, define custom response templates with argument interpolation:

# Set a mock response template
curl -X POST http://localhost:3000/mock/update-response \
  -H "Content-Type: application/json" \
  -d '{
    "name": "get_weather",
    "mockResponse": {
      "temperature": 72,
      "conditions": "Partly cloudy",
      "location": "{{location}}"
    }
  }'

The {{location}} placeholder expands to the actual argument value at execution time.

For argument-specific responses, use fixtures:

# Add a fixture for specific arguments
curl -X POST http://localhost:3000/mock/add-fixture \
  -H "Content-Type: application/json" \
  -d '{
    "name": "get_weather",
    "match": {"location": "Seattle"},
    "response": {"temperature": 55, "conditions": "Rainy", "location": "Seattle"}
  }'

# Fixtures can match multiple arguments
curl -X POST http://localhost:3000/mock/add-fixture \
  -H "Content-Type: application/json" \
  -d '{
    "name": "search_files",
    "match": {"directory": "/src", "pattern": "*.py"},
    "response": {"files": ["main.py", "utils.py", "config.py"], "count": 3}
  }'

Fixtures are matched by specificity--a fixture matching two arguments takes precedence over one matching a single argument. This enables building up realistic response libraries for complex tool interactions.

Why MCP Compatibility Matters

MCP is becoming the standard protocol for tool integration in AI systems. By conforming to MCP schemas, the mock component enables:

Direct schema import: Pull tool definitions from any MCP server without manual transcription
Ecosystem compatibility: Use the same tool schemas across Claude Desktop, IDEs, and training pipelines
Schema validation: Tool parameters are validated against the inputSchema before mock execution
Future-proofing: As MCP servers proliferate, their tools can immediately be used for training data generation

Configuration and Setup

Installing Spin

macOS:

brew install fermyon/tap/spin

Linux:

curl -fsSL https://developer.fermyon.com/downloads/install.sh | bash
sudo mv spin /usr/local/bin/

Building and Running the Tools Service

cd tools-sdk
spin build
spin up

The service starts on http://localhost:3000 with endpoints for each component (/vfs/..., /github/..., /mock/...).

DeepFabric Configuration

A typical configuration for VFS-based generation:

topics:
  prompt: "File manipulation and code analysis tasks"
  mode: graph
  depth: 3
  degree: 3

generation:
  system_prompt: |
    You are an AI assistant with access to file system tools.
    When given a task, analyze what files need to be read or modified.

  conversation:
    type: chain_of_thought
    reasoning_style: agent
    agent_mode: multi_turn

  tools:
    spin_endpoint: "http://localhost:3000"
    available:
      - read_file
      - write_file
      - list_files
    max_per_query: 3
    max_agent_steps: 5

    # Pre-populate the virtual filesystem
    scenario_seed:
      files:
        "main.py": |
          def greet(name):
              return f"Hello, {name}!"

          if __name__ == "__main__":
              print(greet("World"))
        "config.json": |
          {
            "version": "1.0.0",
            "debug": true
          }

output:
  include_system_message: true
  num_samples: 100
  save_as: "vfs-dataset.jsonl"

The scenario_seed option pre-populates the virtual filesystem before generation begins, creating realistic starting conditions. The seeded files are loaded into the session's store before any tool calls execute. We will be following up with more sophisticated seeding options soon, for example to set up specific directory structures or initial data states imported from external sources such as Git repositories.

Security: Why WebAssembly

A natural question: why not just use Docker containers for tool execution?

WebAssembly provides the same (some would argue more) robust isolation guarantees, yet with significantly lower overhead and cold start times. This is crucial when generating large datasets and not wanting to introduce seconds of latency per tool call.

Property	Docker	WebAssembly
Filesystem access	Must explicitly restrict	Denied by default
Network access	Must explicitly restrict	Denied by default
System calls	Full syscall access	Capability-based model
Memory isolation	Process-level	Module-level
Cold start time	Seconds	Milliseconds
Resource overhead	~100MB+ per container	~1MB per module

The capability-based security model is particularly valuable for training data generation. Consider:

Agent: read_file("/etc/passwd")
Wasm:  {"success": false, "result": "Access denied", "error_type": "PermissionDenied"}

The sandbox rejection isn't just a safety feature - it's valuable training data. Models learn that certain paths are off-limits and how to recover when access is denied and find a more appropriate method to achieve their goals.

In spin.toml, capabilities are explicitly granted:

[component.vfs]
# No network access
allowed_outbound_hosts = []
# Only KV store access
key_value_stores = ["default"]

[component.github]
# Specific external APIs only
allowed_outbound_hosts = [
  "https://api.github.com"
]

If a tool attempts anything not explicitly allowed, the Wasm runtime rejects it. There's no possibility of sandbox escape through configuration oversight.

What This Enables

Execution-based tool tracing produces qualitatively different training data:

Metric	Simulated Tools	Real Execution
Time-travel errors	Common	Impossible
Error recovery samples	Rare	Natural
State consistency	Drifting	Guaranteed
Output realism	Hallucinated	Actual

The natural error recovery samples are particularly valuable. When a model tries to read a file that doesn't exist, it must decide what to do next. Try a different path? Create the file? Ask for clarification? These decision points, forced by real failures, teach models how to navigate uncertainty.

The ReAct loop structure also ensures that models learn appropriate verification patterns. After writing a file, a model might read it back to confirm success. This behavior emerges naturally when the model can't assume its writes succeeded.

Getting Started

# Clone DeepFabric
git clone https://github.com/deepfabric/deepfabric
cd deepfabric

# Build and start the Spin service
cd tools-sdk
spin build
spin up &

# Run generation with VFS tools
cd ..
deepfabric start examples/spin-vfs-tools.yaml

The TUI displays tool executions in real-time, including failures:

Events
------
T read_file("config.json") -> success
T list_files() -> success
X T write_file("output.txt") -> IOError
T write_file("output.txt") -> success

Building Custom Tool Components

The built-in VFS and GitHub components cover common use cases, but the real power of this architecture is extensibility. You can package your own tool components, host them on GitHub, and run them in Docker for production deployments.

Anatomy of a Spin Component

Each component is a WebAssembly module that handles HTTP requests. Here's the minimal structure for a Rust component:

my-tools-sdk/
├── spin.toml                 # Application manifest
├── components/
│   └── mytool/
│       ├── Cargo.toml        # Rust dependencies
│       └── src/
│           └── lib.rs        # Tool implementation

The spin.toml defines routing and capabilities:

spin_manifest_version = 2

[application]
name = "my-tools"
version = "1.0.0"

[[trigger.http]]
route = "/mytool/..."
component = "mytool"

[component.mytool]
source = "components/mytool/target/wasm32-wasip1/release/mytool.wasm"
key_value_stores = ["default"]
allowed_outbound_hosts = ["https://api.myservice.com"]

[component.mytool.build]
command = "cargo build --target wasm32-wasip1 --release"
workdir = "components/mytool"

Implementing a Tool Component

A minimal Rust component implementing the execute pattern:

use spin_sdk::{
    http::{Request, Response},
    http_component,
    key_value::Store,
};
use serde::{Deserialize, Serialize};

#[derive(Deserialize)]
struct ExecuteRequest {
    session_id: String,
    tool: String,
    args: serde_json::Value,
}

#[derive(Serialize)]
struct ExecuteResponse {
    success: bool,
    result: String,
    #[serde(skip_serializing_if = "Option::is_none")]
    error_type: Option<String>,
}

#[http_component]
fn handle_request(req: Request) -> anyhow::Result<Response> {
    let path = req.path();

    match path {
        p if p.ends_with("/execute") => handle_execute(req),
        p if p.ends_with("/health") => handle_health(),
        p if p.ends_with("/components") => handle_components(),
        _ => Ok(Response::builder()
            .status(404)
            .body(r#"{"error": "Not found"}"#)
            .build()),
    }
}

fn handle_execute(req: Request) -> anyhow::Result<Response> {
    let request: ExecuteRequest = serde_json::from_slice(req.body())?;

    let response = match request.tool.as_str() {
        "my_tool" => execute_my_tool(&request),
        _ => ExecuteResponse {
            success: false,
            result: format!("Unknown tool: {}", request.tool),
            error_type: Some("UnknownTool".to_string()),
        },
    };

    Ok(Response::builder()
        .status(200)
        .header("content-type", "application/json")
        .body(serde_json::to_vec(&response)?)
        .build())
}

fn execute_my_tool(req: &ExecuteRequest) -> ExecuteResponse {
    // Your tool logic here
    ExecuteResponse {
        success: true,
        result: "Tool executed successfully".to_string(),
        error_type: None,
    }
}

For Python components, use componentize-py:

from spin_sdk import http
from spin_sdk.http import Request, Response
import json

class IncomingHandler(http.IncomingHandler):
    def handle_request(self, request: Request) -> Response:
        if request.uri.endswith("/execute"):
            return self.handle_execute(request)
        elif request.uri.endswith("/health"):
            return Response(200,
                {"content-type": "application/json"},
                b'{"status": "healthy"}')
        return Response(404, {}, b'{"error": "Not found"}')

    def handle_execute(self, request: Request) -> Response:
        body = json.loads(request.body)
        tool = body.get("tool")
        args = body.get("args", {})

        if tool == "my_tool":
            result = self.execute_my_tool(args)
            return Response(200,
                {"content-type": "application/json"},
                json.dumps(result).encode())

        return Response(400, {},
            json.dumps({"error": f"Unknown tool: {tool}"}).encode())

    def execute_my_tool(self, args: dict) -> dict:
        return {
            "success": True,
            "result": "Tool executed successfully"
        }

Hosting on GitHub

Structure your repository for easy consumption:

my-tools-sdk/
├── README.md
├── spin.toml
├── Dockerfile
├── components/
│   ├── tool-a/
│   └── tool-b/
└── examples/
    └── config.yaml

Users can clone and run directly:

git clone https://github.com/myorg/my-tools-sdk
cd my-tools-sdk
spin build
spin up

Running in Docker

For production deployments, package your tools SDK in a Docker container. Create a Dockerfile:

FROM ghcr.io/fermyon/spin:v2.7

# Install build dependencies
RUN apt-get update && apt-get install -y \
    build-essential \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Install Rust and wasm target
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
ENV PATH="/root/.cargo/bin:${PATH}"
RUN rustup target add wasm32-wasip1

# Copy source
WORKDIR /app
COPY . .

# Build components
RUN spin build

# Expose default port
EXPOSE 3000

# Run Spin
CMD ["spin", "up", "--listen", "0.0.0.0:3000"]

Build and run:

# Build the image
docker build -t my-tools-sdk .

# Run with environment variables for API keys
docker run -p 3000:3000 \
  -e SPIN_VARIABLE_API_KEY=xxx \
  my-tools-sdk

# Run with volume mount for persistent KV store
docker run -p 3000:3000 \
  -v tools-data:/root/.spin \
  my-tools-sdk

Docker Compose for Full Stack

Run DeepFabric and your tools SDK together:

version: '3.8'

services:
  tools-sdk:
    build: ./my-tools-sdk
    ports:
      - "3000:3000"
    environment:
      - SPIN_VARIABLE_GITHUB_TOKEN=${GITHUB_TOKEN}
      - SPIN_VARIABLE_ALLOWED_REPOS=myorg/repo1,myorg/repo2
    volumes:
      - tools-data:/root/.spin

  deepfabric:
    image: python:3.11
    volumes:
      - ./:/workspace
    working_dir: /workspace
    command: >
      sh -c "pip install deepfabric &&
             deepfabric start config.yaml"
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
    depends_on:
      - tools-sdk

volumes:
  tools-data:

Your DeepFabric configuration points to the containerized service:

generation:
  tools:
    spin_endpoint: "http://tools-sdk:3000"
    available:
      - my_tool
      - another_tool

Publishing Pre-Built Images

For easier distribution, publish pre-built images to a container registry:

# Build for multiple platforms
docker buildx build --platform linux/amd64,linux/arm64 \
  -t ghcr.io/myorg/my-tools-sdk:latest \
  --push .

Users can then run without building:

docker run -p 3000:3000 ghcr.io/myorg/my-tools-sdk:latest

This pattern enables teams to share specialized tool components--internal APIs, domain-specific services, proprietary integrations--while maintaining the security and isolation guarantees of WebAssembly execution.

Conclusion

Execution-based tool tracing addresses a fundamental limitation in synthetic data generation for agentic models. By replacing simulated tool responses with real execution in WebAssembly sandboxes, we produce training data that reflects actual system behavior--including the error conditions and state dependencies that make tool use challenging.

The Spin framework provides the necessary isolation with minimal overhead, and the three-tier tool system (VFS, Component, Mock) covers the full spectrum from sandboxed file operations to real API access to controlled mock responses.

If you're building agentic training data, consider whether your current approach captures the iterative, uncertainty-driven nature of real tool use. Execution-based tracing ensures it does.

lukehinds/live-executions.md

Select an option

No results found