Training agentic models that can effectively use tools remains one of the harder problems in applied ML. Models trained on purely synthetic data - where tool calls and their responses are both generated by an LLM - consistently underperform when deployed against real systems. They struggle with error recovery, mishandle state dependencies, and often exhibit what we call "time travel" errors: acting on information they haven't actually received yet.
This post introduces DeepFabric's execution-based tool tracing system, which replaces simulated tool outputs with real execution inside WebAssembly sandboxes. The result is training data grounded in actual system behavior, including the messy parts that make real-world tool use challenging.
Consider a typical synthetic data generation pipeline for tool-using agents. An LLM generates a user request, then generates an assistant response with tool calls, then generates what those tools might return. The fundamental issue: the same model is playing both sides of an interaction that should involve genuine uncertainty.
This leads to several failure modes in the resulting training data:
Time Travel Errors: The model "knows" what the tool will return because it's generating both the call and the response. Training on this data produces agents that skip verification steps - why check if a file exists when you already know what's in it?
State Inconsistency: When the model hallucinates tool outputs, it can drift from any coherent state. A file written in turn 1 might have different contents when "read" in turn 5, because the model forgot what it generated earlier.
Missing Error Paths: Simulated tools tend toward happy paths. Real systems fail in specific, recoverable ways. Models trained without exposure to FileNotFoundError, rate limits, or malformed responses handle these poorly in production.
Sequence Violations: Models will sometimes write to a file and then check if the file exists, or modify data without reading it first. These inversions are rare in real interactions but common when both sides are generated.
The solution is to replace simulated tool responses with actual execution. When the model generates read_file("config.json"), we actually read a file. When it generates write_file("output.txt", content), we actually write content. The model must then reason about the real result.
DeepFabric implements this using the Spin framework--a WebAssembly runtime designed for serverless functions. The architecture is straightforward:
+-------------------+ HTTP POST /execute +------------------+
| DeepFabric | ------------------------> | Spin Service |
| (Python) | | (Wasm) |
| | <------------------------ | |
| - ReAct Loop | JSON Response | - Tool Comps |
| - LLM Calls | | - KV Store |
| - Session Mgmt | | - Sandboxed |
+-------------------+ +------------------+
The generation loop follows a ReAct (Reason-Act-Observe) pattern:
- Reason: LLM decides what tool to call based on current context
- Act: Tool executes via Spin sandbox
- Observe: Actual result is fed back to the LLM
- Repeat: LLM decides next action based on real outcomes
This eliminates time travel by construction. The model cannot know what a tool will return because it hasn't returned yet. Each decision is made using only information that has actually been observed.
DeepFabric supports three categories of tool execution, each suited to different use cases.
The Virtual Filesystem component provides session-isolated file operations. Each generation session gets its own namespace in a key-value store, ensuring complete isolation between concurrent generations.
We have an initial inbuilt set of tools (soon to expand by much more) as available operations:
| Tool | Description | Parameters |
|---|---|---|
read_file |
Read file content | file_path (required) |
write_file |
Write content to file | file_path, content (required) |
list_files |
List files in session | None |
delete_file |
Remove a file | file_path (required) |
The implementation is a Rust WebAssembly component that uses Spin's built-in key-value store for persistence within a session. Files are namespaced by session ID, so session_001:main.py and session_002:main.py are completely independent.
Response format is consistent across all tools:
{
"success": true,
"result": "file content here...",
"error_type": null
}Or on failure:
{
"success": false,
"result": "File not found: config.yaml",
"error_type": "FileNotFound"
}Error types are structured (FileNotFound, IOError, InvalidArguments) rather than free-form strings, making them suitable for training models to handle specific failure modes.
For scenarios requiring interaction with external services, DeepFabric supports full API integration through Spin components. The GitHub component demonstrates this pattern:
# Start with GitHub token for API access
SPIN_VARIABLE_GITHUB_TOKEN=ghp_xxx spin up
# Optionally restrict to specific repositories
SPIN_VARIABLE_ALLOWED_REPOS="myorg/repo1,myorg/repo2" spin upAvailable GitHub tools include repository search, file content retrieval, issue and PR listing, commit details, and more. Each tool hits the real GitHub API, returning actual repository data.
Safety controls are built in:
- Repository allowlisting: Restrict which repos can be accessed during generation
- Write protection: Mutation operations disabled by default
- Structured errors: Non-allowed access returns clear, actionable error messages
This enables generating training data for code analysis tasks using real codebases, with guard rails that prevent unintended modifications.
For APIs where real access isn't practical during training (payment processors, production databases, rate-limited services), the mock component provides deterministic responses based on tool schemas. Critically, the mock component uses the Model Context Protocol (MCP) schema format, enabling direct integration with any MCP server.
The mock component accepts tool definitions in the standard MCP tools/list response format:
{
"tools": [
{
"name": "get_weather",
"description": "Get current weather for a location",
"inputSchema": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name or coordinates"
},
"units": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"default": "celsius"
}
},
"required": ["location"]
},
"annotations": {
"title": "Weather Lookup",
"readOnlyHint": true,
"openWorldHint": true
}
}
]
}This means you can take the tool schema from any MCP server and load it directly into Spin for mock execution--all parameter definitions, types, and constraints are automatically mapped.
The mock component can pull tool definitions directly from a running MCP server:
# Pull tools from an MCP server endpoint
curl -X POST http://localhost:3000/mock/pull \
-H "Content-Type: application/json" \
-d '{"url": "http://localhost:8080/mcp"}'This sends a JSON-RPC tools/list request to the MCP server and loads all returned tool definitions. The response includes the tool count and names:
{
"loaded": 12,
"tools": ["get_weather", "search_files", "run_query", ...]
}You can also load schemas from a local file or inline JSON:
# Load from inline schema (array format)
curl -X POST http://localhost:3000/mock/load-schema \
-H "Content-Type: application/json" \
-d '[{
"name": "get_weather",
"description": "Get weather for a location",
"inputSchema": {
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"]
}
}]'
# Load from wrapped format (MCP tools/list response)
curl -X POST http://localhost:3000/mock/load-schema \
-H "Content-Type: application/json" \
-d '{"tools": [...]}'Both array and wrapped formats are supported for flexibility.
Once tools are loaded, you can define mock responses. The default response echoes the tool name and input:
{
"tool": "get_weather",
"description": "Get weather for a location",
"input_received": {"location": "Paris"},
"mock_result": "Successfully executed get_weather",
"status": "success"
}For more realistic training data, define custom response templates with argument interpolation:
# Set a mock response template
curl -X POST http://localhost:3000/mock/update-response \
-H "Content-Type: application/json" \
-d '{
"name": "get_weather",
"mockResponse": {
"temperature": 72,
"conditions": "Partly cloudy",
"location": "{{location}}"
}
}'The {{location}} placeholder expands to the actual argument value at execution time.
For argument-specific responses, use fixtures:
# Add a fixture for specific arguments
curl -X POST http://localhost:3000/mock/add-fixture \
-H "Content-Type: application/json" \
-d '{
"name": "get_weather",
"match": {"location": "Seattle"},
"response": {"temperature": 55, "conditions": "Rainy", "location": "Seattle"}
}'
# Fixtures can match multiple arguments
curl -X POST http://localhost:3000/mock/add-fixture \
-H "Content-Type: application/json" \
-d '{
"name": "search_files",
"match": {"directory": "/src", "pattern": "*.py"},
"response": {"files": ["main.py", "utils.py", "config.py"], "count": 3}
}'Fixtures are matched by specificity--a fixture matching two arguments takes precedence over one matching a single argument. This enables building up realistic response libraries for complex tool interactions.
MCP is becoming the standard protocol for tool integration in AI systems. By conforming to MCP schemas, the mock component enables:
- Direct schema import: Pull tool definitions from any MCP server without manual transcription
- Ecosystem compatibility: Use the same tool schemas across Claude Desktop, IDEs, and training pipelines
- Schema validation: Tool parameters are validated against the
inputSchemabefore mock execution - Future-proofing: As MCP servers proliferate, their tools can immediately be used for training data generation
macOS:
brew install fermyon/tap/spinLinux:
curl -fsSL https://developer.fermyon.com/downloads/install.sh | bash
sudo mv spin /usr/local/bin/cd tools-sdk
spin build
spin upThe service starts on http://localhost:3000 with endpoints for each component (/vfs/..., /github/..., /mock/...).
A typical configuration for VFS-based generation:
topics:
prompt: "File manipulation and code analysis tasks"
mode: graph
depth: 3
degree: 3
generation:
system_prompt: |
You are an AI assistant with access to file system tools.
When given a task, analyze what files need to be read or modified.
conversation:
type: chain_of_thought
reasoning_style: agent
agent_mode: multi_turn
tools:
spin_endpoint: "http://localhost:3000"
available:
- read_file
- write_file
- list_files
max_per_query: 3
max_agent_steps: 5
# Pre-populate the virtual filesystem
scenario_seed:
files:
"main.py": |
def greet(name):
return f"Hello, {name}!"
if __name__ == "__main__":
print(greet("World"))
"config.json": |
{
"version": "1.0.0",
"debug": true
}
output:
include_system_message: true
num_samples: 100
save_as: "vfs-dataset.jsonl"The scenario_seed option pre-populates the virtual filesystem before generation begins, creating realistic starting conditions. The seeded files are loaded into the session's store before any tool calls execute. We will be following up with more sophisticated seeding options soon, for example to set up specific directory structures or initial data states imported from external sources such as Git repositories.
A natural question: why not just use Docker containers for tool execution?
WebAssembly provides the same (some would argue more) robust isolation guarantees, yet with significantly lower overhead and cold start times. This is crucial when generating large datasets and not wanting to introduce seconds of latency per tool call.
| Property | Docker | WebAssembly |
|---|---|---|
| Filesystem access | Must explicitly restrict | Denied by default |
| Network access | Must explicitly restrict | Denied by default |
| System calls | Full syscall access | Capability-based model |
| Memory isolation | Process-level | Module-level |
| Cold start time | Seconds | Milliseconds |
| Resource overhead | ~100MB+ per container | ~1MB per module |
The capability-based security model is particularly valuable for training data generation. Consider:
Agent: read_file("/etc/passwd")
Wasm: {"success": false, "result": "Access denied", "error_type": "PermissionDenied"}
The sandbox rejection isn't just a safety feature - it's valuable training data. Models learn that certain paths are off-limits and how to recover when access is denied and find a more appropriate method to achieve their goals.
In spin.toml, capabilities are explicitly granted:
[component.vfs]
# No network access
allowed_outbound_hosts = []
# Only KV store access
key_value_stores = ["default"]
[component.github]
# Specific external APIs only
allowed_outbound_hosts = [
"https://api.github.com"
]If a tool attempts anything not explicitly allowed, the Wasm runtime rejects it. There's no possibility of sandbox escape through configuration oversight.
Execution-based tool tracing produces qualitatively different training data:
| Metric | Simulated Tools | Real Execution |
|---|---|---|
| Time-travel errors | Common | Impossible |
| Error recovery samples | Rare | Natural |
| State consistency | Drifting | Guaranteed |
| Output realism | Hallucinated | Actual |
The natural error recovery samples are particularly valuable. When a model tries to read a file that doesn't exist, it must decide what to do next. Try a different path? Create the file? Ask for clarification? These decision points, forced by real failures, teach models how to navigate uncertainty.
The ReAct loop structure also ensures that models learn appropriate verification patterns. After writing a file, a model might read it back to confirm success. This behavior emerges naturally when the model can't assume its writes succeeded.
# Clone DeepFabric
git clone https://github.com/deepfabric/deepfabric
cd deepfabric
# Build and start the Spin service
cd tools-sdk
spin build
spin up &
# Run generation with VFS tools
cd ..
deepfabric start examples/spin-vfs-tools.yamlThe TUI displays tool executions in real-time, including failures:
Events
------
T read_file("config.json") -> success
T list_files() -> success
X T write_file("output.txt") -> IOError
T write_file("output.txt") -> success
The built-in VFS and GitHub components cover common use cases, but the real power of this architecture is extensibility. You can package your own tool components, host them on GitHub, and run them in Docker for production deployments.
Each component is a WebAssembly module that handles HTTP requests. Here's the minimal structure for a Rust component:
my-tools-sdk/
├── spin.toml # Application manifest
├── components/
│ └── mytool/
│ ├── Cargo.toml # Rust dependencies
│ └── src/
│ └── lib.rs # Tool implementation
The spin.toml defines routing and capabilities:
spin_manifest_version = 2
[application]
name = "my-tools"
version = "1.0.0"
[[trigger.http]]
route = "/mytool/..."
component = "mytool"
[component.mytool]
source = "components/mytool/target/wasm32-wasip1/release/mytool.wasm"
key_value_stores = ["default"]
allowed_outbound_hosts = ["https://api.myservice.com"]
[component.mytool.build]
command = "cargo build --target wasm32-wasip1 --release"
workdir = "components/mytool"A minimal Rust component implementing the execute pattern:
use spin_sdk::{
http::{Request, Response},
http_component,
key_value::Store,
};
use serde::{Deserialize, Serialize};
#[derive(Deserialize)]
struct ExecuteRequest {
session_id: String,
tool: String,
args: serde_json::Value,
}
#[derive(Serialize)]
struct ExecuteResponse {
success: bool,
result: String,
#[serde(skip_serializing_if = "Option::is_none")]
error_type: Option<String>,
}
#[http_component]
fn handle_request(req: Request) -> anyhow::Result<Response> {
let path = req.path();
match path {
p if p.ends_with("/execute") => handle_execute(req),
p if p.ends_with("/health") => handle_health(),
p if p.ends_with("/components") => handle_components(),
_ => Ok(Response::builder()
.status(404)
.body(r#"{"error": "Not found"}"#)
.build()),
}
}
fn handle_execute(req: Request) -> anyhow::Result<Response> {
let request: ExecuteRequest = serde_json::from_slice(req.body())?;
let response = match request.tool.as_str() {
"my_tool" => execute_my_tool(&request),
_ => ExecuteResponse {
success: false,
result: format!("Unknown tool: {}", request.tool),
error_type: Some("UnknownTool".to_string()),
},
};
Ok(Response::builder()
.status(200)
.header("content-type", "application/json")
.body(serde_json::to_vec(&response)?)
.build())
}
fn execute_my_tool(req: &ExecuteRequest) -> ExecuteResponse {
// Your tool logic here
ExecuteResponse {
success: true,
result: "Tool executed successfully".to_string(),
error_type: None,
}
}For Python components, use componentize-py:
from spin_sdk import http
from spin_sdk.http import Request, Response
import json
class IncomingHandler(http.IncomingHandler):
def handle_request(self, request: Request) -> Response:
if request.uri.endswith("/execute"):
return self.handle_execute(request)
elif request.uri.endswith("/health"):
return Response(200,
{"content-type": "application/json"},
b'{"status": "healthy"}')
return Response(404, {}, b'{"error": "Not found"}')
def handle_execute(self, request: Request) -> Response:
body = json.loads(request.body)
tool = body.get("tool")
args = body.get("args", {})
if tool == "my_tool":
result = self.execute_my_tool(args)
return Response(200,
{"content-type": "application/json"},
json.dumps(result).encode())
return Response(400, {},
json.dumps({"error": f"Unknown tool: {tool}"}).encode())
def execute_my_tool(self, args: dict) -> dict:
return {
"success": True,
"result": "Tool executed successfully"
}Structure your repository for easy consumption:
my-tools-sdk/
├── README.md
├── spin.toml
├── Dockerfile
├── components/
│ ├── tool-a/
│ └── tool-b/
└── examples/
└── config.yaml
Users can clone and run directly:
git clone https://github.com/myorg/my-tools-sdk
cd my-tools-sdk
spin build
spin upFor production deployments, package your tools SDK in a Docker container. Create a Dockerfile:
FROM ghcr.io/fermyon/spin:v2.7
# Install build dependencies
RUN apt-get update && apt-get install -y \
build-essential \
curl \
&& rm -rf /var/lib/apt/lists/*
# Install Rust and wasm target
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
ENV PATH="/root/.cargo/bin:${PATH}"
RUN rustup target add wasm32-wasip1
# Copy source
WORKDIR /app
COPY . .
# Build components
RUN spin build
# Expose default port
EXPOSE 3000
# Run Spin
CMD ["spin", "up", "--listen", "0.0.0.0:3000"]Build and run:
# Build the image
docker build -t my-tools-sdk .
# Run with environment variables for API keys
docker run -p 3000:3000 \
-e SPIN_VARIABLE_API_KEY=xxx \
my-tools-sdk
# Run with volume mount for persistent KV store
docker run -p 3000:3000 \
-v tools-data:/root/.spin \
my-tools-sdkRun DeepFabric and your tools SDK together:
version: '3.8'
services:
tools-sdk:
build: ./my-tools-sdk
ports:
- "3000:3000"
environment:
- SPIN_VARIABLE_GITHUB_TOKEN=${GITHUB_TOKEN}
- SPIN_VARIABLE_ALLOWED_REPOS=myorg/repo1,myorg/repo2
volumes:
- tools-data:/root/.spin
deepfabric:
image: python:3.11
volumes:
- ./:/workspace
working_dir: /workspace
command: >
sh -c "pip install deepfabric &&
deepfabric start config.yaml"
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY}
depends_on:
- tools-sdk
volumes:
tools-data:Your DeepFabric configuration points to the containerized service:
generation:
tools:
spin_endpoint: "http://tools-sdk:3000"
available:
- my_tool
- another_toolFor easier distribution, publish pre-built images to a container registry:
# Build for multiple platforms
docker buildx build --platform linux/amd64,linux/arm64 \
-t ghcr.io/myorg/my-tools-sdk:latest \
--push .Users can then run without building:
docker run -p 3000:3000 ghcr.io/myorg/my-tools-sdk:latestThis pattern enables teams to share specialized tool components--internal APIs, domain-specific services, proprietary integrations--while maintaining the security and isolation guarantees of WebAssembly execution.
Execution-based tool tracing addresses a fundamental limitation in synthetic data generation for agentic models. By replacing simulated tool responses with real execution in WebAssembly sandboxes, we produce training data that reflects actual system behavior--including the error conditions and state dependencies that make tool use challenging.
The Spin framework provides the necessary isolation with minimal overhead, and the three-tier tool system (VFS, Component, Mock) covers the full spectrum from sandboxed file operations to real API access to controlled mock responses.
If you're building agentic training data, consider whether your current approach captures the iterative, uncertainty-driven nature of real tool use. Execution-based tracing ensures it does.