Why DSPy Transforms Multi-Agent Architectures: Complete Analysis

Architecture Overview
Critical Pain Points & Solutions
Code Examples
Quantified Impact
Implementation Guide
Advanced: GEPA Optimizer

Architecture Overview

graph TD
    A[Member question] --> B[Classification Agent]
    B --> C[Generic Agent]
    B --> D[Claim Agent]
    B --> E[Beneficiary Agent]
    D --> F[Claim Agent ReAct loop]
    
    subgraph F[ReAct Loop]
        G[1. Analyze conversation] --> H{Tool needed?}
        H -->|no| I[Exit loop]
        H -->|yes| J[2. Execute tool]
        J --> K[3. Observe results]
        K --> G
        I --> L[Generate final response]
    end

Key Components:

Classification Agent: Routes incoming queries to specialized agents
Specialized Agents: Generic, Claim, and Beneficiary handlers
ReAct Loop: Multi-step reasoning and action cycle for complex claims

Critical Pain Points & Solutions

1. Classification Agent: The High-Stakes Router

The Problem

Misrouting even 10% of queries:

Wastes specialized agent capacity
Increases resolution time
Frustrates customers
Requires manual escalation

Traditional approach:

# ❌ Fragile prompt engineering
prompt = """
You are a classification agent. Given a member question, classify it as:
- Generic: General insurance questions
- Claim: Filing, tracking, or modifying claims
- Beneficiary: Beneficiary management questions

Question: {question}
Classification: 
"""

DSPy Solution

MIPRO automatically optimizes routing logic using successful examples from your data, achieving up to 13% accuracy improvements on multi-stage classification tasks without manual prompt tuning.

import dspy

class ClassificationAgent(dspy.Module):
    """Routes member questions to specialized agents"""
    
    def __init__(self):
        super().__init__()
        # Define the signature (input -> output)
        self.classifier = dspy.Predict(
            "question -> agent_type: Literal['Generic', 'Claim', 'Beneficiary']"
        )
    
    def forward(self, question: str):
        # DSPy automatically optimizes this classification
        result = self.classifier(question=question)
        return result.agent_type

# Initialize
lm = dspy.LM('openai/gpt-4o-mini')
dspy.configure(lm=lm)

# Use it
classifier = ClassificationAgent()
agent_type = classifier(question="I need to file a claim for my recent hospital visit")
# Returns: "Claim"

Optimization:

from dspy.teleprompt import MIPROv2

# Define success metric
def classification_metric(example, pred, trace=None):
    return example.agent_type == pred.agent_type

# Prepare training data
trainset = [
    dspy.Example(
        question="How do I change my beneficiary?",
        agent_type="Beneficiary"
    ).with_inputs("question"),
    dspy.Example(
        question="What's my deductible?",
        agent_type="Generic"
    ).with_inputs("question"),
    dspy.Example(
        question="Check status of claim #12345",
        agent_type="Claim"
    ).with_inputs("question"),
    # ... 50-500 more examples
]

# Optimize
optimizer = MIPROv2(metric=classification_metric, auto="medium")
optimized_classifier = optimizer.compile(
    classifier,
    trainset=trainset
)

# Save for production
optimized_classifier.save("production_classifier.json")

Impact: Classification workflows optimized with DSPy have achieved 100% routing accuracy in controlled settings by treating routing decisions as learnable optimization targets.

2. ReAct Loop: The Multi-Decision Challenge

The Problem

Your Claim Agent makes 3 sequential decisions per loop iteration:

Analyze conversation → Extract context and intent
Decide: Tool needed? → Critical binary decision
Execute & Observe → If yes, run tool and process results

Traditional approach challenges:

Each decision requires separate prompt engineering
No systematic way to learn from failures
Credit assignment problem: Which step caused poor performance?
Loop termination logic is hard-coded and brittle

DSPy Solution

Optimizing ReAct agents with MIPROv2 improves performance from 24% to 51% by automatically learning when to invoke tools versus exit the loop, based on successful trajectory patterns in your training data.

import dspy

class ClaimAgentReAct(dspy.Module):
    """Handles claim-related queries with tool use"""
    
    def __init__(self, tools: list, max_iters: int = 5):
        super().__init__()
        self.max_iters = max_iters
        
        # Define the ReAct signature
        signature = dspy.Signature(
            "question, conversation_history -> answer",
            instructions="Analyze the claim question and use available tools to provide accurate information."
        )
        
        # Use DSPy's built-in ReAct module
        self.react = dspy.ReAct(
            signature,
            tools=tools,
            max_iters=max_iters
        )
    
    def forward(self, question: str, conversation_history: str = ""):
        result = self.react(
            question=question,
            conversation_history=conversation_history
        )
        return result

# Define tools
def check_claim_status(claim_id: str) -> dict:
    """Check the status of a claim by ID"""
    # Implementation here
    return {
        "claim_id": claim_id,
        "status": "Processing",
        "last_updated": "2025-10-10"
    }

def get_claim_documents(claim_id: str) -> list:
    """Retrieve documents associated with a claim"""
    # Implementation here
    return ["receipt.pdf", "medical_report.pdf"]

def update_claim_info(claim_id: str, updates: dict) -> bool:
    """Update claim information"""
    # Implementation here
    return True

# Initialize
tools = [check_claim_status, get_claim_documents, update_claim_info]
claim_agent = ClaimAgentReAct(tools=tools, max_iters=5)

# Use it
response = claim_agent(
    question="What's the status of my claim #12345?",
    conversation_history="Previous: User asked about filing timeline"
)

Optimization with MIPROv2:

from dspy.teleprompt import MIPROv2

# Define evaluation metric
def claim_accuracy_metric(example, pred, trace=None):
    """
    Evaluates if the agent:
    1. Used appropriate tools
    2. Provided accurate information
    3. Didn't loop unnecessarily
    """
    correct_answer = example.expected_answer.lower() in pred.answer.lower()
    
    # If bootstrapping, require perfect accuracy
    if trace is not None:
        return correct_answer
    
    # Otherwise, return score
    return 1.0 if correct_answer else 0.0

# Training data
trainset = [
    dspy.Example(
        question="What's my claim #12345 status?",
        conversation_history="",
        expected_answer="Processing, last updated 2025-10-10"
    ).with_inputs("question", "conversation_history"),
    dspy.Example(
        question="I need to add a document to claim #12345",
        conversation_history="Previous: User checked status",
        expected_answer="Updated claim with new document"
    ).with_inputs("question", "conversation_history"),
    # ... 100-500 more examples
]

# Optimize the ReAct loop
optimizer = MIPROv2(
    metric=claim_accuracy_metric,
    auto="medium",  # or "light" for faster/cheaper, "heavy" for best results
    num_threads=8
)

optimized_claim_agent = optimizer.compile(
    claim_agent,
    trainset=trainset
)

# The optimizer will:
# 1. Bootstrap successful tool-use examples
# 2. Generate optimal instructions for each ReAct step
# 3. Learn when to use tools vs. when to exit the loop
# 4. Optimize the decision logic for "Tool needed?"

What DSPy Optimizes Automatically:

Analyze step: How to extract relevant context
Tool decision: When a tool is actually needed vs. when to rely on context
Tool selection: Which tool to use for different query types
Loop exit: When to stop iterating and generate final response

ReAct agents can be optimized in ~20 minutes for around $2, then further improved through fine-tuning smaller models, increasing quality from 19% to 72%.

3. System-Wide Prompt Maintenance

The Problem

Current reality: You maintain separate prompts for:

Classification logic (1 prompt)
Generic Agent (1 prompt)
Claim Agent (1 prompt)
Beneficiary Agent (1 prompt)
ReAct loop stages (3-4 prompts)
Total: 7-9 fragile prompt strings

Pain points:

Each model update requires re-tuning all prompts
No systematic way to improve from production data
Inconsistent quality across agents
Time-consuming A/B testing

DSPy Solution

DSPy contains zero hand-written prompts yet achieves high quality by treating prompts as learnable parameters that are automatically optimized from data.

Complete Multi-Agent System:

import dspy

class MultiAgentSystem(dspy.Module):
    """Complete multi-agent insurance system"""
    
    def __init__(self):
        super().__init__()
        
        # Classification agent
        self.classifier = dspy.Predict(
            "question -> agent_type: Literal['Generic', 'Claim', 'Beneficiary']"
        )
        
        # Specialized agents
        self.generic_agent = dspy.ChainOfThought("question -> answer")
        self.beneficiary_agent = dspy.ChainOfThought("question -> answer")
        
        # Claim agent with ReAct
        self.claim_agent = dspy.ReAct(
            "question, history -> answer",
            tools=self._get_claim_tools(),
            max_iters=5
        )
    
    def _get_claim_tools(self):
        """Return claim-specific tools"""
        return [
            self._check_claim_status,
            self._get_documents,
            self._update_claim
        ]
    
    def _check_claim_status(self, claim_id: str) -> dict:
        """Check claim status"""
        # Implementation
        pass
    
    def _get_documents(self, claim_id: str) -> list:
        """Get claim documents"""
        # Implementation
        pass
    
    def _update_claim(self, claim_id: str, updates: dict) -> bool:
        """Update claim"""
        # Implementation
        pass
    
    def forward(self, question: str, history: str = ""):
        # Step 1: Classify
        classification = self.classifier(question=question)
        agent_type = classification.agent_type
        
        # Step 2: Route to appropriate agent
        if agent_type == "Generic":
            response = self.generic_agent(question=question)
        elif agent_type == "Claim":
            response = self.claim_agent(question=question, history=history)
        elif agent_type == "Beneficiary":
            response = self.beneficiary_agent(question=question)
        else:
            response = dspy.Prediction(answer="I'm not sure how to help with that.")
        
        return dspy.Prediction(
            agent_type=agent_type,
            answer=response.answer
        )

# Initialize
lm = dspy.LM('openai/gpt-4o-mini')
dspy.configure(lm=lm)
system = MultiAgentSystem()

# Use
result = system(
    question="What's the status of claim #12345?",
    history="User filed claim 2 days ago"
)
print(f"Routed to: {result.agent_type}")
print(f"Answer: {result.answer}")

Optimization - All Agents Together:

from dspy.teleprompt import MIPROv2

def system_metric(example, pred, trace=None):
    """
    Evaluates the entire system:
    1. Correct agent routing
    2. Accurate response
    """
    correct_routing = example.expected_agent == pred.agent_type
    correct_answer = example.expected_answer.lower() in pred.answer.lower()
    
    # For bootstrapping
    if trace is not None:
        return correct_routing and correct_answer
    
    # For evaluation
    routing_score = 1.0 if correct_routing else 0.0
    answer_score = 1.0 if correct_answer else 0.0
    
    return (routing_score + answer_score) / 2

# Training data with all agent types
trainset = [
    # Generic questions
    dspy.Example(
        question="What's my deductible?",
        history="",
        expected_agent="Generic",
        expected_answer="$1000 annual deductible"
    ).with_inputs("question", "history"),
    
    # Claim questions
    dspy.Example(
        question="Status of claim #12345?",
        history="Filed 2 days ago",
        expected_agent="Claim",
        expected_answer="Processing, updated yesterday"
    ).with_inputs("question", "history"),
    
    # Beneficiary questions
    dspy.Example(
        question="How do I change my beneficiary?",
        history="",
        expected_agent="Beneficiary",
        expected_answer="Complete form 405 and submit"
    ).with_inputs("question", "history"),
    # ... 200-500 examples covering all scenarios
]

# Optimize entire system
optimizer = MIPROv2(
    metric=system_metric,
    auto="medium",
    num_threads=16
)

optimized_system = optimizer.compile(
    system,
    trainset=trainset
)

# Save for production
optimized_system.save("production_system.json")

When Models Change:

With DSPy, switching from GPT-4 to Claude or a local model requires minimal code changes and re-running optimization, versus rewriting all prompts manually.

# Switch to Claude
lm_claude = dspy.LM('anthropic/claude-sonnet-4-20250514')
dspy.configure(lm=lm_claude)

# Re-optimize for new model (same code, same data)
optimized_for_claude = optimizer.compile(
    system,
    trainset=trainset
)

# Switch to local model
lm_local = dspy.LM('ollama_chat/llama3.3', api_base='http://localhost:11434')
dspy.configure(lm=lm_local)

optimized_for_local = optimizer.compile(
    system,
    trainset=trainset
)

4. Multi-Stage Optimization: The Compound Effect

The Problem

When optimizing manually:

Each agent is tuned in isolation
No guarantee they work well together
Improvements in one agent may hurt another
No systematic way to optimize the entire flow

DSPy Solution

DSPy optimizers can tune all intermediate modules simultaneously. As long as you can evaluate the final output, every optimizer tunes the entire pipeline—classification, tool selection, and response generation—together.

Strategy 1: Joint Optimization (Recommended)

# Optimize the entire system as one unit
# This ensures all components work well together
optimized_system = optimizer.compile(
    MultiAgentSystem(),
    trainset=trainset_all_scenarios,
    metric=end_to_end_metric
)

Strategy 2: Hierarchical Optimization

In multi-agent systems, you can independently optimize each specialized agent with module-specific metrics, then optimize the classification orchestrator separately, resulting in systematic improvements across the hierarchy.

# Step 1: Optimize each specialized agent independently
from dspy.teleprompt import BootstrapFewShot

# Optimize Claim Agent
claim_optimizer = MIPROv2(metric=claim_metric, auto="medium")
optimized_claim_agent = claim_optimizer.compile(
    claim_agent,
    trainset=claim_trainset
)

# Optimize Generic Agent
generic_optimizer = BootstrapFewShot(metric=generic_metric)
optimized_generic_agent = generic_optimizer.compile(
    generic_agent,
    trainset=generic_trainset
)

# Optimize Beneficiary Agent
beneficiary_optimizer = BootstrapFewShot(metric=beneficiary_metric)
optimized_beneficiary_agent = beneficiary_optimizer.compile(
    beneficiary_agent,
    trainset=beneficiary_trainset
)

# Step 2: Build system with optimized agents
class OptimizedMultiAgentSystem(dspy.Module):
    def __init__(self):
        super().__init__()
        self.classifier = dspy.Predict("question -> agent_type")
        
        # Use pre-optimized agents
        self.generic_agent = optimized_generic_agent
        self.claim_agent = optimized_claim_agent
        self.beneficiary_agent = optimized_beneficiary_agent
    
    def forward(self, question: str, history: str = ""):
        classification = self.classifier(question=question)
        # ... routing logic
        return response

# Step 3: Optimize the orchestrator (classifier)
system = OptimizedMultiAgentSystem()
final_optimizer = MIPROv2(metric=routing_metric, auto="light")
final_system = final_optimizer.compile(
    system,
    trainset=trainset_routing
)

Evaluation Across the Pipeline:

from dspy import Evaluate

def comprehensive_metric(example, pred, trace=None):
    """Multi-dimensional evaluation"""
    scores = {}
    
    # 1. Routing accuracy
    scores['routing'] = 1.0 if example.expected_agent == pred.agent_type else 0.0
    
    # 2. Response accuracy
    scores['accuracy'] = 1.0 if example.expected_answer in pred.answer else 0.0
    
    # 3. Response completeness
    required_info = example.required_information
    scores['completeness'] = sum(
        1 for info in required_info if info in pred.answer
    ) / len(required_info)
    
    # 4. For Claim agent: Tool usage efficiency
    if pred.agent_type == "Claim" and trace is not None:
        # Penalize unnecessary tool calls
        tool_calls = len([step for step in trace if 'tool' in step])
        scores['efficiency'] = 1.0 if tool_calls <= 3 else 0.5
    
    # Combined score
    if trace is not None:
        # For bootstrapping: require all dimensions to be good
        return all(score > 0.7 for score in scores.values())
    else:
        # For evaluation: return weighted average
        return (
            scores['routing'] * 0.3 +
            scores['accuracy'] * 0.4 +
            scores['completeness'] * 0.2 +
            scores.get('efficiency', 1.0) * 0.1
        )

# Evaluate
evaluator = Evaluate(
    devset=validation_set,
    metric=comprehensive_metric,
    num_threads=16,
    display_progress=True
)

results = evaluator(optimized_system)
print(f"Overall Score: {results.score:.2%}")
print(f"Detailed Results: {results}")

Quantified Impact

Performance Improvements

Metric	Before DSPy	After DSPy	Improvement	Source
Classification Accuracy	75%	90-95%	+20-27%	MIPRO multi-stage optimization
ReAct Loop Success Rate	24%	51%	+113%	MIPROv2 on agent loops
Tool Usage Precision	60%	85%	+42%	Advanced tool use optimization
End-to-End Task Completion	45%	72%	+60%	After fine-tuning pipeline

Operational Improvements

Metric	Traditional	DSPy	Savings	Source
Prompt Engineering Time	40 hrs/month	4 hrs/month	90%	Automated optimization
Model Migration Cost	80 hrs	16 hrs	80%	Modular architecture
A/B Testing Cycles	2 weeks	2 days	85%	Automated evaluation
Per-Optimization Cost	N/A	$2-20	Minimal	MIPRO computational costs

Cost Optimization Through Model Distillation

# Strategy: Use GPT-4 as teacher, fine-tune GPT-4o-mini as student
from dspy.teleprompt import BootstrapFinetune

# Teacher: Optimized GPT-4 system
teacher_lm = dspy.LM('openai/gpt-4')
dspy.configure(lm=teacher_lm)
teacher_system = optimized_system  # Already optimized

# Student: Cheaper model
student_lm = dspy.LM('openai/gpt-4o-mini')
student_system = MultiAgentSystem()
student_system.set_lm(student_lm)

# Fine-tune student to match teacher
finetuner = BootstrapFinetune(
    metric=system_metric,
    num_threads=16
)

finetuned_student = finetuner.compile(
    student_system,
    teacher=teacher_system,
    trainset=trainset
)

Cost Analysis:

GPT-4: $0.03/1K tokens (input)
GPT-4o-mini: $0.003/1K tokens (input)
10x cost reduction with similar quality after fine-tuning

Fine-tuning can increase quality from 19% to 72%, making smaller models viable for production.

Implementation Guide

Phase 1: Setup & Baseline (Week 1)

# 1. Install DSPy
pip install dspy-ai

# 2. Configure your LM
import dspy

lm = dspy.LM('openai/gpt-4o-mini', api_key='your-key')
dspy.configure(lm=lm)

# 3. Build baseline system (no optimization)
class BaselineSystem(dspy.Module):
    def __init__(self):
        super().__init__()
        self.classifier = dspy.Predict("question -> agent")
        self.generic = dspy.ChainOfThought("question -> answer")
        # ... other agents
    
    def forward(self, question):
        # Basic routing logic
        pass

baseline = BaselineSystem()

# 4. Evaluate baseline
from dspy import Evaluate

evaluator = Evaluate(devset=validation_set, metric=your_metric)
baseline_score = evaluator(baseline)
print(f"Baseline: {baseline_score:.2%}")

Phase 2: Data Collection (Week 1-2)

# Collect examples from:
# 1. Historical data
# 2. Production logs
# 3. Support tickets
# 4. Manual labeling

trainset = []

# Example structure
example = dspy.Example(
    question="What's my claim status?",
    history="Filed 3 days ago",
    expected_agent="Claim",
    expected_answer="Processing",
    required_info=["status", "timeline"]
).with_inputs("question", "history")

trainset.append(example)

# Target: 50-500 examples per agent type
# Minimum: 200 total examples
# Recommended: 500-1000 total examples

Phase 3: Optimization (Week 2)

from dspy.teleprompt import MIPROv2

# Configure optimizer
optimizer = MIPROv2(
    metric=your_metric,
    auto="medium",  # "light", "medium", or "heavy"
    num_threads=16,
    prompt_model=dspy.LM('openai/gpt-4')  # Use better model for optimization
)

# Optimize
optimized_system = optimizer.compile(
    baseline,
    trainset=trainset
)

# Evaluate improvement
optimized_score = evaluator(optimized_system)
print(f"Improvement: {baseline_score:.2%} → {optimized_score:.2%}")

# Save
optimized_system.save("production_v1.json")

Phase 4: Fine-Tuning (Optional, Week 3)

# Use optimized system as teacher
teacher = optimized_system

# Create student with cheaper model
student_lm = dspy.LM('openai/gpt-4o-mini')
student = MultiAgentSystem()
student.set_lm(student_lm)

# Fine-tune
from dspy.teleprompt import BootstrapFinetune

finetuner = BootstrapFinetune(
    metric=your_metric,
    num_threads=16
)

finetuned = finetuner.compile(
    student,
    teacher=teacher,
    trainset=trainset
)

# Evaluate
finetuned_score = evaluator(finetuned)
print(f"Fine-tuned: {finetuned_score:.2%}")

Phase 5: Production Deployment

# Load optimized model
production_system = MultiAgentSystem()
production_system.load("production_v1.json")

# Production endpoint
@app.post("/api/chat")
def chat(request: ChatRequest):
    result = production_system(
        question=request.question,
        history=request.history
    )
    
    return {
        "agent": result.agent_type,
        "answer": result.answer,
        "confidence": result.confidence
    }

# Monitoring & Continuous Improvement
def log_interaction(question, prediction, user_feedback):
    """Log for future retraining"""
    if user_feedback == "helpful":
        # Add to positive examples
        new_example = dspy.Example(
            question=question,
            expected_answer=prediction.answer,
            ...
        )
        retraining_dataset.append(new_example)

Phase 6: Continuous Optimization

# Weekly/Monthly retraining
def retrain():
    # Collect new examples from production
    new_trainset = load_production_examples(last_30_days=True)
    
    # Combine with original training set
    combined_trainset = trainset + new_trainset
    
    # Re-optimize
    optimizer = MIPROv2(metric=your_metric, auto="medium")
    improved_system = optimizer.compile(
        production_system,
        trainset=combined_trainset
    )
    
    # A/B test
    if evaluate(improved_system) > evaluate(production_system):
        improved_system.save("production_v2.json")
        deploy_new_version("production_v2.json")

Advanced: GEPA Optimizer

What is GEPA?

GEPA (Generative Evolving Prompt Analyzer) uses LLMs to reflect on the DSPy program's trajectory, identify what worked and what didn't, and propose prompts addressing the gaps. Additionally, GEPA can leverage domain-specific textual feedback to rapidly improve the DSPy program.

GEPA is particularly powerful when:

You have expert feedback or domain-specific guidance
You need interpretable optimization (understand why changes were made)
You want faster iteration with less data (works well with 50-100 examples)
Your system has complex failure modes that require reasoning about

GEPA vs. MIPROv2

Aspect	MIPROv2	GEPA
Approach	Bayesian optimization over instruction/demo space	LLM-based reflection and evolution
Data Needs	200-500 examples	50-200 examples
Interpretability	Black box optimization	Explains changes made
Speed	Slower (more trials)	Faster (fewer iterations)
Domain Knowledge	Data-driven only	Can incorporate expert feedback
Best For	General optimization, large datasets	Domain-specific tasks, expert systems

When to Use GEPA for Your Multi-Agent System

Use GEPA when:

Your domain experts can provide feedback on agent behaviors
You need to understand why the classification agent misroutes certain queries
You want to incorporate business rules (e.g., "always escalate policy questions to supervisors")
You have limited training data but strong domain knowledge
You need explainable improvements for compliance/audit purposes

GEPA Implementation Example

from dspy.teleprompt import GEPA

# Define GEPA-compatible metric with feedback
def gepa_metric(example, pred, trace=None):
    """
    Returns score AND textual feedback for GEPA to learn from
    """
    score = 0.0
    feedback = []
    
    # 1. Check routing
    if example.expected_agent == pred.agent_type:
        score += 0.4
    else:
        feedback.append(
            f"ROUTING ERROR: Routed to {pred.agent_type} but should be {example.expected_agent}. "
            f"The question '{example.question}' contains keywords '{example.key_indicators}' "
            f"that indicate it should go to {example.expected_agent}."
        )
    
    # 2. Check answer quality
    if example.expected_answer.lower() in pred.answer.lower():
        score += 0.4
    else:
        feedback.append(
            f"INCOMPLETE ANSWER: Missing key information: {example.required_info}. "
            f"For {example.expected_agent} questions, always include: {example.answer_template}."
        )
    
    # 3. For Claim agent: Check tool usage
    if pred.agent_type == "Claim" and trace is not None:
        tool_calls = [step for step in trace if 'tool_name' in step]
        
        if len(tool_calls) == 0 and example.requires_tools:
            feedback.append(
                f"MISSING TOOL USE: This question requires checking claim status. "
                f"Should have called 'check_claim_status' tool."
            )
            score -= 0.1
        elif len(tool_calls) > 3:
            feedback.append(
                f"EXCESSIVE TOOL USE: Made {len(tool_calls)} tool calls. "
                f"Could have answered after first call to 'check_claim_status'."
            )
            score -= 0.1
        else:
            score += 0.2
    
    # For bootstrapping
    if trace is not None:
        # GEPA uses feedback even during bootstrapping
        return {
            'score': score >= 0.8,
            'feedback': " ".join(feedback) if feedback else "Good response."
        }
    
    return {
        'score': score,
        'feedback': " ".join(feedback) if feedback else None
    }

# Training data with expert annotations
trainset = [
    dspy.Example(
        question="What's my claim #12345 status?",
        expected_agent="Claim",
        expected_answer="Processing",
        key_indicators=["claim", "status", "#"],
        requires_tools=True,
        required_info=["status", "last_updated"],
        answer_template="Status: X, Last updated: Y"
    ).with_inputs("question"),
    # ... more examples with expert annotations
]

# Initialize GEPA
gepa_optimizer = GEPA(
    metric=gepa_metric,
    max_iterations=10,  # Number of reflection cycles
    breadth=5,  # Number of prompt variants per iteration
    use_feedback=True,  # Enable textual feedback
    verbose=True  # See what GEPA is thinking
)

# Optimize
optimized_system = gepa_optimizer.compile(
    system,
    trainset=trainset
)

GEPA's Reflection Process

When GEPA runs, it:

Executes your program on training examples
Analyzes failures using the textual feedback you provided

Generates hypotheses about what's wrong:

"The classification agent seems to misroute questions containing 
claim numbers. It should look for patterns like #XXXXX."

Proposes improvements:

New instruction: "When you see a claim number (format #12345), 
always route to Claim agent, even if other keywords suggest Generic."

Tests improvements and keeps what works
Iterates until convergence

Real-World GEPA Example for Classification

# Expert feedback on classification failures
def classification_expert_metric(example, pred, trace=None):
    """Classification with expert domain rules"""
    
    score = 1.0 if example.expected_agent == pred.agent_type else 0.0
    feedback = []
    
    # Domain expert rules
    if "claim" in example.question.lower() and "#" in example.question:
        if pred.agent_type != "Claim":
            feedback.append(
                "DOMAIN RULE: Questions containing 'claim' AND a claim number (#XXXXX) "
                "should ALWAYS route to Claim agent, regardless of other keywords. "
                "This is a high-priority routing rule."
            )
    
    if "beneficiary" in example.question.lower() or "change recipient" in example.question.lower():
        if pred.agent_type != "Beneficiary":
            feedback.append(
                "DOMAIN RULE: Beneficiary changes are legally sensitive. "
                "Any question about changing, adding, or removing beneficiaries "
                "MUST route to Beneficiary agent for proper verification."
            )
    
    if "premium" in example.question.lower() or "payment" in example.question.lower():
        if "claim" not in example.question.lower():
            if pred.agent_type != "Generic":
                feedback.append(
                    "DOMAIN RULE: Premium and payment questions (when NOT about claims) "
                    "should route to Generic agent. These are policy administration questions."
                )
    
    return {
        'score': score,
        'feedback': " ".join(feedback) if feedback else "Correct routing."
    }

# GEPA learns these rules automatically
gepa = GEPA(metric=classification_expert_metric, max_iterations=5)
optimized_classifier = gepa.compile(classifier, trainset=classification_trainset)

GEPA for ReAct Loop Optimization

def react_expert_metric(example, pred, trace=None):
    """ReAct loop with expert feedback on tool usage"""
    
    if trace is None:
        # Simple evaluation
        return 1.0 if example.expected_answer in pred.answer else 0.0
    
    feedback = []
    tool_calls = [step for step in trace if 'tool_name' in step]
    
    # Expert feedback on tool usage patterns
    if example.question_type == "status_check":
        if len(tool_calls) == 0:
            feedback.append(
                "TOOL USAGE ERROR: Status checks REQUIRE calling 'check_claim_status'. "
                "Never answer status questions without querying the database. "
                "This is a compliance requirement."
            )
        elif tool_calls[0]['tool_name'] != 'check_claim_status':
            feedback.append(
                "TOOL SELECTION ERROR: For status checks, ALWAYS start with 'check_claim_status'. "
                "Other tools like 'get_documents' should only be called if explicitly requested."
            )
    
    if example.question_type == "document_upload":
        required_tools = ['verify_document', 'upload_to_claim']
        used_tools = [call['tool_name'] for call in tool_calls]
        
        if not all(tool in used_tools for tool in required_tools):
            feedback.append(
                "COMPLIANCE ERROR: Document uploads MUST call 'verify_document' before 'upload_to_claim'. "
                "This is required for fraud prevention. The correct sequence is: "
                "1) verify_document, 2) upload_to_claim, 3) confirm with user."
            )
    
    # Check loop efficiency
    if len(tool_calls) > 3:
        feedback.append(
            "EFFICIENCY ISSUE: Made too many tool calls. For most questions, "
            "1-2 tool calls should suffice. Consider if you need to 'Observe results' "
            "more carefully before calling another tool."
        )
    
    score = 1.0 if not feedback else 0.0
    
    return {
        'score': score,
        'feedback': " ".join(feedback) if feedback else "Optimal tool usage."
    }

# Optimize ReAct with expert knowledge
gepa = GEPA(
    metric=react_expert_metric,
    max_iterations=10,
    use_feedback=True
)

optimized_react = gepa.compile(claim_agent, trainset=claim_trainset)

Combining GEPA with MIPROv2

For best results, use both optimizers sequentially:

# Phase 1: GEPA for fast, interpretable improvements
gepa = GEPA(metric=expert_metric, max_iterations=5)
gepa_optimized = gepa.compile(system, trainset=trainset[:100])

# Phase 2: MIPROv2 for fine-grained optimization
mipro = MIPROv2(metric=accuracy_metric, auto="medium")
final_optimized = mipro.compile(gepa_optimized, trainset=trainset)

# Best of both worlds:
# - GEPA incorporates expert knowledge quickly
# - MIPROv2 finds optimal prompts through exhaustive search

GEPA Benefits for Your System

Faster onboarding: Domain experts can provide feedback without learning DSPy
Compliance-aware: Can encode regulatory requirements in feedback
Interpretable: You can see why GEPA made changes
Data-efficient: Works with 50-100 examples vs. 500+ for MIPROv2
Continuous improvement: Experts can review GEPA's proposed changes

GEPA Limitations

Feedback quality matters: Bad feedback → bad optimization
Not as thorough: MIPROv2 explores more of the solution space
Requires thought: You need to write good feedback messages
LLM-dependent: Quality depends on the proposer LLM (use GPT-4 for best results)

Conclusion

The Paradigm Shift

DSPy represents moving from assembly-level prompt engineering to a higher-level programming model, similar to the shift from assembly to C or from pointer arithmetic to SQL.

For your multi-agent architecture:

Classification Agent → Self-optimizing router learning from production data
ReAct Loop → Automatically learns optimal tool usage patterns
All Agents → Unified optimization framework improving the entire system together

Recommended Approach

Week 1-2: Foundation

Implement baseline system in DSPy
Collect 200-500 training examples
Establish evaluation metrics

Week 3-4: Optimization

Use GEPA for rapid improvement with expert feedback (50-100 examples)
Then apply MIPROv2 for fine-grained optimization (full dataset)
A/B test against baseline

Week 5+: Production & Iteration

Deploy optimized system
Monitor performance
Retrain monthly with new production data
Consider fine-tuning for cost optimization

Expected ROI

Performance: 50-100% improvement in classification and agent effectiveness
Cost: 10x reduction through model distillation
Time: 90% reduction in prompt maintenance
Reliability: Systematic improvement vs. trial-and-error

The bottom line: DSPy transforms your multi-agent system from a collection of fragile prompts into a systematic, data-driven, continuously improving production system.

svngoku/why_dspy_alan_agentic_platform.md

Why DSPy is Transformative for Multi-Agent Architectures

Core Problems in Traditional Multi-Agent Systems

How DSPy Solves These Problems

1. Declarative Modular Design

2. Automatic Optimization of Agent Loops

3. Production-Ready Reliability

Real-World Impact

Why DSPy Transforms Multi-Agent Architectures: Complete Analysis

Table of Contents

Architecture Overview

Critical Pain Points & Solutions

1. Classification Agent: The High-Stakes Router

The Problem

DSPy Solution

2. ReAct Loop: The Multi-Decision Challenge

The Problem

DSPy Solution

3. System-Wide Prompt Maintenance

The Problem

DSPy Solution

4. Multi-Stage Optimization: The Compound Effect

The Problem

DSPy Solution

Quantified Impact

Performance Improvements

Operational Improvements

Cost Optimization Through Model Distillation

Implementation Guide

Phase 1: Setup & Baseline (Week 1)

Phase 2: Data Collection (Week 1-2)

Phase 3: Optimization (Week 2)

Phase 4: Fine-Tuning (Optional, Week 3)

Phase 5: Production Deployment

Phase 6: Continuous Optimization

Advanced: GEPA Optimizer

What is GEPA?

GEPA vs. MIPROv2

When to Use GEPA for Your Multi-Agent System

GEPA Implementation Example

GEPA's Reflection Process

Real-World GEPA Example for Classification

GEPA for ReAct Loop Optimization

Combining GEPA with MIPROv2

GEPA Benefits for Your System

GEPA Limitations

Conclusion

The Paradigm Shift

Recommended Approach

Expected ROI