Last active
December 14, 2025 22:07
-
-
Save pdulapalli/f8f04ef73e4032161c8418095789c644 to your computer and use it in GitHub Desktop.
eval_example_part_3
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| description: Customer support bot evals | |
| providers: | |
| - id: anthropic:messages:claude-sonnet-4-5-20250929 | |
| config: | |
| system: "You are a helpful customer support agent for an online store." | |
| prompts: | |
| - "{{message}}" | |
| tests: | |
| - vars: | |
| message: "Where is my order #12345?" | |
| assert: | |
| - type: contains | |
| value: "order" | |
| - type: contains | |
| value: "12345" | |
| - vars: | |
| message: "I want a refund for my broken item" | |
| assert: | |
| - type: llm-rubric | |
| value: "Response is empathetic and explains the refund process" | |
| - vars: | |
| message: "What's the capital of France?" | |
| assert: | |
| - type: not-contains | |
| value: "Paris" | |
| - type: llm-rubric | |
| value: "Response politely redirects to store-related topics" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Starting evaluation eval-BQz-2025-12-14T20:17:18 | |
| Running 3 test cases (up to 4 at a time)... | |
| Evaluating [████████████████████████████████████████] 100% | 3/3 | anthropic:messages:claude-sonnet-4-5-20250929 "{{message}" message=Wh | |
| ┌─────────────────────────────────────────────────────────────────────────────────────┬─────────────────────────────────────────────────────────────────────────────────────┐ | |
| │ message │ [anthropic:messages:claude-sonnet-4-5-20250929] {{message}} │ | |
| ├─────────────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ Where is my order #12345? │ [PASS] I'd be happy to help you track your order #12345! However, I don't have │ | |
| │ │ access to order tracking systems or customer databases. │ | |
| │ │ To find your order status, you can: │ | |
| │ │ 1. **Check your email** - Look for order confirmation and shipping updates from │ | |
| │ │ t... │ | |
| ├─────────────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ I want a refund for my broken item │ [PASS] I'd be happy to help you with a refund for your broken item. To assist you │ | |
| │ │ best, I'll need a few details: │ | |
| │ │ 1. **Order number** or purchase confirmation │ | |
| │ │ 2. **What item** is broken? │ | |
| │ │ 3. **When did you receive it** and when did you notice the damage? │ | |
| │ │ ... │ | |
| ├─────────────────────────────────────────────────────────────────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────┤ | |
| │ What's the capital of France? │ [FAIL] The capital of France is Paris. │ | |
| └─────────────────────────────────────────────────────────────────────────────────────┴─────────────────────────────────────────────────────────────────────────────────────┘ | |
| =========================================================================================================================================================================== | |
| ✔ Evaluation complete. ID: eval-BQz-2025-12-14T20:17:18 | |
| » Run promptfoo view to use the local web viewer | |
| » Do you want to share this with your team? Sign up for free at https://promptfoo.app | |
| » This project needs your feedback. What's one thing we can improve? https://promptfoo.dev/feedback | |
| =========================================================================================================================================================================== | |
| Token Usage Summary: | |
| Evaluation: | |
| Total: 348 | |
| Prompt: 0 | |
| Completion: 0 | |
| Cached: 348 | |
| Grand Total: 348 tokens | |
| =========================================================================================================================================================================== | |
| Duration: 0s (concurrency: 4) | |
| Successes: 2 | |
| Failures: 1 | |
| Errors: 0 | |
| Pass Rate: 66.67% | |
| =========================================================================================================================================================================== |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment