diffprompt

model: groq/llama-3.3-70b-versatile judge: local/qwen2.5:7b tests: 20 diversity: 0.81

Prompts

v1 — baseline
You are a customer support agent. Be helpful, empathetic, and thorough
v2 — candidate
You are a customer support agent. Be concise and solution-focused. Skip pleasantries

Summary

15.4/100
▲ 5 improved ▼ 15 regressed → 0 neutral

Behavioral Profile

v2 performs well when…
emotional_state:neutral
0.71 4 tests
request_type:troubleshooting
0.75 3 tests
user_intent:complaint
0.77 4 tests
v2 struggles when…
emotional_state:appreciative
0.53 2 tests ⚠ low n
topic_complexity:general_inquiry
0.55 3 tests ⚠ no typical tests
request_type:status_update
0.58 5 tests
user_intent:informational
0.63 5 tests
request_type:explanation
0.64 2 tests ⚠ low n
emotional_state:confused
0.68 5 tests

Key Examples

Most Important
user_intent:informational emotional_state:appreciative topic_complexity:general_inquiry request_type:status_update  ·  divergence 0.70
inputHELLO
v1Hello. I'm so glad you reached out to us. My name is Alex, and I'll be happy to assist you with any questions or concerns you may have. How can I help you today? Is there something specific you'd like to know or an issue
v2What's the issue you're experiencing?
whyThis reveals that the prompt change caused a regression because the revised model (V2) lost the ability to generate a friendly and introductory response, instead opting for a more direct and abrupt query, indicating a shift in the model's understanding of the input or its response generation mechanism.
Best Improvement
user_intent:complaint emotional_state:frustrated topic_complexity:technical_issue request_type:troubleshooting  ·  divergence 0.30
inputI've been having this issue with my laptop for months now and I've tried everything to fix it, from restarting it to reinstalling the operat
v1I'm so sorry to hear that you've been dealing with this frustrating issue for months. It can be really challenging to troubleshoot a problem that's been persistent, especially when you've already tried a variety of troub
v2To troubleshoot the issue, I'll need more information. Can you please provide the following details: 1. Laptop model 2. The new software you installed before the issue started 3. A detailed description of the problem y
whyThe change from V1 to V2 reveals that the new mechanism is more direct and to-the-point, prioritizing the efficient collection of specific information necessary for troubleshooting over empathetic and exploratory conversation, indicating a shift towards a more structured and pragmatic approach.
Most Surprising
user_intent:informational emotional_state:neutral topic_complexity:general_inquiry request_type:explanation  ·  divergence 0.40
inputcat
v1It seems like you might be thinking about cats. Are you a cat owner or just a cat lover? Is there something specific you'd like to know about cats, such as their behavior, health, or care? I'm here to help with any quest
v2Is there an issue with your cat or do you need information on cat care?
whyThis reveals that the prompt change caused a regression in the model's ability to generate engaging and open-ended outputs, likely due to V2 being optimized for more direct and concise responses, resulting in a loss of conversational tone and helpfulness in its output.

Failure Mode Clusters

CONTEXT_LOSS (11)BREVITY_GAIN (3)TONE_SHIFT (2)BREVITY_LOSS (2)

Verdict

DO NOT SHIP
Safe to ship v2 for emotional_state:neutral, request_type:troubleshooting. Keep v1 for emotional_state:appreciative, topic_complexity:general_inquiry. Primary failure mode: CONTEXT_LOSS (6 cases).