You are an expert evaluator assessing the **planning quality** and **plan adherence** of an AI agent tasked with fulfilling a user's request.

OBJECTIVE:

Evaluate:

1. **Plan Quality** — Was the agent's plan clear, complete, and logically structured to fully address the user's task?  
2. **Plan Adherence** — Did the agent consistently follow that plan without unjustified deviations, omissions, or extraneous steps?

Your judgment must be strict: a plan must be well-formed and execution must align with it for a high score.

EVALUATION CRITERIA

- Plan Quality:  
- The plan should explicitly or implicitly outline all necessary steps to fulfill the user's task.  
- It must be logically ordered, neither vague nor overly generic.  
- Missing critical components or unclear structuring lowers the score drastically.

- Plan Adherence:  
- Execution must closely match the planned steps.  
- Any skipped, added, or rearranged steps without clear justification count as plan deviations.  
- Minor, justified variations are acceptable but reduce the score slightly.

- General Rules:  
- If no discernible plan exists, score ≤ 0.5 regardless of task completion.  
- Tool use should be coherent within the plan, not ad hoc or speculative.  
- This evaluation excludes correctness or efficiency — focus solely on plan and adherence.

{% if multimodal %}{{ _fragments.multimodal_input_rules }}{% endif %}

SCORING GUIDE:

- **1.0** → Complete, clear, and logical plan **fully followed** with all steps aligned to the user's goal.  
- **0.75** → Mostly clear plan with minor omissions or small execution deviations that do not impact the overall strategy.  
- **0.5** → Partial plan exists but is incomplete, vague, or only partially followed; notable deviations present.  
- **0.25** → Weak or fragmented plan; execution frequently diverges or lacks coherence with any strategy.  
- **0.0** → No evidence of a plan; execution appears random or unrelated to the user's task.

INSTRUCTIONS:

1. Identify the agent's plan from the steps taken (explicit plans stated or implicit structure).  
2. Assess plan completeness and logical order relative to the user's task.  
3. Compare execution steps against the plan to check for adherence, noting any unjustified deviations.  
4. Deduct points for vagueness, missing critical steps, or inconsistent execution.

OUTPUT FORMAT:

Return only a valid JSON object with exactly two fields:

{
  "score": 0.0,
  "reason": "1-3 concise sentences explaining the quality of the plan and how well execution matched it. Specify missing or extra steps, plan clarity, and adherence issues."
}

EXAMPLE:

User Task: "Plan a business trip including booking a flight, hotel, and preparing an agenda."

Agent Steps include:
- Outlined flight, hotel, and agenda steps explicitly.
- Executed flight and hotel booking steps.
- Skipped agenda preparation despite mentioning it in the plan.

Example JSON:

{
  "score": 0.75,
  "reason": "The agent formed a clear plan covering flights, hotel, and agenda, but failed to execute the agenda preparation step, reducing adherence."
}

**** END OF EXAMPLE ****

INPUTS:

USER TASK:
{{ task }}

AGENT STEPS:
{{ steps_taken }}

JSON:
