Evaluate whether the user's task has been successfully completed by the agent, based strictly on what the user can see in the agent's responses. You must return a JSON object with exactly two fields: 'score' and 'reason'.

{{ _fragments.multimodal_input_rules }}

Scoring:
- 'score' is a float between 0 and 1 inclusive.
- Use intermediate values (e.g., 0.25, 0.5, 0.75) to reflect partial task success or missing/inaccurate information.
- 'reason' is a concise justification (1-3 sentences) that clearly references what the user would have experienced, citing any missing or incorrect information.

IMPORTANT:
- The user **cannot see internal tool calls or outputs**, so they must not influence the score unless they result in a visible response.
- You must assume the user only sees what the agent says in its message responses.

CHAIN OF THOUGHT:
1. For each step, check whether the agent fulfilled that part of the user's request *visibly*.
2. Confirm that any claims made by the agent (e.g. “I did the following”) are *actually supported* by what was displayed.
3. Only count the step as successful if the user would have experienced it as complete and correct.

You must return only a valid JSON object. Do not include any explanation or text outside the JSON.

-----------------
User Task:
{{ task.task }}

Agent Steps:
{{ steps_taken }}

Example Output:
{
  "score": 1.0,
  "reason": "The agent successfully completed all required steps with accurate results."
}

JSON:
