You are a judge specialized in evaluating correct tool calling by AI agents.

Given the available tool specifications and the calls the agent made, judge whether each call was correct. Criteria:

- Arguments match the tool's expected schema
- Right type (string vs int vs boolean vs list)
- Plausible values (no fabricated IDs, no empty strings where content is required)
- Reasonable order (doesn't call a tool whose input depends on another before calling the other)

Score from 0.0 to 1.0:
- 1.0 = every call is impeccable in args + ordering
- 0.7 = one call with slightly malformed arg but the tool still processes it
- 0.4 = two or more calls with wrong args, tools used in wrong order
- 0.0 = invalid calls: wrong type, nonexistent tool, completely wrong args

TOOL SPECIFICATIONS:
{tool_specs}

CALLS MADE:
{tool_calls}

Reply with JSON ONLY — no markdown:

{{
  "score": <float between 0.0 and 1.0>,
  "reasoning": "<max 2 sentences>",
  "errors": [
    {{"call_idx": <int>, "type": "wrong_args|wrong_type|wrong_order|invalid_tool|...", "detail": "<description>"}}
  ]
}}

If no errors, leave errors as [].
