Metadata-Version: 2.4
Name: agentforce-probe
Version: 0.1.1
Summary: Run automated tests against Salesforce Agentforce agents (External + Internal Copilot) and score them into evidence — fully local, privacy-first, no API key required.
Author: Ray Kuo
License-Expression: MIT
Project-URL: Homepage, https://github.com/raykuonz/agentforce-probe
Project-URL: Repository, https://github.com/raykuonz/agentforce-probe
Project-URL: Bug Tracker, https://github.com/raykuonz/agentforce-probe/issues
Project-URL: Changelog, https://github.com/raykuonz/agentforce-probe/blob/main/CHANGELOG.md
Keywords: salesforce,agentforce,testing,qa,evaluation,llm,agent,copilot,llm-as-judge
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Testing
Classifier: Topic :: Software Development :: Quality Assurance
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pyyaml>=5.1
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: ruff>=0.4; extra == "dev"
Requires-Dist: pip-audit>=2.7; extra == "dev"
Requires-Dist: pre-commit>=3.5; extra == "dev"
Dynamic: license-file

# agentforce-probe

**A local, privacy-first CLI to run automated tests against Salesforce
Agentforce agents — and score the results into evidence.**

> **TL;DR** — Salesforce's Testing Center can score your *customer-facing*
> agents but **silently can't touch your employee-facing ones**.
> `agentforce-probe` tests **both** from one command and hands you a single
> evidence report. It runs entirely on your machine, sends nothing to third
> parties, and needs **no API key** to get started.

### Do I need to set anything up? (the 30-second version)

| You're testing… | What you need | headless API / ECA? |
| --- | --- | --- |
| **ExternalCopilot** (customer/service agents) | just an `sf`-authenticated org | ❌ none — Testing Center judges for you, zero secrets |
| **InternalCopilot** (employee agents) | the above **+ a one-time External Client App** (consumer key/secret in `.env`) | ✅ yes — this is the headless path Testing Center can't do |
| *Optional:* grade with a live LLM judge | an OpenAI/Anthropic API key | the default judge is a **no-key Claude Code handoff** |

So: **External agents work out of the box.** The only real setup is a one-time
ECA for the Internal path — and even then the judge needs no API key by default.
Full steps are in [Configure secrets](#configure-secrets-env).

`agentforce-probe` auto-detects the agent type and picks the right path:

- **ExternalCopilot** (customer/service agents) → drives
  `sf agent test create/run/results`. Salesforce **Testing Center** provides the
  LLM judge (`output_validation`) for you — no extra setup.
- **InternalCopilot** (employee agents) → **this is the tool's core value.**
  Testing Center *cannot* run employee agents, so `agentforce-probe` walks the
  headless path instead: **External Client App → Client Credentials mint (JWT) →
  Agent API headless session → one message per utterance → a configurable
  LLM-as-judge** scores each response.

Both paths emit **one unified evidence markdown report** (per case: utterance /
topic / agent response / each assertion), using the same assertion-filtering
rules.

## Why this exists

Salesforce's built-in Testing Center (`sf agent test`) only runs
**ExternalCopilot** agents — the customer-facing ones that have a Bot User to
impersonate. **InternalCopilot** (employee/internal) agents have no run-as Bot
User, so the Testing Center judge never fires and you simply cannot get an
automated test score for them through the supported tooling.

That's a real product gap. `agentforce-probe` closes it: for Internal agents it
bypasses the Testing Center and drives the **headless Agent API** directly,
replaying each utterance through a real session and grading the responses with
an LLM-as-judge. One command, one evidence report, both agent types.

## Privacy

Everything runs on your machine. The **only** outbound network calls are:

1. to the target Salesforce org (`sf` CLI + the Agent API), and
2. **(InternalCopilot path only, and only if you opt into a live API-key
   judge)** to the judge LLM you configure.

No telemetry, no third parties. Secrets (ECA consumer key/secret, judge API key)
are read from a gitignored `.env` (or env vars), held in memory only, and are
**never** printed, logged, written to evidence, or passed through a shell. Token
diagnostics only ever expose length + JWT segment count — never bytes.

## Verification status & known limitations

Read this before you trust a score in anger. The tool is deliberately honest
about what has and hasn't been validated.

### What's verified

- **Logic layer — fully tested.** 207 unit tests, 100% line coverage across all
  modules. Spec loading, assertion filtering, scoring, evidence rendering, the
  judge contract, token-shape validation, and the Agent API error ladder are all
  exercised — with the network and the `sf` CLI mocked.

### What's *not* yet verified (the real gap)

- **100% coverage is not the same as "proven against a live org."** Every test
  mocks the network and `sf`. The genuine end-to-end paths — `sf agent test`
  against a real **External** agent, and ECA mint → JWT → live **Agent API**
  session against a real **Internal** agent — have **not** been re-run against a
  live Salesforce org in this open-source extraction. The InternalCopilot
  gotchas baked into the code (opaque-token 404, 412 config errors,
  `bypassUser` handling) were learned from real-world use, but treat **your
  first live run as the first true end-to-end validation** and sanity-check the
  evidence by hand.

### Known limitations

- **Internal path needs a one-time manual UI step.** The External Client App
  must have `isNamedUserJwtEnabled` **on**, or the mint returns an opaque token
  and the session endpoint 404s. The tool *detects and reports* this, but cannot
  fix it for you — see the ECA prerequisite below. This is the most common place
  to get stuck.
- **Agent-type detection relies on a live org query** (`BotDefinition.Type`). If
  your org's metadata shape differs, auto-detection can misfire; override with
  `--force-type internal|external` (and `--bot-id` for the Internal path).
- **The `handoff` judge is an LLM, so verdicts are not perfectly reproducible.**
  Two graders (or the same grader twice) may disagree on a borderline case. The
  score is a well-evidenced judgment, not a deterministic measurement — always
  read the captured agent responses, don't rubber-stamp.
- **Single-turn only.** Each utterance runs in its own fresh session; the tool
  does **not** test multi-turn context or memory.
- **`endSession` is best-effort** and silently ignores failures, so an
  unreachable org could leave a dangling session server-side (low risk, no
  effect on the score).
- **`--from-results` accepts External-shaped payloads only** (offline re-scoring
  of `sf agent test results`); there's no offline replay for the Internal path.

## Install

From [PyPI](https://pypi.org/project/agentforce-probe/):

```bash
pip install agentforce-probe
```

This installs the `agentforce-probe` console command. You can also run it as a
module:

```bash
agentforce-probe --help
python3 -m agentforce_probe --help
```

The only runtime dependency is `pyyaml`.

To install from source instead (e.g. to track `main` or hack on it):

```bash
git clone https://github.com/raykuonz/agentforce-probe
cd agentforce-probe
pip install -e .
```

### Install the Claude skill (no CLI needed)

This repo ships a [Claude Code](https://docs.anthropic.com/claude/docs/agent-skills)
skill (`probe-agentforce-agents`) that teaches an agent when and how to drive
`agentforce-probe`. Install it into your agent in one command with
[`vercel-labs/skills`](https://github.com/vercel-labs/skills) — no clone, no
install, just `npx`:

```bash
# Preview the skill without installing
npx skills add raykuonz/agentforce-probe --list

# Install it globally into Claude Code
npx skills add raykuonz/agentforce-probe -g -a claude-code -y
```

It also works with Cursor, Codex, OpenCode, and
[50+ other agents](https://github.com/vercel-labs/skills#supported-agents) — drop
the `-a claude-code` flag to pick interactively. The skill assumes the
`agentforce-probe` CLI is installed (see above).

## Configure secrets (`.env`)

Only the **InternalCopilot** path needs secrets. Copy the template into the
directory you run `agentforce-probe` from and fill it in (the file is
gitignored):

```bash
cp .env.example .env
# then edit .env:
#   AGENTPROBE_SF_CONSUMER_KEY=...      (Internal path: ECA consumer key)
#   AGENTPROBE_SF_CONSUMER_SECRET=...   (Internal path: ECA consumer secret)
#   AGENTPROBE_ANTHROPIC_API_KEY=...    (only if you use a live API-key judge)
#   AGENTPROBE_OPENAI_API_KEY=...       (only if you use a live API-key judge)
```

Environment variables take precedence over `.env`. The ExternalCopilot path
needs **none** of these (Testing Center judges for you). You can also point at a
specific file with `AGENTPROBE_ENV_FILE=/path/to/.env`.

### Prerequisite for the Internal path — the External Client App

The InternalCopilot path needs an **External Client App (ECA)** configured for
the Client Credentials flow. To get its consumer key/secret:

1. **Setup → App Manager** (or **External Client App Manager**).
2. Find your ECA → row dropdown → **View / Manage Consumer Details** (you may be
   asked to verify your identity).
3. Copy the **Consumer Key** and **Consumer Secret** into `.env`.
4. Confirm the ECA has Client Credentials enabled, a Run-As user
   (`clientCredentialsFlowUser`), and **`isNamedUserJwtEnabled` ON** — otherwise
   the mint returns an *opaque* token instead of a JWT and the Agent API session
   endpoint 404s. (`agentforce-probe` detects this and tells you.)

That's the only step that requires the Salesforce UI. Everything else is CLI.

## Usage

### `doctor` — preflight (local + read-only)

```bash
agentforce-probe doctor --org my-org
```

Reports: is `sf` installed, does the org connect, are External Client Apps
present, are ECA secrets + judge keys configured, where is `.env`. Never spends
Einstein credits; secrets shown only as present/absent.

### Run an ExternalCopilot agent (Testing Center)

```bash
agentforce-probe run \
  --org my-org \
  --agent Support_Concierge \
  --spec examples/specs/Support_Concierge-testSpec.yaml \
  --out support_concierge-evidence.md
```

### Run an InternalCopilot agent (headless Agent API + judge)

```bash
agentforce-probe run \
  --org my-org \
  --agent IT_Helpdesk_Assistant \
  --spec examples/specs/IT_Helpdesk_Assistant-testSpec.yaml \
  --out it_helpdesk-evidence.md
  # --judge handoff is the default (grade with Claude Code, no API key)
```

`--judge` selects the Internal-path judge:

- **`handoff` (default)** — *no API key needed.* Grade with Claude Code via a
  file-handoff protocol. See [Judge via Claude Code](#judge-via-claude-code-no-api-key-needed).
- `openai:<model>` / `anthropic:<model>` — grade live in one step using a raw
  LLM API key from `.env`.
- `mock` — offline heuristic (no network), for dry runs / smoke tests.

> ⚠️ Running a real test (`sf agent test run` or a live Internal Agent API
> session) **spends Einstein credits**. `doctor`, `--dry-run`, `--from-results`,
> and `--from-verdicts` are all free / offline.

## Judge via Claude Code (no API key needed)

The InternalCopilot path needs an LLM to grade each agent response PASS/FAIL.
If your team has **Claude Code** (or a similar coding agent) open in the editor
but **no raw LLM API key**, use the default `handoff` judge — a three-step file
protocol where Claude Code *is* the judge runtime and `agentforce-probe` just
defines the contract. No secret ever leaves your machine; the handoff files
contain only test data.

**Step ① — produce the judge task package** (replays the agent; contacts no LLM):

```bash
agentforce-probe run \
  --org my-org --agent IT_Helpdesk_Assistant \
  --spec examples/specs/IT_Helpdesk_Assistant-testSpec.yaml \
  --out it_helpdesk-evidence.md            # --judge handoff is the default
```

This mints the token, opens the headless Agent API session, sends every
utterance, captures `response` / `topic` / `invokedActions`, and writes two
files next to `--out`, then exits:

- `IT_Helpdesk_Assistant-judge-task.json` — the grading materials (schema below).
- `IT_Helpdesk_Assistant-JUDGING.md` — a block you paste into Claude Code.

**Step ② — grade in Claude Code.** Open Claude Code in this repo and paste the
block from `*-JUDGING.md`. It instructs Claude Code to read the task package,
apply the rubric, and write `*-judge-verdicts.json` (verdict is strictly
`PASS`/`FAIL`, one entry per case id, no skips).

**Step ③ — collect the verdicts into evidence** (offline; no org/LLM call):

```bash
agentforce-probe run \
  --org my-org --agent IT_Helpdesk_Assistant \
  --spec examples/specs/IT_Helpdesk_Assistant-testSpec.yaml \
  --from-verdicts IT_Helpdesk_Assistant-judge-verdicts.json \
  --out it_helpdesk-evidence.md
```

`agentforce-probe` reads the verdicts back, aligns them to the task package by
`id`, recomputes topic/actions (from the recorded live values + the spec), uses
each verdict as the `output` signal, applies the **same** assertion-filtering
rules, and writes the unified evidence markdown. It validates that every case id
has a verdict (missing ids = error) and that each verdict is `PASS`/`FAIL`.

### Schemas

**`<agent>-judge-task.json`** (`agentforce-probe/judge-task@1`):

```json
{
  "schema": "agentforce-probe/judge-task@1",
  "agent": "IT_Helpdesk_Assistant", "org": "my-org",
  "rubric": "<strict-QA-grader rubric>",
  "instructions": "For each case, decide if actual_response satisfies expected_outcome. Write {id,verdict,reason} for every case into the verdicts file. Do not skip any case.",
  "cases": [
    {"id": 1, "utterance": "...", "expected_outcome": "...",
     "actual_response": "...", "actual_topic": "...", "actual_actions": ["..."]}
  ]
}
```

**`<agent>-judge-verdicts.json`** (`agentforce-probe/judge-verdicts@1`):

```json
{"schema": "agentforce-probe/judge-verdicts@1",
 "agent": "IT_Helpdesk_Assistant",
 "verdicts": [{"id": 1, "verdict": "PASS", "reason": "..."}]}
```

All handoff files (`*-judge-task.json`, `*-judge-verdicts.json`, `*-JUDGING.md`)
are run artifacts (test data) and are gitignored.

## Test spec format (`*.yaml`)

```yaml
name: "My Suite"
subjectType: AGENT
subjectName: IT_Helpdesk_Assistant
testCases:
  - utterance: "..."             # required
    expectedTopic: account_help  # optional (= subagent / topic name)
    expectedActions: [foo, bar]  # optional (Level-2 invocation names)
    expectedOutcome: "..."       # used by the judge; almost always present
```

See [`examples/specs/`](examples/specs/) for complete, runnable examples (using
fictional demo data).

## Scoring rules (assertion filtering)

- `topic_assertion` is scored **only** if the case declares `expectedTopic`.
- `actions_assertion` is scored **only** if the case declares `expectedActions`.
- `output_validation` (LLM-as-judge) is the **primary** behavioral signal and is
  scored for every case.
- A dimension with no declared expectation renders as `-` and never counts
  against the score.

> A `topic` FAIL with an `output` PASS usually means the agent behaved correctly
> even though single-turn routing picked a semantically adjacent topic — look at
> the primary `output` signal first.

## Module layout

| file | responsibility |
|---|---|
| `cli.py` | argparse entrypoint; dispatches `run` / `doctor` |
| `config.py` | reads secrets from `.env` / env; never exposes values |
| `doctor.py` | local + read-only preflight checks |
| `agent_meta.py` | resolves `BotDefinition.Type`/Id (Internal vs External) |
| `sf_external.py` | ExternalCopilot path via `sf agent test` |
| `agent_api.py` | InternalCopilot mint + headless Agent API (urllib, token-safe) |
| `sf_internal.py` | Internal path orchestration (session → judge → score) |
| `judge.py` | configurable judge: `handoff` (default) + live `openai`/`anthropic`/`mock` |
| `scorer.py` | spec loading + assertion-filtering scorer |
| `evidence.py` | unified evidence markdown generator |
| `sfcli.py` | `sf` CLI wrapper + banner-tolerant JSON parsing |

## InternalCopilot Agent API — gotchas baked in

These are battle-tested; the code enforces them so you don't re-learn them:

1. **Mint:** `grant_type=client_credentials` → `{instance}/services/oauth2/token`;
   read `access_token` **and** `api_instance_url`.
2. **Token must be a JWT** (~1700 chars, 3 dot segments). An opaque token → 404 →
   `isNamedUserJwtEnabled` is off. The tool refuses to proceed on an opaque token.
3. **Host = `api_instance_url`** from the mint response (sandbox/scratch =
   `https://test.api.salesforce.com`). Never hardcoded.
4. **Session:** `POST .../einstein/ai-agent/v1/agents/{0Xx...}/sessions` with
   `bypassUser:false` (true → 400 "Invalid user ID"). Run-as = the ECA's
   `clientCredentialsFlowUser`; no `userId` in the body.
5. **Message:** `POST .../sessions/{id}/messages` with
   `{"message":{"sequenceId":N,"type":"Text","text":"..."}}`, N increments.
6. **Error ladder:** 404 empty = wrong host / opaque token; 400 "Invalid user ID"
   = use `bypassUser:false`; 412 "Invalid Config" = auth OK but planner config
   broken (usually an action missing its `inputs` block).
7. **Bearer hygiene:** the auth header is built at runtime from an in-memory
   variable (never a source literal, never `echo`'d) to dodge both shell-quoting
   and log-redaction traps.

## Development

```bash
pip install -e ".[dev]"
pre-commit install   # gate every commit/push on the same checks CI runs
pytest               # run the test suite (pure logic; no network, no secrets)
ruff check .         # lint
```

### Pre-commit / pre-push gate

The repo ships a [`pre-commit`](https://pre-commit.com) config with `local`
hooks (no external hook repos, works offline). After `pre-commit install`:

- **on every commit** — a privacy/hygiene scan (`scripts/check-secrets.sh`:
  no secrets, JWTs, org IDs, customer data, agent footprints, or run artifacts)
  plus `ruff check` and `ruff format --check`.
- **on every push** — the full `pytest` suite with the 100% coverage gate.

So a commit that would leak a secret, or a push that would break a test or drop
coverage, is blocked locally before it ever reaches GitHub. CI re-runs the same
checks, so green-local means green-pipeline.

## License

[MIT](LICENSE).
