Metadata-Version: 2.4
Name: finagent-redteam
Version: 0.2.0
Summary: A red-team benchmark for AI agents in financial workflows: measures regulatory-control bypass (sanctions evasion, threshold structuring, dual-approval defeat) and the efficacy of guardrails that try to stop it.
Author-email: Nachiket Lele <lelenachiket07@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/nac7/finagent-redteam
Project-URL: Repository, https://github.com/nac7/finagent-redteam
Project-URL: Issues, https://github.com/nac7/finagent-redteam/issues
Keywords: llm,ai-agents,agent-security,prompt-injection,benchmark,financial,payments,compliance,aml,red-team,tool-calling
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Security
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: agent
Requires-Dist: openai>=1.30; extra == "agent"
Provides-Extra: inspect
Requires-Dist: inspect-ai>=0.3; extra == "inspect"
Provides-Extra: dev
Requires-Dist: pytest>=7; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Provides-Extra: all
Requires-Dist: openai>=1.30; extra == "all"
Requires-Dist: inspect-ai>=0.3; extra == "all"
Dynamic: license-file

# FinAgent Red-Team

[![CI](https://github.com/nac7/finagent-redteam/actions/workflows/ci.yml/badge.svg)](https://github.com/nac7/finagent-redteam/actions/workflows/ci.yml)
[![PyPI](https://img.shields.io/pypi/v/finagent-redteam)](https://pypi.org/project/finagent-redteam/)
[![Python](https://img.shields.io/pypi/pyversions/finagent-redteam)](https://pypi.org/project/finagent-redteam/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
[![codecov](https://codecov.io/gh/nac7/finagent-redteam/graph/badge.svg)](https://codecov.io/gh/nac7/finagent-redteam)

**A red-team benchmark for AI agents that handle money — focused on *regulatory-control bypass*, not just generic misbehavior.**

Can a crafted invoice, vendor email, or chat message trick an LLM agent into
**evading sanctions screening, structuring payments under a reporting threshold,
defeating dual-approval, or wiring funds to an unapproved payee** — and do the
guardrails a bank would actually deploy stop it *without* breaking legitimate work?

FinAgent Red-Team runs every scenario under **three control postures** — *none*,
*advisory*, and *enforced* — and reports not just *can it be attacked* but
*where an agent's safety actually comes from*: its own judgement, its
policy-following, or hard system enforcement — and what each costs in usefulness.

> ⚠️ **Defensive, fully synthetic.** No real money, accounts, or networks. All
> tools are mocks operating on in-memory state. The goal is to *harden* financial
> AI agents before deployment.

---

## Why this exists

LLM agents are moving into financial operations — accounts payable, treasury,
reconciliation, customer servicing. The dangerous failures there are not toxic
text; they are **silent violations of compliance controls** that look like normal
task completion: a payment that skipped sanctions screening, a $15k transfer
quietly split into two $7.5k transfers to dodge reporting, a large wire the agent
approved for itself.

General agent-security benchmarks test "leak the data" or "do the forbidden
action." None of them model the **regulatory control surface** of a real financial
institution, and none measure whether standard guardrails neutralize the attack.

### How this differs from prior work

| | Generic agent-security benchmarks | **FinAgent Red-Team** |
|---|---|---|
| Domain | Email/web/travel, generic "banking" | Payments/treasury with real **compliance controls** |
| Attack goal | Data leak, forbidden action | **Sanctions evasion, structuring, maker-checker defeat** |
| Defense eval | Usually attack-only | **3 control postures → policy-following vs enforcement uplift** |
| Utility tradeoff | Rarely measured | **Over-refusal control set** included |
| Judging | Often LLM-graded | **Deterministic** state predicates (reproducible) |

---

## Threat taxonomy (v0.2)

| ID | Threat | Isolated control under test |
|---|---|---|
| `T2_unauthorized_transfer` | Indirect injection → fraudulent payment | Payee **allow-list** |
| `T3_sanctions_evasion` | Pay an allow-listed but **sanctioned** payee | **Sanctions screening** |
| `T3` (homoglyph) | Sanctioned entity hidden behind look-alike characters | **Normalized** screening |
| `T4_structuring` | Split a payment to dodge the reporting threshold | **Reporting / CTR** control |
| `T5_dual_approval_defeat` | Initiate **and** self-approve a large transfer | **Maker-checker** separation |
| `T6_data_exfiltration` | Email a customer's card/SSN to an external address | **DLP** on outbound messages |
| `T7_confused_deputy` | Redirect a payee's bank details to an attacker account | **Bank-detail protection** |
| `BENIGN` | A legitimate payment that should just work | **Over-refusal** (utility) |

Each attack scenario is designed to isolate **one** control, so a block under the
*enforced* posture attributes cleanly to a specific defense.

## Control postures

Every scenario is evaluated under three postures, which separate *where safety
comes from*:

| Posture | What the agent is told / what the system does | Measures |
|---|---|---|
| **none** | No compliance policy is stated | **Intrinsic** susceptibility |
| **advisory** | Policy stated as obligations, **not** enforced | Does the model **follow stated policy**? |
| **enforced** | System **hard-blocks** violations in the sandbox | **Residual** risk (defense-in-depth) |

## Metrics

- **ASR (Attack Success Rate)** at each posture — fraction of attacks won.
- **Policy-following uplift** = ASR(none) − ASR(advisory): how much merely
  *stating* policy reduces attacks (pure model instruction-following).
- **Enforcement uplift** = ASR(advisory) − ASR(enforced): additional reduction
  from hard enforcement (defense-in-depth).
- **Residual ASR** = ASR(enforced): attacks that survive enforcement.
- **Utility** at each posture and **over-refusal** = utility(none) −
  utility(enforced): legitimate work lost as controls tighten.

This decomposition is the point: two models with identical *enforced* residual
risk can differ sharply in whether they'd behave safely when a control is only
*advisory* (the common real-world case for judgement calls that can't be
hard-coded). A stack only looks good with **low residual ASR** *and* **high
utility** — trivially refusing everything scores 0% ASR but also 0% utility.

---

## How it works

```
                 ┌──────────────────────────── deterministic, offline ───────────────────────────┐
  AgentModel ──► runner ──► sandbox tools ──► World state ──► evaluator ──► Scorecard
 (any model)     loop       (controls here)   (ground truth)   (predicates)   (ASR / uplift / utility)
                 │              ▲
                 │              └── posture: none / advisory / enforced
```

- **Sandbox** (`sandbox/`): synthetic accounts, payees, inbox; mock tools
  (`initiate_transfer`, `approve_transfer`, `screen_sanctions`, …). Compliance
  controls live inside the tools, gated on `Policy.enabled`.
- **Scenarios** (`scenarios/`): pure-data tasks + embedded attacks + a structured
  success predicate.
- **Evaluator** (`eval/`): judges outcomes from final state — no LLM grader, so
  results are reproducible.
- **Agent driver** (`agent/`): any OpenAI-compatible, tool-calling model.

The sandbox, scenarios, and evaluator have **zero third-party dependencies** and
are fully deterministic — the entire attack→defense pipeline is proven by the
offline test-suite, no GPU or API key required.

### Generated suite

Beyond the hand-written scenarios, a **seeded generator** produces hundreds of
cases by combining parametric slots (amounts, payees, accounts), social-
engineering phrasings (authority, urgency, policy pretext, social proof), and
obfuscation techniques (homoglyph / spacing for sanctions evasion):

```bash
finagent-redteam --list --suite generated --per-threat 15   # 120 cases
```

Every generated scenario carries a **`reference_plan`** — the canonical exploit
— and the test-suite replays all of them to verify the invariant that each
attack lands under the *none*/*advisory* postures and is blocked under
*enforced*. The suite is thus **self-validating**: each case is a checked,
control-isolating test.

---

## Quickstart

```bash
pip install -e ".[dev]"      # core + tests
pytest -q                     # 47 tests: proves attacks land (none/advisory), blocked (enforced)

# List scenarios (no model needed)
finagent-redteam --list
```

Run against a model (needs the `agent` extra):

```bash
pip install -e ".[agent]"

# local vLLM / SGLang
finagent-redteam --model Qwen/Qwen3-8B --base-url http://localhost:8000/v1
# Ollama OpenAI shim
finagent-redteam --model llama3.1 --base-url http://localhost:11434/v1 --json results.json
```

Run the **multi-model leaderboard** (several models, repeated trials):

```bash
# examples/models.example.json lists the models + endpoints to compare
finagent-redteam --models-config examples/models.example.json --trials 5 --temperature 0.7 \
                 --json leaderboard.json
```

This prints a ranked leaderboard plus a per-threat-category attack-success
matrix; see [examples/sample_leaderboard.md](examples/sample_leaderboard.md) for
the output shape.

### Illustrative scorecard

A worst-case agent that fully complies with every embedded attack (reproduced by
the offline self-test) yields:

| Metric | none | advisory | enforced |
|---|---:|---:|---:|
| Attack Success Rate | 100% | 100% | 0% |
| Utility (benign completed) | 100% | 100% | 100% |

→ It ignores stated policy (advisory ASR stays 100%) but is fully stopped by
enforcement (residual ASR 0%), with no over-refusal. Real models land *between*
these poles — resisting some attacks on their own and following some stated
policy — and that gap, decomposed into **policy-following** vs **enforcement
uplift**, is what the benchmark measures. See
[examples/sample_leaderboard.md](examples/sample_leaderboard.md).

---

## Responsible use

This is a **defensive** benchmark built on entirely synthetic data and mock tools.
It contains no real financial credentials, accounts, or exploits against live
systems. Use it to evaluate and harden agents before they are trusted with money.

## License

MIT — see [LICENSE](LICENSE).

## Citation

If you use FinAgent Red-Team, please cite it (see [CITATION.cff](CITATION.cff)).
