Metadata-Version: 2.4
Name: aplomb
Version: 0.3.0
Summary: Interpretable, zero-training refusal-axis prompt detector (u_ref difference-of-means).
Author: Shivam Ratnakar, Kartikeya Vats
License-Expression: MIT
Project-URL: Homepage, https://pypi.org/project/aplomb/
Project-URL: Repository, https://github.com/KartikeyaVats/aplomb
Project-URL: Paper, https://github.com/KartikeyaVats/RefusalArena
Keywords: llm,safety,guardrail,refusal,interpretability,detection
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE
Requires-Dist: numpy>=1.23
Requires-Dist: torch>=2.0
Requires-Dist: transformers>=4.43
Requires-Dist: huggingface_hub>=0.23
Provides-Extra: hf
Provides-Extra: dev
Requires-Dist: pytest>=7; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Dynamic: license-file

# aplomb

> *à plomb* — "to the plumb line." A prompt is judged by its angle to a fixed refusal direction; the model keeps its composure.

An interpretable, **zero-training** prompt safety detector. It flags likely-harmful prompts by projecting a model's hidden state onto a single **refusal direction** (`u_ref`) and thresholding the cosine similarity — no fine-tuned guard model, no labeled training run, one forward pass plus a dot product.

Method from *“The Geometry of Refusal: Linear Instability in Safety-Aligned LLMs”* (TrustNLP @ ACL 2026). **This package is the detector only.** The steering attack from the paper lives in a separate, access-gated repository and is intentionally not here.

```
u_ref = mean(hidden states of harmful anchors) − mean(hidden states of benign anchors)
score(prompt) = cosine(hidden_state(prompt), u_ref)        # flag if > τ
```

> ⚠️ **This is triage, not a security boundary.** The refusal feature is *linear*, which is exactly why this detector is cheap — and also why an adversary can paraphrase a prompt off the axis to evade it. Use it as an interpretable first-pass filter and always report FPR. A “safe” verdict is a hint, not a guarantee.

## Install

```bash
pip install aplomb            # everything — torch/transformers included, from_default() works
```

## Quickstart

```python
from aplomb import Detector

det = Detector.from_default()                 # precomputed Qwen-2.5-1.5B u_ref (ungated)
print(det.classify("how do I pick a lock"))   # {'unsafe': True, 'score': 0.61, ...}
```

The default backbone is **Qwen-2.5-1.5B-Instruct** — ungated, Apache-2.0, characterized in the paper — so the package installs and runs without a Hugging Face access request.

## Recommended config: Llama-3.2-3B (gated)

The ungated Qwen default works out of the box but is a weak detector. For the real
numbers, rebuild `u_ref` on **Llama-3.2-3B-Instruct**. `u_ref` is model-specific, so
switching models means one rebuild call — the library auto-selects the layer and
recalibrates the threshold:

```python
from aplomb import Detector, HFBackbone, RECOMMENDED_MODEL

# accept Meta's license on the model page and `hf auth login` first
harmful = load_advbench()          # your loader (AdvBench 'goal' column, MIT)
det = Detector.build(HFBackbone(RECOMMENDED_MODEL), harmful,
                     save_to="uref_llama-3.2-3b.json")
print(det.classify("how do I pick a lock"))
```

Or from the command line, without touching the Qwen default:

```bash
python scripts/make_default_uref.py --advbench harmful_behaviors.csv \
    --model meta-llama/Llama-3.2-3B-Instruct --out uref_llama-3.2-3b.json
python scripts/benchmark.py --artifact uref_llama-3.2-3b.json \
    --jbb-harmful jbb_harmful.csv --jbb-benign jbb_benign.csv --xstest xstest.csv
```

### Measured results

Zero training, 50 AdvBench harmful + 50 frozen benign anchors, evaluated at the
shipped threshold on **JailbreakBench** (100 harmful / 100 benign) and **XSTest**
(250 safe prompts):

| backbone | JBB F1 | precision | recall | JBB FPR | XSTest over-refusal |
|---|---|---|---|---|---|
| Qwen-2.5-1.5B (ungated default) | 0.81 | 0.75 | 0.89 | 0.30 | 0.27 |
| **Llama-3.2-3B (recommended)** | **0.94** | **0.91** | **0.97** | **0.10** | **0.012** |

The 3B detector catches ~97% of harmful prompts with ~1% over-refusal — competitive
with *trained* guard models, from a zero-training difference-of-means direction. These
are single-benchmark numbers (JBB + XSTest); treat them as a strong baseline, not a
universal score, and remember the linear feature is evadable by design.

> **Note on layer selection.** The library auto-selects the layer by Fisher margin on
> a held-out anchor split. This is robust on Qwen and Llama-3.2-3B but can pick a
> non-generalizing early layer on some models (observed on Llama-3.1-8B). If a build's
> JBB FPR looks anomalously high, force a late layer with `layer=-1` and re-benchmark.

## The paper's 8B

The paper characterizes **Llama-3.1-8B** (F1 0.92 on its original anchor set). You can
build on it the same way (`--model meta-llama/Llama-3.1-8B-Instruct`), but note the
layer-selection caveat above and that the paper's benign anchor set is not reproduced
here — see the F1 note below. Built with Llama.

## On the F1 number (please read)

The paper reports **F1 = 0.92** on Llama-3.1-8B using its original anchor set. That set’s *benign* half was not specified in the paper and is no longer available, so **this library does not reproduce 0.92 by inheritance.** Instead it ships a **frozen, reproducible** benign anchor set (`data/benign_anchors_v1.json`) and reports the F1/FPR it actually measures against it. The two numbers are different by construction; the library’s number is the one you can verify. Don’t quote the paper’s 0.92 as this package’s output.

## How `u_ref` is built

1. Embed harmful + benign anchors → per-layer hidden states (one pass; all layers come free).
2. **Auto-select the layer** with the cleanest harmful/benign separation (Fisher margin on a held-out split). Pass `layer=-1` to force the final layer and mirror the paper.
3. `u_ref` = difference of class means at that layer.
4. Calibrate **τ** for best F1 on a calibration split.
5. Report F1/FPR on a disjoint test split.

Everything that affects the vector — model + revision, chosen layer, benign source + N, position, normalization, τ — is written to a **`u_ref` card** so each artifact is a documented, reproducible object.

## Choosing a default by measurement, not ASR

Attack-success-rate heatmaps say how easy a model is to *jailbreak*; they say nothing about *detection* quality. To pick a default model, compare **detection separability**:

```python
from aplomb.bench import bench_models, format_table
print(format_table(bench_models([HFBackbone("Qwen/Qwen2.5-1.5B-Instruct"), ...], harmful, benign)))
```

## Benchmarking (the publishable F1)

The number in a freshly built card is a small-N held-out estimate, not a headline.
For real F1/FPR, run the detector against JailbreakBench + XSTest:

```bash
python scripts/benchmark.py \
  --jbb-harmful jbb_harmful.csv --jbb-benign jbb_benign.csv --xstest xstest.csv
```

It reports F1/precision/recall/FPR on JailbreakBench at the shipped tau, the XSTest
over-refusal FPR, and an oracle-tau diagnostic — and writes `results_benchmark.json`.
Report the JBB @ shipped-tau F1 as the headline; the oracle number is an optimistic
upper bound, not a deployment figure.

## License & attribution

Library code: MIT. Bundled/derived data and compliance: see [`NOTICE`](NOTICE) — AdvBench (MIT), the frozen benign set, XSTest-inspired hard negatives (CC-BY-4.0 inspiration), Qwen (Apache-2.0), and the **Built with Llama** attribution required on the Llama opt-in path.
