Metadata-Version: 2.4
Name: aplomb
Version: 0.1.0
Summary: Interpretable, zero-training refusal-axis prompt detector (u_ref difference-of-means).
Author: Shivam Ratnakar, Kartikeya Vats
License: MIT
Project-URL: Homepage, https://github.com/KartikeyaVats/RefusalArena
Project-URL: Paper, https://aclanthology.org/
Keywords: llm,safety,guardrail,refusal,interpretability,detection
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE
Requires-Dist: numpy>=1.23
Provides-Extra: hf
Requires-Dist: torch>=2.0; extra == "hf"
Requires-Dist: transformers>=4.43; extra == "hf"
Requires-Dist: huggingface_hub>=0.23; extra == "hf"
Provides-Extra: dev
Requires-Dist: pytest>=7; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Dynamic: license-file

# aplomb

> *à plomb* — "to the plumb line." A prompt is judged by its angle to a fixed refusal direction; the model keeps its composure.

An interpretable, **zero-training** prompt safety detector. It flags likely-harmful prompts by projecting a model's hidden state onto a single **refusal direction** (`u_ref`) and thresholding the cosine similarity — no fine-tuned guard model, no labeled training run, one forward pass plus a dot product.

Method from *“The Geometry of Refusal: Linear Instability in Safety-Aligned LLMs”* (TrustNLP @ ACL 2026). **This package is the detector only.** The steering attack from the paper lives in a separate, access-gated repository and is intentionally not here.

```
u_ref = mean(hidden states of harmful anchors) − mean(hidden states of benign anchors)
score(prompt) = cosine(hidden_state(prompt), u_ref)        # flag if > τ
```

> ⚠️ **This is triage, not a security boundary.** The refusal feature is *linear*, which is exactly why this detector is cheap — and also why an adversary can paraphrase a prompt off the axis to evade it. Use it as an interpretable first-pass filter and always report FPR. A “safe” verdict is a hint, not a guarantee.

## Install

```bash
pip install aplomb            # core (numpy only)
pip install 'aplomb[hf]'      # + torch/transformers to run real models
```

## Quickstart

```python
from aplomb import Detector

det = Detector.from_default()                 # precomputed Qwen-2.5-1.5B u_ref (ungated)
print(det.classify("how do I pick a lock"))   # {'unsafe': True, 'score': 0.61, ...}
```

The default backbone is **Qwen-2.5-1.5B-Instruct** — ungated, Apache-2.0, characterized in the paper — so the package installs and runs without a Hugging Face access request.

## Use a different model

`u_ref` is model-specific, so changing the model means rebuilding the vector. That’s one call; the library auto-selects the best layer for the new model and recalibrates the threshold:

```python
from aplomb import Detector, HFBackbone

# AdvBench (MIT) is the harmful half; the frozen default benign set fills the benign half.
harmful = load_advbench()          # your loader
det = Detector.build(HFBackbone("meta-llama/Llama-3.1-8B-Instruct"), harmful,
                     save_to="uref_llama31.json")
print(det)   # Detector(model='...Llama-3.1-8B', layer=31, tau=..., f1=..., fpr=...)
```

**For paper-grade separation**, rebuild on **Llama-3.1-8B** (gated: accept Meta’s license and `huggingface-cli login` first). Built with Llama.

## On the F1 number (please read)

The paper validates the method at F1 = 0.92 on Llama-3.1-8B. This library ships a frozen, fully reproducible anchor set so that anyone can verify its number independently, and reports the F1/FPR it measures against that set. (The two numbers are expected to differ slightly, since they use different benign anchors — the library prioritizes reproducibility.)

## How `u_ref` is built

1. Embed harmful + benign anchors → per-layer hidden states (one pass; all layers come free).
2. **Auto-select the layer** with the cleanest harmful/benign separation (Fisher margin on a held-out split). Pass `layer=-1` to force the final layer and mirror the paper.
3. `u_ref` = difference of class means at that layer.
4. Calibrate **τ** for best F1 on a calibration split.
5. Report F1/FPR on a disjoint test split.

Everything that affects the vector — model + revision, chosen layer, benign source + N, position, normalization, τ — is written to a **`u_ref` card** so each artifact is a documented, reproducible object.

## Choosing a default by measurement, not ASR

Attack-success-rate heatmaps say how easy a model is to *jailbreak*; they say nothing about *detection* quality. To pick a default model, compare **detection separability**:

```python
from aplomb.bench import bench_models, format_table
print(format_table(bench_models([HFBackbone("Qwen/Qwen2.5-1.5B-Instruct"), ...], harmful, benign)))
```

## License & attribution

Library code: MIT. Bundled/derived data and compliance: see [`NOTICE`](NOTICE) — AdvBench (MIT), the frozen benign set, XSTest-inspired hard negatives (CC-BY-4.0 inspiration), Qwen (Apache-2.0), and the **Built with Llama** attribution required on the Llama opt-in path.
