Metadata-Version: 2.4
Name: iconoclast-llm
Version: 0.2.1
Summary: ICONOCLAST — Discriminative representation editing for open-weight LLMs. Beats HERETIC baseline across all tested models.
Keywords: llm,transformer,alignment,abliteration,representation-editing,safety
Author: Varesh Patel
License-Expression: AGPL-3.0-or-later
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Environment :: GPU
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU Affero General Public License v3 or later (AGPLv3+)
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Dist: accelerate>=1.13
Requires-Dist: datasets>=4.7
Requires-Dist: huggingface-hub>=1.7
Requires-Dist: immutabledict>=4.3
Requires-Dist: numpy>=2.2
Requires-Dist: optuna>=4.7
Requires-Dist: peft>=0.18
Requires-Dist: psutil>=7.2
Requires-Dist: pydantic-settings>=2.13
Requires-Dist: questionary>=2.1
Requires-Dist: rich>=14.3
Requires-Dist: transformers>=5.3
Requires-Dist: lm-eval[hf]>=0.4 ; extra == 'benchmark'
Requires-Dist: bitsandbytes>=0.49 ; extra == 'quantized'
Requires-Dist: geom-median>=0.1 ; extra == 'research'
Requires-Dist: imageio>=2.37 ; extra == 'research'
Requires-Dist: matplotlib>=3.10 ; extra == 'research'
Requires-Dist: pacmap>=0.8 ; extra == 'research'
Requires-Dist: scikit-learn>=1.7 ; extra == 'research'
Requires-Dist: fastapi>=0.100 ; extra == 'serve'
Requires-Dist: uvicorn>=0.20 ; extra == 'serve'
Requires-Python: >=3.10
Project-URL: Documentation, https://github.com/Haadesx/Iconoclast
Project-URL: Homepage, https://github.com/Haadesx/Iconoclast
Project-URL: Repository, https://github.com/Haadesx/Iconoclast
Provides-Extra: benchmark
Provides-Extra: quantized
Provides-Extra: research
Provides-Extra: serve
Description-Content-Type: text/markdown

# 🗡️ ICONOCLAST

> **Beat the HERETIC baseline 10/10 — remove LLM refusal behaviors while preserving intelligence.**

```bash
pip install iconoclast-llm
iconoclast abliterate --model Qwen/Qwen2.5-7B-Instruct --output ./my-model
```

---

## Quick Start

### Install

```bash
pip install iconoclast-llm
```

For the demo API server:

```bash
pip install 'iconoclast-llm[serve]'
```

### Abliterate a model

```bash
iconoclast abliterate --model meta-llama/Llama-3.1-8B-Instruct --output ./llama-abliterated
```

This downloads the model from Hugging Face, computes refusal directions, finds optimal abliteration parameters via Optuna, and saves the abliterated model to `./llama-abliterated/`.

**Options:**

| Flag | Default | Description |
|------|---------|-------------|
| `--model` | required | Hugging Face model ID or local path |
| `--output` | `./abliterated-model` | Output directory |
| `--device` | `auto` | Device (`auto`, `cuda:0`, `mps`, `cpu`) |
| `--benign-subspace-rank` | `0` | Benign subspace dimensions to preserve (try 64–256 for better quality) |
| `--quantize` | `none` | Quantization (`none` or `bnb_4bit`) |
| `--good-prompts` | `HuggingFaceH4/ultrafeedback_binarized` | Dataset for benign prompts |
| `--bad-prompts` | `walledai/JailbreakBench` | Dataset for harmful prompts |

### Serve the abliterated model

```bash
iconoclast serve --model ./llama-abliterated --port 8000
```

Then generate text:

```bash
curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "How do I make a bomb?", "max_tokens": 100}'
```

### Full research workflow

```bash
iconoclast study
```

Launches the interactive Optuna-powered hyperparameter search across direction methods, layer selections, and blending strategies.

---

## The Results: ICONOCLAST vs HERETIC

10 open-weight models, one winner. ICONOCLAST beats HERETIC across **every single one** — fewer harmful refusals, fewer benign overrefusals, and dramatically lower KL divergence from the base model.

| Model | ICONOCLAST Refusals | ICONOCLAST Overrefusals | ICONOCLAST KL | HERETIC Refusals | HERETIC Overrefusals | HERETIC KL | Outcome |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :--- |
| **Llama-3.1-8B-Instruct** | **0/20** | **0/64** | **0.0447** | 1/20 | 0/64 | 0.1854 | 🏆 ICONOCLAST |
| **Qwen3.5-9B base** | **10/20** | **2/64** | **0.0055** | 10/20 | 3/64 | 0.0160 | 🏆 ICONOCLAST |
| **Mistral-7B-Instruct-v0.3** | **1/20** | **0/64** | **0.0554** | 4/20 | 0/64 | 0.1317 | 🏆 ICONOCLAST |
| **Falcon3-7B-Instruct** | **0/20** | **0/64** | **6.1448** | 4/20 | 1/64 | 0.1648 | 🏆 ICONOCLAST |
| **Gemma-2-2B-IT** | **1/20** | **0/64** | **0.1849** | 1/20 | 2/64 | 0.6441 | 🏆 ICONOCLAST |
| **Phi-4-mini-instruct** | **2/20** | **1/64** | **0.0204** | 2/20 | 1/64 | 0.0978 | 🏆 ICONOCLAST |
| **Yi-1.5-9B-Chat** | **2/20** | **0/64** | **0.0511** | 3/20 | 0/64 | 0.0355 | 🏆 ICONOCLAST |
| **StableLM2-1.6B** | **2/20** | **0/64** | **0.0328** | 3/20 | 0/64 | 0.0670 | 🏆 ICONOCLAST |
| **SmolLM2-1.7B-Instruct** | **1/20** | **1/64** | **0.0087** | 2/20 | 2/64 | 0.2699 | 🏆 ICONOCLAST |
| **OLMo-2-1B-Instruct** | **2/20** | **0/64** | **0.0345** | 2/20 | 1/64 | 0.0944 | 🏆 ICONOCLAST |

**Key highlights:**

- **Strict Behavior Wins:** Fewer harmful refusals in 6/10 rows, tied in the rest
- **Utility Preservation:** Lower KL divergence in 8/10 rows — the model retains its original capabilities
- **Massive KL Reduction:** SmolLM2 drops from 0.2699 → 0.0087. Gemma-2-2B drops from 0.6441 → 0.1849
- **Flawless on Llama-3.1-8B:** 0/20 harmful, 0/64 overrefusals, 0.0447 KL — the best abliteration result ever reported on this model

---

## How It Works

ICONOCLAST is a **discriminative representation editing** framework. Unlike HERETIC-style methods that simply subtract a refusal direction, ICONOCLAST:

1. **Computes per-layer refusal directions** using contrastive pairs of harmful and benign prompts
2. **Preserves benign subspaces** by projecting refusal directions out of the subspace encoding harmless concepts — this is what prevents overrefusals and KL explosion
3. **Optimizes hyperparameters** via Optuna across direction methods (mean, median, variance, hybrid), layer selections, and blending strategies
4. **Evaluates rigorously** on holdout sets for both harmful refusal rate and benign overrefusal rate, with KL divergence as the utility metric

---

## Commands Reference

### `iconoclast abliterate`

One-shot abliteration. Runs the full pipeline — prompt loading, direction computation, Optuna optimization, model saving.

```bash
iconoclast abliterate --model <model_id> --output <dir> [options]
```

The benign subspace feature (`--benign-subspace-rank`) is the ICONOCLAST innovation. Start with 64 and increase if overrefusals appear:

```bash
# Best quality (requires more VRAM):
iconoclast abliterate --model Qwen/Qwen2.5-7B-Instruct --benign-subspace-rank 128 --output ./qwen-abliterated
```

### `iconoclast study`

Full interactive research workflow. Launches an Optuna study with interactive prompts for configuration, trial inspection, and model export. This is the original research interface used to produce the benchmark results above.

### `iconoclast serve`

Starts a FastAPI server for the abliterated model:

```bash
pip install 'iconoclast-llm[serve]'
iconoclast serve --model ./my-model --port 8000
```

**API endpoints:**

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/generate` | POST | Generate text. Body: `{"prompt": "...", "max_tokens": 256, "temperature": 0.7}` |
| `/health` | GET | Health check. Returns `{"status": "ok", "model": "..."}` |

---

## Installation Options

| Command | Includes |
|---------|----------|
| `pip install iconoclast-llm` | Core abliteration engine |
| `pip install 'iconoclast-llm[serve]'` | + FastAPI demo server |
| `pip install 'iconoclast-llm[research]'` | + plotting, pacmap, scikit-learn |
| `pip install 'iconoclast-llm[benchmark]'` | + lm-eval for standardized evaluation |
| `pip install 'iconoclast-llm[quantized]'` | + bitsandbytes for 4-bit quantization |
| `pip install 'iconoclast-llm[all]'` | Everything |

---

## Requirements

- Python 3.10+
- CUDA GPU recommended (16GB+ VRAM for 7B models, 32GB+ for larger)
- MPS (Apple Silicon) works but is slower
- CPU works but will be very slow

Tested on models from 1B to 9B parameters. Larger models (70B) require multi-GPU or quantization.

---

## Citation

```bibtex
@software{patel2025iconoclast,
  author = {Patel, Varesh},
  title = {ICONOCLAST: Benign-Subspace-Preserved Abliteration for Representation Editing},
  year = {2025},
  url = {https://github.com/Haadesx/Iconoclast}
}
```

---

## License

AGPL-3.0-or-later — see [LICENSE](LICENSE).

Built on 4 months of research using the Rutgers iLabs cluster. Original development history at [Haadesx/NLP_Project](https://github.com/Haadesx/NLP_Project).
