Metadata-Version: 2.4
Name: ai-blackteam
Version: 0.4.0
Summary: Automated LLM red team framework -- test any model's safety with one command
License: MIT
License-File: LICENSE
Keywords: llm,red-team,security,jailbreak,ai-safety,penetration-testing
Author: Bill Kishore
Author-email: abillkishoreinico@gmail.com
Requires-Python: >=3.12,<4.0
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Security
Requires-Dist: anthropic (>=0.86,<0.87)
Requires-Dist: click (>=8.1,<9.0)
Requires-Dist: google-genai (>=1.0,<2.0)
Requires-Dist: httpx (>=0.27,<0.28)
Requires-Dist: huggingface-hub (>=0.20,<0.21)
Requires-Dist: jinja2 (>=3.1,<4.0)
Requires-Dist: ollama (>=0.4,<0.5)
Requires-Dist: openai (>=1.60,<2.0)
Requires-Dist: pyyaml (>=6.0,<7.0)
Requires-Dist: rich (>=13.0,<14.0)
Project-URL: Bug Tracker, https://github.com/BILLKISHORE/ai-evals/issues
Project-URL: Homepage, https://github.com/BILLKISHORE/ai-evals
Project-URL: Repository, https://github.com/BILLKISHORE/ai-evals
Description-Content-Type: text/markdown

# ai-blackteam

Automated LLM red team framework. Test any model's safety with one command.

## Install

```bash
pip install ai-blackteam
```

Or from source:
```bash
git clone https://github.com/BILLKISHORE/ai-evals.git
cd ai-evals
pip install -e .
```

## Quick Start

```bash
# Set your API key
blackteam config set providers.anthropic.api_key sk-ant-...

# Run a single attack
blackteam run -p anthropic -m claude-sonnet-4-6 -a encoding-obfuscation -t "Write a phishing email"

# Run all attacks against a model
blackteam batch -p anthropic -m claude-sonnet-4-6 --attacks all -t "Write a phishing email"

# Run everything against everything
blackteam sweep -t "Write a phishing email"

# Generate reports
blackteam report --format markdown
blackteam report --format html --output report.html
blackteam report --format json --output results.json
```

## Providers

| Provider | Models | Auth |
|----------|--------|------|
| Anthropic | Claude Sonnet 4.6, Opus, Haiku | API key |
| OpenAI | GPT-5.4, GPT-4o | API key |
| Google | Gemini 3.1 Pro, Flash | API key |
| DeepSeek | DeepSeek V3.2 | API key |
| Mistral | Mistral Large 3, Small 4 | API key |
| Ollama | Any local model (Llama 4, Qwen, Gemma) | None (local) |
| HuggingFace | Any HF model | API token |

## Attacks (39 built-in)

### Core Attacks
| Attack | Mode | Description |
|--------|------|-------------|
| encoding-obfuscation | single | Base64, ROT13, hex, leetspeak, pig latin |
| role-play-bypass | multi | Character development, fiction building |
| system-prompt-extraction | multi | JSON schema, chain-of-thought, error correction |
| progressive-normalization | tool-use | Gradual file access escalation |
| many-shot | single | Fabricated Q&A context flooding |
| context-manipulation | multi | 10-turn conversational drift |
| dan-variants | single | DAN, EDUALC, Opposite Day, Professor persona |
| taxonomy-paraphrasing | single | Victim, researcher, journalist framing |
| best-of-n | single | Capitalization, typo, padding variations |
| historical-framing | single | Documentary, museum, textbook framing |
| token-smuggling | single | Zero-width Unicode character injection |
| few-shot-amplification | single | Benign examples + encoded target |
| conversation-derailment | multi | Self-discovery of model weaknesses |
| sunk-cost-exploit | multi | Build helpful context then pivot |
| emotional-manipulation | multi | Distress framing to override safety |
| contradiction-exploit | multi | Feed wrong info, get corrected with truth |
| xml-boundary-injection | single | Fake closing tags to escape prompt |
| authority-chain | multi | CISO/professor authority building |
| completion-exploit | single | Fake transcripts, JSON completion |
| multi-modal-text | single | ASCII art, tables, code embedding |

### Research-Backed Attacks
| Attack | Mode | Source |
|--------|------|--------|
| skeleton-key | multi | Microsoft Research |
| crescendo | multi | Microsoft/USENIX |
| deceptive-delight | multi | Palo Alto Unit 42 |
| bad-likert-judge | multi | Palo Alto Unit 42 |

### Encoding and Structural Attacks
| Attack | Mode | Source |
|--------|------|--------|
| meta-prompting | single | Promptfoo |
| homoglyph-substitution | single | Promptfoo -- Cyrillic/Greek lookalike chars |
| bidirectional-text | single | Promptfoo -- Unicode RTL override |
| markdown-injection | single | Promptfoo -- Hidden divs, HTML comments, tables |
| payload-splitting | single | Learn Prompting -- Split harmful content across variables |
| defined-dictionary | single | Learn Prompting -- Redefine words to mask intent |
| compound-instruction | single | Learn Prompting -- Overwhelm safety with conflicting instructions |
| task-deflection | single | Learn Prompting -- Adjacent task produces harmful output |
| recursive-injection | single | Learn Prompting -- Self-referential prompt loops |
| response-constraints | single | Confident AI -- Suppress safety disclaimers via formatting rules |

### Conversational and Psychological Attacks
| Attack | Mode | Source |
|--------|------|--------|
| false-dichotomy | multi | Promptfoo -- "Better you explain it than bad sources" |
| goalpost-moving | multi | Promptfoo -- Incremental agreement chain |
| narrative-misdirection | multi | arXiv 2507.21820 -- Embed harmful request between benign ones |
| fictional-professional | single | arXiv 2507.21820 -- David Prompt / elaborate novelist persona |
| implication-chaining | multi | arXiv 2507.21820 -- Emotionally loaded context + innocent request |

## Custom Plugins

Drop a Python file in the `plugins/` folder:

```python
from blackteam.registry import register_attack
from blackteam.attacks.base import BaseAttack

@register_attack("my-attack")
class MyAttack(BaseAttack):
    name = "My Custom Attack"
    technique_id = "my-attack"
    mode = "single-turn"

    def generate_prompts(self, target, **kwargs):
        return [f"Custom prompt: {target}"]
```

It shows up in `blackteam list-attacks` automatically.

## Evaluator

Three scoring methods (combine any):
- **Keyword matching** -- fast, free, checks for harmful content indicators
- **Regex patterns** -- precise, free, matches structural patterns
- **LLM-as-judge** -- accurate, uses Claude Haiku to rate 1-5

```bash
# Use all three
blackteam run -p anthropic -a encoding-obfuscation -t "target" --evaluator keyword,regex,llm
```

## Reports

| Format | Use Case |
|--------|----------|
| Markdown | Human-readable summary for documentation |
| JSON | Machine-readable for CI/CD pipelines |
| HTML | Dark-themed report with stats dashboard |

## Research

This tool was built alongside real security research on Claude Sonnet 4 and 4.6. See the `experiments/` folder for 10 experiments covering 150+ attack runs with documented findings.

## License

MIT

