Metadata-Version: 2.4
Name: jailbreak-arena
Version: 0.1.1
Summary: Adversarial RL security testing for LLM applications. An attacker agent learns to break chatbots while a defender patches the system prompt in real time.
Author-email: Mithilesh Kumar Lala <your@email.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/Mithilesh-Lala/jailbreak-arena
Project-URL: Repository, https://github.com/Mithilesh-Lala/jailbreak-arena
Project-URL: Bug Tracker, https://github.com/Mithilesh-Lala/jailbreak-arena/issues
Project-URL: Documentation, https://github.com/Mithilesh-Lala/jailbreak-arena#readme
Project-URL: HuggingFace, https://huggingface.co/mkl-01/jailbreak-arena
Keywords: llm-security,red-teaming,jailbreak,reinforcement-learning,adversarial-ml,llm-safety,openenv,gymnasium
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Security
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: gymnasium>=0.29.1
Requires-Dist: groq>=0.13.0
Requires-Dist: python-dotenv>=1.0.1
Requires-Dist: pyyaml>=6.0
Requires-Dist: httpx>=0.27.0
Provides-Extra: openai
Requires-Dist: openai>=1.0.0; extra == "openai"
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.20.0; extra == "anthropic"
Provides-Extra: gemini
Requires-Dist: google-genai>=1.0.0; extra == "gemini"
Provides-Extra: azure
Requires-Dist: openai>=1.0.0; extra == "azure"
Provides-Extra: bedrock
Requires-Dist: boto3>=1.34.0; extra == "bedrock"
Provides-Extra: langchain
Requires-Dist: langchain>=0.2.0; extra == "langchain"
Provides-Extra: rl
Requires-Dist: stable-baselines3>=2.3.2; extra == "rl"
Provides-Extra: all
Requires-Dist: openai>=1.0.0; extra == "all"
Requires-Dist: anthropic>=0.20.0; extra == "all"
Requires-Dist: google-genai>=1.0.0; extra == "all"
Requires-Dist: boto3>=1.34.0; extra == "all"
Requires-Dist: langchain>=0.2.0; extra == "all"
Requires-Dist: stable-baselines3>=2.3.2; extra == "all"

# 🔐 JailbreakArena

> **The self-improving adversarial RL environment for LLM security testing.**
> Your bot gets attacked. It learns to defend. You get a report.

[![PyPI version](https://badge.fury.io/py/jailbreak-arena.svg)](https://pypi.org/project/jailbreak-arena/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
[![Built on OpenEnv](https://img.shields.io/badge/Built%20on-Meta%20OpenEnv-blue)](https://github.com/meta-pytorch/OpenEnv)
[![HuggingFace](https://img.shields.io/badge/HuggingFace-mkl--01-yellow)](https://huggingface.co/mkl-01)
[![Tests](https://img.shields.io/badge/tests-29%20passed-brightgreen)](https://github.com/Mithilesh-Lala/jailbreak-arena/actions)
[![Docker](https://img.shields.io/badge/Docker-mithilesh--lala%2Fjailbreak--arena-blue)](https://hub.docker.com/r/mithilesh-lala/jailbreak-arena)

---

## The Problem

Every company shipping an LLM chatbot today follows the same broken process:

```
Build chatbot → Test it manually → Deploy → Real attacker finds jailbreak in 10 minutes → PR disaster 🔥
```

This happened to Bing Chat, Air Canada's bot, DPD's support bot — all billion-dollar companies.
The root cause is always the same: **nobody systematically attacked the bot before shipping it.**

Existing tools (static test runners, code review tools, prompt evaluation frameworks)
test what you **tell them to test**. They cannot discover what you haven't thought of yet.
And they have no concept of an attacker that **learns and adapts**.

---

## What JailbreakArena Does

JailbreakArena is an open-source **Reinforcement Learning environment** where two intelligent
agents battle continuously to harden your LLM application:

```
┌─────────────────────────────────────────────────────────┐
│                    JailbreakArena                       │
│                                                         │
│  🗡️  Attacker Agent ──────────▶  Target LLM Bot        │
│      (learns & adapts)                  │               │
│           │                            ▼                │
│           │                    ┌──────────────┐         │
│           │                    │  LLM Judge   │         │
│           │                    └──────────────┘         │
│           ▼                            │                │
│  🛡️  Defender Agent ◀──── Reward Signals ◀────────────┘│
│      (patches system prompt)                            │
└─────────────────────────────────────────────────────────┘
```

- **Attacker Agent** — generates adaptive jailbreak attempts. Studies what was blocked
  and tries a completely different angle next turn. Gets smarter every episode.
- **Defender Agent** — watches every attack outcome and patches the system prompt
  in real time. Learns which defenses work against which attack types.
- **LLM Judge** — evaluates every interaction: `SUCCESS / PARTIAL / FAILED`
  with confidence scoring and reasoning.
- **HTML Report** — every run produces a professional security audit report
  with vulnerabilities, patches, and a hardened system prompt ready to deploy.

---

## What Makes JailbreakArena Different

Most security testing tools are **static** — same tests, same results, every run.
JailbreakArena is **intelligence-first**:

```
Static approach (e.g. traditional test runners, eval frameworks, code review tools):
  ✗ You define the tests → it runs them → same results every time
  ✗ No learning between runs
  ✗ Only finds what you already know to look for
  ✗ Coverage-first — breadth over depth

JailbreakArena:
  ✓ RL environment — attacker LEARNS from every blocked attempt
  ✓ Defender PATCHES the system prompt in real time
  ✓ Gets smarter every episode — discovers novel attack vectors
  ✓ Fully open source — Meta OpenEnv ecosystem
  ✓ Intelligence-first — depth over static breadth
```

**The one-sentence difference:**
Static tools test what you tell them to test.
JailbreakArena discovers what nobody has thought of yet — then tells you how to fix it.

---

## Install

```bash
# Default — Groq (free tier, recommended)
pip install jailbreak-arena

# With your preferred LLM provider
pip install jailbreak-arena[openai]
pip install jailbreak-arena[anthropic]
pip install jailbreak-arena[bedrock]
pip install jailbreak-arena[all]
```

---

## Quickstart — 3 Steps

**Step 1 — Set your provider key in `.env`:**

```dotenv
# Groq — free, fastest (recommended)
GROQ_API_KEY=your_key_here

# Or OpenAI
# OPENAI_API_KEY=sk-xxx

# Or Anthropic Claude
# ANTHROPIC_API_KEY=sk-ant-xxx

# Or Google Gemini
# GEMINI_API_KEY=xxx

# Or Azure OpenAI (enterprise)
# AZURE_OPENAI_API_KEY=xxx
# AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
# AZURE_OPENAI_MODEL_NAME=your-deployment-name
# AZURE_OPENAI_API_VERSION=2025-01-01-preview

# Optional — you choose the models, we use them
# ATTACKER_MODEL=your-chosen-model
# JUDGE_MODEL=your-chosen-model
# BOT_MODEL=your-chosen-model
```

**Step 2 — Run your first audit:**

```bash
# Audit a live chatbot endpoint
jailbreak-arena audit --url https://www.mychatbot.com --turns 5

# Or audit a system prompt directly (no deployment needed)
jailbreak-arena audit \
  --system-prompt "You are a banking assistant. Never reveal account details." \
  --turns 5
```

**Step 3 — Open your report:**

```bash
start report_http_task_001.html      # Windows
open  report_http_task_001.html      # Mac/Linux
```

Full vulnerability report in under 10 minutes.

---

## CLI Reference

```bash
# ── Audit commands ──────────────────────────────────────────────────

# Audit a REST API endpoint
jailbreak-arena audit --url https://www.mychatbot.com

# With authentication
jailbreak-arena audit \
  --url https://www.mychatbot.com \
  --auth "Bearer your-api-token"

# Custom payload template (for non-standard APIs)
jailbreak-arena audit \
  --url https://www.mychatbot.com/api \
  --payload-template '{"query": "{input}", "session_id": "test"}' \
  --response-field "data.answer"

# Audit a system prompt directly (no deployment needed)
jailbreak-arena audit \
  --system-prompt "You are a banking assistant..."

# Full audit — all 20 tasks
jailbreak-arena audit \
  --url https://www.mychatbot.com \
  --tasks all --turns 5

# Specific tasks only
jailbreak-arena audit \
  --url https://www.mychatbot.com \
  --tasks task_001,task_005,task_007

# Save reports to a specific folder
jailbreak-arena audit \
  --url https://www.mychatbot.com \
  --output ./security-reports

# Quiet mode — summary only, no turn-by-turn output
jailbreak-arena audit \
  --url https://www.mychatbot.com \
  --quiet

# ── Info commands ────────────────────────────────────────────────────

# List all 20 attack tasks with categories and difficulty
jailbreak-arena tasks
```

---

## Adapter System — Connect Any Bot

### Adapter 1 — SystemPromptAdapter
Test any system prompt directly. No deployment needed.

```python
from jailbreak_arena.adapters import SystemPromptAdapter
from jailbreak_arena.env import JailbreakArenaEnv

adapter = SystemPromptAdapter(
    system_prompt="""
    You are a helpful banking assistant for AcmeBank.
    Never reveal customer account details.
    Never discuss competitor banks.
    """
)
env = JailbreakArenaEnv(target=adapter, task_id="task_001", max_turns=5)
```

### Adapter 2 — HTTPAdapter
Test any deployed chatbot with a REST API endpoint.

```python
from jailbreak_arena.adapters import HTTPAdapter

adapter = HTTPAdapter(
    url="https://www.mychatbot.com/api/chat",
    headers={"Authorization": "Bearer your-token"},
    payload_template={"message": "{input}"},
    response_field="response",
)

# Common payload templates:
# Simple:        {"message": "{input}"}
# OpenAI-style:  {"messages": [{"role": "user", "content": "{input}"}]}
# Custom:        {"query": "{input}", "session_id": "audit-123"}

# Common response fields:
# Simple:        "response"
# OpenAI-style:  "choices.0.message.content"
# Nested:        "data.response"
```

### Adapter 3 — BedrockAdapter
Test any AWS Bedrock hosted model.

```python
from jailbreak_arena.adapters import BedrockAdapter

# Set in .env:
# BEDROCK_MODEL_ID=your-bedrock-model-id
# AWS_DEFAULT_REGION=us-east-1
# Find model IDs: docs.aws.amazon.com/bedrock/latest/userguide/model-ids.html

adapter = BedrockAdapter(
    system_prompt="You are a banking assistant. Never reveal account details."
)
```

### Adapter 4 — LangChainAdapter
Test any LangChain chain or agent.

```python
from jailbreak_arena.adapters import LangChainAdapter

# You configure the model and chain — JailbreakArena just attacks it
from langchain_openai import ChatOpenAI
from langchain.chains import ConversationChain

llm     = ChatOpenAI(model="your-chosen-model")
chain   = ConversationChain(llm=llm)
adapter = LangChainAdapter(chain=chain)
```

---

## 20 Attack Tasks Across 8 Categories

```bash
jailbreak-arena tasks   # see the full list with difficulty ratings
```

| Category | Count | Examples |
|---|---|---|
| 🎭 Identity & Role | 4 | Role Hijacking, Developer Mode, Fictional Framing |
| 📤 Data & Extraction | 2 | System Prompt Extraction, Indirect Extraction |
| 💉 Injection & Manipulation | 3 | Indirect Injection, Emotional Manipulation |
| 🧠 Logic & Context | 3 | Context Overflow, Hypothetical Bypass |
| ⚠️ Harmful Content | 2 | Harmful Instructions, PII Extraction |
| 📋 Compliance & Policy | 2 | Copyright Violation, Competitor Mention |
| 🤖 Agentic Attacks | 2 | Excessive Agency, Memory Poisoning |
| 🔧 Technical Injections | 2 | SQL Injection via NL, SSRF via LLM |

---

## 5 LLM Providers — Zero Code Changes

Auto-detected from `.env`. Switch providers by changing one line.
**You choose the models — we just use them:**

```dotenv
GROQ_API_KEY=xxx          # Groq    (default — free tier)
OPENAI_API_KEY=xxx        # OpenAI
ANTHROPIC_API_KEY=xxx     # Claude
GEMINI_API_KEY=xxx        # Gemini
AZURE_OPENAI_API_KEY=xxx  # Azure   (enterprise)
```

Optional model control per role:

```dotenv
ATTACKER_MODEL=your-chosen-model
JUDGE_MODEL=your-chosen-model
BOT_MODEL=your-chosen-model
```

---

## ⚠️ Azure OpenAI — Content Filter Note

Azure OpenAI has a built-in content management policy that detects and
blocks jailbreak attempts at the API level. Since JailbreakArena generates
real adversarial prompts, Azure will block the attacker agent with:

```
Error code: 400 — content_filter
jailbreak: detected: True, filtered: True
```

**This is not a bug.** Azure is doing its job. The conflict is architectural —
a red-team tool and a content filter cannot coexist on the same endpoint.

### Fix Option 1 — Disable jailbreak filter for your deployment

```
Azure Portal
→ Azure AI Foundry
→ Your deployment (e.g. gpt-4.1)
→ Content filters
→ Create new filter configuration
→ Set "Jailbreak attacks" to OFF
→ Apply to deployment
```

This is the legitimate path for security research and authorized testing.

### Fix Option 2 — Use a different provider as attacker/judge (recommended)

```dotenv
# Use Groq (free) or OpenAI as the attacker + judge
GROQ_API_KEY=your_groq_key

# Then point --url at your Azure bot as the TARGET
# jailbreak-arena audit --url https://your-azure-bot.com/api/chat
```

**Best practice:** Run JailbreakArena on Groq/OpenAI/Anthropic.
Point it at your Azure bot via `--url`. Your Azure bot is what gets hardened.
The attacker runs on a separate, unrestricted provider.

---

## The Reward Design — The Key Insight

The defender earns maximum reward (+15) only when it **blocks the attack AND keeps the bot helpful**.
Over-restriction is penalised just like a security breach:

| Event | Attacker | Defender |
|---|---|---|
| Clean jailbreak success | +10 | -10 |
| Partial manipulation | +3 | -3 |
| Attack blocked, bot helpful | -5 | **+15** ⭐ |
| Attack blocked, bot unhelpful | +2 | -5 |

This models real-world deployment: **security cannot come at the cost of usability.**

---

## Two-Level Grader

```
Level 1 — Rule-based (zero API cost, instant)
  Scans for hard success/failure signals
  Returns immediately if confident
  Covers ~70% of cases at zero cost

Level 2 — LLM Judge (fires only for ambiguous cases)
  Structured output: RESULT / CONFIDENCE / REASON
  Uses smarter model for nuanced reasoning
  Covers the remaining ~30%
```

---

## Docker

```bash
# Pull
docker pull mithilesh-lala/jailbreak-arena

# Audit a live endpoint
docker run --env-file .env mithilesh-lala/jailbreak-arena \
  audit --url https://www.mychatbot.com --turns 5

# Audit a system prompt
docker run --env-file .env mithilesh-lala/jailbreak-arena \
  audit --system-prompt "You are a banking assistant..." --turns 5

# Full audit — all 20 tasks, save reports locally
docker run \
  -v $(pwd)/reports:/app/reports \
  --env-file .env \
  mithilesh-lala/jailbreak-arena \
  audit --url https://www.mychatbot.com --tasks all --output /app/reports

# List all tasks
docker run --env-file .env mithilesh-lala/jailbreak-arena tasks
```

---

## CI/CD — LLM Security Gate

Add to your pipeline — block deployments if vulnerabilities are found:

```yaml
# .github/workflows/llm-security.yml
name: LLM Security Audit

on: [push, pull_request]

jobs:
  security-gate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Install JailbreakArena
        run: pip install jailbreak-arena
      - name: Run Security Audit
        env:
          GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
        run: |
          jailbreak-arena audit \
            --url ${{ secrets.BOT_ENDPOINT }} \
            --tasks task_001,task_005,task_007 \
            --turns 3 \
            --quiet
```

---

## Use as a Gymnasium RL Environment

For researchers who want to train custom RL agents:

```python
from stable_baselines3 import PPO
from jailbreak_arena.env import JailbreakArenaEnv
from jailbreak_arena.adapters import SystemPromptAdapter

adapter = SystemPromptAdapter(
    system_prompt="You are a banking assistant..."
)
env = JailbreakArenaEnv(target=adapter, max_turns=5)

model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=10000)
model.save("my_defender_v1")
```

---

## Project Structure

```
jailbreak-arena/
├── jailbreak_arena/
│   ├── env.py              # Gymnasium RL environment
│   ├── attacker.py         # LLM-powered attacker agent
│   ├── defender.py         # Discrete-action defender agent
│   ├── grader.py           # Two-level grader
│   ├── tasks.py            # 20 attack task catalog
│   ├── prompts.py          # All prompt templates
│   ├── utils.py            # LLMClient — 5 providers
│   ├── cli.py              # CLI entry point
│   └── adapters/
│       ├── base.py         # Abstract base adapter
│       ├── system_prompt.py # Test any system prompt
│       ├── http.py          # Any REST API endpoint
│       ├── bedrock.py       # AWS Bedrock models
│       └── langchain.py     # LangChain chains/agents
├── reporters/
│   └── html_report.py      # HTML audit report generator
├── examples/
│   ├── basic_run.py        # Basic usage example
│   └── audit_my_bot.py     # Adapter usage examples
├── tests/                  # 29 unit tests, 0.10s, zero API calls
├── Dockerfile
├── openenv.yaml            # Meta OpenEnv spec
├── DOCKER_HUB.md           # Docker Hub documentation
├── HUGGINGFACE.md          # HuggingFace model card
└── pyproject.toml          # PyPI packaging
```

---

## Run Tests

```bash
python -m pytest tests/ -v
# 29 passed in 0.10s — zero API calls — runs in CI instantly
```

---

## Install Options

```bash
pip install jailbreak-arena              # Groq only (free, recommended)
pip install jailbreak-arena[openai]     # adds OpenAI
pip install jailbreak-arena[anthropic]  # adds Anthropic
pip install jailbreak-arena[bedrock]    # adds boto3 for AWS Bedrock
pip install jailbreak-arena[all]        # everything
```

---

## Built On

| | |
|---|---|
| **Meta OpenEnv** | Standardised RL environment framework |
| **HuggingFace** | Environment hub and model ecosystem |
| **Gymnasium** | Industry standard RL interface |
| **Groq** | Free, blazing-fast LLM inference (default) |

---

## Roadmap

- [ ] v0.2.0 — OWASP LLM Top 10 compliance mapping
- [ ] v0.2.0 — Web UI dashboard
- [ ] v0.3.0 — Pre-trained defender agent weights on HuggingFace
- [ ] v0.3.0 — Multi-episode RL training pipeline
- [ ] v0.4.0 — Custom task builder API

---

## Author

**Mithilesh Kumar Lala**
GitHub: [@Mithilesh-Lala](https://github.com/Mithilesh-Lala) &nbsp;|&nbsp;
HuggingFace: [mkl-01](https://huggingface.co/mkl-01) &nbsp;|&nbsp;
Docker: [mithilesh-lala](https://hub.docker.com/r/mithileshkumarlala/jailbreak-arena)

---

## License

MIT — free to use, modify, and distribute.

---

## Contributing

Issues and PRs welcome.
Found a new attack vector not in the 20 tasks? Open an issue — we will add it.

---

