Metadata-Version: 2.4
Name: safetydiff
Version: 1.0.0
Summary: Safety regression comparison for AI systems.
Author: Neuralchemy
License: MIT
Keywords: ai-safety,llm-security,regression-testing,prompt-injection,model-evaluation
Requires-Python: >=3.10
Description-Content-Type: text/markdown

# SafetyDiff 🛡️ 

**The Git-Diff for LLM Safety Posture**

SafetyDiff is an open-source continuous integration (CI/CD) and analytics engine for Large Language Models. It solves the "Black Box Versioning" problem: When you upgrade a model from version 1 to version 2 (or switch from Qwen to OpenAI), is the model *actually* safer, or does it just have different vulnerabilities?

Instead of relying on single benchmark scores, SafetyDiff reads evaluation databases and provides a direct, side-by-side mathematical diff of how two models respond to the exact same adversarial attacks.

## Why SafetyDiff?
Current AI security benchmarks output static numbers (e.g., "Model A scored 82%"). SafetyDiff treats LLM safety like software engineering:
*   **Regression Tracking:** See exactly which vulnerabilities were fixed, and which new vulnerabilities were introduced.
*   **Cross-Model Transferability:** Take an attack that broke Llama-3 and instantly diff it against Qwen2.5 to map shared architectural flaws.
*   **Granular Taxonomy:** Breaks down safety by Intent (e.g., `role_hijack`, `data_exfiltration`, `tool_abuse`).

## Installation
```bash
git clone https://github.com/m4vic/SafetyDiff.git
cd SafetyDiff
pip install -r requirements.txt
```

## Quick Start (Demo)
SafetyDiff ships with a `demo_safety_history.db` containing thousands of pre-computed red-team evaluations across `qwen2.5-coder:3b`, `qwen3.5:4b`, and `gpt-4o-mini`. You can run comparisons out of the box without generating your own data!

Compare two models:
```bash
python safetydiff.py --compare gpt-4o-mini qwen2.5-coder:3b
```

Filter by a specific vulnerability category:
```bash
python safetydiff.py --compare gpt-4o-mini qwen2.5-coder:3b --intent role_hijack
```

## Architecture & Data Generation
SafetyDiff is an **Analytics Engine**. It does not generate attacks itself. 
It is designed to consume SQLite databases generated by automated red-teaming pipelines. The demo database provided was generated using **ASRT** (Automated Safety Regression Testing), a proprietary zero-human adversarial generation engine utilizing TF-IDF routers and MoE (Mixture-of-Experts) LLM-as-a-Judge evaluations.

## Roadmap
*   **v1.0 (Current):** Direct Prompt Injection & Chat Vulnerability Diffing.
*   **v2.0 (In Development):** Agentic Trajectory Evaluation & Indirect Prompt Injections (IPI).

---
**Author:** Sanskar Jajoo ([@m4vic](https://github.com/m4vic))
