Metadata-Version: 2.4
Name: contamination-audit
Version: 0.1.1
Summary: N-gram, embedding, canary, answer-pattern, and corpus contamination auditor.
Author: AuraOne
License-Expression: MIT
Project-URL: Homepage, https://auraone.ai/open
Project-URL: Source, https://github.com/auraoneai/contamination-audit
Classifier: Development Status :: 3 - Alpha
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: embedding
Requires-Dist: sentence-transformers>=2.7; extra == "embedding"
Dynamic: license-file

# contamination-audit

`contamination-audit` combines n-gram overlap, optional embedding similarity, canary matching, answer-pattern checks, and public-corpus hash matching.

## Quickstart

```bash
pip install contamination-audit
contamination-audit run --eval-data examples/eval.jsonl --corpora pile,c4,hf-mmlu
```

By default, embedding checks use a no-dependency lexical cosine fallback. To run semantic embedding checks locally:

```bash
pip install 'contamination-audit[embedding]'
contamination-audit run --eval-data examples/eval.jsonl --embedding-backend sentence-transformers --embedding-model all-MiniLM-L6-v2
```

## What This Is Not

Not proof of uncontaminated data; it is a code-only diagnostic. Examples are synthetic.
