Metadata-Version: 2.4
Name: judicator
Version: 0.2.3
Summary: Judging LLM-as-a-Judge — a screening tool for bias and miscalibration.
Project-URL: Homepage, https://github.com/ankurpand3y/judicator
Project-URL: Repository, https://github.com/ankurpand3y/judicator
Project-URL: Issues, https://github.com/ankurpand3y/judicator/issues
License: Apache-2.0
Keywords: audit,bias,evaluation,judge,llm
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: numpy
Description-Content-Type: text/markdown

<p align="center">
  <img src="judicator_logo.svg" alt="Judicator" width="640">
</p>

<h3 align="center">Judging LLM-as-a-Judge</h3>

<p align="center">An LLM-as-a-Judge screening tool for bias and miscalibration.</p>

<p align="center">
  <a href="https://pypi.org/project/judicator/"><img src="https://badge.fury.io/py/judicator.svg" alt="PyPI version"></a>
  <a href="LICENSE"><img src="https://img.shields.io/badge/license-Apache%202.0-blue.svg" alt="License"></a>
  <a href="https://www.python.org/"><img src="https://img.shields.io/badge/python-3.10%2B-blue.svg" alt="Python"></a>
</p>

---

## Install

```bash
pip install judicator
```

> **Windows note:** the report uses Unicode box-drawing characters. If you see
> `UnicodeEncodeError` when calling `print(report.summary())`, run with
> `PYTHONUTF8=1` or `set PYTHONUTF8=1` (Windows shell) before launching Python.

---

## Quickstart

```python
import openai
from judicator import Judge, JudgeAuditor

system_prompt = "You are an expert evaluator. Score responses objectively."
eval_template = "Question: {question}\nResponse: {response}\nScore 1-10."

def my_judge_call(prompt: str) -> str:
    return openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ]
    ).choices[0].message.content

judge = Judge(
    llm_fn=my_judge_call,
    system_prompt=system_prompt,
    eval_template=eval_template,
    judge_name="my_first_judge"
)

# Shows cost estimate and prompts Y/n. Pass confirm=False to skip.
# max_workers=20 runs API calls in parallel — typically 10–15× faster.
report = JudgeAuditor(
    judge=judge,
    domain="qa",
    cost_per_call=0.0003,
    max_workers=20,
).audit()
print(report.summary())
report.save_json("my_audit.json")
```

---

## Speed (`max_workers`)

A full audit makes ~1,000 LLM calls. Sequential runs take 20–25 minutes.
Set `max_workers` to run calls in parallel via a thread pool:

```python
JudgeAuditor(judge=judge, domain="qa", max_workers=20).audit()
```

| `max_workers` | Wall time (~1k calls) | Speedup |
|---|---|---|
| 1 (default) | 20–25 min | 1× |
| 10 | 2.5 min | 8× |
| **20** | **1.5 min** | **13×** |
| 50+ | diminishing returns; rate-limit risk | — |

**Caveats**

- **Rate limits.** Cost is unchanged but request rate is much higher. Lower `max_workers` if you see 429 errors — there is no auto-backoff.
- **Thread-safe `llm_fn` required.** Stateless calls are safe (OpenAI/Anthropic/OpenRouter clients are thread-safe). Don't share conversation state across calls.
- **Parallelism is per-test.** Within a single bias test, fixture items run concurrently; tests still execute one after another.

---

## What it tests

| Bias | What it catches | Applies to |
|---|---|---|
| **position** | Judge picks slot A/B regardless of content | pairwise |
| **verbosity** | Judge inflates scores for longer responses | all types |
| **self_consistency** | Judge gives different scores to the same input | pointwise, binary |
| **scale_anchoring** | Judge compresses all scores into a narrow band | pointwise |
| **authority** | Judge inflates scores for fake credentials | all types |
| **concreteness** | Judge prefers fabricated specifics over accurate vague answers | pointwise, pairwise |
| **yes_bias** | Binary judge over-approves false statements | binary |

---

## Supported judge types

| Type | Template shape | Detected by |
|---|---|---|
| **pointwise** | `{question}` + `{response}` → numeric score | `{response}` placeholder |
| **pairwise** | `{question}` + `{response_a}` + `{response_b}` → A or B | `{response_a}` and `{response_b}` |
| **binary** | `{statement}` → Yes or No | yes/no keyword in template |

Judge type is auto-detected from your `eval_template`. Override with
`judge_type="pointwise"` if detection fails.

---

## Works with any LLM

Judicator never touches your API keys or model configuration.
You wrap your LLM call in a function — Judicator calls that function.

> **Stateless calls required.** Each call to `llm_fn` must be independent with no shared
> conversation context between calls. Judicator calls it multiple times per fixture item —
> if your judge accumulates history across calls, bias measurements will be invalid.

**OpenAI**
```python
import openai

def my_fn(prompt: str) -> str:
    return openai.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ]
    ).choices[0].message.content
```

**Anthropic**
```python
import anthropic
client = anthropic.Anthropic()

def my_fn(prompt: str) -> str:
    return client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=256,
        system=system_prompt,
        messages=[{"role": "user", "content": prompt}]
    ).content[0].text
```

**OpenRouter** (access 200+ models with one API key)
```python
import openai

client = openai.OpenAI(
    api_key=os.environ["OPENROUTER_API_KEY"],
    base_url="https://openrouter.ai/api/v1",
)

def my_fn(prompt: str) -> str:
    return client.chat.completions.create(
        model="meta-llama/llama-3.2-3b-instruct",
        max_tokens=256,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}
        ]
    ).choices[0].message.content
```

**Ollama (local)**
```python
import ollama

def my_fn(prompt: str) -> str:
    return ollama.chat(
        model="llama3",
        messages=[{"role": "user", "content": prompt}]
    )["message"]["content"]
```

Pass any of these as `llm_fn` to `Judge`. Judicator works identically with all four.

---

## Understanding the report

```
╔══════════════════════════════════════════════════════════════╗
║  JUDICATOR — AUDIT REPORT                                    ║
╠══════════════════════════════════════════════════════════════╣
║  Judge:   my_qa_judge   Domain:  qa   Type:  pointwise       ║
╠════════════════════╦═══════╦═══════╦══════════╦══════════════╣
║   BIAS TEST        ║  SCORE║  RANK ║  VERDICT ║  SEVERITY    ║
╠════════════════════╬═══════╬═══════╬══════════╬══════════════╣
║   scale_anchoring  ║  0.312║  1/5  ║  FAIL    ║  CRITICAL    ║
║   verbosity        ║  0.620║  2/5  ║  FAIL    ║  SIGNIFICANT ║
║   concreteness     ║  0.714║  3/5  ║  PASS    ║  MINOR       ║
║   authority        ║  0.810║  4/5  ║  PASS    ║  NONE        ║
║   self_consistency ║  0.950║  5/5  ║  PASS    ║  NONE        ║
╚══════════════════════════════════════════════════════════════╝
```

**Score:** 0–1. Higher = more calibrated. No composite score — each test is independent.

**Rank:** 1 = worst bias. Address rank 1 first.

**Severity bands:**
- `CRITICAL` (< 0.50): strong bias, investigate immediately
- `SIGNIFICANT` (0.50–0.65): meaningful bias, likely affects production quality
- `MINOR` (0.65–0.80): borderline — PASS if ≥ 0.70, FAIL otherwise
- `NONE` (≥ 0.80): no detectable bias

**N/A** results mean the test does not apply to your judge type or domain,
not that the judge passed the test.

---

## What v0.2 does NOT cover

- Composite scoring / single overall grade
- Sycophancy, compassion fade, bandwagon, sentiment, fallacy oversight, and
  other biases beyond the 7 tested
- Listwise, reference-based, CoT, or multi-turn judge types
- Translation, medical, legal, financial, or creative writing domains
- Position pairs for summarization, safety, or dialogue (qa + code only)
- Custom bias tests or BYO-data mode
- Token-aware cost estimation (currently flat-per-call)
- GitHub Actions integration or SaaS dashboard

---

## Coming in future versions

- **More statistically significant results** — expanded fixture sets; concreteness is currently n=14 (coarse signal)
- **Domain coverage expansion** — position pairs for summarization, safety, and dialogue
- **User-provided data** — BYO-data mode to run bias tests on your own examples
- **Labeling sheet output** — export structured sheets for human annotation workflows

---

## Citation

If you use Judicator in your research, please cite:

```bibtex
@software{judicator2026,
  author = {Pandey, Ankur},
  title  = {Judicator: An LLM-as-a-Judge Bias Auditing Library},
  year   = {2026},
  url    = {https://github.com/ankurpand3y/judicator},
  version = {0.2.3}
}
```

---

## Built on

Judicator ships with fixtures derived from the following datasets.
All are used in accordance with their licenses.

| Dataset | Paper | License |
|---|---|---|
| [OffsetBias](https://github.com/ncsoft/offsetbias) | Park et al. 2024 | Apache 2.0 |
| [JudgeBench](https://github.com/ScalerLab/JudgeBench) | Tan et al. 2024 | MIT |
| [MT-Bench](https://github.com/lm-sys/FastChat) | Zheng et al. 2023 | Apache 2.0 |
| [BeaverTails](https://github.com/PKU-Alignment/beavertails) | Ji et al. 2023 | CC-BY-NC-4.0 |
| [SummEval](https://huggingface.co/datasets/mteb/summeval) | Fabbri et al. 2021 | MIT |
| [DSTC11-Track4](https://huggingface.co/datasets/mario-rc/dstc11.t4) | Rodriguez-Cantelar et al. 2023 | Apache 2.0 |

See [ATTRIBUTION.md](ATTRIBUTION.md) for full item counts.
