Metadata-Version: 2.4
Name: yuragi
Version: 0.5.5
Summary: LLM Confidence Fragility Analyzer — Measure how fragile your AI's confidence really is
Project-URL: Homepage, https://github.com/hinanohart/yuragi
Project-URL: Documentation, https://hinanohart.github.io/yuragi
Project-URL: Repository, https://github.com/hinanohart/yuragi
Project-URL: Issues, https://github.com/hinanohart/yuragi/issues
Project-URL: Changelog, https://github.com/hinanohart/yuragi/blob/main/CHANGELOG.md
Author: hinanohart
License-Expression: MIT
License-File: LICENSE
Keywords: ai-safety,confidence,confidence-calibration,evaluation,explainability,fragility,hallucination-detection,llm,llm-evaluation,model-testing,neural-network,nlp,perturbation-testing,prompt-engineering,robustness,stress-testing,uncertainty-quantification
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Testing
Classifier: Typing :: Typed
Requires-Python: >=3.11
Requires-Dist: click<9,>=8.0.0
Requires-Dist: litellm<2,>=1.83.7
Requires-Dist: rich<16,>=13.0.0
Provides-Extra: all
Requires-Dist: datasets>=2.14; extra == 'all'
Requires-Dist: matplotlib>=3.7.0; extra == 'all'
Requires-Dist: numpy>=1.24.0; extra == 'all'
Requires-Dist: plotly>=5.15.0; extra == 'all'
Requires-Dist: scipy>=1.10.0; extra == 'all'
Requires-Dist: sentence-transformers>=2.2; extra == 'all'
Provides-Extra: benchmarks
Requires-Dist: datasets>=2.14; extra == 'benchmarks'
Provides-Extra: dev
Requires-Dist: bandit>=1.7.0; extra == 'dev'
Requires-Dist: hypothesis>=6.100; extra == 'dev'
Requires-Dist: mypy>=1.0.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.21.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: ruff>=0.4.0; extra == 'dev'
Requires-Dist: scikit-learn>=1.3; extra == 'dev'
Requires-Dist: scipy>=1.10; extra == 'dev'
Requires-Dist: statsmodels>=0.14; extra == 'dev'
Provides-Extra: docs
Requires-Dist: mkdocs-material>=9.0; extra == 'docs'
Requires-Dist: mkdocs>=1.5; extra == 'docs'
Requires-Dist: mkdocstrings[python]>=0.24; extra == 'docs'
Provides-Extra: guardrails
Provides-Extra: guardrails-autogen
Requires-Dist: pyautogen>=0.2; extra == 'guardrails-autogen'
Provides-Extra: guardrails-langgraph
Requires-Dist: langgraph>=0.0.30; extra == 'guardrails-langgraph'
Provides-Extra: guardrails-nats
Requires-Dist: nats-py>=2.7; extra == 'guardrails-nats'
Provides-Extra: semantic
Requires-Dist: numpy>=1.24.0; extra == 'semantic'
Requires-Dist: scipy>=1.10.0; extra == 'semantic'
Requires-Dist: sentence-transformers>=2.2; extra == 'semantic'
Provides-Extra: stats
Requires-Dist: numpy>=1.24.0; extra == 'stats'
Requires-Dist: scipy>=1.10.0; extra == 'stats'
Provides-Extra: viz
Requires-Dist: matplotlib>=3.7.0; extra == 'viz'
Requires-Dist: numpy>=1.24.0; extra == 'viz'
Requires-Dist: plotly>=5.15.0; extra == 'viz'
Description-Content-Type: text/markdown

# yuragi — A measurement harness establishing that perturbation-derived features do not add signal over logprob baseline

[![PyPI version](https://img.shields.io/pypi/v/yuragi)](https://pypi.org/project/yuragi/)
[![PyPI downloads](https://img.shields.io/pypi/dm/yuragi)](https://pypi.org/project/yuragi/)
[![CI](https://github.com/hinanohart/yuragi/actions/workflows/ci.yml/badge.svg)](https://github.com/hinanohart/yuragi/actions/workflows/ci.yml)
[![License](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-blue.svg)](https://www.python.org/downloads/)

## Instant Demo

No API key needed:

```bash
pip install yuragi
yuragi demo
```

<p align="center"><img src="docs/demo_en.svg" alt="yuragi demo output" width="700"></p>

---

## What It Does

yuragi measures **confidence fragility**: how much a model's certainty shifts when you rephrase the same question. It generates 13 perturbation variants of your prompt (typos, tone changes, paraphrases, authority framing), calls your model, and compares the confidence across responses. When the answer text stays the same but confidence moves, that's fragility — a property of the prompt wording, not the model's knowledge.

**v0.5.0** also ships `yuragi.guardrails` — a confidence-aware multi-agent runtime (= Kadavath 2022 logprob-entropy, not novel UQ signal) with append-only audit logging, Git-like state snapshots, and AutoGen / LangGraph integrations. See the [Guardrails](#guardrails-v050) section below.

---

## Install

```bash
pip install yuragi
```

Optional extras:

```bash
pip install yuragi[viz]                  # heatmap / reliability diagram output
pip install yuragi[semantic]             # sentence-transformers for semantic entropy
pip install yuragi[stats]                # numpy/scipy for statistical tests
pip install yuragi[guardrails]           # confidence-aware LLM guardrails (stdlib only)
pip install yuragi[guardrails-autogen]   # AutoGen integration
pip install yuragi[guardrails-langgraph] # LangGraph integration
pip install yuragi[guardrails-nats]      # NATS JetStream distributed transport
pip install yuragi[all]                  # everything
```

Supports any [litellm](https://docs.litellm.ai/docs/providers)-compatible provider — OpenAI, Anthropic, Google, local Ollama, and 100+ others.

---

## Python API

```python
from yuragi import Scanner

result = Scanner(model="cerebras/llama-3.1-8b-instruct").scan("Is quantum computing practical?")
print(result.fragility_score)    # 0.056
print(result.dissociation_rate)  # 0.07 — answer same, confidence shifted
```

<details>
<summary>Psychology experiments / Trilayer / Semantic Entropy API</summary>

```python
from yuragi.experiments.registry import get_experiment
from yuragi.experiments.runner import run_experiment

result = run_experiment(get_experiment("asch"), model="ollama/llama3.2", num_samples=5)
print(result.avg_delta)        # average confidence change
print(result.effect_confirmed) # True if max_delta >= 0.15
```

```python
from yuragi.analysis.trilayer import measure_trilayer

result = measure_trilayer("Is AI dangerous?", model="ollama/llama3.2")
print(result.logprob_confidence)     # Layer 1: token probability
print(result.sampling_confidence)    # Layer 2: behavioral consistency
print(result.verbalized_confidence)  # Layer 3: self-reported
print(result.internal_conflict)      # True if discrepancy > 0.2
```

```python
from yuragi.metrics.semantic_entropy import semantic_entropy
h_sem = semantic_entropy(samples=["Paris", "It's Paris.", "The capital is Paris"])
```

</details>

---

## CLI Quickstart

```bash
# Scan a prompt for fragility
yuragi scan "Is quantum computing practical?" --model cerebras/llama-3.1-8b-instruct

# Find the single weakest word
yuragi find-weakness "Explain the theory of relativity" --model ollama/llama3.2

# Run a psychology stress test
yuragi experiment asch --model ollama/llama3.2
```

---

## Use Cases

**CI/CD regression detection** — catch fragility regressions before they reach production:

```bash
yuragi check prompts.txt --baseline baseline.json --model gpt-4o-mini
```

**Fragility-aware routing** — route each prompt to the model that answers most stably:

```bash
yuragi route "What causes inflation?" --models gpt-4o-mini,ollama/llama3.2,cerebras/llama-3.1-8b
```

**Abstention guard** — refuse to answer when fragility exceeds safety thresholds (medical: < 0.03, safety: < 0.02):

```bash
yuragi guard "What medication should I take?" --domain medical --model gpt-4o-mini
```

**Model selection** — find the best model for your use case by fragility profile:

```bash
yuragi recommend --use-case factual --models gpt-4o-mini,ollama/llama3.2 --budget medium
```

**Automated red teaming** — discover model weaknesses across all 13 perturbation types:

```bash
yuragi red-team prompts.txt --model gpt-4o-mini --output report.json
```

---

## Status & Research Findings

yuragi is primarily a **measurement and stress-testing library** for confidence-under-perturbation. Earlier versions framed ensemble AUC numbers as a hallucination *detector*; after a multi-round internal audit (permutation + BH-FDR + length-residualization + split-conformal finite-sample CIs), **most headline predictive claims did not survive multiple-testing correction or independent replication**. The measurement instruments below are stable and usable; the specific predictive findings are exploratory and should be treated as hypotheses pending external replication at n≥400 per model with an independent hold-out.

Real-data empirical results on llama-3.1-8B-Instruct (Cerebras + NVIDIA NIM endpoints) and Pythia-410m, April 2026.

### ✅ Findings that survived multiple-testing correction (2)

1. **Decisive null: the 13 perturbations HURT predictive performance over `baseline_confidence`.** 54 tests across 3 subsets × 2 label sources × 9 fragility features, all |z|<2; BH-FDR q=0.05 survivor count 0/60. Paired bootstrap (Round 5-W, preregistration 566823a) further shows the 13-perturbation feature set actively reduces AUC: LogReg Δ=−0.0288 CI [−0.052, −0.006] p=0.016; GB Δ=−0.0470 CI [−0.080, −0.012] p=0.007; CatBoost Δ=−0.0409 CI [−0.081, −0.000] p=0.046. All three CIs exclude zero on the negative side — *perturbation-derived features are not only uninformative but actively harmful*. Source: [`experiments/ablation_pivot_subset_robust_report.txt`](https://github.com/hinanohart/yuragi/blob/main/experiments/ablation_pivot_subset_robust_report.txt).
2. **TriviaQA / Pythia-410m, n=85**: `baseline_confidence` AUC with split-conformal 95% CI **[0.596, 0.783]**, not length-confounded (length-residualized Δ < 0.01). Single formal survivor of our conformal-prediction sweep. **Caveat**: `baseline_confidence = 1 − H/log(K)` is mathematically identical to the Kadavath 2022 / Farquhar 2024 logprob-entropy baseline; this result is a replication of prior art, not an independent yuragi contribution. Source: [`experiments/uq_ensemble_sweep/verdict.md`](https://github.com/hinanohart/yuragi/blob/main/experiments/uq_ensemble_sweep/verdict.md).

### ⚠️ Exploratory — do not rely on these as validated claims

- **Ensemble AUC 0.73 on TruthfulQA n=412 / 105 features.** Wilson CI [0.678, 0.779] does not cross 0.5, but the finding was not pre-registered, there is no independent hold-out, and a scaled-down replication (n=100, 4-feature ensemble across 3 independent datasets) produced OOF AUCs **below** the best single feature in all 3, with BH-FDR q=0.05 yielding a single survivor that is single-feature `bc`, not the ensemble. Adding the 13 perturbation features over a no-perturbation baseline produces Δ=−0.027, 95% CI [−0.085, +0.035], p=0.35 — not statistically significant. On `is_correct` (flexible-judge) labels, perturbations in fact **reduce** AUC significantly (Cerebras n=382: Δ=−0.05 p=0.012). Treat as a hypothesis pending n≥400 replication with an independent hold-out. Sources: [`experiments/ensemble_final.txt`](https://github.com/hinanohart/yuragi/blob/main/experiments/ensemble_final.txt), [`experiments/ablation_delta_significance_report.txt`](https://github.com/hinanohart/yuragi/blob/main/experiments/ablation_delta_significance_report.txt).
- **Confidence sign-inversion on 8B.** Higher self-reported confidence correlating with *higher* hallucination probability (TriviaQA n=200 raw AUC 0.252 → inverted 0.748). Length-residualized AUC falls to 0.612. Single provider family, single hardware pair, no cross-family replication at n≥400.
- **Multi-judge majority n=200 AUC 0.635** — subset-artifact: on the rest-212 complement the same method drops to 0.502 (chance), permutation p=0.002. Retracted as a standalone claim.
- **Single-signal solo AUC across 6 datasets** (TruthfulQA, TriviaQA, NQ-Open, NIM, Cohere, Mistral) is ~0.50 for `fragility_score`; the earlier "0.62 noise floor" claim is retracted.
- **Fragility scaling trend F(N)=a/√N+b (R²=0.987)** — 5-model curve fit, no multiple-testing correction, no hold-out, exploratory.
- **Activation-patching L12–L13 "double dissociation"** (paper Contribution 2) — rescued with an L23-control experiment at n=10 on Pythia-410m; an independent causal-tracing sweep on the same prompts places the per-prompt peak layer across L2–L17 (mean 7.5, std 4.3; only 1/10 near L12–L13). Patching-level dissociation stands; the stronger "the circuit lives at L12–L13" reading does not survive an independent method at this n.

### 🧪 Measurement reliability (what the library is good at today)

Test–retest Pearson correlation on paired scans (same prompt, different seed):

| Signal | r | Recommendation |
|:-------|:--|:---------------|
| baseline_confidence | 0.88 | ✓ Primary |
| paraphrase_fragility | 0.80 | ✓ Primary |
| adaptive_fragility | 0.78 | ✓ Primary |
| impostor_fragility | 0.70 | ○ Supporting |
| fragility_score (aggregate) | 0.64 | ○ Supporting |
| **counterfactual_fragility** | **0.18** | **✗ Noise-dominated, do not use** |

The perturbation-and-confidence measurement suite is stable. Whether fragility *predicts* hallucination is open and strongly dataset-dependent: it works on single-path factoids ("Who discovered argon?" → AUC ~0.75) and fails on imitative-falsehood benchmarks ("What happens if you break a mirror?" → AUC ~0.50). See [`paper/domain_boundary_section.md`](https://github.com/hinanohart/yuragi/blob/main/paper/domain_boundary_section.md).

**Supporting observation:** when answer text is identical (Jaccard=1.0), max confidence shift is 0.021 (below the noise floor); when text differs, confidence shifts up to 0.528. This is evidence that verbalised confidence tracks surface text more than underlying knowledge. See [`RESEARCH.md`](https://github.com/hinanohart/yuragi/blob/main/RESEARCH.md).

### 📉 Methodological caveats

- 20+ post-hoc audits on the same dataset ⇒ p-hacking risk; multiple-testing correction was not applied across audits.
- Single base model (Cerebras Llama-3.1-8B) for most results; n=9 frontier pilots underpowered.
- No pre-registration, no independent hold-out, no 3-seed replication.
- Length bias (Spearman ρ=+0.35, longer answers graded more leniently) is partially entangled with `fragility_score` (Δ=+0.022 after length-residualization); prefer length-residualized AUC.
- Public benchmarks with higher AUC exist (SSP ~0.786 output-side, LSD ~0.96 activation-side). Output-level methods like yuragi are bounded by the mutual information `I(correct; h_internal)` accessible from the output surface.

See [`KNOWN_LIMITATIONS.md`](https://github.com/hinanohart/yuragi/blob/main/KNOWN_LIMITATIONS.md) and [`experiments/`](https://github.com/hinanohart/yuragi/tree/main/experiments/) for 20+ raw audit reports.

---

## Integration

**pandas — score a DataFrame of prompts:**

```python
import pandas as pd
from yuragi import Scanner

scanner = Scanner(model="gpt-4o-mini")
df["fragility"] = df["prompt"].apply(lambda p: scanner.scan(p).fragility_score)
```

**pytest — assert stability in tests:**

```python
from yuragi import Scanner

def test_prompt_stability():
    result = Scanner(model="gpt-4o-mini").scan("What is the capital of France?")
    assert result.fragility_score < 0.05, f"Fragility too high: {result.fragility_score}"
```

**GitHub Actions — CI/CD fragility gate:**

```yaml
- name: Check fragility regression
  run: yuragi check prompts.txt --baseline baseline.json --model gpt-4o-mini
```

A [reusable GitHub Actions workflow](.github/workflows/yuragi-check.yml) is included.

---

## Guardrails (v0.5.0)

`yuragi.guardrails` is an opt-in subpackage that turns yuragi from a measurement library into a confidence-aware **LLM guardrail platform**. It is shipped inside the `yuragi` wheel — no extra install needed for the core — and adds zero runtime dependencies (only the standard library).

```python
from yuragi.guardrails import (
    AuditLog,
    ConfidencePolicy,
    ConfidenceReport,
    Runtime,
    PlannerAgent, ExecutorAgent, CriticAgent,
    ResearcherAgent, VerifierAgent,
)

# 1. Append-only audit log with SHA-256 hash chain
log = AuditLog("./audit.db")

# 2. A multi-agent mesh with confidence-aware routing
async with Runtime(audit_log=log) as rt:
    await rt.spawn(PlannerAgent, name="planner")
    await rt.spawn(ExecutorAgent, name="executor")
    await rt.spawn(CriticAgent, name="critic", policy=ConfidencePolicy(tau=0.85))
    await rt.spawn(ResearcherAgent, name="researcher")
    await rt.spawn(VerifierAgent, name="verifier")
    await rt.publish("planner", {"task": "summarise quantum tunneling", "complexity": 6})

# 3. Verify nobody tampered with the audit trail later
assert await log.verify_chain()
```

**Differentiators against existing OSS guardrails**:

| Feature | NeMo Guardrails | Guardrails AI | Llama Guard | LangKit | **yuragi.guardrails** |
|---|---|---|---|---|---|
| Confidence-aware routing | – | – | – | – | **fused 4-signal score** |
| Tamper-evident audit log | – | – | – | – | **SHA-256 hash chain** |
| Crash-resume snapshots | – | – | – | – | **Merkle DAG, ≤ 1 s target** |
| Public benchmarks | – | – | partial | – | exploratory (see Status & Research §) |

**Framework integrations** (each behind an extras gate so the core stays light):

```python
from yuragi.guardrails.integrations.autogen import AutoGenGuardrail   # pip install yuragi[guardrails-autogen]
from yuragi.guardrails.integrations.langgraph import guardrail_node   # pip install yuragi[guardrails-langgraph]
```

The runtime ships with `InMemoryTransport` by default; for distributed deployments install `yuragi[guardrails-nats]` and pass `NatsTransport(...)` instead. NATS support is **experimental** in v0.5.0 — see `KNOWN_LIMITATIONS.md` (G1–G3) before relying on it in production.

A complete demo lives at [`examples/guardrails_smoke.py`](examples/guardrails_smoke.py).

---

## Full CLI Reference

<details>
<summary>All 18 commands</summary>

| Command | Description |
|---------|-------------|
| `demo` | Run pre-computed demo (no API key needed) |
| `scan` | Full fragility scan (13 perturbation types) |
| `find-weakness` | Find the single word that most collapses confidence |
| `experiment` | Run a psychology template (11 types) |
| `compare-models` | Multi-model fragility comparison with heatmap |
| `check` | CI/CD fragility regression detection |
| `route` | Fragility-aware multi-model routing |
| `guard` | Abstention system for high-stakes domains |
| `recommend` | Model selection based on fragility profiles |
| `red-team` | Automated vulnerability discovery |
| `trajectory` | Track confidence across a prompt sequence |
| `stats` | Statistical analysis (Cohen's d, Wilcoxon, bootstrap CI) |
| `trilayer` | Measure confidence via 3 simultaneous methods |
| `profile` | Fragility profile: CCI / RE / NLS |
| `linguistic` | Analyze linguistic confidence markers (hedges, assertiveness) |
| `volatility` | Financial-engineering metrics (VIX, Sharpe ratio) for confidence |
| `phase-map` | Map confidence phase transitions across parameter space |
| `compare` | Compare two scan results (A/B test) |
| `export` | Export scan results to CSV/JSON |

</details>

---

## Research

Key discoveries, empirical data, and scaling trends: [RESEARCH.md](https://github.com/hinanohart/yuragi/blob/main/RESEARCH.md)

White-box layer entropy experiments:

```bash
python experiments/whitebox_design.py --exp entropy_trajectory
python experiments/whitebox_design.py --exp critical_layer_heatmap
python experiments/whitebox_design.py --exp cpu  # lightweight demo
```

See also [docs/related_work.md](https://github.com/hinanohart/yuragi/blob/main/docs/related_work.md) for comparison with lm-polygraph, SelfCheckGPT, PromptBench, CCPS, SYCON-Bench, TRUTH DECAY, SycEval, and FRS.

---

## Papers

**ICML 2026 MI Workshop** (submission target 2026-05-07, negative-result pivot complete):

> "When the Baseline Is the Ceiling: 13 Perturbations Add Zero Signal Over Top-k Logprob Entropy"

arXiv submitted: 2026-04-19 (identifier TBD — update after manual submission)

Source: [`paper/icml2026_mi/`](https://github.com/hinanohart/yuragi/tree/main/paper/icml2026_mi/). arXiv bundle: [`arxiv_submission.tar.gz`](https://github.com/hinanohart/yuragi/tree/main/paper/icml2026_mi/arxiv_submission.tar.gz). Categories: cs.LG (primary), cs.CL, cs.AI. Three confirmed negative findings (5-round audit): (1) all 13 perturbations HURT AUC vs baseline_confidence alone; (2) bc-only pivot null confirmed across 54 bootstrap samples; (3) cross-dataset confirmatory replication AUC=0.782 baseline-only. `baseline_confidence` = Kadavath 2022 / Farquhar 2024 logprob-entropy identity confirmed both analytically and empirically.

**EMNLP ARR 2026** (in preparation, target 2026-05-25):

> "Intent-Misalignment Hallucination: Perturbation-Driven Detection of Specification-Ignored LLM Generation"

Outline: [`paper/emnlp2026_intent/OUTLINE.md`](https://github.com/hinanohart/yuragi/blob/main/paper/emnlp2026_intent/OUTLINE.md). Introduces *intent-misalignment hallucination* — outputs that are syntactically correct and instruction-compliant yet ignore per-user project context — and proposes **context-stripping perturbation (CSP)** for detection. Seed dataset (30 tasks × 3 ecosystems) at [`seed_tasks.jsonl`](https://github.com/hinanohart/yuragi/blob/main/paper/emnlp2026_intent/seed_tasks.jsonl).

## Citation

```bibtex
@misc{yuragi2025,
  title  = {yuragi: Confidence Fragility in Neural Networks},
  author = {hinanohart},
  year   = {2026},
  url    = {https://github.com/hinanohart/yuragi}
}
```

## Contributing / License

Issues and PRs welcome. See [CONTRIBUTING.md](https://github.com/hinanohart/yuragi/blob/main/CONTRIBUTING.md).

Known limitations: [KNOWN_LIMITATIONS.md](https://github.com/hinanohart/yuragi/blob/main/KNOWN_LIMITATIONS.md). Raw benchmark data: [`docs/bench/real/`](https://github.com/hinanohart/yuragi/tree/main/docs/bench/real/). Theory and metric definitions: [`docs/theory.md`](https://github.com/hinanohart/yuragi/blob/main/docs/theory.md).

MIT License for human use.

### AI / ML training opt-out

This repository is **opted out of AI/ML training, fine-tuning, evaluation, and embedding generation**. See [ai.txt](./ai.txt). Using this work to train machine-learning models without separately negotiated written permission is explicitly disallowed. The MIT License covers human use and software redistribution; it does not grant a training data license.

## Verification (sigstore)

Releases from **v_next_** (released after 2026-05-16) include a sigstore keyless signature bundle
(`.sigstore` per artifact) attached to the GitHub Release.

### Verify a PyPI install

```bash
pip download <pkg-name>==<version> --no-deps -d ./verify
python -m sigstore verify github \
    --cert-identity 'https://github.com/hinanohart/yuragi/.github/workflows/release.yml@refs/tags/v<version>' \
    --cert-oidc-issuer 'https://token.actions.githubusercontent.com' \
    ./verify/*.whl ./verify/*.tar.gz
```

The corresponding `.sigstore` bundles can be downloaded from the GitHub Release page.

### Historic releases (pre-2026-05-16)

Earlier releases were published without sigstore bundles. Re-installing those versions
provides no cryptographic provenance — pin to a current release if assurance matters.
