Metadata-Version: 2.4
Name: gavagai
Version: 0.1.2
Summary: Quantify translation indeterminacy between sparse autoencoder feature dictionaries (Quine × Mechanistic Interpretability).
Author: gavagai contributors
License: Apache-2.0
Project-URL: Homepage, https://github.com/hinanohart/gavagai
Project-URL: Repository, https://github.com/hinanohart/gavagai
Project-URL: Documentation, https://github.com/hinanohart/gavagai#readme
Project-URL: Issues, https://github.com/hinanohart/gavagai/issues
Keywords: interpretability,mechanistic-interpretability,sparse-autoencoder,sae,alignment,philosophy-of-mind,quine
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE
Requires-Dist: numpy<3.0,>=1.26
Requires-Dist: scipy>=1.12
Provides-Extra: saelens
Requires-Dist: sae-lens<6.0,>=5.0; extra == "saelens"
Requires-Dist: torch<3.0,>=2.3; extra == "saelens"
Provides-Extra: holism
Requires-Dist: circuit-tracer>=0.1; extra == "holism"
Provides-Extra: behavior
Requires-Dist: torch<3.0,>=2.3; extra == "behavior"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-cov>=5.0; extra == "dev"
Requires-Dist: ruff>=0.5; extra == "dev"
Requires-Dist: mypy>=1.10; extra == "dev"
Requires-Dist: hypothesis>=6.100; extra == "dev"
Dynamic: license-file

# gavagai

> *“The very fact of the indeterminacy of translation is a finding about
> meaning, not a failure of method.”* — paraphrased after W. V. O. Quine,
> *Ontological Relativity* (1968)

[![PyPI](https://img.shields.io/pypi/v/gavagai.svg)](https://pypi.org/project/gavagai/)
[![License](https://img.shields.io/badge/license-Apache--2.0-blue.svg)](LICENSE)
[![Python](https://img.shields.io/badge/python-3.10%2B-blue.svg)](https://www.python.org/)

**gavagai** quantifies *translation indeterminacy* between two Sparse
Autoencoder (SAE) feature dictionaries — how many empirically valid
feature-to-feature alignments exist, not which single alignment is "the
right one". It is a Mechanistic Interpretability tool grounded in
Quine's philosophy of language.

> 🇯🇵 日本語の説明は [`docs/README.ja.md`](docs/README.ja.md) にあります。

## Why this exists

Cross-SAE alignment tools (Universal SAE, SPARC, Sparse Crosscoders) ask:
*what is the correct mapping between two SAEs' features?* Quine's *gavagai*
thought experiment suggests this question is **structurally
underdetermined**: the observational data fixes an *equivalence class of
translations*, not a single one. `gavagai` does not solve that
underdetermination — it measures it.

Concretely:

- Train two SAEs (different seed, different model checkpoint, different
  layer) on aligned activations.
- Run `gavagai_score(sae_a, sae_b)`.
- Get a number in `[0, 1]`: **0 = deterministic alignment exists**;
  **1 = radical indeterminacy** (many empirically valid alignments).

The score drops into CI as a regression gate: refuse model pushes whose
indeterminacy with the baseline exceeds a threshold.

## Install

```bash
pip install gavagai
```

Optional extras:

```bash
pip install "gavagai[saelens]"    # SAELens SAE objects
pip install "gavagai[behavior]"   # downstream-KL behavior equivalence
pip install "gavagai[holism]"     # circuit-tracer integration (v0.2)
```

## Quick start

```python
import numpy as np
from gavagai import gavagai_score

# Decoder matrices: shape (n_features, d_model). gavagai also accepts
# SAELens SAE instances and {"W_dec": ndarray} dicts.
sae_a = np.random.default_rng(0).standard_normal((1024, 768))
sae_b = np.random.default_rng(1).standard_normal((1024, 768))

score = gavagai_score(sae_a, sae_b)
print(f"indeterminacy: {score:.4f}")

# With diagnostics
score, details = gavagai_score(sae_a, sae_b, return_details=True)
print(f"  candidates : {details.n_equivalent_translations}")
print(f"  95% CI     : [{details.ci_low:.4f}, {details.ci_high:.4f}]")
```

## CI gate (`gavagai-lint`)

The kill-app. Drop into your pre-push hook or GitHub Action:

```bash
gavagai-lint \
    --before sae_baseline.npz \
    --after  sae_after_abliteration.npz \
    --threshold 0.3
```

Exit `0` if the indeterminacy is below threshold (acceptable drift), `1`
otherwise. Designed for gating Hugging Face uploads, abliteration patches,
and post-train fine-tuning steps where the model's feature semantics may
silently re-arrange.

GitHub Action (composite):

```yaml
- uses: hinanohart/gavagai/.github/actions/gavagai@v0.1.0
  with:
    before: artifacts/baseline.npz
    after:  artifacts/candidate.npz
    threshold: 0.3
```

## Equivalence relations

The score is **relative to a choice of equivalence relation** — this is the
Ontological Relativity commitment, made explicit:

| `equivalence=`  | What "two features are equivalent" means                  | Needs                  |
|-----------------|-----------------------------------------------------------|------------------------|
| `"cosine"`      | decoder directions within ε cosine distance               | decoder matrices       |
| `"activation"`  | overlapping token-firing patterns (Jaccard ≥ 1−ε)         | `activations_*` arrays |
| `"behavior"`    | similar downstream KL when ablated (`1/(1+kl) ≥ 1−ε`)     | `ablation_kl` matrix   |

```python
score = gavagai_score(sae_a, sae_b, equivalence="cosine", epsilon=0.1)
```

Different relations yield different scores. **That is the point**: there is
no relation-independent "true" indeterminacy.

> Caveat (v0.1): the same `epsilon` is applied across all three relations
> despite their differing scales (cosine ∈ [−1,1], Jaccard ∈ [0,1], KL-derived
> ∈ (0,1]). Comparison across relations should be qualitative, not
> threshold-equal. v0.1.x will add per-relation epsilon normalization.

## How it works

1. Extract decoder matrices `W_A`, `W_B`.
2. Compute similarity matrix `S` under the chosen relation.
3. Threshold by `ε` to get a candidate adjacency `A_ε`.
4. Count valid bipartite matchings of `A_ε` (DFS with backtracking, capped
   at `cap=1000`). Empty adjacency ⇒ cap (radical indeterminacy).
5. Compress matching count to `[0, 1]` via `1 − 1 / (1 + log(n))`.
6. Bootstrap a 95% CI over feature-row resamples.

Step 4 is the Quinean heart: we never collapse the candidate space to a
single bijection.

## Roadmap

| version | adds                                                                     |
|---------|--------------------------------------------------------------------------|
| v0.1.0  | scalar `gavagai_score`, CLI gate, 3 equivalence relations                |
| v0.1.x  | per-relation ε normalization, `coverage` diagnostic for sparse adjacency |
| v0.2.0  | holism propagator (Duhem-Quine) via `circuit-tracer`                     |
| v0.3.0  | ontological commitment detector (AlignSAE binding)                       |
| v1.0.0  | cross-paradigm translator (probe ↔ SAE ↔ patching)                       |

## Anti-goals

- Not a SAE *trainer*. Use [SAELens](https://github.com/jbloomAus/SAELens).
- Not a *circuit visualizer*. Use [circuit-tracer](https://github.com/decoderesearch/circuit-tracer).
- Not a *universal feature library*. We measure indeterminacy; we do not
  pretend to eliminate it.

## Reading

- W. V. O. Quine, *Word and Object* (1960), ch. 2.
- W. V. O. Quine, *Ontological Relativity and Other Essays* (1968).
- Marks et al., *Sparse Feature Circuits*, ICLR 2025
  ([arXiv:2403.19647](https://arxiv.org/abs/2403.19647)).
- Bricken et al., *Towards Monosemanticity*, Anthropic 2023.
- Arditi et al., *Refusal direction is mediated by a single direction*, NeurIPS
  2024 ([arXiv:2406.11717](https://arxiv.org/abs/2406.11717)).
- *Mechanistic Interpretability Needs Philosophy*
  ([arXiv:2506.18852](https://arxiv.org/abs/2506.18852)).

## License

Apache 2.0. See [`LICENSE`](LICENSE) and [`NOTICE`](NOTICE).

The name draws on Quine's philosophical work as a *scholarly reference*. No
endorsement or affiliation is claimed or implied.
