Metadata-Version: 2.4
Name: dictcollision
Version: 0.3.0
Summary: Calibrate dictionary hit rates in computational decipherment. Detects when short decoded strings collide with dictionary entries by chance.
Project-URL: Repository, https://github.com/mruckman1/dictcollision
Project-URL: Documentation, https://github.com/mruckman1/dictcollision#usage
Project-URL: Paper, https://github.com/mruckman1/signal-isolation-paper
Project-URL: Issues, https://github.com/mruckman1/dictcollision/issues
Author-email: Matthew Ruckman <mruckman1@gmail.com>
License: MIT
License-File: LICENSE
Keywords: NLP,calibration,collision,computational-linguistics,cryptanalysis,decipherment,dictionary,signal-isolation
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Typing :: Typed
Requires-Python: >=3.10
Provides-Extra: dev
Requires-Dist: hypothesis>=6.0; extra == 'dev'
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff; extra == 'dev'
Provides-Extra: viz
Requires-Dist: matplotlib>=3.5; extra == 'viz'
Description-Content-Type: text/markdown

# dictcollision

[![PyPI](https://img.shields.io/pypi/v/dictcollision)](https://pypi.org/project/dictcollision/)
[![Python](https://img.shields.io/pypi/pyversions/dictcollision)](https://pypi.org/project/dictcollision/)
[![License](https://img.shields.io/pypi/l/dictcollision)](https://github.com/mruckman1/dictcollision/blob/main/LICENSE)

**Calibrate dictionary hit rates.** Given a list of short strings and a
reference dictionary, separate real matches from chance collisions.

---

## The general problem

You have a stream of short strings and a big reference dictionary. Some
fraction match. How many are **real** matches vs. the dictionary being
large enough that anything would match?

| Domain | "Decoded tokens" | "Dictionary" | "Real signal" means |
|---|---|---|---|
| **Decipherment / cryptanalysis** | candidate plaintext (¹) | language wordlist | the decode works |
| **OCR validation** | extracted strings | lexicon | the OCR read correctly |
| **Spell-check eval** | candidate corrections | target vocabulary | the correction fired |
| **Autocomplete ranking** | prefix expansions | vocab | the candidate is meaningful |
| **Password audit** | cracked-string attempts | common-words list | the password was weak, not a random collision |
| **Fuzzy dedup** | near-match candidates | canonical set | they are actually duplicates |
| **RNG / fuzzer QA** | generated strings | wordlist | did your generator accidentally emit real words? |

If the input is short (2–4 chars) and the dictionary is large (10K+),
naive hit-rate metrics are badly inflated by chance. This package fixes
that.

(¹) When the candidate plaintext was produced by a stochastic search
over a key space (SA, hill-climbing), the *cipher symbols* themselves
are also a relevant input — see [When your decode came from a
search](#when-your-decode-came-from-a-search).

## Install

```bash
pip install dictcollision
pip install "dictcollision[viz]"   # with matplotlib
```

Or with [uv](https://docs.astral.sh/uv/):

```bash
uv add dictcollision                 # into a uv project
uv pip install dictcollision         # into the active venv
uv tool install dictcollision        # install the CLI globally
```

## Quick start

```python
from dictcollision import noise_floor, classify, recommend

# 1. One-line prediction: how much of my 43% hit rate is chance?
predicted = noise_floor(decoded_tokens, dictionary)
print(f"Chance alone predicts {predicted:.1%}; observed 43.0%; "
      f"excess {0.43 - predicted:.1%}")

# 2. Full four-category analysis.
result = classify(decoded_tokens, dictionary)
print(result.summary())

# 3. Rank candidate dictionaries when you don't know the language.
ranked = recommend(decoded_tokens,
                   {"latin_10k": latin_words, "german_50k": german_words})
print(ranked[0].name, ranked[0].excess)
```

`result.summary()` prints:

```
ClassifyResult (n=5000 tokens)
  apparent hit rate :  99.6%
  net signal        :  70.1%   <- calibrated metric
  correction        :  29.5%   <- amount subtracted

  signal       70.1%  ████████████████░░░░░░░  real matches
  shared_hit   19.4%  ████░░░░░░░░░░░░░░░░░░░  chance collisions
  anti_signal   0.6%  ░░░░░░░░░░░░░░░░░░░░░░░  phantom matches
  shared_miss   9.9%  ██░░░░░░░░░░░░░░░░░░░░░  non-dict tokens

  Interpretation: strong signal — dictionary is a good fit
```

## Command line

No Python required:

```bash
python -m dictcollision --tokens decoded.txt --dict latin_50k.txt
python -m dictcollision --tokens decoded.txt --dict latin.txt --baselines
python -m dictcollision --tokens decoded.txt --dict latin.txt --json > report.json
```

Supported dictionary formats: one word per line, `word count` (hermitdave
FrequencyWords), or CSV. See `python -m dictcollision --help`.

## Input and output contract

**Input:**

```python
decoded_tokens : list[str]          # e.g. ["the", "cat", "ab", "cd", ...]
                                    # any Unicode; no preprocessing assumed
dictionary     : set[str] | list[str]   # reference words, same encoding
```

**Output (`classify`)** → `ClassifyResult` dataclass:

| Field | Type | Range | Meaning |
|---|---|---|---|
| `net_signal` | float | `[-1, 1]` | **The calibrated metric.** signal − anti_signal |
| `signal` | float | `[0, 1]` | real hits |
| `shared_hit` | float | `[0, 1]` | chance collisions that happen to also be real words |
| `anti_signal` | float | `[0, 1]` | phantom matches (null-only) |
| `shared_miss` | float | `[0, 1]` | non-dictionary tokens |
| `apparent_hit_rate` | float | `[0, 1]` | what a naive evaluator would report |
| `correction` | float | `≥ 0` | apparent − net_signal |
| `signal_words` | `list[str]` |  | types driving real signal |
| `anti_signal_words` | `list[str]` |  | types that inflate chance — inspect to diagnose |
| `n_tokens` | int |  | total count |

### Interpreting `net_signal`

| Range | Meaning |
|---|---|
| `≥ 0.20` | **Strong signal** — dictionary is a good fit |
| `0.05 – 0.20` | **Partial signal** — possibly correct with caveats |
| `≈ 0` | **No signal beyond chance** |
| `< 0` | **Worse than random** — wrong language or wrong decode key |

## The core equation

The predicted noise floor for dictionary $D$ against decoded text with
character distribution $p$ is:

$$\hat{r} \\;=\\; \sum_{w \in D}\\; \prod_{i=1}^{|w|} p(w_i)$$

For every word in the dictionary, multiply together the character
frequencies of your decoded output. Sum. That number is how many of
your tokens would match by accident.

## Four-category framework

| Category | In dictionary? | In real text? | In null corpora? |
|---|---|---|---|
| **Signal** | yes | yes | no |
| **Shared hit** | yes | yes | yes |
| **Anti-signal** | yes | no | yes |
| **Shared miss** | no | — | — |

**Net signal = Signal − Anti-signal** is the calibrated metric.

Null corpora are generated from the decoded text's character bigram
distribution (configurable: unigram / bigram / trigram), preserving
character-pair frequencies and token lengths while destroying word
identity. On wrong-language evaluations the four-category framework
is the only method among six tested that correctly reports signal as
≤ 0.

## When your decode came from a search

If your decoded tokens are the *output of a stochastic search* over a
key space (simulated annealing on a substitution alphabet, hill-climbing,
AZdecrypt, etc.), `net_signal` alone can mislead. The search itself can
manufacture apparent signal: a quadgram-optimised key on a short cipher
will find local optima that resolve into a handful of high-frequency
dictionary words even when the cipher has no underlying linguistic
structure. The Dorabella case (Ruckman 2026) documents this failure
mode at +0.55 net_signal.

The fix is to give the same search procedure the same matched-budget
opportunity on **shuffles of the cipher** — multiset-preserving
permutations that destroy positional content but keep the character
budget constant. If the search finds materially more signal on the real
cipher than on its shuffles, that excess is the calibrated signal.

```python
from dictcollision import search_calibrated_signal

result = search_calibrated_signal(
    cipher_symbols=cipher,        # the cipher itself, not decoded tokens
    search_fn=my_sa_search,       # cipher -> decoded tokens
    dictionary=word_set,
    n_shuffles=30,
)
print(result.summary())
# z_score >= 3 → search finds real signal
# −1 ≤ z < 1   → indistinguishable from a shuffle baseline
```

`search_calibrated_signal` and `null_distribution` solve different
problems:

| Question | Use |
|---|---|
| "is this fixed decode's signal distinguishable from a bigram null?" | `null_distribution()` |
| "does my search procedure find more signal on the real cipher than on shuffles of it?" | `search_calibrated_signal()` |

Both can be informative; reach for `search_calibrated_signal` whenever
the decoded tokens were chosen by a key-space optimiser.

## Full API

```python
from dictcollision import (
    noise_floor,                  # analytical collision prediction
    classify,                     # four-category classification
    classify_by_length,           # per-length-bucket breakdown
    recommend,                    # rank candidate dictionaries
    null_distribution,            # Monte Carlo null distribution
    bootstrap_ci,                 # bootstrap CI on net_signal
    search_calibrated_signal,     # matched-budget shuffle calibration
    load_dictionary, load_tokens, # file loaders
)

from dictcollision.baselines import (
    apparent_hit_rate,            # no correction
    subtract_null,                # naive baseline
    permutation_test,             # per-word Poisson test
    bh_fdr,                       # Benjamini-Hochberg
    blast_evalue,                 # BLAST-style E-value
    all_methods,                  # all six in one dict (Table 2 style)
)

from dictcollision.viz import (
    plot_decomposition,           # paper Figure 1
    plot_size_sweep,              # paper Figure 2
    plot_method_comparison,       # paper Figure 5
    plot_length_stratified,       # paper Figure 13
)
```

## Examples

Self-contained scripts in [examples/](examples/):

- [01_vigenere.py](examples/01_vigenere.py) — evaluate a Vigenere candidate key
- [02_paper_table2.py](examples/02_paper_table2.py) — reproduce the six-method comparison
- [03_dictionary_recommender.py](examples/03_dictionary_recommender.py) — pick the right dictionary without knowing the language
- [04_search_calibrated.py](examples/04_search_calibrated.py) — calibrate a stochastic search against shuffle baseline

## Paper

Methodology, experiments, validation:

Ruckman, M. (2026). *The Dictionary Collision Effect in Computational
Decipherment.* Source, figures, and reproduction code:
<https://github.com/mruckman1/signal-isolation-paper>

## Citation

```bibtex
@article{ruckman2026dictcollision,
  title={The Dictionary Collision Effect in Computational Decipherment},
  author={Ruckman, Matthew},
  year={2026},
  url={https://github.com/mruckman1/signal-isolation-paper}
}
```

## License

MIT
