Metadata-Version: 2.4
Name: suffix-smoother
Version: 0.2.0
Summary: Suffix smoothing classifier: three research-backed methods, conformal prediction, streaming.
License: MIT
Project-URL: Source Code, https://github.com/loofy147/suffix-smoother
Project-URL: Bug Tracker, https://github.com/loofy147/suffix-smoother/issues
Keywords: nlp,sequence-classification,suffix-smoothing,conformal-prediction,uncertainty-quantification,kneser-ney,witten-bell,log-anomaly,genomics,oov
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: numpy

# suffix-smoother

A lightweight, production-ready sequence classifier using **recursive suffix smoothing**.

Zero neural networks. Zero model files. Zero corpus downloads. Handles any unseen input via progressive backoff — the same technique that powered the TnT POS tagger (Brants 2000) before deep learning.

---

## Why

Most sequence classifiers fail silently on out-of-vocabulary inputs. A neural tagger trained on English news text will output garbage when it sees a neologism, a domain-specific term, or a malformed token. This library degrades gracefully: if the full context is unseen, it backs off to a shorter suffix, then shorter still, until it reaches the uniform prior. It always returns a calibrated probability.

The same algorithm works across domains because the math is domain-agnostic. You provide (context_tuple, label_id) pairs — the library doesn't care whether those represent characters in a word, nucleotides in a genome, or event codes in a server log.

---

## Install

```bash
pip install suffix-smoother
```

---

## Quick Start

```python
from suffix_smoother import SuffixSmoother, SuffixConfig

config = SuffixConfig(max_suffix_length=5, n_classes=2)
smoother = SuffixSmoother(config)

# Training: (context_tuple, label_id) pairs
smoother.train([
    ((101, 102, 103), 0),   # Normal sequence
    ((404, 404, 500), 1),   # Anomaly sequence
])

# Predict
label, confidence = smoother.predict((101, 102, 103))
# → (0, 0.87)

# Full distribution
dist = smoother.predict_distribution((101, 102))
# → {0: 0.72, 1: 0.28}

# Uncertainty in bits (0 = certain, log2(n_classes) = random)
bits = smoother.uncertainty((101, 102))

# Fraction of maximum uncertainty eliminated
reduction = smoother.uncertainty_reduction((101, 102))
```

---

## The Math

```
P(label | seq_k) = λ · P_MLE(label | seq_k) + (1-λ) · P(label | seq_{k-1})

Base case: P(label | ∅) = 1 / n_classes
```

Where `seq_k` is the last `k` symbols of the input sequence. The recursion blends the maximum likelihood estimate at each suffix level with the estimate from shorter contexts, all the way back to a uniform prior. This is Jelinek-Mercer smoothing applied to suffix trees.

---

## Use Cases

### NLP — POS Tagging

```python
from suffix_smoother import SuffixSmoother, SuffixConfig

TAGS = {"NOUN": 0, "VERB": 1, "ADJ": 2, "ADV": 3}

config = SuffixConfig(max_suffix_length=6, n_classes=len(TAGS))
smoother = SuffixSmoother(config)

# Encode suffix as char codes
def encode(word, maxlen=6):
    return tuple(ord(c) % 32 for c in word[-maxlen:])

# Train on (suffix_encoding, tag_id) pairs
smoother.train([
    (encode("running"), TAGS["VERB"]),
    (encode("quickly"), TAGS["ADV"]),
    (encode("creation"), TAGS["NOUN"]),
    # ... more training pairs
])

# Predict — works on any word, including OOV
label, conf = smoother.predict(encode("antidisestablishmentarianism"))
# Backs off: "ism" → suffix known → NOUN
```

**Benchmark (UD English-EWT corpus):**
- Overall accuracy: **81.12%**
- OOV accuracy: **78.57%** — nearly identical to in-vocabulary performance

### Log Anomaly Detection

```python
# Event codes: 101=LOGIN, 102=VIEW, 103=LOGOUT, 404=ERROR, 500=CRASH
config = SuffixConfig(max_suffix_length=4, n_classes=2)
smoother = SuffixSmoother(config)

smoother.train([
    ((101, 102, 103), 0),        # Normal
    ((101, 102, 102, 103), 0),   # Normal variant
    ((101, 404, 404, 404), 1),   # Anomaly: repeated errors
    ((102, 102, 500), 1),        # Anomaly: crash after double view
])

# Novel sequence — unseen but suffix matches
label, conf = smoother.predict((999, 102, 103))
# Backs off from full context to (102, 103) → NORMAL
```

Advantage: handles new service names, new error codes, and new event IDs without retraining.

### Genomics — Pathogenicity Prediction

```python
BASE = {"A": 0, "T": 1, "G": 2, "C": 3, "N": 4}

def encode_kmer(seq, k=6):
    return tuple(BASE.get(b.upper(), 4) for b in seq[:k])

# Classes: 0=BENIGN, 1=LIKELY_BENIGN, ... 4=PATHOGENIC
config = SuffixConfig(max_suffix_length=6, n_classes=8)
smoother = SuffixSmoother(config)

smoother.train(clinvar_training_data)  # (kmer_tuple, class_id) pairs

label, conf = smoother.predict(encode_kmer("TGCGAT"))
```

**Benchmark (ClinVar, real hg38 flanking sequences via Ensembl REST API):**
- Pathogenic recall: **69.23%** vs 0% naive baseline (always predict BENIGN)

---

## API Reference

### `SuffixConfig`

| Parameter | Default | Description |
|---|---|---|
| `max_suffix_length` | `5` | Maximum context length |
| `smoothing_lambda` | `0.7` | λ weight: MLE vs backoff |
| `n_classes` | `16` | Number of output labels |
| `min_count` | `1` | Minimum observations to trust a suffix level |

### `SuffixSmoother`

| Method | Returns | Description |
|---|---|---|
| `train(data)` | `dict` | Train on list of `(tuple, int)` pairs |
| `predict(seq)` | `(int, float)` | Best label and confidence |
| `predict_distribution(seq)` | `dict[int, float]` | Full probability distribution |
| `uncertainty(seq)` | `float` | Entropy in bits |
| `uncertainty_reduction(seq)` | `float` | Fraction of max entropy eliminated |
| `max_uncertainty()` | `float` | `log2(n_classes)` — theoretical maximum |
| `n_nodes` | `int` | Number of suffix nodes built |

---

## Performance

- **Inference**: < 2ms per sequence (Python, no JIT)
- **Training throughput**: > 50,000 samples/second
- **Memory**: scales with unique suffix patterns observed, not total vocabulary size
- **Dependencies**: `numpy` only

---

## License

MIT
