Metadata-Version: 2.4
Name: harmonic-token-projection
Version: 0.1.0
Summary: Vocabulary-free, training-free, deterministic and reversible text embeddings via harmonic modular projection (HTP).
Project-URL: Homepage, https://pypi.org/project/harmonic-token-projection/
Project-URL: Repository, https://github.com/tcharliesschmitz/harmonic-token-projection
Project-URL: Documentation, https://github.com/tcharliesschmitz/harmonic-token-projection#readme
Project-URL: Bug Tracker, https://github.com/tcharliesschmitz/harmonic-token-projection/issues
Project-URL: Paper, https://arxiv.org/abs/2511.20665
Project-URL: DOI, https://doi.org/10.5281/zenodo.17575155
Author-email: Tcharlies Schmitz <tcharliesschmitz@gmail.com>
Maintainer-email: Tcharlies Schmitz <tcharliesschmitz@gmail.com>
License: MIT
License-File: LICENSE
Keywords: deterministic,embeddings,harmonic,nlp,reversible,semantic-similarity,training-free,vocabulary-free
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Typing :: Typed
Requires-Python: >=3.9
Requires-Dist: numpy>=1.21
Provides-Extra: dev
Requires-Dist: build>=1.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Requires-Dist: twine>=5.0; extra == 'dev'
Provides-Extra: eval
Requires-Dist: datasets>=2.0; extra == 'eval'
Requires-Dist: scipy>=1.7; extra == 'eval'
Provides-Extra: zh
Requires-Dist: jieba>=0.42; extra == 'zh'
Description-Content-Type: text/markdown

[![PyPI version](https://img.shields.io/pypi/v/harmonic-token-projection?color=orange)](https://pypi.org/project/harmonic-token-projection/)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
[![Python](https://img.shields.io/badge/python-3.9%2B-blue.svg)](https://www.python.org/)
[![GitHub](https://img.shields.io/badge/github-harmonic--token--projection-black?logo=github)](https://github.com/tcharliesschmitz/harmonic-token-projection)

# 🎵 Harmonic Token Projection (HTP)

A **vocabulary-free**, **training-free**, **deterministic** and **reversible** text-embedding methodology.  
HTP encodes each token analytically as a **harmonic trajectory** derived from its Unicode integer representation, establishing a *bijective* and interpretable mapping between discrete symbols and a continuous vector space — with **no learned parameters, no corpus, and no randomness**.

> 📘 **Reference**  
> Schmitz, T. (2025). *Harmonic Token Projection (HTP): A Vocabulary-Free, Training-Free, Deterministic, and Reversible Embedding Methodology*.  
> arXiv: [2511.20665](https://arxiv.org/abs/2511.20665) · DOI: [10.5281/zenodo.17575155](https://doi.org/10.5281/zenodo.17575155)

---

## 🔖 Key Features

- 🚫 **No training, no vocabulary** — pure analytic transform, works on any Unicode string
- 🔁 **Fully reversible** — exact token recovery via the **Chinese Remainder Theorem**
- 🎯 **Deterministic** — identical input always yields identical output (no randomness)
- 🪶 **Lightweight** — sub-megabyte footprint, sub-millisecond per sentence pair, CPU-only
- 🔍 **Interpretable** — every coordinate is a harmonic of a modular residue
- 🌍 **Language-agnostic** — ρ ≈ 0.68–0.70 (EN) and ρ ≈ 0.64 averaged over 10 languages on STS-B
- 🧩 **Minimal dependencies** — only `numpy` (optional `jieba`, `scipy`, `datasets`)

---

## 📦 Installation

```bash
pip install harmonic-token-projection
```

Optional extras:

```bash
pip install 'harmonic-token-projection[zh]'    # jieba segmenter for Chinese
pip install 'harmonic-token-projection[eval]'  # scipy + datasets for STS evaluation
pip install 'harmonic-token-projection[dev]'   # test / lint / build tooling
```

---

## ⚙️ How It Works

For a token `t = [c₁, …, c_ℓ]`:

| Step | Equation | Description |
|------|----------|-------------|
| **1. Unicode** | `uᵢ = ord(cᵢ)` | character → code point |
| **2. Padding** | `ũ = [u₁,…,uₗ,0,…,0]` | zero-pad to fixed length `L_max` |
| **3. Integer** | `Nₜ = Σ ũⱼ·B^(L_max−j)`, `B = 2¹⁶` | read as a base-`B` number |
| **4. Residues** | `rᵢ = Nₜ mod mᵢ` | decompose over pairwise-coprime moduli |
| **5. Harmonics** | `Eᵢ = [sin(2πrᵢ/mᵢ), cos(2πrᵢ/mᵢ)]` | project each residue → `E(t) ∈ ℝ²ᵏ` |

**Inversion** recovers each residue from its phase `r̃ᵢ = round(atan2(sᵢ,cᵢ)/2π · mᵢ)` and reconstructs `Nₜ` via the **Chinese Remainder Theorem**, then decodes the base-`B` digits back to characters. By default HTP uses the **first `k = D/2` primes** as moduli, which are pairwise coprime and give a modulus product `M` large enough to make every token up to `model.reversible_max_len` characters exactly reversible.

---

## 🧪 Detailed Examples

### **1️⃣ Token-level: deterministic & reversible**

```python
from htp import HTP

model = HTP(dim=512, max_len=32)

vec = model.encode_token("harmonic")   # numpy array, shape (512,)
print(vec.shape)                        # (512,)
print(model.decode_token(vec))          # -> 'harmonic'   (lossless)
print(model.token_to_int("harmonic"))   # -> deterministic integer Nₜ
print(model.reversible_max_len)          # -> 143 (chars guaranteed to round-trip)
```

### **2️⃣ Sentence-level: harmonic pooling & similarity**

```python
from htp import HTP

model = HTP(dim=512)

emb = model.encode("the cat sat on the mat")        # (512,) L2-normalized
mat = model.encode_batch(["first sentence",
                          "second one"])             # (2, 512)

print(model.similarity("a man is playing a guitar",
                       "a person plays the guitar")) # ~0.44
print(model.similarity("a man is playing a guitar",
                       "the stock market fell"))     # ~-0.03
```

### **3️⃣ Frequency-aware pooling (ITF / TF-IDF)**

```python
from htp import HTP

corpus = ["the cat sat on the mat", "a dog ran in the park", "the bird flew away"]

model = HTP(dim=512, pooling="tfidf")
model.fit(corpus)        # collects token frequencies — trains NO parameters
emb = model.encode("the rare cat")   # common words ("the") down-weighted
```

Pooling strategies (`pooling=...`):

| Strategy | Weighting |
|----------|-----------|
| `"itf"` *(default)* | Inverse Token Frequency `w = 1/log(1+f(t))` |
| `"tfidf"` | TF-IDF (call `model.fit(corpus)` first) |
| `"mean"` | uniform |
| `"stopword"` | drop stopwords, then mean |

### **4️⃣ Multilingual round-trip & STS evaluation**

```python
from htp import HTP
from htp.evaluate import evaluate_pairs   # requires the [eval] extra

model = HTP(dim=512, max_len=32)

# Reversible across scripts
for t in ["représentation", "Schlüssel", "coração", "язык", "日本語"]:
    assert model.decode_token(model.encode_token(t)) == t

# Correlate against human similarity judgments
pairs = [("a man is eating food", "a man eats something"),
         ("a plane is taking off", "a dog is running")]
gold  = [4.2, 0.5]
print(evaluate_pairs(model, pairs, gold))   # {'spearman': ..., 'pearson': ...}
```

---

## 🧰 API Reference

```python
HTP(dim=512, max_len=32, moduli=None, pooling="itf",
    tokenizer="regex", lowercase=False, stopwords="en")

model.encode_token(token)        # str      -> ndarray (dim,)
model.decode_token(vector)       # ndarray  -> str
model.token_to_int(token)        # str      -> int  (Nₜ)
model.int_to_token(value)        # int      -> str
model.encode(text, pooling=None) # str      -> ndarray (dim,)
model.encode_batch(texts)        # list     -> ndarray (n, dim)
model.similarity(a, b)           # str, str -> float
model.fit(corpus)                # collect ITF/TF-IDF statistics
model.reversible_max_len         # max token length guaranteed to round-trip
```

---

## 🔬 Properties

| Property | HTP |
|----------|-----|
| Training | none (analytic) |
| Vocabulary | none (any Unicode string) |
| Determinism | identical input → identical output |
| Reversibility | exact token recovery via CRT |
| Footprint | sub-megabyte, sub-millisecond, CPU-only |
| Interpretability | each coordinate is a harmonic of a modular residue |

---

## 📖 Citation

```bibtex
@article{schmitz2025htp,
  title   = {Harmonic Token Projection (HTP): A Vocabulary-Free, Training-Free,
             Deterministic, and Reversible Embedding Methodology},
  author  = {Schmitz, Tcharlies},
  journal = {arXiv preprint arXiv:2511.20665},
  year    = {2025},
  doi     = {10.5281/zenodo.17575155}
}
```

---

## 📝 License

MIT © 2025 **Tcharlies Schmitz** — Data Science, PX.Center
