Metadata-Version: 2.4
Name: hsl-embedding
Version: 0.1.0
Summary: HSL (Holistic Signal Language): a non-learned, byte-level signal encoder for PyTorch — change-rate features, no tokenizer, losslessly invertible.
Project-URL: Homepage, https://github.com/Woojiggun/holo-hsl
Project-URL: Paper, https://doi.org/10.5281/zenodo.20581805
Project-URL: Demo, https://holo-demo-p5txmh4dda-as.a.run.app
Author-email: Jinhyun Woo <ggunio5782@gmail.com>
License: MIT
License-File: LICENSE
Keywords: byte-native,change-rate,embedding,multimodal,pytorch,signal,tokenizer-free
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Requires-Dist: numpy>=1.21
Requires-Dist: torch>=1.12
Description-Content-Type: text/markdown

# HSL — Holistic Signal Language

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.20581805.svg)](https://doi.org/10.5281/zenodo.20581805)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)

**A non-learned, byte-level signal encoder for PyTorch.** Instead of splitting text into tokens, it reads
raw bytes *holistically as signal*: bits, change-rate (Δ, XOR-delta), 2nd-order change (Δ²), boundary,
Fourier bands, and exact complex phase — 29 dimensions per byte, losslessly invertible. One
modality-agnostic input layer for text, image, audio, video — any byte stream.

> Everything is information — a fluctuation between 0 and 1. HSL doesn't ask *what a token means*; it
> measures *how the signal changes*, with exact formulas, so the same representation works under every modality.

```python
import hsl_embedding as hsl

feats, phase = hsl.embed(b"hello")          # -> Tensor [L, 21], Tensor [L]
emb = hsl.Embedding()                        # an nn.Module, no parameters (like nn.Embedding)
feats = emb("강아지".encode())               # -> [L, 21]
assert hsl.decode(hsl.encode(b"hello")) == b"hello"   # lossless, by construction
```

## Install

```bash
pip install hsl-embedding      # distribution name; import as `import hsl_embedding as hsl`
# deps: numpy, torch
```

## Why not just `nn.Embedding`?

They solve **different problems** — this is *not* a performance claim, it's a "when to use which".

| | `torch.nn.Embedding` | `hsl.Embedding` |
|---|---|---|
| what it is | a **learned lookup table** (trainable params) | an **exact formula** (zero params, deterministic) |
| input | a token id (`int`) | raw `bytes` |
| needs | a tokenizer + fixed vocab + training data | nothing — works on any bytes, day one |
| dimensions | opaque, learned | **named & interpretable** (Δ / Δ² / boundary / Fourier / phase) |
| modality | one tokenizer per modality (text ≠ image ≠ audio) | **one substrate for all** (byte-native) |
| invertible | no | **yes** (`decode(encode(x)) == x`) |
| new scripts / formats | breaks / out-of-vocab | just bytes — never breaks |

**They compose.** HSL is an *input substrate*, not a replacement for learned representations: `nn.Embedding`
learns *what tokens mean*; HSL gives *exact structural signal* for free. Stack learned layers **on top** of
HSL features.

**Reach for HSL when** you want: tokenizer-free input · one model across modalities · structure/change-aware
features · exact reconstruction · small-data or from-scratch training · interpretable input channels.

## What each channel captures (and where it's good)

HSL is built from **exact formulas**, each chosen to carry information a plain learned embedding tends to
throw away. The default is **21-D** — the pure change-rate substrate, one row per channel:

| channel (dims) | exact formula | captures | especially good for |
|---|---|---|---|
| **Δ** `dxor` 0–7 (8) | `XOR(bitₜ, bitₜ₋₁)` from origin 0 | **change / transitions** — *where the signal flips* | edges, topic/region shifts, the modality-shared "rate of change". *Measured: shift-detection AUC **0.725** vs content **0.698**.* |
| **Δ²** `d2xor` 0–7 (8) | `XOR(Δₜ, Δₜ₋₁)` | **acceleration of change** (2nd order) — *편미분 경계* | sharp **boundaries / corners / onsets**; where the rate-of-change itself jumps (segment cuts, audio attacks, image corners) |
| **boundary** (1) | `\|Δ\| + 0.5\|Δ²\| + 0.25·HF` | **transition-energy peaks** | **tokenizer-free segmentation** — natural byte/word/chunk cuts without decoding |
| **Fourier** low/high (2) | per-byte 8-bit rFFT amplitude bands | **frequency / texture / periodicity** | smooth vs busy, periodic vs random — audio timbre, image texture, repetitive vs novel content |
| **phase** cos/sin (2) | exact phasor `z = e^{iθ}, θ = 2π·byte/256` | **cyclic relation / angle** — exact `cos(θᵢ−θⱼ)` | **affect / mood** and relative/positional structure. *Measured: phase-variation tracks the audio affect-line **0.912**, better than loudness alone.* |

The point: a single learned vector blurs all of this together. HSL keeps **change (Δ), curvature (Δ²),
spectrum (Fourier), and phase** as separate, exact, interpretable channels — and adds them only where a
modality needs them.

*Legacy 29-D:* `include_bits=True` prepends the 8 raw byte bits. They're **redundant** (Δ-from-origin-0
already encodes the bytes losslessly), included only to match the original trained HoLo model.

## Lossless by construction

The features are grounded in a lossless codec, so the substrate is byte-exact:

```python
frame = hsl.encode(b"any bytes \x00\xff")
hsl.decode(frame) == b"any bytes \x00\xff"     # True
```
Δ-from-origin-0 *is* the codec's XOR-delta, so it already encodes the bytes losslessly — which is why the
raw `bits` channel is redundant and can be dropped.

## 21-D (default) vs 29-D (legacy)

```python
hsl.embed(data)                      # 21-D  (default; pure change-rate, no redundant bits)
hsl.embed(data, include_bits=True)   # 29-D  (also prepend the 8 raw bits — original HoLo model)
hsl.Embedding(include_bits=True).out_dim   # 29
```

## Batch

```python
emb = hsl.Embedding()
feats, phase, mask = emb.pack([b"a", b"abcdef"], max_len=8)   # [B, L, D], [B, L], [B, L]
```

## Examples

```bash
python examples/quickstart.py        # bytes in, features out; named channels
python examples/roundtrip_all.py     # text / image / audio / video -> embed -> EXACT reconstruction
python examples/vs_nn_embedding.py   # nn.Embedding vs hsl.Embedding — when to use which
python examples/benchmark_vs_nn.py   # honest capability + speed comparison
```

`roundtrip_all.py` — one modality-agnostic encoder, lossless by construction:

```
modality              bytes     feat shape   reconstruction
----------------------------------------------------------------
text  (utf-8)            98       (98, 21)   EXACT ✓
image (RGB u8)         3072     (3072, 21)   EXACT ✓
audio (PCM i16)        8000     (8000, 21)   EXACT ✓
video (6 frames)       4608     (4608, 21)   EXACT ✓
```

## Scope (honest)

HSL is a **non-learned input substrate** — a possibility-proof from an independent, single-GPU project, not a
benchmark-beating system. It gives exact structural signal; the *meaning* still comes from a model you stack on
top. See the paper and live demo:

- 📄 Paper: [A Feasibility Study of Change-Rate-Based Multimodal Unification](https://doi.org/10.5281/zenodo.20581805) (Zenodo)
- 🌐 Live demo: https://holo-demo-p5txmh4dda-as.a.run.app
- 💻 HoLo project: https://github.com/Woojiggun/holo-hsl

## License & citation

**MIT License — © 2026 Jinhyun Woo (ggunio5782@gmail.com).**
Free to use, modify, and **distribute, including for commercial use** — the only condition is that the
copyright notice and attribution to **Jinhyun Woo** are kept. See [LICENSE](LICENSE).

```bibtex
@software{woo_hsl_2026,
  author = {Jinhyun Woo},
  title  = {HSL: a byte-native, modality-agnostic signal embedding},
  year   = {2026},
  doi    = {10.5281/zenodo.20581805},
  url    = {https://github.com/Woojiggun/holo-hsl}
}
```
