Metadata-Version: 2.4
Name: kolmgformers
Version: 0.0.3
Summary: KOLMGformers: Unified KAN attention-free sequence modeling (KOLMOGformers) + Parallel Diffusion LM (OMGformers)
Author: Ömür Bera Işık
License-Expression: Apache-2.0
Keywords: deep-learning,transformers,diffusion,language-model,nlp,pytorch,kolmogorov-arnold,kan,attention-free,lora,moe
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0
Provides-Extra: hf
Requires-Dist: transformers>=4.35; extra == "hf"
Requires-Dist: tokenizers>=0.15; extra == "hf"
Provides-Extra: safetensors
Requires-Dist: safetensors>=0.4; extra == "safetensors"
Provides-Extra: flash
Requires-Dist: flash-attn>=2.0; extra == "flash"
Provides-Extra: quant
Requires-Dist: bitsandbytes>=0.41; extra == "quant"
Provides-Extra: all
Requires-Dist: transformers>=4.35; extra == "all"
Requires-Dist: tokenizers>=0.15; extra == "all"
Requires-Dist: safetensors>=0.4; extra == "all"
Requires-Dist: bitsandbytes>=0.41; extra == "all"
Dynamic: license-file

# KOLMGformers  v0.0.3

Unified Python library merging two research-grade model families:

| Family | Architecture | Key property |
|--------|-------------|-------------|
| **KOLMOG** | Kolmogorov-Arnold Cumulative Context | Attention-free, O(d) memory |
| **OMG** | Parallel Diffusion Transformer | Masked diffusion, full feature set |

---

## What's New in v0.0.3 — Bug Fixes & Improvements

### Bug Fixes (KOLMOG)

| ID | Component | Issue | Fix |
|----|-----------|-------|-----|
| #K5 | `KOLMOGformerLayer` | `phi` received raw `kappa_out` and `context` — shape contract was implicit and fragile | Documented and type-checked; context extractor now propagates `attention_mask` |
| #K6 | `generate()` | Repetition penalty used a Python loop over `set(generated[b].tolist())` — O(seq·vocab) per step | Vectorised with `tensor.unique()` + `scatter_` |
| #K7 | `generate()` | Top-p nucleus sampling: `cumprobs − softmax(sorted)` double-subtracted the pivot token, causing off-by-one exclusions | Replaced with correct shifted-cumsum implementation |
| #K8 | `PLKANLayer` | `breakpoints` initialized via `expand().clone()` left non-contiguous memory; subtle autograd issues under in-place ops | Replaced with `linspace(...).repeat()` → always contiguous |
| #K9 | `InnerKolmogorovFunction` | Always used slow B-spline `KANLayer` for φ layers, ignoring `config.use_plkan` | `build_kan_layer` factory now honoured for φ too (~3–5× speedup with PLKAN) |
| #K10 | `CumulativeContextExtractor` | Causal pad+shift produced wrong exclusive prefix at position 0 (current token leaked into its own context) | Replaced with correct exclusive cumsum: `C^{<i} = cumsum[i] − kappa_w[i]` |
| #K11 | `CumulativeContextExtractor` | `attention_mask` was accepted by model but never threaded to the context extractor; pad tokens polluted context vectors | Mask is now applied to `kappa` before accumulation at every layer |
| #K12 | `KOLMOGformerForCausalLM` | Logits returned only `[:, :-1, :]` (already shifted), breaking downstream use of the full logit tensor | Full-sequence logits returned; shift applied only inside loss computation (matches HF API) |
| #K13 | `save_pretrained` | Direct `torch.save` to final path could leave a corrupted checkpoint on interruption | Atomic write via `tempfile` + `os.replace`; safetensors support added |
| #K14 | `KOLMOGformerModel` | No gradient checkpointing — OOM on long sequences during training | `enable_gradient_checkpointing()` added; controlled via `TrainingArguments.gradient_checkpointing` |
| #K15 | `KANLayer.b_splines` | Grid buffer could be float32 while activations are bfloat16/float16, causing dtype mismatch | Grid is cast to `x.dtype` on every forward pass |

### Bug Fixes (Training)

| ID | Component | Issue | Fix |
|----|-----------|-------|-----|
| #T1 | `Trainer` | No early stopping — training continued even after convergence | `early_stopping_patience` added to `TrainingArguments` |
| #T2 | `DataCollatorForCausalLM` | Sequences silently truncated or accepted without warning | `warn_length` parameter warns when batch sequences exceed model's position limit |
| #T3 | `Trainer._save_checkpoint` | Saved only model weights — optimizer/scheduler lost, training couldn't truly resume | Optimizer + scheduler state saved in `trainer_state.pt` |
| #T4 | `Trainer.load_checkpoint` | Restored only model weights and step count | Now restores optimizer, scheduler, early-stopping state |
| #T5 | `get_scheduler` | Missing `"constant_with_warmup"` type | Added |
| #T6 | `Trainer` | `bf16=True` on CPU silently fell back to float32 with no warning | Warning emitted; autocast errors caught gracefully |
| #T7 | `DataCollatorForMaskedLM` | Random-replacement tokens were drawn from `[0, vocab_size)` including `[PAD]`/`[BOS]`/`[EOS]` | Now draws from `[num_special_tokens, vocab_size)` |

### New Features (v0.0.3)

**`KOLMOGformerConfig` additions:**
- `context_dropout` — independent dropout on the context vector path (default `0.0`).
- `ffn_type` — FFN activation: `"gelu"` (default) | `"silu"` | `"swiglu"`.
- `max_position_embeddings_dynamic` — RoPE cache auto-extends beyond limit instead of erroring (default `True`).
- `validate()` — called in `__post_init__`; surfaces config errors early with helpful messages.
- `__repr__` — readable summary of key config fields.

**`TrainingArguments` additions:**
- `early_stopping_patience` — stop after N evaluations without improvement.
- `gradient_checkpointing` — enable memory-efficient training automatically.

---

## Installation

```bash
pip install -e .
# Optional extras
pip install -e ".[hf]"       # HuggingFace tokenizers
pip install -e ".[flash]"    # Flash Attention 2
pip install -e ".[all]"      # Everything
```

---

## Quick Start

### KOLMOG — Attention-Free Causal LM

```python
from kolmgformers import KOLMOGformerConfig, KOLMOGformerForCausalLM
import torch

config = KOLMOGformerConfig(
    vocab_size=32000,
    hidden_size=512,
    num_channels=8,
    num_layers=6,
    causal=True,
    use_nce=True,    # Normalized Context Extraction
    use_wcc=True,    # Weighted Cumulative Context (v0.0.2+)
    use_plkan=True,  # Piecewise Linear KAN — 3-5x faster (v0.0.2+)
)
model = KOLMOGformerForCausalLM(config)
print(config)  # v0.0.3: readable repr

ids = torch.tensor([[1, 42, 100]])
out = model.generate(ids, max_new_tokens=50, temperature=0.8)
```

### KOLMOG — Training with Early Stopping

```python
from kolmgformers import (
    KOLMOGTrainer, KOLMOGTrainingArguments,
    KOLMOGDataCollatorForCausalLM,
)

args = KOLMOGTrainingArguments(
    output_dir="runs/my_run",
    num_train_epochs=10,
    early_stopping_patience=3,   # v0.0.3: stop after 3 bad evals
    gradient_checkpointing=True, # v0.0.3: save memory on long seqs
    evaluation_strategy="steps",
    eval_steps=500,
)
trainer = KOLMOGTrainer(
    model=model,
    args=args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    data_collator=KOLMOGDataCollatorForCausalLM(pad_token_id=0),
)
trainer.train()
```

### KOLMOG — True Checkpoint Resume

```python
# v0.0.3: optimizer + scheduler state is saved, enabling true resume
trainer.load_checkpoint("runs/my_run/checkpoint-5000")
trainer.train()  # continues from exact state
```

### OMG — Diffusion LM

```python
from kolmgformers import OMGConfig, OMGModel

config = OMGConfig(vocab_size=32000, hidden_size=768, num_layers=12)
model  = OMGModel(config)

import torch
prompt = torch.tensor([[1, 42]])
out = model.generate(prompt, new_tokens=128, steps=10)
```

---

## Architecture: KOLMOG

Based on the Kolmogorov-Arnold representation theorem:

```
F(X) = Σ_q Φ_q( Σ_i φ_{q,i}( xᵢ ⊕ eᵢ ⊕ c_{q,i} ) )
```

Key innovations:
- **KAN layers** — learnable B-spline activations per edge (not fixed non-linearities)
- **NCE** — Normalized Context Extraction: jackknife leave-one-out mean context
- **WCC** — Weighted Cumulative Context: attention-like token selectivity at O(n·d)
- **PLKAN** — Piecewise Linear KAN: 3–5× faster than B-spline, same expressivity
- **No attention** — O(n·d) time, O(d) memory (independent of sequence length)

---

## Architecture: OMG

Parallel Diffusion Language Model with:
- GQA / MLA / Sliding-Window / Linear / Block-Sparse attention
- MoE (dense + soft MoE)
- DS-PDLM dual-stream (understanding + generation)
- LoRA / DoRA PEFT
- TASA + MFS + DI efficiency trilogy
- TWE temporal embeddings, NCA neuro-creative routing

---

## Bug Fix History

| Version | Fixes |
|---------|-------|
| v0.0.1 | #K1–#K4 (config, imports, KANLayer rightmost knot) |
| v0.0.2 | WCC + PLKANLayer added |
| v0.0.3 | #K5–#K15 (context mask, generate, PLKAN, phi layers, causal prefix, gradient checkpointing) + #T1–#T7 (early stopping, optimizer save/load, bf16 CPU, special token masking) |

---

## License

Apache-2.0
