Metadata-Version: 2.4
Name: thermoclaw
Version: 0.1.6
Summary: Catches training failures before they waste GPU hours. Thermodynamic diagnostics for any PyTorch optimiser — per-layer collapse detection, entropy decomposition, and plain-English recommendations.
Author: Christopher Gardner
License: Apache-2.0
Project-URL: Homepage, https://github.com/christophergardner-star/Thermoclaw
Project-URL: Repository, https://github.com/christophergardner-star/Thermoclaw
Project-URL: Bug Tracker, https://github.com/christophergardner-star/Thermoclaw/issues
Keywords: deep-learning,thermodynamics,optimizer,diagnostics,entropy
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0
Requires-Dist: numpy>=1.24
Provides-Extra: viz
Requires-Dist: matplotlib>=3.7; extra == "viz"
Provides-Extra: hf
Requires-Dist: transformers>=4.30; extra == "hf"
Requires-Dist: accelerate>=1.1.0; extra == "hf"
Provides-Extra: all
Requires-Dist: matplotlib>=3.7; extra == "all"
Requires-Dist: transformers>=4.30; extra == "all"
Requires-Dist: accelerate>=1.1.0; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: matplotlib>=3.7; extra == "dev"
Dynamic: license-file

# Thermoclaw

**Thermoclaw catches training failures before they waste GPU hours.**

A lightweight diagnostic layer for any PyTorch optimiser. One line wraps AdamW, SGD, or anything else — and you get real-time alerts when your training is about to collapse, plateau, or diverge, with layer-level explanations of why.

Works alongside W&B, TensorBoard, and the HuggingFace Trainer. Thermoclaw tells you *why* your loss curve looks the way it does.

---

## It found the problem. Acting on it worked.

We trained GPT-2 small (124M parameters) from scratch on WikiText-103 with SGD + momentum and `weight_decay=5.0`. At **step 19**, `CollapseDetector` flagged a HIGH-confidence weight-decay collapse:

```
[HIGH] Weight decay is eroding 2 embedding layers: param norms dropped
36% during training. Reduce weight_decay.
```

We branched at that point — one arm continued unchanged, the other reduced weight decay to `0.01`.

| Run | Final PPL (600 steps post-branch) |
|-----|-----------------------------------|
| Unmodified (`wd=5.0` throughout) | **50,257** — model completely dead, outputting uniform noise |
| Thermoclaw intervention (`wd→0.01`) | **1,377** |
| **Improvement** | **36× lower perplexity** |

Replicated across 3 seeds (42, 137, 2024). The unmodified arm locks at `ppl = vocab_size` — zero information. The intervention arm learns.

No hyperparameter search. No manual inspection. One warning, one change. The warning is causal, not correlational.

---

## Install

```bash
pip install thermoclaw                    # core (PyTorch + NumPy only)
pip install thermoclaw[viz]               # + Matplotlib dashboards
pip install thermoclaw[hf]                # + HuggingFace Trainer callback
pip install thermoclaw[all]               # everything
```

---

## Quick Start

### Observe any optimiser

```python
from thermoclaw import Observer, diagnose

model     = YourModel()
optimiser = torch.optim.AdamW(model.parameters(), lr=3e-4)
observer  = Observer(model, optimiser)

for batch in loader:
    loss = criterion(model(batch))
    loss.backward()
    optimiser.step()
    observer.step(loss=loss.item())
    optimiser.zero_grad()

report = diagnose(observer)
print(report)
observer.plot_dashboard(save_path='dashboard.png')
```

### HuggingFace Trainer (one line)

```python
from thermoclaw.integrations.huggingface import ThermoclawCallback

trainer = Trainer(
    model=model,
    args=training_args,
    callbacks=[ThermoclawCallback()],      # ← that's it
)
trainer.train()
# Dashboard PNG, CSV, and diagnostic report saved to output_dir automatically
# Thermodynamic metrics logged to W&B / TensorBoard automatically
```

See [BLOG_HF.md](BLOG_HF.md) for a full walkthrough.

### Catch weight-decay collapse in real time

```python
from thermoclaw import CollapseDetector
from thermoclaw import make_param_groups

# Per-layer groups give full resolution (recommended)
groups    = make_param_groups(model, lr=3e-4, weight_decay=0.01)
optimiser = torch.optim.AdamW(groups)
detector  = CollapseDetector(model, optimiser)

for batch in loader:
    loss.backward()
    optimiser.step()
    detector.step()             # call before zero_grad
    optimiser.zero_grad()

    if detector.is_collapsing:
        for pg in optimiser.param_groups:
            pg['weight_decay'] *= 0.1

recs = detector.get_recommendations()
# → ["[HIGH] Weight decay collapse in 5 mlp layers: grad/param ratio
#     dropped 4.2× from early to late training. Reduce weight_decay."]
```

`is_collapsing` fires as soon as any HIGH or MEDIUM signal is confirmed. See [Known Issues](KNOWN_ISSUES.md) for AdamW behaviour at typical weight-decay values.

### Decompose entropy into productive vs overhead

```python
from thermoclaw import Observer, EntropySplit, diagnose

observer = Observer(model, optimiser)
splitter = EntropySplit(model, optimiser, observer)

for batch in loader:
    loss = criterion(model(batch))
    loss.backward()
    optimiser.step()
    observer.step(loss=loss.item())
    splitter.step()             # decomposes entropy each step
    optimiser.zero_grad()

report = diagnose(observer, splitter)
print(report)
# Example output:
#   [HIGH] Weight decay is the dominant entropy source for 12 attention
#   layers (mean R_ie=4.2). Consider reducing weight_decay for attention
#   layers by 4-8×, or excluding them.

splitter.plot_entropy_split(save_path='entropy_split.png')
```

### Thermodynamically-aware LR schedule (drop-in)

```python
from thermoclaw import ThermoScheduler

optimiser = torch.optim.AdamW(model.parameters(), lr=3e-4)
scheduler = ThermoScheduler(optimiser, total_steps=10000)

for step, batch in enumerate(loader):
    loss = criterion(model(batch))
    loss.backward()
    optimiser.step()
    scheduler.step()            # replaces cosine_scheduler.step()
    optimiser.zero_grad()
```

---

## What am I looking at?

> You've run Thermoclaw and have numbers. Here is what they mean in plain English.

### Entropy ratio `r_l = σ / σ*`

The ratio of how much thermodynamic work layer `l` is doing now versus its own historical baseline.

| Value | Meaning | What to do |
|-------|---------|------------|
| `r ≈ 1.0` | Layer is at equilibrium — learning at a steady, sustainable rate | Nothing |
| `r < 0.85` | Under-trained — this layer is doing less work than its baseline | Consider a higher LR or check for gradient starvation |
| `r > 1.15` | Over-trained — this layer is overheating | Consider a lower LR, more weight decay, or gradient clipping |

### Dispersion `D = Var(r_l)` across layers

How uniformly your model is learning across all layers simultaneously.

| Value | Meaning |
|-------|---------|
| `D < 0.05` | Clean — all layers are coordinated, gradients are coherent |
| `D ≈ 0.1–0.4` | Some inter-layer tension — often a sign of noisy labels or an aggressive LR |
| `D > 0.5` | Significant fragmentation — some layers are updating aggressively whilst others stall |

### Entropy ratio `R_ie = d_iS / d_eS`

The single most actionable number. It tells you how much of the optimiser's energy budget is going on *overhead* (weight decay, momentum friction, noise) versus *productive gradient descent*.

| Value | Meaning | What to do |
|-------|---------|------------|
| `R_ie < 1` | Most entropy is productive — training is efficient | Nothing |
| `R_ie 1–2` | Modest overhead — normal for most runs | Monitor |
| `R_ie 2–5` | **Warning** — overhead is dominating. Check weight decay and momentum | Reduce `weight_decay`, lower `β₁`, or clip gradients |
| `R_ie > 5` | **Critical** — the optimiser is mostly generating heat. Training may plateau or collapse | Intervene immediately |

### Gradient coherence `ρ = cos(g_t, g_{t-1})`

How consistent the gradient direction is from step to step.

| Value | Meaning |
|-------|---------|
| `ρ > 0.3` | Coherent — training is stable, loss is likely decreasing smoothly |
| `ρ ≈ 0` | Incoherent — gradients are effectively random walk. Check batch size, LR, and data ordering |
| `ρ < -0.1` | Oscillating — gradients are reversing direction. Reduce LR or increase batch size |

> **Non-obvious:** high momentum (`β₁ → 1`) *increases* ρ by smoothing consecutive gradient vectors. A high ρ reading does not necessarily mean training is healthy if the equilibrium fraction is low. Use both together.

### Equilibrium fraction `eq_frac`

The fraction of recent steps where layers were operating at or near their equilibrium entropy ratio (`r ≈ 1`). Think of it as a "steady-state score".

| Value | Meaning |
|-------|---------|
| `eq_frac > 0.5` | More than half of steps are at equilibrium — stable training |
| `eq_frac < 0.2` | Training is rarely at equilibrium — unstable. Check LR schedule and weight decay |

### CollapseDetector confidence levels

| Level | Trigger | Action |
|-------|---------|--------|
| `[HIGH]` | Grad/param ratio dropped >2× AND confirmed across multiple layers | Act immediately |
| `[MEDIUM]` | Clear signal, moderate severity | Investigate |
| `[LOW]` | Signal present, multiple contributing sources | Informational |

---

## What Thermoclaw measures

| Quantity | Symbol | What it means |
|----------|--------|---------------|
| Entropy production | `σ_l = η‖g‖²` | How much thermodynamic work each layer is doing |
| Entropy ratio | `r_l = σ/σ*` | 1.0 = equilibrium. <0.85 = under-trained. >1.15 = over-trained |
| **External entropy** | **`d_eS`** | **Entropy that reduces loss — productive learning** |
| **Internal entropy** | **`d_iS`** | **Entropy from weight decay, momentum, noise — overhead** |
| **`R_ie = d_iS/d_eS`** | — | **The diagnostic ratio. >2 = warning, >5 = critical** |
| Grad/param ratio | `‖g‖/‖θ‖` | CollapseDetector signal. A falling trend signals weight-decay erosion |
| Dispersion | `D = Var(r_l)` | Inter-layer training uniformity |
| Gradient alignment | `ρ = cos(g_t, g_{t-1})` | Step coherence. Negative = oscillation |
| Parameter distance | `E = ‖θ−θ₀‖²` | How far weights have moved from initialisation |

---

## The `d_iS / d_eS` decomposition

Standard training observes total loss and calls it a day. But total entropy production `σ` conflates two fundamentally different thermodynamic flows:

- **`d_eS` (external)** — gradient-driven parameter updates that reduce loss. This is productive work.
- **`d_iS` (internal)** — entropy from weight decay, momentum friction, and stochastic noise. This is heat.

When `d_iS >> d_eS`, the optimiser is spending most of its entropy budget on overhead. Thermoclaw decomposes `d_iS` further:

- **`d_iS_wd`** — weight-decay contribution
- **`d_iS_momentum`** — momentum friction
- **`d_iS_noise`** — stochastic gradient noise

This tells you exactly which layers, at which step, are wasting compute — and why.

---

## Per-layer parameter groups

For full per-layer resolution, use `make_param_groups`:

```python
from thermoclaw import make_param_groups

groups    = make_param_groups(model, lr=3e-4, weight_decay=0.01)
optimiser = torch.optim.AdamW(groups)
observer  = Observer(model, optimiser)
```

---

## Confidence scoring

Recommendations are **conservative**. Thermoclaw only flags issues where the physics signal is unambiguous.

- **[HIGH]** — Single dominant source (>60% of `d_iS`), `R_ie > 5`, consistent across regions. Safe to act on.
- **[MEDIUM]** — Clear signal but moderate `R_ie` (2–5). Worth investigating.
- **[LOW]** — Signal present but multiple sources contribute. Informational only.

Wrong recommendations that sound authoritative destroy trust faster than no recommendations at all.

---

## Validated

Three-tier validation on H100 80 GB (Pythia-410M, WikiText-103, bfloat16):

| Tier | Test | Result |
|------|------|--------|
| T1: Analytical | σ, ρ, d_iS_wd, d_eS+d_iS=σ, E, D — 8 ground-truth checks | **8/8 PASS** |
| T2A: High LR | `lr=3e-2` → `R_ie=1.6×10²⁰`, `eq=0.014`, flagged HIGH | **PASS** |
| T2B: High WD | `wd=5.0` → unhealthy, flagged MEDIUM | **PASS** |
| T2C: Over-damped | `β₁=0.999` → `ρ=0.63` (vs baseline 0.39), `eq=0.15`, flagged HIGH | **PASS** |
| T2D: Baseline | `lr=3e-4 / wd=0.01 / β₁=0.9` → no collapse or WD pathology flagged | **PASS** |
| T3: Intervention | CollapseDetector fires step 19 HIGH (SGD `wd=5.0`), PPL gap +48,880 vs dead arm (3/3 seeds) | **PASS** |

---

## Origin

Thermoclaw's thermodynamic framework comes from the [EPTO (Entropy-Production Targeted Optimisation)](https://github.com/christophergardner-star/EPTO_-) research project. The key insight: neural network training is a non-equilibrium thermodynamic process, and the quantities that matter — entropy production, entropy ratios, equilibrium fraction — can be measured for *any* optimiser, not just EPTO.

---

## Licence

Apache 2.0
