Metadata-Version: 2.4
Name: diffmax
Version: 0.1.0
Summary: α-Diffmax: sparse power-law normalizing operator for attention (0 < α < 1)
Author-email: dengzhaowork@gmail.com
License: MIT
Keywords: attention,sparse,entmax,normalizing-operator,deep-learning,pytorch
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0
Provides-Extra: dev
Requires-Dist: pytest>=8.4.2; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: scipy>=1.10; extra == "dev"
Provides-Extra: examples
Requires-Dist: matplotlib; extra == "examples"
Requires-Dist: numpy; extra == "examples"
Requires-Dist: pandas; extra == "examples"
Provides-Extra: monitoring
Requires-Dist: swanlab>=0.3; extra == "monitoring"
Dynamic: license-file

# diffmax

**α-Diffmax** is a sparse, power-law normalizing operator for attention (0 < α < 1). It generalises softmax: lower α produces sparser attention weights while preserving a valid probability distribution. The threshold τ is found by bisection, making it numerically stable and differentiable end-to-end.

## Installation

```bash
pip install diffmax
```

Colab / CPU-only environments work out of the box. A CUDA GPU with Triton unlocks the fused kernel automatically.

```python
from diffmax import diffmax_bisect
```

Optional extras:

```bash
pip install "diffmax[monitoring]"   # swanlab metric logging
pip install "diffmax[dev]"          # pytest, scipy (for development)
pip install "diffmax[examples]"     # matplotlib, numpy, pandas
```

## Quick start

```python
import torch
from diffmax import diffmax_bisect, DiffmaxBisectModule

# Functional API — drop-in for torch.softmax
scores = torch.randn(2, 8, 64, 64)          # (B, H, L, L)
weights = diffmax_bisect(scores, alpha=0.85, dim=-1)
# weights.sum(dim=-1) == 1.0, many entries exactly zero

# Module API — use inside nn.Sequential or nn.Module
layer = DiffmaxBisectModule(alpha=0.85, dim=-1)
weights = layer(scores)
```

Learnable α via a β→α map:

```python
from diffmax import DiffmaxBisectModule, HillMap

layer = DiffmaxBisectModule(alpha=0.85, dim=-1, alpha_map=HillMap(c=0.3))
# Exposes `log_beta` as a trainable parameter; optimiser tunes α.
optimizer = torch.optim.Adam(layer.parameters(), lr=1e-3)
```

## API

| Symbol | Description |
|---|---|
| `diffmax_bisect(X, alpha, dim, ...)` | Functional forward pass |
| `DiffmaxBisectModule(alpha, dim, ...)` | `nn.Module` wrapper, optionally learnable α |
| `HillMap(c, beta0)` | Hill β→α map (recommended) |
| `TanhMap(c, beta0)` | Tanh β→α map |
| `ExpMap(c, beta0)` | Exponential β→α map |
| `diffmax_bisect_monitored(...)` | Monitored variant (requires `swanlab`) |
| `__version__` | Package version string |

### `diffmax_bisect` signature

```python
diffmax_bisect(
    X: Tensor,
    alpha: float | Tensor | Callable = 0.9,
    dim: int = -1,
    n_iter: int = 50,
    ensure_sum_one: bool = True,
) -> Tensor
```

- `alpha`: must satisfy 0 < α < 1. Scalars, broadcast tensors, and zero-arg callables are all accepted.
- `n_iter=50` is sufficient for fp32 convergence; raise to 200 for fp64.

## Backends

| Device | Implementation | Notes |
|---|---|---|
| CPU | Pure PyTorch bisection | Default; always available |
| CUDA (NVIDIA) | Triton fused kernel | Auto-selected when Triton is installed |
| ROCm (AMD) | Placeholder | Plugs into CUDA dispatch key |
| Ascend NPU | Placeholder | Activated when `torch_npu` is installed |

Rows with N > 4096 or dtype `float64` fall back to the CPU backend even on CUDA.

## Development

```bash
# Install with uv (recommended)
uv venv && uv sync --extra dev

# Run tests
uv run pytest tests/ -v

# Install with conda / pip
conda create -n diffmax python=3.9 -y && conda activate diffmax
pip install -r requirements.txt && pip install -e ".[dev]"
pytest tests/ -v
```

CUDA benchmarks (require a GPU):

```bash
python -m benchmarks.bench_forward
python -m benchmarks.bench_backward
python -m benchmarks.bench_e2e_attention
```

## Monitoring

Install `swanlab` to log per-call metrics:

```bash
pip install "diffmax[monitoring]"
```

```python
import swanlab
from diffmax import diffmax_bisect_monitored

swanlab.init(project="diffmax-experiments")

Y = diffmax_bisect_monitored(
    scores, alpha=0.85, dim=-1,
    name_prefix="diffmax/encoder_0",
    step=global_step,
)
```

Logged keys: `{prefix}/forward/*`, `{prefix}/alpha/*`, `{prefix}/convergence/*`, `{prefix}/backward/*`.

## Dispatch flow

```
diffmax_bisect(X, alpha, ...)
    → _DiffmaxAutocast.apply(...)       # AMP-aware autograd.Function
        → torch.ops.diffmax.bisect(...) # PyTorch custom dispatcher op
            → CPU kernel (default)
            → CUDA / Triton kernel
            → NPU kernel

loss.backward()
    → torch.library.register_autograd  # _autograd.py
        → torch.ops.diffmax.bisect_backward(...)
            → CPU / CUDA / NPU backward kernel
```

## License

MIT
