Metadata-Version: 2.4
Name: tsugi-mend
Version: 0.1.0
Summary: tsugiai-mend-sdk. Maximum-uplift cross-rack distributed-training reducer built on Decoupled DiLoCo + DES-LOC + async tensor parallelism + FALCON fail-slow mitigation. Patent-independent by deliberate construction; see LICENSE preamble.
Project-URL: Homepage, https://tsugilabs.ai
Project-URL: Repository, https://github.com/tsugiai/tsugi-mend
Project-URL: Unified-SDK, https://github.com/tsugiai/tsugi
Project-URL: Companion-SDK-Patent-Aligned, https://github.com/tsugiai/tsugi-kpool
Author-email: Tong Liu <tong@tsugicinema.com>
License: Apache-2.0
License-File: LICENSE
License-File: NOTICE
Keywords: async-tp,cross-rack,decoupled-diloco,des-loc,distributed-training,failslow,pytorch
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.10
Requires-Dist: numpy>=1.26
Requires-Dist: torch>=2.4
Provides-Extra: benchmark
Requires-Dist: accelerate>=0.34; extra == 'benchmark'
Requires-Dist: datasets>=2.20; extra == 'benchmark'
Requires-Dist: evaluate>=0.4; extra == 'benchmark'
Requires-Dist: hf-transfer>=0.1; extra == 'benchmark'
Requires-Dist: matplotlib>=3.8; extra == 'benchmark'
Requires-Dist: protobuf>=4.25; extra == 'benchmark'
Requires-Dist: pyyaml>=6.0; extra == 'benchmark'
Requires-Dist: scipy>=1.13; extra == 'benchmark'
Requires-Dist: sentencepiece>=0.2; extra == 'benchmark'
Requires-Dist: transformers>=4.45; extra == 'benchmark'
Provides-Extra: dev
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.5; extra == 'dev'
Description-Content-Type: text/markdown

# tsugiai-mend-sdk

Maximum-uplift cross-rack distributed-training reducer for PyTorch.

A software-only drop-in that replaces the cross-rack data-parallel
all-reduce with an integration of public-art techniques designed to
maximize measured tokens-per-second uplift on realistic cross-rack
topology:

- **Decoupled DiLoCo** (Douillard et al., arXiv:2604.21428, April 2026): independent asynchronous learners, minimum quorum, adaptive grace window, token-weighted merge.
- **DES-LOC / Local Adam** (Iacob et al., arXiv:2505.22549, May 2025; ICLR 2026): desynchronized synchronization periods for adaptive-optimizer momenta.
- **Async tensor parallelism** (PyTorch / TorchTitan, September 2024): intra-node overlap of all-gather / reduce-scatter with matmul.
- **FALCON fail-slow mitigation** (arXiv:2410.12588, October 2024): sliding-window z-score detection of slow nodes to exclude them from the current quorum round.

The SDK keeps intra-rack TP / CP / PP / FSDP collectives unchanged (vanilla NCCL inside the NVLink domain) and replaces only the off-rack DP boundary with the four-paper stack above.

## License and IP posture

This SDK is licensed under **Apache-2.0** with its full automatic patent grant. The SDK is patent-independent by deliberate construction: it does not exercise the K-Pool LoRA (US App. 64/060,315) or Infinity (US App. 64/055,093) patent estates that the companion SDK (`tsugiai-kpool-sdk`, also Apache-2.0) does. Read the preamble at the top of the `LICENSE` file for the full posture explanation.

The companion patent-aligned SDK at `github.com/tsugiai/tsugi-kpool` is the reduction-to-practice artifact for those two TsugiCinema patent estates. The two SDKs share zero code.

## Headline measurement

| Workload | Hardware | Measurement | Reference |
|---|---|---|---|
| **Statistical-confidence headline (Hopper 3-seed CI)** | Modal H100:1, Qwen-2.5-1.5B, 200 steps × 3 seeds at 2000ms | **+71.49% ± 2.83% (95% CI)** throughput uplift | `docs/phase2_multi_seed_ci_results.md` |
| **Cross-rack grace-window overlap on Hopper at 7B production scale** | Modal H100:1, Qwen-2.5-7B, 200 steps × 5 delays | **+76.58% at 2000ms; +39.72% at 1000ms; +19.20% at 500ms** | `docs/phase2_delay_sweep_qwen7b_results.md` |
| **Intermediate model-scale (3B) confirmation (Hopper 3-seed CI)** | Modal H100:1, Qwen-2.5-3B, 200 steps × 4 delays × 3 seeds | **+41.31% ± 0.29% at 2000ms (n=3); +20.40% ± 0.03% at 1000ms; +9.75% ± 0.04% at 500ms** | Continuation doc 2026-05-23 |
| Cross-rack grace-window overlap on Hopper at 1.5B model scale | Modal H100:1, Qwen-2.5-1.5B, 200 steps × 7 delays | +70.64% at 2000ms; +34.73% at 1000ms; +16.86% at 500ms; concurrent path constant ~30,840 tok/s across all delays | `docs/phase2_delay_sweep_results.md` |
| Cross-rack grace-window overlap on A10G | Modal A10G, SmolLM-135M, 200 steps × 7 delays | +52.75% at 2000ms (constant); +11.61% at 500ms; -0.06% overhead at 0ms; bit-exact loss preserved across all cells | `docs/phase2_delay_sweep_results.md` |
| **A10G delay-distribution stress-test sweep** | Modal A10G, SmolLM-135M, 200 steps × 4 delays × 3 distributions | **constant +52.53% / bimodal-stress +8.29% / long-tail-stress +19.44% at delay=2000ms** (parameters NOT FALCON-anchored; see briefs 06/07/08 in MasterVision) | Continuation doc 2026-05-23 |
| **Production-realistic multi-GPU FSDP + 7B model (3-seed CI)** | Modal 8xH100 FSDP FULL_SHARD, Qwen-2.5-7B + simulated 2-rack, 4 delays × 3 seeds | **+6.37% ± 1.31% at delay=2000ms (n=3; replaces the prior +7.36% single-seed)** | Continuation doc 2026-05-23 |
| Multi-GPU FSDP smaller model | Modal 8xH100 FSDP, Qwen-2.5-1.5B + simulated 2-rack, 7 delays | +3.08% at 2000ms (synthetic 8xGPU floor) | `docs/stage_c_phase2_delay_sweep_results.md` |
| Real cross-network 2-node 8xV100 (synchronous reducer; preserved baseline) | Lambda Labs commodity Ethernet, SmolLM-135M, 500 paired steps | +28.58% tokens-per-second uplift vs vanilla FSDP, bit-exact-identical loss | `docs/stage_d_proper_results.md` |
| H100 Hopper single-instance (synchronous reducer baseline) | Modal 8x H100 SXM5, Llama-3-8B, 2000 paired steps × 3 seeds | -0.97% ± 1.5% (predicted null; Hopper NVLink absorbs the synchronous-path cross-rack tax) | `docs/stage_e_results.md` |
| **H100 Hopper single-instance + concurrent orchestrator (Track A, 3-seed CI)** | Modal 8x H100 SXM5, Llama-3-8B, 200 paired steps × 3 seeds at delay=2000ms | **+2.88% ± 0.43% (n=3)** throughput uplift; loss equivalence preserved (sync 0.288, conc 0.278) | Continuation doc 2026-05-23 |
| **Item E DeepSpeed ZeRO-3 head-to-head measured orthogonality** | Modal 8x H100 SXM5, Qwen-2.5-7B + DeepSpeed ZeRO-3 (stage=3, overlap_comm=True), 200 paired steps × 3 seeds at delay=2000ms | **+29.6% concurrent vs sync (n=3 paired; per-seed jitter <1pp)**; concurrent +0.44% vs no-delay baseline. DeepSpeed's intra-iteration overlap cannot recover the outer-step wait; Mend's outer-step concurrent_outer_step recovers essentially all of it | (internal benchmark report) |
| **Real cross-network 2-pod 8xH100 (Stage E-prime; production-fabric floor; n=1 paired)** | RunPod 2x 8x H100 SXM5 over real InfiniBand or RoCE v2 3.2 Tbps (AP-IN-1), Llama-3-8B, 500 paired steps × 1 seed | **+1.42% tps + 0.18% loss delta (effectively bit-exact)**; production-fabric floor anchor; n=1 caveat is load-bearing (same order of magnitude as +1.40% baseline-only seed variance); n=3 CI pending | `docs/stage_e_prime_results.md` |

The **+71.49% ± 2.83% (95% CI; n=3) Hopper headline** is the canonical enterprise cross-rack DD-grade measurement. The orchestrator's uplift is governed by `N * T_step / G` (sync_period_steps × per-step compute time, vs grace_window_ms); apparent "non-monotonicity with model size" at H100:1 (Qwen-3B +41.31% sits below both Qwen-1.5B +71.49% and Qwen-7B +76.58%) is fully explained by the Qwen-7B measurement using 1/8 the tokens-per-step (seq_len=1024 mbs=1 vs seq_len=2048 mbs=4). At fixed tokens-per-step, uplift is monotonically decreasing in model size. The analytical model and the proposed `N* = ceil(G / T_step)` auto-tuner spec are documented in an internal companion brief on uplift-surface characterization. Production-realistic multi-GPU FSDP yields a smaller honest floor (**+6.37% ± 1.31% n=3** paired at delay=2000ms; replaces the prior +7.36% single-seed) because 8-rank NCCL pipelining absorbs some of the synthetic delay; the real Stage D-proper measurement on actual cross-network would lift this number back up.

**Delay-distribution stress-test disclosure (Track D 2026-05-23)**: the constant-delay headlines (e.g., +52.53% A10G at delay=2000ms) are ceiling-case stress tests. The bimodal and log-normal variants are alternative stress-test shapes whose parameters are NOT directly FALCON-anchored: FALCON Table 2 reports only inter-node RDMA CoV=0.29 as the quantitative variance number; it does NOT publish per-iteration percentile breakdowns or bimodal characterizations. The current `bimodal` (80/20 at 50ms/base) delivers +8.29% on the same A10G + SmolLM-135M workload; the current `long_tail` (sigma=1.0) delivers +19.44%. A FALCON-CoV-anchored re-tune (95/5 bimodal, sigma~0.285 log-normal) is proposed in an internal FALCON-distribution-verification brief. Until that re-measurement lands, the +28.58% Stage D-proper Lambda V100 cross-network result remains the most defensible production-grounded headline.

At every scale, the concurrent path's throughput is rock-solid across all delays (Qwen-7B single-process: 4,300 ± 80 tok/s; Qwen-3B Hopper: 18,153 ± 4 tok/s; Qwen-1.5B Hopper: 30,840 ± 35 tok/s; SmolLM-135M A10G: 23,610 ± 80 tok/s); the synchronous baseline collapses linearly. The FALCON paper (arXiv:2410.12588) documents cross-rack inter-node RDMA variance (CoV=0.29) but does not characterize the per-iteration latency distribution shape that this sweep parameterizes; the delay-sweep is a stress-test, not a literal FALCON-replay.

## Status

Stage A through Stage E-prime all PASS. Phase 2 Week 1 (concurrent async-TP overlap with cross-rack reducer) shipped with the `ConcurrentOuterStep` orchestrator integrated. **Stage D-proper for Hopper-cross-network real-fabric is point-estimate closed via Stage E-prime (RunPod 2x 8x H100 InfiniBand 3.2 Tbps, n=1 paired, +1.42% production-fabric floor); n=3 CI pending.** Item E head-to-head against DeepSpeed ZeRO-3 confirms orthogonality at +29.6% concurrent vs sync on a Tier-1 hyperscaler stack. See `docs/60_day_plan.md` for the Phase 2 nine-week sprint.

## Quickstart

```python
from tsugi_mend import MendConfig, mend_init, mend_shutdown

config = MendConfig(
    quorum_min_learners=4,
    grace_window_ms=2000,
    token_weighted_merge=True,
    sync_period_steps=128,
    momentum_sync_period_steps=512,
    async_tp_enabled=True,
    # Phase 2 Week 1 (2026-05-22): orchestrator overlaps the cross-rack
    # outer-step wait with inner-step async-TP compute. Default True.
    # See docs/phase2_delay_sweep_results.md for the +52.75% measurement.
    concurrent_outer_step=True,
    failslow_zscore_threshold=3.0,
    failslow_window_steps=50,
    rack_aware=True,
    sideband_addr="tcp://0.0.0.0:51900",
    sideband_peers=("tcp://peer1:51900", "tcp://peer2:51900"),
    sideband_heartbeat_ms=100,
    diagnostics_dir="./results/mend_diag",
)

mend_init(model, config)
# ... train normally ...
mend_shutdown(model)
```

## Layout

```
src/tsugi_mend/        SDK source
tests/                Stage A unit and integration tests (CPU-only)
benchmarks/           Stage B/C/D/E launch scripts (cloud-gated)
docs/                 architecture, benchmark protocol, convergence-equivalence sketch
examples/             minimal training-loop integration examples
scripts/              utility scripts (env audit, cost estimator)
```

## Companion SDK

For LoRA-adapter-granularity productization that exercises the K-Pool LoRA and Infinity patent estates, see `tsugiai-kpool-sdk`. The two SDKs serve different acquirer-due-diligence legs:

- **Patent moat leg (kpool):** the IP that goes into the Definitive Agreement's assignment schedule.
- **Operational uplift leg (max):** the engineering artifact a partner can run Monday morning on their cluster.

Both legs are independent. Either can stand on its own.
