tsugiai-mend-sdk
Copyright 2026 TsugiCinema, Inc.

This product includes software developed by TsugiCinema, Inc. and is
licensed under the Apache License, Version 2.0 (see LICENSE).

This software composes public prior art from the following sources.
Each citation is to the published reference describing the technique
exercised in this codebase; the implementation here is original
TsugiCinema, Inc. code that reproduces the published mechanism.

1. Cross-rack reducer (GraceWindowSyncer state machine).
   Decoupled DiLoCo: Continual Pre-Training of Language Models
   without Aggregator Synchronization.
   Douillard, A.; Donchev, A.; Rush, A.; Riedel, S. (2026).
   arXiv:2604.21428.

2. Desynchronized optimizer momenta (DES-LOC).
   Local Adam: Globally-Sparse Local Updates for Distributed Adam.
   Mishchenko, K. et al. (2025). arXiv:2505.22549.

3. Async tensor parallelism (intra-node async-TP overlap hooks).
   TorchTitan: PyTorch Pre-training Native Library.
   Wanchao Liang, Tianyu Liu, Less Wright, Will Constable,
   Andrew Gu, Chien-Chin Huang, Iris Zhang, Wei Feng,
   Howard Huang, Junjie Wang, Sanket Purandare, Gokul Nadathur,
   Stratos Idreos. PyTorch (2024). https://github.com/pytorch/torchtitan

4. Fail-slow detection (FALCON sliding-window z-score).
   FALCON: Pinpointing and Mitigating Stragglers for Large-Scale
   Hybrid-Parallel Training.
   Wu, T. et al. (2024). arXiv:2410.12588.

5. Optional gradient compression (PowerSGD with error feedback).
   PowerSGD: Practical Low-Rank Gradient Compression for Distributed
   Optimization.
   Vogels, T.; Karimireddy, S. P.; Jaggi, M. (2019).
   NeurIPS 2019. arXiv:1905.13727.
   The PowerSGD primitive in `src/tsugi_mend/compression.py` is a
   from-scratch reproduction of the rank-r power-iteration algorithm
   with persistent error feedback. PyTorch's native
   `torch.distributed.algorithms.ddp_comm_hooks.powerSGD_hook` was
   considered but is DDP-bucket-bound and not directly reusable for
   the GraceWindowSyncer fragment-merge path.

6. Concurrent outer-step orchestrator (Phase 2 Week 1 marquee feature).
   Original TsugiCinema, Inc. work. The orchestrator wraps the
   Decoupled DiLoCo control law (item 1) in an asyncio-task-based
   pattern that overlaps the cross-rack grace window with inner-step
   forward/backward compute. Convergence-equivalent to Decoupled
   DiLoCo by Algorithm 2 staggering analysis; see
   `docs/phase2_week1_async_tp_overlap.md` for the proof sketch.

Patent posture:
The Apache-2.0 license grants a full automatic patent grant on the
techniques exercised in this codebase. The mend SDK is patent-
independent by deliberate construction; it does NOT exercise the
K-Pool LoRA (US App. 64/060,315) or Infinity (US App. 64/055,093)
patent estates owned by TsugiCinema, Inc. Those mechanisms (variance-
threshold trigger; K-of-N adapter routing; elastic-adapter-buffer
quorum-based merge) live in the companion patent-aligned SDK at
github.com/tsugiai/tsugi-kpool and are not present in this
repository.
