The core idea is publishable and timely, and the system is unusually complete (real 224-rank Trainium cluster, production baseline, 2500-step real-token loss-matched runs, anonymized artifact). The two genuinely novel pieces are (1) the strategy layer formulation — searching over vendor-primitive compositions and tensor transformations above a closed collective API — and (2) the LLM-calibrated, per-deployment cost model used as the in-loop reward, validated by a clean ablation (hiding the simulator collapses 1.40× → 1.00×). The cross-scope inversion measurement study is a valuable systems finding in its own right.
However, the paper is exposed on four fronts that an ATC PC will probe hard: (i) the 3.24× Llama headline rests on one known idea (bundling M per-microbatch collectives into one graph — what XLA's own collective-combiner passes do within a graph, and what DDP/FSDP bucketing does on GPUs) against a possibly under-tuned per_mb baseline; (ii) generalization claims to TPU/MTIA are asserted, never tested; (iii) the cost model's structure is hand-engineered and Trainium-specific while the abstract implies the LLM "builds the simulator"; (iv) related work misses several directly relevant systems (Centauri, MSCCL++, ForestColl, NKI, XLA collective combining). All four are fixable before the deadline. Detailed actions below.
OverlayCCL targets collective-communication selection on closed vendor stacks (AWS Trainium, with claims of relevance to Google TPU and Meta MTIA), where (C1) the runtime is opaque and (C2) no schedule-level IR (MSCCL/TACCL-style) is exposed. A five-phase loop: Phase 1 an LLM agent runs probe tools on the deployment to fit a six-term cost model (graph-launch tax, collective bandwidth, back-to-back amortization, NEFF compile/reload, memcopy bandwidth, net bytes) plus anti-reward-hacking structural terms; Phase 2 scores a seed library including AWS's production strategy; Phase 3 "strategy-enumerate" LLM search (K=5 sketches + top-2 refinement, 10 LLM calls/problem); Phase 4 hardware/training-validation gates; Phase 5 deploys the lowest-simulator-score gated survivor.
Results on 7× trn1.32xlarge (224 ranks): 1.40× end-to-end steady-step on OLMoE-10B (loss-matched, 2500 steps, real OpenWebText), 3.24× on Llama-7B (random tokens, 200 steps); 8 collective problems; cross-scope inversion documented (microbench winners lose in training); no-simulator ablation collapses to 1.00×; search costs ~$19 LLM + ~1h cluster; cluster-size redeployment at 3/5 nodes preserves speedups (1.79×/1.50× OLMoE; 2.49×/2.74× Llama).
mark_step graph boundary — that is an interesting deployment-specific insight, but the paper currently presents the bundled strategy as a discovery rather than a known transformation that the search re-derived and validated under Trainium's memory/compile-time constraints.Novelty is sufficient for ATC if the claims are scoped precisely: a closed-stack system contribution with a novel search-surface formulation and a novel calibration mechanism, validated in production-like conditions. It is not a new algorithm-synthesis technique nor a new class of collective algorithm (Appendix D already concedes the loop "does not and cannot invent collective algorithms" — good; bring that honesty into Section 1).
per_mb → bundled. A skeptical reviewer's first question: could a developer get the same 3.24× with a one-line manual change? Add a hand-written "developer-bundled" baseline (an AWS engineer or the authors manually fusing the M collectives) and report (a) how close it gets to the agent strategy, and (b) how much engineering time it took, including handling the memory/compile-time limits the paper says make practitioners avoid bundling. If the agent's version contains additional non-obvious elements (masked-AR packing, layout choices), quantify their marginal contribution with a decomposition table. Also test/measure XLA collective-combiner behavior when the mark_step boundary is removed, to show the compiler alone does not recover the win.nki.collectives.* — a lower-level programmable surface that complicates the absolute "no low-level control (C2)" claim. One paragraph explaining why NKI collectives don't (yet) provide an MSCCL-style schedule IR would inoculate the paper.| Work | Why it matters to this paper | Suggested treatment |
|---|---|---|
| Centauri — Chen et al., ASPLOS'24 Best Paper (DOI) | Communication partitioning + hierarchical scheduling for comm-compute overlap; its partition space explicitly includes primitive substitution — the same design dimension OverlayCCL searches. Up to ~45% training speedup. Closest conceptual neighbor at the "which primitive, how partitioned" layer, albeit on open GPU stacks with analytic models. | Must cite & differentiate (closed stack, LLM search, runtime-effect cost model vs. analytic overlap scheduling). |
| MSCCL++ — Shah et al., arXiv:2504.09014 (GitHub) | The current Microsoft communication stack (successor to MSCCL): GPU-driven 1-sided put/get channels; used by RCCL/SGLang/FlashInfer. The paper cites MSCCL (2023) but not its successor; reviewers from that community will notice. | Cite in the schedule-synthesis paragraph; strengthens the open-vs-closed-stack contrast. |
| ForestColl — Zhao et al., arXiv:2402.06787 | Throughput-optimal collective schedules for arbitrary/heterogeneous topologies (AMD + NVIDIA), beating vendor libraries and TACCL-class tools — the modern SOTA in schedule synthesis. | Cite as SOTA schedule-layer synthesis; reinforces that none of it applies without a programmable runtime. |
| XLA collective-combiner / latency-hiding-scheduler passes (OpenXLA docs) | XLA itself merges adjacent collectives within a graph and schedules async collectives (AllReduceStart/Done). The Llama bundling win is partially "give the compiler one graph instead of M." Also: XLA has RaggedAllToAll — relevant to the AllToAllV search space if/when Neuron exposes it. | Discuss explicitly; this is the most likely "isn't this just what the compiler does?" review question. |
| SimAI — Wang et al., NSDI'25 (USENIX) | Already cited, but the paper's drift characterization conflicts with SimAI's stated 98.1% cross-environment fidelity. Needs a quantitative or rhetorical fix (Action 4). | Re-position: closed stacks make simulator calibration impossible from outside, which is the sharper argument. |
| AWS NKI collectives & TorchNeuron transition (Neuron docs) | nki.collectives.all_gather etc. is a programmable kernel surface on Trainium; AWS also announced migration off PyTorch/XLA. Both bear directly on C2 and on the longevity of the cost-model terms. |
Address in Background + Future Work (Action 9). |
| KernelBench — Ouyang et al., arXiv:2502.10517; and the Sakana "AI CUDA Engineer" reward-hacking incident (Feb 2025) | Establishes evaluation norms (correctness + speedup distributions, multiple runs/models) and documents LLM optimizers gaming evaluation harnesses — the exact failure mode the paper's structural simulator terms defend against. | Cite to motivate the anti-reward-hacking design and to justify adding multi-run variance (Action 3). |
| AlphaEvolve — Novikov et al., arXiv:2506.13131 (already cited) | Already optimized Google production training infrastructure (data-center scheduling, training of its own underlying LLM). Means "LLM search for training-stack performance" is precedent, not novelty. | Scope contribution (i) precisely: first above a closed vendor collective API, with model-in-the-loop reward. |
| Flux (ByteDance) / Lagom (comm-compute overlap line) | Recent kernel-fusion-based comm/compute overlap systems that share the "per-call latency ≠ per-step cost" motivation. | Optional one-line citations alongside Centauri/CoCoNet. |
| Likely reviewer objection | Severity | Pre-emptive fix |
|---|---|---|
| "3.24× comes from one known transformation (bundling) vs. a weak per-microbatch baseline." | High | Action 1: hand-bundled baseline + decomposition + compiler-combiner discussion. |
| "TPU/MTIA claims have no evidence." | High | Action 7: TPU Phase-1 + 1–2 problems, or remove claims. |
| "Single search run, single LLM — is this reproducible?" | High | Action 3: multi-run variance, second LLM, K/R sensitivity. |
| "Show me the simulator is actually accurate." | High | Action 2: predicted-vs-measured scatter + rank correlation + term ablations. |
| "Agent's final loss is worse (6.945 vs 6.843)." | Med | Action 5: seed-noise band for baseline-vs-baseline. |
| "Isn't NKI low-level control on Trainium?" / "What about TorchNeuron/trn2?" | Med | Action 9: background + future-work paragraphs, optional trn2 probe. |
| "Correctness at ws∈{4,8} doesn't imply correctness at 224 ranks / other shapes." | Med | Action 8: shape fuzzing; clarify expert-choice sidesteps variable counts. |
| "Anonymization is compromised by the AWS-internal baseline narrative." | Med | Reword baseline provenance; check ATC double-blind rules. |
| "The cost-calculus table inflates the E2E alternative." | Low | Action 10: recompute with honest per-candidate times. |
This is a strong systems submission in the making: a real problem on increasingly important hardware, a complete loop, a production-grade baseline, an honest anti-reward-hacking design, and the rare end-to-end loss-matched validation. To make it robust for ATC 2026: (1) scope the novelty claims precisely (strategy layer + calibrated-model-in-the-loop, not "LLM finds new algorithms"), (2) bullet-proof the Llama baseline and explain the bundling win relative to compiler collective-combining, (3) add simulator-fidelity and search-variance evidence, (4) either substantiate or remove TPU/MTIA claims, and (5) engage Centauri, MSCCL++, ForestColl, NKI, and the XLA pass pipeline in related work. With those revisions, the paper has a credible path to acceptance; without them, the headline numbers are likely to be discounted as a baseline artifact plus a single known transformation.