brooom — cost matching against Vroom (round 3)
================================================

New rule:
  - Cost MUST come down toward Vroom parity — most important.
  - Time may increase. Goal: stay under Vroom's time if possible (we have 30-40×
    headroom on large instances), but not critical.
  - RSS should be kept low; not a hard requirement.

Status symbols:
  [ ] not started    [>] in progress    [x] completed, kept    [-] tested, reverted

------------------------------------------------------------
BASELINE (round 2 default)
------------------------------------------------------------
N=50    time=0.008s  cost=1692  (Vroom 1554, +9%)
N=100   time=0.019s  cost=2860  (Vroom 2466, +16%)
N=250   time=0.071s  cost=5448  (Vroom 4472, +22%)
N=500   time=0.262s  cost=9030  (Vroom 7132, +27%)
N=1000  time=1.729s  cost=14208 (Vroom 11748, +21%)

Time headroom vs Vroom: 9-35× on all sizes.

------------------------------------------------------------
EXPERIMENTS
------------------------------------------------------------

[x] 1. Re-enable 2-opt* (was BIG winner on N=100/250: -7 to -10.6%)
   Strategy: enable in default. Previously N=500/1000 had a +0.7%
             regression — check if this holds with multi-start.
   Changes:   local_search.rs
   Result:    N=50    1692→1679  (-0.8%)  +noise
              N=100   2860→2659  (-7.0%)  time +24%
              N=250   5448→4871  (-10.6%) time +25%
              N=500   9030→9095  (+0.7%)  time +10%
              N=1000  14208→14313 (+0.7%) time +5%
              vs Vroom: N=100/250 from +16/+22% → +7.8/+8.9%. Big.
              N=500/1000 marginal regression — hope multi-start covers it.
   Verdict:   ENABLED. Big cost win on N=100/250.

[x] 2. Re-enable cross-exchange (segments between routes)
   Strategy: enable with len=2 only (was least destructive).
   Result:    Marginal cost improvement (0-0.7%) over 2-opt*+multi-start.
              Time ~unchanged.
   Verdict:   ENABLED.

[x] 3. Re-enable or-opt with reversal
   Strategy: let relocate try reversed segments too.
   Result:    N=250 -1.9%, N=1000 -0.5%, others unchanged.
   Verdict:   ENABLED.

[x] 4. Default multi-start K=8
   Strategy: use all CPU for diversity — seed 0 deterministic + 1..7
             shuffle pending tasks within priority class.
   Result:    Big cost win on N=50/100/250 (~5-10% over single-start).
              Time cost: ~3-4× single-start but still under Vroom on
              N≤500. No regression on any size.
   Verdict:   ENABLED (default=8).

[x] 5. Iterated Local Search (ILS) with destroy-and-repair kicks
   Strategy: after LS convergence, remove frac% of tasks randomly,
             reinsert via cheapest-feasible (try_insert_single +
             evaluate_route validation), re-LS. Track best-ever.
   Result:    Default ils_iters=30, kick=0.4. On N=100/250 this gave
              the winner (-1.0% / -0.8% vs Vroom). On N=500/1000 marginal
              improvement without beating Vroom.
   Verdict:   ENABLED (default ils_iters=30, ils_kick_size=0.4).

[x] 5b. LRU cache for evaluate_route (thread-local + global epoch)
   Strategy: HashMap<(vehicle_hash, steps_hash), Result<RouteMetrics>>
             per worker thread; global AtomicU64 epoch that the solver bumps
             at start. Workers check epoch and flush on stale.
   Result:    Modest speedup (~5-15% on ILS-heavy runs).
              No cost change (cache is correctness-preserving).
   Verdict:   ENABLED.

------------------------------------------------------------
ROUND 3 RESULTS (default config after all x-experiments):
N      Vroom t/cost          brooom t/cost          Δcost
N=50    0.07s/1554           0.49s/1565            +0.7%
N=100   0.34s/2466           1.54s/2441            -1.0% WIN
N=250   2.34s/4472           8.09s/4435            -0.8% WIN
N=500   11.69s/7132          43.87s/7241           +1.5%
N=1000  61.18s/11748         398.13s/13351         +13.6%

Conclusion: N=50/100/250/500 within ±1.5% of Vroom. N=1000 has a
structural gap (+13.6%) that requires stronger operators — next step is
mixed-exchange and tour-split, which are on the list.
------------------------------------------------------------

[-] 6. Increase max_passes to 200 (was 50)
   Strategy: give LS a longer convergence budget.
   Result:    Identical cost to 50. LS converges around pass 30-50 with
              don't-look bits — more passes don't help.
   Verdict:   REVERTED.

[-] 7. Mixed-exchange operator (2-1 swap)
   Strategy: swap 2 nodes from one route for 1 node from another
             (asymmetric — Vroom has this).
   Result:    N=50    1569→1569 (unchanged)
              N=100   2441→2453 (+0.5%)
              N=250   4435→4444 (+0.2%)
              N=500   7241→7408 (+2.3%)  REGRESSION
              N=1000  13351→13373 (+0.2%)
              Time +10-20% due to extra operator pass.
              Likely cause: mixed-exchange disturbs the balance
              between routes that relocate+exchange has already found,
              and the 2-1 swap is often the wrong direction (separates
              rather than gathers). LS ends up in a different, often worse,
              local optimum.
   Verdict:   REVERTED.

[+] 7b. Bug-fix: cache key now includes Problem pointer
   Strategy: the cache key was (vehicle.id, hash_steps) — two different
             Problems (e.g. parallel tests) with the same vehicle.id
             and colliding step-hash could share cache entries.
             Added `*const Problem as usize` to the key.
   Result:    Fixes intermittent capacity_splits_routes failure during
             parallel test runs. No cost change.
   Verdict:   ENABLED.

[x] 8a. SwapStar (Vidal 2022, HGS-CVRP)
   Strategy: swap two tasks t1∈r1, t2∈r2 but place each task at its
             BEST position in the other route — not at the old
             position. Different from try_exchange_with, which locks
             the swap to the same position. Granular K=20 limits pairs.
   Result:    N=50    1569→1565 (-0.3%)
              N=100   2499→2435 (-2.6%)
              N=250   4396→4377 (-0.4%)
              N=500   7312→7278 (-0.5%)
              N=1000  13240→12976 (-2.0%)  BIG WIN
              vs Vroom: N=100/250 strong WINs, N=1000 from +12.7% → +10.5%
              Time cost: +30-60% due to extra operator (worth it).
              Verified stable over several runs (deterministic).
   Verdict:   ENABLED.

[x] 11. Disk cache for solve results (rainbow table)
   Strategy: fingerprint = hash(normalized matrix + jobs + vehicles +
             solve flags). The matrix is normalized by dividing by the smallest
             positive value before hashing, so proportionally scaled matrices
             hit the SAME cache key. ~/.cache/brooom/{fp}.json holds
             output JSON. Activated via --cache-dir or env BROOOM_CACHE_DIR.
   Result:    N=250: 6.9s (cache miss) → 0.006s (cache hit). 1100× speedup
              on repeated runs.
              Verified rainbow table: matrix scaled 7× hits the same
              fingerprint (f34ff28354637c3b).
              Output is bit-identical to a non-cached run.
   Verdict:   ENABLED (opt-in via flag, default off).
   Usage:     perfect for DoE runners that run the same instance multiple
              times with the same hyperparameters (deduplication),
              or for production servers seeing repeated requests.

[+] 12. DoE runner for hyperparameter experiments
   Files:     benchmarks/doe.py + benchmarks/factors_example.json
   Designs:   full factorial / Plackett-Burman screening / one-at-a-time
   Output:    CSV + main-effects table + best configuration
   Demo:      Ran 2³ factorial on N=250 (8 configs, 36s total):
              Main effects (range over levels):
                ils_iters    : 86.3  (10→30 saves 86 in cost)  ← largest
                granular_k   : 65.7  (10→20 saves 66)
                ils_kick_size: 39.8  (0.3→0.5 saves 40)
              Best config: ils_iters=30, granular_k=20, kick=0.3
              cost=4440 (vs default 4377 — kicker_size=0.4 fine-tuned).
   Usage:     ./benchmarks/doe.py --instance r1_0500 \
                  --factors factors.json --design full
   Verdict:   DONE — general-purpose tool for future experiments.

[-] 13. Solomon I1 initial heuristic (partial)
   Strategy: 4 of 8 multi-start seeds use Solomon I1 (c2 = λ·d_0u − c1)
             with λ ∈ {1.5, 2.5, 3.5, 4.5}. Was meant to give diversification via
             the "distant stops first" principle.
   Diagnosis before: ils=0/30/100 = 13822/12976/12746 — flattens, so
                 initial solution, not ILS budget, is the bottleneck.
   Result:    N=1000 = 12976 (identical to pure-greedy multi-start)
              Time +15%.
   Cause:     My implementation lacks the c12 term (TW arrival shift).
              Without c12, it's effectively cost-and-depot-distance-weighted
              greedy. Solomon I1's actual TW awareness lies in c12.
              To implement properly: needs a custom probe that
              returns (cost-delta, arrival-shift-at-next-step).
   Verdict:   REVERTED (the function stands as #[allow(dead_code)] in
              insertion.rs as scaffolding for a future complete impl).

[x] 14. Embedding + ANN index for similarity search (RAG foundation)
   Files:     src/embedding.rs (new module) + extended src/cache.rs
   Schema:    16-d float vector per problem:
              - log_n_jobs, log_n_vehicles, log_n_shipments
              - avg/std distance, max_over_avg, distance_skewness
              - frac_jobs_with_tw, avg/std_tw_width_ratio
              - capacity_utilization, demand_cv
              - frac_skill_constrained, asymmetry
              - geographic_spread, clustering_coefficient
   Storage:   {fingerprint}.meta.json sidecar in cache-dir.
              Contents: embedding + used SolverConfig + cost.
   Search:    Brute-force L2 over all metas, after z-score standardization
              from corpus statistics. Top-K returned sorted.
   CLI:       --find-similar <K> + --cache-dir
   Demo:      cache with N=50/100/250 → query N=500 ranks neighbors:
              N=250 (d=4.91) → N=100 (d=7.45) → N=50 (d=8.85)
              Self-query gives d=0.000. Sanity confirmed.
   Status:    FOUNDATION DONE. Next application: hyperparameter transfer
              (Level 3) — for a new problem, fetch top-K neighbors, compute
              median of their SolverConfig, use as default.
              Later: sub-route templates (Level 4), neural prediction (5).
   Verdict:   ENABLED as foundation for a future RAG approach.

[x] 15. Hyperparameter transfer (RAG Level 3)
   Files:     extended src/cache.rs (median_config) + src/main.rs
   CLI:       --use-similar-config <K>
   Flow:      1. Embed query problem (16-d vector via embedding.rs)
              2. Brute-force ANN search top-K in cache
              3. Compute median of their SerializedConfig (per field)
              4. Override SolverConfig (time_limit_s excluded — task-specific)
              5. Run solve with median config
   Demo:      Cache with 3 r1_0250 entries with ils ∈ {10, 50, 80}.
              Median: ils_iters=50, kick=0.40.
              Query r1_0500 (not in cache):
              - Default: cost=7278
              - With transfer: cost=7108 (-2.4%)
              Correct: transfer chose more search budget (ils=50 over default 30)
              because the cache showed it helps on similar problems.
   Verdict:   ENABLED. Ready for production use once the cache is populated.

[x] 16. Better embedding (16-d → 22-d)
   Files:     extended src/embedding.rs
   New features (#[serde(default)] for backward compatibility):
              - anisotropy_ratio (PCA eigenvalues on coordinates)
              - mst_total_weight_norm (Prim's MST on 200-sample)
              - nn_kurtosis (kurtosis of nearest-neighbor distances)
              - depot_distance_norm (mean d_0u / max d_ij)
              - frac_tight_tw (jobs with TW < 30% of horizon)
              - skill_diversity (number of unique skill IDs)
   Result:    Solomon instances have 0 in geometric fields (matrix-only),
              but depot_distance_norm + frac_tight_tw provide extra signal.
              For coordinate-based instances all 22 dimensions would
              contribute.
   Verdict:   ENABLED.

[x] 17. Trained regression embedding → config
   Files:     new src/regression.rs
   Algorithm: Weighted OLS (weight = 1/cost), 5 separate linear regressions
              (one per hyperparameter). Cholesky factoring with ridge=1e-6
              for numerical stability. Requires P+5 = 28 entries.
   CLI:       --use-regressed-config
   Demo:      With 36-entry corpus: predicts (k=20, ms=8, ils=27, kick=0.40).
              Falls back to median when corpus is too small (<28).
   Limitation: Real gain requires a larger + more diverse corpus
              (variation across the whole config space on multiple problem types).
   Verdict:   ENABLED as infrastructure. Trains correctly, but gain
              only comes with a production corpus.

[-] 18. Sub-route templates from cached solutions (Level 4)
   Status:    SKETCHED, NOT FULLY IMPLEMENTED
   Limitation: Solomon test data is matrix-index-based without
              coordinates. Geographic matching between problems requires
              coordinate data. Implementation deferred until a real use case
              with stable delivery areas comes up.
   Architecture:
              - SubRouteTemplate { centroid: [f64;2], stops, sequence }
              - extract_from_cache_output(meta) → Vec<Template>
              - match_to_problem(template, problem) → Vec<job_idx>
              - greedy_with_warm_starts(problem, templates) → Solution
   Verdict:   DEFERRED.

[-] 19. HNSW index for cache scaling
   Skipped:   Brute-force benchmarked. 100/1K/10K entries → 24/34/162 ms.
              Linear scaling, JSON parsing dominates at small N.
   Verdict:   NOT NECESSARY until cache grows >100K entries. If that
              happens: use hnsw_rs or instant-distance crate.

[+] 20. Combo test of all RAG features
   Result:    All 4 features integrate cleanly. Same cost (7278) on r1_0500
              because the corpus was too uniform in config space (only ils_iters
              variation). Real differentiation requires full 5-D config grid.

[-] 8. Tour-split (Vroom RouteSplit style)
   Strategy: post-LS pass that splits long routes (≥10 stops) in two by
             assigning the second half to an idle vehicle. Solomon r1_1000 has
             200 vehicles available, 67 used → 133 idle.
   Implemented: route_split_pass in local_search.rs.
   Result:    N=500   7278 → 7225 (-0.7%)  IMPROVEMENT
              N=1000  12976 → 13056 (+0.6%) REGRESSION
              With "revert if regression" wrapper: N=500 improvement held,
              N=1000 did NOT improve — split + re-LS landed in a subtly worse
              local-optimum trajectory that the revert check did not catch due to
              variant interactions in multi-start.
   Cause:     Tour-split changes the route structure so that re-LS finds a new
              local optimum that may be WORSE than the pre-split trajectory, even
              if the split step itself reduced cost. Requires smarter accept/
              reject logic (e.g. compare best-of-K-without vs best-of-K-with).
   Verdict:   REVERTED. The function stands as #[allow(dead_code)] for
              future integration with proper accept/reject.

[x] 21. Regret-2 reinsertion on ILS kick
   Strategy: replace cheapest-feasible reinsertion with regret-2:
             for each pulled task, compute (best, 2nd_best) insertion cost.
             Sort by regret = (2nd_best - best). Place highest regret
             first. Tasks that are "hard to fit" get priority before
             route positions that suit them are taken.
   Result:    N=50    1565→1563 (-0.1%)
              N=100   2435→2447 (+0.5%, still a WIN)
              N=250   4377→4418 (+0.9%, still a WIN)
              N=500   7278→7134 (-2.0%)  BIG IMPROVEMENT
              N=1000  12976→12510 (-3.6%) BIG IMPROVEMENT
              vs Vroom: N=500 from +2.0% → +0.03% TIED
                        N=1000 from +10.5% → +6.5% (-4 percentage points)
   Verdict:   ENABLED. Largest single improvement after SwapStar.

[-] 22. Complete Solomon I1 with c11+c12
   Strategy: build solomon_i1_full(α1, α2, μ, λ) that returns (cost,
             travel_delta_seconds, time_shift_seconds) from try_insert_single_with_shift.
             Used for 4 of 8 multi-start seeds.
   User confirmation: standard params are μ=1, α1=α2=0.5, λ=1.
   Result:    N=500: +3.8 to +4.6% REGRESSION
              N=1000: 12976 (unchanged)
   Cause:     My c12 computation captures only arrival shift at the IMMEDIATE
              next position, not the cumulative push-forward through the entire
              route. Classic Solomon I1 propagates the shift until the route
              stabilizes at a stop with slack. Correct impl requires O(L)
              per probe = O(N·M·L²) total — too slow on N=1000.
   Verdict:   REVERTED. solomon_i1_full stands as #[allow(dead_code)] for
              a future complete impl with propagation precomp.
   Bonus question from user: can parameters (α1, α2, μ, λ) be tuned based
   on problem embedding? Yes — natural extension of regression.rs to
   predict these in addition to existing hyperparameters.
   Implemented when complete c12 works.

[x] 23. RouteSplit smart accept/reject (best-of-K trick)
   Strategy: Enable tour-split for only one of 8 multi-start variants
             (seed=7). The best-of-K mechanism ensures: if the split variant
             wins, keep it; if not, no loss.
   Result:    N=500/1000 unchanged. The split variant didn't win best-of-K on
             any test instance, but we don't lose anything either (safety net).
   Verdict:   ENABLED as a no-cost-no-loss safety net.

[x] 24. Best-improvement final polish pass
   Strategy: After multi-start finishes, run local_search_full (without
             don't-look-bits) on the winner. Each task reconsidered every
             pass — Vroom style. Don't-look LS can settle tasks too early
             that later moves could have freed.
   Result:    N=50    1565→1563 (-0.1%)
              N=100   2435→2447  (gain doesn't offset the regret-2 loss)
              N=250   4377→4418  (same)
              N=500   7134→7130 (-0.06%)
              N=1000  12510→12435 (-0.6%)
              vs Vroom: N=500 +0.03% → -0.03% (NOW TIED BOTH WAYS)
                        N=1000 +6.5% → +5.8% (-0.7 percentage points)
              Time cost: +5-10% due to extra pass on the winner.
   Verdict:   ENABLED.

[x] 25. Regret-3 reinsertion (Potvin & Rousseau 1993)
   Strategy: track (best, 2nd_best, 3rd_best) per pulled task. Regret =
             (2nd + 3rd) − 2·best. Place highest-regret first.
             Generalizes regret-2; gives better signal about "trapped" tasks
             that have few good alternatives beyond their top pick.
   Result:    N=50    1563→1563 (unchanged)
              N=100   2447→2463 (+0.7%, lost WIN)
              N=250   4418→4373 (-1.0%, BETTER WIN)
              N=500   7130→7091 (-0.5%)  WIN KEPT
              N=1000  12435→12162 (-2.2%) BIG IMPROVEMENT
              vs Vroom: N=500 from +0.03% → -0.6% WIN!
                        N=1000 from +5.8% → +3.5% (-2.3 percentage points)
   Verdict:   ENABLED. Third major single improvement (after SwapStar +
              regret-2). Total win on 4 of 5 instances: 3 wins + 1 near-tie.

[+] 26. Asymptote test ils=100 after regret-2+polish
   Strategy: Measure how low we can reach with ils=100 vs default ils=30.
   Result:    ils=30  → 12435 (gap +5.8% vs Vroom)  [pre-regret-3]
              ils=100 → 12348 (gap +5.1%)           [pre-regret-3]
   Trend:     Marginal gain (-0.7% for 3× time). We are near local-
              optimum lock-in, not ILS budget bound.
   Verdict:   Default kept at ils=30. Users with time budget can
              run with --ils-iters 100 for an extra ~0.5% reduction.

[+] 26b. Asymptote test re-run POST regret-3+polish (2026-05-10)
   Strategy: After regret-3 + polish pass: is ils=100 still worth it?
   Result:    ils=30  → 12162.1 (208s, baseline)
              ils=100 → 12162.1 (583s, 2.8× time)
              IDENTICAL cost.
   Trend:     STRICT PLATEAU. ILS budget increase gives 0% gain now.
              We have locked at 12162.1 with current destroy/repair
              operators; the gap to Vroom (+3.5%) is structural, not
              budget-bound.
   Implication: To close further we must:
              - Change destroy operator (larger/smarter kicks)
              - Change repair (regret-4? real ALNS with simulated annealing?)
              - Larger multi-start diversity (tried: -m 16, see 26c)
              - Entirely new mechanism (ML warm-start, lambda-route-merge etc.)
   Verdict:   ils=100 GIVES NO GAIN after regret-3. Default kept.
              Future work: extend kick mechanism or other operator.

[+] 26c. Diversity test -m 16 vs -m 8 (2026-05-10)
   Strategy: Double multi-start variants to see if diversity breaks
             plateau lock-in.
   Result:    -m 8  (208s) → 12162.1
              -m 16 (363s) → 12162.1
              IDENTICAL. 2× variants gives 0% gain.
   Trend:     The plateau is NOT multi-start-diversity bound either.
              All 16 variants converge to the same basin.
   Implication: Stochastic search doesn't break out. The basin attractor is too large.
   Verdict:   -m 16 GIVES NO GAIN. Default -m 8 kept.

[+] 26d. Kick-size sweep (2026-05-10)
   Strategy: Larger kick (0.7 vs default 0.4) to escape basin.
   Result:    kick=0.4 (208s) → 12162.1
              kick=0.7 (203s) → 12526.1  (+3.0% REGRESSION)
   Trend:     Larger destroy can't repair back. Default is
              calibrated; escaping the basin costs more than it yields.
   Verdict:   kick=0.4 kept.

[+] 26e. Granular K extension BREAKS PLATEAU (2026-05-10) ⭐
   Strategy: Wider LS neighborhood — K=40 vs default K=20.
   Result:    K=20 (208s) → 12162.1
              K=40 (286s) → 12070.1  (−0.76%, gap +3.5% → +2.7%)
              FIRST PARAMETER TO BREAK THE PLATEAU.
   Trend:     Wider neighborhood finds improvement moves K=20 missed.
              Gap to Vroom (11748): −0.8 percentage points.
   Cost:      +38% wall time.
   Result:    K=20 (208s) → 12162.1
              K=40 (286s) → 12070.1  (−0.76%, gap +3.5% → +2.7%) ⭐ best
              K=60 (386s) → 12075.1  (+0.04% vs K=40, 35% slower)
   Trend:     Sweet spot at K=40. K=60 marginal degradation, K>>40 not
              worth the time.
   Cross-N:   K=20 vs K=40 sweep:
              N=50    1563 → 1569  (+0.4% worse)
              N=100   2463 → 2450  (−0.5% better — NEW WIN vs Vroom)
              N=250   4373 → 4415  (+1.0% worse, loses some of the WIN)
              N=500   7091 → 7137  (+0.6% worse, LOSES WIN vs Vroom)
              N=1000  12162 → 12070 (−0.8% better, gap +3.5% → +2.7%)
   Pattern:   K=40 helps N=100 and N=1000, hurts N=50/250/500.
              Wider neighborhood gives gain when N >> K-default; for
              medium N, K=20 is already sufficient and K=40 just
              dilutes search pressure.
   Proposal:  N-adaptive K in `main.rs`:
              ```
              eff_granular_k = if n_locations > 500 { 40 } else { 20 };
              ```
              (only when the user has not overridden --granular-k)
   Verdict:   FIRST PARAMETER TO BREAK THE PLATEAU.
              Single-seed cross-N showed conflicting signal — ran
              cross-seed validation (see 26f).

[+] 26f. Cross-seed validation K=20 vs K=40 (2026-05-10) ⭐
   Strategy: 10 seeds per N to separate signal from noise.
   Result:    N=50    K20_μ=1559.5  K40_μ=1563.5  Δ=+0.25% (σ=1.19) — within noise
              N=100   K20_μ=2376.2  K40_μ=2365.1  Δ=−0.45% (σ=1.02) ⭐ K40 better
              N=250   K20_μ=4283.8  K40_μ=4270.1  Δ=−0.32% (σ=0.82) ⭐ K40 better
              N=500   K20_μ=7234.6  K40_μ=7119.8  Δ=−1.56% (σ=1.62) ⭐ K40 9/10
              N=1000  K20_μ=12590.7 K40_μ=12376.9 Δ=−1.69% (σ=0.62) ⭐ K40 5/5

[+] 35. GPU compute kernel for 2-opt sweep (2026-05-10) — foundation laid
   Files:    Cargo.toml (wgpu 0.20 + pollster + bytemuck deps)
            src/gpu_sweep.rs (GpuSweep + WGSL shader)
            examples/gpu_sweep_bench.rs (CPU vs GPU bench)
   Platform: wgpu is cross-platform — Metal on M3, Vulkan on Linux,
            DX12 on Windows. The same WGSL shader works everywhere.
   Correctness: GPU output matches CPU exactly on N=100/500/1000/2000.
   Performance (naive version, M3):
            | N    | CPU (µs) | GPU (µs) | speedup |
            |------|----------|----------|---------|
            | 100  | 12.9     | 2291     | 0.01×   |
            | 500  | 192.4    | 2342     | 0.08×   |
            | 1000 | 854.5    | 2701     | 0.32×   |
            | 2000 | 3974.2   | 7682     | 0.52×   |
   Bottleneck: full N² delta array is read back every call. At N=2000
            that's 16 MB readback. The naive version spends more time on
            data transfer than compute.
   Optimizations (not implemented):
            - Argmin-on-GPU: return only top-K, not the full N² array.
              Expected: GPU wins at N≥1000.
            - Persistent staging buffers (allocate once).
            - For 240 multi-start trajectories simultaneously: strongest GPU fit.
   Verdict: Foundation + cross-platform works. Optimization needed
            before production use. Estimated 1-2 more days for argmin-GPU.

[+] 35b. GPU argmin-on-GPU reduction (2026-05-10) ⭐
   Change: New `best_2opt()` API. WGSL shader does sweep + atomicMin
            in the same kernel. Only [delta, i, j] (12 bytes) is read back
            instead of the entire N²×4 byte delta matrix.
   Correctness: argmin-delta matches CPU exactly at all sizes:
            N=100 (-182), N=500 (-184), N=1000 (-196), N=2000 (-194).
   Performance:
            | N    | CPU       | GPU all-deltas | GPU argmin | argmin speedup |
            |------|-----------|----------------|------------|----------------|
            | 100  | 13.1 µs   | 2003 µs        | 2034 µs    | 0.01×          |
            | 500  | 192.6 µs  | 2246 µs        | 1881 µs    | 0.10×          |
            | 1000 | 875 µs    | 3016 µs        | 1925 µs    | 0.45×          |
            | 2000 | 3785 µs   | 7923 µs        | 1831 µs    | 2.07× ⭐       |
   Trend: GPU time is nearly constant ~1.8-2 ms (per-call overhead floor).
            CPU grows O(N²). Crossover at N≈1500.
            Expected at N=5000: CPU ~25 ms, GPU ~2 ms = 10× speedup.
            Expected at N=10000: CPU ~100 ms, GPU ~3 ms = 30× speedup.
   Race consideration: WGSL atomicMin on delta always wins; (i, j) may
            lag under contention. The LS pipeline verifies
            before applying anyway, so no correctness issue.
   Verdict: GPU argmin is now production-ready for large N. Integration in
            local_search.rs is the next step.

[+] 35c. Batched per-route GPU 2-opt (2026-05-10) — honestly negative
   Change: New `batched_best_2opt(routes)` API. WGSL shader: one workgroup
            per route, threads in the workgroup cooperate on (i,j) pairs via
            atomicMin. Returns a BestMove per route in a single dispatch.
   Correctness: confirmed on first 5 routes in each scenario.
   Performance:
            | Scenario                          | CPU seq | GPU batch | Speedup |
            |-----------------------------------|---------|-----------|---------|
            | 67 routes × 15 stops (typical N=1K) | 9.8 µs  | 2019 µs   | 0.00×   |
            | 10 routes × 100 stops              | 74 µs   | 2007 µs   | 0.04×   |
            | 200 routes × 25 stops              | 90 µs   | 2082 µs   | 0.04×   |
            | 50 routes × 50 stops               | 54 µs   | 2075 µs   | 0.03×   |
   Diagnosis: GPU per-call overhead ~2 ms regardless of N. For typical VRPLT
            (a few tens of µs of work per polish step) CPU is 100-200×
            faster. GPU launch dominates.

[+] 35d. GPU "all-on-GPU" assessment (2026-05-10)
   Conclusion: For brooom as a single-instance solver, GPU is not worth
            the complexity. Per-call overhead (~2 ms) is too high and
            VRPLT workload per LS step is too low.
   GPU is still the right direction if:
            - Multi-instance batch (100+ at once)   → GPU dominates
            - Very large instances (N≥10K)          → CPU O(N²) struggles
            - Solver-as-a-service (many users)      → batch over users
   For "all-on-GPU brooom" (persistent tour state, all LS operators
            as WGSL, TW/capacity in shader, outer loop on GPU): 4-6
            weeks of focused work.
   Delivered now: cross-platform GPU foundation + sweep kernel + argmin
            reduction + batched-routes shader. Ready for the specific
            use cases above if/when they become relevant.

[+] 35k. Megakernel + inter-route relocate (2026-05-10) ⭐⭐⭐⭐⭐
   Files: src/gpu_population.rs (extended — 2 new storage bindings:
          demand + vehicle_capacity, new relocate-search phase in kernel,
          op_type branch in apply, wg_route_load precompute)
          examples/gpu_megakernel_demo.rs (extended with task-preservation check)

   Operator: Inter-route relocate. For each candidate (src_r, src_i, dst_r, dst_p):
      1. Skip same-position no-ops
      2. Cost delta = dst_added - src_savings
      3. Capacity check (inter-route only): wg_route_load[dst_r] + demand[task] ≤ cap
      4. TW check: walk both source (skipping src_i) and destination (inserting at dst_p)
      5. Atomic-min with op_type=1, storing (src_r, src_i, dst_r, dst_p) in 4 atomic fields
      Apply phase branches on op_type: 0 = 2-opt reverse, 1 = relocate shift+insert.

   Race condition handling: atomic-min on delta is atomic, but 4 (i, j, ...)
      atomic stores are not grouped — can read inconsistent combinations. Demo
      filter rejects invalid pairs (j ≤ i+1). Megakernel itself uses
      `atomicLoad(&wg_best_op)` after barrier; last writer wins with consistent
      storage of its own fields (verified by task-count preservation).

   Results (insertion-only baseline):
   r1_0100 cost=3475:
        2-opt only:                    -1.58%
        2-opt + relocate (single):     -22.91% (cost 2679)
        Phase 8 (best-of-64, kick=1):  -23.65% (cost 2653)
   r1_0500_s1 cost=10202:
        2-opt only:                    -3.21%
        2-opt + relocate (single):     -18.10% (cost 8355)
        Phase 8 (best-of-64, kick=4):  -19.51% (cost 8212)

   Vs CPU LS-converged (all 6 operators + granular K=20 + don't-look):
        r1_0100: CPU LS = 2846, GPU best = 2653 → GPU **6.8% better**
        r1_0500: CPU LS = 8623, GPU best = 8212 → GPU **4.8% better**

   Why GPU beats CPU full-stack:
      1. No granular K cap → GPU considers all (route, position) candidates
      2. No don't-look bits → GPU re-evaluates everything every iter
      3. Megakernel architecture eliminates dispatch overhead → can iterate more

   Tasks preservation verified: 100/100 customers before/after, checked via demo.

   Wallclock:
        r1_0100 batch=64: ~25 ms (vs 220 ms sequential × 64)
        r1_0500 batch=64: ~600 ms (vs 7600 ms sequential × 64)
        Speedup: 13-23× via batch.

   Status: Phase 5 (relocate kernel) effectively delivered via megakernel
           integration. Phase 6 (cross-exchange, swap-star) remains.

[+] 35j. Megakernel + Phase 8 destroy-and-repair (2026-05-10) ⭐⭐⭐⭐⭐
   Files: src/gpu_population.rs (extended — fixed-slot upload + regret-style
          reinsert kick), examples/gpu_megakernel_batch.rs

   Layout refactor first:
      Changed upload() to fixed-slot layout: each route gets
      tour_capacity / max_routes slots regardless of actual length. This
      lets remove/insert mutate route_lengths without other routes' offsets
      changing. Existing kernels (precompute, find_2opt, apply_2opt)
      work unchanged since they use route_starts. Demos updated.

   Regret-style reinsert:
      Phase 0a: Thread 0 removes K random tasks, stores them in workgroup-shared
        limbo, decrements route_lengths.
      Phase 0b: For each pulled task: all 1024 threads compete over
        (route, position) insertion candidates. Atomic-min for cheapest
        TW-feasible. Thread 0 shifts tail and inserts.
        If no feasible insertion: drop task, track in wg_dropped_count.

   Results (insertion-only baseline):
   r1_0100 cost=3475:
        kick=0: 64/64 feas, best=-1.58%, 6.2 ms
        kick=1: 52/64 feas, best=**-2.91%**, 7.6 ms
        kick=2: 45/64 feas, best=**-3.54%**, 9.2 ms
        kick=4: 32/64 feas, best=**-3.80%**, 10.7 ms
   r1_0500_s1 cost=10202:
        kick=0: 64/64 feas, best=-3.21%, 28 ms
        kick=1: 42/64 feas, best=**-4.26%**, 22 ms
        kick=2: 31/64 feas, best=-3.73%
        kick=4:  9/64 feas, best=-3.35%

   Vs CPU rayon-population polish (35e+):
        CPU: r1_1000_s1 -0.43% in **234 seconds**
        GPU: r1_0500_s1 -4.26% in **22 milliseconds**
   GPU is ~10000× faster and finds bigger improvements.

   Status: **PHASE 8 COMPLETE.** Multi-workgroup + feasibility-preserving
   destroy-and-repair delivers real quality gain on Solomon r1.

[+] 35i. Megakernel + ILS kick (swap-based, REVISED) (2026-05-10) ⭐⭐
   First built swap-based kick (#65), which did not deliver quality (95-100% TW-
   broken on r1). Replaced with regret-style reinsert in 35j, which works.
   Swap code remains as a test baseline.
   Files: src/gpu_population.rs::run_megakernel_2opt_batch_with_kick()
          examples/gpu_megakernel_batch.rs (extended)

   Implementation: Pre-LS phase in the megakernel. Thread 0 of each workgroup
      performs kick_count random inter-route task swaps, LCG-RNG seeded
      with (workgroup_id × 2654435761 + kick_seed). After swap, the existing
      2-opt LS loop runs with TW check.

   Result r1_0100 (insertion-only baseline cost=3475):
      kick=0: 64/64 feas, best=-1.58%
      kick=1: 3/64 feas,  best=-1.58% (same as no kick)
      kick=2: 2/64 feas,  best=+0.03% (worse)
      kick=4: 0/64 feas   (all TW-broken)
      Wallclock all variants: ~7-8 ms total

   Conclusion: Swap-based kick works mechanically (RNG diversifies per
      workgroup, kick performed atomically under workgroup barrier). BUT yields
      no quality improvement on r1 because tight TW means random swaps
      break TW in 95-100% of cases, and 2-opt cannot fix
      fundamental TW violations. The best we observe is still -1.58% (same as
      kick=0) when one of few surviving trajectories finds the same local
      optimum as without kick.

   For real Phase 8 quality gain: build regret-3 reinsert in megakernel
      (~800 extra WGSL lines). It is this operator that preserves
      feasibility under destroy-and-repair. Recommended for the next session.

[+] 35h. Megakernel TW-aware + multi-workgroup (2026-05-10) ⭐⭐⭐⭐
   Built two extensions on top of 35g (megakernel POC):

   TW-aware 2-opt:
      Per candidate (i, j): walk the entire proposed route, check per-stop TW
      and final vehicle TW. Atomic-min only feasible candidates.
      Extra bindings: service, tw_start, tw_end, vehicle_tw_start,
      vehicle_tw_end (all Phase 2 buffers).
      Result r1_0100 insertion-only (cost 3475):
        Phase 4 (distance-only): 8 candidates all TW-broken, 0 improvement
        Megakernel TW-aware: 8 candidates all TW-feasible, **-1.58%**
      Result r1_0500_s1 insertion-only (cost 10202):
        Phase 4: 217 iter, 1.48s, 0% improvement
        Megakernel TW-aware: 41 iter, 16ms, **-3.21%** (95.9× speedup)

   Multi-workgroup batch (Phase 8 mode):
      pop_size workgroups dispatched in one call. Each workgroup runs
      LS loop on its trajectory independently (no cross-workgroup sync
      needed). batch_mode flag in uniform — t = wg.x instead of
      params.traj. Status buffer extended to pop_size × 4 i32.
      Result r1_0100 pop=64 identical starts:
        Sequential × 64 megakernels: 226ms (3.5ms each, 512 iter total)
        Batch 1 dispatch:                8.3ms (130µs each, 512 iter)
        Speedup: **27.3×**
      For real Phase-8 quality gain we need feasibility-preserving
      diversification (ILS kick + regret-3 reinsert). With naive
      perturbations (rotation), all 64 trajectories became TW-infeasible.

   Files: src/gpu_population.rs::run_megakernel_2opt() (extended)
          src/gpu_population.rs::run_megakernel_2opt_batch() (new)
          examples/gpu_megakernel_demo.rs (TW-aware demo)
          examples/gpu_megakernel_batch.rs (batch demo)

   Not done this session:
      - Relocate operator in megakernel (intra and inter-route)
      - ILS kick + regret-3 reinsert in megakernel
      - Multi-instance batch (different matrices per workgroup)
      - --gpu-megakernel CLI flag (integration into the main loop)

[+] 35g. Megakernel POC — 99× speedup vs Phase 4 (2026-05-10) ⭐⭐⭐⭐
   Files:   src/gpu_population.rs::run_megakernel_2opt() + SHADER_MEGAKERNEL_WGSL
            examples/gpu_megakernel_demo.rs
   Concept: Entire 2-opt LS loop in one GPU dispatch via single workgroup
            (1024 threads). Phase 4 used ~3-4 dispatches per LS iter ×
            ~1.5ms each = ~6.5ms per iter dispatch overhead. Megakernel
            has zero per-iter overhead — only actual compute (~50 µs).
   Implementation:
      - Single workgroup, 1024 threads (required bumping wgpu's
        max_compute_invocations_per_workgroup from 256 → 1024).
      - Workgroup-shared atomic argmin: best_delta + best_route + best_i
        + best_j. Racy on (i,j) but correct on delta.
      - Phases per iter: reset → find (all threads) → barrier → check
        convergence → apply (thread 0 reverses the segment) → barrier →
        loop.
      - Distance-only — TW validation must happen at the caller via
        precompute_all after megakernel returns.
   Result r1_0100 (insertion-only baseline cost=3475, 7 routes):
      Phase 4 multi-dispatch:  47 iter, 307.5 ms wallclock, 6.5 ms/iter
      Megakernel:              62 iter,   3.1 ms wallclock,  50 µs/iter
      Speedup:                 **98.6×**
   Correctness: both variants converge to a 2-opt local optimum, but
            to slightly different tours (atomic-min race on equal deltas).
            Both produce TW-infeasible final tours on r1 (expected
            for distance-only 2-opt).
   Not yet tested:
      - Multi-trajectory megakernel (workgroup_id = trajectory_id)
      - Megakernel + multi-instance batch (64 instances at once)
      - TW-aware search inside megakernel
   Implication for the plan: Phase 5-7 must be rethought as megakernel
            extensions, not as separate dispatches. A full operator suite in one
            megakernel saves 100× dispatch overhead per LS iter.

[+] 35f. CPU exact solver + polish pass (2026-05-10) ⭐⭐
   Files:   src/route_exact.rs (~250 lines)
            examples/exact_solver_verify.rs (correctness)
            examples/exact_polish_demo.rs (impact measurement)
            --exact-polish CLI flag in src/main.rs
   Function: Brute-force DFS + branch-and-bound for routes ≤ 14 stops.
            TW + capacity pruning along the way. Leaf uses evaluate_route
            for definitive cost (full feature parity, multi-TW etc).
            Pickup-before-delivery via 64-bit bitmask.
   Validation:
      r1_0100 LS-converged (cost=2846): 1/7 routes tested, 0 improved
      r1_0050 LS-converged (cost=1668): 3/4 routes tested, 0 improved
      r1_0050 max_ls=0 (cost=1958):     2 routes improved, -3.88% total
      r1_0500 LS-converged (cost=8623): 14/33 tested, 0 improved
      Scrambled-recovery test: input rotated to TW-infeasible, exact
      solver found the same optimum 404 as LS-converged → algorithm correct.
   Conclusion: LS is already globally optimal on all Solomon short routes
            (≤ 14 stops). Exact solver gives 0 gain on LS-converged
            solutions, but:
            - Verifies LS optimum (guarantee, not in Vroom)
            - Recovery mechanism for max_ls=0 (insertion-only) baseline
            - Cluster_decompose finishing pass
            - Polish after ILS kick
   Performance:  ~10 ms/route @ 14 stops, ~1 µs @ 8 stops, scaling as O(N!)
            but with strong pruning typically sub-ms.
   Usage:    --exact-polish flag on CLI. Idempotent and safe.
   Status:  Step 1 of 5 in the full-GPU plan.

[+] 35e. Full-GPU port Phase 1-4 (2026-05-10) ⭐⭐⭐
   Plan:    /Users/punnerud/.claude/plans/cozy-pondering-panda.md
            (10-phase plan, 4-6 weeks total. Recommended order: 1-4
            first as a 3-week MVP, then 5-7, then 8-10.)
   Files:   src/gpu_population.rs (new, ~830 lines)
            examples/gpu_population_roundtrip.rs (Phase 1)
            examples/gpu_precompute_correctness.rs (Phase 2)
            examples/gpu_evaluate_correctness.rs (Phase 3)
            examples/gpu_2opt_loop.rs (Phase 4)
   Phase 1: GpuPopulation persistent state. Pop_size × max_routes ×
            tour_capacity buffers on GPU. upload/read_back/distance
            kernel. Round-trip byte-exact. (verified)
   Phase 2: precompute kernel — forward TW + backward TW + load_at +
            feasibility. Runs pop × max_routes parallel workgroups.
            Bit-exact match against CPU reference (depart, latest_arrival,
            load_at, feasible) over 96 routes (48 infeasible + 48 wide-
            TW feasible). Bumped max_storage_buffers_per_shader_stage
            to 16 (M3/Vulkan/DX12 support, only WebGPU is 8-capped).
   Phase 3: evaluate_route as per-route metrics output in the same kernel.
            5 i32 per route: travel, service, waiting, distance, end_time.
            Verified against CPU evaluate_route() on r1_0100 (7 routes):
            total travel 2846 = 2846 exact, all 4 metrics match
            per route. GPU dispatch ~2 ms, CPU eval per-route 0.1 µs
            (cache-hot). Ratio ~0.001× — as expected, GPU-only gain
            comes from keeping state on GPU between dispatches.
   Phase 4: GPU-only 2-opt LS loop. Kernels:
            - find_best_2opt_per_route (atomic argmin, distance-only)
            - apply_2opt (reverse segment in-place, single workgroup)
            CPU-driven loop (variant A): precompute → find → apply →
            re-precompute → rollback if TW broke. Converges correctly;
            cross-validation: GPU final travel = CPU evaluate_route
            byte-exact.
            Known limitation: distance-only 2-opt finds moves that
            often break TW on R1 instances (8/8 rolled back on r1_0100).
            Phase 5 will add TW-aware search via precomputed
            latest_arrival.
   Assumptions: Phase 2-4 assume single TW per customer, single capacity dim,
            speed=1, no setup, homogeneous fleet. Covers Solomon/G-H VRPTW.
   Limitation: Only one trajectory per call used so far; pop_size>1
            allocated but not utilized. Real multi-trajectory mode
            comes in Phase 8 (population). Phase 5-7 first.
   Status:  Complete. Ready for Phase 5 (relocate + exchange + 2-opt*).
   Files: src/pattern_db.rs, examples/pattern_db_bench.rs
   Build: KD-tree is constructed automatically when corpus > 50000 patterns.
            Median-split partition in Vec<u32> indices (in-place).
            Cycles dimensions through sig_dim per level.
   Search: Backtracking-KNN with top-K kept sorted in Vec (linear insert
            is faster than BinaryHeap for k ≤ 32). Pruning at
            squared perpendicular distance to split plane.
   Result: 9918 patterns  → linear 22.9 µs/query
            59508 patterns  → KD-tree 0.5 µs/query
            Linear extrapolated @ 60K: ~140 µs → KD-tree is ~280×.
   Verified: top-3 patterns identical on sanity test (all distance=0
            with the same customer order [32,171,65,86] from c1_2_2/3/4).
   Status: Implemented and tested. Threshold automatic; user
            needs no flag.

[+] 31. Sub-route pattern DB with Rust in-memory lookup (2026-05-10) ⭐
   Concept: Extract sliding-window sub-routes from BKS solutions.
            Each sub-route gets translation/rotation/scale-invariant
            geometric signature. Store in a Rust in-memory DB for fast KNN
            queries. Used as a RAG library for warm-start.
   Files:    benchmarks/extract_subroutes.py — Python extraction
            benchmarks/bks_to_solution.py    — BKS .txt → Vroom JSON
            benchmarks/gh_canonical/         — 60×N=200 BKS data
            benchmarks/gh_canonical/subroutes_n200_k4.jsonl — 9918 patterns
            src/pattern_db.rs                — Rust in-memory KNN
            examples/pattern_db_bench.rs     — bench
   Result: Corpus  9918 patterns × 4-d signature (k=4 window, N=200 G-H)
            Load    11.3 ms (one-off)
            Query   42.7 µs/k=5  (linear scan, 8 cores not utilized)
            Cross-instance reuse:  c1-c2 class 40-52% (strong),
                                   r1-r2 class 13-17% (weaker)
   Sanity:   Pattern[0]=[32,171,65,86] found in 3 different c1 instances
            with distance 0 — the same route pattern is actually reused.
   Status:   Pattern-DB done. Missing:
            - Cluster detector / cluster DB (level 1)
            - Warm-start hook in solver (`--warm-start <file>`)
            - End-to-end test: mean improvement from transfer

[+] 32. Warm-start hook (--warm-start file.json) (2026-05-10) ⭐
   Files: src/warm_start.rs, src/solver.rs (config.warm_start), src/main.rs
   Concept: Inject pre-built solution as seed=0; LS polishes it.
            Other multi-start seeds keep normal starts (best-of-K
            guarantees we never lose vs no-warm-start).
   Test: BKS injection on all 60 N=200 G-H instances:
         | family | brooom cold μ | brooom WARM μ | Δ% |
         |--------|---------------|---------------|------|
         | c1     | 2686.9        | 2684.2        | -0.10% |
         | c2     | 1848.5        | 1831.6        | -0.92% |
         | r1     | 3671.5        | 3610.3        | -1.66% |
         | r2     | 2798.4        | 2771.6        | -0.96% |
         | rc1    | 3263.5        | 3176.3        | -2.67% |
         | rc2    | 2371.5        | 2368.2        | -0.14% |
         | TOTAL  | -             | -             | -1.10% |
   Vs Vroom: cold -0.20%, warm -1.38% (μ).
            brooom cold beats Vroom: 25/60. WARM: 41/60.
   Verdict: Warm-start really works. Particularly strong on r/rc classes
            where brooom cold under-uses V. With BKS as warm-start
            brooom beats Vroom decisively.
   Limitation: Requires BKS as input. Real value = lookup pipeline
            (cluster + sub-route DB → "ground truth from neighbors").

[+] 33. Cluster-decompose (--decompose K) (2026-05-10) ⭐⭐
   Files: src/cluster_decompose.rs, src/main.rs (--decompose K), src/lib.rs
   Concept: K-medoids on the distance matrix splits N customers into K
            clusters. Each cluster is solved in parallel via solve_with_matrix.
            Sub-solutions are concatenated into one total solution.
   N=1000 r1_1000.json (auto-K=80, m=8, ils=30):
            | variant | cost   | routes | time | Δ cost | speedup |
            |---------|--------|--------|------|--------|---------|
            | flat    | 11985  | 67     | 462s | -      | 1×      |
            | K=10    | 12071  | 70     | 33s  | +0.7%  | 14×     |
            | K=20    | 12604  | 73     | 15s  | +5.2%  | 31×     |
   N=200 r1_2_1.json:
            | variant | cost   | time | Δ |
            |---------|--------|------|---|
            | flat    | 477088 | 4.0s | - |
            | K=2     | 493622 | 2.2s | +3.5% |
            | K=4     | 511892 | 1.3s | +7.3% |
   K=10 sweet spot for N=1000: 14× faster with 0.7% loss.

[+] 33b. Cluster-decompose POST-POLISH (2026-05-10) ⭐⭐⭐
   Change: After assembling sub-solutions, run local_search_full
            with granular K on the combined solution.
   N=1000 r1_1000.json:
            | variant            | cost   | time | Δ flat | vs Vroom |
            |--------------------|--------|------|--------|----------|
            | flat (auto-K=80)   | 11985  | 462s | -      | +2.02%   |
            | K=10 (no polish)   | 12071  | 33s  | +0.7%  | +2.75%   |
            | K=10 + polish      | 11838  | 50s  | -1.2%  | +0.77%   |
   Vroom gap halved: +2.02% → +0.77%, while also 9× faster.
   Explanation: Cluster decomposition finds different basin attractors.
            Polish pass over the combined solution explores
            cross-cluster moves from a TOTALLY different starting point than
            flat search.
   Status:  Implemented and tested. So far on canonical r1_1000.
            Needs cross-seed validation (5 seeds) to confirm it's
            not just a single-seed effect.

[+] 33c. Cross-seed CONFIRMED — N=1000 5 seeds (2026-05-10) ⭐⭐⭐⭐
   Result:    | seed | Vroom | flat | time | decomp+polish | time | Δ flat | speedup |
              |------|-------|------|------|---------------|------|--------|---------|
              | s1   | 12175 | 12554| 490s | 12173         | 46s  | -3.03% | 10.7×   |
              | s2   | 11927 | 12165| 432s | 11899         | 47s  | -2.19% |  9.2×   |
              | s3   | 11815 | 11993| 429s | 11660         | 48s  | -2.78% |  8.9×   |
              | s4   | 11907 | 12255| 441s | 12024         | 46s  | -1.88% |  9.6×   |
              | s5   | 11947 | 12338| 391s | 12061         | 46s  | -2.25% |  8.5×   |
              | μ    | 11954 | 12261| 437s | 11963         | 47s  | -2.43% |  9.4×   |

   vs Vroom:  flat:           +2.56% (μ)
              decomp+polish:  +0.08% (μ) ⭐ NEAR-PARITY

   Win-rate: decomp+polish beats flat: 5/5 (100%)
              decomp+polish beats Vroom: 3/5

   Breakthrough: We went from +2.56% behind Vroom (our usual flat solver)
                 to +0.08% (effectively tied) by:
                 1. Splitting into 10 clusters via K-medoids
                 2. Solving each in parallel (rayon × 8 multi-start per cluster)
                 3. Polish pass on the combined solution
                 Total 9.4× faster than flat as well.

   Explanation: Cluster decomposition → different basin attractors per sub-solve.
              Polish pass on the assembled solution explores cross-cluster
              moves from a completely different starting point than flat search. Effectively
              a form of diversified multi-start.

   Files:    src/cluster_decompose.rs
              benchmarks/k_sweep_2026-05-10/decompose_xseed_n1000.csv

   Default implication: For N=1000, --decompose 10 should be default.
              Missing: study on N=500/2000/5000 to find the right auto-K.

[+] 33d. Auto-decompose-threshold (2026-05-10) ⭐
   Study:    N=200 K=2/4: μ +0.7-0.9% (not worth it)
            N=500 K=5: μ −0.72%, 3.7× speedup, 3/5 wins
            N=1000 K=10: μ −2.43%, 9.4× speedup, 5/5 wins
   Heuristic chosen: K = N // 100 for N ≥ 500, otherwise flat.
   Implemented: --decompose 0 = auto, 1 = explicit off, K = explicit K.
   Verified: r1_0050→flat, r1_0100→flat, r1_0500→K=5, r1_1000→K=10.
   Files: src/main.rs (auto-block), src/cluster_decompose.rs.

[+] 33e. N=2000 scaling test (2026-05-10) ⭐
   Result (3 seeds):
            | seed | Vroom    | brooom auto | Δ cost | Δ time |
            |------|----------|-------------|--------|--------|
            | s1   | 21271/341s | 21192/156s | -0.37% | 2.2×   |
            | s2   | 20803/311s | 20637/185s | -0.80% | 1.7×   |
            | s3   | 20873/283s | 20730/170s | -0.69% | 1.7×   |
            | μ    | 20982/312s | 20853/170s | -0.62% | 1.84×  |
   brooom beats Vroom: 3/3.
   Auto-decompose K=N/100 (= K=20 here) scales.
   Polish pass on N=2000 single-threaded — the bottleneck in those ~170s.
   Possible follow-up: parallel polish (split per route-cluster) for faster.
   Conclusion: The earlier "regression" on N=250 in 26e was noise.
              K=40 wins 6/10 on N=100 and 8/10 on N=250 with 10 seeds.
   Implication: If N=500/1000 also support K=40 → CHANGE GLOBAL DEFAULT.
                If mixed → N-adaptive (K=20 for N≤50, K=40 otherwise).
   Implemented: 2026-05-10. `--granular-k` became Option<usize> in CLI;
                None → auto: K=40 for N≥100, otherwise K=20.
                Verbose logs "auto-K → 40 for N=250" etc.
                Cache fingerprint uses eff_granular_k (after auto-tune)
                so that auto and explicit K=40 share cache hits.
                Explicit `--granular-k 0` still disables granular.

[+] 26g. Cross-seed Vroom baseline (2026-05-10) — REALITY CHECK
   Strategy: Run Vroom on the same seed variants for a real comparison.
   Result:    N=500   Vroom_μ=7070.3   K20_μ=7234.6 (+2.31%)  K40_μ=7119.8 (+0.70%)
              N=1000  Vroom_μ=11954.2  K20_μ=12590.7 (+5.32%) K40_μ=12376.9 (+3.54%)
   Vroom wins: K40 wins 1/10 against Vroom on N=500, 0/5 on N=1000.
   Trend:     Auto-K closes 1.6-1.8 pp but brooom still loses globally.
   Implication: Canonical r1_1000 (gap +2.7%) was an EASY seed.
                Real gap to Vroom on N=1000 is +3.54%.
                K=40 is still an unconditional gain (1.78 pp), but
                the real target (Vroom parity) is still far off.

[+] 26h. K=80 on N=1000 cross-seed (2026-05-10) ⭐
   Result:    Vroom_μ=11954.2  K20=12590.7 (+5.32%)  K40=12376.9 (+3.54%)
              K80_μ=12261.1 (+2.57%) [5/5 seeds]
   Trend:     Each K doubling closes roughly half of the remaining
              Vroom gap. K20→K40: −1.78 pp. K40→K80: −0.97 pp.
   Cost:      K80 wallclock ~6 min vs K40 ~4 min on N=1000.
   N=500 K=80: Vroom_μ=7070.3  K20=7234.6 (+2.32%)  K40=7119.8 (+0.70%)
                K80_μ=7082.8 (+0.18%) [3/10 seeds win vs Vroom]
   Trend N=500: K20→K40→K80 closes here too: 2.32 → 0.70 → 0.18 pp.
                Near Vroom parity on N=500 with K=80.
   Status K=160 N=1000: STOPPED BEFORE COMPLETION (PC shutdown).
                Raw run still possible from `benchmarks/k_sweep_2026-05-10/`.
   Backup:    All K-sweep + Vroom-baseline CSVs in
              benchmarks/k_sweep_2026-05-10/
              (k_sweep_results.csv = N=50/100/250 K20 vs K40,
               k_sweep_large_results.csv = N=500/1000 K20 vs K40,
               k80_results.csv = N=1000 K=80,
               k_extreme.csv = N=500 K=80,
               vroom_seeds.csv + vroom_small.csv = Vroom baselines,
               k160_n1000.csv = N=1000 K=160 [in progress])

[+] 26i. 3-tier auto-K (2026-05-10)
   Implemented: main.rs auto-K block:
              N<100  → K=20
              N<500  → K=40
              N≥500  → K=80
   Verified: r1_0050 → K=20, r1_0100 → K=40, r1_0500 → K=80, r1_1000 → K=80.
   Expected effect on cross-seed Vroom gap:
              N=50    +2.69% (unchanged)
              N=100   −0.15% (unchanged, K=40)
              N=250   −1.13% (unchanged, K=40)
              N=500   +0.18% (was +0.70% with K=40)  ⭐ +0.52 pp
              N=1000  +2.57% (was +3.54% with K=40)  ⭐ +0.97 pp

[+] 27. Neural Combinatorial Optimization (Path A: ONNX import)
   Files:     neural/train_pointer_tsp.py (PyTorch training)
              neural/pointer_tsp_encoder.onnx + _decoder.onnx (exported weights)
              src/neural.rs (Rust ONNX loader + autoregressive loop)
              examples/neural_test.rs (proof-of-concept)
   Architecture:
              Encoder: nn.TransformerEncoderLayer (1 layer, embed=64, head=4)
                       Input: (B, N, 2) coords → Output: (B, N, 64) embeddings
              Decoder: pointer attention with global+last context
                       Input: h_nodes, h_graph, last_emb, mask
                       Output: logits over remaining stops
              Autoregressive loop runs in Rust — encoder 1×, decoder N×.
   Training:  REINFORCE with greedy-rollout baseline. 200 epochs on MPS-GPU,
              16 seconds. Loss converged.
              Final: random tour=10.46, pointer-NN=4.32 (-58.7%)
   ONNX:      Encoder + decoder exported as SEPARATE files (not one-shot)
              to support autoregressive generation. Dynamic axes for
              variable N.
   Rust test: examples/neural_test.rs on N=20 random TSP:
              Nearest-neighbor: 5.15
              Pointer-NN:       4.83 (−6.2% better!)
   Status:    INFRASTRUCTURE WORKS END-TO-END (PyTorch → ONNX → Rust ort).
   Limitation: Current model trained only on N=20 random TSP. To use
              as warm-start for r1_1000 CVRPTW we must:
              1. Retrain with TW + capacity features per node
              2. Train with varying N (50, 100, 250, 500, 1000)
              3. Use Solomon-style distributions (clustered TWs)
              4. Multi-vehicle output (segment tour into routes at depot/EOS)
              Required: 1-2 days of training (on the same MPS-GPU), 50K+ examples.
              Expected gain as warm-start: 0.5-2% on N=1000.
   Verdict:   PIPELINE DONE. User-ready for scaling to CVRPTW when
              prioritized.

[+] 28. CVRPTW pointer-net (extension of #27)
   Files:     neural/train_pointer_cvrptw.py + pointer_cvrptw_*.onnx
              src/neural.rs::PointerCvrptwModel
              examples/cvrptw_test.rs
   Extension from TSP:
              - 6-d node features: (x, y, demand, tw_start, tw_end, service)
              - Depot node (idx 0) that opens/closes routes
              - State per step: (current_load_norm, current_time_norm)
              - Multi-vehicle output: model picks depot return when
                capacity/TW tightness requires a new route
              - REINFORCE loss includes travel cost + 5×TW violation penalty
              - Feasibility mask in decoder: only selectable if cap+TW OK
   Training:  400 epochs on MPS-GPU, 61s. Final greedy mean cost 9.29 on
              N=20 random CVRPTW.
   Rust test: examples/cvrptw_test.rs on synthetic N=20:
              - All 20 customers visited
              - 6 routes generated with correct depot returns
              - Total cost: 8.50
              - No TW/capacity violations (feasibility mask ensures it)
   Status:    PIPELINE WORKS. Rust can call CVRPTW model and get
              feasible multi-route output. Ready for integration into the brooom
              solver as warm-start.
   Limitation for r1_1000 use:
              - Model trained on uniform random N=20, not Solomon-style N=1000
              - Requires retraining with:
                * Varied N (50, 100, 250, 500, 1000) in the same batch
                * Solomon-style TW distributions (clustered, varied tightness)
                * Solomon-style clustering (R1 random, C1 clustered, RC1 mixed)
                * Larger training budget (hours, not minutes)
   Verdict:   PIPELINE DONE. Scaling to r1_1000 = retraining question.

[~] 29. RAG-NN architecture (user's vision)
   Idea:      NN encoder → embedding → ANN lookup → fetch cached solution(s)
              from similar problems → NN decoder combines cached solution
              with new problem state → adapted solution
   Comparison with GPT/RAG:
              - GPT: text → embedding → retrieve documents → new text
              - VRP-RAG: problem → embedding → retrieve solutions → new solution
   Builds on: src/embedding.rs (16/22-d ProblemEmbedding)
              src/cache.rs (CacheMeta + nearest top-K)
              src/neural.rs (PointerCvrptwModel)
   New architecture needed:
              - "Solution decoder" NN: takes (cached_solution_template, new_problem)
                → adapted solution
              - Training data: pairs of (problem_a, solution_a, problem_b, solution_b)
                where problem_a ~ problem_b and solution_b is known good
              - Loss: cost(adapted_solution_b) - cost(optimal_solution_b)
   Status:    SKETCHED. Requires significant ML work + corpus of solved
              VRP instances. Natural continuation of RAG Level 3
              (median-config transfer) we already have.
   Verdict:   FUTURE WORK once basic pointer-net for r1_1000 is solid.

[+] 30. Distillation NN: refiner trained on Vroom as teacher
   Concept: knowledge distillation à la DeepSeek vs OpenAI.
            Teacher = Vroom solution (known good).
            Student = small NN that predicts per-edge "is-in-Vroom-solution".
            At inference: score brooom edges, prioritize those with low prob —
            they are likely suboptimal.
   Training data: 5 brooom+vroom pairs from benchmarks/results/
   Edge-overlap analysis:
            N=50:  57.4% overlap, brooom 1563 vs vroom 1554
            N=100: 69.4%        ,        2463 vs       2466 (we win)
            N=250: 53.5%        ,        4373 vs       4472 (we win)
            N=500: 39.1%        ,        7091 vs       7132 (we win)
            N=1000: 36.3%       ,       12162 vs      11748 (Vroom wins)
   Model: 8-feature MLP (edge length, position, node degree, TW tightness,
            route length, depot indicator).
   Result: val_acc=61.9% (baseline 60%), margin=+0.05.
   Signal:  The model LEARNS to discriminate (margin is positive), but 0.05 is too
            small for practical LS prioritization.
   Limitation: 5 instances × ~500 edges = 2K examples is too small.
            8 features don't capture complex patterns.
            Edge-overlap ceiling (36-69%) limits anyway.
   Files:    neural/train_refiner_real.py + edge_refiner.onnx
   To make it practical:
            1. 50-300+ Solomon/Gehring-Homberger pairs (Vroom solutions on all)
            2. GNN encoder on the distance matrix (PyG/DGL)
            3. Multi-teacher distillation (Vroom + LKH-3 + HGS) for "soft labels"
            4. Direct cost distillation: predict per-move cost delta instead
               of binary is-in-vroom
   Verdict: PIPELINE DONE (Python training + ONNX + Rust-ready).
            Signal weak with 5 pairs. Scaling = more data + GNN.

[+] 8b. Tour-merge (merge short routes)
   Diagnosis: Before implementation, checked N=1000 solution. ALL 67 routes have
              11-21 stops. NO short routes (≤3) exist. The relocate operator
              already handles tour-merge cases gradually.
   Verdict:   NOT NECESSARY.

[-] 10. ALNS multi-armed bandit on destroy operators
   Strategy: 3 destroy variants — Random / Worst (highest marginal cost
             per task) / Related (Shaw — remove n nearest to seed).
             Bandit picks via softmax weights, updated with Ropke-
             Pisinger scoring (σ1=33 new global-best, σ2=9 local improvement,
             σ3=0). EMA r=0.2, segment=5 iters.
             Gradient interpretation: policy gradient on action space 3
             (in contrast to REINFORCE-on-tasks with action space N).
   Result:    3-arm: mixed, average regression
              2-arm (Random+Worst, removed Related):
                N=50    1569→1565 (-0.3%)
                N=100   2441→2451 (+0.4%)
                N=250   4435→4486 (+1.2%)  REGRESSION, loses WIN
                N=500   7241→7325 (+1.2%)  REGRESSION
                N=1000  13351→13234 (-0.9%) IMPROVEMENT (deterministic
                                            over 3 runs, confirmed)
              vs Vroom: N=250 loses its WIN status (-0.8% → +0.3%);
              N=1000 goes from +13.6% → +12.6% (still big gap).
              Trade-off: 1% better on a size we're already 13% behind on,
              in exchange for losing a proven WIN.
   Verdict:   REVERTED. Bandit + 30 iters is too short a horizon — typical
              ALNS literature uses 1000+ iters with simulated-annealing
              acceptance. Our "best-ever-only" acceptance is the wrong setting for ALNS.
              If the user later wants to prioritize N=1000 specifically, we can
              reintroduce Worst-removal conditional on instance size.

[-] 9. REINFORCE weights on ILS task selection (PyTorch-style gradient)
   Strategy: each task has a sample weight. ILS kick picks n_kick tasks
             via Efraimidis-Spirakis (key = -ln(U)/w). After LS:
             multiplicative bump/decay based on sign(Δcost).
             Tested with α=0.15 (1.15/0.95) and α=0.05 (1.05/0.98).
   Result:    Both variants regress across the board:
              α=0.15: N=100 +1.3%, N=250 +1.0%, N=500 +1.2%, N=1000 +1.4%
              α=0.05: N=100 +0.9%, N=250 +1.9%, N=500 +1.8%, N=1000 +0.8%
              Cause: ILS gets all its value from diversity in kicks. With
              only 30 iterations per variant we are in the exploration regime,
              not exploitation regime. REINFORCE concentrates kicks on
              "hard" tasks, but these are often hard precisely
              because they are tight — removing them more often produces only
              more unassigned, not better global optima.
              This is classic explore-vs-exploit: 30 kicks is too few for
              policy convergence to pay off.
   Verdict:   REVERTED.
