======================================================================
Round 41 — P100 fp64 Verification Summary
======================================================================
diffcb         : 0.1.11
PyTorch        : 2.2.2+cu118
GPU            : 6.0, Tesla P100-PCIE-16GB
forward_batched: YES (B1+B2)
R+multimode    : YES

── Part 1: fp64 vs fp32 (bimodal, DCBLayer, best-of-5) ──────────────
           n     dtype          ms
──────────────────────────────────────
      50,000   float64     38.0818
      50,000   float32     24.2779
     100,000   float64     38.8386
     100,000   float32     24.5947
   1,000,000   float64     40.0615
   1,000,000   float32     25.6655

fp64 / fp32 latency ratio:
  n=   50,000  ratio=1.569x  (P100 strong fp64)
  n=  100,000  ratio=1.579x  (P100 strong fp64)
  n=1,000,000  ratio=1.561x  (P100 strong fp64)

── Part 2: forward_batched B1+B2 speedup (fp64, CUDA) ───────────────
   K         n    serial_ms    batch_ms    speedup     batch_dps
─────────────────────────────────────────────────────────────────
   1    50,000      64.1652     11.7122     5.4785x          85.4
   4    50,000     257.8352     20.2539    12.7302x         197.5
   8    50,000     515.6841     33.4900    15.3981x         238.9
  16    50,000    1035.7828     57.4761    18.0211x         278.4
  32    50,000    2070.0996    104.3010    19.8474x         306.8
  64    50,000    4134.9243    211.2463    19.5739x         303.0
   1   100,000      39.5547     11.3051     3.4988x          88.5
   4   100,000     161.2752     21.4318     7.5250x         186.6
   8   100,000     318.4005     33.5456     9.4916x         238.5
  16   100,000     638.6050     59.2827    10.7722x         269.9
  32   100,000    1278.9561    111.9748    11.4218x         285.8
  64   100,000    2586.5457    224.3666    11.5282x         285.2

Peak B1+B2 speedup: 19.85x at K=32 n=50,000
10x target: PASS (>=10x)

── Part 3: DCB CUDA vs R speedup ───────────────────────────────────
           n    dcb_gpu_ms        r_ms   speedup_gpu
────────────────────────────────────────────────────
       1,000       13.8954    1515.485       109.064x
      10,000       77.8050    1583.335        20.350x
     100,000       38.7515    1710.029        44.128x
   1,000,000       39.4587    4650.677       117.862x

── Part 4: Accuracy (mean overest_pct vs R) ────────────────────────
           n        dist       mean%      min%      max%
───────────────────────────────────────────────────────
       1,000     bimodal      -0.002    -0.004    -0.001
       1,000    gaussian      -0.054    -0.098    -0.003
      10,000     bimodal      -0.002    -0.004    -0.000
      10,000    gaussian      -0.102    -0.223    -0.008
     100,000     bimodal      -0.000    -0.001    +0.001
     100,000    gaussian      +0.027    -0.075    +0.132

======================================================================