identity layers + randn queries

/usr/local/lib/python3.12/dist-packages/torch/_inductor/lowering.py:7627: UserWarning: 
Online softmax is disabled on the fly since Inductor decides to
split the reduction. Cut an issue to PyTorch if this is an
important use case and you want to speed it up with online
softmax.

  warnings.warn(
/usr/local/lib/python3.12/dist-packages/torch/_inductor/lowering.py:7627: UserWarning: 
Online softmax is disabled on the fly since Inductor decides to
split the reduction. Cut an issue to PyTorch if this is an
important use case and you want to speed it up with online
softmax.

  warnings.warn(
/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
  warnings.warn(
/usr/local/lib/python3.12/dist-packages/torch/_inductor/select_algorithm.py:3464: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  current_size = base.storage().size()
E0429 17:42:45.063000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 17:42:45.063000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 17:42:45.063000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 17:42:45.063000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 17:42:45.063000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 17:42:45.063000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 17:42:45.063000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
E0429 17:42:45.077000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 17:42:45.077000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 17:42:45.077000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 17:42:45.077000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 17:42:45.077000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 17:42:45.077000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 17:42:45.077000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
E0429 17:42:45.089000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 17:42:45.089000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 17:42:45.089000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 17:42:45.089000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 17:42:45.089000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 17:42:45.089000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 17:42:45.089000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
E0429 17:42:45.101000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 17:42:45.101000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 17:42:45.101000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 17:42:45.101000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 17:42:45.101000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 17:42:45.101000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 17:42:45.101000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
E0429 17:42:45.113000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 17:42:45.113000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 17:42:45.113000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 17:42:45.113000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 17:42:45.113000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 17:42:45.113000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 17:42:45.113000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
E0429 17:42:45.127000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 17:42:45.127000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 17:42:45.127000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 17:42:45.127000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 17:42:45.127000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 17:42:45.127000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 17:42:45.127000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
E0429 17:42:45.139000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 17:42:45.139000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 17:42:45.139000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 17:42:45.139000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 17:42:45.139000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 17:42:45.139000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 17:42:45.139000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
E0429 17:42:45.151000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 17:42:45.151000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 17:42:45.151000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 17:42:45.151000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 17:42:45.151000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 17:42:45.151000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 17:42:45.151000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
E0429 17:42:45.164000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 17:42:45.164000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 17:42:45.164000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 17:42:45.164000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 17:42:45.164000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 17:42:45.164000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 17:42:45.164000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
E0429 17:42:45.176000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 17:42:45.176000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 17:42:45.176000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 17:42:45.176000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 17:42:45.176000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 17:42:45.176000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 17:42:45.176000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
E0429 17:42:45.189000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 17:42:45.189000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 17:42:45.189000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 17:42:45.189000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 17:42:45.189000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 17:42:45.189000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 17:42:45.189000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
E0429 17:42:45.201000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 17:42:45.201000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 17:42:45.201000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 17:42:45.201000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 17:42:45.201000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 17:42:45.201000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 17:42:45.201000 8448 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
Autotune Choices Stats:
{"num_choices": 13, "num_triton_choices": 12, "best_kernel": "bmm", "best_time": 3.4744319915771484, "best_triton_pos": 1, "best_triton_time": Infinity, "best_triton_kernel": "triton_bmm_0", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2"}
AUTOTUNE bmm(131072x2x1, 131072x1x512)
strides: [1, 131072, 0], [512, 0, 1]
dtypes: torch.float32, torch.float32
  bmm 3.4744 ms 100.0% 
  triton_bmm_0 inf ms 0.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2
  triton_bmm_1 inf ms 0.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=2
  triton_bmm_2 inf ms 0.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
  triton_bmm_3 inf ms 0.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=2
  triton_bmm_4 inf ms 0.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4
  triton_bmm_5 inf ms 0.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
  triton_bmm_6 inf ms 0.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=128, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
  triton_bmm_7 inf ms 0.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=128, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8
  triton_bmm_8 inf ms 0.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=128, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4
SingleProcess AUTOTUNE benchmarking takes 0.4249 seconds and 0.0006 seconds precompiling for 13 choices
Autotune Choices Stats:
{"num_choices": 18, "num_triton_choices": 17, "best_kernel": "triton_mm_19", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4", "best_time": 0.2959359884262085, "best_triton_pos": 0}
AUTOTUNE mm(512x1, 1x262144)
strides: [1, 512], [0, 1]
dtypes: torch.float32, torch.float32
  triton_mm_19 0.2959 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
  triton_mm_17 0.2970 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4
  triton_mm_23 0.2970 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=128, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
  triton_mm_22 0.2980 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=128, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4
  triton_mm_20 0.2990 ms 99.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=128, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
  triton_mm_16 0.3154 ms 93.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
  triton_mm_24 0.3195 ms 92.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=128, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8
  mm 0.3226 ms 91.7% 
  triton_mm_18 0.3256 ms 90.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8
  triton_mm_21 0.3318 ms 89.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=128, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8
SingleProcess AUTOTUNE benchmarking takes 0.7168 seconds and 0.4730 seconds precompiling for 18 choices
/usr/local/lib/python3.12/dist-packages/torch/_inductor/lowering.py:7627: UserWarning: 
Online softmax is disabled on the fly since Inductor decides to
split the reduction. Cut an issue to PyTorch if this is an
important use case and you want to speed it up with online
softmax.

  warnings.warn(
Autotune Choices Stats:
{"num_choices": 18, "num_triton_choices": 17, "best_kernel": "mm", "best_time": 0.14643199741840363, "best_triton_pos": 1, "best_triton_time": 0.15462400019168854, "best_triton_kernel": "triton_mm_39", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=128, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4"}
AUTOTUNE mm(512x1, 1x131072)
strides: [1, 512], [0, 1]
dtypes: torch.float32, torch.float32
  mm 0.1464 ms 100.0% 
  triton_mm_39 0.1546 ms 94.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=128, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4
  triton_mm_34 0.1556 ms 94.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4
  triton_mm_36 0.1556 ms 94.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
  triton_mm_37 0.1556 ms 94.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=128, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
  triton_mm_40 0.1556 ms 94.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=128, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
  triton_mm_35 0.1587 ms 92.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8
  triton_mm_41 0.1618 ms 90.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=128, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8
  triton_mm_29 0.1628 ms 89.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=32, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2
  triton_mm_33 0.1628 ms 89.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
SingleProcess AUTOTUNE benchmarking takes 0.5355 seconds and 0.6913 seconds precompiling for 18 choices

paper_forward fwd+bwd:  385.238 ms
paper_forward bwd-only: 304.971 ms
paper_forward peak allocated: fwd=29.705 GiB, fwd+bwd=31.823 GiB
paper_forward peak reserved:  fwd=29.760 GiB, fwd+bwd=32.510 GiB
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_kernel,
with key as (1, 512, 8, 1, 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'torch.float32'),
finished after 11.01s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None;
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_2_online_softmax_merge_intrablock_out_kernel,
with key as (512, 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.bfloat16'),
finished after 5.24s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_kernel,
with key as (2, 512, 8, 2, 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'torch.float32'),
finished after 13.90s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_kernel,
with key as (3, 512, 8, 4, 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'torch.float32'),
finished after 17.12s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_kernel,
with key as (4, 512, 8, 4, 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'torch.float32'),
finished after 17.54s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_kernel,
with key as (5, 512, 1, 8, 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'torch.float32'),
finished after 8.03s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (5, 512, 1, 8, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 13.52s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None;
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 32, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 32, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 64, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 64, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 32, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 32, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 64, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 64, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Triton autotuning for function phase_1_reduce_grad_pseudo_queries_kernel,
with key as (131072, 512, 1, 'torch.float32', 'torch.float32'),
finished after 2.04s,
best config selected: BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 64, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None;
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_2_online_softmax_merge_intrablock_backward_kernel,
with key as (512, 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 6.64s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None;
Autotuning kernel phase_2_reduce_grad_pseudo_query_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 32, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_reduce_grad_pseudo_query_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 32, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_reduce_grad_pseudo_query_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 64, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_reduce_grad_pseudo_query_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 64, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_reduce_grad_pseudo_query_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 32, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_reduce_grad_pseudo_query_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 32, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_reduce_grad_pseudo_query_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 64, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_reduce_grad_pseudo_query_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 64, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Triton autotuning for function phase_2_reduce_grad_pseudo_query_kernel,
with key as (131072, 512, 'torch.float32', 'torch.float32'),
finished after 1.99s,
best config selected: BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 64, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (4, 512, 8, 4, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 46.99s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None;
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 32, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 32, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 64, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 64, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 32, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 32, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 64, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 64, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Triton autotuning for function phase_1_reduce_grad_pseudo_queries_kernel,
with key as (131072, 512, 8, 'torch.float32', 'torch.float32'),
finished after 1.96s,
best config selected: BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 64, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (3, 512, 8, 4, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 42.45s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (2, 512, 8, 2, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 31.21s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (1, 512, 8, 1, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 18.82s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None;
production_forward2 fwd+bwd:  191.605 ms
production_forward2 bwd-only: 172.599 ms
production_forward2 peak allocated: fwd=2.567 GiB, fwd+bwd=5.946 GiB
production_forward2 peak reserved:  fwd=2.963 GiB, fwd+bwd=8.713 GiB
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_2_online_softmax_merge_intrablock_backward_kernel,
with key as (512, 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 6.95s,
best config selected: num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (4, 512, 8, 4, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 54.09s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (3, 512, 8, 4, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 44.19s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (2, 512, 8, 2, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 32.62s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (1, 512, 8, 1, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 20.30s,
best config selected: num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None;
production_forward fwd+bwd:  113.491 ms
production_forward bwd-only: 95.886 ms
production_forward peak allocated: fwd=3.071 GiB, fwd+bwd=9.821 GiB
production_forward peak reserved:  fwd=3.338 GiB, fwd+bwd=12.338 GiB

grads check for swiglu layers + randn queries
production_forward vs paper_forward output: mean_abs=0.0015801729168742895, max_abs=0.0390625
production_forward grad[0] vs paper_forward: mean_abs=0.007999137975275517, max_abs=0.349609375, mean_rel=0.07028037309646606, max_rel=93.23179626464844, norm_rel=0.019216591492295265, ref_abs_avg=0.4511302709579468, test_abs_avg=0.4511508643627167
production_forward grad[1] vs paper_forward: mean_abs=6.821956157684326, max_abs=48.0, mean_rel=0.147918701171875, max_rel=259.58331298828125, norm_rel=0.019457440823316574, ref_abs_avg=308.0956726074219, test_abs_avg=308.1716613769531
production_forward grad[2] vs paper_forward: mean_abs=1.1876044273376465, max_abs=5.0, mean_rel=0.0886216089129448, max_rel=2.856550693511963, norm_rel=0.0223385039716959, ref_abs_avg=53.860626220703125, test_abs_avg=53.85081481933594
production_forward grad[3] vs paper_forward: mean_abs=1.5219930410385132, max_abs=11.0, mean_rel=0.16117843985557556, max_rel=1847.185302734375, norm_rel=0.023943297564983368, ref_abs_avg=63.968955993652344, test_abs_avg=63.9727783203125
production_forward grad[4] vs paper_forward: mean_abs=1.4006023406982422, max_abs=10.0, mean_rel=0.4023421108722687, max_rel=4375.0, norm_rel=0.022372353821992874, ref_abs_avg=62.93260192871094, test_abs_avg=62.93924331665039
production_forward grad[5] vs paper_forward: mean_abs=1.0514183044433594, max_abs=3.75, mean_rel=0.07834570854902267, max_rel=2.8421409130096436, norm_rel=0.021730298176407814, ref_abs_avg=48.237098693847656, test_abs_avg=48.24871063232422
production_forward grad[6] vs paper_forward: mean_abs=1.356330156326294, max_abs=9.0, mean_rel=0.16842123866081238, max_rel=3362.71630859375, norm_rel=0.02380526252090931, ref_abs_avg=57.3785285949707, test_abs_avg=57.3852424621582
production_forward grad[7] vs paper_forward: mean_abs=1.2472559213638306, max_abs=8.0, mean_rel=0.28839191794395447, max_rel=2999.999755859375, norm_rel=0.022086873650550842, ref_abs_avg=56.85024642944336, test_abs_avg=56.86254119873047
production_forward grad[8] vs paper_forward: mean_abs=0.9643962383270264, max_abs=4.25, mean_rel=0.16699998080730438, max_rel=13.089797019958496, norm_rel=0.02179776132106781, ref_abs_avg=43.664306640625, test_abs_avg=43.641883850097656
production_forward grad[9] vs paper_forward: mean_abs=1.228745698928833, max_abs=8.5, mean_rel=0.1626926064491272, max_rel=1925.9468994140625, norm_rel=0.02355739288032055, ref_abs_avg=52.54914093017578, test_abs_avg=52.553253173828125
production_forward grad[10] vs paper_forward: mean_abs=1.1219182014465332, max_abs=7.25, mean_rel=0.3235069513320923, max_rel=3874.999755859375, norm_rel=0.02174711413681507, ref_abs_avg=51.93836975097656, test_abs_avg=51.94451141357422
production_forward grad[11] vs paper_forward: mean_abs=0.8524951934814453, max_abs=3.375, mean_rel=0.10369281470775604, max_rel=4.980968475341797, norm_rel=0.021579822525382042, ref_abs_avg=38.89253234863281, test_abs_avg=38.89323425292969
production_forward grad[12] vs paper_forward: mean_abs=1.1401891708374023, max_abs=9.0, mean_rel=0.15680965781211853, max_rel=1563.9097900390625, norm_rel=0.023392660543322563, ref_abs_avg=49.10293960571289, test_abs_avg=49.10669708251953
production_forward grad[13] vs paper_forward: mean_abs=1.0396099090576172, max_abs=6.5, mean_rel=0.3004627227783203, max_rel=3843.749755859375, norm_rel=0.021689999848604202, ref_abs_avg=48.20381546020508, test_abs_avg=48.20499801635742
production_forward grad[14] vs paper_forward: mean_abs=0.8077888488769531, max_abs=3.375, mean_rel=0.08194241672754288, max_rel=8.170308113098145, norm_rel=0.02220870740711689, ref_abs_avg=36.91057205200195, test_abs_avg=36.958091735839844
production_forward grad[15] vs paper_forward: mean_abs=1.0596719980239868, max_abs=7.0, mean_rel=0.15346470475196838, max_rel=1301.7252197265625, norm_rel=0.02321195788681507, ref_abs_avg=45.97185516357422, test_abs_avg=45.97437286376953
production_forward grad[16] vs paper_forward: mean_abs=0.9777671098709106, max_abs=6.0, mean_rel=0.23717117309570312, max_rel=2375.0, norm_rel=0.021600546315312386, ref_abs_avg=45.45899200439453, test_abs_avg=45.47166442871094
production_forward grad[17] vs paper_forward: mean_abs=0.7815446853637695, max_abs=3.25, mean_rel=0.092159703373909, max_rel=7.500978469848633, norm_rel=0.022490741685032845, ref_abs_avg=34.417518615722656, test_abs_avg=34.328269958496094
production_forward grad[18] vs paper_forward: mean_abs=0.9955465793609619, max_abs=9.0, mean_rel=0.15455813705921173, max_rel=2222.07763671875, norm_rel=0.02316720224916935, ref_abs_avg=43.235050201416016, test_abs_avg=43.23835372924805
production_forward grad[19] vs paper_forward: mean_abs=0.9133306741714478, max_abs=5.5, mean_rel=0.26848167181015015, max_rel=2125.0, norm_rel=0.021407293155789375, ref_abs_avg=42.94965362548828, test_abs_avg=42.95027160644531
production_forward grad[20] vs paper_forward: mean_abs=0.7646365165710449, max_abs=3.0, mean_rel=0.08095085620880127, max_rel=7.8364410400390625, norm_rel=0.022511132061481476, ref_abs_avg=34.571624755859375, test_abs_avg=34.577720642089844
production_forward grad[21] vs paper_forward: mean_abs=0.9484402537345886, max_abs=8.0, mean_rel=0.14872711896896362, max_rel=1455.66455078125, norm_rel=0.02306726574897766, ref_abs_avg=41.408870697021484, test_abs_avg=41.41416931152344
production_forward grad[22] vs paper_forward: mean_abs=0.8634872436523438, max_abs=5.25, mean_rel=0.33720603585243225, max_rel=3187.499755859375, norm_rel=0.02119280770421028, ref_abs_avg=40.971107482910156, test_abs_avg=40.978736877441406
production_forward grad[23] vs paper_forward: mean_abs=0.684910774230957, max_abs=2.5625, mean_rel=0.10807637870311737, max_rel=11.711923599243164, norm_rel=0.0202614888548851, ref_abs_avg=33.58743667602539, test_abs_avg=33.56614685058594
production_forward grad[24] vs paper_forward: mean_abs=0.9008255004882812, max_abs=6.5, mean_rel=0.15570871531963348, max_rel=999.0344848632812, norm_rel=0.022913780063390732, ref_abs_avg=39.542457580566406, test_abs_avg=39.54249954223633
production_forward grad[25] vs paper_forward: mean_abs=0.8208962678909302, max_abs=5.5, mean_rel=0.30059754848480225, max_rel=2375.0, norm_rel=0.02096746489405632, ref_abs_avg=39.327392578125, test_abs_avg=39.33070755004883
production_forward grad[26] vs paper_forward: mean_abs=0.8272948265075684, max_abs=3.0, mean_rel=0.08208966255187988, max_rel=5.42167329788208, norm_rel=0.023283453658223152, ref_abs_avg=36.0911750793457, test_abs_avg=36.191864013671875
production_forward grad[27] vs paper_forward: mean_abs=1.0290260314941406, max_abs=8.0, mean_rel=0.1758023500442505, max_rel=1304.1146240234375, norm_rel=0.02480878122150898, ref_abs_avg=41.69692611694336, test_abs_avg=41.69957733154297
production_forward grad[28] vs paper_forward: mean_abs=0.9491353034973145, max_abs=6.0, mean_rel=0.2610308527946472, max_rel=3062.499755859375, norm_rel=0.023024680092930794, ref_abs_avg=41.39656448364258, test_abs_avg=41.395751953125
production_forward grad[29] vs paper_forward: mean_abs=0.7881612777709961, max_abs=3.859375, mean_rel=0.14162355661392212, max_rel=21.613378524780273, norm_rel=0.025011155754327774, ref_abs_avg=31.045223236083984, test_abs_avg=31.017831802368164
production_forward grad[30] vs paper_forward: mean_abs=0.9521849155426025, max_abs=7.0, mean_rel=0.16213256120681763, max_rel=1273.078125, norm_rel=0.02499374933540821, ref_abs_avg=38.286094665527344, test_abs_avg=38.289764404296875
production_forward grad[31] vs paper_forward: mean_abs=0.8826069831848145, max_abs=5.0, mean_rel=0.30339938402175903, max_rel=1999.9998779296875, norm_rel=0.02354077249765396, ref_abs_avg=37.6838264465332, test_abs_avg=37.684181213378906
production_forward grad[32] vs paper_forward: mean_abs=0.6987738609313965, max_abs=2.75, mean_rel=0.08816632628440857, max_rel=7.0089921951293945, norm_rel=0.024784661829471588, ref_abs_avg=27.616657257080078, test_abs_avg=27.662824630737305
production_forward grad[33] vs paper_forward: mean_abs=0.8819955587387085, max_abs=6.25, mean_rel=0.15808187425136566, max_rel=1328.5572509765625, norm_rel=0.024799296632409096, ref_abs_avg=35.73487854003906, test_abs_avg=35.7369499206543
production_forward grad[34] vs paper_forward: mean_abs=0.821435272693634, max_abs=5.5, mean_rel=0.2931009531021118, max_rel=2812.499755859375, norm_rel=0.023541221395134926, ref_abs_avg=35.062904357910156, test_abs_avg=35.065528869628906
production_forward grad[35] vs paper_forward: mean_abs=0.6666642427444458, max_abs=2.25, mean_rel=0.10705417394638062, max_rel=4.545058250427246, norm_rel=0.023944439366459846, ref_abs_avg=27.525596618652344, test_abs_avg=27.51608657836914
production_forward grad[36] vs paper_forward: mean_abs=0.8276935815811157, max_abs=5.5, mean_rel=0.16265171766281128, max_rel=912.8562622070312, norm_rel=0.02474815584719181, ref_abs_avg=33.59124755859375, test_abs_avg=33.59613800048828
production_forward grad[37] vs paper_forward: mean_abs=0.7766587138175964, max_abs=4.9375, mean_rel=0.3118058443069458, max_rel=2156.25, norm_rel=0.02342105843126774, ref_abs_avg=33.26607894897461, test_abs_avg=33.264442443847656
production_forward grad[38] vs paper_forward: mean_abs=0.6323413848876953, max_abs=2.0, mean_rel=0.1741843968629837, max_rel=20.304025650024414, norm_rel=0.02413232810795307, ref_abs_avg=25.3636531829834, test_abs_avg=25.413253784179688
production_forward grad[39] vs paper_forward: mean_abs=0.7816578149795532, max_abs=5.0, mean_rel=0.16436298191547394, max_rel=932.9815673828125, norm_rel=0.02454269491136074, ref_abs_avg=31.968656539916992, test_abs_avg=31.96959114074707
production_forward grad[40] vs paper_forward: mean_abs=0.7302777767181396, max_abs=4.5, mean_rel=0.27089035511016846, max_rel=2187.5, norm_rel=0.022977758198976517, ref_abs_avg=31.872711181640625, test_abs_avg=31.880603790283203
production_forward grad[41] vs paper_forward: mean_abs=0.5971198081970215, max_abs=2.3125, mean_rel=0.11402087658643723, max_rel=7.963451385498047, norm_rel=0.024664470925927162, ref_abs_avg=24.537538528442383, test_abs_avg=24.508337020874023
production_forward grad[42] vs paper_forward: mean_abs=0.7412899732589722, max_abs=6.0, mean_rel=0.15955549478530884, max_rel=1164.5166015625, norm_rel=0.02412276342511177, ref_abs_avg=30.831432342529297, test_abs_avg=30.833627700805664
production_forward grad[43] vs paper_forward: mean_abs=0.6881078481674194, max_abs=4.25, mean_rel=0.28407031297683716, max_rel=2593.749755859375, norm_rel=0.02269728295505047, ref_abs_avg=30.378013610839844, test_abs_avg=30.390684127807617
production_forward grad[44] vs paper_forward: mean_abs=0.5594196319580078, max_abs=2.5, mean_rel=0.06800157576799393, max_rel=3.5526068210601807, norm_rel=0.023040466010570526, ref_abs_avg=25.062545776367188, test_abs_avg=25.012174606323242
production_forward grad[45] vs paper_forward: mean_abs=0.7089194059371948, max_abs=5.0, mean_rel=0.16214615106582642, max_rel=1472.3270263671875, norm_rel=0.02402377873659134, ref_abs_avg=29.626323699951172, test_abs_avg=29.630712509155273
production_forward grad[46] vs paper_forward: mean_abs=0.6592225432395935, max_abs=4.5, mean_rel=0.24710428714752197, max_rel=2749.999755859375, norm_rel=0.02269175834953785, ref_abs_avg=29.173206329345703, test_abs_avg=29.173778533935547
production_forward grad[47] vs paper_forward: mean_abs=0.5333414077758789, max_abs=2.125, mean_rel=0.1267337203025818, max_rel=8.716750144958496, norm_rel=0.022967441007494926, ref_abs_avg=23.23651885986328, test_abs_avg=23.304088592529297
production_forward grad[48] vs paper_forward: mean_abs=0.6779374480247498, max_abs=5.5, mean_rel=0.15909512341022491, max_rel=1173.2646484375, norm_rel=0.02387855388224125, ref_abs_avg=28.477386474609375, test_abs_avg=28.480735778808594
production_forward grad[49] vs paper_forward: mean_abs=0.6282257437705994, max_abs=3.875, mean_rel=0.23102322220802307, max_rel=1874.9998779296875, norm_rel=0.022533109411597252, ref_abs_avg=27.99240493774414, test_abs_avg=27.994800567626953
production_forward grad[50] vs paper_forward: mean_abs=0.5941143035888672, max_abs=2.2890625, mean_rel=0.12670087814331055, max_rel=11.816856384277344, norm_rel=0.024654509499669075, ref_abs_avg=23.858749389648438, test_abs_avg=23.870285034179688
production_forward grad[51] vs paper_forward: mean_abs=0.751095175743103, max_abs=5.125, mean_rel=0.16962634027004242, max_rel=1259.693115234375, norm_rel=0.02556668408215046, ref_abs_avg=29.450700759887695, test_abs_avg=29.450637817382812
production_forward grad[52] vs paper_forward: mean_abs=0.7044152021408081, max_abs=5.625, mean_rel=0.24386027455329895, max_rel=1781.2498779296875, norm_rel=0.024063589051365852, ref_abs_avg=29.37651252746582, test_abs_avg=29.377788543701172
production_forward grad[53] vs paper_forward: mean_abs=0.5401620864868164, max_abs=2.125, mean_rel=0.08842748403549194, max_rel=9.429885864257812, norm_rel=0.02515404112637043, ref_abs_avg=21.564620971679688, test_abs_avg=21.565448760986328
production_forward grad[54] vs paper_forward: mean_abs=0.6929619908332825, max_abs=5.0, mean_rel=0.16519665718078613, max_rel=1392.2119140625, norm_rel=0.02511824481189251, ref_abs_avg=27.648632049560547, test_abs_avg=27.651893615722656
production_forward grad[55] vs paper_forward: mean_abs=0.6502715945243835, max_abs=4.5, mean_rel=0.2551685571670532, max_rel=2500.0, norm_rel=0.023906933143734932, ref_abs_avg=27.25778579711914, test_abs_avg=27.259765625
production_forward grad[56] vs paper_forward: mean_abs=0.48848867416381836, max_abs=2.4140625, mean_rel=0.10896407067775726, max_rel=9.092547416687012, norm_rel=0.02338814176619053, ref_abs_avg=20.837825775146484, test_abs_avg=20.839780807495117
production_forward grad[57] vs paper_forward: mean_abs=0.6445450782775879, max_abs=5.25, mean_rel=0.15786118805408478, max_rel=1658.14501953125, norm_rel=0.024624282494187355, ref_abs_avg=26.214710235595703, test_abs_avg=26.21505355834961
production_forward grad[58] vs paper_forward: mean_abs=0.5980971455574036, max_abs=3.75, mean_rel=0.2041841447353363, max_rel=1437.4998779296875, norm_rel=0.023040344938635826, ref_abs_avg=26.009925842285156, test_abs_avg=26.014442443847656
production_forward grad[59] vs paper_forward: mean_abs=0.48030900955200195, max_abs=1.75, mean_rel=0.08248366415500641, max_rel=2.374539375305176, norm_rel=0.02335011214017868, ref_abs_avg=20.542221069335938, test_abs_avg=20.546314239501953
production_forward grad[60] vs paper_forward: mean_abs=0.6046392917633057, max_abs=5.0, mean_rel=0.15462318062782288, max_rel=728.8182983398438, norm_rel=0.02448323182761669, ref_abs_avg=24.75127410888672, test_abs_avg=24.752538681030273
production_forward grad[61] vs paper_forward: mean_abs=0.5648778676986694, max_abs=4.25, mean_rel=0.21095716953277588, max_rel=1187.5, norm_rel=0.02311522327363491, ref_abs_avg=24.48411750793457, test_abs_avg=24.487579345703125
production_forward grad[62] vs paper_forward: mean_abs=0.45525646209716797, max_abs=1.75, mean_rel=0.10034479945898056, max_rel=9.158077239990234, norm_rel=0.02339322119951248, ref_abs_avg=19.722618103027344, test_abs_avg=19.72789764404297
production_forward grad[63] vs paper_forward: mean_abs=0.5796430110931396, max_abs=5.0, mean_rel=0.1538151204586029, max_rel=1267.0419921875, norm_rel=0.02396349050104618, ref_abs_avg=24.185443878173828, test_abs_avg=24.184688568115234
production_forward grad[64] vs paper_forward: mean_abs=0.5355815887451172, max_abs=3.5625, mean_rel=0.25874510407447815, max_rel=1874.9998779296875, norm_rel=0.022678302600979805, ref_abs_avg=23.64425277709961, test_abs_avg=23.649097442626953
production_forward grad[65] vs paper_forward: mean_abs=0.44606733322143555, max_abs=1.625, mean_rel=0.12179648876190186, max_rel=12.782829284667969, norm_rel=0.023624621331691742, ref_abs_avg=18.49039077758789, test_abs_avg=18.49079132080078
production_forward grad[66] vs paper_forward: mean_abs=0.5441453456878662, max_abs=4.0, mean_rel=0.14889255166053772, max_rel=1094.702880859375, norm_rel=0.023699473589658737, ref_abs_avg=22.970991134643555, test_abs_avg=22.969694137573242
production_forward grad[67] vs paper_forward: mean_abs=0.5010315179824829, max_abs=3.75, mean_rel=0.22136341035366058, max_rel=1515.6248779296875, norm_rel=0.02183806523680687, ref_abs_avg=23.005884170532227, test_abs_avg=23.000547409057617
production_forward grad[68] vs paper_forward: mean_abs=0.3989924192428589, max_abs=1.75, mean_rel=0.08449646830558777, max_rel=4.982295989990234, norm_rel=0.020937997847795486, ref_abs_avg=19.38930892944336, test_abs_avg=19.351301193237305
production_forward grad[69] vs paper_forward: mean_abs=0.5206108689308167, max_abs=4.0, mean_rel=0.15255196392536163, max_rel=823.933837890625, norm_rel=0.02320694364607334, ref_abs_avg=22.443153381347656, test_abs_avg=22.445480346679688
production_forward grad[70] vs paper_forward: mean_abs=0.4760805368423462, max_abs=3.5, mean_rel=0.23439663648605347, max_rel=2562.5, norm_rel=0.021289411932229996, ref_abs_avg=22.3358097076416, test_abs_avg=22.336034774780273
production_forward grad[71] vs paper_forward: mean_abs=0.3925960659980774, max_abs=1.75, mean_rel=0.10336746275424957, max_rel=14.632308006286621, norm_rel=0.022274749353528023, ref_abs_avg=17.909225463867188, test_abs_avg=17.928428649902344
production_forward grad[72] vs paper_forward: mean_abs=0.4969756007194519, max_abs=4.5, mean_rel=0.14322346448898315, max_rel=887.9810180664062, norm_rel=0.023244304582476616, ref_abs_avg=21.39801788330078, test_abs_avg=21.400638580322266
production_forward grad[73] vs paper_forward: mean_abs=0.4606819152832031, max_abs=3.25, mean_rel=0.1951185017824173, max_rel=1499.9998779296875, norm_rel=0.02169525995850563, ref_abs_avg=21.26293182373047, test_abs_avg=21.263078689575195
production_forward grad[74] vs paper_forward: mean_abs=0.41828203201293945, max_abs=1.5625, mean_rel=0.10873236507177353, max_rel=8.077588081359863, norm_rel=0.02125866338610649, ref_abs_avg=20.01227569580078, test_abs_avg=20.007604598999023
production_forward grad[75] vs paper_forward: mean_abs=0.5456652045249939, max_abs=4.5, mean_rel=0.14863905310630798, max_rel=669.96435546875, norm_rel=0.024389026686549187, ref_abs_avg=22.40944480895996, test_abs_avg=22.41143798828125
production_forward grad[76] vs paper_forward: mean_abs=0.5079202055931091, max_abs=3.75, mean_rel=0.24707555770874023, max_rel=2562.5, norm_rel=0.023145750164985657, ref_abs_avg=22.01817512512207, test_abs_avg=22.018157958984375
production_forward grad[77] vs paper_forward: mean_abs=0.3705463409423828, max_abs=1.375, mean_rel=0.06604862213134766, max_rel=2.2728285789489746, norm_rel=0.022852551192045212, ref_abs_avg=16.69402313232422, test_abs_avg=16.665634155273438
production_forward grad[78] vs paper_forward: mean_abs=0.5049273371696472, max_abs=4.0, mean_rel=0.15534508228302002, max_rel=960.7106323242188, norm_rel=0.023875655606389046, ref_abs_avg=21.19204330444336, test_abs_avg=21.19339370727539
production_forward grad[79] vs paper_forward: mean_abs=0.47029101848602295, max_abs=4.25, mean_rel=0.19620710611343384, max_rel=1531.2498779296875, norm_rel=0.022109871730208397, ref_abs_avg=21.307357788085938, test_abs_avg=21.311588287353516
production_forward grad[80] vs paper_forward: mean_abs=0.3725825548171997, max_abs=1.5625, mean_rel=0.10879392921924591, max_rel=13.581185340881348, norm_rel=0.023728037253022194, ref_abs_avg=16.046401977539062, test_abs_avg=16.077804565429688
production_forward grad[81] vs paper_forward: mean_abs=0.46906986832618713, max_abs=5.0, mean_rel=0.1429491639137268, max_rel=836.4044799804688, norm_rel=0.02343125082552433, ref_abs_avg=20.10074234008789, test_abs_avg=20.101375579833984
production_forward grad[82] vs paper_forward: mean_abs=0.4323064088821411, max_abs=3.625, mean_rel=0.2281939685344696, max_rel=1531.2498779296875, norm_rel=0.021503707394003868, ref_abs_avg=20.061359405517578, test_abs_avg=20.06161880493164
production_forward grad[83] vs paper_forward: mean_abs=0.32420259714126587, max_abs=1.5, mean_rel=0.243489608168602, max_rel=91.86126708984375, norm_rel=0.020064478740096092, ref_abs_avg=16.713485717773438, test_abs_avg=16.692058563232422
production_forward grad[84] vs paper_forward: mean_abs=0.44470036029815674, max_abs=4.75, mean_rel=0.1440318077802658, max_rel=731.9725341796875, norm_rel=0.022783121094107628, ref_abs_avg=19.61675262451172, test_abs_avg=19.618431091308594
production_forward grad[85] vs paper_forward: mean_abs=0.4008271098136902, max_abs=3.5, mean_rel=0.19514010846614838, max_rel=1226.5625, norm_rel=0.0206306129693985, ref_abs_avg=19.487581253051758, test_abs_avg=19.493196487426758
production_forward grad[86] vs paper_forward: mean_abs=0.32839691638946533, max_abs=1.375, mean_rel=0.20261134207248688, max_rel=27.812543869018555, norm_rel=0.02022990956902504, ref_abs_avg=15.989163398742676, test_abs_avg=16.015104293823242
production_forward grad[87] vs paper_forward: mean_abs=0.42105042934417725, max_abs=4.0, mean_rel=0.14031442999839783, max_rel=944.9521484375, norm_rel=0.02246115356683731, ref_abs_avg=18.86562156677246, test_abs_avg=18.866804122924805
production_forward grad[88] vs paper_forward: mean_abs=0.3820262551307678, max_abs=3.125, mean_rel=0.1963883936405182, max_rel=1187.5, norm_rel=0.02048252336680889, ref_abs_avg=18.732297897338867, test_abs_avg=18.72952651977539
production_forward grad[89] vs paper_forward: mean_abs=0.316713809967041, max_abs=1.125, mean_rel=0.06921830028295517, max_rel=2.442599058151245, norm_rel=0.02137608453631401, ref_abs_avg=15.07958984375, test_abs_avg=15.083768844604492
production_forward grad[90] vs paper_forward: mean_abs=0.40174850821495056, max_abs=3.875, mean_rel=0.1409684419631958, max_rel=1054.2421875, norm_rel=0.022062130272388458, ref_abs_avg=18.384151458740234, test_abs_avg=18.38436508178711
production_forward grad[91] vs paper_forward: mean_abs=0.3598005771636963, max_abs=3.59375, mean_rel=0.1897587776184082, max_rel=1765.6248779296875, norm_rel=0.0199285838752985, ref_abs_avg=18.209444046020508, test_abs_avg=18.211118698120117
production_forward grad[92] vs paper_forward: mean_abs=0.29330992698669434, max_abs=1.0, mean_rel=0.1690371334552765, max_rel=42.69853591918945, norm_rel=0.01997462660074234, ref_abs_avg=14.67477798461914, test_abs_avg=14.699028015136719
production_forward grad[93] vs paper_forward: mean_abs=0.37302446365356445, max_abs=4.5, mean_rel=0.12540914118289948, max_rel=796.4561157226562, norm_rel=0.02145211771130562, ref_abs_avg=17.606304168701172, test_abs_avg=17.60614776611328
production_forward grad[94] vs paper_forward: mean_abs=0.3381348252296448, max_abs=3.25, mean_rel=0.17974679172039032, max_rel=1046.875, norm_rel=0.01938444934785366, ref_abs_avg=17.543960571289062, test_abs_avg=17.541593551635742
production_forward grad[95] vs paper_forward: mean_abs=0.2782384157180786, max_abs=1.2265625, mean_rel=0.08368503302335739, max_rel=8.076253890991211, norm_rel=0.01933291181921959, ref_abs_avg=14.471559524536133, test_abs_avg=14.45144271850586
production_forward grad[96] vs paper_forward: mean_abs=0.3588622510433197, max_abs=4.0, mean_rel=0.12639814615249634, max_rel=959.7847290039062, norm_rel=0.021311288699507713, ref_abs_avg=17.133275985717773, test_abs_avg=17.132884979248047
production_forward grad[97] vs paper_forward: mean_abs=0.33246442675590515, max_abs=3.18359375, mean_rel=0.1630283147096634, max_rel=1250.0, norm_rel=0.019512254744768143, ref_abs_avg=17.319629669189453, test_abs_avg=17.32330894470215
production_forward2 vs paper_forward output: mean_abs=0.0015801729168742895, max_abs=0.0390625
production_forward2 grad[0] vs paper_forward: mean_abs=0.00833134911954403, max_abs=0.3359375, mean_rel=0.07284270226955414, max_rel=102.0990982055664, norm_rel=0.019888751208782196, ref_abs_avg=0.4511302709579468, test_abs_avg=0.4511384665966034
production_forward2 grad[1] vs paper_forward: mean_abs=6.971301555633545, max_abs=48.0, mean_rel=0.1412653625011444, max_rel=162.60482788085938, norm_rel=0.019786003977060318, ref_abs_avg=308.0956726074219, test_abs_avg=308.17974853515625
production_forward2 grad[2] vs paper_forward: mean_abs=1.2595691680908203, max_abs=4.171875, mean_rel=0.10270194709300995, max_rel=3.3499412536621094, norm_rel=0.0230999868363142, ref_abs_avg=53.860626220703125, test_abs_avg=53.84275817871094
production_forward2 grad[3] vs paper_forward: mean_abs=1.5711992979049683, max_abs=11.5, mean_rel=0.16865158081054688, max_rel=1647.4505615234375, norm_rel=0.02470334805548191, ref_abs_avg=63.968955993652344, test_abs_avg=63.97236633300781
production_forward2 grad[4] vs paper_forward: mean_abs=1.4549511671066284, max_abs=9.5, mean_rel=0.4087912440299988, max_rel=4125.0, norm_rel=0.02320156805217266, ref_abs_avg=62.93260192871094, test_abs_avg=62.936485290527344
production_forward2 grad[5] vs paper_forward: mean_abs=1.0633376836776733, max_abs=4.0, mean_rel=0.0812385231256485, max_rel=2.8653738498687744, norm_rel=0.022042838856577873, ref_abs_avg=48.237098693847656, test_abs_avg=48.25542449951172
production_forward2 grad[6] vs paper_forward: mean_abs=1.395132064819336, max_abs=9.125, mean_rel=0.17287078499794006, max_rel=3267.713134765625, norm_rel=0.024486392736434937, ref_abs_avg=57.3785285949707, test_abs_avg=57.38132858276367
production_forward2 grad[7] vs paper_forward: mean_abs=1.2850887775421143, max_abs=8.0, mean_rel=0.31623566150665283, max_rel=3812.499755859375, norm_rel=0.02275383286178112, ref_abs_avg=56.85024642944336, test_abs_avg=56.86354064941406
production_forward2 grad[8] vs paper_forward: mean_abs=0.9964500665664673, max_abs=4.125, mean_rel=0.18337413668632507, max_rel=18.42840576171875, norm_rel=0.0225396528840065, ref_abs_avg=43.664306640625, test_abs_avg=43.677642822265625
production_forward2 grad[9] vs paper_forward: mean_abs=1.263685941696167, max_abs=8.5, mean_rel=0.1727929562330246, max_rel=1358.2781982421875, norm_rel=0.024205723777413368, ref_abs_avg=52.54914093017578, test_abs_avg=52.55200958251953
production_forward2 grad[10] vs paper_forward: mean_abs=1.1597559452056885, max_abs=7.5, mean_rel=0.3429591655731201, max_rel=4500.0, norm_rel=0.0224579069763422, ref_abs_avg=51.93836975097656, test_abs_avg=51.9425048828125
production_forward2 grad[11] vs paper_forward: mean_abs=0.9067230224609375, max_abs=3.5, mean_rel=0.09711585193872452, max_rel=5.932556629180908, norm_rel=0.02311374805867672, ref_abs_avg=38.89253234863281, test_abs_avg=38.89991760253906
production_forward2 grad[12] vs paper_forward: mean_abs=1.1694940328598022, max_abs=8.5, mean_rel=0.16033849120140076, max_rel=1470.18701171875, norm_rel=0.02400130406022072, ref_abs_avg=49.10293960571289, test_abs_avg=49.107078552246094
production_forward2 grad[13] vs paper_forward: mean_abs=1.0764610767364502, max_abs=7.0, mean_rel=0.3146788477897644, max_rel=4250.0, norm_rel=0.022459156811237335, ref_abs_avg=48.20381546020508, test_abs_avg=48.200927734375
production_forward2 grad[14] vs paper_forward: mean_abs=0.8269405364990234, max_abs=3.125, mean_rel=0.0849846750497818, max_rel=7.347563743591309, norm_rel=0.02259059064090252, ref_abs_avg=36.91057205200195, test_abs_avg=36.965362548828125
production_forward2 grad[15] vs paper_forward: mean_abs=1.086055040359497, max_abs=8.0, mean_rel=0.1563347727060318, max_rel=1218.923583984375, norm_rel=0.02378532849252224, ref_abs_avg=45.97185516357422, test_abs_avg=45.97493362426758
production_forward2 grad[16] vs paper_forward: mean_abs=1.0040544271469116, max_abs=6.5, mean_rel=0.24627043306827545, max_rel=2968.749755859375, norm_rel=0.022191952913999557, ref_abs_avg=45.45899200439453, test_abs_avg=45.468570709228516
production_forward2 grad[17] vs paper_forward: mean_abs=0.8117904663085938, max_abs=3.5, mean_rel=0.08467389643192291, max_rel=4.268043518066406, norm_rel=0.023467224091291428, ref_abs_avg=34.417518615722656, test_abs_avg=34.335391998291016
production_forward2 grad[18] vs paper_forward: mean_abs=1.0182745456695557, max_abs=9.0, mean_rel=0.15401259064674377, max_rel=2004.219970703125, norm_rel=0.023678546771407127, ref_abs_avg=43.235050201416016, test_abs_avg=43.236961364746094
production_forward2 grad[19] vs paper_forward: mean_abs=0.9364341497421265, max_abs=5.375, mean_rel=0.26395902037620544, max_rel=2125.0, norm_rel=0.021937023848295212, ref_abs_avg=42.94965362548828, test_abs_avg=42.95050811767578
production_forward2 grad[20] vs paper_forward: mean_abs=0.7885036468505859, max_abs=3.5, mean_rel=0.08731287717819214, max_rel=9.758219718933105, norm_rel=0.023298311978578568, ref_abs_avg=34.571624755859375, test_abs_avg=34.53955841064453
production_forward2 grad[21] vs paper_forward: mean_abs=0.9675105810165405, max_abs=10.0, mean_rel=0.1516885608434677, max_rel=1499.788818359375, norm_rel=0.023542800918221474, ref_abs_avg=41.408870697021484, test_abs_avg=41.41301727294922
production_forward2 grad[22] vs paper_forward: mean_abs=0.8874686360359192, max_abs=5.5, mean_rel=0.31410303711891174, max_rel=2999.999755859375, norm_rel=0.021760763600468636, ref_abs_avg=40.971107482910156, test_abs_avg=40.97706604003906
production_forward2 grad[23] vs paper_forward: mean_abs=0.6800411939620972, max_abs=2.625, mean_rel=0.10210034251213074, max_rel=10.098803520202637, norm_rel=0.02055731788277626, ref_abs_avg=33.58743667602539, test_abs_avg=33.588783264160156
production_forward2 grad[24] vs paper_forward: mean_abs=0.9183217287063599, max_abs=7.0, mean_rel=0.16182419657707214, max_rel=1181.707275390625, norm_rel=0.023351816460490227, ref_abs_avg=39.542457580566406, test_abs_avg=39.54326629638672
production_forward2 grad[25] vs paper_forward: mean_abs=0.841976523399353, max_abs=5.0, mean_rel=0.3077215850353241, max_rel=2562.5, norm_rel=0.021478980779647827, ref_abs_avg=39.327392578125, test_abs_avg=39.32990264892578
production_forward2 grad[26] vs paper_forward: mean_abs=0.8379359245300293, max_abs=3.15625, mean_rel=0.09625478833913803, max_rel=14.10975170135498, norm_rel=0.023862862959504128, ref_abs_avg=36.0911750793457, test_abs_avg=36.20096206665039
production_forward2 grad[27] vs paper_forward: mean_abs=1.051483392715454, max_abs=10.0, mean_rel=0.1822453737258911, max_rel=1487.962646484375, norm_rel=0.02534669078886509, ref_abs_avg=41.69692611694336, test_abs_avg=41.69641876220703
production_forward2 grad[28] vs paper_forward: mean_abs=0.9689689874649048, max_abs=6.0, mean_rel=0.2936398386955261, max_rel=3843.749755859375, norm_rel=0.023502955213189125, ref_abs_avg=41.39656448364258, test_abs_avg=41.39537048339844
production_forward2 grad[29] vs paper_forward: mean_abs=0.7714567184448242, max_abs=3.640625, mean_rel=0.18723933398723602, max_rel=50.08063507080078, norm_rel=0.024517923593521118, ref_abs_avg=31.045223236083984, test_abs_avg=31.02707862854004
production_forward2 grad[30] vs paper_forward: mean_abs=0.9697825908660889, max_abs=6.5, mean_rel=0.1647874116897583, max_rel=1512.856201171875, norm_rel=0.02543608471751213, ref_abs_avg=38.286094665527344, test_abs_avg=38.28928756713867
production_forward2 grad[31] vs paper_forward: mean_abs=0.9014438390731812, max_abs=5.625, mean_rel=0.31578540802001953, max_rel=2312.5, norm_rel=0.02403622679412365, ref_abs_avg=37.6838264465332, test_abs_avg=37.682918548583984
production_forward2 grad[32] vs paper_forward: mean_abs=0.6914219856262207, max_abs=2.5, mean_rel=0.0833844244480133, max_rel=3.909022331237793, norm_rel=0.02438829280436039, ref_abs_avg=27.616657257080078, test_abs_avg=27.66583251953125
production_forward2 grad[33] vs paper_forward: mean_abs=0.8977620601654053, max_abs=6.25, mean_rel=0.16037115454673767, max_rel=1394.75390625, norm_rel=0.025238458067178726, ref_abs_avg=35.73487854003906, test_abs_avg=35.73480987548828
production_forward2 grad[34] vs paper_forward: mean_abs=0.839061975479126, max_abs=5.4375, mean_rel=0.3012286424636841, max_rel=2749.999755859375, norm_rel=0.024038853123784065, ref_abs_avg=35.062904357910156, test_abs_avg=35.06382751464844
production_forward2 grad[35] vs paper_forward: mean_abs=0.6831631660461426, max_abs=2.875, mean_rel=0.10697352886199951, max_rel=5.039560317993164, norm_rel=0.024481236934661865, ref_abs_avg=27.525596618652344, test_abs_avg=27.515239715576172
production_forward2 grad[36] vs paper_forward: mean_abs=0.84147709608078, max_abs=6.0, mean_rel=0.16381624341011047, max_rel=735.1447143554688, norm_rel=0.025148531422019005, ref_abs_avg=33.59124755859375, test_abs_avg=33.59552001953125
production_forward2 grad[37] vs paper_forward: mean_abs=0.7887670993804932, max_abs=5.25, mean_rel=0.3205450773239136, max_rel=2375.0, norm_rel=0.023806819692254066, ref_abs_avg=33.26607894897461, test_abs_avg=33.262699127197266
production_forward2 grad[38] vs paper_forward: mean_abs=0.627495288848877, max_abs=2.25, mean_rel=0.19521328806877136, max_rel=40.429630279541016, norm_rel=0.024380626156926155, ref_abs_avg=25.3636531829834, test_abs_avg=25.413270950317383
production_forward2 grad[39] vs paper_forward: mean_abs=0.7940101027488708, max_abs=5.0, mean_rel=0.16597980260849, max_rel=1051.7913818359375, norm_rel=0.02493179589509964, ref_abs_avg=31.968656539916992, test_abs_avg=31.96921157836914
production_forward2 grad[40] vs paper_forward: mean_abs=0.7410464286804199, max_abs=5.0, mean_rel=0.2581097483634949, max_rel=2187.5, norm_rel=0.02330523356795311, ref_abs_avg=31.872711181640625, test_abs_avg=31.879688262939453
production_forward2 grad[41] vs paper_forward: mean_abs=0.6065592765808105, max_abs=2.3125, mean_rel=0.13279706239700317, max_rel=15.335221290588379, norm_rel=0.02475316822528839, ref_abs_avg=24.537538528442383, test_abs_avg=24.504852294921875
production_forward2 grad[42] vs paper_forward: mean_abs=0.7512693405151367, max_abs=5.0, mean_rel=0.16438651084899902, max_rel=1467.983154296875, norm_rel=0.024444853886961937, ref_abs_avg=30.831432342529297, test_abs_avg=30.833343505859375
production_forward2 grad[43] vs paper_forward: mean_abs=0.7007213830947876, max_abs=4.5, mean_rel=0.2803649306297302, max_rel=2218.75, norm_rel=0.02308785915374756, ref_abs_avg=30.378013610839844, test_abs_avg=30.390453338623047
production_forward2 grad[44] vs paper_forward: mean_abs=0.5747795104980469, max_abs=2.5, mean_rel=0.0711066946387291, max_rel=3.489867687225342, norm_rel=0.023804599419236183, ref_abs_avg=25.062545776367188, test_abs_avg=25.0183048248291
production_forward2 grad[45] vs paper_forward: mean_abs=0.7179675102233887, max_abs=5.0, mean_rel=0.16536206007003784, max_rel=1085.400146484375, norm_rel=0.024321552366018295, ref_abs_avg=29.626323699951172, test_abs_avg=29.629520416259766
production_forward2 grad[46] vs paper_forward: mean_abs=0.6685274243354797, max_abs=4.0, mean_rel=0.2522820234298706, max_rel=2843.749755859375, norm_rel=0.023015595972537994, ref_abs_avg=29.173206329345703, test_abs_avg=29.173248291015625
production_forward2 grad[47] vs paper_forward: mean_abs=0.5318918228149414, max_abs=2.5, mean_rel=0.13210542500019073, max_rel=8.851545333862305, norm_rel=0.023154940456151962, ref_abs_avg=23.23651885986328, test_abs_avg=23.31402587890625
production_forward2 grad[48] vs paper_forward: mean_abs=0.6851575374603271, max_abs=4.75, mean_rel=0.1590128242969513, max_rel=1476.2957763671875, norm_rel=0.024123989045619965, ref_abs_avg=28.477386474609375, test_abs_avg=28.479915618896484
production_forward2 grad[49] vs paper_forward: mean_abs=0.6364837884902954, max_abs=4.625, mean_rel=0.2438737154006958, max_rel=1874.9998779296875, norm_rel=0.02283676527440548, ref_abs_avg=27.99240493774414, test_abs_avg=27.994163513183594
production_forward2 grad[50] vs paper_forward: mean_abs=0.6106181144714355, max_abs=2.375, mean_rel=0.13307347893714905, max_rel=9.724213600158691, norm_rel=0.02559657208621502, ref_abs_avg=23.858749389648438, test_abs_avg=23.874372482299805
production_forward2 grad[51] vs paper_forward: mean_abs=0.7615518569946289, max_abs=5.5, mean_rel=0.1739758551120758, max_rel=1587.6485595703125, norm_rel=0.025923307985067368, ref_abs_avg=29.450700759887695, test_abs_avg=29.45035171508789
production_forward2 grad[52] vs paper_forward: mean_abs=0.7138957977294922, max_abs=6.125, mean_rel=0.24853627383708954, max_rel=1624.9998779296875, norm_rel=0.024380508810281754, ref_abs_avg=29.37651252746582, test_abs_avg=29.376375198364258
production_forward2 grad[53] vs paper_forward: mean_abs=0.5398209095001221, max_abs=2.0, mean_rel=0.08572444319725037, max_rel=8.585868835449219, norm_rel=0.025142036378383636, ref_abs_avg=21.564620971679688, test_abs_avg=21.559864044189453
production_forward2 grad[54] vs paper_forward: mean_abs=0.702695369720459, max_abs=5.0, mean_rel=0.1675463169813156, max_rel=1558.4862060546875, norm_rel=0.02544451877474785, ref_abs_avg=27.648632049560547, test_abs_avg=27.650774002075195
production_forward2 grad[55] vs paper_forward: mean_abs=0.6592301726341248, max_abs=4.5, mean_rel=0.2553025186061859, max_rel=2437.5, norm_rel=0.024235650897026062, ref_abs_avg=27.25778579711914, test_abs_avg=27.258846282958984
production_forward2 grad[56] vs paper_forward: mean_abs=0.4878082275390625, max_abs=2.546875, mean_rel=0.11301272362470627, max_rel=10.259688377380371, norm_rel=0.023465152829885483, ref_abs_avg=20.837825775146484, test_abs_avg=20.827106475830078
production_forward2 grad[57] vs paper_forward: mean_abs=0.6530168056488037, max_abs=5.125, mean_rel=0.15997593104839325, max_rel=1848.6441650390625, norm_rel=0.024926329031586647, ref_abs_avg=26.214710235595703, test_abs_avg=26.214622497558594
production_forward2 grad[58] vs paper_forward: mean_abs=0.6062284708023071, max_abs=4.0, mean_rel=0.21648937463760376, max_rel=1546.8748779296875, norm_rel=0.023338375613093376, ref_abs_avg=26.009925842285156, test_abs_avg=26.013341903686523
production_forward2 grad[59] vs paper_forward: mean_abs=0.4842081069946289, max_abs=1.78125, mean_rel=0.0834861472249031, max_rel=2.8302078247070312, norm_rel=0.023477653041481972, ref_abs_avg=20.542221069335938, test_abs_avg=20.555755615234375
production_forward2 grad[60] vs paper_forward: mean_abs=0.6113079786300659, max_abs=4.5, mean_rel=0.1598133146762848, max_rel=669.9639282226562, norm_rel=0.0247470922768116, ref_abs_avg=24.75127410888672, test_abs_avg=24.75185775756836
production_forward2 grad[61] vs paper_forward: mean_abs=0.57210773229599, max_abs=4.40625, mean_rel=0.21686694025993347, max_rel=1312.4998779296875, norm_rel=0.023404447361826897, ref_abs_avg=24.48411750793457, test_abs_avg=24.486244201660156
production_forward2 grad[62] vs paper_forward: mean_abs=0.44089412689208984, max_abs=1.875, mean_rel=0.100521981716156, max_rel=9.799623489379883, norm_rel=0.023006701841950417, ref_abs_avg=19.722618103027344, test_abs_avg=19.729778289794922
production_forward2 grad[63] vs paper_forward: mean_abs=0.5853639245033264, max_abs=5.0, mean_rel=0.15440265834331512, max_rel=1404.5101318359375, norm_rel=0.02418556809425354, ref_abs_avg=24.185443878173828, test_abs_avg=24.184913635253906
production_forward2 grad[64] vs paper_forward: mean_abs=0.540602445602417, max_abs=3.8125, mean_rel=0.2638394236564636, max_rel=1562.4998779296875, norm_rel=0.02287594974040985, ref_abs_avg=23.64425277709961, test_abs_avg=23.649404525756836
production_forward2 grad[65] vs paper_forward: mean_abs=0.4510650634765625, max_abs=1.5, mean_rel=0.1147349625825882, max_rel=8.027643203735352, norm_rel=0.023963473737239838, ref_abs_avg=18.49039077758789, test_abs_avg=18.4779109954834
production_forward2 grad[66] vs paper_forward: mean_abs=0.5486444234848022, max_abs=4.5, mean_rel=0.14865410327911377, max_rel=1406.5703125, norm_rel=0.02389048971235752, ref_abs_avg=22.970991134643555, test_abs_avg=22.969528198242188
production_forward2 grad[67] vs paper_forward: mean_abs=0.5060833692550659, max_abs=3.75, mean_rel=0.2225947380065918, max_rel=1578.1248779296875, norm_rel=0.022046471014618874, ref_abs_avg=23.005884170532227, test_abs_avg=23.001272201538086
production_forward2 grad[68] vs paper_forward: mean_abs=0.397610604763031, max_abs=1.625, mean_rel=0.0775567963719368, max_rel=5.720808982849121, norm_rel=0.021066997200250626, ref_abs_avg=19.38930892944336, test_abs_avg=19.343273162841797
production_forward2 grad[69] vs paper_forward: mean_abs=0.5239505767822266, max_abs=4.0, mean_rel=0.15253421664237976, max_rel=827.490234375, norm_rel=0.02334672212600708, ref_abs_avg=22.443153381347656, test_abs_avg=22.44511604309082
production_forward2 grad[70] vs paper_forward: mean_abs=0.47991225123405457, max_abs=3.5, mean_rel=0.23355567455291748, max_rel=2187.5, norm_rel=0.021464578807353973, ref_abs_avg=22.3358097076416, test_abs_avg=22.335512161254883
production_forward2 grad[71] vs paper_forward: mean_abs=0.39649486541748047, max_abs=1.78125, mean_rel=0.10342514514923096, max_rel=10.179777145385742, norm_rel=0.02257380075752735, ref_abs_avg=17.909225463867188, test_abs_avg=17.92047119140625
production_forward2 grad[72] vs paper_forward: mean_abs=0.49913012981414795, max_abs=4.0, mean_rel=0.1433752477169037, max_rel=796.03173828125, norm_rel=0.023346563801169395, ref_abs_avg=21.39801788330078, test_abs_avg=21.400115966796875
production_forward2 grad[73] vs paper_forward: mean_abs=0.46357858180999756, max_abs=3.5625, mean_rel=0.19650323688983917, max_rel=1562.4998779296875, norm_rel=0.021836115047335625, ref_abs_avg=21.26293182373047, test_abs_avg=21.262723922729492
production_forward2 grad[74] vs paper_forward: mean_abs=0.41805410385131836, max_abs=1.625, mean_rel=0.09928035736083984, max_rel=7.0690155029296875, norm_rel=0.021317480131983757, ref_abs_avg=20.01227569580078, test_abs_avg=20.00028419494629
production_forward2 grad[75] vs paper_forward: mean_abs=0.5512063503265381, max_abs=4.1875, mean_rel=0.1494000107049942, max_rel=577.7836303710938, norm_rel=0.02461468055844307, ref_abs_avg=22.40944480895996, test_abs_avg=22.411455154418945
production_forward2 grad[76] vs paper_forward: mean_abs=0.5134758949279785, max_abs=3.75, mean_rel=0.2552650570869446, max_rel=2437.5, norm_rel=0.02338479645550251, ref_abs_avg=22.01817512512207, test_abs_avg=22.01757049560547
production_forward2 grad[77] vs paper_forward: mean_abs=0.38560962677001953, max_abs=1.5, mean_rel=0.06609345227479935, max_rel=2.7463345527648926, norm_rel=0.023501306772232056, ref_abs_avg=16.69402313232422, test_abs_avg=16.66502571105957
production_forward2 grad[78] vs paper_forward: mean_abs=0.5090547204017639, max_abs=4.0, mean_rel=0.1547989696264267, max_rel=1045.8052978515625, norm_rel=0.02405383251607418, ref_abs_avg=21.19204330444336, test_abs_avg=21.192981719970703
production_forward2 grad[79] vs paper_forward: mean_abs=0.4752117991447449, max_abs=4.25, mean_rel=0.19475586712360382, max_rel=1453.1248779296875, norm_rel=0.022346636280417442, ref_abs_avg=21.307357788085938, test_abs_avg=21.31088638305664
production_forward2 grad[80] vs paper_forward: mean_abs=0.38099634647369385, max_abs=1.625, mean_rel=0.08644095063209534, max_rel=3.865645408630371, norm_rel=0.024147989228367805, ref_abs_avg=16.046401977539062, test_abs_avg=16.079063415527344
production_forward2 grad[81] vs paper_forward: mean_abs=0.47326424717903137, max_abs=4.140625, mean_rel=0.14457252621650696, max_rel=799.3977661132812, norm_rel=0.02361590601503849, ref_abs_avg=20.10074234008789, test_abs_avg=20.101341247558594
production_forward2 grad[82] vs paper_forward: mean_abs=0.4355027973651886, max_abs=3.625, mean_rel=0.231225848197937, max_rel=1531.2498779296875, norm_rel=0.021666334941983223, ref_abs_avg=20.061359405517578, test_abs_avg=20.061399459838867
production_forward2 grad[83] vs paper_forward: mean_abs=0.33263808488845825, max_abs=1.5, mean_rel=0.17381373047828674, max_rel=57.13203048706055, norm_rel=0.02040300890803337, ref_abs_avg=16.713485717773438, test_abs_avg=16.689624786376953
production_forward2 grad[84] vs paper_forward: mean_abs=0.4478401243686676, max_abs=4.75, mean_rel=0.14358225464820862, max_rel=777.489990234375, norm_rel=0.022925080731511116, ref_abs_avg=19.61675262451172, test_abs_avg=19.618183135986328
production_forward2 grad[85] vs paper_forward: mean_abs=0.40386950969696045, max_abs=3.25, mean_rel=0.19477719068527222, max_rel=1406.2498779296875, norm_rel=0.02076108194887638, ref_abs_avg=19.487581253051758, test_abs_avg=19.492525100708008
production_forward2 grad[86] vs paper_forward: mean_abs=0.3361092805862427, max_abs=1.4375, mean_rel=0.21288520097732544, max_rel=32.15769958496094, norm_rel=0.02065706066787243, ref_abs_avg=15.989163398742676, test_abs_avg=16.026683807373047
production_forward2 grad[87] vs paper_forward: mean_abs=0.42301395535469055, max_abs=4.0, mean_rel=0.1413320004940033, max_rel=879.8621826171875, norm_rel=0.02255510725080967, ref_abs_avg=18.86562156677246, test_abs_avg=18.866531372070312
production_forward2 grad[88] vs paper_forward: mean_abs=0.38395676016807556, max_abs=3.0, mean_rel=0.19896036386489868, max_rel=1156.25, norm_rel=0.02057197131216526, ref_abs_avg=18.732297897338867, test_abs_avg=18.72930908203125
production_forward2 grad[89] vs paper_forward: mean_abs=0.32111406326293945, max_abs=1.125, mean_rel=0.06942754983901978, max_rel=1.9947892427444458, norm_rel=0.021888624876737595, ref_abs_avg=15.07958984375, test_abs_avg=15.078781127929688
production_forward2 grad[90] vs paper_forward: mean_abs=0.40301382541656494, max_abs=4.125, mean_rel=0.1400572508573532, max_rel=1184.2080078125, norm_rel=0.02212313376367092, ref_abs_avg=18.384151458740234, test_abs_avg=18.384492874145508
production_forward2 grad[91] vs paper_forward: mean_abs=0.36136385798454285, max_abs=3.625, mean_rel=0.18892177939414978, max_rel=1749.9998779296875, norm_rel=0.020003845915198326, ref_abs_avg=18.209444046020508, test_abs_avg=18.211223602294922
production_forward2 grad[92] vs paper_forward: mean_abs=0.29979491233825684, max_abs=1.125, mean_rel=0.1677035242319107, max_rel=39.602928161621094, norm_rel=0.020013388246297836, ref_abs_avg=14.67477798461914, test_abs_avg=14.70892333984375
production_forward2 grad[93] vs paper_forward: mean_abs=0.37406158447265625, max_abs=4.5, mean_rel=0.12567131221294403, max_rel=885.4730224609375, norm_rel=0.021493976935744286, ref_abs_avg=17.606304168701172, test_abs_avg=17.6060791015625
production_forward2 grad[94] vs paper_forward: mean_abs=0.33891409635543823, max_abs=3.5, mean_rel=0.1831257939338684, max_rel=1070.3125, norm_rel=0.019428538158535957, ref_abs_avg=17.543960571289062, test_abs_avg=17.541263580322266
production_forward2 grad[95] vs paper_forward: mean_abs=0.2782384157180786, max_abs=1.2265625, mean_rel=0.08368503302335739, max_rel=8.076253890991211, norm_rel=0.01933291181921959, ref_abs_avg=14.471559524536133, test_abs_avg=14.45144271850586
production_forward2 grad[96] vs paper_forward: mean_abs=0.3588622510433197, max_abs=4.0, mean_rel=0.12639814615249634, max_rel=959.7847290039062, norm_rel=0.021311288699507713, ref_abs_avg=17.133275985717773, test_abs_avg=17.132884979248047
production_forward2 grad[97] vs paper_forward: mean_abs=0.33246442675590515, max_abs=3.18359375, mean_rel=0.1630283147096634, max_rel=1250.0, norm_rel=0.019512254744768143, ref_abs_avg=17.319629669189453, test_abs_avg=17.32330894470215
identity layers + randn queries
production_forward fwd+bwd:  113.691 ms
production_forward bwd-only: 96.058 ms
production_forward peak allocated: fwd=3.368 GiB, fwd+bwd=10.118 GiB
production_forward peak reserved:  fwd=3.635 GiB, fwd+bwd=12.635 GiB
production_forward2 fwd+bwd:  193.870 ms
production_forward2 bwd-only: 172.735 ms
production_forward2 peak allocated: fwd=2.864 GiB, fwd+bwd=6.243 GiB
production_forward2 peak reserved:  fwd=3.260 GiB, fwd+bwd=9.010 GiB
paper_forward fwd+bwd:  385.037 ms
paper_forward bwd-only: 305.015 ms
paper_forward peak allocated: fwd=30.002 GiB, fwd+bwd=32.120 GiB
paper_forward peak reserved:  fwd=30.057 GiB, fwd+bwd=32.807 GiB

grads check for swiglu layers + randn queries
production_forward vs paper_forward output: mean_abs=0.00167427072301507, max_abs=0.0390625
production_forward grad[0] vs paper_forward: mean_abs=0.008495984598994255, max_abs=0.4375, mean_rel=0.07230208814144135, max_rel=100.38377380371094, norm_rel=0.01965581625699997, ref_abs_avg=0.4681438207626343, test_abs_avg=0.4681679606437683
production_forward grad[1] vs paper_forward: mean_abs=7.34219217300415, max_abs=52.0, mean_rel=0.12500512599945068, max_rel=81.87227630615234, norm_rel=0.02048587240278721, ref_abs_avg=322.4056091308594, test_abs_avg=322.5775146484375
production_forward grad[2] vs paper_forward: mean_abs=1.3563385009765625, max_abs=5.5, mean_rel=0.12020817399024963, max_rel=17.21140480041504, norm_rel=0.023534361273050308, ref_abs_avg=57.2935791015625, test_abs_avg=57.246002197265625
production_forward grad[3] vs paper_forward: mean_abs=1.6487653255462646, max_abs=11.0, mean_rel=0.17704443633556366, max_rel=2685.53955078125, norm_rel=0.024427147582173347, ref_abs_avg=67.95283508300781, test_abs_avg=67.95597839355469
production_forward grad[4] vs paper_forward: mean_abs=1.519158959388733, max_abs=10.0, mean_rel=0.45047813653945923, max_rel=5874.99951171875, norm_rel=0.022975966334342957, ref_abs_avg=66.53125762939453, test_abs_avg=66.54426574707031
production_forward grad[5] vs paper_forward: mean_abs=1.0568599700927734, max_abs=4.6875, mean_rel=0.08537128567695618, max_rel=2.9949560165405273, norm_rel=0.0231463760137558, ref_abs_avg=46.43482208251953, test_abs_avg=46.46527862548828
production_forward grad[6] vs paper_forward: mean_abs=1.446023941040039, max_abs=10.0, mean_rel=0.16673025488853455, max_rel=2850.4794921875, norm_rel=0.024224242195487022, ref_abs_avg=60.129356384277344, test_abs_avg=60.137630462646484
production_forward grad[7] vs paper_forward: mean_abs=1.3350818157196045, max_abs=8.0, mean_rel=0.39710286259651184, max_rel=3421.874755859375, norm_rel=0.022646760568022728, ref_abs_avg=59.17636489868164, test_abs_avg=59.185508728027344
production_forward grad[8] vs paper_forward: mean_abs=1.038320541381836, max_abs=3.5, mean_rel=0.09617407619953156, max_rel=5.757630348205566, norm_rel=0.022475676611065865, ref_abs_avg=45.700164794921875, test_abs_avg=45.76554870605469
production_forward grad[9] vs paper_forward: mean_abs=1.3033950328826904, max_abs=9.0, mean_rel=0.15793760120868683, max_rel=1524.1038818359375, norm_rel=0.023997826501727104, ref_abs_avg=54.68836212158203, test_abs_avg=54.694496154785156
production_forward grad[10] vs paper_forward: mean_abs=1.1993350982666016, max_abs=7.625, mean_rel=0.38057276606559753, max_rel=5437.49951171875, norm_rel=0.022368023172020912, ref_abs_avg=53.95429992675781, test_abs_avg=53.95518493652344
production_forward grad[11] vs paper_forward: mean_abs=0.9412527084350586, max_abs=4.5, mean_rel=0.1336839646100998, max_rel=27.02436637878418, norm_rel=0.02365770749747753, ref_abs_avg=40.45931625366211, test_abs_avg=40.42897033691406
production_forward grad[12] vs paper_forward: mean_abs=1.1935793161392212, max_abs=10.0, mean_rel=0.15301603078842163, max_rel=1996.6639404296875, norm_rel=0.023813463747501373, ref_abs_avg=50.42420196533203, test_abs_avg=50.431976318359375
production_forward grad[13] vs paper_forward: mean_abs=1.1048119068145752, max_abs=6.625, mean_rel=0.31415075063705444, max_rel=4062.499755859375, norm_rel=0.022278858348727226, ref_abs_avg=49.8963623046875, test_abs_avg=49.90303421020508
production_forward grad[14] vs paper_forward: mean_abs=0.9071612358093262, max_abs=3.0, mean_rel=0.1062684878706932, max_rel=8.802139282226562, norm_rel=0.023898085579276085, ref_abs_avg=37.27079391479492, test_abs_avg=37.259521484375
production_forward grad[15] vs paper_forward: mean_abs=1.1149592399597168, max_abs=10.0, mean_rel=0.15122678875923157, max_rel=2480.274658203125, norm_rel=0.02358618751168251, ref_abs_avg=47.632625579833984, test_abs_avg=47.63754653930664
production_forward grad[16] vs paper_forward: mean_abs=1.032855749130249, max_abs=6.5, mean_rel=0.35000067949295044, max_rel=3624.999755859375, norm_rel=0.0221839789301157, ref_abs_avg=46.79708480834961, test_abs_avg=46.801788330078125
production_forward grad[17] vs paper_forward: mean_abs=0.8159170150756836, max_abs=2.75, mean_rel=0.11182674765586853, max_rel=6.781965255737305, norm_rel=0.023134397342801094, ref_abs_avg=34.6031608581543, test_abs_avg=34.64836883544922
production_forward grad[18] vs paper_forward: mean_abs=1.0488910675048828, max_abs=7.0, mean_rel=0.16460449993610382, max_rel=1987.8072509765625, norm_rel=0.02354065142571926, ref_abs_avg=44.83739471435547, test_abs_avg=44.837890625
production_forward grad[19] vs paper_forward: mean_abs=0.9574331641197205, max_abs=5.5, mean_rel=0.3293505311012268, max_rel=3656.249755859375, norm_rel=0.02164224348962307, ref_abs_avg=44.48262023925781, test_abs_avg=44.48533630371094
production_forward grad[20] vs paper_forward: mean_abs=0.7559447288513184, max_abs=2.75, mean_rel=0.3626256585121155, max_rel=110.44869995117188, norm_rel=0.02111574076116085, ref_abs_avg=34.76023864746094, test_abs_avg=34.7723388671875
production_forward grad[21] vs paper_forward: mean_abs=0.9960402250289917, max_abs=6.5, mean_rel=0.1594613790512085, max_rel=1900.168701171875, norm_rel=0.023313116282224655, ref_abs_avg=42.955169677734375, test_abs_avg=42.95820617675781
production_forward grad[22] vs paper_forward: mean_abs=0.9157662391662598, max_abs=5.78125, mean_rel=0.27085745334625244, max_rel=2421.875, norm_rel=0.02163202501833439, ref_abs_avg=42.614173889160156, test_abs_avg=42.615234375
production_forward grad[23] vs paper_forward: mean_abs=0.7145577073097229, max_abs=2.5, mean_rel=0.20535695552825928, max_rel=34.790592193603516, norm_rel=0.02145494893193245, ref_abs_avg=33.73601531982422, test_abs_avg=33.726287841796875
production_forward grad[24] vs paper_forward: mean_abs=0.9544306397438049, max_abs=7.0, mean_rel=0.15591192245483398, max_rel=3317.6328125, norm_rel=0.023247694596648216, ref_abs_avg=41.29481887817383, test_abs_avg=41.297359466552734
production_forward grad[25] vs paper_forward: mean_abs=0.8745827078819275, max_abs=5.3125, mean_rel=0.2799738645553589, max_rel=3281.249755859375, norm_rel=0.02154836244881153, ref_abs_avg=40.82671356201172, test_abs_avg=40.825294494628906
production_forward grad[26] vs paper_forward: mean_abs=0.8269872665405273, max_abs=3.5, mean_rel=0.1713326871395111, max_rel=26.938852310180664, norm_rel=0.024088380858302116, ref_abs_avg=34.329612731933594, test_abs_avg=34.27132034301758
production_forward grad[27] vs paper_forward: mean_abs=1.0860615968704224, max_abs=7.0, mean_rel=0.18316367268562317, max_rel=1708.3538818359375, norm_rel=0.02504732459783554, ref_abs_avg=43.60613250732422, test_abs_avg=43.611122131347656
production_forward grad[28] vs paper_forward: mean_abs=1.0139987468719482, max_abs=6.25, mean_rel=0.34396278858184814, max_rel=3124.999755859375, norm_rel=0.023662803694605827, ref_abs_avg=43.02670669555664, test_abs_avg=43.0348014831543
production_forward grad[29] vs paper_forward: mean_abs=0.799957275390625, max_abs=3.0, mean_rel=0.08610229194164276, max_rel=10.864806175231934, norm_rel=0.02343604899942875, ref_abs_avg=34.410552978515625, test_abs_avg=34.390907287597656
production_forward grad[30] vs paper_forward: mean_abs=1.0265401601791382, max_abs=8.0, mean_rel=0.16525694727897644, max_rel=1669.7071533203125, norm_rel=0.02551463060081005, ref_abs_avg=40.423500061035156, test_abs_avg=40.42768859863281
production_forward grad[31] vs paper_forward: mean_abs=0.968579888343811, max_abs=5.75, mean_rel=0.3011246919631958, max_rel=2687.499755859375, norm_rel=0.024156389757990837, ref_abs_avg=40.31534957885742, test_abs_avg=40.32232666015625
production_forward grad[32] vs paper_forward: mean_abs=0.7442073822021484, max_abs=3.0, mean_rel=0.09471534192562103, max_rel=10.739518165588379, norm_rel=0.021510466933250427, ref_abs_avg=34.49700927734375, test_abs_avg=34.51905059814453
production_forward grad[33] vs paper_forward: mean_abs=0.9592751860618591, max_abs=6.5, mean_rel=0.16208183765411377, max_rel=1521.922119140625, norm_rel=0.025316815823316574, ref_abs_avg=38.0476188659668, test_abs_avg=38.048667907714844
production_forward grad[34] vs paper_forward: mean_abs=0.8946535587310791, max_abs=5.59375, mean_rel=0.3096373677253723, max_rel=2593.749755859375, norm_rel=0.023940250277519226, ref_abs_avg=37.50889587402344, test_abs_avg=37.50871276855469
production_forward grad[35] vs paper_forward: mean_abs=0.6824268102645874, max_abs=2.75, mean_rel=0.24830570816993713, max_rel=64.16674041748047, norm_rel=0.02315434440970421, ref_abs_avg=28.72698211669922, test_abs_avg=28.78818130493164
production_forward grad[36] vs paper_forward: mean_abs=0.894045352935791, max_abs=6.0, mean_rel=0.17399978637695312, max_rel=879.2643432617188, norm_rel=0.02522609382867813, ref_abs_avg=35.58241271972656, test_abs_avg=35.584205627441406
production_forward grad[37] vs paper_forward: mean_abs=0.8328204154968262, max_abs=5.4375, mean_rel=0.28943413496017456, max_rel=3531.249755859375, norm_rel=0.023543886840343475, ref_abs_avg=35.509212493896484, test_abs_avg=35.50791549682617
production_forward grad[38] vs paper_forward: mean_abs=0.6581323146820068, max_abs=3.08203125, mean_rel=0.19342495501041412, max_rel=52.3952522277832, norm_rel=0.023608066141605377, ref_abs_avg=28.473552703857422, test_abs_avg=28.36518669128418
production_forward grad[39] vs paper_forward: mean_abs=0.843734860420227, max_abs=6.0, mean_rel=0.1675419807434082, max_rel=1634.6832275390625, norm_rel=0.024892322719097137, ref_abs_avg=34.033626556396484, test_abs_avg=34.03528594970703
production_forward grad[40] vs paper_forward: mean_abs=0.7806597948074341, max_abs=5.0, mean_rel=0.24933093786239624, max_rel=2250.0, norm_rel=0.02322142943739891, ref_abs_avg=33.714595794677734, test_abs_avg=33.71295928955078
production_forward grad[41] vs paper_forward: mean_abs=0.6088762283325195, max_abs=2.0625, mean_rel=0.09543748199939728, max_rel=10.425860404968262, norm_rel=0.022242816165089607, ref_abs_avg=27.527416229248047, test_abs_avg=27.523536682128906
production_forward grad[42] vs paper_forward: mean_abs=0.7966699004173279, max_abs=5.5, mean_rel=0.16176116466522217, max_rel=919.811767578125, norm_rel=0.02456839196383953, ref_abs_avg=32.541664123535156, test_abs_avg=32.544288635253906
production_forward grad[43] vs paper_forward: mean_abs=0.7441866993904114, max_abs=4.5, mean_rel=0.26247674226760864, max_rel=1999.9998779296875, norm_rel=0.023177798837423325, ref_abs_avg=32.172691345214844, test_abs_avg=32.1744270324707
production_forward grad[44] vs paper_forward: mean_abs=0.611363410949707, max_abs=2.5, mean_rel=0.1109541654586792, max_rel=10.179722785949707, norm_rel=0.02381048910319805, ref_abs_avg=26.296485900878906, test_abs_avg=26.28832244873047
production_forward grad[45] vs paper_forward: mean_abs=0.7614661455154419, max_abs=6.0, mean_rel=0.1604669988155365, max_rel=1736.1876220703125, norm_rel=0.02447644993662834, ref_abs_avg=31.20000648498535, test_abs_avg=31.19872283935547
production_forward grad[46] vs paper_forward: mean_abs=0.7080999612808228, max_abs=4.8125, mean_rel=0.2460121214389801, max_rel=1499.9998779296875, norm_rel=0.022839773446321487, ref_abs_avg=31.045419692993164, test_abs_avg=31.047222137451172
production_forward grad[47] vs paper_forward: mean_abs=0.5640945434570312, max_abs=2.25, mean_rel=0.07940587401390076, max_rel=4.078694820404053, norm_rel=0.023413773626089096, ref_abs_avg=24.876232147216797, test_abs_avg=24.839797973632812
production_forward grad[48] vs paper_forward: mean_abs=0.7230427265167236, max_abs=5.5, mean_rel=0.15529608726501465, max_rel=1128.181640625, norm_rel=0.02415657415986061, ref_abs_avg=29.981693267822266, test_abs_avg=29.982725143432617
production_forward grad[49] vs paper_forward: mean_abs=0.6713954210281372, max_abs=4.4375, mean_rel=0.2560778260231018, max_rel=2562.5, norm_rel=0.022593243047595024, ref_abs_avg=29.80673599243164, test_abs_avg=29.806650161743164
production_forward grad[50] vs paper_forward: mean_abs=0.6185836791992188, max_abs=2.5, mean_rel=0.11435335874557495, max_rel=15.358036994934082, norm_rel=0.022741470485925674, ref_abs_avg=27.287424087524414, test_abs_avg=27.325273513793945
production_forward grad[51] vs paper_forward: mean_abs=0.7861454486846924, max_abs=5.75, mean_rel=0.1639748215675354, max_rel=1184.8995361328125, norm_rel=0.025448348373174667, ref_abs_avg=30.98583984375, test_abs_avg=30.988483428955078
production_forward grad[52] vs paper_forward: mean_abs=0.7379940748214722, max_abs=4.5, mean_rel=0.27845299243927, max_rel=2500.0, norm_rel=0.024221932515501976, ref_abs_avg=30.55096435546875, test_abs_avg=30.550304412841797
production_forward grad[53] vs paper_forward: mean_abs=0.5944503545761108, max_abs=2.25, mean_rel=0.18747682869434357, max_rel=48.98942947387695, norm_rel=0.02457975596189499, ref_abs_avg=23.666423797607422, test_abs_avg=23.64743423461914
production_forward grad[54] vs paper_forward: mean_abs=0.7304859161376953, max_abs=5.5, mean_rel=0.16159963607788086, max_rel=771.8173828125, norm_rel=0.02512533962726593, ref_abs_avg=29.183340072631836, test_abs_avg=29.183944702148438
production_forward grad[55] vs paper_forward: mean_abs=0.6832846999168396, max_abs=4.5, mean_rel=0.21084818243980408, max_rel=1624.9998779296875, norm_rel=0.0237293504178524, ref_abs_avg=28.82940673828125, test_abs_avg=28.840160369873047
production_forward grad[56] vs paper_forward: mean_abs=0.527101993560791, max_abs=2.3125, mean_rel=0.08705764263868332, max_rel=13.071003913879395, norm_rel=0.022649304941296577, ref_abs_avg=23.839563369750977, test_abs_avg=23.848173141479492
production_forward grad[57] vs paper_forward: mean_abs=0.6792005896568298, max_abs=5.25, mean_rel=0.16270366311073303, max_rel=2182.296875, norm_rel=0.024574028328061104, ref_abs_avg=27.684616088867188, test_abs_avg=27.686548233032227
production_forward grad[58] vs paper_forward: mean_abs=0.6298417448997498, max_abs=4.25, mean_rel=0.24614755809307098, max_rel=1781.2498779296875, norm_rel=0.023330505937337875, ref_abs_avg=27.03676414489746, test_abs_avg=27.042177200317383
production_forward grad[59] vs paper_forward: mean_abs=0.5185203552246094, max_abs=2.35546875, mean_rel=0.098924919962883, max_rel=8.012650489807129, norm_rel=0.02450886368751526, ref_abs_avg=22.01021957397461, test_abs_avg=22.009490966796875
production_forward grad[60] vs paper_forward: mean_abs=0.6417582631111145, max_abs=5.0, mean_rel=0.16463440656661987, max_rel=1241.511474609375, norm_rel=0.02416864223778248, ref_abs_avg=26.580398559570312, test_abs_avg=26.579458236694336
production_forward grad[61] vs paper_forward: mean_abs=0.5970516204833984, max_abs=5.0, mean_rel=0.2557668387889862, max_rel=1656.2498779296875, norm_rel=0.022824693471193314, ref_abs_avg=26.223793029785156, test_abs_avg=26.227693557739258
production_forward grad[62] vs paper_forward: mean_abs=0.4934356212615967, max_abs=2.0, mean_rel=0.13569313287734985, max_rel=17.477895736694336, norm_rel=0.021875636652112007, ref_abs_avg=22.684589385986328, test_abs_avg=22.69605255126953
production_forward grad[63] vs paper_forward: mean_abs=0.6063588857650757, max_abs=5.0, mean_rel=0.15300700068473816, max_rel=838.7886962890625, norm_rel=0.023983368650078773, ref_abs_avg=25.347156524658203, test_abs_avg=25.347702026367188
production_forward grad[64] vs paper_forward: mean_abs=0.5633997321128845, max_abs=3.75, mean_rel=0.2355705201625824, max_rel=1593.7498779296875, norm_rel=0.02197163924574852, ref_abs_avg=25.552875518798828, test_abs_avg=25.544376373291016
production_forward grad[65] vs paper_forward: mean_abs=0.47482824325561523, max_abs=1.625, mean_rel=0.13753017783164978, max_rel=12.344863891601562, norm_rel=0.023573148995637894, ref_abs_avg=19.968151092529297, test_abs_avg=19.96009063720703
production_forward grad[66] vs paper_forward: mean_abs=0.5767370462417603, max_abs=6.0, mean_rel=0.15761056542396545, max_rel=1077.62841796875, norm_rel=0.023579556494951248, ref_abs_avg=24.478294372558594, test_abs_avg=24.47699737548828
production_forward grad[67] vs paper_forward: mean_abs=0.5274789333343506, max_abs=4.0, mean_rel=0.23316895961761475, max_rel=1874.9998779296875, norm_rel=0.021784480661153793, ref_abs_avg=24.2264461517334, test_abs_avg=24.228759765625
production_forward grad[68] vs paper_forward: mean_abs=0.44173765182495117, max_abs=2.0, mean_rel=0.10182711482048035, max_rel=13.788098335266113, norm_rel=0.022714462131261826, ref_abs_avg=19.90361213684082, test_abs_avg=19.89154815673828
production_forward grad[69] vs paper_forward: mean_abs=0.5470362901687622, max_abs=4.75, mean_rel=0.15443480014801025, max_rel=1215.2408447265625, norm_rel=0.02313379943370819, ref_abs_avg=23.664127349853516, test_abs_avg=23.66676139831543
production_forward grad[70] vs paper_forward: mean_abs=0.5047788023948669, max_abs=4.0, mean_rel=0.22316738963127136, max_rel=1624.9998779296875, norm_rel=0.021500518545508385, ref_abs_avg=23.52410316467285, test_abs_avg=23.52166748046875
production_forward grad[71] vs paper_forward: mean_abs=0.40282630920410156, max_abs=1.5, mean_rel=0.07025282829999924, max_rel=7.756196022033691, norm_rel=0.019868802279233932, ref_abs_avg=20.52445411682129, test_abs_avg=20.52219009399414
production_forward grad[72] vs paper_forward: mean_abs=0.5290144085884094, max_abs=4.375, mean_rel=0.1592932790517807, max_rel=1060.260009765625, norm_rel=0.022823568433523178, ref_abs_avg=23.200336456298828, test_abs_avg=23.200510025024414
production_forward grad[73] vs paper_forward: mean_abs=0.4846799075603485, max_abs=3.5, mean_rel=0.20936152338981628, max_rel=1624.9998779296875, norm_rel=0.02159937284886837, ref_abs_avg=22.528831481933594, test_abs_avg=22.532978057861328
production_forward grad[74] vs paper_forward: mean_abs=0.46051812171936035, max_abs=2.0625, mean_rel=0.09119769185781479, max_rel=5.9307684898376465, norm_rel=0.023060910403728485, ref_abs_avg=20.179931640625, test_abs_avg=20.255207061767578
production_forward grad[75] vs paper_forward: mean_abs=0.5735039114952087, max_abs=5.0, mean_rel=0.16555169224739075, max_rel=1558.2039794921875, norm_rel=0.02415209263563156, ref_abs_avg=23.788494110107422, test_abs_avg=23.78882598876953
production_forward grad[76] vs paper_forward: mean_abs=0.5208908319473267, max_abs=3.8125, mean_rel=0.24813181161880493, max_rel=1999.9998779296875, norm_rel=0.02235662192106247, ref_abs_avg=23.284582138061523, test_abs_avg=23.286840438842773
production_forward grad[77] vs paper_forward: mean_abs=0.4249448776245117, max_abs=1.5, mean_rel=0.12864281237125397, max_rel=26.404571533203125, norm_rel=0.021678108721971512, ref_abs_avg=19.565731048583984, test_abs_avg=19.591793060302734
production_forward grad[78] vs paper_forward: mean_abs=0.5261340141296387, max_abs=4.5, mean_rel=0.1457940638065338, max_rel=1431.318359375, norm_rel=0.023614004254341125, ref_abs_avg=22.33417320251465, test_abs_avg=22.335628509521484
production_forward grad[79] vs paper_forward: mean_abs=0.4936220645904541, max_abs=3.875, mean_rel=0.23125915229320526, max_rel=1874.9998779296875, norm_rel=0.02243432216346264, ref_abs_avg=22.11951446533203, test_abs_avg=22.120466232299805
production_forward grad[80] vs paper_forward: mean_abs=0.37738609313964844, max_abs=1.5625, mean_rel=0.06191369146108627, max_rel=2.2241342067718506, norm_rel=0.020982297137379646, ref_abs_avg=18.191280364990234, test_abs_avg=18.199216842651367
production_forward grad[81] vs paper_forward: mean_abs=0.48566171526908875, max_abs=4.5, mean_rel=0.1442442238330841, max_rel=605.303955078125, norm_rel=0.022957751527428627, ref_abs_avg=21.218608856201172, test_abs_avg=21.219959259033203
production_forward grad[82] vs paper_forward: mean_abs=0.45185142755508423, max_abs=4.5, mean_rel=0.1987299919128418, max_rel=1125.0, norm_rel=0.021683523431420326, ref_abs_avg=20.940235137939453, test_abs_avg=20.948505401611328
production_forward grad[83] vs paper_forward: mean_abs=0.3700141906738281, max_abs=1.5, mean_rel=0.08640418946743011, max_rel=12.580554008483887, norm_rel=0.021580174565315247, ref_abs_avg=17.694616317749023, test_abs_avg=17.700958251953125
production_forward grad[84] vs paper_forward: mean_abs=0.4566783308982849, max_abs=4.5, mean_rel=0.13952568173408508, max_rel=701.0369873046875, norm_rel=0.02258659526705742, ref_abs_avg=20.305404663085938, test_abs_avg=20.305828094482422
production_forward grad[85] vs paper_forward: mean_abs=0.4228667616844177, max_abs=4.0, mean_rel=0.21276840567588806, max_rel=1499.9998779296875, norm_rel=0.0211531613022089, ref_abs_avg=20.075254440307617, test_abs_avg=20.079696655273438
production_forward grad[86] vs paper_forward: mean_abs=0.3311808109283447, max_abs=1.25, mean_rel=0.07116561383008957, max_rel=4.1626877784729, norm_rel=0.019174661487340927, ref_abs_avg=17.580787658691406, test_abs_avg=17.5505428314209
production_forward grad[87] vs paper_forward: mean_abs=0.4351266026496887, max_abs=4.0, mean_rel=0.13017573952674866, max_rel=1167.4150390625, norm_rel=0.022283997386693954, ref_abs_avg=19.624736785888672, test_abs_avg=19.625518798828125
production_forward grad[88] vs paper_forward: mean_abs=0.3999180197715759, max_abs=3.75, mean_rel=0.18957966566085815, max_rel=2125.0, norm_rel=0.020330967381596565, ref_abs_avg=19.857440948486328, test_abs_avg=19.85664176940918
production_forward grad[89] vs paper_forward: mean_abs=0.32820725440979004, max_abs=1.28125, mean_rel=0.06000109016895294, max_rel=1.932955265045166, norm_rel=0.020371047779917717, ref_abs_avg=16.451780319213867, test_abs_avg=16.45427703857422
production_forward grad[90] vs paper_forward: mean_abs=0.41800397634506226, max_abs=4.375, mean_rel=0.13272210955619812, max_rel=749.6936645507812, norm_rel=0.02174953930079937, ref_abs_avg=19.413997650146484, test_abs_avg=19.41402244567871
production_forward grad[91] vs paper_forward: mean_abs=0.37571340799331665, max_abs=3.75, mean_rel=0.1931062489748001, max_rel=1312.4998779296875, norm_rel=0.019948503002524376, ref_abs_avg=18.955860137939453, test_abs_avg=18.965051651000977
production_forward grad[92] vs paper_forward: mean_abs=0.29691800475120544, max_abs=1.375, mean_rel=0.17262600362300873, max_rel=15.580656051635742, norm_rel=0.018498921766877174, ref_abs_avg=15.522010803222656, test_abs_avg=15.521303176879883
production_forward grad[93] vs paper_forward: mean_abs=0.3882138729095459, max_abs=4.0, mean_rel=0.12539774179458618, max_rel=728.1190185546875, norm_rel=0.021220838651061058, ref_abs_avg=18.517122268676758, test_abs_avg=18.5183162689209
production_forward grad[94] vs paper_forward: mean_abs=0.35832643508911133, max_abs=3.75, mean_rel=0.17102500796318054, max_rel=1812.4998779296875, norm_rel=0.019317420199513435, ref_abs_avg=18.791885375976562, test_abs_avg=18.79835319519043
production_forward grad[95] vs paper_forward: mean_abs=0.28693437576293945, max_abs=1.34375, mean_rel=0.06455793231725693, max_rel=1.854966402053833, norm_rel=0.019872218370437622, ref_abs_avg=14.658702850341797, test_abs_avg=14.697797775268555
production_forward grad[96] vs paper_forward: mean_abs=0.3684746026992798, max_abs=4.5, mean_rel=0.12000785768032074, max_rel=816.7664184570312, norm_rel=0.020907564088702202, ref_abs_avg=17.957286834716797, test_abs_avg=17.957229614257812
production_forward grad[97] vs paper_forward: mean_abs=0.3513179421424866, max_abs=3.59375, mean_rel=0.18224453926086426, max_rel=1312.4998779296875, norm_rel=0.01942899078130722, ref_abs_avg=18.288753509521484, test_abs_avg=18.290855407714844
production_forward2 vs paper_forward output: mean_abs=0.00167427072301507, max_abs=0.0390625
production_forward2 grad[0] vs paper_forward: mean_abs=0.00883730873465538, max_abs=0.421875, mean_rel=0.07488694041967392, max_rel=101.8014907836914, norm_rel=0.020323682576417923, ref_abs_avg=0.4681438207626343, test_abs_avg=0.46815410256385803
production_forward2 grad[1] vs paper_forward: mean_abs=7.446203708648682, max_abs=52.0, mean_rel=0.12971118092536926, max_rel=100.8126449584961, norm_rel=0.02075253613293171, ref_abs_avg=322.4056091308594, test_abs_avg=322.51007080078125
production_forward2 grad[2] vs paper_forward: mean_abs=1.3863639831542969, max_abs=5.5, mean_rel=0.14223706722259521, max_rel=29.612886428833008, norm_rel=0.024044230580329895, ref_abs_avg=57.2935791015625, test_abs_avg=57.2308235168457
production_forward2 grad[3] vs paper_forward: mean_abs=1.6986663341522217, max_abs=15.0, mean_rel=0.18520832061767578, max_rel=2627.721923828125, norm_rel=0.025156548246741295, ref_abs_avg=67.95283508300781, test_abs_avg=67.95195007324219
production_forward2 grad[4] vs paper_forward: mean_abs=1.5712807178497314, max_abs=10.0, mean_rel=0.4465593695640564, max_rel=5249.99951171875, norm_rel=0.023765048012137413, ref_abs_avg=66.53125762939453, test_abs_avg=66.5399169921875
production_forward2 grad[5] vs paper_forward: mean_abs=1.0887904167175293, max_abs=5.3125, mean_rel=0.09159412980079651, max_rel=3.993274688720703, norm_rel=0.023978205397725105, ref_abs_avg=46.43482208251953, test_abs_avg=46.449981689453125
production_forward2 grad[6] vs paper_forward: mean_abs=1.4879729747772217, max_abs=10.0, mean_rel=0.17586107552051544, max_rel=2416.628173828125, norm_rel=0.024918070062994957, ref_abs_avg=60.129356384277344, test_abs_avg=60.133121490478516
production_forward2 grad[7] vs paper_forward: mean_abs=1.3794021606445312, max_abs=9.5, mean_rel=0.3858422040939331, max_rel=3546.874755859375, norm_rel=0.02341579645872116, ref_abs_avg=59.17636489868164, test_abs_avg=59.18207550048828
production_forward2 grad[8] vs paper_forward: mean_abs=1.0384788513183594, max_abs=4.0546875, mean_rel=0.09483065456151962, max_rel=6.567413330078125, norm_rel=0.023082010447978973, ref_abs_avg=45.700164794921875, test_abs_avg=45.7929573059082
production_forward2 grad[9] vs paper_forward: mean_abs=1.3415367603302002, max_abs=9.0, mean_rel=0.16759999096393585, max_rel=1237.8822021484375, norm_rel=0.024662818759679794, ref_abs_avg=54.68836212158203, test_abs_avg=54.69226837158203
production_forward2 grad[10] vs paper_forward: mean_abs=1.237959384918213, max_abs=8.0, mean_rel=0.38185638189315796, max_rel=4312.5, norm_rel=0.023114461451768875, ref_abs_avg=53.95429992675781, test_abs_avg=53.956199645996094
production_forward2 grad[11] vs paper_forward: mean_abs=0.9667387008666992, max_abs=4.5, mean_rel=0.1396588236093521, max_rel=28.18470001220703, norm_rel=0.023936277255415916, ref_abs_avg=40.45931625366211, test_abs_avg=40.409942626953125
production_forward2 grad[12] vs paper_forward: mean_abs=1.2248090505599976, max_abs=8.0, mean_rel=0.16167369484901428, max_rel=1657.574462890625, norm_rel=0.024422846734523773, ref_abs_avg=50.42420196533203, test_abs_avg=50.42835998535156
production_forward2 grad[13] vs paper_forward: mean_abs=1.1402904987335205, max_abs=7.5, mean_rel=0.32322293519973755, max_rel=4062.499755859375, norm_rel=0.022994421422481537, ref_abs_avg=49.8963623046875, test_abs_avg=49.89755630493164
production_forward2 grad[14] vs paper_forward: mean_abs=0.9473433494567871, max_abs=3.5, mean_rel=0.12172534316778183, max_rel=13.859455108642578, norm_rel=0.024955468252301216, ref_abs_avg=37.27079391479492, test_abs_avg=37.237335205078125
production_forward2 grad[15] vs paper_forward: mean_abs=1.1427392959594727, max_abs=10.0, mean_rel=0.15363925695419312, max_rel=2178.789794921875, norm_rel=0.024157607927918434, ref_abs_avg=47.632625579833984, test_abs_avg=47.63533020019531
production_forward2 grad[16] vs paper_forward: mean_abs=1.0569086074829102, max_abs=6.5, mean_rel=0.37349221110343933, max_rel=3874.999755859375, norm_rel=0.02271581068634987, ref_abs_avg=46.79708480834961, test_abs_avg=46.79874801635742
production_forward2 grad[17] vs paper_forward: mean_abs=0.8392276763916016, max_abs=3.75, mean_rel=0.1200798749923706, max_rel=16.98206901550293, norm_rel=0.023919550701975822, ref_abs_avg=34.6031608581543, test_abs_avg=34.618133544921875
production_forward2 grad[18] vs paper_forward: mean_abs=1.0742231607437134, max_abs=7.0, mean_rel=0.16972580552101135, max_rel=1494.30322265625, norm_rel=0.024094358086586, ref_abs_avg=44.83739471435547, test_abs_avg=44.83820343017578
production_forward2 grad[19] vs paper_forward: mean_abs=0.9857242107391357, max_abs=6.0, mean_rel=0.3338693082332611, max_rel=3562.499755859375, norm_rel=0.022265376523137093, ref_abs_avg=44.48262023925781, test_abs_avg=44.482261657714844
production_forward2 grad[20] vs paper_forward: mean_abs=0.7852921485900879, max_abs=2.625, mean_rel=0.4384108781814575, max_rel=143.9312744140625, norm_rel=0.021803049370646477, ref_abs_avg=34.76023864746094, test_abs_avg=34.74544143676758
production_forward2 grad[21] vs paper_forward: mean_abs=1.0172615051269531, max_abs=7.0, mean_rel=0.16155150532722473, max_rel=1528.8414306640625, norm_rel=0.023800034075975418, ref_abs_avg=42.955169677734375, test_abs_avg=42.95719909667969
production_forward2 grad[22] vs paper_forward: mean_abs=0.9369173049926758, max_abs=6.3125, mean_rel=0.2836134433746338, max_rel=2375.0, norm_rel=0.022121110931038857, ref_abs_avg=42.614173889160156, test_abs_avg=42.615936279296875
production_forward2 grad[23] vs paper_forward: mean_abs=0.7216987013816833, max_abs=3.0625, mean_rel=0.3514109253883362, max_rel=77.10189056396484, norm_rel=0.021781807765364647, ref_abs_avg=33.73601531982422, test_abs_avg=33.744224548339844
production_forward2 grad[24] vs paper_forward: mean_abs=0.9737893342971802, max_abs=7.5, mean_rel=0.1597708761692047, max_rel=3880.7216796875, norm_rel=0.023700542747974396, ref_abs_avg=41.29481887817383, test_abs_avg=41.29820251464844
production_forward2 grad[25] vs paper_forward: mean_abs=0.8939109444618225, max_abs=5.5, mean_rel=0.28942790627479553, max_rel=3156.249755859375, norm_rel=0.02202051505446434, ref_abs_avg=40.82671356201172, test_abs_avg=40.825164794921875
production_forward2 grad[26] vs paper_forward: mean_abs=0.8537626266479492, max_abs=3.75, mean_rel=0.21264804899692535, max_rel=44.409820556640625, norm_rel=0.02485327050089836, ref_abs_avg=34.329612731933594, test_abs_avg=34.294334411621094
production_forward2 grad[27] vs paper_forward: mean_abs=1.108325481414795, max_abs=8.0, mean_rel=0.18861746788024902, max_rel=1557.4989013671875, norm_rel=0.025555798783898354, ref_abs_avg=43.60613250732422, test_abs_avg=43.609130859375
production_forward2 grad[28] vs paper_forward: mean_abs=1.0367085933685303, max_abs=6.5, mean_rel=0.34181398153305054, max_rel=2749.999755859375, norm_rel=0.02418556623160839, ref_abs_avg=43.02670669555664, test_abs_avg=43.03211975097656
production_forward2 grad[29] vs paper_forward: mean_abs=0.8321685791015625, max_abs=3.0, mean_rel=0.08983084559440613, max_rel=12.324149131774902, norm_rel=0.024279586970806122, ref_abs_avg=34.410552978515625, test_abs_avg=34.39881896972656
production_forward2 grad[30] vs paper_forward: mean_abs=1.0462708473205566, max_abs=8.0, mean_rel=0.17138171195983887, max_rel=1086.4266357421875, norm_rel=0.025994881987571716, ref_abs_avg=40.423500061035156, test_abs_avg=40.42729949951172
production_forward2 grad[31] vs paper_forward: mean_abs=0.9888306856155396, max_abs=6.0, mean_rel=0.3094267249107361, max_rel=2937.499755859375, norm_rel=0.024632057175040245, ref_abs_avg=40.31534957885742, test_abs_avg=40.32122802734375
production_forward2 grad[32] vs paper_forward: mean_abs=0.7608060836791992, max_abs=2.75, mean_rel=0.10730238258838654, max_rel=14.605411529541016, norm_rel=0.02211974374949932, ref_abs_avg=34.49700927734375, test_abs_avg=34.50794219970703
production_forward2 grad[33] vs paper_forward: mean_abs=0.9756625294685364, max_abs=6.5, mean_rel=0.1639944314956665, max_rel=1329.326416015625, norm_rel=0.025755878537893295, ref_abs_avg=38.0476188659668, test_abs_avg=38.04762649536133
production_forward2 grad[34] vs paper_forward: mean_abs=0.9119421243667603, max_abs=5.5, mean_rel=0.3252089321613312, max_rel=3249.999755859375, norm_rel=0.024387430399656296, ref_abs_avg=37.50889587402344, test_abs_avg=37.50786590576172
production_forward2 grad[35] vs paper_forward: mean_abs=0.6847602128982544, max_abs=2.25, mean_rel=0.23948392271995544, max_rel=56.26515197753906, norm_rel=0.02343154512345791, ref_abs_avg=28.72698211669922, test_abs_avg=28.7769718170166
production_forward2 grad[36] vs paper_forward: mean_abs=0.9078948497772217, max_abs=6.0, mean_rel=0.17946574091911316, max_rel=1139.4097900390625, norm_rel=0.025602595880627632, ref_abs_avg=35.58241271972656, test_abs_avg=35.584110260009766
production_forward2 grad[37] vs paper_forward: mean_abs=0.8455997705459595, max_abs=5.625, mean_rel=0.283198744058609, max_rel=2609.374755859375, norm_rel=0.023918723687529564, ref_abs_avg=35.509212493896484, test_abs_avg=35.50625991821289
production_forward2 grad[38] vs paper_forward: mean_abs=0.6611735820770264, max_abs=3.617919921875, mean_rel=0.18941693007946014, max_rel=51.01586151123047, norm_rel=0.0237228125333786, ref_abs_avg=28.473552703857422, test_abs_avg=28.35515594482422
production_forward2 grad[39] vs paper_forward: mean_abs=0.8560061454772949, max_abs=6.0, mean_rel=0.16771544516086578, max_rel=1361.0496826171875, norm_rel=0.025252526625990868, ref_abs_avg=34.033626556396484, test_abs_avg=34.035179138183594
production_forward2 grad[40] vs paper_forward: mean_abs=0.792831540107727, max_abs=5.0, mean_rel=0.24885398149490356, max_rel=1999.9998779296875, norm_rel=0.023573579266667366, ref_abs_avg=33.714595794677734, test_abs_avg=33.71150207519531
production_forward2 grad[41] vs paper_forward: mean_abs=0.6351265907287598, max_abs=2.03125, mean_rel=0.12715084850788116, max_rel=11.0326566696167, norm_rel=0.022986628115177155, ref_abs_avg=27.527416229248047, test_abs_avg=27.525070190429688
production_forward2 grad[42] vs paper_forward: mean_abs=0.8079731464385986, max_abs=5.0, mean_rel=0.16518515348434448, max_rel=1203.1273193359375, norm_rel=0.024915633723139763, ref_abs_avg=32.541664123535156, test_abs_avg=32.5435676574707
production_forward2 grad[43] vs paper_forward: mean_abs=0.7553790807723999, max_abs=4.5, mean_rel=0.264037162065506, max_rel=2156.25, norm_rel=0.023521658033132553, ref_abs_avg=32.172691345214844, test_abs_avg=32.17488098144531
production_forward2 grad[44] vs paper_forward: mean_abs=0.617462158203125, max_abs=2.875, mean_rel=0.12504246830940247, max_rel=17.74901008605957, norm_rel=0.02399436943233013, ref_abs_avg=26.296485900878906, test_abs_avg=26.257293701171875
production_forward2 grad[45] vs paper_forward: mean_abs=0.770231306552887, max_abs=5.0, mean_rel=0.16326302289962769, max_rel=1647.909912109375, norm_rel=0.024755582213401794, ref_abs_avg=31.20000648498535, test_abs_avg=31.198762893676758
production_forward2 grad[46] vs paper_forward: mean_abs=0.7187782526016235, max_abs=4.5, mean_rel=0.24514725804328918, max_rel=1921.8748779296875, norm_rel=0.023182425647974014, ref_abs_avg=31.045419692993164, test_abs_avg=31.04790496826172
production_forward2 grad[47] vs paper_forward: mean_abs=0.5634894371032715, max_abs=2.25, mean_rel=0.08440598845481873, max_rel=5.09170389175415, norm_rel=0.022996509447693825, ref_abs_avg=24.876232147216797, test_abs_avg=24.84101676940918
production_forward2 grad[48] vs paper_forward: mean_abs=0.7312638163566589, max_abs=6.0, mean_rel=0.15890134871006012, max_rel=1048.555908203125, norm_rel=0.02441651187837124, ref_abs_avg=29.981693267822266, test_abs_avg=29.982402801513672
production_forward2 grad[49] vs paper_forward: mean_abs=0.6786341667175293, max_abs=5.0, mean_rel=0.2434346079826355, max_rel=2187.5, norm_rel=0.022845610976219177, ref_abs_avg=29.80673599243164, test_abs_avg=29.80678367614746
production_forward2 grad[50] vs paper_forward: mean_abs=0.6368656158447266, max_abs=2.5, mean_rel=0.10610849410295486, max_rel=12.365327835083008, norm_rel=0.02321632206439972, ref_abs_avg=27.287424087524414, test_abs_avg=27.337337493896484
production_forward2 grad[51] vs paper_forward: mean_abs=0.7974120378494263, max_abs=6.0, mean_rel=0.16610999405384064, max_rel=1091.8363037109375, norm_rel=0.025784319266676903, ref_abs_avg=30.98583984375, test_abs_avg=30.98828887939453
production_forward2 grad[52] vs paper_forward: mean_abs=0.748039722442627, max_abs=5.0, mean_rel=0.2716617286205292, max_rel=2843.749755859375, norm_rel=0.024536091834306717, ref_abs_avg=30.55096435546875, test_abs_avg=30.549806594848633
production_forward2 grad[53] vs paper_forward: mean_abs=0.5915825366973877, max_abs=2.5, mean_rel=0.17923790216445923, max_rel=38.1779670715332, norm_rel=0.024871312081813812, ref_abs_avg=23.666423797607422, test_abs_avg=23.667139053344727
production_forward2 grad[54] vs paper_forward: mean_abs=0.7400399446487427, max_abs=5.5, mean_rel=0.16376249492168427, max_rel=1038.33447265625, norm_rel=0.025455942377448082, ref_abs_avg=29.183340072631836, test_abs_avg=29.182292938232422
production_forward2 grad[55] vs paper_forward: mean_abs=0.6940728425979614, max_abs=4.75, mean_rel=0.21313732862472534, max_rel=1515.6248779296875, norm_rel=0.02409769780933857, ref_abs_avg=28.82940673828125, test_abs_avg=28.83809471130371
production_forward2 grad[56] vs paper_forward: mean_abs=0.5317051410675049, max_abs=2.375, mean_rel=0.08977831900119781, max_rel=12.404580116271973, norm_rel=0.022528275847434998, ref_abs_avg=23.839563369750977, test_abs_avg=23.847610473632812
production_forward2 grad[57] vs paper_forward: mean_abs=0.6876702904701233, max_abs=5.0, mean_rel=0.1636536717414856, max_rel=2156.916015625, norm_rel=0.02486785501241684, ref_abs_avg=27.684616088867188, test_abs_avg=27.686058044433594
production_forward2 grad[58] vs paper_forward: mean_abs=0.6378949284553528, max_abs=4.25, mean_rel=0.25846514105796814, max_rel=2062.5, norm_rel=0.02363399788737297, ref_abs_avg=27.03676414489746, test_abs_avg=27.042028427124023
production_forward2 grad[59] vs paper_forward: mean_abs=0.5345659255981445, max_abs=2.46484375, mean_rel=0.09990601986646652, max_rel=8.384714126586914, norm_rel=0.025021841749548912, ref_abs_avg=22.01021957397461, test_abs_avg=21.99523162841797
production_forward2 grad[60] vs paper_forward: mean_abs=0.6481809616088867, max_abs=4.5, mean_rel=0.1642158180475235, max_rel=1060.330322265625, norm_rel=0.024411987513303757, ref_abs_avg=26.580398559570312, test_abs_avg=26.57992935180664
production_forward2 grad[61] vs paper_forward: mean_abs=0.604654848575592, max_abs=5.0, mean_rel=0.2581660747528076, max_rel=1749.9998779296875, norm_rel=0.023115353658795357, ref_abs_avg=26.223793029785156, test_abs_avg=26.22820472717285
production_forward2 grad[62] vs paper_forward: mean_abs=0.5079014301300049, max_abs=2.5, mean_rel=0.1307692676782608, max_rel=12.654948234558105, norm_rel=0.022500725463032722, ref_abs_avg=22.684589385986328, test_abs_avg=22.692964553833008
production_forward2 grad[63] vs paper_forward: mean_abs=0.6123476028442383, max_abs=5.0, mean_rel=0.15214146673679352, max_rel=890.8284912109375, norm_rel=0.024212783202528954, ref_abs_avg=25.347156524658203, test_abs_avg=25.347675323486328
production_forward2 grad[64] vs paper_forward: mean_abs=0.5700463056564331, max_abs=4.0, mean_rel=0.2370235025882721, max_rel=1624.9998779296875, norm_rel=0.022227628156542778, ref_abs_avg=25.552875518798828, test_abs_avg=25.543567657470703
production_forward2 grad[65] vs paper_forward: mean_abs=0.468508243560791, max_abs=1.65625, mean_rel=0.15190134942531586, max_rel=9.607608795166016, norm_rel=0.023150794208049774, ref_abs_avg=19.968151092529297, test_abs_avg=19.962541580200195
production_forward2 grad[66] vs paper_forward: mean_abs=0.5821294784545898, max_abs=5.0, mean_rel=0.16040007770061493, max_rel=1097.021484375, norm_rel=0.023779556155204773, ref_abs_avg=24.478294372558594, test_abs_avg=24.476749420166016
production_forward2 grad[67] vs paper_forward: mean_abs=0.5333154797554016, max_abs=3.75, mean_rel=0.24015994369983673, max_rel=1812.4998779296875, norm_rel=0.022016210481524467, ref_abs_avg=24.2264461517334, test_abs_avg=24.229206085205078
production_forward2 grad[68] vs paper_forward: mean_abs=0.4505312442779541, max_abs=2.0, mean_rel=0.11196520179510117, max_rel=21.044992446899414, norm_rel=0.022985154762864113, ref_abs_avg=19.90361213684082, test_abs_avg=19.890731811523438
production_forward2 grad[69] vs paper_forward: mean_abs=0.551729679107666, max_abs=4.5, mean_rel=0.15562163293361664, max_rel=1055.5089111328125, norm_rel=0.023315168917179108, ref_abs_avg=23.664127349853516, test_abs_avg=23.666545867919922
production_forward2 grad[70] vs paper_forward: mean_abs=0.5091567039489746, max_abs=4.0, mean_rel=0.23334789276123047, max_rel=1499.9998779296875, norm_rel=0.021678129211068153, ref_abs_avg=23.52410316467285, test_abs_avg=23.52204132080078
production_forward2 grad[71] vs paper_forward: mean_abs=0.404742956161499, max_abs=1.5, mean_rel=0.07696115970611572, max_rel=10.129196166992188, norm_rel=0.019818028435111046, ref_abs_avg=20.52445411682129, test_abs_avg=20.514646530151367
production_forward2 grad[72] vs paper_forward: mean_abs=0.5319814682006836, max_abs=4.5, mean_rel=0.1614052802324295, max_rel=1079.7701416015625, norm_rel=0.022943109273910522, ref_abs_avg=23.200336456298828, test_abs_avg=23.200225830078125
production_forward2 grad[73] vs paper_forward: mean_abs=0.48800140619277954, max_abs=4.0, mean_rel=0.20321720838546753, max_rel=1437.4998779296875, norm_rel=0.02174353413283825, ref_abs_avg=22.528831481933594, test_abs_avg=22.5333194732666
production_forward2 grad[74] vs paper_forward: mean_abs=0.45280325412750244, max_abs=2.09375, mean_rel=0.08584239333868027, max_rel=3.3548402786254883, norm_rel=0.022959113121032715, ref_abs_avg=20.179931640625, test_abs_avg=20.24346160888672
production_forward2 grad[75] vs paper_forward: mean_abs=0.5786446928977966, max_abs=4.6875, mean_rel=0.1667805314064026, max_rel=1653.890869140625, norm_rel=0.02435302548110485, ref_abs_avg=23.788494110107422, test_abs_avg=23.78887176513672
production_forward2 grad[76] vs paper_forward: mean_abs=0.5267118215560913, max_abs=3.75, mean_rel=0.24698466062545776, max_rel=1757.8123779296875, norm_rel=0.02260185219347477, ref_abs_avg=23.284582138061523, test_abs_avg=23.286428451538086
production_forward2 grad[77] vs paper_forward: mean_abs=0.4328024387359619, max_abs=1.5625, mean_rel=0.1537601798772812, max_rel=43.066646575927734, norm_rel=0.02226065844297409, ref_abs_avg=19.565731048583984, test_abs_avg=19.588939666748047
production_forward2 grad[78] vs paper_forward: mean_abs=0.5301731824874878, max_abs=4.5, mean_rel=0.14736609160900116, max_rel=1186.5765380859375, norm_rel=0.02379222773015499, ref_abs_avg=22.33417320251465, test_abs_avg=22.335073471069336
production_forward2 grad[79] vs paper_forward: mean_abs=0.4981447458267212, max_abs=3.9375, mean_rel=0.23264861106872559, max_rel=1968.7498779296875, norm_rel=0.022630225867033005, ref_abs_avg=22.11951446533203, test_abs_avg=22.119781494140625
production_forward2 grad[80] vs paper_forward: mean_abs=0.37668323516845703, max_abs=1.5, mean_rel=0.06374654173851013, max_rel=1.986649751663208, norm_rel=0.020921437069773674, ref_abs_avg=18.191280364990234, test_abs_avg=18.201690673828125
production_forward2 grad[81] vs paper_forward: mean_abs=0.48958179354667664, max_abs=4.75, mean_rel=0.14628316462039948, max_rel=612.59033203125, norm_rel=0.02314288169145584, ref_abs_avg=21.218608856201172, test_abs_avg=21.219432830810547
production_forward2 grad[82] vs paper_forward: mean_abs=0.4560922384262085, max_abs=4.4609375, mean_rel=0.20036476850509644, max_rel=1499.9998779296875, norm_rel=0.02186525985598564, ref_abs_avg=20.940235137939453, test_abs_avg=20.948192596435547
production_forward2 grad[83] vs paper_forward: mean_abs=0.3786786198616028, max_abs=1.5, mean_rel=0.11791560053825378, max_rel=26.122560501098633, norm_rel=0.021564532071352005, ref_abs_avg=17.694616317749023, test_abs_avg=17.706165313720703
production_forward2 grad[84] vs paper_forward: mean_abs=0.45948266983032227, max_abs=4.5, mean_rel=0.13892358541488647, max_rel=745.060546875, norm_rel=0.022724298760294914, ref_abs_avg=20.305404663085938, test_abs_avg=20.30575942993164
production_forward2 grad[85] vs paper_forward: mean_abs=0.4253101050853729, max_abs=4.0, mean_rel=0.2167009562253952, max_rel=1499.9998779296875, norm_rel=0.02127811871469021, ref_abs_avg=20.075254440307617, test_abs_avg=20.079879760742188
production_forward2 grad[86] vs paper_forward: mean_abs=0.327883243560791, max_abs=1.375, mean_rel=0.06967271864414215, max_rel=2.767665147781372, norm_rel=0.018928997218608856, ref_abs_avg=17.580787658691406, test_abs_avg=17.550960540771484
production_forward2 grad[87] vs paper_forward: mean_abs=0.43682897090911865, max_abs=4.25, mean_rel=0.13180696964263916, max_rel=1162.52880859375, norm_rel=0.022376541048288345, ref_abs_avg=19.624736785888672, test_abs_avg=19.62552833557129
production_forward2 grad[88] vs paper_forward: mean_abs=0.4018942415714264, max_abs=3.59375, mean_rel=0.18558304011821747, max_rel=2093.75, norm_rel=0.020417595282197, ref_abs_avg=19.857440948486328, test_abs_avg=19.857078552246094
production_forward2 grad[89] vs paper_forward: mean_abs=0.33490681648254395, max_abs=1.25, mean_rel=0.059094808995723724, max_rel=1.8995131254196167, norm_rel=0.020724892616271973, ref_abs_avg=16.451780319213867, test_abs_avg=16.453815460205078
production_forward2 grad[90] vs paper_forward: mean_abs=0.41963356733322144, max_abs=4.125, mean_rel=0.1346537321805954, max_rel=751.05078125, norm_rel=0.021817415952682495, ref_abs_avg=19.413997650146484, test_abs_avg=19.413768768310547
production_forward2 grad[91] vs paper_forward: mean_abs=0.3770175576210022, max_abs=4.0, mean_rel=0.19499890506267548, max_rel=1562.4998779296875, norm_rel=0.02000907063484192, ref_abs_avg=18.955860137939453, test_abs_avg=18.96493148803711
production_forward2 grad[92] vs paper_forward: mean_abs=0.3051161468029022, max_abs=1.25, mean_rel=0.16603180766105652, max_rel=16.201744079589844, norm_rel=0.01893666759133339, ref_abs_avg=15.522010803222656, test_abs_avg=15.517412185668945
production_forward2 grad[93] vs paper_forward: mean_abs=0.3890523612499237, max_abs=4.0, mean_rel=0.1265622079372406, max_rel=793.7297973632812, norm_rel=0.02127128653228283, ref_abs_avg=18.517122268676758, test_abs_avg=18.518028259277344
production_forward2 grad[94] vs paper_forward: mean_abs=0.35917696356773376, max_abs=3.75, mean_rel=0.17512956261634827, max_rel=1867.1873779296875, norm_rel=0.01936638541519642, ref_abs_avg=18.791885375976562, test_abs_avg=18.797832489013672
production_forward2 grad[95] vs paper_forward: mean_abs=0.28693437576293945, max_abs=1.34375, mean_rel=0.06455793231725693, max_rel=1.854966402053833, norm_rel=0.019872218370437622, ref_abs_avg=14.658702850341797, test_abs_avg=14.697797775268555
production_forward2 grad[96] vs paper_forward: mean_abs=0.3684746026992798, max_abs=4.5, mean_rel=0.12000785768032074, max_rel=816.7664184570312, norm_rel=0.020907564088702202, ref_abs_avg=17.957286834716797, test_abs_avg=17.957229614257812
production_forward2 grad[97] vs paper_forward: mean_abs=0.3513179421424866, max_abs=3.59375, mean_rel=0.18224453926086426, max_rel=1312.4998779296875, norm_rel=0.01942899078130722, ref_abs_avg=18.288753509521484, test_abs_avg=18.290855407714844
identity layers + randn queries
production_forward fwd+bwd:  113.614 ms
production_forward bwd-only: 95.979 ms
production_forward peak allocated: fwd=3.368 GiB, fwd+bwd=10.118 GiB
production_forward peak reserved:  fwd=3.637 GiB, fwd+bwd=12.637 GiB
paper_forward fwd+bwd:  385.134 ms
paper_forward bwd-only: 305.065 ms
paper_forward peak allocated: fwd=30.002 GiB, fwd+bwd=32.120 GiB
paper_forward peak reserved:  fwd=30.057 GiB, fwd+bwd=32.807 GiB
production_forward2 fwd+bwd:  191.566 ms
production_forward2 bwd-only: 172.526 ms
production_forward2 peak allocated: fwd=2.864 GiB, fwd+bwd=6.243 GiB
production_forward2 peak reserved:  fwd=3.262 GiB, fwd+bwd=9.012 GiB

grads check for swiglu layers + randn queries
production_forward vs paper_forward output: mean_abs=0.0016944250091910362, max_abs=0.046875
production_forward grad[0] vs paper_forward: mean_abs=0.00867659505456686, max_abs=0.53125, mean_rel=0.07321739941835403, max_rel=124.63219451904297, norm_rel=0.019954819232225418, ref_abs_avg=0.471544474363327, test_abs_avg=0.47156447172164917
production_forward grad[1] vs paper_forward: mean_abs=7.578455448150635, max_abs=56.0, mean_rel=0.2104467898607254, max_rel=1216.5186767578125, norm_rel=0.020562605932354927, ref_abs_avg=327.9366760253906, test_abs_avg=328.0084228515625
production_forward grad[2] vs paper_forward: mean_abs=1.2460532188415527, max_abs=5.5, mean_rel=0.15871012210845947, max_rel=36.776466369628906, norm_rel=0.023400574922561646, ref_abs_avg=54.21361541748047, test_abs_avg=54.21880340576172
production_forward grad[3] vs paper_forward: mean_abs=1.6581065654754639, max_abs=11.0, mean_rel=0.16494345664978027, max_rel=2383.708251953125, norm_rel=0.02452120929956436, ref_abs_avg=68.07452392578125, test_abs_avg=68.08357238769531
production_forward grad[4] vs paper_forward: mean_abs=1.5371887683868408, max_abs=10.5, mean_rel=0.4273166358470917, max_rel=4375.0, norm_rel=0.022921331226825714, ref_abs_avg=67.36068725585938, test_abs_avg=67.37266540527344
production_forward grad[5] vs paper_forward: mean_abs=1.1405872106552124, max_abs=5.0, mean_rel=0.12713830173015594, max_rel=27.62506675720215, norm_rel=0.022715475410223007, ref_abs_avg=50.39482879638672, test_abs_avg=50.44035339355469
production_forward grad[6] vs paper_forward: mean_abs=1.455998182296753, max_abs=12.0, mean_rel=0.1629435420036316, max_rel=2168.180419921875, norm_rel=0.024280447512865067, ref_abs_avg=60.31103515625, test_abs_avg=60.31386184692383
production_forward grad[7] vs paper_forward: mean_abs=1.3362102508544922, max_abs=8.5, mean_rel=0.4162226915359497, max_rel=4718.75, norm_rel=0.022583164274692535, ref_abs_avg=59.52531433105469, test_abs_avg=59.52722930908203
production_forward grad[8] vs paper_forward: mean_abs=1.0230870246887207, max_abs=4.0, mean_rel=0.14554035663604736, max_rel=13.887773513793945, norm_rel=0.023944305256009102, ref_abs_avg=42.403907775878906, test_abs_avg=42.49391174316406
production_forward grad[9] vs paper_forward: mean_abs=1.3308244943618774, max_abs=9.0, mean_rel=0.16731634736061096, max_rel=1519.4586181640625, norm_rel=0.02408367395401001, ref_abs_avg=55.67478561401367, test_abs_avg=55.67680358886719
production_forward grad[10] vs paper_forward: mean_abs=1.2245714664459229, max_abs=7.75, mean_rel=0.37741613388061523, max_rel=3874.999755859375, norm_rel=0.022540176287293434, ref_abs_avg=54.68236541748047, test_abs_avg=54.68693542480469
production_forward grad[11] vs paper_forward: mean_abs=0.8984889984130859, max_abs=4.0, mean_rel=0.11262694001197815, max_rel=13.103195190429688, norm_rel=0.02055414207279682, ref_abs_avg=43.56941604614258, test_abs_avg=43.549224853515625
production_forward grad[12] vs paper_forward: mean_abs=1.2269561290740967, max_abs=10.0, mean_rel=0.16671574115753174, max_rel=1864.410400390625, norm_rel=0.023949241265654564, ref_abs_avg=51.5570068359375, test_abs_avg=51.5603141784668
production_forward grad[13] vs paper_forward: mean_abs=1.1251869201660156, max_abs=6.75, mean_rel=0.3355218768119812, max_rel=3562.499755859375, norm_rel=0.022264085710048676, ref_abs_avg=50.81504821777344, test_abs_avg=50.81586456298828
production_forward grad[14] vs paper_forward: mean_abs=0.8764381408691406, max_abs=3.125, mean_rel=0.06737782806158066, max_rel=3.0495967864990234, norm_rel=0.020920302718877792, ref_abs_avg=41.694114685058594, test_abs_avg=41.702232360839844
production_forward grad[15] vs paper_forward: mean_abs=1.1433703899383545, max_abs=10.0, mean_rel=0.1552388072013855, max_rel=1338.0107421875, norm_rel=0.023837508633732796, ref_abs_avg=48.346282958984375, test_abs_avg=48.350746154785156
production_forward grad[16] vs paper_forward: mean_abs=1.0507299900054932, max_abs=6.625, mean_rel=0.33414435386657715, max_rel=3124.999755859375, norm_rel=0.0221074391156435, ref_abs_avg=47.823753356933594, test_abs_avg=47.826480865478516
production_forward grad[17] vs paper_forward: mean_abs=0.8805718421936035, max_abs=4.0, mean_rel=0.1670830100774765, max_rel=23.77618408203125, norm_rel=0.02365901879966259, ref_abs_avg=37.93742370605469, test_abs_avg=38.04853439331055
production_forward grad[18] vs paper_forward: mean_abs=1.0726972818374634, max_abs=9.0, mean_rel=0.16376379132270813, max_rel=2688.40283203125, norm_rel=0.02372969686985016, ref_abs_avg=45.48407745361328, test_abs_avg=45.48825454711914
production_forward grad[19] vs paper_forward: mean_abs=0.9841387271881104, max_abs=6.0, mean_rel=0.2830381691455841, max_rel=2671.874755859375, norm_rel=0.0219570305198431, ref_abs_avg=45.034305572509766, test_abs_avg=45.037559509277344
production_forward grad[20] vs paper_forward: mean_abs=0.7990732192993164, max_abs=3.5, mean_rel=0.1019066721200943, max_rel=7.573921203613281, norm_rel=0.022038400173187256, ref_abs_avg=36.205623626708984, test_abs_avg=36.24888610839844
production_forward grad[21] vs paper_forward: mean_abs=1.0242335796356201, max_abs=7.0, mean_rel=0.16547361016273499, max_rel=2576.286376953125, norm_rel=0.02359006740152836, ref_abs_avg=43.67195510864258, test_abs_avg=43.675750732421875
production_forward grad[22] vs paper_forward: mean_abs=0.941569447517395, max_abs=6.375, mean_rel=0.31988394260406494, max_rel=3093.749755859375, norm_rel=0.02203778550028801, ref_abs_avg=42.91346740722656, test_abs_avg=42.91944122314453
production_forward grad[23] vs paper_forward: mean_abs=0.7454137802124023, max_abs=2.75, mean_rel=0.15053990483283997, max_rel=20.134132385253906, norm_rel=0.020657559856772423, ref_abs_avg=36.471858978271484, test_abs_avg=36.525413513183594
production_forward grad[24] vs paper_forward: mean_abs=0.9716944694519043, max_abs=7.0, mean_rel=0.15677373111248016, max_rel=2023.7755126953125, norm_rel=0.023455779999494553, ref_abs_avg=41.73587417602539, test_abs_avg=41.734657287597656
production_forward grad[25] vs paper_forward: mean_abs=0.886613130569458, max_abs=5.5, mean_rel=0.3077815771102905, max_rel=2937.499755859375, norm_rel=0.021718008443713188, ref_abs_avg=41.0641975402832, test_abs_avg=41.06960678100586
production_forward grad[26] vs paper_forward: mean_abs=0.9050021171569824, max_abs=3.25, mean_rel=0.21427983045578003, max_rel=38.9661979675293, norm_rel=0.0243892353028059, ref_abs_avg=36.72212600708008, test_abs_avg=36.78898620605469
production_forward grad[27] vs paper_forward: mean_abs=1.132497787475586, max_abs=11.0, mean_rel=0.17098085582256317, max_rel=2598.28125, norm_rel=0.025073006749153137, ref_abs_avg=45.42576599121094, test_abs_avg=45.428062438964844
production_forward grad[28] vs paper_forward: mean_abs=1.042846441268921, max_abs=6.0, mean_rel=0.2835661768913269, max_rel=2687.499755859375, norm_rel=0.02349141053855419, ref_abs_avg=44.639556884765625, test_abs_avg=44.65174865722656
production_forward grad[29] vs paper_forward: mean_abs=0.7817974090576172, max_abs=3.25, mean_rel=0.10193194448947906, max_rel=11.151738166809082, norm_rel=0.02396545186638832, ref_abs_avg=33.39048385620117, test_abs_avg=33.407352447509766
production_forward grad[30] vs paper_forward: mean_abs=1.0339994430541992, max_abs=7.0, mean_rel=0.1772030144929886, max_rel=1619.09765625, norm_rel=0.02541476860642433, ref_abs_avg=40.88751220703125, test_abs_avg=40.892303466796875
production_forward grad[31] vs paper_forward: mean_abs=0.9697123765945435, max_abs=6.5, mean_rel=0.37977135181427, max_rel=2874.999755859375, norm_rel=0.024084072560071945, ref_abs_avg=40.46000289916992, test_abs_avg=40.47281265258789
production_forward grad[32] vs paper_forward: mean_abs=0.741729736328125, max_abs=3.5, mean_rel=0.08658923953771591, max_rel=4.091216087341309, norm_rel=0.024427957832813263, ref_abs_avg=30.589160919189453, test_abs_avg=30.61587142944336
production_forward grad[33] vs paper_forward: mean_abs=0.9680207371711731, max_abs=8.0, mean_rel=0.16511622071266174, max_rel=1665.5821533203125, norm_rel=0.025296062231063843, ref_abs_avg=38.41847229003906, test_abs_avg=38.41780090332031
production_forward grad[34] vs paper_forward: mean_abs=0.8996748924255371, max_abs=5.5625, mean_rel=0.3124430775642395, max_rel=2874.999755859375, norm_rel=0.023711565881967545, ref_abs_avg=38.07161331176758, test_abs_avg=38.072452545166016
production_forward grad[35] vs paper_forward: mean_abs=0.7342908382415771, max_abs=2.75, mean_rel=0.08756765723228455, max_rel=4.797880172729492, norm_rel=0.025688905268907547, ref_abs_avg=29.086322784423828, test_abs_avg=29.07509994506836
production_forward grad[36] vs paper_forward: mean_abs=0.9062782526016235, max_abs=8.0, mean_rel=0.1656145453453064, max_rel=1787.051513671875, norm_rel=0.02518761344254017, ref_abs_avg=36.101707458496094, test_abs_avg=36.10306167602539
production_forward grad[37] vs paper_forward: mean_abs=0.8478465676307678, max_abs=5.5, mean_rel=0.24688294529914856, max_rel=2343.75, norm_rel=0.023829003795981407, ref_abs_avg=35.689300537109375, test_abs_avg=35.692466735839844
production_forward grad[38] vs paper_forward: mean_abs=0.6929621696472168, max_abs=2.5, mean_rel=0.14227578043937683, max_rel=15.355753898620605, norm_rel=0.024917615577578545, ref_abs_avg=28.06127166748047, test_abs_avg=28.15256118774414
production_forward grad[39] vs paper_forward: mean_abs=0.8612948060035706, max_abs=6.0, mean_rel=0.15951430797576904, max_rel=1793.682373046875, norm_rel=0.024943465366959572, ref_abs_avg=34.66069793701172, test_abs_avg=34.66144943237305
production_forward grad[40] vs paper_forward: mean_abs=0.7987704277038574, max_abs=5.0, mean_rel=0.23382365703582764, max_rel=1968.7498779296875, norm_rel=0.023264138028025627, ref_abs_avg=34.3978271484375, test_abs_avg=34.392826080322266
production_forward grad[41] vs paper_forward: mean_abs=0.6247332096099854, max_abs=3.0, mean_rel=0.10479429364204407, max_rel=7.795053958892822, norm_rel=0.023666363209486008, ref_abs_avg=26.794721603393555, test_abs_avg=26.84156608581543
production_forward grad[42] vs paper_forward: mean_abs=0.8129171133041382, max_abs=5.5, mean_rel=0.1628219038248062, max_rel=1261.778564453125, norm_rel=0.02465084381401539, ref_abs_avg=33.082969665527344, test_abs_avg=33.084144592285156
production_forward grad[43] vs paper_forward: mean_abs=0.7552131414413452, max_abs=4.5, mean_rel=0.24504220485687256, max_rel=3156.249755859375, norm_rel=0.023220885545015335, ref_abs_avg=32.60271453857422, test_abs_avg=32.60188674926758
production_forward grad[44] vs paper_forward: mean_abs=0.6296277046203613, max_abs=3.1171875, mean_rel=0.14603111147880554, max_rel=28.900665283203125, norm_rel=0.024549562484025955, ref_abs_avg=25.15070152282715, test_abs_avg=25.150827407836914
production_forward grad[45] vs paper_forward: mean_abs=0.7734966278076172, max_abs=5.5, mean_rel=0.15594220161437988, max_rel=1209.5341796875, norm_rel=0.024248361587524414, ref_abs_avg=31.993125915527344, test_abs_avg=31.994522094726562
production_forward grad[46] vs paper_forward: mean_abs=0.7204782962799072, max_abs=4.875, mean_rel=0.2287464439868927, max_rel=1343.7498779296875, norm_rel=0.022955190390348434, ref_abs_avg=31.466360092163086, test_abs_avg=31.473745346069336
production_forward grad[47] vs paper_forward: mean_abs=0.5744080543518066, max_abs=2.25, mean_rel=0.1572435200214386, max_rel=20.606019973754883, norm_rel=0.022238371893763542, ref_abs_avg=25.784523010253906, test_abs_avg=25.766633987426758
production_forward grad[48] vs paper_forward: mean_abs=0.7428089380264282, max_abs=5.25, mean_rel=0.15766078233718872, max_rel=1771.0948486328125, norm_rel=0.02404894307255745, ref_abs_avg=30.94145393371582, test_abs_avg=30.943592071533203
production_forward grad[49] vs paper_forward: mean_abs=0.6913974285125732, max_abs=4.0, mean_rel=0.2853131890296936, max_rel=2484.375, norm_rel=0.02279454842209816, ref_abs_avg=30.34742546081543, test_abs_avg=30.348134994506836
production_forward grad[50] vs paper_forward: mean_abs=0.6412239074707031, max_abs=2.75, mean_rel=0.06993088126182556, max_rel=3.8075716495513916, norm_rel=0.023528488352894783, ref_abs_avg=27.057315826416016, test_abs_avg=27.04735565185547
production_forward grad[51] vs paper_forward: mean_abs=0.8314387202262878, max_abs=7.0, mean_rel=0.1731773465871811, max_rel=1955.9871826171875, norm_rel=0.02601274847984314, ref_abs_avg=32.102787017822266, test_abs_avg=32.103424072265625
production_forward grad[52] vs paper_forward: mean_abs=0.7740999460220337, max_abs=5.0, mean_rel=0.26243454217910767, max_rel=2156.25, norm_rel=0.024615399539470673, ref_abs_avg=31.540058135986328, test_abs_avg=31.538179397583008
production_forward grad[53] vs paper_forward: mean_abs=0.5836334228515625, max_abs=2.5, mean_rel=0.13894565403461456, max_rel=18.125150680541992, norm_rel=0.025212828069925308, ref_abs_avg=23.22336196899414, test_abs_avg=23.26556396484375
production_forward grad[54] vs paper_forward: mean_abs=0.7643808126449585, max_abs=6.0, mean_rel=0.16563299298286438, max_rel=979.9451293945312, norm_rel=0.02563430368900299, ref_abs_avg=29.8712158203125, test_abs_avg=29.873382568359375
production_forward grad[55] vs paper_forward: mean_abs=0.7134031057357788, max_abs=4.25, mean_rel=0.31857284903526306, max_rel=2156.25, norm_rel=0.024178538471460342, ref_abs_avg=29.563995361328125, test_abs_avg=29.574506759643555
production_forward grad[56] vs paper_forward: mean_abs=0.5927920937538147, max_abs=2.25, mean_rel=0.09969806671142578, max_rel=10.000214576721191, norm_rel=0.02532566525042057, ref_abs_avg=23.20798683166504, test_abs_avg=23.246389389038086
production_forward grad[57] vs paper_forward: mean_abs=0.7174323797225952, max_abs=6.0, mean_rel=0.1643865406513214, max_rel=795.7351684570312, norm_rel=0.025216611102223396, ref_abs_avg=28.515277862548828, test_abs_avg=28.519561767578125
production_forward grad[58] vs paper_forward: mean_abs=0.6615198850631714, max_abs=4.0, mean_rel=0.2982995808124542, max_rel=2187.5, norm_rel=0.023693306371569633, ref_abs_avg=27.981082916259766, test_abs_avg=27.987707138061523
production_forward grad[59] vs paper_forward: mean_abs=0.5251665115356445, max_abs=2.0, mean_rel=0.08269380033016205, max_rel=7.379617214202881, norm_rel=0.02196710743010044, ref_abs_avg=23.95697784423828, test_abs_avg=23.920001983642578
production_forward grad[60] vs paper_forward: mean_abs=0.6719384789466858, max_abs=5.5, mean_rel=0.1534997969865799, max_rel=1097.626220703125, norm_rel=0.02476605400443077, ref_abs_avg=27.147167205810547, test_abs_avg=27.146827697753906
production_forward grad[61] vs paper_forward: mean_abs=0.6220197677612305, max_abs=4.5, mean_rel=0.25945356488227844, max_rel=1937.4998779296875, norm_rel=0.023589331656694412, ref_abs_avg=26.440204620361328, test_abs_avg=26.43628692626953
production_forward grad[62] vs paper_forward: mean_abs=0.4802093505859375, max_abs=2.0, mean_rel=0.11557485163211823, max_rel=13.73488998413086, norm_rel=0.02422880381345749, ref_abs_avg=20.250507354736328, test_abs_avg=20.267784118652344
production_forward grad[63] vs paper_forward: mean_abs=0.6252901554107666, max_abs=4.5, mean_rel=0.149371936917305, max_rel=1045.616943359375, norm_rel=0.024525556713342667, ref_abs_avg=25.521678924560547, test_abs_avg=25.519969940185547
production_forward grad[64] vs paper_forward: mean_abs=0.5825854539871216, max_abs=4.25, mean_rel=0.222584068775177, max_rel=1640.6248779296875, norm_rel=0.023000283166766167, ref_abs_avg=25.358774185180664, test_abs_avg=25.362106323242188
production_forward grad[65] vs paper_forward: mean_abs=0.42965197563171387, max_abs=1.75, mean_rel=0.08808176964521408, max_rel=14.383683204650879, norm_rel=0.021156294271349907, ref_abs_avg=21.105972290039062, test_abs_avg=21.144275665283203
production_forward grad[66] vs paper_forward: mean_abs=0.5985939502716064, max_abs=5.0, mean_rel=0.14966228604316711, max_rel=915.6497192382812, norm_rel=0.02395443245768547, ref_abs_avg=24.988842010498047, test_abs_avg=24.99122428894043
production_forward grad[67] vs paper_forward: mean_abs=0.55069899559021, max_abs=3.875, mean_rel=0.23633694648742676, max_rel=1437.4998779296875, norm_rel=0.022758275270462036, ref_abs_avg=24.201351165771484, test_abs_avg=24.205059051513672
production_forward grad[68] vs paper_forward: mean_abs=0.4516434669494629, max_abs=1.9375, mean_rel=0.0818285346031189, max_rel=3.751523494720459, norm_rel=0.023134121671319008, ref_abs_avg=19.37499237060547, test_abs_avg=19.36838150024414
production_forward grad[69] vs paper_forward: mean_abs=0.5679569244384766, max_abs=5.0, mean_rel=0.15381580591201782, max_rel=1386.4169921875, norm_rel=0.02378973923623562, ref_abs_avg=23.88779640197754, test_abs_avg=23.889190673828125
production_forward grad[70] vs paper_forward: mean_abs=0.525844156742096, max_abs=4.5, mean_rel=0.21951451897621155, max_rel=1726.5623779296875, norm_rel=0.022283390164375305, ref_abs_avg=23.635971069335938, test_abs_avg=23.635799407958984
production_forward grad[71] vs paper_forward: mean_abs=0.4350271224975586, max_abs=1.5, mean_rel=0.11200488358736038, max_rel=25.8072509765625, norm_rel=0.021474407985806465, ref_abs_avg=19.75170135498047, test_abs_avg=19.717016220092773
production_forward grad[72] vs paper_forward: mean_abs=0.5440172553062439, max_abs=4.25, mean_rel=0.1490817666053772, max_rel=922.9968872070312, norm_rel=0.023234190419316292, ref_abs_avg=23.364566802978516, test_abs_avg=23.365339279174805
production_forward grad[73] vs paper_forward: mean_abs=0.4958559572696686, max_abs=3.5, mean_rel=0.23745965957641602, max_rel=1437.4998779296875, norm_rel=0.02226119302213192, ref_abs_avg=22.318124771118164, test_abs_avg=22.322742462158203
production_forward grad[74] vs paper_forward: mean_abs=0.4414398670196533, max_abs=1.75, mean_rel=0.14879818260669708, max_rel=28.053386688232422, norm_rel=0.02192765101790428, ref_abs_avg=19.587509155273438, test_abs_avg=19.545124053955078
production_forward grad[75] vs paper_forward: mean_abs=0.5853841304779053, max_abs=5.0, mean_rel=0.15708935260772705, max_rel=1161.7763671875, norm_rel=0.024617038667201996, ref_abs_avg=23.83563232421875, test_abs_avg=23.838497161865234
production_forward grad[76] vs paper_forward: mean_abs=0.550157368183136, max_abs=4.125, mean_rel=0.22632677853107452, max_rel=1624.9998779296875, norm_rel=0.02311587892472744, ref_abs_avg=23.892593383789062, test_abs_avg=23.89405632019043
production_forward grad[77] vs paper_forward: mean_abs=0.4298582077026367, max_abs=2.0, mean_rel=0.11405492573976517, max_rel=5.213605880737305, norm_rel=0.023123545572161674, ref_abs_avg=18.711082458496094, test_abs_avg=18.739700317382812
production_forward grad[78] vs paper_forward: mean_abs=0.5441073775291443, max_abs=4.5, mean_rel=0.14730429649353027, max_rel=1040.475341796875, norm_rel=0.02412096969783306, ref_abs_avg=22.59975242614746, test_abs_avg=22.600631713867188
production_forward grad[79] vs paper_forward: mean_abs=0.501993715763092, max_abs=4.0, mean_rel=0.27518898248672485, max_rel=1749.9998779296875, norm_rel=0.022495508193969727, ref_abs_avg=22.33602523803711, test_abs_avg=22.336570739746094
production_forward grad[80] vs paper_forward: mean_abs=0.42535948753356934, max_abs=1.9375, mean_rel=0.14302124083042145, max_rel=30.4447021484375, norm_rel=0.022307442501187325, ref_abs_avg=19.310749053955078, test_abs_avg=19.28689956665039
production_forward grad[81] vs paper_forward: mean_abs=0.5099785327911377, max_abs=7.0, mean_rel=0.14335289597511292, max_rel=908.9422607421875, norm_rel=0.023391610011458397, ref_abs_avg=21.882003784179688, test_abs_avg=21.883380889892578
production_forward grad[82] vs paper_forward: mean_abs=0.46774813532829285, max_abs=4.40625, mean_rel=0.20177221298217773, max_rel=1374.9998779296875, norm_rel=0.021442960947752, ref_abs_avg=21.852340698242188, test_abs_avg=21.86615753173828
production_forward grad[83] vs paper_forward: mean_abs=0.35397231578826904, max_abs=1.5, mean_rel=0.1794106811285019, max_rel=19.521116256713867, norm_rel=0.021323952823877335, ref_abs_avg=16.71011734008789, test_abs_avg=16.753311157226562
production_forward grad[84] vs paper_forward: mean_abs=0.47346746921539307, max_abs=4.0, mean_rel=0.14297573268413544, max_rel=999.1276245117188, norm_rel=0.023042837157845497, ref_abs_avg=20.601329803466797, test_abs_avg=20.602157592773438
production_forward grad[85] vs paper_forward: mean_abs=0.4407387971878052, max_abs=4.0, mean_rel=0.19201534986495972, max_rel=1499.9998779296875, norm_rel=0.021380634978413582, ref_abs_avg=20.669601440429688, test_abs_avg=20.670377731323242
production_forward grad[86] vs paper_forward: mean_abs=0.36491161584854126, max_abs=1.3125, mean_rel=0.17719954252243042, max_rel=19.74152374267578, norm_rel=0.021089909598231316, ref_abs_avg=17.264392852783203, test_abs_avg=17.261024475097656
production_forward grad[87] vs paper_forward: mean_abs=0.45104408264160156, max_abs=4.5, mean_rel=0.13815373182296753, max_rel=490.831298828125, norm_rel=0.022288518026471138, ref_abs_avg=20.346012115478516, test_abs_avg=20.348007202148438
production_forward grad[88] vs paper_forward: mean_abs=0.4104350209236145, max_abs=4.0, mean_rel=0.18967989087104797, max_rel=1125.0, norm_rel=0.020656943321228027, ref_abs_avg=20.047954559326172, test_abs_avg=20.048532485961914
production_forward grad[89] vs paper_forward: mean_abs=0.32824623584747314, max_abs=1.25, mean_rel=0.14350509643554688, max_rel=11.64082145690918, norm_rel=0.01931178569793701, ref_abs_avg=17.45874786376953, test_abs_avg=17.488243103027344
production_forward grad[90] vs paper_forward: mean_abs=0.4206145405769348, max_abs=4.5, mean_rel=0.13932359218597412, max_rel=1470.619873046875, norm_rel=0.021750053390860558, ref_abs_avg=19.525564193725586, test_abs_avg=19.524850845336914
production_forward grad[91] vs paper_forward: mean_abs=0.38528087735176086, max_abs=3.5, mean_rel=0.18524566292762756, max_rel=1562.4998779296875, norm_rel=0.02008529007434845, ref_abs_avg=19.325223922729492, test_abs_avg=19.333250045776367
production_forward grad[92] vs paper_forward: mean_abs=0.31329917907714844, max_abs=1.1875, mean_rel=0.05494491010904312, max_rel=1.2952630519866943, norm_rel=0.019483575597405434, ref_abs_avg=15.932411193847656, test_abs_avg=15.923181533813477
production_forward grad[93] vs paper_forward: mean_abs=0.3903471827507019, max_abs=4.0, mean_rel=0.1343567669391632, max_rel=1177.506103515625, norm_rel=0.02139955386519432, ref_abs_avg=18.458660125732422, test_abs_avg=18.459625244140625
production_forward grad[94] vs paper_forward: mean_abs=0.3642662763595581, max_abs=3.75, mean_rel=0.18372252583503723, max_rel=2218.75, norm_rel=0.019806407392024994, ref_abs_avg=18.65676498413086, test_abs_avg=18.661434173583984
production_forward grad[95] vs paper_forward: mean_abs=0.28230148553848267, max_abs=1.25, mean_rel=0.14400561153888702, max_rel=27.254241943359375, norm_rel=0.018662648275494576, ref_abs_avg=15.893695831298828, test_abs_avg=15.88918399810791
production_forward grad[96] vs paper_forward: mean_abs=0.38709673285484314, max_abs=4.5, mean_rel=0.13378648459911346, max_rel=891.5759887695312, norm_rel=0.021218597888946533, ref_abs_avg=18.537742614746094, test_abs_avg=18.53917694091797
production_forward grad[97] vs paper_forward: mean_abs=0.3527044653892517, max_abs=4.0, mean_rel=0.16790977120399475, max_rel=1203.125, norm_rel=0.0192963145673275, ref_abs_avg=18.47671127319336, test_abs_avg=18.483076095581055
production_forward2 vs paper_forward output: mean_abs=0.0016944250091910362, max_abs=0.046875
production_forward2 grad[0] vs paper_forward: mean_abs=0.009019304998219013, max_abs=0.5390625, mean_rel=0.07574337720870972, max_rel=105.64639282226562, norm_rel=0.020618710666894913, ref_abs_avg=0.471544474363327, test_abs_avg=0.47154873609542847
production_forward2 grad[1] vs paper_forward: mean_abs=7.710519313812256, max_abs=56.0, mean_rel=0.20963618159294128, max_rel=1171.8641357421875, norm_rel=0.02093992568552494, ref_abs_avg=327.9366760253906, test_abs_avg=328.00018310546875
production_forward2 grad[2] vs paper_forward: mean_abs=1.372910499572754, max_abs=5.0, mean_rel=0.21413575112819672, max_rel=53.771366119384766, norm_rel=0.024882253259420395, ref_abs_avg=54.21361541748047, test_abs_avg=54.27068328857422
production_forward2 grad[3] vs paper_forward: mean_abs=1.705960750579834, max_abs=12.0, mean_rel=0.17521782219409943, max_rel=2626.17724609375, norm_rel=0.02521919086575508, ref_abs_avg=68.07452392578125, test_abs_avg=68.07958984375
production_forward2 grad[4] vs paper_forward: mean_abs=1.5832816362380981, max_abs=10.125, mean_rel=0.414620578289032, max_rel=4625.0, norm_rel=0.02362097054719925, ref_abs_avg=67.36068725585938, test_abs_avg=67.36373138427734
production_forward2 grad[5] vs paper_forward: mean_abs=1.1122288703918457, max_abs=4.5, mean_rel=0.10741738975048065, max_rel=19.362998962402344, norm_rel=0.02257089503109455, ref_abs_avg=50.39482879638672, test_abs_avg=50.43891143798828
production_forward2 grad[6] vs paper_forward: mean_abs=1.4978017807006836, max_abs=10.0, mean_rel=0.176942378282547, max_rel=1909.6514892578125, norm_rel=0.024980010464787483, ref_abs_avg=60.31103515625, test_abs_avg=60.3106575012207
production_forward2 grad[7] vs paper_forward: mean_abs=1.3802382946014404, max_abs=8.25, mean_rel=0.4222088158130646, max_rel=4375.0, norm_rel=0.023327399045228958, ref_abs_avg=59.52531433105469, test_abs_avg=59.52540588378906
production_forward2 grad[8] vs paper_forward: mean_abs=1.037703037261963, max_abs=4.0, mean_rel=0.17483875155448914, max_rel=27.308435440063477, norm_rel=0.024111399427056313, ref_abs_avg=42.403907775878906, test_abs_avg=42.483341217041016
production_forward2 grad[9] vs paper_forward: mean_abs=1.3656418323516846, max_abs=9.0, mean_rel=0.17533716559410095, max_rel=1819.4443359375, norm_rel=0.024694060906767845, ref_abs_avg=55.67478561401367, test_abs_avg=55.674835205078125
production_forward2 grad[10] vs paper_forward: mean_abs=1.262343168258667, max_abs=7.9609375, mean_rel=0.3971484303474426, max_rel=4750.0, norm_rel=0.02323365956544876, ref_abs_avg=54.68236541748047, test_abs_avg=54.68528747558594
production_forward2 grad[11] vs paper_forward: mean_abs=0.9256086349487305, max_abs=4.2265625, mean_rel=0.12319114059209824, max_rel=12.864502906799316, norm_rel=0.02142568677663803, ref_abs_avg=43.56941604614258, test_abs_avg=43.52777862548828
production_forward2 grad[12] vs paper_forward: mean_abs=1.2582402229309082, max_abs=10.0, mean_rel=0.16852906346321106, max_rel=1332.0926513671875, norm_rel=0.024549460038542747, ref_abs_avg=51.5570068359375, test_abs_avg=51.559837341308594
production_forward2 grad[13] vs paper_forward: mean_abs=1.1583497524261475, max_abs=8.5, mean_rel=0.3305513262748718, max_rel=4625.0, norm_rel=0.022909685969352722, ref_abs_avg=50.81504821777344, test_abs_avg=50.81187438964844
production_forward2 grad[14] vs paper_forward: mean_abs=0.8997516632080078, max_abs=3.5, mean_rel=0.06519521027803421, max_rel=2.7423691749572754, norm_rel=0.02155700884759426, ref_abs_avg=41.694114685058594, test_abs_avg=41.70030212402344
production_forward2 grad[15] vs paper_forward: mean_abs=1.1700422763824463, max_abs=8.0, mean_rel=0.16108745336532593, max_rel=2374.623779296875, norm_rel=0.024379994720220566, ref_abs_avg=48.346282958984375, test_abs_avg=48.349308013916016
production_forward2 grad[16] vs paper_forward: mean_abs=1.076753854751587, max_abs=7.0, mean_rel=0.3376467525959015, max_rel=2999.999755859375, norm_rel=0.02266070432960987, ref_abs_avg=47.823753356933594, test_abs_avg=47.8272705078125
production_forward2 grad[17] vs paper_forward: mean_abs=0.8890109062194824, max_abs=3.5, mean_rel=0.1828838288784027, max_rel=28.697349548339844, norm_rel=0.02335549332201481, ref_abs_avg=37.93742370605469, test_abs_avg=38.0358772277832
production_forward2 grad[18] vs paper_forward: mean_abs=1.0983808040618896, max_abs=8.0, mean_rel=0.1619817018508911, max_rel=1907.9205322265625, norm_rel=0.024286402389407158, ref_abs_avg=45.48407745361328, test_abs_avg=45.48664093017578
production_forward2 grad[19] vs paper_forward: mean_abs=1.0106008052825928, max_abs=6.75, mean_rel=0.29403582215309143, max_rel=2781.249755859375, norm_rel=0.022544609382748604, ref_abs_avg=45.034305572509766, test_abs_avg=45.03453063964844
production_forward2 grad[20] vs paper_forward: mean_abs=0.798980712890625, max_abs=3.75, mean_rel=0.09144143760204315, max_rel=6.916845321655273, norm_rel=0.02250923216342926, ref_abs_avg=36.205623626708984, test_abs_avg=36.264400482177734
production_forward2 grad[21] vs paper_forward: mean_abs=1.046286940574646, max_abs=7.0, mean_rel=0.17035146057605743, max_rel=1936.7637939453125, norm_rel=0.024102916941046715, ref_abs_avg=43.67195510864258, test_abs_avg=43.67376708984375
production_forward2 grad[22] vs paper_forward: mean_abs=0.9648611545562744, max_abs=6.0625, mean_rel=0.3398778736591339, max_rel=2687.499755859375, norm_rel=0.022569259628653526, ref_abs_avg=42.91346740722656, test_abs_avg=42.91609191894531
production_forward2 grad[23] vs paper_forward: mean_abs=0.7690060138702393, max_abs=2.75, mean_rel=0.15103477239608765, max_rel=19.82798194885254, norm_rel=0.021032536402344704, ref_abs_avg=36.471858978271484, test_abs_avg=36.5107307434082
production_forward2 grad[24] vs paper_forward: mean_abs=0.9912945032119751, max_abs=7.0, mean_rel=0.16011951863765717, max_rel=1827.155029296875, norm_rel=0.02390131540596485, ref_abs_avg=41.73587417602539, test_abs_avg=41.73461151123047
production_forward2 grad[25] vs paper_forward: mean_abs=0.9097201824188232, max_abs=5.5, mean_rel=0.3206062912940979, max_rel=3374.999755859375, norm_rel=0.022280694916844368, ref_abs_avg=41.0641975402832, test_abs_avg=41.07093811035156
production_forward2 grad[26] vs paper_forward: mean_abs=0.9030065536499023, max_abs=3.75, mean_rel=0.23521281778812408, max_rel=43.15658187866211, norm_rel=0.02458861656486988, ref_abs_avg=36.72212600708008, test_abs_avg=36.766380310058594
production_forward2 grad[27] vs paper_forward: mean_abs=1.1557958126068115, max_abs=10.0, mean_rel=0.17445418238639832, max_rel=1413.1253662109375, norm_rel=0.02556985430419445, ref_abs_avg=45.42576599121094, test_abs_avg=45.426361083984375
production_forward2 grad[28] vs paper_forward: mean_abs=1.0707451105117798, max_abs=7.0, mean_rel=0.27908122539520264, max_rel=2812.499755859375, norm_rel=0.02410677634179592, ref_abs_avg=44.639556884765625, test_abs_avg=44.64912414550781
production_forward2 grad[29] vs paper_forward: mean_abs=0.8060169219970703, max_abs=3.912109375, mean_rel=0.09758526086807251, max_rel=9.030754089355469, norm_rel=0.024827871471643448, ref_abs_avg=33.39048385620117, test_abs_avg=33.41899490356445
production_forward2 grad[30] vs paper_forward: mean_abs=1.0551035404205322, max_abs=7.0, mean_rel=0.18180839717388153, max_rel=2118.829345703125, norm_rel=0.025918155908584595, ref_abs_avg=40.88751220703125, test_abs_avg=40.88973617553711
production_forward2 grad[31] vs paper_forward: mean_abs=0.992024302482605, max_abs=6.5, mean_rel=0.37073686718940735, max_rel=2812.499755859375, norm_rel=0.024625834077596664, ref_abs_avg=40.46000289916992, test_abs_avg=40.46954345703125
production_forward2 grad[32] vs paper_forward: mean_abs=0.768146276473999, max_abs=3.5, mean_rel=0.08936377614736557, max_rel=3.022719621658325, norm_rel=0.02527942880988121, ref_abs_avg=30.589160919189453, test_abs_avg=30.59789276123047
production_forward2 grad[33] vs paper_forward: mean_abs=0.9847206473350525, max_abs=7.0, mean_rel=0.17044824361801147, max_rel=1630.012939453125, norm_rel=0.02574225142598152, ref_abs_avg=38.41847229003906, test_abs_avg=38.416778564453125
production_forward2 grad[34] vs paper_forward: mean_abs=0.9180730581283569, max_abs=5.5, mean_rel=0.3212459087371826, max_rel=2749.999755859375, norm_rel=0.02418893575668335, ref_abs_avg=38.07161331176758, test_abs_avg=38.070777893066406
production_forward2 grad[35] vs paper_forward: mean_abs=0.7562274932861328, max_abs=3.0, mean_rel=0.0892530009150505, max_rel=4.847257614135742, norm_rel=0.02642994001507759, ref_abs_avg=29.086322784423828, test_abs_avg=29.065271377563477
production_forward2 grad[36] vs paper_forward: mean_abs=0.9206024408340454, max_abs=6.0, mean_rel=0.17104819416999817, max_rel=1254.68212890625, norm_rel=0.025590872392058372, ref_abs_avg=36.101707458496094, test_abs_avg=36.101806640625
production_forward2 grad[37] vs paper_forward: mean_abs=0.8625766038894653, max_abs=6.0, mean_rel=0.26953211426734924, max_rel=2093.75, norm_rel=0.024248499423265457, ref_abs_avg=35.689300537109375, test_abs_avg=35.691932678222656
production_forward2 grad[38] vs paper_forward: mean_abs=0.6899440288543701, max_abs=2.59375, mean_rel=0.12272660434246063, max_rel=5.56398868560791, norm_rel=0.025096971541643143, ref_abs_avg=28.06127166748047, test_abs_avg=28.161399841308594
production_forward2 grad[39] vs paper_forward: mean_abs=0.8758043050765991, max_abs=6.0, mean_rel=0.1630820333957672, max_rel=1929.0335693359375, norm_rel=0.025341516360640526, ref_abs_avg=34.66069793701172, test_abs_avg=34.662078857421875
production_forward2 grad[40] vs paper_forward: mean_abs=0.8125466108322144, max_abs=5.0, mean_rel=0.24216629564762115, max_rel=1749.9998779296875, norm_rel=0.02365100383758545, ref_abs_avg=34.3978271484375, test_abs_avg=34.39118194580078
production_forward2 grad[41] vs paper_forward: mean_abs=0.6239557266235352, max_abs=2.875, mean_rel=0.10226424038410187, max_rel=7.9950337409973145, norm_rel=0.023661315441131592, ref_abs_avg=26.794721603393555, test_abs_avg=26.838714599609375
production_forward2 grad[42] vs paper_forward: mean_abs=0.8248244524002075, max_abs=6.0, mean_rel=0.164412260055542, max_rel=1314.45166015625, norm_rel=0.025000298395752907, ref_abs_avg=33.082969665527344, test_abs_avg=33.0826301574707
production_forward2 grad[43] vs paper_forward: mean_abs=0.7681612372398376, max_abs=4.75, mean_rel=0.2693605124950409, max_rel=3468.749755859375, norm_rel=0.023611467331647873, ref_abs_avg=32.60271453857422, test_abs_avg=32.601036071777344
production_forward2 grad[44] vs paper_forward: mean_abs=0.6386961936950684, max_abs=2.77734375, mean_rel=0.13690684735774994, max_rel=23.316200256347656, norm_rel=0.025282621383666992, ref_abs_avg=25.15070152282715, test_abs_avg=25.154247283935547
production_forward2 grad[45] vs paper_forward: mean_abs=0.7840807437896729, max_abs=6.0, mean_rel=0.16108037531375885, max_rel=1114.1810302734375, norm_rel=0.024577340111136436, ref_abs_avg=31.993125915527344, test_abs_avg=31.99462127685547
production_forward2 grad[46] vs paper_forward: mean_abs=0.7305727601051331, max_abs=5.0, mean_rel=0.23422324657440186, max_rel=1624.9998779296875, norm_rel=0.023278629407286644, ref_abs_avg=31.466360092163086, test_abs_avg=31.473438262939453
production_forward2 grad[47] vs paper_forward: mean_abs=0.5829563140869141, max_abs=2.5234375, mean_rel=0.16087226569652557, max_rel=25.49040985107422, norm_rel=0.022454461082816124, ref_abs_avg=25.784523010253906, test_abs_avg=25.755395889282227
production_forward2 grad[48] vs paper_forward: mean_abs=0.7516865134239197, max_abs=5.5, mean_rel=0.15857931971549988, max_rel=1755.82666015625, norm_rel=0.024323072284460068, ref_abs_avg=30.94145393371582, test_abs_avg=30.942556381225586
production_forward2 grad[49] vs paper_forward: mean_abs=0.6999231576919556, max_abs=4.75, mean_rel=0.2849196791648865, max_rel=2624.999755859375, norm_rel=0.023061014711856842, ref_abs_avg=30.34742546081543, test_abs_avg=30.347688674926758
production_forward2 grad[50] vs paper_forward: mean_abs=0.6474189758300781, max_abs=2.75, mean_rel=0.0731356143951416, max_rel=4.740718364715576, norm_rel=0.023738062009215355, ref_abs_avg=27.057315826416016, test_abs_avg=27.026466369628906
production_forward2 grad[51] vs paper_forward: mean_abs=0.843794584274292, max_abs=7.0, mean_rel=0.17897328734397888, max_rel=2532.810791015625, norm_rel=0.026388658210635185, ref_abs_avg=32.102787017822266, test_abs_avg=32.10285186767578
production_forward2 grad[52] vs paper_forward: mean_abs=0.7874205112457275, max_abs=5.0, mean_rel=0.27850276231765747, max_rel=2312.5, norm_rel=0.025028038769960403, ref_abs_avg=31.540058135986328, test_abs_avg=31.538002014160156
production_forward2 grad[53] vs paper_forward: mean_abs=0.6002388000488281, max_abs=2.5, mean_rel=0.14569544792175293, max_rel=25.992494583129883, norm_rel=0.02582979016005993, ref_abs_avg=23.22336196899414, test_abs_avg=23.255691528320312
production_forward2 grad[54] vs paper_forward: mean_abs=0.7738048434257507, max_abs=5.0, mean_rel=0.16670538485050201, max_rel=1167.390625, norm_rel=0.025952879339456558, ref_abs_avg=29.8712158203125, test_abs_avg=29.87250518798828
production_forward2 grad[55] vs paper_forward: mean_abs=0.7252292037010193, max_abs=4.5, mean_rel=0.32199418544769287, max_rel=2531.25, norm_rel=0.024569828063249588, ref_abs_avg=29.563995361328125, test_abs_avg=29.57294273376465
production_forward2 grad[56] vs paper_forward: mean_abs=0.6137961149215698, max_abs=2.5, mean_rel=0.09060205519199371, max_rel=4.579669952392578, norm_rel=0.0263105146586895, ref_abs_avg=23.20798683166504, test_abs_avg=23.239511489868164
production_forward2 grad[57] vs paper_forward: mean_abs=0.7263238430023193, max_abs=6.0, mean_rel=0.16649498045444489, max_rel=939.0256958007812, norm_rel=0.025513723492622375, ref_abs_avg=28.515277862548828, test_abs_avg=28.52035903930664
production_forward2 grad[58] vs paper_forward: mean_abs=0.6706212162971497, max_abs=4.125, mean_rel=0.29346489906311035, max_rel=2187.5, norm_rel=0.024001074954867363, ref_abs_avg=27.981082916259766, test_abs_avg=27.986604690551758
production_forward2 grad[59] vs paper_forward: mean_abs=0.5304298400878906, max_abs=2.0, mean_rel=0.08159893751144409, max_rel=8.286609649658203, norm_rel=0.022298302501440048, ref_abs_avg=23.95697784423828, test_abs_avg=23.933399200439453
production_forward2 grad[60] vs paper_forward: mean_abs=0.6788558959960938, max_abs=5.0, mean_rel=0.15362197160720825, max_rel=1097.626220703125, norm_rel=0.02502402849495411, ref_abs_avg=27.147167205810547, test_abs_avg=27.145992279052734
production_forward2 grad[61] vs paper_forward: mean_abs=0.6294646859169006, max_abs=4.5, mean_rel=0.2539452016353607, max_rel=1999.9998779296875, norm_rel=0.023870252072811127, ref_abs_avg=26.440204620361328, test_abs_avg=26.43460464477539
production_forward2 grad[62] vs paper_forward: mean_abs=0.49500560760498047, max_abs=2.0625, mean_rel=0.12791705131530762, max_rel=17.73379898071289, norm_rel=0.02467712014913559, ref_abs_avg=20.250507354736328, test_abs_avg=20.243144989013672
production_forward2 grad[63] vs paper_forward: mean_abs=0.6321080923080444, max_abs=5.5, mean_rel=0.1526007056236267, max_rel=1059.835693359375, norm_rel=0.024786336347460747, ref_abs_avg=25.521678924560547, test_abs_avg=25.520526885986328
production_forward2 grad[64] vs paper_forward: mean_abs=0.5893478393554688, max_abs=4.375, mean_rel=0.22343406081199646, max_rel=1718.7498779296875, norm_rel=0.023260176181793213, ref_abs_avg=25.358774185180664, test_abs_avg=25.362163543701172
production_forward2 grad[65] vs paper_forward: mean_abs=0.4367401599884033, max_abs=1.75, mean_rel=0.12445999681949615, max_rel=32.50165557861328, norm_rel=0.02137076109647751, ref_abs_avg=21.105972290039062, test_abs_avg=21.141742706298828
production_forward2 grad[66] vs paper_forward: mean_abs=0.6042413711547852, max_abs=4.5, mean_rel=0.15322156250476837, max_rel=1272.121826171875, norm_rel=0.024174217134714127, ref_abs_avg=24.988842010498047, test_abs_avg=24.99030113220215
production_forward2 grad[67] vs paper_forward: mean_abs=0.5554693341255188, max_abs=3.75, mean_rel=0.23526716232299805, max_rel=1374.9998779296875, norm_rel=0.02296867035329342, ref_abs_avg=24.201351165771484, test_abs_avg=24.204607009887695
production_forward2 grad[68] vs paper_forward: mean_abs=0.4569563865661621, max_abs=1.75, mean_rel=0.09680028259754181, max_rel=6.665143013000488, norm_rel=0.023185839876532555, ref_abs_avg=19.37499237060547, test_abs_avg=19.36512565612793
production_forward2 grad[69] vs paper_forward: mean_abs=0.5720822811126709, max_abs=6.0, mean_rel=0.15480270981788635, max_rel=1386.4169921875, norm_rel=0.02394868992269039, ref_abs_avg=23.88779640197754, test_abs_avg=23.889545440673828
production_forward2 grad[70] vs paper_forward: mean_abs=0.5300226807594299, max_abs=4.5, mean_rel=0.22372233867645264, max_rel=1749.9998779296875, norm_rel=0.022449182346463203, ref_abs_avg=23.635971069335938, test_abs_avg=23.63572883605957
production_forward2 grad[71] vs paper_forward: mean_abs=0.43300342559814453, max_abs=1.5, mean_rel=0.11485762149095535, max_rel=27.025741577148438, norm_rel=0.021518204361200333, ref_abs_avg=19.75170135498047, test_abs_avg=19.720531463623047
production_forward2 grad[72] vs paper_forward: mean_abs=0.5474525094032288, max_abs=4.5, mean_rel=0.1508297622203827, max_rel=853.7799682617188, norm_rel=0.023368511348962784, ref_abs_avg=23.364566802978516, test_abs_avg=23.365131378173828
production_forward2 grad[73] vs paper_forward: mean_abs=0.49914342164993286, max_abs=4.0, mean_rel=0.240987166762352, max_rel=1656.2498779296875, norm_rel=0.022400055080652237, ref_abs_avg=22.318124771118164, test_abs_avg=22.323087692260742
production_forward2 grad[74] vs paper_forward: mean_abs=0.4398179054260254, max_abs=1.6875, mean_rel=0.14608527719974518, max_rel=17.502161026000977, norm_rel=0.021991020068526268, ref_abs_avg=19.587509155273438, test_abs_avg=19.534137725830078
production_forward2 grad[75] vs paper_forward: mean_abs=0.5907162427902222, max_abs=4.75, mean_rel=0.1608891487121582, max_rel=1040.2286376953125, norm_rel=0.024823859333992004, ref_abs_avg=23.83563232421875, test_abs_avg=23.837909698486328
production_forward2 grad[76] vs paper_forward: mean_abs=0.5560286045074463, max_abs=4.09375, mean_rel=0.2291560024023056, max_rel=1781.2498779296875, norm_rel=0.023367665708065033, ref_abs_avg=23.892593383789062, test_abs_avg=23.89342498779297
production_forward2 grad[77] vs paper_forward: mean_abs=0.43558788299560547, max_abs=1.9375, mean_rel=0.11621802300214767, max_rel=8.832759857177734, norm_rel=0.023133601993322372, ref_abs_avg=18.711082458496094, test_abs_avg=18.72931671142578
production_forward2 grad[78] vs paper_forward: mean_abs=0.5485280752182007, max_abs=5.0, mean_rel=0.14876125752925873, max_rel=954.6997680664062, norm_rel=0.024303095415234566, ref_abs_avg=22.59975242614746, test_abs_avg=22.600444793701172
production_forward2 grad[79] vs paper_forward: mean_abs=0.5061594247817993, max_abs=4.5, mean_rel=0.2813813388347626, max_rel=1999.9998779296875, norm_rel=0.022673992440104485, ref_abs_avg=22.33602523803711, test_abs_avg=22.33632469177246
production_forward2 grad[80] vs paper_forward: mean_abs=0.42137765884399414, max_abs=2.046875, mean_rel=0.15002888441085815, max_rel=33.281715393066406, norm_rel=0.02209617756307125, ref_abs_avg=19.310749053955078, test_abs_avg=19.28689956665039
production_forward2 grad[81] vs paper_forward: mean_abs=0.5144051313400269, max_abs=6.0, mean_rel=0.1437763273715973, max_rel=1116.9249267578125, norm_rel=0.023563524708151817, ref_abs_avg=21.882003784179688, test_abs_avg=21.882633209228516
production_forward2 grad[82] vs paper_forward: mean_abs=0.47148627042770386, max_abs=3.9375, mean_rel=0.2060556709766388, max_rel=1453.1248779296875, norm_rel=0.021621234714984894, ref_abs_avg=21.852340698242188, test_abs_avg=21.86646270751953
production_forward2 grad[83] vs paper_forward: mean_abs=0.3579418659210205, max_abs=1.75, mean_rel=0.20338580012321472, max_rel=35.39742660522461, norm_rel=0.021583257243037224, ref_abs_avg=16.71011734008789, test_abs_avg=16.746601104736328
production_forward2 grad[84] vs paper_forward: mean_abs=0.47607216238975525, max_abs=4.5, mean_rel=0.14455947279930115, max_rel=1003.9561767578125, norm_rel=0.023149168118834496, ref_abs_avg=20.601329803466797, test_abs_avg=20.602079391479492
production_forward2 grad[85] vs paper_forward: mean_abs=0.4444824755191803, max_abs=3.65625, mean_rel=0.19736911356449127, max_rel=1656.2498779296875, norm_rel=0.021557293832302094, ref_abs_avg=20.669601440429688, test_abs_avg=20.669965744018555
production_forward2 grad[86] vs paper_forward: mean_abs=0.3682372570037842, max_abs=1.3125, mean_rel=0.1901179552078247, max_rel=27.976160049438477, norm_rel=0.021314384415745735, ref_abs_avg=17.264392852783203, test_abs_avg=17.259933471679688
production_forward2 grad[87] vs paper_forward: mean_abs=0.45361238718032837, max_abs=5.0, mean_rel=0.14119139313697815, max_rel=509.0126647949219, norm_rel=0.022402511909604073, ref_abs_avg=20.346012115478516, test_abs_avg=20.34699058532715
production_forward2 grad[88] vs paper_forward: mean_abs=0.41285189986228943, max_abs=4.0, mean_rel=0.1894039362668991, max_rel=1125.0, norm_rel=0.020778581500053406, ref_abs_avg=20.047954559326172, test_abs_avg=20.04821014404297
production_forward2 grad[89] vs paper_forward: mean_abs=0.33487749099731445, max_abs=1.25, mean_rel=0.13597679138183594, max_rel=12.044665336608887, norm_rel=0.019453441724181175, ref_abs_avg=17.45874786376953, test_abs_avg=17.48635482788086
production_forward2 grad[90] vs paper_forward: mean_abs=0.4216790795326233, max_abs=4.5, mean_rel=0.13769865036010742, max_rel=1174.240966796875, norm_rel=0.021787971258163452, ref_abs_avg=19.525564193725586, test_abs_avg=19.52503204345703
production_forward2 grad[91] vs paper_forward: mean_abs=0.38682639598846436, max_abs=3.5, mean_rel=0.18417346477508545, max_rel=1531.2498779296875, norm_rel=0.020161280408501625, ref_abs_avg=19.325223922729492, test_abs_avg=19.332759857177734
production_forward2 grad[92] vs paper_forward: mean_abs=0.32056713104248047, max_abs=1.1875, mean_rel=0.054565928876399994, max_rel=0.9963562488555908, norm_rel=0.019995415583252907, ref_abs_avg=15.932411193847656, test_abs_avg=15.923909187316895
production_forward2 grad[93] vs paper_forward: mean_abs=0.39102715253829956, max_abs=3.75, mean_rel=0.1364191323518753, max_rel=1372.5406494140625, norm_rel=0.021435895934700966, ref_abs_avg=18.458660125732422, test_abs_avg=18.45987319946289
production_forward2 grad[94] vs paper_forward: mean_abs=0.36510494351387024, max_abs=3.875, mean_rel=0.18601053953170776, max_rel=2375.0, norm_rel=0.019854098558425903, ref_abs_avg=18.65676498413086, test_abs_avg=18.661407470703125
production_forward2 grad[95] vs paper_forward: mean_abs=0.28230148553848267, max_abs=1.25, mean_rel=0.14400561153888702, max_rel=27.254241943359375, norm_rel=0.018662648275494576, ref_abs_avg=15.893695831298828, test_abs_avg=15.88918399810791
production_forward2 grad[96] vs paper_forward: mean_abs=0.38709673285484314, max_abs=4.5, mean_rel=0.13378648459911346, max_rel=891.5759887695312, norm_rel=0.021218597888946533, ref_abs_avg=18.537742614746094, test_abs_avg=18.53917694091797
production_forward2 grad[97] vs paper_forward: mean_abs=0.3527044653892517, max_abs=4.0, mean_rel=0.16790977120399475, max_rel=1203.125, norm_rel=0.0192963145673275, ref_abs_avg=18.47671127319336, test_abs_avg=18.483076095581055
identity layers + randn queries
production_forward fwd+bwd:  113.730 ms
production_forward bwd-only: 96.121 ms
production_forward peak allocated: fwd=3.368 GiB, fwd+bwd=10.118 GiB
production_forward peak reserved:  fwd=3.637 GiB, fwd+bwd=12.637 GiB
paper_forward fwd+bwd:  384.908 ms
paper_forward bwd-only: 304.763 ms
paper_forward peak allocated: fwd=30.002 GiB, fwd+bwd=32.120 GiB
paper_forward peak reserved:  fwd=30.057 GiB, fwd+bwd=32.807 GiB
production_forward2 fwd+bwd:  191.666 ms
production_forward2 bwd-only: 172.549 ms
production_forward2 peak allocated: fwd=2.864 GiB, fwd+bwd=6.243 GiB
production_forward2 peak reserved:  fwd=3.262 GiB, fwd+bwd=9.012 GiB

grads check for swiglu layers + randn queries
production_forward vs paper_forward output: mean_abs=0.0016692555509507656, max_abs=0.052734375
production_forward grad[0] vs paper_forward: mean_abs=0.008367432281374931, max_abs=0.34375, mean_rel=0.07217346131801605, max_rel=103.3956069946289, norm_rel=0.019633600488305092, ref_abs_avg=0.4606306850910187, test_abs_avg=0.460652619600296
production_forward grad[1] vs paper_forward: mean_abs=7.27675199508667, max_abs=56.0, mean_rel=0.13126003742218018, max_rel=98.1521224975586, norm_rel=0.02042669802904129, ref_abs_avg=319.6033630371094, test_abs_avg=319.5823974609375
production_forward grad[2] vs paper_forward: mean_abs=1.3036117553710938, max_abs=4.1875, mean_rel=0.08459466695785522, max_rel=2.966717004776001, norm_rel=0.02455769293010235, ref_abs_avg=52.78914260864258, test_abs_avg=52.877872467041016
production_forward grad[3] vs paper_forward: mean_abs=1.5951943397521973, max_abs=11.0, mean_rel=0.16615377366542816, max_rel=1416.141357421875, norm_rel=0.024338360875844955, ref_abs_avg=65.97113037109375, test_abs_avg=65.9808349609375
production_forward grad[4] vs paper_forward: mean_abs=1.4724993705749512, max_abs=9.5, mean_rel=0.3520490229129791, max_rel=4125.0, norm_rel=0.022852206602692604, ref_abs_avg=64.75505065917969, test_abs_avg=64.76541137695312
production_forward grad[5] vs paper_forward: mean_abs=1.0742712020874023, max_abs=4.5, mean_rel=0.11561573296785355, max_rel=25.187808990478516, norm_rel=0.022249849513173103, ref_abs_avg=49.014137268066406, test_abs_avg=48.99629211425781
production_forward grad[6] vs paper_forward: mean_abs=1.3870548009872437, max_abs=10.0, mean_rel=0.1614280343055725, max_rel=1755.18212890625, norm_rel=0.023989154025912285, ref_abs_avg=58.23811340332031, test_abs_avg=58.242774963378906
production_forward grad[7] vs paper_forward: mean_abs=1.2702856063842773, max_abs=7.5, mean_rel=0.37514111399650574, max_rel=3312.499755859375, norm_rel=0.02233201265335083, ref_abs_avg=57.188453674316406, test_abs_avg=57.20013427734375
production_forward grad[8] vs paper_forward: mean_abs=0.9349365234375, max_abs=5.0, mean_rel=0.07738332450389862, max_rel=6.981945991516113, norm_rel=0.021587662398815155, ref_abs_avg=43.72166442871094, test_abs_avg=43.672271728515625
production_forward grad[9] vs paper_forward: mean_abs=1.262728214263916, max_abs=8.0, mean_rel=0.16725683212280273, max_rel=1722.9189453125, norm_rel=0.023881372064352036, ref_abs_avg=53.212158203125, test_abs_avg=53.21460723876953
production_forward grad[10] vs paper_forward: mean_abs=1.167963981628418, max_abs=7.0, mean_rel=0.3591291606426239, max_rel=2999.999755859375, norm_rel=0.02225283347070217, ref_abs_avg=52.84107208251953, test_abs_avg=52.85026168823242
production_forward grad[11] vs paper_forward: mean_abs=0.9813122749328613, max_abs=3.75, mean_rel=0.19700965285301208, max_rel=20.406070709228516, norm_rel=0.024674300104379654, ref_abs_avg=39.578834533691406, test_abs_avg=39.6586799621582
production_forward grad[12] vs paper_forward: mean_abs=1.1703970432281494, max_abs=9.0, mean_rel=0.1552598774433136, max_rel=1746.923583984375, norm_rel=0.0237021092325449, ref_abs_avg=49.71336364746094, test_abs_avg=49.71857833862305
production_forward grad[13] vs paper_forward: mean_abs=1.0850050449371338, max_abs=6.0, mean_rel=0.3020724654197693, max_rel=3749.999755859375, norm_rel=0.022218747064471245, ref_abs_avg=49.12849807739258, test_abs_avg=49.13011932373047
production_forward grad[14] vs paper_forward: mean_abs=0.8324699401855469, max_abs=3.5, mean_rel=0.11858099699020386, max_rel=8.254061698913574, norm_rel=0.021851802244782448, ref_abs_avg=38.73619842529297, test_abs_avg=38.65831756591797
production_forward grad[15] vs paper_forward: mean_abs=1.1020687818527222, max_abs=7.0, mean_rel=0.1549317091703415, max_rel=1474.2550048828125, norm_rel=0.02359911985695362, ref_abs_avg=47.03253936767578, test_abs_avg=47.03474044799805
production_forward grad[16] vs paper_forward: mean_abs=1.0099802017211914, max_abs=6.5625, mean_rel=0.2994152903556824, max_rel=3124.999755859375, norm_rel=0.02181820571422577, ref_abs_avg=46.55182647705078, test_abs_avg=46.55499267578125
production_forward grad[17] vs paper_forward: mean_abs=0.8240089416503906, max_abs=3.0625, mean_rel=0.1353454887866974, max_rel=14.114784240722656, norm_rel=0.023911595344543457, ref_abs_avg=33.786338806152344, test_abs_avg=33.752532958984375
production_forward grad[18] vs paper_forward: mean_abs=1.0358197689056396, max_abs=8.0, mean_rel=0.15634506940841675, max_rel=1149.5037841796875, norm_rel=0.023436279967427254, ref_abs_avg=44.45613098144531, test_abs_avg=44.458343505859375
production_forward grad[19] vs paper_forward: mean_abs=0.9451475143432617, max_abs=5.75, mean_rel=0.34001511335372925, max_rel=3062.499755859375, norm_rel=0.021698782220482826, ref_abs_avg=43.82069778442383, test_abs_avg=43.81959915161133
production_forward grad[20] vs paper_forward: mean_abs=0.799168586730957, max_abs=3.5, mean_rel=0.09388105571269989, max_rel=4.798721790313721, norm_rel=0.0245287474244833, ref_abs_avg=32.54723358154297, test_abs_avg=32.52356719970703
production_forward grad[21] vs paper_forward: mean_abs=0.9816886782646179, max_abs=8.0, mean_rel=0.15490713715553284, max_rel=2333.427490234375, norm_rel=0.023343320935964584, ref_abs_avg=42.31281661987305, test_abs_avg=42.317569732666016
production_forward grad[22] vs paper_forward: mean_abs=0.9050906300544739, max_abs=6.0, mean_rel=0.2600560784339905, max_rel=2953.124755859375, norm_rel=0.021828191354870796, ref_abs_avg=41.60742950439453, test_abs_avg=41.61028289794922
production_forward grad[23] vs paper_forward: mean_abs=0.7361173629760742, max_abs=3.0, mean_rel=0.13757917284965515, max_rel=13.849886894226074, norm_rel=0.023208225145936012, ref_abs_avg=32.999855041503906, test_abs_avg=32.90528106689453
production_forward grad[24] vs paper_forward: mean_abs=0.9267005920410156, max_abs=6.0, mean_rel=0.14670516550540924, max_rel=998.0828857421875, norm_rel=0.023152688518166542, ref_abs_avg=40.29560089111328, test_abs_avg=40.29788589477539
production_forward grad[25] vs paper_forward: mean_abs=0.853918194770813, max_abs=5.65625, mean_rel=0.2857879400253296, max_rel=2812.499755859375, norm_rel=0.021721595898270607, ref_abs_avg=39.533748626708984, test_abs_avg=39.53830337524414
production_forward grad[26] vs paper_forward: mean_abs=0.8316001892089844, max_abs=3.5, mean_rel=0.12791648507118225, max_rel=8.390947341918945, norm_rel=0.024165021255612373, ref_abs_avg=34.29248809814453, test_abs_avg=34.29423904418945
production_forward grad[27] vs paper_forward: mean_abs=1.0740432739257812, max_abs=7.0, mean_rel=0.15920911729335785, max_rel=2431.278076171875, norm_rel=0.0249284990131855, ref_abs_avg=43.31253433227539, test_abs_avg=43.31118392944336
production_forward grad[28] vs paper_forward: mean_abs=0.9873768091201782, max_abs=6.75, mean_rel=0.3240997791290283, max_rel=3562.499755859375, norm_rel=0.023172985762357712, ref_abs_avg=42.77947998046875, test_abs_avg=42.78002166748047
production_forward grad[29] vs paper_forward: mean_abs=0.7872066497802734, max_abs=2.875, mean_rel=0.12341852486133575, max_rel=12.343315124511719, norm_rel=0.023968903347849846, ref_abs_avg=33.19596862792969, test_abs_avg=33.22032928466797
production_forward grad[30] vs paper_forward: mean_abs=1.0017739534378052, max_abs=7.0, mean_rel=0.1595573127269745, max_rel=942.3499755859375, norm_rel=0.025356709957122803, ref_abs_avg=39.698577880859375, test_abs_avg=39.700584411621094
production_forward grad[31] vs paper_forward: mean_abs=0.929305911064148, max_abs=5.75, mean_rel=0.3475557267665863, max_rel=2437.5, norm_rel=0.023914137855172157, ref_abs_avg=39.02936553955078, test_abs_avg=39.0262336730957
production_forward grad[32] vs paper_forward: mean_abs=0.7724370956420898, max_abs=2.75, mean_rel=0.12880593538284302, max_rel=9.35562515258789, norm_rel=0.024278294295072556, ref_abs_avg=31.965587615966797, test_abs_avg=32.027252197265625
production_forward grad[33] vs paper_forward: mean_abs=0.9448290467262268, max_abs=7.0, mean_rel=0.16373947262763977, max_rel=1159.709228515625, norm_rel=0.02527298405766487, ref_abs_avg=37.56528091430664, test_abs_avg=37.56704330444336
production_forward grad[34] vs paper_forward: mean_abs=0.8662083148956299, max_abs=5.3125, mean_rel=0.2817445993423462, max_rel=2812.499755859375, norm_rel=0.023667797446250916, ref_abs_avg=36.67671203613281, test_abs_avg=36.67948913574219
production_forward grad[35] vs paper_forward: mean_abs=0.7067661285400391, max_abs=2.875, mean_rel=0.10007220506668091, max_rel=9.1643705368042, norm_rel=0.024357469752430916, ref_abs_avg=28.652847290039062, test_abs_avg=28.628032684326172
production_forward grad[36] vs paper_forward: mean_abs=0.8754832744598389, max_abs=6.5, mean_rel=0.17332248389720917, max_rel=2329.81787109375, norm_rel=0.02502809464931488, ref_abs_avg=35.08371353149414, test_abs_avg=35.08281707763672
production_forward grad[37] vs paper_forward: mean_abs=0.8079229593276978, max_abs=4.90625, mean_rel=0.29674965143203735, max_rel=2624.999755859375, norm_rel=0.023622902110219002, ref_abs_avg=34.33649444580078, test_abs_avg=34.33318328857422
production_forward grad[38] vs paper_forward: mean_abs=0.6240229606628418, max_abs=2.625, mean_rel=0.17912942171096802, max_rel=35.3676643371582, norm_rel=0.02242489717900753, ref_abs_avg=28.147232055664062, test_abs_avg=28.15011215209961
production_forward grad[39] vs paper_forward: mean_abs=0.821725606918335, max_abs=6.0, mean_rel=0.17102447152137756, max_rel=1519.3624267578125, norm_rel=0.02476091869175434, ref_abs_avg=33.32065963745117, test_abs_avg=33.32201385498047
production_forward grad[40] vs paper_forward: mean_abs=0.7643351554870605, max_abs=4.625, mean_rel=0.2851673364639282, max_rel=2718.749755859375, norm_rel=0.023233335465192795, ref_abs_avg=33.015804290771484, test_abs_avg=33.01616668701172
production_forward grad[41] vs paper_forward: mean_abs=0.6038637161254883, max_abs=2.5, mean_rel=0.08896610140800476, max_rel=3.0772182941436768, norm_rel=0.0225068312138319, ref_abs_avg=25.899127960205078, test_abs_avg=25.898414611816406
production_forward grad[42] vs paper_forward: mean_abs=0.7804937362670898, max_abs=5.5, mean_rel=0.15468965470790863, max_rel=1168.359619140625, norm_rel=0.02454676479101181, ref_abs_avg=31.958797454833984, test_abs_avg=31.95785903930664
production_forward grad[43] vs paper_forward: mean_abs=0.7273859977722168, max_abs=5.0625, mean_rel=0.24183014035224915, max_rel=2015.6248779296875, norm_rel=0.022938430309295654, ref_abs_avg=31.803184509277344, test_abs_avg=31.798049926757812
production_forward grad[44] vs paper_forward: mean_abs=0.5613698959350586, max_abs=2.25, mean_rel=0.14560121297836304, max_rel=25.59054946899414, norm_rel=0.023197857663035393, ref_abs_avg=24.671966552734375, test_abs_avg=24.684627532958984
production_forward grad[45] vs paper_forward: mean_abs=0.7450909614562988, max_abs=6.0, mean_rel=0.1636943817138672, max_rel=2303.89111328125, norm_rel=0.024332376196980476, ref_abs_avg=30.715974807739258, test_abs_avg=30.717178344726562
production_forward grad[46] vs paper_forward: mean_abs=0.6915737390518188, max_abs=4.1875, mean_rel=0.22521334886550903, max_rel=1937.4998779296875, norm_rel=0.02260349690914154, ref_abs_avg=30.648426055908203, test_abs_avg=30.64870262145996
production_forward grad[47] vs paper_forward: mean_abs=0.5099716186523438, max_abs=2.5625, mean_rel=0.11244945973157883, max_rel=18.69729232788086, norm_rel=0.02122340351343155, ref_abs_avg=24.47433090209961, test_abs_avg=24.478591918945312
production_forward grad[48] vs paper_forward: mean_abs=0.7119970917701721, max_abs=5.0, mean_rel=0.15479323267936707, max_rel=1293.0745849609375, norm_rel=0.0240995604544878, ref_abs_avg=29.658885955810547, test_abs_avg=29.6597900390625
production_forward grad[49] vs paper_forward: mean_abs=0.662859320640564, max_abs=4.03125, mean_rel=0.2553595304489136, max_rel=1937.4998779296875, norm_rel=0.02240416407585144, ref_abs_avg=29.640748977661133, test_abs_avg=29.6475830078125
production_forward grad[50] vs paper_forward: mean_abs=0.5974749326705933, max_abs=2.375, mean_rel=0.07523833960294724, max_rel=3.2642862796783447, norm_rel=0.024325203150510788, ref_abs_avg=25.495983123779297, test_abs_avg=25.45309066772461
production_forward grad[51] vs paper_forward: mean_abs=0.7881186008453369, max_abs=5.5, mean_rel=0.16592025756835938, max_rel=977.0147094726562, norm_rel=0.0248404610902071, ref_abs_avg=31.79586410522461, test_abs_avg=31.797466278076172
production_forward grad[52] vs paper_forward: mean_abs=0.7361429929733276, max_abs=4.75, mean_rel=0.28365224599838257, max_rel=2125.0, norm_rel=0.023631669580936432, ref_abs_avg=31.224498748779297, test_abs_avg=31.237104415893555
production_forward grad[53] vs paper_forward: mean_abs=0.5577181577682495, max_abs=2.3125, mean_rel=0.11642120778560638, max_rel=14.638978004455566, norm_rel=0.022379226982593536, ref_abs_avg=25.197498321533203, test_abs_avg=25.240604400634766
production_forward grad[54] vs paper_forward: mean_abs=0.7386075854301453, max_abs=6.5, mean_rel=0.15548020601272583, max_rel=1835.102783203125, norm_rel=0.024537626653909683, ref_abs_avg=30.17728042602539, test_abs_avg=30.18027114868164
production_forward grad[55] vs paper_forward: mean_abs=0.6774354577064514, max_abs=4.5, mean_rel=0.22700247168540955, max_rel=2718.749755859375, norm_rel=0.02313322015106678, ref_abs_avg=29.351694107055664, test_abs_avg=29.361400604248047
production_forward grad[56] vs paper_forward: mean_abs=0.5742430686950684, max_abs=2.125, mean_rel=0.12698334455490112, max_rel=21.06427764892578, norm_rel=0.02635510265827179, ref_abs_avg=21.987037658691406, test_abs_avg=21.990966796875
production_forward grad[57] vs paper_forward: mean_abs=0.6886683702468872, max_abs=5.0, mean_rel=0.1684725135564804, max_rel=1652.1661376953125, norm_rel=0.02445092424750328, ref_abs_avg=28.205718994140625, test_abs_avg=28.208351135253906
production_forward grad[58] vs paper_forward: mean_abs=0.6400173902511597, max_abs=4.0, mean_rel=0.22924405336380005, max_rel=1874.9998779296875, norm_rel=0.02285061404109001, ref_abs_avg=28.061603546142578, test_abs_avg=28.06647491455078
production_forward grad[59] vs paper_forward: mean_abs=0.4971351623535156, max_abs=2.125, mean_rel=0.14223149418830872, max_rel=40.07985305786133, norm_rel=0.02173045091331005, ref_abs_avg=22.85171890258789, test_abs_avg=22.80966567993164
production_forward grad[60] vs paper_forward: mean_abs=0.6421754956245422, max_abs=5.0, mean_rel=0.14553366601467133, max_rel=664.2814331054688, norm_rel=0.02410311810672283, ref_abs_avg=26.66686248779297, test_abs_avg=26.670928955078125
production_forward grad[61] vs paper_forward: mean_abs=0.6030222177505493, max_abs=4.0, mean_rel=0.22966593503952026, max_rel=2187.5, norm_rel=0.022879479452967644, ref_abs_avg=26.419246673583984, test_abs_avg=26.417932510375977
production_forward grad[62] vs paper_forward: mean_abs=0.48526859283447266, max_abs=1.8125, mean_rel=0.10328693687915802, max_rel=7.419461727142334, norm_rel=0.023160971701145172, ref_abs_avg=21.308061599731445, test_abs_avg=21.303726196289062
production_forward grad[63] vs paper_forward: mean_abs=0.6097719669342041, max_abs=5.0, mean_rel=0.16105437278747559, max_rel=1590.0321044921875, norm_rel=0.023539120331406593, ref_abs_avg=25.90411376953125, test_abs_avg=25.904014587402344
production_forward grad[64] vs paper_forward: mean_abs=0.566797137260437, max_abs=4.0625, mean_rel=0.22801178693771362, max_rel=2749.999755859375, norm_rel=0.022439803928136826, ref_abs_avg=25.32202911376953, test_abs_avg=25.323301315307617
production_forward grad[65] vs paper_forward: mean_abs=0.42422914505004883, max_abs=1.875, mean_rel=0.08983872830867767, max_rel=17.79216766357422, norm_rel=0.02012506127357483, ref_abs_avg=21.65802001953125, test_abs_avg=21.625873565673828
production_forward grad[66] vs paper_forward: mean_abs=0.5745663046836853, max_abs=4.375, mean_rel=0.1482854038476944, max_rel=737.7196655273438, norm_rel=0.023396963253617287, ref_abs_avg=24.60637664794922, test_abs_avg=24.606876373291016
production_forward grad[67] vs paper_forward: mean_abs=0.5352972149848938, max_abs=3.5, mean_rel=0.20695066452026367, max_rel=1289.0623779296875, norm_rel=0.021625231951475143, ref_abs_avg=24.73621368408203, test_abs_avg=24.74117660522461
production_forward grad[68] vs paper_forward: mean_abs=0.4402613639831543, max_abs=1.875, mean_rel=0.13012373447418213, max_rel=13.042980194091797, norm_rel=0.021467743441462517, ref_abs_avg=21.348648071289062, test_abs_avg=21.350622177124023
production_forward grad[69] vs paper_forward: mean_abs=0.5546700358390808, max_abs=5.0, mean_rel=0.14569103717803955, max_rel=854.8538208007812, norm_rel=0.023094266653060913, ref_abs_avg=24.007537841796875, test_abs_avg=24.01087760925293
production_forward grad[70] vs paper_forward: mean_abs=0.5107940435409546, max_abs=4.25, mean_rel=0.22391003370285034, max_rel=2218.75, norm_rel=0.021702388301491737, ref_abs_avg=23.518917083740234, test_abs_avg=23.513200759887695
production_forward grad[71] vs paper_forward: mean_abs=0.3844120502471924, max_abs=1.75, mean_rel=0.2851294279098511, max_rel=49.70917510986328, norm_rel=0.021721545606851578, ref_abs_avg=18.24908447265625, test_abs_avg=18.249727249145508
production_forward grad[72] vs paper_forward: mean_abs=0.5256432294845581, max_abs=6.0, mean_rel=0.14169223606586456, max_rel=1021.59423828125, norm_rel=0.02270633541047573, ref_abs_avg=23.179424285888672, test_abs_avg=23.18313980102539
production_forward grad[73] vs paper_forward: mean_abs=0.4860668182373047, max_abs=3.5, mean_rel=0.2034997195005417, max_rel=1562.4998779296875, norm_rel=0.021066725254058838, ref_abs_avg=23.061370849609375, test_abs_avg=23.0662899017334
production_forward grad[74] vs paper_forward: mean_abs=0.45488595962524414, max_abs=1.828125, mean_rel=0.08879904448986053, max_rel=12.076525688171387, norm_rel=0.02309969998896122, ref_abs_avg=19.45409393310547, test_abs_avg=19.44952964782715
production_forward grad[75] vs paper_forward: mean_abs=0.5770856142044067, max_abs=4.5, mean_rel=0.14796070754528046, max_rel=762.0631713867188, norm_rel=0.02434203214943409, ref_abs_avg=23.782541275024414, test_abs_avg=23.781173706054688
production_forward grad[76] vs paper_forward: mean_abs=0.5343586206436157, max_abs=3.609375, mean_rel=0.20325931906700134, max_rel=1374.9998779296875, norm_rel=0.022316304966807365, ref_abs_avg=23.891578674316406, test_abs_avg=23.890316009521484
production_forward grad[77] vs paper_forward: mean_abs=0.42492103576660156, max_abs=1.625, mean_rel=0.08762266486883163, max_rel=11.122859001159668, norm_rel=0.021903321146965027, ref_abs_avg=19.239776611328125, test_abs_avg=19.23635482788086
production_forward grad[78] vs paper_forward: mean_abs=0.532296895980835, max_abs=4.5, mean_rel=0.1592014580965042, max_rel=642.873291015625, norm_rel=0.023744646459817886, ref_abs_avg=22.455286026000977, test_abs_avg=22.452192306518555
production_forward grad[79] vs paper_forward: mean_abs=0.48468881845474243, max_abs=3.5, mean_rel=0.19102251529693604, max_rel=1624.9998779296875, norm_rel=0.02214951626956463, ref_abs_avg=21.942079544067383, test_abs_avg=21.944049835205078
production_forward grad[80] vs paper_forward: mean_abs=0.3917989730834961, max_abs=1.6875, mean_rel=0.09515026211738586, max_rel=11.25976848602295, norm_rel=0.022091491147875786, ref_abs_avg=18.063270568847656, test_abs_avg=18.057275772094727
production_forward grad[81] vs paper_forward: mean_abs=0.4997174143791199, max_abs=5.5, mean_rel=0.14596781134605408, max_rel=840.8245849609375, norm_rel=0.023409821093082428, ref_abs_avg=21.414649963378906, test_abs_avg=21.414321899414062
production_forward grad[82] vs paper_forward: mean_abs=0.4629077911376953, max_abs=3.875, mean_rel=0.2083463966846466, max_rel=1437.4998779296875, norm_rel=0.02219248190522194, ref_abs_avg=20.959104537963867, test_abs_avg=20.96160316467285
production_forward grad[83] vs paper_forward: mean_abs=0.3825938105583191, max_abs=1.75, mean_rel=0.09848922491073608, max_rel=10.371542930603027, norm_rel=0.02229178324341774, ref_abs_avg=17.705463409423828, test_abs_avg=17.748140335083008
production_forward grad[84] vs paper_forward: mean_abs=0.46141016483306885, max_abs=4.0, mean_rel=0.13967525959014893, max_rel=722.4832763671875, norm_rel=0.02269245870411396, ref_abs_avg=20.449275970458984, test_abs_avg=20.449260711669922
production_forward grad[85] vs paper_forward: mean_abs=0.4207003712654114, max_abs=3.75, mean_rel=0.20078934729099274, max_rel=1250.0, norm_rel=0.020687459036707878, ref_abs_avg=20.420846939086914, test_abs_avg=20.415264129638672
production_forward grad[86] vs paper_forward: mean_abs=0.3382495641708374, max_abs=1.1875, mean_rel=0.07128286361694336, max_rel=2.8766751289367676, norm_rel=0.01991586945950985, ref_abs_avg=16.925655364990234, test_abs_avg=16.956655502319336
production_forward grad[87] vs paper_forward: mean_abs=0.43441230058670044, max_abs=4.0, mean_rel=0.1322430521249771, max_rel=565.0244750976562, norm_rel=0.02221296913921833, ref_abs_avg=19.7126522064209, test_abs_avg=19.71221160888672
production_forward grad[88] vs paper_forward: mean_abs=0.40171676874160767, max_abs=3.5, mean_rel=0.18137279152870178, max_rel=1093.75, norm_rel=0.02040158584713936, ref_abs_avg=19.840517044067383, test_abs_avg=19.838359832763672
production_forward grad[89] vs paper_forward: mean_abs=0.3209397792816162, max_abs=1.234375, mean_rel=0.13242468237876892, max_rel=26.82636833190918, norm_rel=0.020440060645341873, ref_abs_avg=15.983186721801758, test_abs_avg=15.970901489257812
production_forward grad[90] vs paper_forward: mean_abs=0.41286659240722656, max_abs=4.0, mean_rel=0.13250090181827545, max_rel=970.2008666992188, norm_rel=0.021937379613518715, ref_abs_avg=18.97901725769043, test_abs_avg=18.9776668548584
production_forward grad[91] vs paper_forward: mean_abs=0.38490617275238037, max_abs=3.75, mean_rel=0.18809141218662262, max_rel=2250.0, norm_rel=0.019890349358320236, ref_abs_avg=19.404441833496094, test_abs_avg=19.402816772460938
production_forward grad[92] vs paper_forward: mean_abs=0.3251206874847412, max_abs=1.1875, mean_rel=0.07793432474136353, max_rel=3.577719211578369, norm_rel=0.02004345878958702, ref_abs_avg=16.13847541809082, test_abs_avg=16.136058807373047
production_forward grad[93] vs paper_forward: mean_abs=0.4001951813697815, max_abs=4.875, mean_rel=0.1313926726579666, max_rel=1679.1583251953125, norm_rel=0.021462110802531242, ref_abs_avg=18.87130355834961, test_abs_avg=18.87096405029297
production_forward grad[94] vs paper_forward: mean_abs=0.3573710024356842, max_abs=3.25, mean_rel=0.17847277224063873, max_rel=1874.9998779296875, norm_rel=0.019590679556131363, ref_abs_avg=18.42091941833496, test_abs_avg=18.42438507080078
production_forward grad[95] vs paper_forward: mean_abs=0.291196346282959, max_abs=1.1875, mean_rel=0.07488848268985748, max_rel=4.278465747833252, norm_rel=0.019785230979323387, ref_abs_avg=15.13236141204834, test_abs_avg=15.11037826538086
production_forward grad[96] vs paper_forward: mean_abs=0.3652556240558624, max_abs=5.0, mean_rel=0.12935373187065125, max_rel=837.6895141601562, norm_rel=0.02088243141770363, ref_abs_avg=17.785863876342773, test_abs_avg=17.786062240600586
production_forward grad[97] vs paper_forward: mean_abs=0.3424161374568939, max_abs=4.0, mean_rel=0.17207801342010498, max_rel=1218.75, norm_rel=0.01941348798573017, ref_abs_avg=18.03641700744629, test_abs_avg=18.03240203857422
production_forward2 vs paper_forward output: mean_abs=0.0016692555509507656, max_abs=0.052734375
production_forward2 grad[0] vs paper_forward: mean_abs=0.008704417385160923, max_abs=0.3671875, mean_rel=0.0747205913066864, max_rel=100.72709655761719, norm_rel=0.020304029807448387, ref_abs_avg=0.4606306850910187, test_abs_avg=0.46063995361328125
production_forward2 grad[1] vs paper_forward: mean_abs=7.442819595336914, max_abs=56.0, mean_rel=0.132306769490242, max_rel=103.28716278076172, norm_rel=0.020898621529340744, ref_abs_avg=319.6033630371094, test_abs_avg=319.582275390625
production_forward2 grad[2] vs paper_forward: mean_abs=1.3950462341308594, max_abs=4.84375, mean_rel=0.0910838395357132, max_rel=2.7924611568450928, norm_rel=0.02588549256324768, ref_abs_avg=52.78914260864258, test_abs_avg=52.92759704589844
production_forward2 grad[3] vs paper_forward: mean_abs=1.643997311592102, max_abs=12.5, mean_rel=0.16233181953430176, max_rel=1304.049560546875, norm_rel=0.025078631937503815, ref_abs_avg=65.97113037109375, test_abs_avg=65.97602844238281
production_forward2 grad[4] vs paper_forward: mean_abs=1.5205836296081543, max_abs=9.5, mean_rel=0.359817236661911, max_rel=4937.5, norm_rel=0.023587413132190704, ref_abs_avg=64.75505065917969, test_abs_avg=64.76727294921875
production_forward2 grad[5] vs paper_forward: mean_abs=1.1474666595458984, max_abs=5.375, mean_rel=0.1390107125043869, max_rel=36.092933654785156, norm_rel=0.02359594963490963, ref_abs_avg=49.014137268066406, test_abs_avg=48.968109130859375
production_forward2 grad[6] vs paper_forward: mean_abs=1.4258830547332764, max_abs=10.0, mean_rel=0.17065563797950745, max_rel=2429.34912109375, norm_rel=0.024636657908558846, ref_abs_avg=58.23811340332031, test_abs_avg=58.241546630859375
production_forward2 grad[7] vs paper_forward: mean_abs=1.3125196695327759, max_abs=8.25, mean_rel=0.40896934270858765, max_rel=3468.749755859375, norm_rel=0.023059619590640068, ref_abs_avg=57.188453674316406, test_abs_avg=57.199127197265625
production_forward2 grad[8] vs paper_forward: mean_abs=1.0120697021484375, max_abs=5.5, mean_rel=0.07848871499300003, max_rel=5.571451663970947, norm_rel=0.02286633662879467, ref_abs_avg=43.72166442871094, test_abs_avg=43.64573669433594
production_forward2 grad[9] vs paper_forward: mean_abs=1.2963300943374634, max_abs=9.0, mean_rel=0.17089484632015228, max_rel=1968.71337890625, norm_rel=0.024509258568286896, ref_abs_avg=53.212158203125, test_abs_avg=53.21305465698242
production_forward2 grad[10] vs paper_forward: mean_abs=1.201237440109253, max_abs=7.0, mean_rel=0.34806299209594727, max_rel=4375.0, norm_rel=0.02288191206753254, ref_abs_avg=52.84107208251953, test_abs_avg=52.85337448120117
production_forward2 grad[11] vs paper_forward: mean_abs=1.0072290897369385, max_abs=3.625, mean_rel=0.22111137211322784, max_rel=27.652446746826172, norm_rel=0.02532707154750824, ref_abs_avg=39.578834533691406, test_abs_avg=39.68394470214844
production_forward2 grad[12] vs paper_forward: mean_abs=1.2014466524124146, max_abs=8.0, mean_rel=0.16354647278785706, max_rel=2138.91064453125, norm_rel=0.02432161197066307, ref_abs_avg=49.71336364746094, test_abs_avg=49.717437744140625
production_forward2 grad[13] vs paper_forward: mean_abs=1.114342212677002, max_abs=6.5, mean_rel=0.3136664628982544, max_rel=3874.999755859375, norm_rel=0.022821934893727303, ref_abs_avg=49.12849807739258, test_abs_avg=49.12904357910156
production_forward2 grad[14] vs paper_forward: mean_abs=0.8541984558105469, max_abs=4.0, mean_rel=0.13157744705677032, max_rel=8.501143455505371, norm_rel=0.02252541482448578, ref_abs_avg=38.73619842529297, test_abs_avg=38.67530822753906
production_forward2 grad[15] vs paper_forward: mean_abs=1.1287494897842407, max_abs=8.0, mean_rel=0.15761059522628784, max_rel=1499.37646484375, norm_rel=0.024146191775798798, ref_abs_avg=47.03253936767578, test_abs_avg=47.035430908203125
production_forward2 grad[16] vs paper_forward: mean_abs=1.038329839706421, max_abs=6.5, mean_rel=0.2944047749042511, max_rel=3312.499755859375, norm_rel=0.022409001365303993, ref_abs_avg=46.55182647705078, test_abs_avg=46.55284881591797
production_forward2 grad[17] vs paper_forward: mean_abs=0.8275966644287109, max_abs=3.125, mean_rel=0.12174077332019806, max_rel=9.510421752929688, norm_rel=0.024364404380321503, ref_abs_avg=33.786338806152344, test_abs_avg=33.73944854736328
production_forward2 grad[18] vs paper_forward: mean_abs=1.0595173835754395, max_abs=7.0, mean_rel=0.1651839017868042, max_rel=1126.39990234375, norm_rel=0.023961886763572693, ref_abs_avg=44.45613098144531, test_abs_avg=44.45702362060547
production_forward2 grad[19] vs paper_forward: mean_abs=0.9714393615722656, max_abs=5.75, mean_rel=0.36453163623809814, max_rel=3249.999755859375, norm_rel=0.02230420522391796, ref_abs_avg=43.82069778442383, test_abs_avg=43.81773376464844
production_forward2 grad[20] vs paper_forward: mean_abs=0.8062334060668945, max_abs=3.5, mean_rel=0.09214691817760468, max_rel=5.896882057189941, norm_rel=0.024999486282467842, ref_abs_avg=32.54723358154297, test_abs_avg=32.530059814453125
production_forward2 grad[21] vs paper_forward: mean_abs=1.0021181106567383, max_abs=7.0, mean_rel=0.15857642889022827, max_rel=2833.51904296875, norm_rel=0.023835187777876854, ref_abs_avg=42.31281661987305, test_abs_avg=42.31696319580078
production_forward2 grad[22] vs paper_forward: mean_abs=0.9251136183738708, max_abs=5.75, mean_rel=0.26105791330337524, max_rel=3343.749755859375, norm_rel=0.022319654002785683, ref_abs_avg=41.60742950439453, test_abs_avg=41.608238220214844
production_forward2 grad[23] vs paper_forward: mean_abs=0.7477569580078125, max_abs=2.6875, mean_rel=0.15493528544902802, max_rel=20.015565872192383, norm_rel=0.0232708640396595, ref_abs_avg=32.999855041503906, test_abs_avg=32.92512130737305
production_forward2 grad[24] vs paper_forward: mean_abs=0.9453413486480713, max_abs=6.0, mean_rel=0.150853231549263, max_rel=1004.607421875, norm_rel=0.02360653318464756, ref_abs_avg=40.29560089111328, test_abs_avg=40.29801940917969
production_forward2 grad[25] vs paper_forward: mean_abs=0.8705145120620728, max_abs=5.28125, mean_rel=0.27833396196365356, max_rel=2624.999755859375, norm_rel=0.022147873416543007, ref_abs_avg=39.533748626708984, test_abs_avg=39.53740692138672
production_forward2 grad[26] vs paper_forward: mean_abs=0.8728446960449219, max_abs=3.25, mean_rel=0.12945029139518738, max_rel=8.913497924804688, norm_rel=0.0251361932605505, ref_abs_avg=34.29248809814453, test_abs_avg=34.27968215942383
production_forward2 grad[27] vs paper_forward: mean_abs=1.096805453300476, max_abs=9.0, mean_rel=0.1628335416316986, max_rel=1925.92626953125, norm_rel=0.02544301375746727, ref_abs_avg=43.31253433227539, test_abs_avg=43.310508728027344
production_forward2 grad[28] vs paper_forward: mean_abs=1.0120139122009277, max_abs=7.0, mean_rel=0.32508647441864014, max_rel=3249.999755859375, norm_rel=0.02375073917210102, ref_abs_avg=42.77947998046875, test_abs_avg=42.78086853027344
production_forward2 grad[29] vs paper_forward: mean_abs=0.7994022369384766, max_abs=3.953125, mean_rel=0.12117087095975876, max_rel=8.259599685668945, norm_rel=0.02444637194275856, ref_abs_avg=33.19596862792969, test_abs_avg=33.20660400390625
production_forward2 grad[30] vs paper_forward: mean_abs=1.0203500986099243, max_abs=7.5, mean_rel=0.16547399759292603, max_rel=1057.2626953125, norm_rel=0.025814520195126534, ref_abs_avg=39.698577880859375, test_abs_avg=39.700286865234375
production_forward2 grad[31] vs paper_forward: mean_abs=0.9509063959121704, max_abs=6.25, mean_rel=0.3874853849411011, max_rel=2828.124755859375, norm_rel=0.024458879604935646, ref_abs_avg=39.02936553955078, test_abs_avg=39.02398681640625
production_forward2 grad[32] vs paper_forward: mean_abs=0.8056421279907227, max_abs=3.5, mean_rel=0.146907240152359, max_rel=11.382514953613281, norm_rel=0.025088444352149963, ref_abs_avg=31.965587615966797, test_abs_avg=32.0468864440918
production_forward2 grad[33] vs paper_forward: mean_abs=0.9614039659500122, max_abs=7.0, mean_rel=0.16944557428359985, max_rel=1227.9576416015625, norm_rel=0.025712862610816956, ref_abs_avg=37.56528091430664, test_abs_avg=37.5657844543457
production_forward2 grad[34] vs paper_forward: mean_abs=0.8833189010620117, max_abs=5.25, mean_rel=0.29039180278778076, max_rel=2874.999755859375, norm_rel=0.02414938621222973, ref_abs_avg=36.67671203613281, test_abs_avg=36.679222106933594
production_forward2 grad[35] vs paper_forward: mean_abs=0.7247586250305176, max_abs=3.125, mean_rel=0.08533935248851776, max_rel=6.139017581939697, norm_rel=0.02529711276292801, ref_abs_avg=28.652847290039062, test_abs_avg=28.652435302734375
production_forward2 grad[36] vs paper_forward: mean_abs=0.8894493579864502, max_abs=6.5, mean_rel=0.17206299304962158, max_rel=2090.164306640625, norm_rel=0.025421874597668648, ref_abs_avg=35.08371353149414, test_abs_avg=35.08155059814453
production_forward2 grad[37] vs paper_forward: mean_abs=0.8238810896873474, max_abs=5.34375, mean_rel=0.30942028760910034, max_rel=3046.874755859375, norm_rel=0.02406579814851284, ref_abs_avg=34.33649444580078, test_abs_avg=34.33444595336914
production_forward2 grad[38] vs paper_forward: mean_abs=0.6435322761535645, max_abs=2.8125, mean_rel=0.173305481672287, max_rel=32.01834487915039, norm_rel=0.02316238544881344, ref_abs_avg=28.147232055664062, test_abs_avg=28.14044189453125
production_forward2 grad[39] vs paper_forward: mean_abs=0.8334475159645081, max_abs=5.5, mean_rel=0.17136487364768982, max_rel=1504.015625, norm_rel=0.025129735469818115, ref_abs_avg=33.32065963745117, test_abs_avg=33.3213005065918
production_forward2 grad[40] vs paper_forward: mean_abs=0.7773706912994385, max_abs=5.1875, mean_rel=0.2990826964378357, max_rel=2500.0, norm_rel=0.023626238107681274, ref_abs_avg=33.015804290771484, test_abs_avg=33.01499938964844
production_forward2 grad[41] vs paper_forward: mean_abs=0.6122856140136719, max_abs=2.375, mean_rel=0.0862644761800766, max_rel=4.981109142303467, norm_rel=0.023087285459041595, ref_abs_avg=25.899127960205078, test_abs_avg=25.87853240966797
production_forward2 grad[42] vs paper_forward: mean_abs=0.7920737266540527, max_abs=6.0, mean_rel=0.16046024858951569, max_rel=1213.6171875, norm_rel=0.024891143664717674, ref_abs_avg=31.958797454833984, test_abs_avg=31.9582576751709
production_forward2 grad[43] vs paper_forward: mean_abs=0.7398664951324463, max_abs=4.71875, mean_rel=0.25597208738327026, max_rel=2125.0, norm_rel=0.023328863084316254, ref_abs_avg=31.803184509277344, test_abs_avg=31.796648025512695
production_forward2 grad[44] vs paper_forward: mean_abs=0.5854387283325195, max_abs=2.5, mean_rel=0.1461319923400879, max_rel=23.37598419189453, norm_rel=0.023952282965183258, ref_abs_avg=24.671966552734375, test_abs_avg=24.68348503112793
production_forward2 grad[45] vs paper_forward: mean_abs=0.7544109225273132, max_abs=5.25, mean_rel=0.16597267985343933, max_rel=2464.29345703125, norm_rel=0.02462819777429104, ref_abs_avg=30.715974807739258, test_abs_avg=30.716846466064453
production_forward2 grad[46] vs paper_forward: mean_abs=0.7010898590087891, max_abs=4.5, mean_rel=0.23687973618507385, max_rel=1906.2498779296875, norm_rel=0.022908931598067284, ref_abs_avg=30.648426055908203, test_abs_avg=30.648157119750977
production_forward2 grad[47] vs paper_forward: mean_abs=0.5186548233032227, max_abs=2.5, mean_rel=0.1175713837146759, max_rel=21.24944496154785, norm_rel=0.021718386560678482, ref_abs_avg=24.47433090209961, test_abs_avg=24.490264892578125
production_forward2 grad[48] vs paper_forward: mean_abs=0.7212687134742737, max_abs=5.0, mean_rel=0.15658029913902283, max_rel=921.6132202148438, norm_rel=0.02438901923596859, ref_abs_avg=29.658885955810547, test_abs_avg=29.659488677978516
production_forward2 grad[49] vs paper_forward: mean_abs=0.6705402731895447, max_abs=4.0, mean_rel=0.2610119581222534, max_rel=1874.9998779296875, norm_rel=0.022670920938253403, ref_abs_avg=29.640748977661133, test_abs_avg=29.64695167541504
production_forward2 grad[50] vs paper_forward: mean_abs=0.5926562547683716, max_abs=2.25, mean_rel=0.07292614877223969, max_rel=3.213911533355713, norm_rel=0.024126270785927773, ref_abs_avg=25.495983123779297, test_abs_avg=25.44000244140625
production_forward2 grad[51] vs paper_forward: mean_abs=0.8003004193305969, max_abs=6.0, mean_rel=0.16402468085289001, max_rel=971.3359375, norm_rel=0.02522500418126583, ref_abs_avg=31.79586410522461, test_abs_avg=31.797298431396484
production_forward2 grad[52] vs paper_forward: mean_abs=0.7478345632553101, max_abs=5.0, mean_rel=0.2838699519634247, max_rel=2093.75, norm_rel=0.024012558162212372, ref_abs_avg=31.224498748779297, test_abs_avg=31.237096786499023
production_forward2 grad[53] vs paper_forward: mean_abs=0.5705751180648804, max_abs=2.1875, mean_rel=0.16804400086402893, max_rel=33.56757736206055, norm_rel=0.02242439053952694, ref_abs_avg=25.197498321533203, test_abs_avg=25.238338470458984
production_forward2 grad[54] vs paper_forward: mean_abs=0.747531533241272, max_abs=5.25, mean_rel=0.158052459359169, max_rel=1947.4437255859375, norm_rel=0.024828394874930382, ref_abs_avg=30.17728042602539, test_abs_avg=30.179105758666992
production_forward2 grad[55] vs paper_forward: mean_abs=0.688691258430481, max_abs=5.0, mean_rel=0.23161157965660095, max_rel=3093.749755859375, norm_rel=0.02351888082921505, ref_abs_avg=29.351694107055664, test_abs_avg=29.358991622924805
production_forward2 grad[56] vs paper_forward: mean_abs=0.5998215675354004, max_abs=2.125, mean_rel=0.15253791213035583, max_rel=31.71994400024414, norm_rel=0.027558062225580215, ref_abs_avg=21.987037658691406, test_abs_avg=21.99923324584961
production_forward2 grad[57] vs paper_forward: mean_abs=0.6967460513114929, max_abs=5.5, mean_rel=0.16864022612571716, max_rel=1848.4586181640625, norm_rel=0.024742621928453445, ref_abs_avg=28.205718994140625, test_abs_avg=28.208454132080078
production_forward2 grad[58] vs paper_forward: mean_abs=0.6508207321166992, max_abs=4.125, mean_rel=0.22710293531417847, max_rel=1656.2498779296875, norm_rel=0.023210469633340836, ref_abs_avg=28.061603546142578, test_abs_avg=28.066017150878906
production_forward2 grad[59] vs paper_forward: mean_abs=0.5031723976135254, max_abs=2.0, mean_rel=0.1873699128627777, max_rel=59.58230972290039, norm_rel=0.02213600091636181, ref_abs_avg=22.85171890258789, test_abs_avg=22.819074630737305
production_forward2 grad[60] vs paper_forward: mean_abs=0.6501423716545105, max_abs=5.0, mean_rel=0.1472635269165039, max_rel=858.55908203125, norm_rel=0.024400584399700165, ref_abs_avg=26.66686248779297, test_abs_avg=26.669240951538086
production_forward2 grad[61] vs paper_forward: mean_abs=0.611315131187439, max_abs=4.0, mean_rel=0.23407144844532013, max_rel=2406.25, norm_rel=0.02318071946501732, ref_abs_avg=26.419246673583984, test_abs_avg=26.418167114257812
production_forward2 grad[62] vs paper_forward: mean_abs=0.48900604248046875, max_abs=1.6875, mean_rel=0.10778996348381042, max_rel=8.129803657531738, norm_rel=0.023380260914564133, ref_abs_avg=21.308061599731445, test_abs_avg=21.305749893188477
production_forward2 grad[63] vs paper_forward: mean_abs=0.6163332462310791, max_abs=5.0, mean_rel=0.16431650519371033, max_rel=1590.0321044921875, norm_rel=0.023801004514098167, ref_abs_avg=25.90411376953125, test_abs_avg=25.903514862060547
production_forward2 grad[64] vs paper_forward: mean_abs=0.5730370879173279, max_abs=4.03125, mean_rel=0.2351302206516266, max_rel=2593.749755859375, norm_rel=0.02267991565167904, ref_abs_avg=25.32202911376953, test_abs_avg=25.32353973388672
production_forward2 grad[65] vs paper_forward: mean_abs=0.42508363723754883, max_abs=1.875, mean_rel=0.10066781938076019, max_rel=21.143535614013672, norm_rel=0.02003687247633934, ref_abs_avg=21.65802001953125, test_abs_avg=21.618938446044922
production_forward2 grad[66] vs paper_forward: mean_abs=0.5802939534187317, max_abs=5.0, mean_rel=0.15104293823242188, max_rel=684.9261474609375, norm_rel=0.02360987849533558, ref_abs_avg=24.60637664794922, test_abs_avg=24.606029510498047
production_forward2 grad[67] vs paper_forward: mean_abs=0.5403093695640564, max_abs=3.5, mean_rel=0.2150055170059204, max_rel=1554.6873779296875, norm_rel=0.021812867373228073, ref_abs_avg=24.73621368408203, test_abs_avg=24.74100112915039
production_forward2 grad[68] vs paper_forward: mean_abs=0.44213438034057617, max_abs=1.8125, mean_rel=0.11345821619033813, max_rel=13.2311429977417, norm_rel=0.021376417949795723, ref_abs_avg=21.348648071289062, test_abs_avg=21.347169876098633
production_forward2 grad[69] vs paper_forward: mean_abs=0.5590904951095581, max_abs=5.0, mean_rel=0.15041619539260864, max_rel=619.1943969726562, norm_rel=0.0232580304145813, ref_abs_avg=24.007537841796875, test_abs_avg=24.010292053222656
production_forward2 grad[70] vs paper_forward: mean_abs=0.514975905418396, max_abs=4.0, mean_rel=0.22458207607269287, max_rel=2125.0, norm_rel=0.02188529260456562, ref_abs_avg=23.518917083740234, test_abs_avg=23.513221740722656
production_forward2 grad[71] vs paper_forward: mean_abs=0.38928425312042236, max_abs=1.8125, mean_rel=0.2586701512336731, max_rel=38.7173957824707, norm_rel=0.022022414952516556, ref_abs_avg=18.24908447265625, test_abs_avg=18.249326705932617
production_forward2 grad[72] vs paper_forward: mean_abs=0.5288577079772949, max_abs=5.0, mean_rel=0.141877219080925, max_rel=978.1190795898438, norm_rel=0.022842390462756157, ref_abs_avg=23.179424285888672, test_abs_avg=23.18260383605957
production_forward2 grad[73] vs paper_forward: mean_abs=0.4884072244167328, max_abs=3.375, mean_rel=0.20547820627689362, max_rel=1531.2498779296875, norm_rel=0.02117154374718666, ref_abs_avg=23.061370849609375, test_abs_avg=23.06727409362793
production_forward2 grad[74] vs paper_forward: mean_abs=0.4640507698059082, max_abs=1.75, mean_rel=0.07853081077337265, max_rel=6.8615522384643555, norm_rel=0.023523926734924316, ref_abs_avg=19.45409393310547, test_abs_avg=19.446245193481445
production_forward2 grad[75] vs paper_forward: mean_abs=0.5826761722564697, max_abs=4.75, mean_rel=0.15107284486293793, max_rel=937.9971923828125, norm_rel=0.02455264888703823, ref_abs_avg=23.782541275024414, test_abs_avg=23.78076934814453
production_forward2 grad[76] vs paper_forward: mean_abs=0.5389406681060791, max_abs=3.796875, mean_rel=0.20784008502960205, max_rel=1499.9998779296875, norm_rel=0.022514358162879944, ref_abs_avg=23.891578674316406, test_abs_avg=23.889848709106445
production_forward2 grad[77] vs paper_forward: mean_abs=0.4330625534057617, max_abs=1.5, mean_rel=0.09307838976383209, max_rel=12.63033390045166, norm_rel=0.022305665537714958, ref_abs_avg=19.239776611328125, test_abs_avg=19.239046096801758
production_forward2 grad[78] vs paper_forward: mean_abs=0.5372641682624817, max_abs=4.5, mean_rel=0.16221332550048828, max_rel=705.0414428710938, norm_rel=0.023952236399054527, ref_abs_avg=22.455286026000977, test_abs_avg=22.451982498168945
production_forward2 grad[79] vs paper_forward: mean_abs=0.4905153214931488, max_abs=3.75, mean_rel=0.1956378072500229, max_rel=1250.0, norm_rel=0.022393865510821342, ref_abs_avg=21.942079544067383, test_abs_avg=21.943496704101562
production_forward2 grad[80] vs paper_forward: mean_abs=0.3782968521118164, max_abs=1.8125, mean_rel=0.09485377371311188, max_rel=10.11722469329834, norm_rel=0.021485477685928345, ref_abs_avg=18.063270568847656, test_abs_avg=18.05716323852539
production_forward2 grad[81] vs paper_forward: mean_abs=0.5037364363670349, max_abs=5.5, mean_rel=0.14748108386993408, max_rel=938.6140747070312, norm_rel=0.02358364872634411, ref_abs_avg=21.414649963378906, test_abs_avg=21.41377830505371
production_forward2 grad[82] vs paper_forward: mean_abs=0.4669698178768158, max_abs=3.9375, mean_rel=0.20968294143676758, max_rel=1640.6248779296875, norm_rel=0.022369470447301865, ref_abs_avg=20.959104537963867, test_abs_avg=20.96111297607422
production_forward2 grad[83] vs paper_forward: mean_abs=0.383026123046875, max_abs=1.75, mean_rel=0.0931645929813385, max_rel=7.844607353210449, norm_rel=0.022264685481786728, ref_abs_avg=17.705463409423828, test_abs_avg=17.74740219116211
production_forward2 grad[84] vs paper_forward: mean_abs=0.46470963954925537, max_abs=4.5, mean_rel=0.141163170337677, max_rel=576.5512084960938, norm_rel=0.022843943908810616, ref_abs_avg=20.449275970458984, test_abs_avg=20.448909759521484
production_forward2 grad[85] vs paper_forward: mean_abs=0.42342811822891235, max_abs=3.625, mean_rel=0.20226356387138367, max_rel=1406.2498779296875, norm_rel=0.020822934806346893, ref_abs_avg=20.420846939086914, test_abs_avg=20.414752960205078
production_forward2 grad[86] vs paper_forward: mean_abs=0.3458171486854553, max_abs=1.25, mean_rel=0.077504463493824, max_rel=8.003925323486328, norm_rel=0.02022918313741684, ref_abs_avg=16.925655364990234, test_abs_avg=16.96115493774414
production_forward2 grad[87] vs paper_forward: mean_abs=0.4367764890193939, max_abs=4.0, mean_rel=0.13169848918914795, max_rel=607.3173217773438, norm_rel=0.022310467436909676, ref_abs_avg=19.7126522064209, test_abs_avg=19.712329864501953
production_forward2 grad[88] vs paper_forward: mean_abs=0.4038355350494385, max_abs=4.0, mean_rel=0.18386104702949524, max_rel=1343.7498779296875, norm_rel=0.02050936408340931, ref_abs_avg=19.840517044067383, test_abs_avg=19.838233947753906
production_forward2 grad[89] vs paper_forward: mean_abs=0.3182065486907959, max_abs=1.1875, mean_rel=0.11835555732250214, max_rel=21.429895401000977, norm_rel=0.020403485745191574, ref_abs_avg=15.983186721801758, test_abs_avg=15.967275619506836
production_forward2 grad[90] vs paper_forward: mean_abs=0.4142974019050598, max_abs=4.1875, mean_rel=0.13269227743148804, max_rel=995.3958740234375, norm_rel=0.022004393860697746, ref_abs_avg=18.97901725769043, test_abs_avg=18.9774169921875
production_forward2 grad[91] vs paper_forward: mean_abs=0.3862840533256531, max_abs=3.5, mean_rel=0.18872731924057007, max_rel=2250.0, norm_rel=0.019963722676038742, ref_abs_avg=19.404441833496094, test_abs_avg=19.402565002441406
production_forward2 grad[92] vs paper_forward: mean_abs=0.32200855016708374, max_abs=1.244140625, mean_rel=0.08676451444625854, max_rel=5.504568099975586, norm_rel=0.019817883148789406, ref_abs_avg=16.13847541809082, test_abs_avg=16.13672637939453
production_forward2 grad[93] vs paper_forward: mean_abs=0.4009334444999695, max_abs=4.875, mean_rel=0.13219422101974487, max_rel=1666.2430419921875, norm_rel=0.02150082215666771, ref_abs_avg=18.87130355834961, test_abs_avg=18.87146759033203
production_forward2 grad[94] vs paper_forward: mean_abs=0.3574952483177185, max_abs=3.5, mean_rel=0.1804439127445221, max_rel=1624.9998779296875, norm_rel=0.01960158161818981, ref_abs_avg=18.42091941833496, test_abs_avg=18.424461364746094
production_forward2 grad[95] vs paper_forward: mean_abs=0.291196346282959, max_abs=1.1875, mean_rel=0.07488848268985748, max_rel=4.278465747833252, norm_rel=0.019785230979323387, ref_abs_avg=15.13236141204834, test_abs_avg=15.11037826538086
production_forward2 grad[96] vs paper_forward: mean_abs=0.3652556240558624, max_abs=5.0, mean_rel=0.12935373187065125, max_rel=837.6895141601562, norm_rel=0.02088243141770363, ref_abs_avg=17.785863876342773, test_abs_avg=17.786062240600586
production_forward2 grad[97] vs paper_forward: mean_abs=0.3424161374568939, max_abs=4.0, mean_rel=0.17207801342010498, max_rel=1218.75, norm_rel=0.01941348798573017, ref_abs_avg=18.03641700744629, test_abs_avg=18.03240203857422
identity layers + randn queries
paper_forward fwd+bwd:  385.141 ms
paper_forward bwd-only: 304.880 ms
paper_forward peak allocated: fwd=30.002 GiB, fwd+bwd=32.120 GiB
paper_forward peak reserved:  fwd=30.059 GiB, fwd+bwd=32.809 GiB
production_forward2 fwd+bwd:  191.650 ms
production_forward2 bwd-only: 172.728 ms
production_forward2 peak allocated: fwd=2.864 GiB, fwd+bwd=6.243 GiB
production_forward2 peak reserved:  fwd=3.264 GiB, fwd+bwd=9.014 GiB
production_forward fwd+bwd:  114.669 ms
production_forward bwd-only: 96.031 ms
production_forward peak allocated: fwd=3.368 GiB, fwd+bwd=10.118 GiB
production_forward peak reserved:  fwd=3.639 GiB, fwd+bwd=12.639 GiB

grads check for swiglu layers + randn queries
production_forward vs paper_forward output: mean_abs=0.0016269661718979478, max_abs=0.0390625
production_forward grad[0] vs paper_forward: mean_abs=0.008272310718894005, max_abs=0.34765625, mean_rel=0.07217934727668762, max_rel=114.78594970703125, norm_rel=0.01975233480334282, ref_abs_avg=0.454330176115036, test_abs_avg=0.4543377757072449
production_forward grad[1] vs paper_forward: mean_abs=7.257564544677734, max_abs=56.0, mean_rel=0.16241642832756042, max_rel=547.3978271484375, norm_rel=0.020596951246261597, ref_abs_avg=311.4980773925781, test_abs_avg=311.57257080078125
production_forward grad[2] vs paper_forward: mean_abs=1.2032794952392578, max_abs=5.25, mean_rel=0.09617684036493301, max_rel=13.090683937072754, norm_rel=0.022502977401018143, ref_abs_avg=53.065330505371094, test_abs_avg=53.072288513183594
production_forward grad[3] vs paper_forward: mean_abs=1.5929301977157593, max_abs=12.0, mean_rel=0.17912182211875916, max_rel=4280.90673828125, norm_rel=0.024438366293907166, ref_abs_avg=65.63932800292969, test_abs_avg=65.6378173828125
production_forward grad[4] vs paper_forward: mean_abs=1.4728662967681885, max_abs=8.75, mean_rel=0.4544370174407959, max_rel=4937.5, norm_rel=0.023047778755426407, ref_abs_avg=64.2689208984375, test_abs_avg=64.28046417236328
production_forward grad[5] vs paper_forward: mean_abs=1.086397647857666, max_abs=5.0, mean_rel=0.4697164297103882, max_rel=116.19541931152344, norm_rel=0.02279251255095005, ref_abs_avg=48.00648498535156, test_abs_avg=48.00059127807617
production_forward grad[6] vs paper_forward: mean_abs=1.388651967048645, max_abs=10.0, mean_rel=0.15678855776786804, max_rel=1263.4718017578125, norm_rel=0.024165915325284004, ref_abs_avg=57.7789306640625, test_abs_avg=57.78206253051758
production_forward grad[7] vs paper_forward: mean_abs=1.2865211963653564, max_abs=7.5, mean_rel=0.3688121438026428, max_rel=3499.999755859375, norm_rel=0.02269044518470764, ref_abs_avg=57.05849075317383, test_abs_avg=57.06510543823242
production_forward grad[8] vs paper_forward: mean_abs=0.9809989929199219, max_abs=3.9375, mean_rel=0.0840141549706459, max_rel=7.459312915802002, norm_rel=0.021462053060531616, ref_abs_avg=45.90339660644531, test_abs_avg=45.95078659057617
production_forward grad[9] vs paper_forward: mean_abs=1.2590296268463135, max_abs=9.0, mean_rel=0.15725496411323547, max_rel=1980.7449951171875, norm_rel=0.024012167006731033, ref_abs_avg=52.79206085205078, test_abs_avg=52.79169464111328
production_forward grad[10] vs paper_forward: mean_abs=1.1609443426132202, max_abs=7.25, mean_rel=0.40142959356307983, max_rel=4062.499755859375, norm_rel=0.022370008751749992, ref_abs_avg=52.166534423828125, test_abs_avg=52.16400909423828
production_forward grad[11] vs paper_forward: mean_abs=0.9208801984786987, max_abs=4.125, mean_rel=0.5515568852424622, max_rel=218.69439697265625, norm_rel=0.022590016946196556, ref_abs_avg=43.131561279296875, test_abs_avg=43.18050003051758
production_forward grad[12] vs paper_forward: mean_abs=1.1591620445251465, max_abs=8.0, mean_rel=0.17006877064704895, max_rel=2321.937255859375, norm_rel=0.023813847452402115, ref_abs_avg=49.00739669799805, test_abs_avg=49.00899887084961
production_forward grad[13] vs paper_forward: mean_abs=1.0691910982131958, max_abs=6.5, mean_rel=0.3137010335922241, max_rel=3374.999755859375, norm_rel=0.022168807685375214, ref_abs_avg=48.49418640136719, test_abs_avg=48.49772644042969
production_forward grad[14] vs paper_forward: mean_abs=0.9035477638244629, max_abs=4.4580078125, mean_rel=0.2756529748439789, max_rel=76.42922973632812, norm_rel=0.02322428487241268, ref_abs_avg=38.55906677246094, test_abs_avg=38.60285186767578
production_forward grad[15] vs paper_forward: mean_abs=1.088478446006775, max_abs=7.5, mean_rel=0.15877047181129456, max_rel=931.5259399414062, norm_rel=0.023754509165883064, ref_abs_avg=46.1102294921875, test_abs_avg=46.10869598388672
production_forward grad[16] vs paper_forward: mean_abs=0.9982225894927979, max_abs=6.0, mean_rel=0.390274316072464, max_rel=3312.499755859375, norm_rel=0.021897906437516212, ref_abs_avg=45.862186431884766, test_abs_avg=45.85905075073242
production_forward grad[17] vs paper_forward: mean_abs=0.8104012608528137, max_abs=3.0, mean_rel=0.11003085970878601, max_rel=15.410441398620605, norm_rel=0.021007493138313293, ref_abs_avg=38.93220138549805, test_abs_avg=38.961158752441406
production_forward grad[18] vs paper_forward: mean_abs=1.0262198448181152, max_abs=7.0, mean_rel=0.1550491750240326, max_rel=1298.049072265625, norm_rel=0.02356438711285591, ref_abs_avg=43.859683990478516, test_abs_avg=43.86311721801758
production_forward grad[19] vs paper_forward: mean_abs=0.9423691034317017, max_abs=5.9375, mean_rel=0.3072469234466553, max_rel=2624.999755859375, norm_rel=0.021967079490423203, ref_abs_avg=43.14059829711914, test_abs_avg=43.14125061035156
production_forward grad[20] vs paper_forward: mean_abs=0.7470741271972656, max_abs=3.25, mean_rel=0.09701792895793915, max_rel=12.67348861694336, norm_rel=0.02258685603737831, ref_abs_avg=33.73028564453125, test_abs_avg=33.70225524902344
production_forward grad[21] vs paper_forward: mean_abs=0.9744461178779602, max_abs=6.0, mean_rel=0.1515541672706604, max_rel=808.4613037109375, norm_rel=0.02345292642712593, ref_abs_avg=41.813865661621094, test_abs_avg=41.816219329833984
production_forward grad[22] vs paper_forward: mean_abs=0.8898379802703857, max_abs=5.5, mean_rel=0.30159395933151245, max_rel=2375.0, norm_rel=0.02173752710223198, ref_abs_avg=41.072288513183594, test_abs_avg=41.07570266723633
production_forward grad[23] vs paper_forward: mean_abs=0.7494773864746094, max_abs=3.25, mean_rel=0.07732155919075012, max_rel=2.6102445125579834, norm_rel=0.0213924590498209, ref_abs_avg=34.36039733886719, test_abs_avg=34.345062255859375
production_forward grad[24] vs paper_forward: mean_abs=0.9182044267654419, max_abs=7.0, mean_rel=0.15132829546928406, max_rel=1087.6922607421875, norm_rel=0.023360267281532288, ref_abs_avg=39.57351303100586, test_abs_avg=39.57609558105469
production_forward grad[25] vs paper_forward: mean_abs=0.8446333408355713, max_abs=5.09375, mean_rel=0.2709250748157501, max_rel=2593.749755859375, norm_rel=0.021659569814801216, ref_abs_avg=39.21565246582031, test_abs_avg=39.21929168701172
production_forward grad[26] vs paper_forward: mean_abs=0.8348188400268555, max_abs=3.625, mean_rel=0.15102052688598633, max_rel=13.835572242736816, norm_rel=0.02397249825298786, ref_abs_avg=34.87395095825195, test_abs_avg=34.8807373046875
production_forward grad[27] vs paper_forward: mean_abs=1.0629186630249023, max_abs=7.0, mean_rel=0.16391827166080475, max_rel=925.7825317382812, norm_rel=0.025202179327607155, ref_abs_avg=42.374935150146484, test_abs_avg=42.3729362487793
production_forward grad[28] vs paper_forward: mean_abs=0.9828612208366394, max_abs=6.0, mean_rel=0.2687155604362488, max_rel=2312.5, norm_rel=0.02337263710796833, ref_abs_avg=42.25495147705078, test_abs_avg=42.243194580078125
production_forward grad[29] vs paper_forward: mean_abs=0.7163181304931641, max_abs=3.34375, mean_rel=0.10575488954782486, max_rel=8.822660446166992, norm_rel=0.023966914042830467, ref_abs_avg=30.681467056274414, test_abs_avg=30.647750854492188
production_forward grad[30] vs paper_forward: mean_abs=0.9799099564552307, max_abs=6.5, mean_rel=0.1760951578617096, max_rel=1462.211669921875, norm_rel=0.025551360100507736, ref_abs_avg=38.54536437988281, test_abs_avg=38.545379638671875
production_forward grad[31] vs paper_forward: mean_abs=0.913643479347229, max_abs=6.0, mean_rel=0.23999717831611633, max_rel=3249.999755859375, norm_rel=0.02403920143842697, ref_abs_avg=38.20744323730469, test_abs_avg=38.212135314941406
production_forward grad[32] vs paper_forward: mean_abs=0.7258520126342773, max_abs=3.0, mean_rel=0.12076548486948013, max_rel=15.944066047668457, norm_rel=0.024215294048190117, ref_abs_avg=30.159446716308594, test_abs_avg=30.111875534057617
production_forward grad[33] vs paper_forward: mean_abs=0.9248281717300415, max_abs=6.0, mean_rel=0.17018666863441467, max_rel=1302.78466796875, norm_rel=0.02542971260845661, ref_abs_avg=36.54292678833008, test_abs_avg=36.543251037597656
production_forward grad[34] vs paper_forward: mean_abs=0.8588067889213562, max_abs=5.5, mean_rel=0.3244500160217285, max_rel=2250.0, norm_rel=0.024002499878406525, ref_abs_avg=35.944236755371094, test_abs_avg=35.94295120239258
production_forward grad[35] vs paper_forward: mean_abs=0.661186695098877, max_abs=2.875, mean_rel=0.08218412846326828, max_rel=3.313528299331665, norm_rel=0.023673687130212784, ref_abs_avg=28.572223663330078, test_abs_avg=28.54350471496582
production_forward grad[36] vs paper_forward: mean_abs=0.8686901330947876, max_abs=6.0, mean_rel=0.15784645080566406, max_rel=1284.078857421875, norm_rel=0.025159046053886414, ref_abs_avg=34.68816375732422, test_abs_avg=34.68803405761719
production_forward grad[37] vs paper_forward: mean_abs=0.8031822443008423, max_abs=4.875, mean_rel=0.257688045501709, max_rel=2343.75, norm_rel=0.02365846186876297, ref_abs_avg=34.053688049316406, test_abs_avg=34.05131530761719
production_forward grad[38] vs paper_forward: mean_abs=0.6556453704833984, max_abs=2.625, mean_rel=0.09426814317703247, max_rel=5.58593225479126, norm_rel=0.02364872395992279, ref_abs_avg=27.754619598388672, test_abs_avg=27.746095657348633
production_forward grad[39] vs paper_forward: mean_abs=0.8230023980140686, max_abs=5.75, mean_rel=0.1626662313938141, max_rel=794.2646484375, norm_rel=0.02503174915909767, ref_abs_avg=32.98695373535156, test_abs_avg=32.98882293701172
production_forward grad[40] vs paper_forward: mean_abs=0.7651584148406982, max_abs=4.5, mean_rel=0.2837027907371521, max_rel=1945.3123779296875, norm_rel=0.023557933047413826, ref_abs_avg=32.60231018066406, test_abs_avg=32.60057830810547
production_forward grad[41] vs paper_forward: mean_abs=0.5959906578063965, max_abs=2.375, mean_rel=0.13959553837776184, max_rel=28.30171012878418, norm_rel=0.023554977029561996, ref_abs_avg=25.12900161743164, test_abs_avg=25.124807357788086
production_forward grad[42] vs paper_forward: mean_abs=0.779330849647522, max_abs=5.5, mean_rel=0.16253319382667542, max_rel=1097.828125, norm_rel=0.02482193522155285, ref_abs_avg=31.501087188720703, test_abs_avg=31.502994537353516
production_forward grad[43] vs paper_forward: mean_abs=0.7255394458770752, max_abs=4.25, mean_rel=0.2923443615436554, max_rel=1968.7498779296875, norm_rel=0.023356668651103973, ref_abs_avg=31.12778091430664, test_abs_avg=31.119827270507812
production_forward grad[44] vs paper_forward: mean_abs=0.5926622152328491, max_abs=2.4375, mean_rel=0.23535311222076416, max_rel=34.74821472167969, norm_rel=0.023151634261012077, ref_abs_avg=24.983293533325195, test_abs_avg=24.953048706054688
production_forward grad[45] vs paper_forward: mean_abs=0.7404825687408447, max_abs=5.5, mean_rel=0.15680468082427979, max_rel=897.65283203125, norm_rel=0.02455640584230423, ref_abs_avg=30.280364990234375, test_abs_avg=30.27843475341797
production_forward grad[46] vs paper_forward: mean_abs=0.6881668567657471, max_abs=4.375, mean_rel=0.28860536217689514, max_rel=2624.999755859375, norm_rel=0.023129789158701897, ref_abs_avg=29.84454345703125, test_abs_avg=29.84374237060547
production_forward grad[47] vs paper_forward: mean_abs=0.5662431716918945, max_abs=2.125, mean_rel=0.12141866981983185, max_rel=17.926618576049805, norm_rel=0.022949952632188797, ref_abs_avg=24.572437286376953, test_abs_avg=24.600818634033203
production_forward grad[48] vs paper_forward: mean_abs=0.7119478583335876, max_abs=5.0, mean_rel=0.1597621738910675, max_rel=1079.304931640625, norm_rel=0.024232134222984314, ref_abs_avg=29.456157684326172, test_abs_avg=29.45647621154785
production_forward grad[49] vs paper_forward: mean_abs=0.6605685949325562, max_abs=3.8125, mean_rel=0.19001899659633636, max_rel=1343.7498779296875, norm_rel=0.022970981895923615, ref_abs_avg=28.785499572753906, test_abs_avg=28.784584045410156
production_forward grad[50] vs paper_forward: mean_abs=0.5940892696380615, max_abs=2.125, mean_rel=0.07352590560913086, max_rel=6.007003307342529, norm_rel=0.02418663538992405, ref_abs_avg=25.114810943603516, test_abs_avg=25.173973083496094
production_forward grad[51] vs paper_forward: mean_abs=0.7814022302627563, max_abs=7.0, mean_rel=0.17466704547405243, max_rel=1412.267578125, norm_rel=0.025603458285331726, ref_abs_avg=30.608728408813477, test_abs_avg=30.606889724731445
production_forward grad[52] vs paper_forward: mean_abs=0.7271133065223694, max_abs=4.5, mean_rel=0.31475746631622314, max_rel=2593.749755859375, norm_rel=0.024281524121761322, ref_abs_avg=29.988391876220703, test_abs_avg=29.986740112304688
production_forward grad[53] vs paper_forward: mean_abs=0.5752806663513184, max_abs=2.25, mean_rel=0.1412545144557953, max_rel=24.3167781829834, norm_rel=0.025868233293294907, ref_abs_avg=22.255334854125977, test_abs_avg=22.22386932373047
production_forward grad[54] vs paper_forward: mean_abs=0.7243630290031433, max_abs=5.375, mean_rel=0.1708623617887497, max_rel=1400.978515625, norm_rel=0.025370193645358086, ref_abs_avg=28.60346031188965, test_abs_avg=28.603336334228516
production_forward grad[55] vs paper_forward: mean_abs=0.6740953922271729, max_abs=4.625, mean_rel=0.28578534722328186, max_rel=1937.4998779296875, norm_rel=0.0240468829870224, ref_abs_avg=28.08812141418457, test_abs_avg=28.081523895263672
production_forward grad[56] vs paper_forward: mean_abs=0.5139293670654297, max_abs=2.3125, mean_rel=0.10815879702568054, max_rel=7.118788242340088, norm_rel=0.02393331564962864, ref_abs_avg=21.67684555053711, test_abs_avg=21.62590980529785
production_forward grad[57] vs paper_forward: mean_abs=0.6699817776679993, max_abs=5.109375, mean_rel=0.17215415835380554, max_rel=1473.1875, norm_rel=0.024902991950511932, ref_abs_avg=26.954294204711914, test_abs_avg=26.954736709594727
production_forward grad[58] vs paper_forward: mean_abs=0.621259331703186, max_abs=4.5, mean_rel=0.23708607256412506, max_rel=1812.4998779296875, norm_rel=0.023221371695399284, ref_abs_avg=26.801342010498047, test_abs_avg=26.802215576171875
production_forward grad[59] vs paper_forward: mean_abs=0.5055720806121826, max_abs=2.125, mean_rel=0.11432188749313354, max_rel=13.804101943969727, norm_rel=0.023727020248770714, ref_abs_avg=20.951313018798828, test_abs_avg=20.984588623046875
production_forward grad[60] vs paper_forward: mean_abs=0.6317020058631897, max_abs=5.0, mean_rel=0.1587199568748474, max_rel=1004.366943359375, norm_rel=0.02453601360321045, ref_abs_avg=25.796756744384766, test_abs_avg=25.794116973876953
production_forward grad[61] vs paper_forward: mean_abs=0.5897541642189026, max_abs=4.0, mean_rel=0.2693408131599426, max_rel=2312.5, norm_rel=0.023316599428653717, ref_abs_avg=25.356014251708984, test_abs_avg=25.360023498535156
production_forward grad[62] vs paper_forward: mean_abs=0.4567677974700928, max_abs=2.0390625, mean_rel=0.13803252577781677, max_rel=20.9173583984375, norm_rel=0.022632397711277008, ref_abs_avg=20.349971771240234, test_abs_avg=20.327348709106445
production_forward grad[63] vs paper_forward: mean_abs=0.5973716974258423, max_abs=5.0, mean_rel=0.15516753494739532, max_rel=994.8053588867188, norm_rel=0.024051759392023087, ref_abs_avg=24.874818801879883, test_abs_avg=24.875041961669922
production_forward grad[64] vs paper_forward: mean_abs=0.551512598991394, max_abs=4.0, mean_rel=0.2579526901245117, max_rel=2187.5, norm_rel=0.022642921656370163, ref_abs_avg=24.36040496826172, test_abs_avg=24.360952377319336
production_forward grad[65] vs paper_forward: mean_abs=0.4281580448150635, max_abs=1.5, mean_rel=0.0955209881067276, max_rel=5.797560214996338, norm_rel=0.021829815581440926, ref_abs_avg=19.525100708007812, test_abs_avg=19.52468490600586
production_forward grad[66] vs paper_forward: mean_abs=0.5621124505996704, max_abs=5.0, mean_rel=0.1526491791009903, max_rel=955.9750366210938, norm_rel=0.023722751066088676, ref_abs_avg=23.751354217529297, test_abs_avg=23.74895477294922
production_forward grad[67] vs paper_forward: mean_abs=0.5197053551673889, max_abs=3.8125, mean_rel=0.216737300157547, max_rel=1578.1248779296875, norm_rel=0.022133860737085342, ref_abs_avg=23.476116180419922, test_abs_avg=23.47559356689453
production_forward grad[68] vs paper_forward: mean_abs=0.4396878480911255, max_abs=2.0, mean_rel=0.1969241052865982, max_rel=66.03558349609375, norm_rel=0.022602567449212074, ref_abs_avg=19.306177139282227, test_abs_avg=19.281600952148438
production_forward grad[69] vs paper_forward: mean_abs=0.5407679080963135, max_abs=5.0, mean_rel=0.15063795447349548, max_rel=691.9144897460938, norm_rel=0.023285644128918648, ref_abs_avg=23.23855972290039, test_abs_avg=23.23841094970703
production_forward grad[70] vs paper_forward: mean_abs=0.4953898787498474, max_abs=4.5, mean_rel=0.20398461818695068, max_rel=2687.499755859375, norm_rel=0.02165716327726841, ref_abs_avg=22.919837951660156, test_abs_avg=22.914947509765625
production_forward grad[71] vs paper_forward: mean_abs=0.3962669372558594, max_abs=1.5, mean_rel=0.10935751348733902, max_rel=12.42589282989502, norm_rel=0.021652519702911377, ref_abs_avg=18.86334228515625, test_abs_avg=18.871002197265625
production_forward grad[72] vs paper_forward: mean_abs=0.5160830020904541, max_abs=5.0, mean_rel=0.14063072204589844, max_rel=1349.95263671875, norm_rel=0.022914821282029152, ref_abs_avg=22.48540496826172, test_abs_avg=22.486133575439453
production_forward grad[73] vs paper_forward: mean_abs=0.47426357865333557, max_abs=4.25, mean_rel=0.21995630860328674, max_rel=1328.1248779296875, norm_rel=0.021033329889178276, ref_abs_avg=22.447120666503906, test_abs_avg=22.452945709228516
production_forward grad[74] vs paper_forward: mean_abs=0.43337059020996094, max_abs=1.75, mean_rel=0.09688501805067062, max_rel=9.259259223937988, norm_rel=0.023774638772010803, ref_abs_avg=18.863500595092773, test_abs_avg=18.8309326171875
production_forward grad[75] vs paper_forward: mean_abs=0.5643517971038818, max_abs=4.5, mean_rel=0.1551828533411026, max_rel=1084.09375, norm_rel=0.02466922625899315, ref_abs_avg=22.925931930541992, test_abs_avg=22.92420196533203
production_forward grad[76] vs paper_forward: mean_abs=0.5228561162948608, max_abs=4.0, mean_rel=0.21199043095111847, max_rel=1656.2498779296875, norm_rel=0.023005831986665726, ref_abs_avg=22.695152282714844, test_abs_avg=22.692745208740234
production_forward grad[77] vs paper_forward: mean_abs=0.3953545093536377, max_abs=1.40625, mean_rel=0.0655881017446518, max_rel=1.386226773262024, norm_rel=0.02244919165968895, ref_abs_avg=17.87263298034668, test_abs_avg=17.861120223999023
production_forward grad[78] vs paper_forward: mean_abs=0.5145094394683838, max_abs=4.5, mean_rel=0.15576235949993134, max_rel=1167.1090087890625, norm_rel=0.024069013074040413, ref_abs_avg=21.396961212158203, test_abs_avg=21.395465850830078
production_forward grad[79] vs paper_forward: mean_abs=0.482464075088501, max_abs=4.5, mean_rel=0.20996402204036713, max_rel=1578.1248779296875, norm_rel=0.022666307166218758, ref_abs_avg=21.370182037353516, test_abs_avg=21.369369506835938
production_forward grad[80] vs paper_forward: mean_abs=0.3681192398071289, max_abs=1.3125, mean_rel=0.06487856060266495, max_rel=3.593470811843872, norm_rel=0.020711693912744522, ref_abs_avg=18.398649215698242, test_abs_avg=18.367462158203125
production_forward grad[81] vs paper_forward: mean_abs=0.4869614839553833, max_abs=4.125, mean_rel=0.14392676949501038, max_rel=668.939453125, norm_rel=0.023271122947335243, ref_abs_avg=20.95767593383789, test_abs_avg=20.95659637451172
production_forward grad[82] vs paper_forward: mean_abs=0.44546186923980713, max_abs=4.125, mean_rel=0.18935182690620422, max_rel=1562.4998779296875, norm_rel=0.022058380767703056, ref_abs_avg=20.268108367919922, test_abs_avg=20.271270751953125
production_forward grad[83] vs paper_forward: mean_abs=0.36258363723754883, max_abs=1.4375, mean_rel=0.09284837543964386, max_rel=9.608016967773438, norm_rel=0.022066982463002205, ref_abs_avg=16.27475357055664, test_abs_avg=16.240591049194336
production_forward grad[84] vs paper_forward: mean_abs=0.4548048973083496, max_abs=4.5, mean_rel=0.1479768007993698, max_rel=1431.642578125, norm_rel=0.02278086729347706, ref_abs_avg=20.054670333862305, test_abs_avg=20.05410385131836
production_forward grad[85] vs paper_forward: mean_abs=0.4153518080711365, max_abs=3.5, mean_rel=0.21229296922683716, max_rel=1148.4375, norm_rel=0.021104393526911736, ref_abs_avg=19.720355987548828, test_abs_avg=19.72344207763672
production_forward grad[86] vs paper_forward: mean_abs=0.32803231477737427, max_abs=1.375, mean_rel=0.08649048209190369, max_rel=9.66224479675293, norm_rel=0.02071678824722767, ref_abs_avg=15.653578758239746, test_abs_avg=15.653852462768555
production_forward grad[87] vs paper_forward: mean_abs=0.42371708154678345, max_abs=4.0, mean_rel=0.13637375831604004, max_rel=718.2374877929688, norm_rel=0.022618146613240242, ref_abs_avg=18.867969512939453, test_abs_avg=18.86785125732422
production_forward grad[88] vs paper_forward: mean_abs=0.39704573154449463, max_abs=4.25, mean_rel=0.22169242799282074, max_rel=1937.4998779296875, norm_rel=0.02131623402237892, ref_abs_avg=18.797740936279297, test_abs_avg=18.806034088134766
production_forward grad[89] vs paper_forward: mean_abs=0.31407976150512695, max_abs=1.3125, mean_rel=0.09510256350040436, max_rel=7.323867321014404, norm_rel=0.02003581076860428, ref_abs_avg=15.807146072387695, test_abs_avg=15.794352531433105
production_forward grad[90] vs paper_forward: mean_abs=0.40817341208457947, max_abs=4.5, mean_rel=0.13956129550933838, max_rel=1158.6297607421875, norm_rel=0.021912259981036186, ref_abs_avg=18.789466857910156, test_abs_avg=18.789794921875
production_forward grad[91] vs paper_forward: mean_abs=0.3631652593612671, max_abs=3.5, mean_rel=0.21164387464523315, max_rel=1234.375, norm_rel=0.01981283538043499, ref_abs_avg=18.47493553161621, test_abs_avg=18.47027015686035
production_forward grad[92] vs paper_forward: mean_abs=0.3111060857772827, max_abs=1.25, mean_rel=0.08756454288959503, max_rel=7.009345531463623, norm_rel=0.0202195942401886, ref_abs_avg=15.470868110656738, test_abs_avg=15.463235855102539
production_forward grad[93] vs paper_forward: mean_abs=0.38891351222991943, max_abs=4.375, mean_rel=0.13160939514636993, max_rel=649.5192260742188, norm_rel=0.0215225238353014, ref_abs_avg=18.305553436279297, test_abs_avg=18.30628204345703
production_forward grad[94] vs paper_forward: mean_abs=0.34717434644699097, max_abs=4.0, mean_rel=0.20071373879909515, max_rel=1968.7498779296875, norm_rel=0.020259760320186615, ref_abs_avg=17.495689392089844, test_abs_avg=17.48689842224121
production_forward grad[95] vs paper_forward: mean_abs=0.27758145332336426, max_abs=1.373046875, mean_rel=0.08083498477935791, max_rel=6.957773208618164, norm_rel=0.019029822200536728, ref_abs_avg=15.03215217590332, test_abs_avg=15.039073944091797
production_forward grad[96] vs paper_forward: mean_abs=0.3536849915981293, max_abs=5.5, mean_rel=0.12042669951915741, max_rel=1265.0533447265625, norm_rel=0.0205574631690979, ref_abs_avg=17.52997589111328, test_abs_avg=17.52911376953125
production_forward grad[97] vs paper_forward: mean_abs=0.3166807293891907, max_abs=3.75, mean_rel=0.1615704596042633, max_rel=1437.4998779296875, norm_rel=0.018237045034766197, ref_abs_avg=17.710445404052734, test_abs_avg=17.71649932861328
production_forward2 vs paper_forward output: mean_abs=0.0016269661718979478, max_abs=0.0390625
production_forward2 grad[0] vs paper_forward: mean_abs=0.008603539317846298, max_abs=0.3515625, mean_rel=0.0747237503528595, max_rel=105.81389617919922, norm_rel=0.020418575033545494, ref_abs_avg=0.454330176115036, test_abs_avg=0.45432502031326294
production_forward2 grad[1] vs paper_forward: mean_abs=7.340235233306885, max_abs=56.0, mean_rel=0.15736503899097443, max_rel=473.2980041503906, norm_rel=0.02081245929002762, ref_abs_avg=311.4980773925781, test_abs_avg=311.55389404296875
production_forward2 grad[2] vs paper_forward: mean_abs=1.2691869735717773, max_abs=6.375, mean_rel=0.08710305392742157, max_rel=10.917054176330566, norm_rel=0.023396631702780724, ref_abs_avg=53.065330505371094, test_abs_avg=53.08853530883789
production_forward2 grad[3] vs paper_forward: mean_abs=1.6432948112487793, max_abs=14.0, mean_rel=0.1802554428577423, max_rel=3367.837158203125, norm_rel=0.02520095743238926, ref_abs_avg=65.63932800292969, test_abs_avg=65.63832092285156
production_forward2 grad[4] vs paper_forward: mean_abs=1.5196017026901245, max_abs=9.125, mean_rel=0.4713708162307739, max_rel=5031.25, norm_rel=0.023761775344610214, ref_abs_avg=64.2689208984375, test_abs_avg=64.2720947265625
production_forward2 grad[5] vs paper_forward: mean_abs=1.104250431060791, max_abs=5.0, mean_rel=0.756954550743103, max_rel=251.3223114013672, norm_rel=0.023203840479254723, ref_abs_avg=48.00648498535156, test_abs_avg=47.965579986572266
production_forward2 grad[6] vs paper_forward: mean_abs=1.428511142730713, max_abs=9.0, mean_rel=0.16627579927444458, max_rel=1228.25439453125, norm_rel=0.024877920746803284, ref_abs_avg=57.7789306640625, test_abs_avg=57.77933120727539
production_forward2 grad[7] vs paper_forward: mean_abs=1.330101490020752, max_abs=8.0, mean_rel=0.38725268840789795, max_rel=4062.499755859375, norm_rel=0.023450899869203568, ref_abs_avg=57.05849075317383, test_abs_avg=57.06422424316406
production_forward2 grad[8] vs paper_forward: mean_abs=0.9925813674926758, max_abs=4.375, mean_rel=0.08291974663734436, max_rel=5.685770034790039, norm_rel=0.021988915279507637, ref_abs_avg=45.90339660644531, test_abs_avg=45.93618392944336
production_forward2 grad[9] vs paper_forward: mean_abs=1.2946419715881348, max_abs=8.3125, mean_rel=0.16526898741722107, max_rel=2148.344482421875, norm_rel=0.024686740711331367, ref_abs_avg=52.79206085205078, test_abs_avg=52.792266845703125
production_forward2 grad[10] vs paper_forward: mean_abs=1.1963505744934082, max_abs=7.0, mean_rel=0.4045356512069702, max_rel=3499.999755859375, norm_rel=0.023038992658257484, ref_abs_avg=52.166534423828125, test_abs_avg=52.1579475402832
production_forward2 grad[11] vs paper_forward: mean_abs=0.9377661943435669, max_abs=4.0, mean_rel=0.4283298850059509, max_rel=152.8173065185547, norm_rel=0.022810954600572586, ref_abs_avg=43.131561279296875, test_abs_avg=43.189247131347656
production_forward2 grad[12] vs paper_forward: mean_abs=1.188856601715088, max_abs=8.0, mean_rel=0.1763152927160263, max_rel=2372.4052734375, norm_rel=0.024397505447268486, ref_abs_avg=49.00739669799805, test_abs_avg=49.00704574584961
production_forward2 grad[13] vs paper_forward: mean_abs=1.1011171340942383, max_abs=7.0, mean_rel=0.2993939518928528, max_rel=3687.499755859375, norm_rel=0.02280525118112564, ref_abs_avg=48.49418640136719, test_abs_avg=48.49439239501953
production_forward2 grad[14] vs paper_forward: mean_abs=0.9354075193405151, max_abs=4.74609375, mean_rel=0.2638217508792877, max_rel=63.546043395996094, norm_rel=0.02430465817451477, ref_abs_avg=38.55906677246094, test_abs_avg=38.60405731201172
production_forward2 grad[15] vs paper_forward: mean_abs=1.1140440702438354, max_abs=7.0, mean_rel=0.1636945754289627, max_rel=1080.345947265625, norm_rel=0.024305710569024086, ref_abs_avg=46.1102294921875, test_abs_avg=46.109130859375
production_forward2 grad[16] vs paper_forward: mean_abs=1.0249288082122803, max_abs=6.0, mean_rel=0.4000420570373535, max_rel=3624.999755859375, norm_rel=0.022489439696073532, ref_abs_avg=45.862186431884766, test_abs_avg=45.85589599609375
production_forward2 grad[17] vs paper_forward: mean_abs=0.8219714164733887, max_abs=3.25, mean_rel=0.11099527776241302, max_rel=13.901463508605957, norm_rel=0.02143651805818081, ref_abs_avg=38.93220138549805, test_abs_avg=38.93480682373047
production_forward2 grad[18] vs paper_forward: mean_abs=1.0495553016662598, max_abs=7.0, mean_rel=0.15993261337280273, max_rel=1643.2440185546875, norm_rel=0.024086788296699524, ref_abs_avg=43.859683990478516, test_abs_avg=43.86099624633789
production_forward2 grad[19] vs paper_forward: mean_abs=0.9680705070495605, max_abs=6.0, mean_rel=0.3223469853401184, max_rel=3187.499755859375, norm_rel=0.022566944360733032, ref_abs_avg=43.14059829711914, test_abs_avg=43.14078140258789
production_forward2 grad[20] vs paper_forward: mean_abs=0.7688703536987305, max_abs=3.0, mean_rel=0.10071144253015518, max_rel=14.555397987365723, norm_rel=0.02303420938551426, ref_abs_avg=33.73028564453125, test_abs_avg=33.70538330078125
production_forward2 grad[21] vs paper_forward: mean_abs=0.9945414066314697, max_abs=7.0, mean_rel=0.1559711992740631, max_rel=1194.7677001953125, norm_rel=0.02391163259744644, ref_abs_avg=41.813865661621094, test_abs_avg=41.81437301635742
production_forward2 grad[22] vs paper_forward: mean_abs=0.9110760688781738, max_abs=5.5, mean_rel=0.2963036894798279, max_rel=3187.499755859375, norm_rel=0.022268153727054596, ref_abs_avg=41.072288513183594, test_abs_avg=41.07442092895508
production_forward2 grad[23] vs paper_forward: mean_abs=0.7542667388916016, max_abs=3.5, mean_rel=0.07505322992801666, max_rel=3.0532824993133545, norm_rel=0.02176760882139206, ref_abs_avg=34.36039733886719, test_abs_avg=34.34663772583008
production_forward2 grad[24] vs paper_forward: mean_abs=0.9354227781295776, max_abs=7.0, mean_rel=0.15492656826972961, max_rel=1166.0576171875, norm_rel=0.023777473717927933, ref_abs_avg=39.57351303100586, test_abs_avg=39.57558059692383
production_forward2 grad[25] vs paper_forward: mean_abs=0.8637136816978455, max_abs=5.25, mean_rel=0.2901611924171448, max_rel=2437.5, norm_rel=0.022153185680508614, ref_abs_avg=39.21565246582031, test_abs_avg=39.2185173034668
production_forward2 grad[26] vs paper_forward: mean_abs=0.877436637878418, max_abs=4.0, mean_rel=0.17905664443969727, max_rel=20.873319625854492, norm_rel=0.025144405663013458, ref_abs_avg=34.87395095825195, test_abs_avg=34.850040435791016
production_forward2 grad[27] vs paper_forward: mean_abs=1.086024522781372, max_abs=7.0, mean_rel=0.16798606514930725, max_rel=1101.2652587890625, norm_rel=0.025733429938554764, ref_abs_avg=42.374935150146484, test_abs_avg=42.37338638305664
production_forward2 grad[28] vs paper_forward: mean_abs=1.0064921379089355, max_abs=6.0, mean_rel=0.29631680250167847, max_rel=2312.5, norm_rel=0.023919515311717987, ref_abs_avg=42.25495147705078, test_abs_avg=42.24335479736328
production_forward2 grad[29] vs paper_forward: mean_abs=0.7547788619995117, max_abs=3.3125, mean_rel=0.09850063920021057, max_rel=3.589447498321533, norm_rel=0.024672934785485268, ref_abs_avg=30.681467056274414, test_abs_avg=30.657556533813477
production_forward2 grad[30] vs paper_forward: mean_abs=0.9984951019287109, max_abs=7.0, mean_rel=0.17583662271499634, max_rel=1242.0648193359375, norm_rel=0.02601362019777298, ref_abs_avg=38.54536437988281, test_abs_avg=38.544921875
production_forward2 grad[31] vs paper_forward: mean_abs=0.9339554905891418, max_abs=5.75, mean_rel=0.25603845715522766, max_rel=3249.999755859375, norm_rel=0.024577828124165535, ref_abs_avg=38.20744323730469, test_abs_avg=38.210899353027344
production_forward2 grad[32] vs paper_forward: mean_abs=0.7551316022872925, max_abs=2.6875, mean_rel=0.1075187474489212, max_rel=7.633076190948486, norm_rel=0.025036891922354698, ref_abs_avg=30.159446716308594, test_abs_avg=30.10416030883789
production_forward2 grad[33] vs paper_forward: mean_abs=0.9403800964355469, max_abs=6.5, mean_rel=0.17061735689640045, max_rel=1238.4256591796875, norm_rel=0.025844944640994072, ref_abs_avg=36.54292678833008, test_abs_avg=36.54292678833008
production_forward2 grad[34] vs paper_forward: mean_abs=0.8765392303466797, max_abs=5.75, mean_rel=0.32182279229164124, max_rel=2218.75, norm_rel=0.02449679747223854, ref_abs_avg=35.944236755371094, test_abs_avg=35.943756103515625
production_forward2 grad[35] vs paper_forward: mean_abs=0.6573944091796875, max_abs=2.75, mean_rel=0.08353500813245773, max_rel=3.7701241970062256, norm_rel=0.02359316498041153, ref_abs_avg=28.572223663330078, test_abs_avg=28.554195404052734
production_forward2 grad[36] vs paper_forward: mean_abs=0.8837111592292786, max_abs=6.0, mean_rel=0.16064685583114624, max_rel=1370.7010498046875, norm_rel=0.02557828463613987, ref_abs_avg=34.68816375732422, test_abs_avg=34.68751525878906
production_forward2 grad[37] vs paper_forward: mean_abs=0.8183210492134094, max_abs=5.25, mean_rel=0.27350014448165894, max_rel=2656.249755859375, norm_rel=0.02409796603024006, ref_abs_avg=34.053688049316406, test_abs_avg=34.05327606201172
production_forward2 grad[38] vs paper_forward: mean_abs=0.6711301803588867, max_abs=2.625, mean_rel=0.10117356479167938, max_rel=6.145445346832275, norm_rel=0.024357253685593605, ref_abs_avg=27.754619598388672, test_abs_avg=27.740310668945312
production_forward2 grad[39] vs paper_forward: mean_abs=0.8344653844833374, max_abs=5.25, mean_rel=0.16686458885669708, max_rel=736.9459228515625, norm_rel=0.02536408044397831, ref_abs_avg=32.98695373535156, test_abs_avg=32.987701416015625
production_forward2 grad[40] vs paper_forward: mean_abs=0.7784978151321411, max_abs=4.5, mean_rel=0.2925926446914673, max_rel=2093.75, norm_rel=0.02396981045603752, ref_abs_avg=32.60231018066406, test_abs_avg=32.60011672973633
production_forward2 grad[41] vs paper_forward: mean_abs=0.6161065101623535, max_abs=2.75, mean_rel=0.1910952776670456, max_rel=45.18376159667969, norm_rel=0.024372173473238945, ref_abs_avg=25.12900161743164, test_abs_avg=25.175941467285156
production_forward2 grad[42] vs paper_forward: mean_abs=0.7898640632629395, max_abs=5.5, mean_rel=0.16244696080684662, max_rel=1461.3150634765625, norm_rel=0.02516462281346321, ref_abs_avg=31.501087188720703, test_abs_avg=31.50201416015625
production_forward2 grad[43] vs paper_forward: mean_abs=0.7340517044067383, max_abs=4.5, mean_rel=0.3029893636703491, max_rel=2031.2498779296875, norm_rel=0.02363400347530842, ref_abs_avg=31.12778091430664, test_abs_avg=31.11994171142578
production_forward2 grad[44] vs paper_forward: mean_abs=0.5881036520004272, max_abs=2.625, mean_rel=0.2322639524936676, max_rel=42.91425704956055, norm_rel=0.0233335979282856, ref_abs_avg=24.983293533325195, test_abs_avg=24.948829650878906
production_forward2 grad[45] vs paper_forward: mean_abs=0.7500511407852173, max_abs=5.5, mean_rel=0.1578921675682068, max_rel=951.3858032226562, norm_rel=0.024856436997652054, ref_abs_avg=30.280364990234375, test_abs_avg=30.278554916381836
production_forward2 grad[46] vs paper_forward: mean_abs=0.6971662640571594, max_abs=4.5, mean_rel=0.2877563238143921, max_rel=2062.5, norm_rel=0.02344328537583351, ref_abs_avg=29.84454345703125, test_abs_avg=29.84465217590332
production_forward2 grad[47] vs paper_forward: mean_abs=0.580632209777832, max_abs=2.0, mean_rel=0.13067519664764404, max_rel=20.41229820251465, norm_rel=0.02331133745610714, ref_abs_avg=24.572437286376953, test_abs_avg=24.592422485351562
production_forward2 grad[48] vs paper_forward: mean_abs=0.7187924385070801, max_abs=5.5, mean_rel=0.1590188890695572, max_rel=829.762451171875, norm_rel=0.024473292753100395, ref_abs_avg=29.456157684326172, test_abs_avg=29.456491470336914
production_forward2 grad[49] vs paper_forward: mean_abs=0.6672040224075317, max_abs=4.203125, mean_rel=0.18989238142967224, max_rel=1281.25, norm_rel=0.0231980811804533, ref_abs_avg=28.785499572753906, test_abs_avg=28.785240173339844
production_forward2 grad[50] vs paper_forward: mean_abs=0.5961475372314453, max_abs=2.3125, mean_rel=0.08275102078914642, max_rel=7.908718109130859, norm_rel=0.02428424544632435, ref_abs_avg=25.114810943603516, test_abs_avg=25.149675369262695
production_forward2 grad[51] vs paper_forward: mean_abs=0.7918640971183777, max_abs=7.0, mean_rel=0.17773538827896118, max_rel=1705.8741455078125, norm_rel=0.025940299034118652, ref_abs_avg=30.608728408813477, test_abs_avg=30.606136322021484
production_forward2 grad[52] vs paper_forward: mean_abs=0.7375127077102661, max_abs=4.53125, mean_rel=0.3366779685020447, max_rel=2843.749755859375, norm_rel=0.024627694860100746, ref_abs_avg=29.988391876220703, test_abs_avg=29.98762321472168
production_forward2 grad[53] vs paper_forward: mean_abs=0.5916171073913574, max_abs=2.0, mean_rel=0.1527390480041504, max_rel=24.203380584716797, norm_rel=0.02645164728164673, ref_abs_avg=22.255334854125977, test_abs_avg=22.21895408630371
production_forward2 grad[54] vs paper_forward: mean_abs=0.7330062985420227, max_abs=5.625, mean_rel=0.172784686088562, max_rel=1978.5745849609375, norm_rel=0.025662412866950035, ref_abs_avg=28.60346031188965, test_abs_avg=28.602365493774414
production_forward2 grad[55] vs paper_forward: mean_abs=0.6824338436126709, max_abs=4.5, mean_rel=0.2803148627281189, max_rel=1812.4998779296875, norm_rel=0.024325361475348473, ref_abs_avg=28.08812141418457, test_abs_avg=28.08111000061035
production_forward2 grad[56] vs paper_forward: mean_abs=0.5230340957641602, max_abs=1.875, mean_rel=0.12410467863082886, max_rel=8.636969566345215, norm_rel=0.024226555600762367, ref_abs_avg=21.67684555053711, test_abs_avg=21.61315155029297
production_forward2 grad[57] vs paper_forward: mean_abs=0.6777467727661133, max_abs=5.21875, mean_rel=0.17305123805999756, max_rel=1418.2288818359375, norm_rel=0.02518109790980816, ref_abs_avg=26.954294204711914, test_abs_avg=26.954898834228516
production_forward2 grad[58] vs paper_forward: mean_abs=0.6297339200973511, max_abs=4.5, mean_rel=0.23765771090984344, max_rel=1656.2498779296875, norm_rel=0.023547958582639694, ref_abs_avg=26.801342010498047, test_abs_avg=26.801029205322266
production_forward2 grad[59] vs paper_forward: mean_abs=0.5112588405609131, max_abs=2.0, mean_rel=0.13522857427597046, max_rel=14.53665828704834, norm_rel=0.024209214374423027, ref_abs_avg=20.951313018798828, test_abs_avg=20.98870277404785
production_forward2 grad[60] vs paper_forward: mean_abs=0.6384062767028809, max_abs=4.5, mean_rel=0.15748462080955505, max_rel=1163.1419677734375, norm_rel=0.024793973192572594, ref_abs_avg=25.796756744384766, test_abs_avg=25.793476104736328
production_forward2 grad[61] vs paper_forward: mean_abs=0.5961412191390991, max_abs=3.75, mean_rel=0.2702096700668335, max_rel=1937.4998779296875, norm_rel=0.023551713675260544, ref_abs_avg=25.356014251708984, test_abs_avg=25.3598690032959
production_forward2 grad[62] vs paper_forward: mean_abs=0.45859575271606445, max_abs=2.1953125, mean_rel=0.0961524024605751, max_rel=9.633611679077148, norm_rel=0.02265065163373947, ref_abs_avg=20.349971771240234, test_abs_avg=20.338218688964844
production_forward2 grad[63] vs paper_forward: mean_abs=0.6031525135040283, max_abs=4.5, mean_rel=0.15710678696632385, max_rel=1089.815185546875, norm_rel=0.02427632361650467, ref_abs_avg=24.874818801879883, test_abs_avg=24.874385833740234
production_forward2 grad[64] vs paper_forward: mean_abs=0.5561782121658325, max_abs=4.0, mean_rel=0.27008938789367676, max_rel=2375.0, norm_rel=0.022838426753878593, ref_abs_avg=24.36040496826172, test_abs_avg=24.359962463378906
production_forward2 grad[65] vs paper_forward: mean_abs=0.44983506202697754, max_abs=2.0, mean_rel=0.09750518202781677, max_rel=4.128783226013184, norm_rel=0.02263113670051098, ref_abs_avg=19.525100708007812, test_abs_avg=19.52507209777832
production_forward2 grad[66] vs paper_forward: mean_abs=0.567574143409729, max_abs=4.953125, mean_rel=0.1575322151184082, max_rel=1422.0408935546875, norm_rel=0.023923072963953018, ref_abs_avg=23.751354217529297, test_abs_avg=23.749174118041992
production_forward2 grad[67] vs paper_forward: mean_abs=0.523800253868103, max_abs=4.0, mean_rel=0.22066041827201843, max_rel=1367.1873779296875, norm_rel=0.022314131259918213, ref_abs_avg=23.476116180419922, test_abs_avg=23.47486114501953
production_forward2 grad[68] vs paper_forward: mean_abs=0.4318324327468872, max_abs=1.75, mean_rel=0.19069477915763855, max_rel=65.09226989746094, norm_rel=0.022288745269179344, ref_abs_avg=19.306177139282227, test_abs_avg=19.291423797607422
production_forward2 grad[69] vs paper_forward: mean_abs=0.5447412133216858, max_abs=6.0, mean_rel=0.15170714259147644, max_rel=653.18359375, norm_rel=0.02342471480369568, ref_abs_avg=23.23855972290039, test_abs_avg=23.238046646118164
production_forward2 grad[70] vs paper_forward: mean_abs=0.49937593936920166, max_abs=4.5, mean_rel=0.20343691110610962, max_rel=2781.249755859375, norm_rel=0.021827302873134613, ref_abs_avg=22.919837951660156, test_abs_avg=22.91472625732422
production_forward2 grad[71] vs paper_forward: mean_abs=0.38855838775634766, max_abs=1.5, mean_rel=0.1125921905040741, max_rel=14.218255043029785, norm_rel=0.021427638828754425, ref_abs_avg=18.86334228515625, test_abs_avg=18.878267288208008
production_forward2 grad[72] vs paper_forward: mean_abs=0.5191453099250793, max_abs=5.0, mean_rel=0.1411609798669815, max_rel=1336.248779296875, norm_rel=0.023048484697937965, ref_abs_avg=22.48540496826172, test_abs_avg=22.485671997070312
production_forward2 grad[73] vs paper_forward: mean_abs=0.4774993658065796, max_abs=4.25, mean_rel=0.22547069191932678, max_rel=1250.0, norm_rel=0.02118191495537758, ref_abs_avg=22.447120666503906, test_abs_avg=22.45252227783203
production_forward2 grad[74] vs paper_forward: mean_abs=0.4450950622558594, max_abs=1.9375, mean_rel=0.09203670918941498, max_rel=4.59455680847168, norm_rel=0.024468550458550453, ref_abs_avg=18.863500595092773, test_abs_avg=18.836000442504883
production_forward2 grad[75] vs paper_forward: mean_abs=0.5700110197067261, max_abs=5.0, mean_rel=0.15711568295955658, max_rel=1005.7373657226562, norm_rel=0.024906400591135025, ref_abs_avg=22.925931930541992, test_abs_avg=22.92365264892578
production_forward2 grad[76] vs paper_forward: mean_abs=0.5282340049743652, max_abs=4.0, mean_rel=0.2195453941822052, max_rel=2062.5, norm_rel=0.02322359010577202, ref_abs_avg=22.695152282714844, test_abs_avg=22.693115234375
production_forward2 grad[77] vs paper_forward: mean_abs=0.4044778347015381, max_abs=1.625, mean_rel=0.06509783118963242, max_rel=1.4154962301254272, norm_rel=0.02308596670627594, ref_abs_avg=17.87263298034668, test_abs_avg=17.857851028442383
production_forward2 grad[78] vs paper_forward: mean_abs=0.5190153121948242, max_abs=5.5, mean_rel=0.15683868527412415, max_rel=943.451171875, norm_rel=0.024272272363305092, ref_abs_avg=21.396961212158203, test_abs_avg=21.395450592041016
production_forward2 grad[79] vs paper_forward: mean_abs=0.4872264564037323, max_abs=4.0, mean_rel=0.21815912425518036, max_rel=1656.2498779296875, norm_rel=0.022897420451045036, ref_abs_avg=21.370182037353516, test_abs_avg=21.368547439575195
production_forward2 grad[80] vs paper_forward: mean_abs=0.3748779296875, max_abs=1.375, mean_rel=0.06629811972379684, max_rel=2.6868488788604736, norm_rel=0.021061472594738007, ref_abs_avg=18.398649215698242, test_abs_avg=18.37382698059082
production_forward2 grad[81] vs paper_forward: mean_abs=0.4900139570236206, max_abs=4.375, mean_rel=0.14463438093662262, max_rel=662.401123046875, norm_rel=0.023416174575686455, ref_abs_avg=20.95767593383789, test_abs_avg=20.956920623779297
production_forward2 grad[82] vs paper_forward: mean_abs=0.44973593950271606, max_abs=4.0, mean_rel=0.19213606417179108, max_rel=1624.9998779296875, norm_rel=0.02226782590150833, ref_abs_avg=20.268108367919922, test_abs_avg=20.270689010620117
production_forward2 grad[83] vs paper_forward: mean_abs=0.3708038330078125, max_abs=1.375, mean_rel=0.08483059704303741, max_rel=6.016234874725342, norm_rel=0.022374490275979042, ref_abs_avg=16.27475357055664, test_abs_avg=16.243080139160156
production_forward2 grad[84] vs paper_forward: mean_abs=0.45747110247612, max_abs=4.5, mean_rel=0.14819622039794922, max_rel=1180.157470703125, norm_rel=0.022917652502655983, ref_abs_avg=20.054670333862305, test_abs_avg=20.05373764038086
production_forward2 grad[85] vs paper_forward: mean_abs=0.41880807280540466, max_abs=3.5, mean_rel=0.20790567994117737, max_rel=1062.5, norm_rel=0.02126414328813553, ref_abs_avg=19.720355987548828, test_abs_avg=19.723417282104492
production_forward2 grad[86] vs paper_forward: mean_abs=0.3315901756286621, max_abs=1.28125, mean_rel=0.10430194437503815, max_rel=16.092458724975586, norm_rel=0.020893177017569542, ref_abs_avg=15.653578758239746, test_abs_avg=15.652765274047852
production_forward2 grad[87] vs paper_forward: mean_abs=0.42598357796669006, max_abs=4.0, mean_rel=0.13694007694721222, max_rel=904.6170654296875, norm_rel=0.022730669006705284, ref_abs_avg=18.867969512939453, test_abs_avg=18.867952346801758
production_forward2 grad[88] vs paper_forward: mean_abs=0.3996274471282959, max_abs=4.5, mean_rel=0.22751590609550476, max_rel=2078.125, norm_rel=0.021431798115372658, ref_abs_avg=18.797740936279297, test_abs_avg=18.805809020996094
production_forward2 grad[89] vs paper_forward: mean_abs=0.3125038146972656, max_abs=1.140625, mean_rel=0.08954944461584091, max_rel=6.791053771972656, norm_rel=0.019892482087016106, ref_abs_avg=15.807146072387695, test_abs_avg=15.789091110229492
production_forward2 grad[90] vs paper_forward: mean_abs=0.4095069169998169, max_abs=4.5, mean_rel=0.14299416542053223, max_rel=1146.864501953125, norm_rel=0.021975373849272728, ref_abs_avg=18.789466857910156, test_abs_avg=18.78976058959961
production_forward2 grad[91] vs paper_forward: mean_abs=0.3646374046802521, max_abs=3.75, mean_rel=0.21541249752044678, max_rel=1398.4373779296875, norm_rel=0.019881607964634895, ref_abs_avg=18.47493553161621, test_abs_avg=18.470294952392578
production_forward2 grad[92] vs paper_forward: mean_abs=0.31771087646484375, max_abs=1.25, mean_rel=0.09073954820632935, max_rel=6.20057487487793, norm_rel=0.02075655572116375, ref_abs_avg=15.470868110656738, test_abs_avg=15.470526695251465
production_forward2 grad[93] vs paper_forward: mean_abs=0.38947945833206177, max_abs=4.625, mean_rel=0.1304137110710144, max_rel=619.9315185546875, norm_rel=0.02153906226158142, ref_abs_avg=18.305553436279297, test_abs_avg=18.306236267089844
production_forward2 grad[94] vs paper_forward: mean_abs=0.3476641774177551, max_abs=3.625, mean_rel=0.2032485157251358, max_rel=2218.75, norm_rel=0.020290717482566833, ref_abs_avg=17.495689392089844, test_abs_avg=17.486927032470703
production_forward2 grad[95] vs paper_forward: mean_abs=0.27758145332336426, max_abs=1.373046875, mean_rel=0.08083498477935791, max_rel=6.957773208618164, norm_rel=0.019029822200536728, ref_abs_avg=15.03215217590332, test_abs_avg=15.039073944091797
production_forward2 grad[96] vs paper_forward: mean_abs=0.3536849915981293, max_abs=5.5, mean_rel=0.12042669951915741, max_rel=1265.0533447265625, norm_rel=0.0205574631690979, ref_abs_avg=17.52997589111328, test_abs_avg=17.52911376953125
production_forward2 grad[97] vs paper_forward: mean_abs=0.3166807293891907, max_abs=3.75, mean_rel=0.1615704596042633, max_rel=1437.4998779296875, norm_rel=0.018237045034766197, ref_abs_avg=17.710445404052734, test_abs_avg=17.71649932861328

