identity layers + randn queries

/usr/local/lib/python3.12/dist-packages/torch/_inductor/lowering.py:7627: UserWarning: 
Online softmax is disabled on the fly since Inductor decides to
split the reduction. Cut an issue to PyTorch if this is an
important use case and you want to speed it up with online
softmax.

  warnings.warn(
/usr/local/lib/python3.12/dist-packages/torch/_inductor/lowering.py:7627: UserWarning: 
Online softmax is disabled on the fly since Inductor decides to
split the reduction. Cut an issue to PyTorch if this is an
important use case and you want to speed it up with online
softmax.

  warnings.warn(
/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
  warnings.warn(
/usr/local/lib/python3.12/dist-packages/torch/_inductor/select_algorithm.py:3464: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  current_size = base.storage().size()
E0429 23:06:57.575000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 23:06:57.575000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 23:06:57.575000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 23:06:57.575000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 23:06:57.575000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 23:06:57.575000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 23:06:57.575000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
E0429 23:06:57.579000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 23:06:57.579000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 23:06:57.579000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 23:06:57.579000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 23:06:57.579000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 23:06:57.579000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 23:06:57.579000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
E0429 23:06:57.582000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 23:06:57.582000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 23:06:57.582000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 23:06:57.582000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 23:06:57.582000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 23:06:57.582000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 23:06:57.582000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
E0429 23:06:57.586000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 23:06:57.586000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 23:06:57.586000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 23:06:57.586000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 23:06:57.586000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 23:06:57.586000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 23:06:57.586000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
E0429 23:06:57.589000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 23:06:57.589000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 23:06:57.589000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 23:06:57.589000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 23:06:57.589000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 23:06:57.589000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 23:06:57.589000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
E0429 23:06:57.592000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 23:06:57.592000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 23:06:57.592000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 23:06:57.592000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 23:06:57.592000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 23:06:57.592000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 23:06:57.592000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
E0429 23:06:57.596000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 23:06:57.596000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 23:06:57.596000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 23:06:57.596000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 23:06:57.596000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 23:06:57.596000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 23:06:57.596000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
E0429 23:06:57.599000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 23:06:57.599000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 23:06:57.599000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 23:06:57.599000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 23:06:57.599000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 23:06:57.599000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 23:06:57.599000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
E0429 23:06:57.602000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 23:06:57.602000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 23:06:57.602000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 23:06:57.602000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 23:06:57.602000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 23:06:57.602000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 23:06:57.602000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
E0429 23:06:57.606000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 23:06:57.606000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 23:06:57.606000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 23:06:57.606000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 23:06:57.606000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 23:06:57.606000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 23:06:57.606000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
E0429 23:06:57.609000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 23:06:57.609000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 23:06:57.609000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 23:06:57.609000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 23:06:57.609000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 23:06:57.609000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 23:06:57.609000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
E0429 23:06:57.612000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 23:06:57.612000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 23:06:57.612000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 23:06:57.612000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 23:06:57.612000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 23:06:57.612000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 23:06:57.612000 3153 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
Autotune Choices Stats:
{"num_choices": 13, "num_triton_choices": 12, "best_kernel": "bmm", "best_time": 2.382848024368286, "best_triton_pos": 1, "best_triton_time": Infinity, "best_triton_kernel": "triton_bmm_0", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2"}
AUTOTUNE bmm(131072x2x1, 131072x1x512)
strides: [1, 131072, 0], [512, 0, 1]
dtypes: torch.float32, torch.float32
  bmm 2.3828 ms 100.0% 
  triton_bmm_0 inf ms 0.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2
  triton_bmm_1 inf ms 0.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=2
  triton_bmm_2 inf ms 0.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
  triton_bmm_3 inf ms 0.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=2
  triton_bmm_4 inf ms 0.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4
  triton_bmm_5 inf ms 0.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
  triton_bmm_6 inf ms 0.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=128, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
  triton_bmm_7 inf ms 0.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=128, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8
  triton_bmm_8 inf ms 0.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=128, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4
SingleProcess AUTOTUNE benchmarking takes 0.2378 seconds and 0.0003 seconds precompiling for 13 choices
Autotune Choices Stats:
{"num_choices": 18, "num_triton_choices": 17, "best_kernel": "mm", "best_time": 0.35363200306892395, "best_triton_pos": 1, "best_triton_time": 0.35366401076316833, "best_triton_kernel": "triton_mm_18", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8"}
AUTOTUNE mm(512x1, 1x262144)
strides: [1, 512], [0, 1]
dtypes: torch.float32, torch.float32
  mm 0.3536 ms 100.0% 
  triton_mm_18 0.3537 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8
  triton_mm_24 0.3537 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=128, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8
  triton_mm_21 0.3543 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=128, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8
  triton_mm_23 0.3543 ms 99.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=128, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
  triton_mm_15 0.3547 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8
  triton_mm_17 0.3553 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4
  triton_mm_19 0.3556 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
  triton_mm_22 0.3557 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=128, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4
  triton_mm_16 0.3564 ms 99.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
SingleProcess AUTOTUNE benchmarking takes 0.8293 seconds and 0.0520 seconds precompiling for 18 choices
/usr/local/lib/python3.12/dist-packages/torch/_inductor/lowering.py:7627: UserWarning: 
Online softmax is disabled on the fly since Inductor decides to
split the reduction. Cut an issue to PyTorch if this is an
important use case and you want to speed it up with online
softmax.

  warnings.warn(
Autotune Choices Stats:
{"num_choices": 18, "num_triton_choices": 17, "best_kernel": "triton_mm_38", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=128, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8", "best_time": 0.17609600722789764, "best_triton_pos": 0}
AUTOTUNE mm(512x1, 1x131072)
strides: [1, 512], [0, 1]
dtypes: torch.float32, torch.float32
  triton_mm_38 0.1761 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=128, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8
  mm 0.1770 ms 99.5% 
  triton_mm_33 0.1774 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
  triton_mm_37 0.1774 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=128, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
  triton_mm_35 0.1774 ms 99.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8
  triton_mm_34 0.1781 ms 98.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4
  triton_mm_36 0.1781 ms 98.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
  triton_mm_39 0.1781 ms 98.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=128, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4
  triton_mm_40 0.1781 ms 98.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=128, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
  triton_mm_41 0.1781 ms 98.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=128, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8
SingleProcess AUTOTUNE benchmarking takes 0.7150 seconds and 0.3453 seconds precompiling for 18 choices

paper_forward fwd+bwd:  379.556 ms
paper_forward bwd-only: 293.947 ms
paper_forward peak allocated: fwd=29.705 GiB, fwd+bwd=31.823 GiB
paper_forward peak reserved:  fwd=29.744 GiB, fwd+bwd=32.494 GiB
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_kernel,
with key as (1, 512, 8, 1, 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'torch.float32'),
finished after 6.41s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None;
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_2_online_softmax_merge_intrablock_out_kernel,
with key as (512, 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.bfloat16'),
finished after 3.65s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_kernel,
with key as (2, 512, 8, 2, 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'torch.float32'),
finished after 7.30s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_kernel,
with key as (3, 512, 8, 4, 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'torch.float32'),
finished after 8.45s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_kernel,
with key as (4, 512, 8, 4, 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'torch.float32'),
finished after 8.61s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_kernel,
with key as (5, 512, 1, 8, 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'torch.float32'),
finished after 4.79s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (5, 512, 1, 8, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.bfloat16', 'torch.float32', 'torch.bfloat16', 'torch.float32'),
finished after 7.51s,
best config selected: num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None;
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 32, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 32, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 64, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 64, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 32, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 32, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 64, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 64, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Triton autotuning for function phase_1_reduce_grad_pseudo_queries_kernel,
with key as (131072, 512, 1, 'torch.float32', 'torch.float32'),
finished after 1.51s,
best config selected: BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 64, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None;
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_2_online_softmax_merge_intrablock_backward_kernel,
with key as (512, 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 4.79s,
best config selected: num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None;
Autotuning kernel phase_2_reduce_grad_pseudo_query_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 32, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_reduce_grad_pseudo_query_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 32, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_reduce_grad_pseudo_query_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 64, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_reduce_grad_pseudo_query_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 64, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_reduce_grad_pseudo_query_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 32, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_reduce_grad_pseudo_query_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 32, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_reduce_grad_pseudo_query_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 64, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_reduce_grad_pseudo_query_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 64, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Triton autotuning for function phase_2_reduce_grad_pseudo_query_kernel,
with key as (131072, 512, 'torch.float32', 'torch.float32'),
finished after 1.47s,
best config selected: BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 64, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (4, 512, 8, 4, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.bfloat16', 'torch.float32'),
finished after 24.12s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None;
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 32, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 32, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 64, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 64, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 32, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 32, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 64, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 64, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Triton autotuning for function phase_1_reduce_grad_pseudo_queries_kernel,
with key as (131072, 512, 8, 'torch.float32', 'torch.float32'),
finished after 1.52s,
best config selected: BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 64, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (3, 512, 8, 4, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.bfloat16', 'torch.float32'),
finished after 21.15s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (2, 512, 8, 2, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.bfloat16', 'torch.float32'),
finished after 15.44s,
best config selected: num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (1, 512, 8, 1, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.bfloat16', 'torch.float32'),
finished after 10.45s,
best config selected: num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None;
production_forward fwd+bwd:  112.013 ms
production_forward bwd-only: 91.617 ms
production_forward peak allocated: fwd=2.067 GiB, fwd+bwd=5.946 GiB
production_forward peak reserved:  fwd=2.203 GiB, fwd+bwd=6.078 GiB
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (5, 512, 1, 8, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 7.65s,
best config selected: num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None;
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_2_online_softmax_merge_intrablock_backward_kernel,
with key as (512, 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 4.58s,
best config selected: num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (4, 512, 8, 4, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 20.28s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (3, 512, 8, 4, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 19.00s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (2, 512, 8, 2, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 15.08s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (1, 512, 8, 1, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 10.35s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None;
production_forward2 fwd+bwd:  224.389 ms
production_forward2 bwd-only: 202.214 ms
production_forward2 peak allocated: fwd=2.567 GiB, fwd+bwd=5.946 GiB
production_forward2 peak reserved:  fwd=2.953 GiB, fwd+bwd=8.703 GiB

grads check for swiglu layers + randn queries
production_forward vs paper_forward output: mean_abs=0.0016437419690191746, max_abs=0.0419921875
production_forward grad[0] vs paper_forward: mean_abs=0.008816814050078392, max_abs=0.703125, mean_rel=0.07575202733278275, max_rel=115.92130279541016, norm_rel=0.020706573501229286, ref_abs_avg=0.4606746435165405, test_abs_avg=0.46069106459617615
production_forward grad[1] vs paper_forward: mean_abs=7.482598781585693, max_abs=72.0, mean_rel=0.38754597306251526, max_rel=2564.451416015625, norm_rel=0.020601509138941765, ref_abs_avg=320.8928527832031, test_abs_avg=320.81964111328125
production_forward grad[2] vs paper_forward: mean_abs=1.3616628646850586, max_abs=5.5, mean_rel=0.14114511013031006, max_rel=7.425687789916992, norm_rel=0.024179114028811455, ref_abs_avg=54.85408020019531, test_abs_avg=54.964263916015625
production_forward grad[3] vs paper_forward: mean_abs=1.6893064975738525, max_abs=11.75, mean_rel=0.16541613638401031, max_rel=3830.77978515625, norm_rel=0.02474384754896164, ref_abs_avg=68.65489196777344, test_abs_avg=68.65849304199219
production_forward grad[4] vs paper_forward: mean_abs=1.6553542613983154, max_abs=12.0, mean_rel=0.1866776943206787, max_rel=1097.7088623046875, norm_rel=0.02452625334262848, ref_abs_avg=67.88655090332031, test_abs_avg=67.88615417480469
production_forward grad[5] vs paper_forward: mean_abs=1.1399710178375244, max_abs=4.5, mean_rel=0.08585543930530548, max_rel=3.4445040225982666, norm_rel=0.025316277518868446, ref_abs_avg=45.44083786010742, test_abs_avg=45.460906982421875
production_forward grad[6] vs paper_forward: mean_abs=1.4234594106674194, max_abs=9.0, mean_rel=0.18932577967643738, max_rel=2827.30224609375, norm_rel=0.024377737194299698, ref_abs_avg=58.71061706542969, test_abs_avg=58.70948028564453
production_forward grad[7] vs paper_forward: mean_abs=1.3952919244766235, max_abs=8.75, mean_rel=0.16047191619873047, max_rel=1412.1375732421875, norm_rel=0.02415282092988491, ref_abs_avg=58.09241485595703, test_abs_avg=58.089210510253906
production_forward grad[8] vs paper_forward: mean_abs=1.057131290435791, max_abs=4.0625, mean_rel=0.17117345333099365, max_rel=23.636411666870117, norm_rel=0.022910984233021736, ref_abs_avg=45.92921447753906, test_abs_avg=45.903255462646484
production_forward grad[9] vs paper_forward: mean_abs=1.2789781093597412, max_abs=9.0, mean_rel=0.15485917031764984, max_rel=1257.8006591796875, norm_rel=0.024027103558182716, ref_abs_avg=53.49093246459961, test_abs_avg=53.49431228637695
production_forward grad[10] vs paper_forward: mean_abs=1.2482545375823975, max_abs=8.0, mean_rel=0.15753622353076935, max_rel=940.3853759765625, norm_rel=0.023810598999261856, ref_abs_avg=52.71092224121094, test_abs_avg=52.71435546875
production_forward grad[11] vs paper_forward: mean_abs=0.9203147888183594, max_abs=4.0, mean_rel=0.07215717434883118, max_rel=3.98907470703125, norm_rel=0.022787833586335182, ref_abs_avg=41.09593963623047, test_abs_avg=41.11952590942383
production_forward grad[12] vs paper_forward: mean_abs=1.177842378616333, max_abs=8.0, mean_rel=0.15740156173706055, max_rel=2593.891845703125, norm_rel=0.023843001574277878, ref_abs_avg=49.64415740966797, test_abs_avg=49.643890380859375
production_forward grad[13] vs paper_forward: mean_abs=1.1462562084197998, max_abs=7.0, mean_rel=0.16036581993103027, max_rel=1088.5201416015625, norm_rel=0.023520803079009056, ref_abs_avg=49.01813507080078, test_abs_avg=49.017784118652344
production_forward grad[14] vs paper_forward: mean_abs=0.8727321624755859, max_abs=3.0, mean_rel=0.07149948179721832, max_rel=2.857724189758301, norm_rel=0.02276647463440895, ref_abs_avg=38.26472091674805, test_abs_avg=38.18743896484375
production_forward grad[15] vs paper_forward: mean_abs=1.092236876487732, max_abs=7.5, mean_rel=0.16962739825248718, max_rel=1706.4913330078125, norm_rel=0.023639438673853874, ref_abs_avg=46.45845413208008, test_abs_avg=46.462223052978516
production_forward grad[16] vs paper_forward: mean_abs=1.0612945556640625, max_abs=6.5, mean_rel=0.166168212890625, max_rel=2763.694091796875, norm_rel=0.023208213970065117, ref_abs_avg=46.022666931152344, test_abs_avg=46.02880859375
production_forward grad[17] vs paper_forward: mean_abs=0.8273859024047852, max_abs=3.25, mean_rel=0.07700466364622116, max_rel=4.150567531585693, norm_rel=0.023045126348733902, ref_abs_avg=36.42527770996094, test_abs_avg=36.429222106933594
production_forward grad[18] vs paper_forward: mean_abs=1.0263092517852783, max_abs=7.0, mean_rel=0.15577590465545654, max_rel=1315.024169921875, norm_rel=0.0234123133122921, ref_abs_avg=44.05963134765625, test_abs_avg=44.065860748291016
production_forward grad[19] vs paper_forward: mean_abs=0.9980553388595581, max_abs=6.25, mean_rel=0.14705100655555725, max_rel=1106.7664794921875, norm_rel=0.02294529415667057, ref_abs_avg=43.728694915771484, test_abs_avg=43.734561920166016
production_forward grad[20] vs paper_forward: mean_abs=0.7833776473999023, max_abs=3.625, mean_rel=0.0997304618358612, max_rel=9.266443252563477, norm_rel=0.021975206211209297, ref_abs_avg=35.592742919921875, test_abs_avg=35.60558319091797
production_forward grad[21] vs paper_forward: mean_abs=0.9811875820159912, max_abs=7.5, mean_rel=0.1624719798564911, max_rel=1170.7974853515625, norm_rel=0.0232100673019886, ref_abs_avg=42.479183197021484, test_abs_avg=42.479095458984375
production_forward grad[22] vs paper_forward: mean_abs=0.9540704488754272, max_abs=6.25, mean_rel=0.15586791932582855, max_rel=956.131103515625, norm_rel=0.02297532930970192, ref_abs_avg=41.73329162597656, test_abs_avg=41.737361907958984
production_forward grad[23] vs paper_forward: mean_abs=0.7098884582519531, max_abs=3.0, mean_rel=0.14895924925804138, max_rel=12.985095024108887, norm_rel=0.022291090339422226, ref_abs_avg=32.07307052612305, test_abs_avg=32.13265609741211
production_forward grad[24] vs paper_forward: mean_abs=0.9251799583435059, max_abs=6.0, mean_rel=0.16195990145206451, max_rel=1783.8736572265625, norm_rel=0.023216374218463898, ref_abs_avg=40.03559875488281, test_abs_avg=40.038875579833984
production_forward grad[25] vs paper_forward: mean_abs=0.900181770324707, max_abs=6.0, mean_rel=0.14982536435127258, max_rel=1038.9853515625, norm_rel=0.02282264642417431, ref_abs_avg=39.66090393066406, test_abs_avg=39.6663818359375
production_forward grad[26] vs paper_forward: mean_abs=0.9207088947296143, max_abs=3.5, mean_rel=0.1730339080095291, max_rel=14.866291999816895, norm_rel=0.02442532777786255, ref_abs_avg=37.35575485229492, test_abs_avg=37.312843322753906
production_forward grad[27] vs paper_forward: mean_abs=1.1052955389022827, max_abs=7.25, mean_rel=0.1678743064403534, max_rel=1013.5103149414062, norm_rel=0.024953676387667656, ref_abs_avg=44.50688934326172, test_abs_avg=44.507545471191406
production_forward grad[28] vs paper_forward: mean_abs=1.0802810192108154, max_abs=7.25, mean_rel=0.180691659450531, max_rel=1251.5318603515625, norm_rel=0.024868423119187355, ref_abs_avg=43.597782135009766, test_abs_avg=43.6014404296875
production_forward grad[29] vs paper_forward: mean_abs=0.8435540199279785, max_abs=3.125, mean_rel=0.08095486462116241, max_rel=1.6950793266296387, norm_rel=0.023372743278741837, ref_abs_avg=35.97844696044922, test_abs_avg=35.85871505737305
production_forward grad[30] vs paper_forward: mean_abs=1.0163602828979492, max_abs=7.0, mean_rel=0.1688426434993744, max_rel=1052.03955078125, norm_rel=0.025198178365826607, ref_abs_avg=40.47704315185547, test_abs_avg=40.47380065917969
production_forward grad[31] vs paper_forward: mean_abs=0.9935811758041382, max_abs=6.0, mean_rel=0.16750568151474, max_rel=1078.310302734375, norm_rel=0.025009972974658012, ref_abs_avg=39.8715705871582, test_abs_avg=39.867774963378906
production_forward grad[32] vs paper_forward: mean_abs=0.7495404481887817, max_abs=3.5, mean_rel=0.22967121005058289, max_rel=36.571292877197266, norm_rel=0.025032490491867065, ref_abs_avg=30.48016357421875, test_abs_avg=30.512371063232422
production_forward grad[33] vs paper_forward: mean_abs=0.9388604164123535, max_abs=6.0, mean_rel=0.18161484599113464, max_rel=1924.329833984375, norm_rel=0.02492120862007141, ref_abs_avg=37.762351989746094, test_abs_avg=37.76023864746094
production_forward grad[34] vs paper_forward: mean_abs=0.9151344299316406, max_abs=6.0, mean_rel=0.17475759983062744, max_rel=1467.5626220703125, norm_rel=0.024658827111124992, ref_abs_avg=37.234310150146484, test_abs_avg=37.23210906982422
production_forward grad[35] vs paper_forward: mean_abs=0.7148661613464355, max_abs=3.125, mean_rel=0.10999496281147003, max_rel=4.843050956726074, norm_rel=0.024039417505264282, ref_abs_avg=29.338233947753906, test_abs_avg=29.290491104125977
production_forward grad[36] vs paper_forward: mean_abs=0.8744961619377136, max_abs=5.515625, mean_rel=0.1585027426481247, max_rel=1471.89208984375, norm_rel=0.024790871888399124, ref_abs_avg=35.369808197021484, test_abs_avg=35.37042236328125
production_forward grad[37] vs paper_forward: mean_abs=0.8559786081314087, max_abs=5.5, mean_rel=0.16511493921279907, max_rel=1214.287841796875, norm_rel=0.024393824860453606, ref_abs_avg=35.15031433105469, test_abs_avg=35.152435302734375
production_forward grad[38] vs paper_forward: mean_abs=0.631777286529541, max_abs=2.75, mean_rel=0.1803465038537979, max_rel=45.88920974731445, norm_rel=0.023027848452329636, ref_abs_avg=28.093454360961914, test_abs_avg=28.12567901611328
production_forward grad[39] vs paper_forward: mean_abs=0.8156226873397827, max_abs=5.25, mean_rel=0.1703992784023285, max_rel=1683.54443359375, norm_rel=0.024382853880524635, ref_abs_avg=33.515655517578125, test_abs_avg=33.512210845947266
production_forward grad[40] vs paper_forward: mean_abs=0.8005096316337585, max_abs=5.5, mean_rel=0.14497065544128418, max_rel=544.26318359375, norm_rel=0.024241456761956215, ref_abs_avg=33.18366241455078, test_abs_avg=33.183067321777344
production_forward grad[41] vs paper_forward: mean_abs=0.6247611045837402, max_abs=2.5, mean_rel=0.09909714758396149, max_rel=4.801375865936279, norm_rel=0.024074025452136993, ref_abs_avg=25.901668548583984, test_abs_avg=25.87092399597168
production_forward grad[42] vs paper_forward: mean_abs=0.7721772193908691, max_abs=5.5, mean_rel=0.1625087559223175, max_rel=1696.8695068359375, norm_rel=0.024219337850809097, ref_abs_avg=31.946491241455078, test_abs_avg=31.949262619018555
production_forward grad[43] vs paper_forward: mean_abs=0.7578346729278564, max_abs=5.0, mean_rel=0.17687919735908508, max_rel=1581.1436767578125, norm_rel=0.023909425362944603, ref_abs_avg=31.75054168701172, test_abs_avg=31.749481201171875
production_forward grad[44] vs paper_forward: mean_abs=0.6249544620513916, max_abs=2.625, mean_rel=0.14354993402957916, max_rel=27.30851936340332, norm_rel=0.02407355047762394, ref_abs_avg=25.902637481689453, test_abs_avg=25.888795852661133
production_forward grad[45] vs paper_forward: mean_abs=0.7350080013275146, max_abs=5.5, mean_rel=0.16310761868953705, max_rel=1461.2901611328125, norm_rel=0.023867113515734673, ref_abs_avg=30.844890594482422, test_abs_avg=30.845964431762695
production_forward grad[46] vs paper_forward: mean_abs=0.7233396768569946, max_abs=4.625, mean_rel=0.16005438566207886, max_rel=983.6210327148438, norm_rel=0.02357723005115986, ref_abs_avg=30.724163055419922, test_abs_avg=30.71904754638672
production_forward grad[47] vs paper_forward: mean_abs=0.5560379028320312, max_abs=2.375, mean_rel=0.10209402441978455, max_rel=12.801383972167969, norm_rel=0.02172333374619484, ref_abs_avg=26.603153228759766, test_abs_avg=26.62710189819336
production_forward grad[48] vs paper_forward: mean_abs=0.7030609250068665, max_abs=4.625, mean_rel=0.1539401113986969, max_rel=867.5191040039062, norm_rel=0.023619981482625008, ref_abs_avg=29.81830406188965, test_abs_avg=29.820911407470703
production_forward grad[49] vs paper_forward: mean_abs=0.6925461292266846, max_abs=4.625, mean_rel=0.15776684880256653, max_rel=621.2492065429688, norm_rel=0.023289140313863754, ref_abs_avg=29.760005950927734, test_abs_avg=29.756664276123047
production_forward grad[50] vs paper_forward: mean_abs=0.6460399627685547, max_abs=2.5, mean_rel=0.11287176609039307, max_rel=8.834131240844727, norm_rel=0.02647511288523674, ref_abs_avg=24.23819351196289, test_abs_avg=24.24705696105957
production_forward grad[51] vs paper_forward: mean_abs=0.7952213883399963, max_abs=5.875, mean_rel=0.17569556832313538, max_rel=2003.9730224609375, norm_rel=0.025446411222219467, ref_abs_avg=31.310768127441406, test_abs_avg=31.30965232849121
production_forward grad[52] vs paper_forward: mean_abs=0.7804017066955566, max_abs=5.125, mean_rel=0.17366915941238403, max_rel=1343.5623779296875, norm_rel=0.025068117305636406, ref_abs_avg=31.143884658813477, test_abs_avg=31.14417266845703
production_forward grad[53] vs paper_forward: mean_abs=0.5873053073883057, max_abs=2.34375, mean_rel=0.06680150330066681, max_rel=3.8770759105682373, norm_rel=0.024821531027555466, ref_abs_avg=24.69098663330078, test_abs_avg=24.68634796142578
production_forward grad[54] vs paper_forward: mean_abs=0.7252108454704285, max_abs=5.0, mean_rel=0.16998718678951263, max_rel=1575.28564453125, norm_rel=0.02486657164990902, ref_abs_avg=29.218984603881836, test_abs_avg=29.218685150146484
production_forward grad[55] vs paper_forward: mean_abs=0.7138490676879883, max_abs=4.65625, mean_rel=0.16601018607616425, max_rel=777.5525512695312, norm_rel=0.024929126724600792, ref_abs_avg=28.676355361938477, test_abs_avg=28.683094024658203
production_forward grad[56] vs paper_forward: mean_abs=0.5703296661376953, max_abs=2.1875, mean_rel=0.20406624674797058, max_rel=50.061424255371094, norm_rel=0.02535710297524929, ref_abs_avg=22.279172897338867, test_abs_avg=22.28266143798828
production_forward grad[57] vs paper_forward: mean_abs=0.6714335083961487, max_abs=4.5, mean_rel=0.16000288724899292, max_rel=803.3784790039062, norm_rel=0.024235747754573822, ref_abs_avg=27.66536521911621, test_abs_avg=27.666534423828125
production_forward grad[58] vs paper_forward: mean_abs=0.6529416441917419, max_abs=4.25, mean_rel=0.15540297329425812, max_rel=484.9640808105469, norm_rel=0.023987390100955963, ref_abs_avg=27.217971801757812, test_abs_avg=27.218425750732422
production_forward grad[59] vs paper_forward: mean_abs=0.5052402019500732, max_abs=2.0625, mean_rel=0.10767503082752228, max_rel=11.832927703857422, norm_rel=0.02348112128674984, ref_abs_avg=21.6556396484375, test_abs_avg=21.61128807067871
production_forward grad[60] vs paper_forward: mean_abs=0.6195583343505859, max_abs=5.75, mean_rel=0.15361689031124115, max_rel=823.81884765625, norm_rel=0.023743683472275734, ref_abs_avg=26.078887939453125, test_abs_avg=26.080459594726562
production_forward grad[61] vs paper_forward: mean_abs=0.6100759506225586, max_abs=4.0, mean_rel=0.16041138768196106, max_rel=1441.0157470703125, norm_rel=0.02329404279589653, ref_abs_avg=26.153697967529297, test_abs_avg=26.157241821289062
production_forward grad[62] vs paper_forward: mean_abs=0.4723140597343445, max_abs=1.75, mean_rel=0.08689624816179276, max_rel=5.842272758483887, norm_rel=0.02328657917678356, ref_abs_avg=20.377391815185547, test_abs_avg=20.378023147583008
production_forward grad[63] vs paper_forward: mean_abs=0.584986686706543, max_abs=4.625, mean_rel=0.14853805303573608, max_rel=1208.37841796875, norm_rel=0.023403197526931763, ref_abs_avg=25.00870704650879, test_abs_avg=25.010208129882812
production_forward grad[64] vs paper_forward: mean_abs=0.574589729309082, max_abs=3.84765625, mean_rel=0.16786983609199524, max_rel=1551.2772216796875, norm_rel=0.02327664941549301, ref_abs_avg=24.68195915222168, test_abs_avg=24.679929733276367
production_forward grad[65] vs paper_forward: mean_abs=0.41602087020874023, max_abs=1.5, mean_rel=0.09241342544555664, max_rel=8.716750144958496, norm_rel=0.020914016291499138, ref_abs_avg=19.890369415283203, test_abs_avg=19.904979705810547
production_forward grad[66] vs paper_forward: mean_abs=0.5467660427093506, max_abs=4.0, mean_rel=0.14906758069992065, max_rel=1249.6907958984375, norm_rel=0.02292049489915371, ref_abs_avg=23.83356285095215, test_abs_avg=23.83346176147461
production_forward grad[67] vs paper_forward: mean_abs=0.5419193506240845, max_abs=3.5, mean_rel=0.14680436253547668, max_rel=737.8959350585938, norm_rel=0.02249504067003727, ref_abs_avg=24.11071014404297, test_abs_avg=24.115312576293945
production_forward grad[68] vs paper_forward: mean_abs=0.4400343894958496, max_abs=1.375, mean_rel=0.12221448123455048, max_rel=19.114253997802734, norm_rel=0.022677112370729446, ref_abs_avg=19.429149627685547, test_abs_avg=19.443572998046875
production_forward grad[69] vs paper_forward: mean_abs=0.5327223539352417, max_abs=4.0, mean_rel=0.15623517334461212, max_rel=726.9307861328125, norm_rel=0.022692782804369926, ref_abs_avg=23.41402816772461, test_abs_avg=23.415634155273438
production_forward grad[70] vs paper_forward: mean_abs=0.5169877409934998, max_abs=3.625, mean_rel=0.1541823446750641, max_rel=2215.748779296875, norm_rel=0.022422775626182556, ref_abs_avg=22.995624542236328, test_abs_avg=22.989892959594727
production_forward grad[71] vs paper_forward: mean_abs=0.41585487127304077, max_abs=1.53125, mean_rel=0.09893353283405304, max_rel=4.819463729858398, norm_rel=0.022717487066984177, ref_abs_avg=18.31871795654297, test_abs_avg=18.333566665649414
production_forward grad[72] vs paper_forward: mean_abs=0.4994838833808899, max_abs=3.875, mean_rel=0.14381001889705658, max_rel=802.3468017578125, norm_rel=0.022000398486852646, ref_abs_avg=22.637577056884766, test_abs_avg=22.638011932373047
production_forward grad[73] vs paper_forward: mean_abs=0.49031132459640503, max_abs=4.5, mean_rel=0.14501240849494934, max_rel=673.5411376953125, norm_rel=0.021621521562337875, ref_abs_avg=22.658477783203125, test_abs_avg=22.663434982299805
production_forward grad[74] vs paper_forward: mean_abs=0.43056726455688477, max_abs=1.75, mean_rel=0.14513075351715088, max_rel=26.368349075317383, norm_rel=0.021622007712721825, ref_abs_avg=20.038665771484375, test_abs_avg=20.045665740966797
production_forward grad[75] vs paper_forward: mean_abs=0.5444650650024414, max_abs=3.875, mean_rel=0.15718728303909302, max_rel=819.8798828125, norm_rel=0.024388134479522705, ref_abs_avg=22.322551727294922, test_abs_avg=22.323413848876953
production_forward grad[76] vs paper_forward: mean_abs=0.5377369523048401, max_abs=4.25, mean_rel=0.15495818853378296, max_rel=854.646728515625, norm_rel=0.02398495562374592, ref_abs_avg=22.421655654907227, test_abs_avg=22.42613983154297
production_forward grad[77] vs paper_forward: mean_abs=0.39663660526275635, max_abs=1.375, mean_rel=0.3861289322376251, max_rel=110.76757049560547, norm_rel=0.021719105541706085, ref_abs_avg=18.447174072265625, test_abs_avg=18.426204681396484
production_forward grad[78] vs paper_forward: mean_abs=0.5001846551895142, max_abs=4.0, mean_rel=0.14932112395763397, max_rel=657.03125, norm_rel=0.023755131289362907, ref_abs_avg=21.047748565673828, test_abs_avg=21.04777717590332
production_forward grad[79] vs paper_forward: mean_abs=0.4921099841594696, max_abs=4.5, mean_rel=0.15455502271652222, max_rel=1605.3646240234375, norm_rel=0.023467814549803734, ref_abs_avg=21.0301456451416, test_abs_avg=21.029563903808594
production_forward grad[80] vs paper_forward: mean_abs=0.35842621326446533, max_abs=1.4375, mean_rel=0.19366121292114258, max_rel=36.5760383605957, norm_rel=0.020795617252588272, ref_abs_avg=17.345293045043945, test_abs_avg=17.32439613342285
production_forward grad[81] vs paper_forward: mean_abs=0.4710250496864319, max_abs=4.875, mean_rel=0.14564359188079834, max_rel=742.9908447265625, norm_rel=0.02290983684360981, ref_abs_avg=20.51471710205078, test_abs_avg=20.51609230041504
production_forward grad[82] vs paper_forward: mean_abs=0.45605865120887756, max_abs=4.0, mean_rel=0.14230147004127502, max_rel=1635.4329833984375, norm_rel=0.022477395832538605, ref_abs_avg=20.243885040283203, test_abs_avg=20.237619400024414
production_forward grad[83] vs paper_forward: mean_abs=0.34921741485595703, max_abs=2.0, mean_rel=0.18309813737869263, max_rel=12.271023750305176, norm_rel=0.02332095243036747, ref_abs_avg=15.508295059204102, test_abs_avg=15.513388633728027
production_forward grad[84] vs paper_forward: mean_abs=0.4357345998287201, max_abs=3.625, mean_rel=0.1432793140411377, max_rel=729.1466674804688, norm_rel=0.022381898015737534, ref_abs_avg=19.476016998291016, test_abs_avg=19.47641372680664
production_forward grad[85] vs paper_forward: mean_abs=0.4244496822357178, max_abs=3.953125, mean_rel=0.13473737239837646, max_rel=463.63592529296875, norm_rel=0.021601157262921333, ref_abs_avg=19.727210998535156, test_abs_avg=19.72119903564453
production_forward grad[86] vs paper_forward: mean_abs=0.36565661430358887, max_abs=1.3125, mean_rel=0.08529555052518845, max_rel=4.45060920715332, norm_rel=0.022988511249423027, ref_abs_avg=15.67647933959961, test_abs_avg=15.648396492004395
production_forward grad[87] vs paper_forward: mean_abs=0.41498345136642456, max_abs=3.5, mean_rel=0.132937490940094, max_rel=601.6880493164062, norm_rel=0.021892448887228966, ref_abs_avg=18.99587631225586, test_abs_avg=18.994670867919922
production_forward grad[88] vs paper_forward: mean_abs=0.3959193229675293, max_abs=3.5, mean_rel=0.12767396867275238, max_rel=470.4412536621094, norm_rel=0.02134728617966175, ref_abs_avg=18.59715461730957, test_abs_avg=18.595468521118164
production_forward grad[89] vs paper_forward: mean_abs=0.3306533098220825, max_abs=1.25, mean_rel=0.1964043527841568, max_rel=30.861068725585938, norm_rel=0.022368498146533966, ref_abs_avg=14.846080780029297, test_abs_avg=14.816324234008789
production_forward grad[90] vs paper_forward: mean_abs=0.3825991451740265, max_abs=4.0, mean_rel=0.13396765291690826, max_rel=495.6353759765625, norm_rel=0.021517643705010414, ref_abs_avg=17.864595413208008, test_abs_avg=17.864971160888672
production_forward grad[91] vs paper_forward: mean_abs=0.38030150532722473, max_abs=3.375, mean_rel=0.13062307238578796, max_rel=569.8818969726562, norm_rel=0.021199427545070648, ref_abs_avg=18.01663589477539, test_abs_avg=18.01374053955078
production_forward grad[92] vs paper_forward: mean_abs=0.3227614164352417, max_abs=1.28125, mean_rel=0.09637042880058289, max_rel=6.3803019523620605, norm_rel=0.021693043410778046, ref_abs_avg=15.099485397338867, test_abs_avg=15.12238883972168
production_forward grad[93] vs paper_forward: mean_abs=0.3609021306037903, max_abs=3.75, mean_rel=0.12837886810302734, max_rel=767.3264770507812, norm_rel=0.020934609696269035, ref_abs_avg=17.414005279541016, test_abs_avg=17.415002822875977
production_forward grad[94] vs paper_forward: mean_abs=0.35644376277923584, max_abs=4.0, mean_rel=0.13476116955280304, max_rel=668.099365234375, norm_rel=0.02092677354812622, ref_abs_avg=17.227882385253906, test_abs_avg=17.21976661682129
production_forward grad[95] vs paper_forward: mean_abs=0.308890163898468, max_abs=1.125, mean_rel=0.20351234078407288, max_rel=57.52255630493164, norm_rel=0.01997840404510498, ref_abs_avg=15.359833717346191, test_abs_avg=15.361616134643555
production_forward grad[96] vs paper_forward: mean_abs=0.34265565872192383, max_abs=3.5, mean_rel=0.12814366817474365, max_rel=893.501953125, norm_rel=0.020327292382717133, ref_abs_avg=17.131135940551758, test_abs_avg=17.131237030029297
production_forward grad[97] vs paper_forward: mean_abs=0.3299690783023834, max_abs=3.5, mean_rel=0.11867514252662659, max_rel=479.639404296875, norm_rel=0.01934143900871277, ref_abs_avg=17.26851463317871, test_abs_avg=17.2684268951416
production_forward2 vs paper_forward output: mean_abs=0.0016437419690191746, max_abs=0.0419921875
production_forward2 grad[0] vs paper_forward: mean_abs=0.008813943713903427, max_abs=0.71875, mean_rel=0.07562369108200073, max_rel=104.14708709716797, norm_rel=0.02068948745727539, ref_abs_avg=0.4606746435165405, test_abs_avg=0.46068212389945984
production_forward2 grad[1] vs paper_forward: mean_abs=7.456153392791748, max_abs=72.0, mean_rel=0.4687599241733551, max_rel=4048.075927734375, norm_rel=0.02050580456852913, ref_abs_avg=320.8928527832031, test_abs_avg=320.8064270019531
production_forward2 grad[2] vs paper_forward: mean_abs=1.365107774734497, max_abs=4.75, mean_rel=0.13741850852966309, max_rel=13.648441314697266, norm_rel=0.0243825763463974, ref_abs_avg=54.85408020019531, test_abs_avg=55.007080078125
production_forward2 grad[3] vs paper_forward: mean_abs=1.683678150177002, max_abs=14.3515625, mean_rel=0.16620460152626038, max_rel=1483.6334228515625, norm_rel=0.024642083793878555, ref_abs_avg=68.65489196777344, test_abs_avg=68.65522766113281
production_forward2 grad[4] vs paper_forward: mean_abs=1.64499831199646, max_abs=10.8125, mean_rel=0.18276798725128174, max_rel=1658.228271484375, norm_rel=0.024345099925994873, ref_abs_avg=67.88655090332031, test_abs_avg=67.88026428222656
production_forward2 grad[5] vs paper_forward: mean_abs=1.155664324760437, max_abs=4.0, mean_rel=0.09308946132659912, max_rel=4.580146312713623, norm_rel=0.025445157662034035, ref_abs_avg=45.44083786010742, test_abs_avg=45.44548034667969
production_forward2 grad[6] vs paper_forward: mean_abs=1.425268292427063, max_abs=9.0, mean_rel=0.18316467106342316, max_rel=2027.3848876953125, norm_rel=0.02441095933318138, ref_abs_avg=58.71061706542969, test_abs_avg=58.70701217651367
production_forward2 grad[7] vs paper_forward: mean_abs=1.39663565158844, max_abs=9.0, mean_rel=0.15463422238826752, max_rel=1167.5560302734375, norm_rel=0.02417331375181675, ref_abs_avg=58.09241485595703, test_abs_avg=58.088584899902344
production_forward2 grad[8] vs paper_forward: mean_abs=1.0203056335449219, max_abs=4.25, mean_rel=0.2934616804122925, max_rel=89.57867431640625, norm_rel=0.022250933572649956, ref_abs_avg=45.92921447753906, test_abs_avg=45.914337158203125
production_forward2 grad[9] vs paper_forward: mean_abs=1.2844889163970947, max_abs=8.0, mean_rel=0.15729624032974243, max_rel=1765.9251708984375, norm_rel=0.024131815880537033, ref_abs_avg=53.49093246459961, test_abs_avg=53.493431091308594
production_forward2 grad[10] vs paper_forward: mean_abs=1.2552281618118286, max_abs=7.5, mean_rel=0.15272757411003113, max_rel=1146.6143798828125, norm_rel=0.023931236937642097, ref_abs_avg=52.71092224121094, test_abs_avg=52.71331787109375
production_forward2 grad[11] vs paper_forward: mean_abs=0.9371795654296875, max_abs=4.0, mean_rel=0.07746320962905884, max_rel=4.386458396911621, norm_rel=0.022959792986512184, ref_abs_avg=41.09593963623047, test_abs_avg=41.114662170410156
production_forward2 grad[12] vs paper_forward: mean_abs=1.1806282997131348, max_abs=7.0, mean_rel=0.15438011288642883, max_rel=2790.1787109375, norm_rel=0.023901451379060745, ref_abs_avg=49.64415740966797, test_abs_avg=49.64453125
production_forward2 grad[13] vs paper_forward: mean_abs=1.1498491764068604, max_abs=7.5, mean_rel=0.15661005675792694, max_rel=764.16162109375, norm_rel=0.023593274876475334, ref_abs_avg=49.01813507080078, test_abs_avg=49.017311096191406
production_forward2 grad[14] vs paper_forward: mean_abs=0.8705661296844482, max_abs=3.25, mean_rel=0.07904624938964844, max_rel=3.8759186267852783, norm_rel=0.02302595041692257, ref_abs_avg=38.26472091674805, test_abs_avg=38.20808029174805
production_forward2 grad[15] vs paper_forward: mean_abs=1.0962471961975098, max_abs=7.0, mean_rel=0.16807854175567627, max_rel=1763.3612060546875, norm_rel=0.023720787838101387, ref_abs_avg=46.45845413208008, test_abs_avg=46.463623046875
production_forward2 grad[16] vs paper_forward: mean_abs=1.0650309324264526, max_abs=6.046875, mean_rel=0.1685754805803299, max_rel=2996.168701171875, norm_rel=0.023282112553715706, ref_abs_avg=46.022666931152344, test_abs_avg=46.02653121948242
production_forward2 grad[17] vs paper_forward: mean_abs=0.8238277435302734, max_abs=3.5, mean_rel=0.07844950258731842, max_rel=4.663728713989258, norm_rel=0.022889304906129837, ref_abs_avg=36.42527770996094, test_abs_avg=36.41705322265625
production_forward2 grad[18] vs paper_forward: mean_abs=1.027726173400879, max_abs=6.87890625, mean_rel=0.15743732452392578, max_rel=1343.76611328125, norm_rel=0.02343699336051941, ref_abs_avg=44.05963134765625, test_abs_avg=44.065670013427734
production_forward2 grad[19] vs paper_forward: mean_abs=1.0048882961273193, max_abs=6.875, mean_rel=0.15563973784446716, max_rel=1702.7669677734375, norm_rel=0.023104358464479446, ref_abs_avg=43.728694915771484, test_abs_avg=43.73503875732422
production_forward2 grad[20] vs paper_forward: mean_abs=0.7754089832305908, max_abs=3.421875, mean_rel=0.10632605850696564, max_rel=10.9376859664917, norm_rel=0.021899720653891563, ref_abs_avg=35.592742919921875, test_abs_avg=35.64160919189453
production_forward2 grad[21] vs paper_forward: mean_abs=0.9834905862808228, max_abs=6.5, mean_rel=0.16089491546154022, max_rel=1115.07861328125, norm_rel=0.023278165608644485, ref_abs_avg=42.479183197021484, test_abs_avg=42.47901153564453
production_forward2 grad[22] vs paper_forward: mean_abs=0.953899621963501, max_abs=5.75, mean_rel=0.1544266790151596, max_rel=798.9163818359375, norm_rel=0.02298853173851967, ref_abs_avg=41.73329162597656, test_abs_avg=41.736595153808594
production_forward2 grad[23] vs paper_forward: mean_abs=0.7147960662841797, max_abs=3.0, mean_rel=0.13680678606033325, max_rel=16.347814559936523, norm_rel=0.022295642644166946, ref_abs_avg=32.07307052612305, test_abs_avg=32.1283073425293
production_forward2 grad[24] vs paper_forward: mean_abs=0.9265792369842529, max_abs=5.75, mean_rel=0.1609601229429245, max_rel=1910.7568359375, norm_rel=0.023261990398168564, ref_abs_avg=40.03559875488281, test_abs_avg=40.037994384765625
production_forward2 grad[25] vs paper_forward: mean_abs=0.9028737545013428, max_abs=5.5, mean_rel=0.14905257523059845, max_rel=1082.576171875, norm_rel=0.022884158417582512, ref_abs_avg=39.66090393066406, test_abs_avg=39.66534423828125
production_forward2 grad[26] vs paper_forward: mean_abs=0.9093306064605713, max_abs=3.6875, mean_rel=0.16785705089569092, max_rel=18.36908531188965, norm_rel=0.0242286529392004, ref_abs_avg=37.35575485229492, test_abs_avg=37.31934356689453
production_forward2 grad[27] vs paper_forward: mean_abs=1.1032053232192993, max_abs=7.25, mean_rel=0.16743507981300354, max_rel=1090.3934326171875, norm_rel=0.024891173467040062, ref_abs_avg=44.50688934326172, test_abs_avg=44.50593566894531
production_forward2 grad[28] vs paper_forward: mean_abs=1.0786640644073486, max_abs=7.375, mean_rel=0.17530539631843567, max_rel=1135.2764892578125, norm_rel=0.024830332025885582, ref_abs_avg=43.597782135009766, test_abs_avg=43.597259521484375
production_forward2 grad[29] vs paper_forward: mean_abs=0.8342494964599609, max_abs=3.125, mean_rel=0.08471039682626724, max_rel=1.9078114032745361, norm_rel=0.023334214463829994, ref_abs_avg=35.97844696044922, test_abs_avg=35.87055587768555
production_forward2 grad[30] vs paper_forward: mean_abs=1.0172314643859863, max_abs=6.5, mean_rel=0.16909745335578918, max_rel=903.369140625, norm_rel=0.025215547531843185, ref_abs_avg=40.47704315185547, test_abs_avg=40.47343444824219
production_forward2 grad[31] vs paper_forward: mean_abs=0.9943839311599731, max_abs=6.0, mean_rel=0.16377267241477966, max_rel=735.5571899414062, norm_rel=0.025020049884915352, ref_abs_avg=39.8715705871582, test_abs_avg=39.865753173828125
production_forward2 grad[32] vs paper_forward: mean_abs=0.747584342956543, max_abs=3.0, mean_rel=0.28081488609313965, max_rel=42.97257614135742, norm_rel=0.025219684466719627, ref_abs_avg=30.48016357421875, test_abs_avg=30.52093505859375
production_forward2 grad[33] vs paper_forward: mean_abs=0.9398350715637207, max_abs=6.5, mean_rel=0.1832880973815918, max_rel=2102.032470703125, norm_rel=0.024959484115242958, ref_abs_avg=37.762351989746094, test_abs_avg=37.75959396362305
production_forward2 grad[34] vs paper_forward: mean_abs=0.9163400530815125, max_abs=6.0, mean_rel=0.1699177622795105, max_rel=1432.0628662109375, norm_rel=0.024707864969968796, ref_abs_avg=37.234310150146484, test_abs_avg=37.231056213378906
production_forward2 grad[35] vs paper_forward: mean_abs=0.7048006057739258, max_abs=3.0625, mean_rel=0.10704730451107025, max_rel=5.411706447601318, norm_rel=0.02348945289850235, ref_abs_avg=29.338233947753906, test_abs_avg=29.28986930847168
production_forward2 grad[36] vs paper_forward: mean_abs=0.8747209906578064, max_abs=6.1875, mean_rel=0.15883782505989075, max_rel=1747.8895263671875, norm_rel=0.0248129740357399, ref_abs_avg=35.369808197021484, test_abs_avg=35.37084197998047
production_forward2 grad[37] vs paper_forward: mean_abs=0.8571913242340088, max_abs=6.0, mean_rel=0.16736572980880737, max_rel=1445.916748046875, norm_rel=0.024442294612526894, ref_abs_avg=35.15031433105469, test_abs_avg=35.15195083618164
production_forward2 grad[38] vs paper_forward: mean_abs=0.615186870098114, max_abs=2.75, mean_rel=0.16766512393951416, max_rel=41.13265609741211, norm_rel=0.02255474030971527, ref_abs_avg=28.093454360961914, test_abs_avg=28.120956420898438
production_forward2 grad[39] vs paper_forward: mean_abs=0.8166543841362, max_abs=5.1875, mean_rel=0.17021530866622925, max_rel=1382.201171875, norm_rel=0.02442026510834694, ref_abs_avg=33.515655517578125, test_abs_avg=33.51166534423828
production_forward2 grad[40] vs paper_forward: mean_abs=0.8016395568847656, max_abs=5.375, mean_rel=0.14629600942134857, max_rel=498.3329162597656, norm_rel=0.02427206002175808, ref_abs_avg=33.18366241455078, test_abs_avg=33.183284759521484
production_forward2 grad[41] vs paper_forward: mean_abs=0.6364765167236328, max_abs=2.5, mean_rel=0.09796633571386337, max_rel=5.033232688903809, norm_rel=0.024460619315505028, ref_abs_avg=25.901668548583984, test_abs_avg=25.85688591003418
production_forward2 grad[42] vs paper_forward: mean_abs=0.7734895348548889, max_abs=4.75, mean_rel=0.16498008370399475, max_rel=1654.439697265625, norm_rel=0.024251151829957962, ref_abs_avg=31.946491241455078, test_abs_avg=31.949325561523438
production_forward2 grad[43] vs paper_forward: mean_abs=0.7592495679855347, max_abs=5.5, mean_rel=0.1767617017030716, max_rel=1570.311767578125, norm_rel=0.023973139002919197, ref_abs_avg=31.75054168701172, test_abs_avg=31.749897003173828
production_forward2 grad[44] vs paper_forward: mean_abs=0.628652811050415, max_abs=2.5, mean_rel=0.18770407140254974, max_rel=39.64554214477539, norm_rel=0.024059878662228584, ref_abs_avg=25.902637481689453, test_abs_avg=25.89960479736328
production_forward2 grad[45] vs paper_forward: mean_abs=0.7356818914413452, max_abs=4.875, mean_rel=0.16382133960723877, max_rel=1260.1583251953125, norm_rel=0.023890484124422073, ref_abs_avg=30.844890594482422, test_abs_avg=30.84520721435547
production_forward2 grad[46] vs paper_forward: mean_abs=0.7232319116592407, max_abs=4.875, mean_rel=0.16133737564086914, max_rel=893.690673828125, norm_rel=0.02357303909957409, ref_abs_avg=30.724163055419922, test_abs_avg=30.718183517456055
production_forward2 grad[47] vs paper_forward: mean_abs=0.5714073181152344, max_abs=2.5, mean_rel=0.1189807802438736, max_rel=19.366195678710938, norm_rel=0.022202810272574425, ref_abs_avg=26.603153228759766, test_abs_avg=26.62865447998047
production_forward2 grad[48] vs paper_forward: mean_abs=0.7041980028152466, max_abs=5.0, mean_rel=0.1555008590221405, max_rel=966.8958740234375, norm_rel=0.02365235611796379, ref_abs_avg=29.81830406188965, test_abs_avg=29.820240020751953
production_forward2 grad[49] vs paper_forward: mean_abs=0.6943973302841187, max_abs=4.625, mean_rel=0.1561422497034073, max_rel=765.5364379882812, norm_rel=0.023348364979028702, ref_abs_avg=29.760005950927734, test_abs_avg=29.757158279418945
production_forward2 grad[50] vs paper_forward: mean_abs=0.6564802527427673, max_abs=2.75, mean_rel=0.10052305459976196, max_rel=5.6356401443481445, norm_rel=0.02684200555086136, ref_abs_avg=24.23819351196289, test_abs_avg=24.252683639526367
production_forward2 grad[51] vs paper_forward: mean_abs=0.7933751344680786, max_abs=6.0, mean_rel=0.17449459433555603, max_rel=1803.5367431640625, norm_rel=0.025384265929460526, ref_abs_avg=31.310768127441406, test_abs_avg=31.309539794921875
production_forward2 grad[52] vs paper_forward: mean_abs=0.7774540185928345, max_abs=5.25, mean_rel=0.17587801814079285, max_rel=1574.505126953125, norm_rel=0.02496335282921791, ref_abs_avg=31.143884658813477, test_abs_avg=31.141813278198242
production_forward2 grad[53] vs paper_forward: mean_abs=0.592750072479248, max_abs=2.5, mean_rel=0.06812308728694916, max_rel=4.404270648956299, norm_rel=0.025235610082745552, ref_abs_avg=24.69098663330078, test_abs_avg=24.660642623901367
production_forward2 grad[54] vs paper_forward: mean_abs=0.7245990037918091, max_abs=5.25, mean_rel=0.1680596023797989, max_rel=1453.075439453125, norm_rel=0.024844923987984657, ref_abs_avg=29.218984603881836, test_abs_avg=29.218265533447266
production_forward2 grad[55] vs paper_forward: mean_abs=0.713770866394043, max_abs=4.5, mean_rel=0.1660013496875763, max_rel=877.1976318359375, norm_rel=0.024945849552750587, ref_abs_avg=28.676355361938477, test_abs_avg=28.682498931884766
production_forward2 grad[56] vs paper_forward: mean_abs=0.5669689178466797, max_abs=2.25, mean_rel=0.1798797845840454, max_rel=35.31940841674805, norm_rel=0.0252968929708004, ref_abs_avg=22.279172897338867, test_abs_avg=22.267841339111328
production_forward2 grad[57] vs paper_forward: mean_abs=0.6715998649597168, max_abs=4.5, mean_rel=0.16027386486530304, max_rel=1075.1214599609375, norm_rel=0.024251120164990425, ref_abs_avg=27.66536521911621, test_abs_avg=27.66642951965332
production_forward2 grad[58] vs paper_forward: mean_abs=0.6524440050125122, max_abs=4.5, mean_rel=0.15906023979187012, max_rel=594.4666137695312, norm_rel=0.023944467306137085, ref_abs_avg=27.217971801757812, test_abs_avg=27.21780014038086
production_forward2 grad[59] vs paper_forward: mean_abs=0.5091683864593506, max_abs=2.0, mean_rel=0.11511389911174774, max_rel=17.265193939208984, norm_rel=0.02351202256977558, ref_abs_avg=21.6556396484375, test_abs_avg=21.627296447753906
production_forward2 grad[60] vs paper_forward: mean_abs=0.6202861070632935, max_abs=5.75, mean_rel=0.1557692438364029, max_rel=1012.7952270507812, norm_rel=0.02376868762075901, ref_abs_avg=26.078887939453125, test_abs_avg=26.080245971679688
production_forward2 grad[61] vs paper_forward: mean_abs=0.6106748580932617, max_abs=4.0, mean_rel=0.16194963455200195, max_rel=1357.000244140625, norm_rel=0.023319028317928314, ref_abs_avg=26.153697967529297, test_abs_avg=26.156246185302734
production_forward2 grad[62] vs paper_forward: mean_abs=0.48594164848327637, max_abs=1.734375, mean_rel=0.1077118068933487, max_rel=12.646510124206543, norm_rel=0.023843914270401, ref_abs_avg=20.377391815185547, test_abs_avg=20.381874084472656
production_forward2 grad[63] vs paper_forward: mean_abs=0.586689829826355, max_abs=4.5, mean_rel=0.14797823131084442, max_rel=1300.90576171875, norm_rel=0.023460133001208305, ref_abs_avg=25.00870704650879, test_abs_avg=25.009689331054688
production_forward2 grad[64] vs paper_forward: mean_abs=0.5755957365036011, max_abs=4.28515625, mean_rel=0.16429105401039124, max_rel=1116.397705078125, norm_rel=0.023319188505411148, ref_abs_avg=24.68195915222168, test_abs_avg=24.679685592651367
production_forward2 grad[65] vs paper_forward: mean_abs=0.42231035232543945, max_abs=1.625, mean_rel=0.09332137554883957, max_rel=8.896476745605469, norm_rel=0.021284591406583786, ref_abs_avg=19.890369415283203, test_abs_avg=19.896535873413086
production_forward2 grad[66] vs paper_forward: mean_abs=0.5474060773849487, max_abs=4.5, mean_rel=0.1506061851978302, max_rel=1076.184326171875, norm_rel=0.02293647639453411, ref_abs_avg=23.83356285095215, test_abs_avg=23.832963943481445
production_forward2 grad[67] vs paper_forward: mean_abs=0.542088508605957, max_abs=3.25, mean_rel=0.1483502984046936, max_rel=780.250244140625, norm_rel=0.0225132517516613, ref_abs_avg=24.11071014404297, test_abs_avg=24.11441421508789
production_forward2 grad[68] vs paper_forward: mean_abs=0.44381141662597656, max_abs=1.5, mean_rel=0.11697551608085632, max_rel=14.662847518920898, norm_rel=0.02273436263203621, ref_abs_avg=19.429149627685547, test_abs_avg=19.439895629882812
production_forward2 grad[69] vs paper_forward: mean_abs=0.5336666107177734, max_abs=3.9375, mean_rel=0.15801985561847687, max_rel=782.491455078125, norm_rel=0.022717008367180824, ref_abs_avg=23.41402816772461, test_abs_avg=23.41567611694336
production_forward2 grad[70] vs paper_forward: mean_abs=0.5172065496444702, max_abs=3.5, mean_rel=0.1565990149974823, max_rel=2582.58447265625, norm_rel=0.022455468773841858, ref_abs_avg=22.995624542236328, test_abs_avg=22.990535736083984
production_forward2 grad[71] vs paper_forward: mean_abs=0.4116654098033905, max_abs=1.5, mean_rel=0.10047857463359833, max_rel=5.831510543823242, norm_rel=0.022679323330521584, ref_abs_avg=18.31871795654297, test_abs_avg=18.321836471557617
production_forward2 grad[72] vs paper_forward: mean_abs=0.4997127950191498, max_abs=3.625, mean_rel=0.1440693438053131, max_rel=725.56982421875, norm_rel=0.022007012739777565, ref_abs_avg=22.637577056884766, test_abs_avg=22.637981414794922
production_forward2 grad[73] vs paper_forward: mean_abs=0.49109020829200745, max_abs=5.5, mean_rel=0.14568988978862762, max_rel=746.4622802734375, norm_rel=0.021660806611180305, ref_abs_avg=22.658477783203125, test_abs_avg=22.66381072998047
production_forward2 grad[74] vs paper_forward: mean_abs=0.4245467185974121, max_abs=1.875, mean_rel=0.1272754967212677, max_rel=23.08204460144043, norm_rel=0.021500062197446823, ref_abs_avg=20.038665771484375, test_abs_avg=20.058746337890625
production_forward2 grad[75] vs paper_forward: mean_abs=0.5416516065597534, max_abs=4.125, mean_rel=0.15654931962490082, max_rel=872.8126831054688, norm_rel=0.02426540106534958, ref_abs_avg=22.322551727294922, test_abs_avg=22.322124481201172
production_forward2 grad[76] vs paper_forward: mean_abs=0.5357367396354675, max_abs=3.75, mean_rel=0.15325510501861572, max_rel=672.6669921875, norm_rel=0.02388455532491207, ref_abs_avg=22.421655654907227, test_abs_avg=22.425392150878906
production_forward2 grad[77] vs paper_forward: mean_abs=0.397281289100647, max_abs=1.375, mean_rel=0.3939286470413208, max_rel=103.08969116210938, norm_rel=0.021642189472913742, ref_abs_avg=18.447174072265625, test_abs_avg=18.418197631835938
production_forward2 grad[78] vs paper_forward: mean_abs=0.4994845688343048, max_abs=4.0, mean_rel=0.14956331253051758, max_rel=826.0472412109375, norm_rel=0.023725716397166252, ref_abs_avg=21.047748565673828, test_abs_avg=21.04758644104004
production_forward2 grad[79] vs paper_forward: mean_abs=0.4913824796676636, max_abs=4.0, mean_rel=0.15467658638954163, max_rel=1400.120361328125, norm_rel=0.02344266138970852, ref_abs_avg=21.0301456451416, test_abs_avg=21.030170440673828
production_forward2 grad[80] vs paper_forward: mean_abs=0.3501788377761841, max_abs=1.5, mean_rel=0.15904586017131805, max_rel=29.448667526245117, norm_rel=0.020759565755724907, ref_abs_avg=17.345293045043945, test_abs_avg=17.315523147583008
production_forward2 grad[81] vs paper_forward: mean_abs=0.470694899559021, max_abs=4.875, mean_rel=0.14491897821426392, max_rel=784.0895385742188, norm_rel=0.02289501577615738, ref_abs_avg=20.51471710205078, test_abs_avg=20.51618003845215
production_forward2 grad[82] vs paper_forward: mean_abs=0.4558284878730774, max_abs=4.0, mean_rel=0.14068394899368286, max_rel=1493.19091796875, norm_rel=0.02247309312224388, ref_abs_avg=20.243885040283203, test_abs_avg=20.237781524658203
production_forward2 grad[83] vs paper_forward: mean_abs=0.34940487146377563, max_abs=2.0, mean_rel=0.16516269743442535, max_rel=11.506175994873047, norm_rel=0.02306399494409561, ref_abs_avg=15.508295059204102, test_abs_avg=15.51375675201416
production_forward2 grad[84] vs paper_forward: mean_abs=0.4357872009277344, max_abs=3.53125, mean_rel=0.14099128544330597, max_rel=664.7911987304688, norm_rel=0.02238231897354126, ref_abs_avg=19.476016998291016, test_abs_avg=19.476036071777344
production_forward2 grad[85] vs paper_forward: mean_abs=0.42441508173942566, max_abs=3.859375, mean_rel=0.13490909337997437, max_rel=458.01104736328125, norm_rel=0.02161249704658985, ref_abs_avg=19.727210998535156, test_abs_avg=19.72045135498047
production_forward2 grad[86] vs paper_forward: mean_abs=0.3678709864616394, max_abs=1.4375, mean_rel=0.08841906487941742, max_rel=6.525908946990967, norm_rel=0.023343436419963837, ref_abs_avg=15.67647933959961, test_abs_avg=15.650154113769531
production_forward2 grad[87] vs paper_forward: mean_abs=0.4151420295238495, max_abs=3.515625, mean_rel=0.13158179819583893, max_rel=542.0914916992188, norm_rel=0.0219030249863863, ref_abs_avg=18.99587631225586, test_abs_avg=18.99401092529297
production_forward2 grad[88] vs paper_forward: mean_abs=0.3961561918258667, max_abs=3.5, mean_rel=0.1283004879951477, max_rel=500.0845947265625, norm_rel=0.021356971934437752, ref_abs_avg=18.59715461730957, test_abs_avg=18.595298767089844
production_forward2 grad[89] vs paper_forward: mean_abs=0.33418136835098267, max_abs=1.1875, mean_rel=0.18985000252723694, max_rel=29.111495971679688, norm_rel=0.022357888519763947, ref_abs_avg=14.846080780029297, test_abs_avg=14.817723274230957
production_forward2 grad[90] vs paper_forward: mean_abs=0.3826516270637512, max_abs=4.0, mean_rel=0.1332913637161255, max_rel=524.8927001953125, norm_rel=0.02152426168322563, ref_abs_avg=17.864595413208008, test_abs_avg=17.86488151550293
production_forward2 grad[91] vs paper_forward: mean_abs=0.3806414306163788, max_abs=3.125, mean_rel=0.13152757287025452, max_rel=703.7401123046875, norm_rel=0.02121872454881668, ref_abs_avg=18.01663589477539, test_abs_avg=18.013465881347656
production_forward2 grad[92] vs paper_forward: mean_abs=0.3270777761936188, max_abs=1.34375, mean_rel=0.09827888011932373, max_rel=5.796189785003662, norm_rel=0.02201308123767376, ref_abs_avg=15.099485397338867, test_abs_avg=15.123533248901367
production_forward2 grad[93] vs paper_forward: mean_abs=0.36095744371414185, max_abs=3.75, mean_rel=0.1277514547109604, max_rel=811.4507446289062, norm_rel=0.0209373589605093, ref_abs_avg=17.414005279541016, test_abs_avg=17.41497039794922
production_forward2 grad[94] vs paper_forward: mean_abs=0.3564188480377197, max_abs=3.9375, mean_rel=0.1346338391304016, max_rel=686.1622924804688, norm_rel=0.02092723734676838, ref_abs_avg=17.227882385253906, test_abs_avg=17.219959259033203
production_forward2 grad[95] vs paper_forward: mean_abs=0.3082159161567688, max_abs=1.125, mean_rel=0.19284316897392273, max_rel=52.050167083740234, norm_rel=0.019927699118852615, ref_abs_avg=15.359833717346191, test_abs_avg=15.361814498901367
production_forward2 grad[96] vs paper_forward: mean_abs=0.34267911314964294, max_abs=3.5, mean_rel=0.12818095088005066, max_rel=893.501953125, norm_rel=0.020330006256699562, ref_abs_avg=17.131135940551758, test_abs_avg=17.131237030029297
production_forward2 grad[97] vs paper_forward: mean_abs=0.32995080947875977, max_abs=3.5, mean_rel=0.11866197735071182, max_rel=479.639404296875, norm_rel=0.01934068091213703, ref_abs_avg=17.26851463317871, test_abs_avg=17.268451690673828
identity layers + randn queries
production_forward2 fwd+bwd:  224.367 ms
production_forward2 bwd-only: 202.204 ms
production_forward2 peak allocated: fwd=2.864 GiB, fwd+bwd=6.243 GiB
production_forward2 peak reserved:  fwd=3.240 GiB, fwd+bwd=8.990 GiB
paper_forward fwd+bwd:  379.533 ms
paper_forward bwd-only: 293.897 ms
paper_forward peak allocated: fwd=30.001 GiB, fwd+bwd=32.120 GiB
paper_forward peak reserved:  fwd=30.037 GiB, fwd+bwd=32.787 GiB
production_forward fwd+bwd:  112.041 ms
production_forward bwd-only: 91.655 ms
production_forward peak allocated: fwd=2.364 GiB, fwd+bwd=6.243 GiB
production_forward peak reserved:  fwd=2.490 GiB, fwd+bwd=6.365 GiB

grads check for swiglu layers + randn queries
production_forward vs paper_forward output: mean_abs=0.0016339446883648634, max_abs=0.03533935546875
production_forward grad[0] vs paper_forward: mean_abs=0.008523550815880299, max_abs=0.3125, mean_rel=0.0745019018650055, max_rel=86.07611083984375, norm_rel=0.02034677192568779, ref_abs_avg=0.4510049521923065, test_abs_avg=0.4510270953178406
production_forward grad[1] vs paper_forward: mean_abs=7.3386969566345215, max_abs=64.0, mean_rel=0.2518211901187897, max_rel=1913.2183837890625, norm_rel=0.021131020039319992, ref_abs_avg=311.27972412109375, test_abs_avg=311.1899719238281
production_forward grad[2] vs paper_forward: mean_abs=1.215484619140625, max_abs=4.771484375, mean_rel=0.16978593170642853, max_rel=18.29798126220703, norm_rel=0.022964371368288994, ref_abs_avg=52.96739959716797, test_abs_avg=52.958526611328125
production_forward grad[3] vs paper_forward: mean_abs=1.5363690853118896, max_abs=11.0, mean_rel=0.16485148668289185, max_rel=1838.7110595703125, norm_rel=0.024295225739479065, ref_abs_avg=63.545631408691406, test_abs_avg=63.548004150390625
production_forward grad[4] vs paper_forward: mean_abs=1.5043118000030518, max_abs=12.25, mean_rel=0.1733400672674179, max_rel=989.1339111328125, norm_rel=0.02423481084406376, ref_abs_avg=62.619667053222656, test_abs_avg=62.61482620239258
production_forward grad[5] vs paper_forward: mean_abs=1.0994988679885864, max_abs=4.25, mean_rel=0.34141701459884644, max_rel=123.12017059326172, norm_rel=0.023716745898127556, ref_abs_avg=47.023658752441406, test_abs_avg=46.9762077331543
production_forward grad[6] vs paper_forward: mean_abs=1.3403995037078857, max_abs=9.0, mean_rel=0.17339037358760834, max_rel=2084.107421875, norm_rel=0.023817412555217743, ref_abs_avg=56.563297271728516, test_abs_avg=56.56904220581055
production_forward grad[7] vs paper_forward: mean_abs=1.3078054189682007, max_abs=9.0, mean_rel=0.1613456755876541, max_rel=940.7881469726562, norm_rel=0.02340233512222767, ref_abs_avg=56.14898681640625, test_abs_avg=56.15489959716797
production_forward grad[8] vs paper_forward: mean_abs=0.929046630859375, max_abs=3.5, mean_rel=0.09820327907800674, max_rel=12.387948989868164, norm_rel=0.022328050807118416, ref_abs_avg=42.10066223144531, test_abs_avg=42.17039489746094
production_forward grad[9] vs paper_forward: mean_abs=1.21310555934906, max_abs=7.5, mean_rel=0.15980200469493866, max_rel=1977.03271484375, norm_rel=0.023695925250649452, ref_abs_avg=51.44599914550781, test_abs_avg=51.448692321777344
production_forward grad[10] vs paper_forward: mean_abs=1.186708927154541, max_abs=8.0, mean_rel=0.16924722492694855, max_rel=1919.2286376953125, norm_rel=0.02353939227759838, ref_abs_avg=50.704044342041016, test_abs_avg=50.69997024536133
production_forward grad[11] vs paper_forward: mean_abs=0.9102630615234375, max_abs=3.75, mean_rel=0.11736400425434113, max_rel=9.26883602142334, norm_rel=0.023262040689587593, ref_abs_avg=39.63140869140625, test_abs_avg=39.60282897949219
production_forward grad[12] vs paper_forward: mean_abs=1.1364713907241821, max_abs=7.5, mean_rel=0.15852200984954834, max_rel=1354.3065185546875, norm_rel=0.023535307496786118, ref_abs_avg=48.4886360168457, test_abs_avg=48.48750305175781
production_forward grad[13] vs paper_forward: mean_abs=1.101170539855957, max_abs=6.859375, mean_rel=0.171571284532547, max_rel=939.5382690429688, norm_rel=0.023280441761016846, ref_abs_avg=47.58503723144531, test_abs_avg=47.587738037109375
production_forward grad[14] vs paper_forward: mean_abs=0.885284423828125, max_abs=3.625, mean_rel=0.08742527663707733, max_rel=5.305602550506592, norm_rel=0.023573409765958786, ref_abs_avg=37.51253890991211, test_abs_avg=37.51247787475586
production_forward grad[15] vs paper_forward: mean_abs=1.059715986251831, max_abs=6.625, mean_rel=0.16645336151123047, max_rel=1626.08935546875, norm_rel=0.023401716724038124, ref_abs_avg=45.560611724853516, test_abs_avg=45.563926696777344
production_forward grad[16] vs paper_forward: mean_abs=1.0336644649505615, max_abs=6.5, mean_rel=0.16105619072914124, max_rel=1306.667236328125, norm_rel=0.02294311113655567, ref_abs_avg=45.34004211425781, test_abs_avg=45.345638275146484
production_forward grad[17] vs paper_forward: mean_abs=0.7957015037536621, max_abs=2.75, mean_rel=0.09153357148170471, max_rel=7.5100297927856445, norm_rel=0.02195076458156109, ref_abs_avg=36.63607406616211, test_abs_avg=36.597755432128906
production_forward grad[18] vs paper_forward: mean_abs=0.9947335720062256, max_abs=6.5, mean_rel=0.14838792383670807, max_rel=1475.991455078125, norm_rel=0.02302122674882412, ref_abs_avg=43.417816162109375, test_abs_avg=43.41745376586914
production_forward grad[19] vs paper_forward: mean_abs=0.9687694907188416, max_abs=6.0, mean_rel=0.16117990016937256, max_rel=1258.8426513671875, norm_rel=0.022878844290971756, ref_abs_avg=42.56714630126953, test_abs_avg=42.56529235839844
production_forward grad[20] vs paper_forward: mean_abs=0.7392964363098145, max_abs=2.5, mean_rel=0.0853051170706749, max_rel=3.793034791946411, norm_rel=0.02174750715494156, ref_abs_avg=33.978580474853516, test_abs_avg=33.97658920288086
production_forward grad[21] vs paper_forward: mean_abs=0.9459748268127441, max_abs=7.0, mean_rel=0.1575508713722229, max_rel=1163.19775390625, norm_rel=0.022958919405937195, ref_abs_avg=41.40408706665039, test_abs_avg=41.40499496459961
production_forward grad[22] vs paper_forward: mean_abs=0.9212068319320679, max_abs=5.9375, mean_rel=0.15304851531982422, max_rel=2146.01220703125, norm_rel=0.022812550887465477, ref_abs_avg=40.517181396484375, test_abs_avg=40.519073486328125
production_forward grad[23] vs paper_forward: mean_abs=0.7296862602233887, max_abs=3.0, mean_rel=0.16590428352355957, max_rel=24.330463409423828, norm_rel=0.022399330511689186, ref_abs_avg=33.5982666015625, test_abs_avg=33.56813430786133
production_forward grad[24] vs paper_forward: mean_abs=0.8946283459663391, max_abs=6.25, mean_rel=0.1609686315059662, max_rel=1500.975830078125, norm_rel=0.02276487648487091, ref_abs_avg=39.511390686035156, test_abs_avg=39.51347351074219
production_forward grad[25] vs paper_forward: mean_abs=0.8779659867286682, max_abs=5.40625, mean_rel=0.14642751216888428, max_rel=826.029541015625, norm_rel=0.02266618423163891, ref_abs_avg=38.94001770019531, test_abs_avg=38.94541931152344
production_forward grad[26] vs paper_forward: mean_abs=0.8690282106399536, max_abs=3.59375, mean_rel=0.12447760999202728, max_rel=12.76047134399414, norm_rel=0.023976342752575874, ref_abs_avg=36.60013961791992, test_abs_avg=36.609840393066406
production_forward grad[27] vs paper_forward: mean_abs=1.0559442043304443, max_abs=7.5, mean_rel=0.18873441219329834, max_rel=3038.20068359375, norm_rel=0.024817069992423058, ref_abs_avg=42.74725341796875, test_abs_avg=42.74984359741211
production_forward grad[28] vs paper_forward: mean_abs=1.0300745964050293, max_abs=6.5, mean_rel=0.15432438254356384, max_rel=1040.16455078125, norm_rel=0.02454046905040741, ref_abs_avg=42.14956283569336, test_abs_avg=42.149070739746094
production_forward grad[29] vs paper_forward: mean_abs=0.8038511872291565, max_abs=4.0, mean_rel=0.26243704557418823, max_rel=30.814184188842773, norm_rel=0.02651851251721382, ref_abs_avg=29.856916427612305, test_abs_avg=29.836212158203125
production_forward grad[30] vs paper_forward: mean_abs=0.9705819487571716, max_abs=6.75, mean_rel=0.16264253854751587, max_rel=1093.9320068359375, norm_rel=0.024901360273361206, ref_abs_avg=39.153297424316406, test_abs_avg=39.158348083496094
production_forward grad[31] vs paper_forward: mean_abs=0.9524379968643188, max_abs=6.5625, mean_rel=0.15648344159126282, max_rel=1196.955810546875, norm_rel=0.024650588631629944, ref_abs_avg=38.777862548828125, test_abs_avg=38.786258697509766
production_forward grad[32] vs paper_forward: mean_abs=0.700650691986084, max_abs=2.875, mean_rel=0.24255791306495667, max_rel=38.90521240234375, norm_rel=0.022894781082868576, ref_abs_avg=30.79623031616211, test_abs_avg=30.81075096130371
production_forward grad[33] vs paper_forward: mean_abs=0.9051888585090637, max_abs=5.5, mean_rel=0.1643543243408203, max_rel=1061.7991943359375, norm_rel=0.024766555055975914, ref_abs_avg=36.64909744262695, test_abs_avg=36.6517333984375
production_forward grad[34] vs paper_forward: mean_abs=0.8852900862693787, max_abs=6.5, mean_rel=0.16629326343536377, max_rel=1135.7852783203125, norm_rel=0.02448827214539051, ref_abs_avg=36.25239562988281, test_abs_avg=36.254005432128906
production_forward grad[35] vs paper_forward: mean_abs=0.6586980819702148, max_abs=2.5, mean_rel=0.12381965667009354, max_rel=13.646809577941895, norm_rel=0.023250550031661987, ref_abs_avg=28.789337158203125, test_abs_avg=28.766910552978516
production_forward grad[36] vs paper_forward: mean_abs=0.8465631604194641, max_abs=6.25, mean_rel=0.1641363799571991, max_rel=938.4401245117188, norm_rel=0.02457357384264469, ref_abs_avg=34.54619598388672, test_abs_avg=34.549556732177734
production_forward grad[37] vs paper_forward: mean_abs=0.831247091293335, max_abs=5.125, mean_rel=0.16855022311210632, max_rel=1343.0875244140625, norm_rel=0.024460965767502785, ref_abs_avg=34.12697219848633, test_abs_avg=34.12654113769531
production_forward grad[38] vs paper_forward: mean_abs=0.6216330528259277, max_abs=2.75, mean_rel=0.09652248024940491, max_rel=5.389153480529785, norm_rel=0.024187834933400154, ref_abs_avg=26.306594848632812, test_abs_avg=26.260608673095703
production_forward grad[39] vs paper_forward: mean_abs=0.7986589670181274, max_abs=5.5, mean_rel=0.1680157631635666, max_rel=1465.1551513671875, norm_rel=0.024348268285393715, ref_abs_avg=32.88127899169922, test_abs_avg=32.88207244873047
production_forward grad[40] vs paper_forward: mean_abs=0.7850584983825684, max_abs=5.25, mean_rel=0.15984103083610535, max_rel=963.3591918945312, norm_rel=0.024243012070655823, ref_abs_avg=32.516380310058594, test_abs_avg=32.516170501708984
production_forward grad[41] vs paper_forward: mean_abs=0.6075882911682129, max_abs=2.75, mean_rel=0.17372022569179535, max_rel=29.15001106262207, norm_rel=0.024603411555290222, ref_abs_avg=24.798471450805664, test_abs_avg=24.752687454223633
production_forward grad[42] vs paper_forward: mean_abs=0.7576977014541626, max_abs=5.5, mean_rel=0.16251468658447266, max_rel=926.780517578125, norm_rel=0.024023331701755524, ref_abs_avg=31.622047424316406, test_abs_avg=31.6221923828125
production_forward grad[43] vs paper_forward: mean_abs=0.7403115630149841, max_abs=4.5, mean_rel=0.14628253877162933, max_rel=506.1983642578125, norm_rel=0.023854224011301994, ref_abs_avg=31.148555755615234, test_abs_avg=31.1453857421875
production_forward grad[44] vs paper_forward: mean_abs=0.5881608128547668, max_abs=2.375, mean_rel=0.08589960634708405, max_rel=4.672832012176514, norm_rel=0.023159336298704147, ref_abs_avg=25.358549118041992, test_abs_avg=25.355785369873047
production_forward grad[45] vs paper_forward: mean_abs=0.7226152420043945, max_abs=4.75, mean_rel=0.1587749719619751, max_rel=1075.0750732421875, norm_rel=0.023706164211034775, ref_abs_avg=30.501262664794922, test_abs_avg=30.50479507446289
production_forward grad[46] vs paper_forward: mean_abs=0.7078600525856018, max_abs=4.25, mean_rel=0.16750097274780273, max_rel=831.0198364257812, norm_rel=0.02378789708018303, ref_abs_avg=29.845813751220703, test_abs_avg=29.84423065185547
production_forward grad[47] vs paper_forward: mean_abs=0.5434846878051758, max_abs=2.375, mean_rel=0.0789235457777977, max_rel=3.077110528945923, norm_rel=0.021699491888284683, ref_abs_avg=25.25756072998047, test_abs_avg=25.26747703552246
production_forward grad[48] vs paper_forward: mean_abs=0.6882701516151428, max_abs=4.5, mean_rel=0.16598986089229584, max_rel=1545.1011962890625, norm_rel=0.02347653917968273, ref_abs_avg=29.32159423828125, test_abs_avg=29.321063995361328
production_forward grad[49] vs paper_forward: mean_abs=0.675572395324707, max_abs=4.5, mean_rel=0.1500629186630249, max_rel=761.3557739257812, norm_rel=0.023372359573841095, ref_abs_avg=28.951385498046875, test_abs_avg=28.96027374267578
production_forward grad[50] vs paper_forward: mean_abs=0.6774587631225586, max_abs=3.03125, mean_rel=0.11185634881258011, max_rel=11.079692840576172, norm_rel=0.026320625096559525, ref_abs_avg=25.896194458007812, test_abs_avg=25.95083236694336
production_forward grad[51] vs paper_forward: mean_abs=0.7800997495651245, max_abs=6.0, mean_rel=0.17286765575408936, max_rel=1183.6612548828125, norm_rel=0.02515728771686554, ref_abs_avg=31.06922721862793, test_abs_avg=31.066665649414062
production_forward grad[52] vs paper_forward: mean_abs=0.763526201248169, max_abs=5.4375, mean_rel=0.16071859002113342, max_rel=711.91796875, norm_rel=0.025165507569909096, ref_abs_avg=30.482013702392578, test_abs_avg=30.480743408203125
production_forward grad[53] vs paper_forward: mean_abs=0.5844297409057617, max_abs=3.0, mean_rel=0.09577706456184387, max_rel=2.8251519203186035, norm_rel=0.026123005896806717, ref_abs_avg=23.45013427734375, test_abs_avg=23.489700317382812
production_forward grad[54] vs paper_forward: mean_abs=0.7156933546066284, max_abs=4.5, mean_rel=0.17710033059120178, max_rel=1290.302978515625, norm_rel=0.024728598073124886, ref_abs_avg=28.96306800842285, test_abs_avg=28.963376998901367
production_forward grad[55] vs paper_forward: mean_abs=0.6957818269729614, max_abs=4.5, mean_rel=0.15520760416984558, max_rel=544.6774291992188, norm_rel=0.024938693270087242, ref_abs_avg=27.96462631225586, test_abs_avg=27.958219528198242
production_forward grad[56] vs paper_forward: mean_abs=0.5267841815948486, max_abs=2.75, mean_rel=0.1388193964958191, max_rel=15.20055866241455, norm_rel=0.024360811337828636, ref_abs_avg=21.98488998413086, test_abs_avg=21.93983268737793
production_forward grad[57] vs paper_forward: mean_abs=0.6593796014785767, max_abs=4.8125, mean_rel=0.16267916560173035, max_rel=993.857666015625, norm_rel=0.024537822231650352, ref_abs_avg=26.880720138549805, test_abs_avg=26.881298065185547
production_forward grad[58] vs paper_forward: mean_abs=0.6497863531112671, max_abs=4.5, mean_rel=0.16201432049274445, max_rel=991.811279296875, norm_rel=0.024331985041499138, ref_abs_avg=26.797229766845703, test_abs_avg=26.804885864257812
production_forward grad[59] vs paper_forward: mean_abs=0.5212650299072266, max_abs=2.15625, mean_rel=0.0735655426979065, max_rel=1.9687085151672363, norm_rel=0.02416916936635971, ref_abs_avg=21.95478057861328, test_abs_avg=21.947458267211914
production_forward grad[60] vs paper_forward: mean_abs=0.6118730902671814, max_abs=4.5, mean_rel=0.16175922751426697, max_rel=700.7633666992188, norm_rel=0.023733995854854584, ref_abs_avg=25.777950286865234, test_abs_avg=25.77909278869629
production_forward grad[61] vs paper_forward: mean_abs=0.6022527813911438, max_abs=4.0, mean_rel=0.1507735550403595, max_rel=430.8342590332031, norm_rel=0.02396594174206257, ref_abs_avg=25.209514617919922, test_abs_avg=25.22008514404297
production_forward grad[62] vs paper_forward: mean_abs=0.4604593515396118, max_abs=2.25, mean_rel=0.07916378974914551, max_rel=3.148740530014038, norm_rel=0.02374902367591858, ref_abs_avg=19.993507385253906, test_abs_avg=19.95187759399414
production_forward grad[63] vs paper_forward: mean_abs=0.5781821608543396, max_abs=4.25, mean_rel=0.1580623984336853, max_rel=834.0225219726562, norm_rel=0.02333935722708702, ref_abs_avg=24.779861450195312, test_abs_avg=24.7803897857666
production_forward grad[64] vs paper_forward: mean_abs=0.5646373629570007, max_abs=3.875, mean_rel=0.15819531679153442, max_rel=776.5609130859375, norm_rel=0.02331208996474743, ref_abs_avg=24.20841407775879, test_abs_avg=24.21450424194336
production_forward grad[65] vs paper_forward: mean_abs=0.4556083679199219, max_abs=2.4375, mean_rel=0.06105604022741318, max_rel=1.8701540231704712, norm_rel=0.023369543254375458, ref_abs_avg=19.855815887451172, test_abs_avg=19.863990783691406
production_forward grad[66] vs paper_forward: mean_abs=0.5474441051483154, max_abs=4.0, mean_rel=0.1482546329498291, max_rel=807.1937255859375, norm_rel=0.02297588437795639, ref_abs_avg=23.814138412475586, test_abs_avg=23.817211151123047
production_forward grad[67] vs paper_forward: mean_abs=0.539167046546936, max_abs=3.75, mean_rel=0.1384921669960022, max_rel=455.19342041015625, norm_rel=0.02301640808582306, ref_abs_avg=23.479145050048828, test_abs_avg=23.473691940307617
production_forward grad[68] vs paper_forward: mean_abs=0.39667415618896484, max_abs=1.375, mean_rel=0.09224195778369904, max_rel=4.582228660583496, norm_rel=0.020453311502933502, ref_abs_avg=19.128559112548828, test_abs_avg=19.13576889038086
production_forward grad[69] vs paper_forward: mean_abs=0.5230207443237305, max_abs=4.0, mean_rel=0.15699878334999084, max_rel=1488.0345458984375, norm_rel=0.022527117282152176, ref_abs_avg=23.20309066772461, test_abs_avg=23.204421997070312
production_forward grad[70] vs paper_forward: mean_abs=0.5116389393806458, max_abs=5.0, mean_rel=0.14226281642913818, max_rel=414.4055480957031, norm_rel=0.022690359503030777, ref_abs_avg=22.59861183166504, test_abs_avg=22.60449981689453
production_forward grad[71] vs paper_forward: mean_abs=0.3817131519317627, max_abs=1.5, mean_rel=0.07678033411502838, max_rel=2.9386813640594482, norm_rel=0.020072724670171738, ref_abs_avg=18.66067886352539, test_abs_avg=18.660991668701172
production_forward grad[72] vs paper_forward: mean_abs=0.5013245344161987, max_abs=5.25, mean_rel=0.14282411336898804, max_rel=972.0000610351562, norm_rel=0.02213647961616516, ref_abs_avg=22.605003356933594, test_abs_avg=22.60479736328125
production_forward grad[73] vs paper_forward: mean_abs=0.4887549579143524, max_abs=3.1640625, mean_rel=0.14034751057624817, max_rel=763.4874267578125, norm_rel=0.022252509370446205, ref_abs_avg=21.967506408691406, test_abs_avg=21.97032356262207
production_forward grad[74] vs paper_forward: mean_abs=0.45169591903686523, max_abs=1.625, mean_rel=0.15652573108673096, max_rel=15.492462158203125, norm_rel=0.023745013400912285, ref_abs_avg=18.971771240234375, test_abs_avg=18.954586029052734
production_forward grad[75] vs paper_forward: mean_abs=0.5475396513938904, max_abs=4.0, mean_rel=0.14935626089572906, max_rel=1155.6514892578125, norm_rel=0.023674972355365753, ref_abs_avg=23.178905487060547, test_abs_avg=23.18112564086914
production_forward grad[76] vs paper_forward: mean_abs=0.5396044254302979, max_abs=4.0, mean_rel=0.15848681330680847, max_rel=770.9146728515625, norm_rel=0.023409375920891762, ref_abs_avg=23.01299476623535, test_abs_avg=23.0194149017334
production_forward grad[77] vs paper_forward: mean_abs=0.4026336669921875, max_abs=1.6875, mean_rel=0.08844168484210968, max_rel=6.962070465087891, norm_rel=0.022771025076508522, ref_abs_avg=18.295745849609375, test_abs_avg=18.311309814453125
production_forward grad[78] vs paper_forward: mean_abs=0.5074889063835144, max_abs=4.25, mean_rel=0.1438567042350769, max_rel=719.4124755859375, norm_rel=0.02301896922290325, ref_abs_avg=22.05719757080078, test_abs_avg=22.05783462524414
production_forward grad[79] vs paper_forward: mean_abs=0.5002720355987549, max_abs=4.0, mean_rel=0.13615047931671143, max_rel=265.9692077636719, norm_rel=0.022466113790869713, ref_abs_avg=22.225933074951172, test_abs_avg=22.226747512817383
production_forward grad[80] vs paper_forward: mean_abs=0.37306880950927734, max_abs=1.5, mean_rel=0.24472512304782867, max_rel=60.01567840576172, norm_rel=0.023392120376229286, ref_abs_avg=16.261877059936523, test_abs_avg=16.26717758178711
production_forward grad[81] vs paper_forward: mean_abs=0.4764306843280792, max_abs=3.9375, mean_rel=0.14744821190834045, max_rel=911.2456665039062, norm_rel=0.022707784548401833, ref_abs_avg=21.014572143554688, test_abs_avg=21.01439666748047
production_forward grad[82] vs paper_forward: mean_abs=0.46229350566864014, max_abs=4.5, mean_rel=0.13750147819519043, max_rel=1076.8927001953125, norm_rel=0.022297250106930733, ref_abs_avg=20.84571075439453, test_abs_avg=20.846960067749023
production_forward grad[83] vs paper_forward: mean_abs=0.3703470230102539, max_abs=1.75, mean_rel=0.09176339209079742, max_rel=4.941919803619385, norm_rel=0.020958073437213898, ref_abs_avg=17.780963897705078, test_abs_avg=17.783748626708984
production_forward grad[84] vs paper_forward: mean_abs=0.4442231059074402, max_abs=4.0, mean_rel=0.13311004638671875, max_rel=487.8998718261719, norm_rel=0.021946746855974197, ref_abs_avg=20.298931121826172, test_abs_avg=20.299724578857422
production_forward grad[85] vs paper_forward: mean_abs=0.43399426341056824, max_abs=3.5, mean_rel=0.13200052082538605, max_rel=1224.44775390625, norm_rel=0.021743599325418472, ref_abs_avg=20.10053825378418, test_abs_avg=20.104463577270508
production_forward grad[86] vs paper_forward: mean_abs=0.3438434600830078, max_abs=1.25, mean_rel=0.09714880585670471, max_rel=13.547765731811523, norm_rel=0.02061077579855919, ref_abs_avg=16.32069969177246, test_abs_avg=16.33487319946289
production_forward grad[87] vs paper_forward: mean_abs=0.41876864433288574, max_abs=4.0, mean_rel=0.13963556289672852, max_rel=950.1488037109375, norm_rel=0.021516932174563408, ref_abs_avg=19.556758880615234, test_abs_avg=19.55708885192871
production_forward grad[88] vs paper_forward: mean_abs=0.4057801067829132, max_abs=4.09375, mean_rel=0.13555321097373962, max_rel=529.8212890625, norm_rel=0.020787745714187622, ref_abs_avg=19.53902244567871, test_abs_avg=19.537782669067383
production_forward grad[89] vs paper_forward: mean_abs=0.3059931993484497, max_abs=1.5625, mean_rel=0.09927834570407867, max_rel=6.993710041046143, norm_rel=0.02023618295788765, ref_abs_avg=15.19472885131836, test_abs_avg=15.183088302612305
production_forward grad[90] vs paper_forward: mean_abs=0.38999563455581665, max_abs=4.875, mean_rel=0.12718157470226288, max_rel=631.857177734375, norm_rel=0.02093650959432125, ref_abs_avg=18.781946182250977, test_abs_avg=18.783594131469727
production_forward grad[91] vs paper_forward: mean_abs=0.3862818479537964, max_abs=3.5, mean_rel=0.13489823043346405, max_rel=764.603515625, norm_rel=0.020904984325170517, ref_abs_avg=18.641342163085938, test_abs_avg=18.64048194885254
production_forward grad[92] vs paper_forward: mean_abs=0.310072660446167, max_abs=1.0625, mean_rel=0.13449016213417053, max_rel=15.639699935913086, norm_rel=0.020335055887699127, ref_abs_avg=15.20871639251709, test_abs_avg=15.187637329101562
production_forward grad[93] vs paper_forward: mean_abs=0.3704710006713867, max_abs=4.5, mean_rel=0.12403886020183563, max_rel=889.0316772460938, norm_rel=0.020407503470778465, ref_abs_avg=18.38072967529297, test_abs_avg=18.382862091064453
production_forward grad[94] vs paper_forward: mean_abs=0.3603334128856659, max_abs=3.0, mean_rel=0.1185590922832489, max_rel=319.8901062011719, norm_rel=0.019705036655068398, ref_abs_avg=18.469219207763672, test_abs_avg=18.46902084350586
production_forward grad[95] vs paper_forward: mean_abs=0.29749107360839844, max_abs=1.25, mean_rel=0.05322272330522537, max_rel=1.8387523889541626, norm_rel=0.018833469599485397, ref_abs_avg=15.92737102508545, test_abs_avg=15.966055870056152
production_forward grad[96] vs paper_forward: mean_abs=0.35731297731399536, max_abs=4.75, mean_rel=0.1227983683347702, max_rel=654.9535522460938, norm_rel=0.019907508045434952, ref_abs_avg=18.21951675415039, test_abs_avg=18.219913482666016
production_forward grad[97] vs paper_forward: mean_abs=0.34821754693984985, max_abs=3.5, mean_rel=0.12332479655742645, max_rel=799.885498046875, norm_rel=0.01979195885360241, ref_abs_avg=18.015567779541016, test_abs_avg=18.02071762084961
production_forward2 vs paper_forward output: mean_abs=0.0016339446883648634, max_abs=0.03533935546875
production_forward2 grad[0] vs paper_forward: mean_abs=0.008532322011888027, max_abs=0.359375, mean_rel=0.07447247952222824, max_rel=85.61058044433594, norm_rel=0.020362459123134613, ref_abs_avg=0.4510049521923065, test_abs_avg=0.45101457834243774
production_forward2 grad[1] vs paper_forward: mean_abs=7.268146991729736, max_abs=64.0, mean_rel=0.2888554632663727, max_rel=1547.7899169921875, norm_rel=0.02089790627360344, ref_abs_avg=311.27972412109375, test_abs_avg=311.2301330566406
production_forward2 grad[2] vs paper_forward: mean_abs=1.260512351989746, max_abs=4.7392578125, mean_rel=0.18241086602210999, max_rel=26.661319732666016, norm_rel=0.02377714402973652, ref_abs_avg=52.96739959716797, test_abs_avg=52.91389465332031
production_forward2 grad[3] vs paper_forward: mean_abs=1.5358123779296875, max_abs=10.0, mean_rel=0.16639696061611176, max_rel=2153.188720703125, norm_rel=0.02429383620619774, ref_abs_avg=63.545631408691406, test_abs_avg=63.54780578613281
production_forward2 grad[4] vs paper_forward: mean_abs=1.5030865669250488, max_abs=12.25, mean_rel=0.16108818352222443, max_rel=745.9318237304688, norm_rel=0.024162499234080315, ref_abs_avg=62.619667053222656, test_abs_avg=62.61431121826172
production_forward2 grad[5] vs paper_forward: mean_abs=1.1398049592971802, max_abs=4.0, mean_rel=0.35887181758880615, max_rel=129.00193786621094, norm_rel=0.024627504870295525, ref_abs_avg=47.023658752441406, test_abs_avg=46.948326110839844
production_forward2 grad[6] vs paper_forward: mean_abs=1.3451595306396484, max_abs=8.25, mean_rel=0.17232833802700043, max_rel=1912.4927978515625, norm_rel=0.023886341601610184, ref_abs_avg=56.563297271728516, test_abs_avg=56.567413330078125
production_forward2 grad[7] vs paper_forward: mean_abs=1.314695954322815, max_abs=8.0, mean_rel=0.15931913256645203, max_rel=950.46484375, norm_rel=0.023511042818427086, ref_abs_avg=56.14898681640625, test_abs_avg=56.152854919433594
production_forward2 grad[8] vs paper_forward: mean_abs=0.9549732208251953, max_abs=4.0, mean_rel=0.10021407902240753, max_rel=9.648904800415039, norm_rel=0.02262183278799057, ref_abs_avg=42.10066223144531, test_abs_avg=42.235328674316406
production_forward2 grad[9] vs paper_forward: mean_abs=1.2176629304885864, max_abs=8.5, mean_rel=0.16615769267082214, max_rel=2251.845703125, norm_rel=0.023773852735757828, ref_abs_avg=51.44599914550781, test_abs_avg=51.44635009765625
production_forward2 grad[10] vs paper_forward: mean_abs=1.1920992136001587, max_abs=8.5, mean_rel=0.16440463066101074, max_rel=1663.817626953125, norm_rel=0.023638363927602768, ref_abs_avg=50.704044342041016, test_abs_avg=50.698692321777344
production_forward2 grad[11] vs paper_forward: mean_abs=0.9221773147583008, max_abs=4.75, mean_rel=0.12315364181995392, max_rel=9.429400444030762, norm_rel=0.023945212364196777, ref_abs_avg=39.63140869140625, test_abs_avg=39.61819839477539
production_forward2 grad[12] vs paper_forward: mean_abs=1.141063928604126, max_abs=7.5, mean_rel=0.16044881939888, max_rel=1684.9986572265625, norm_rel=0.02362724393606186, ref_abs_avg=48.4886360168457, test_abs_avg=48.48614501953125
production_forward2 grad[13] vs paper_forward: mean_abs=1.1050901412963867, max_abs=7.0, mean_rel=0.17236220836639404, max_rel=1056.9306640625, norm_rel=0.023372560739517212, ref_abs_avg=47.58503723144531, test_abs_avg=47.587928771972656
production_forward2 grad[14] vs paper_forward: mean_abs=0.8918156623840332, max_abs=4.0, mean_rel=0.09626033902168274, max_rel=9.06033706665039, norm_rel=0.023618150502443314, ref_abs_avg=37.51253890991211, test_abs_avg=37.51865768432617
production_forward2 grad[15] vs paper_forward: mean_abs=1.0629417896270752, max_abs=7.0, mean_rel=0.16319094598293304, max_rel=1501.48046875, norm_rel=0.023466601967811584, ref_abs_avg=45.560611724853516, test_abs_avg=45.56303405761719
production_forward2 grad[16] vs paper_forward: mean_abs=1.0378808975219727, max_abs=6.5, mean_rel=0.15898741781711578, max_rel=1260.0042724609375, norm_rel=0.02301602065563202, ref_abs_avg=45.34004211425781, test_abs_avg=45.345054626464844
production_forward2 grad[17] vs paper_forward: mean_abs=0.7971429824829102, max_abs=3.0, mean_rel=0.115994393825531, max_rel=17.183086395263672, norm_rel=0.021949218586087227, ref_abs_avg=36.63607406616211, test_abs_avg=36.58076858520508
production_forward2 grad[18] vs paper_forward: mean_abs=0.9985055923461914, max_abs=6.1875, mean_rel=0.14829716086387634, max_rel=1427.1300048828125, norm_rel=0.023096280172467232, ref_abs_avg=43.417816162109375, test_abs_avg=43.41695022583008
production_forward2 grad[19] vs paper_forward: mean_abs=0.9700755476951599, max_abs=6.0, mean_rel=0.16269558668136597, max_rel=909.482177734375, norm_rel=0.022914182394742966, ref_abs_avg=42.56714630126953, test_abs_avg=42.564292907714844
production_forward2 grad[20] vs paper_forward: mean_abs=0.7417845726013184, max_abs=2.5, mean_rel=0.0873975083231926, max_rel=3.7559125423431396, norm_rel=0.021560169756412506, ref_abs_avg=33.978580474853516, test_abs_avg=34.008026123046875
production_forward2 grad[21] vs paper_forward: mean_abs=0.9474940299987793, max_abs=6.5, mean_rel=0.15423977375030518, max_rel=1009.6300659179688, norm_rel=0.02298816666007042, ref_abs_avg=41.40408706665039, test_abs_avg=41.404014587402344
production_forward2 grad[22] vs paper_forward: mean_abs=0.9226537346839905, max_abs=6.0, mean_rel=0.15772566199302673, max_rel=2192.664794921875, norm_rel=0.022859303280711174, ref_abs_avg=40.517181396484375, test_abs_avg=40.51549530029297
production_forward2 grad[23] vs paper_forward: mean_abs=0.7341251373291016, max_abs=3.5, mean_rel=0.21761806309223175, max_rel=46.876068115234375, norm_rel=0.022557955235242844, ref_abs_avg=33.5982666015625, test_abs_avg=33.55127716064453
production_forward2 grad[24] vs paper_forward: mean_abs=0.8974259495735168, max_abs=6.5, mean_rel=0.16019804775714874, max_rel=1288.8555908203125, norm_rel=0.022823816165328026, ref_abs_avg=39.511390686035156, test_abs_avg=39.510719299316406
production_forward2 grad[25] vs paper_forward: mean_abs=0.8788795471191406, max_abs=5.125, mean_rel=0.14891420304775238, max_rel=949.7882690429688, norm_rel=0.02269577980041504, ref_abs_avg=38.94001770019531, test_abs_avg=38.94631576538086
production_forward2 grad[26] vs paper_forward: mean_abs=0.8545284271240234, max_abs=3.734375, mean_rel=0.12971708178520203, max_rel=18.794288635253906, norm_rel=0.023541871458292007, ref_abs_avg=36.60013961791992, test_abs_avg=36.59710693359375
production_forward2 grad[27] vs paper_forward: mean_abs=1.0543594360351562, max_abs=7.0, mean_rel=0.18781259655952454, max_rel=3335.141357421875, norm_rel=0.024793582037091255, ref_abs_avg=42.74725341796875, test_abs_avg=42.74974822998047
production_forward2 grad[28] vs paper_forward: mean_abs=1.0285658836364746, max_abs=6.4296875, mean_rel=0.15734125673770905, max_rel=1245.9697265625, norm_rel=0.024489236995577812, ref_abs_avg=42.14956283569336, test_abs_avg=42.15037536621094
production_forward2 grad[29] vs paper_forward: mean_abs=0.7822686433792114, max_abs=3.75, mean_rel=0.3283790946006775, max_rel=78.459228515625, norm_rel=0.026281170547008514, ref_abs_avg=29.856916427612305, test_abs_avg=29.843416213989258
production_forward2 grad[30] vs paper_forward: mean_abs=0.9714449048042297, max_abs=6.25, mean_rel=0.16135069727897644, max_rel=811.225341796875, norm_rel=0.02492787502706051, ref_abs_avg=39.153297424316406, test_abs_avg=39.158409118652344
production_forward2 grad[31] vs paper_forward: mean_abs=0.9550557136535645, max_abs=6.3125, mean_rel=0.16072499752044678, max_rel=1241.0469970703125, norm_rel=0.02470880188047886, ref_abs_avg=38.777862548828125, test_abs_avg=38.78385925292969
production_forward2 grad[32] vs paper_forward: mean_abs=0.7117390632629395, max_abs=2.75, mean_rel=0.2275083065032959, max_rel=35.82265853881836, norm_rel=0.02287009358406067, ref_abs_avg=30.79623031616211, test_abs_avg=30.798686981201172
production_forward2 grad[33] vs paper_forward: mean_abs=0.9060828685760498, max_abs=5.5, mean_rel=0.16349026560783386, max_rel=1143.265380859375, norm_rel=0.024795494973659515, ref_abs_avg=36.64909744262695, test_abs_avg=36.65180206298828
production_forward2 grad[34] vs paper_forward: mean_abs=0.8865634202957153, max_abs=6.0, mean_rel=0.1678238809108734, max_rel=1368.22119140625, norm_rel=0.02453687973320484, ref_abs_avg=36.25239562988281, test_abs_avg=36.25123596191406
production_forward2 grad[35] vs paper_forward: mean_abs=0.664517879486084, max_abs=2.40625, mean_rel=0.12715649604797363, max_rel=12.751447677612305, norm_rel=0.0229671411216259, ref_abs_avg=28.789337158203125, test_abs_avg=28.79026222229004
production_forward2 grad[36] vs paper_forward: mean_abs=0.847906768321991, max_abs=6.0, mean_rel=0.16647371649742126, max_rel=1335.98291015625, norm_rel=0.02462908998131752, ref_abs_avg=34.54619598388672, test_abs_avg=34.54961395263672
production_forward2 grad[37] vs paper_forward: mean_abs=0.8322216272354126, max_abs=4.75, mean_rel=0.17027470469474792, max_rel=1125.056884765625, norm_rel=0.024485457688570023, ref_abs_avg=34.12697219848633, test_abs_avg=34.1270751953125
production_forward2 grad[38] vs paper_forward: mean_abs=0.6289710402488708, max_abs=2.5, mean_rel=0.08789065480232239, max_rel=3.361553907394409, norm_rel=0.024084175005555153, ref_abs_avg=26.306594848632812, test_abs_avg=26.265182495117188
production_forward2 grad[39] vs paper_forward: mean_abs=0.8005647659301758, max_abs=5.25, mean_rel=0.16725532710552216, max_rel=1544.543701171875, norm_rel=0.024398429319262505, ref_abs_avg=32.88127899169922, test_abs_avg=32.88273620605469
production_forward2 grad[40] vs paper_forward: mean_abs=0.7853378057479858, max_abs=4.75, mean_rel=0.15989463031291962, max_rel=1133.1336669921875, norm_rel=0.024232830852270126, ref_abs_avg=32.516380310058594, test_abs_avg=32.515846252441406
production_forward2 grad[41] vs paper_forward: mean_abs=0.6261172294616699, max_abs=2.25, mean_rel=0.17831261456012726, max_rel=28.578908920288086, norm_rel=0.024986792355775833, ref_abs_avg=24.798471450805664, test_abs_avg=24.76211166381836
production_forward2 grad[42] vs paper_forward: mean_abs=0.7595928907394409, max_abs=5.5, mean_rel=0.1638333946466446, max_rel=1106.64501953125, norm_rel=0.02406899444758892, ref_abs_avg=31.622047424316406, test_abs_avg=31.621553421020508
production_forward2 grad[43] vs paper_forward: mean_abs=0.7396818995475769, max_abs=4.5625, mean_rel=0.1481669694185257, max_rel=895.5626220703125, norm_rel=0.023826725780963898, ref_abs_avg=31.148555755615234, test_abs_avg=31.144704818725586
production_forward2 grad[44] vs paper_forward: mean_abs=0.5859698057174683, max_abs=2.40625, mean_rel=0.08680611848831177, max_rel=3.7563183307647705, norm_rel=0.022958941757678986, ref_abs_avg=25.358549118041992, test_abs_avg=25.366735458374023
production_forward2 grad[45] vs paper_forward: mean_abs=0.7240505814552307, max_abs=4.75, mean_rel=0.157611683011055, max_rel=1180.68701171875, norm_rel=0.023751547560095787, ref_abs_avg=30.501262664794922, test_abs_avg=30.50395965576172
production_forward2 grad[46] vs paper_forward: mean_abs=0.7097501754760742, max_abs=4.5, mean_rel=0.1670645773410797, max_rel=730.2241821289062, norm_rel=0.023844195529818535, ref_abs_avg=29.845813751220703, test_abs_avg=29.843242645263672
production_forward2 grad[47] vs paper_forward: mean_abs=0.5205507278442383, max_abs=2.1875, mean_rel=0.08729211241006851, max_rel=4.955759525299072, norm_rel=0.0210507120937109, ref_abs_avg=25.25756072998047, test_abs_avg=25.26769256591797
production_forward2 grad[48] vs paper_forward: mean_abs=0.6893773674964905, max_abs=4.5, mean_rel=0.16738323867321014, max_rel=1512.921630859375, norm_rel=0.023511124774813652, ref_abs_avg=29.32159423828125, test_abs_avg=29.321260452270508
production_forward2 grad[49] vs paper_forward: mean_abs=0.6760608553886414, max_abs=4.5, mean_rel=0.15090274810791016, max_rel=893.914306640625, norm_rel=0.02338065207004547, ref_abs_avg=28.951385498046875, test_abs_avg=28.960527420043945
production_forward2 grad[50] vs paper_forward: mean_abs=0.6768608093261719, max_abs=2.9375, mean_rel=0.11057856678962708, max_rel=11.884198188781738, norm_rel=0.02642274647951126, ref_abs_avg=25.896194458007812, test_abs_avg=25.944881439208984
production_forward2 grad[51] vs paper_forward: mean_abs=0.7776833772659302, max_abs=5.75, mean_rel=0.17329131066799164, max_rel=1048.366943359375, norm_rel=0.025075094774365425, ref_abs_avg=31.06922721862793, test_abs_avg=31.06503677368164
production_forward2 grad[52] vs paper_forward: mean_abs=0.7603618502616882, max_abs=5.888671875, mean_rel=0.15909725427627563, max_rel=628.9111938476562, norm_rel=0.025058075785636902, ref_abs_avg=30.482013702392578, test_abs_avg=30.482210159301758
production_forward2 grad[53] vs paper_forward: mean_abs=0.586329460144043, max_abs=2.375, mean_rel=0.09162615984678268, max_rel=3.3251638412475586, norm_rel=0.026064451783895493, ref_abs_avg=23.45013427734375, test_abs_avg=23.48140525817871
production_forward2 grad[54] vs paper_forward: mean_abs=0.7151180505752563, max_abs=5.0, mean_rel=0.17463970184326172, max_rel=1152.74072265625, norm_rel=0.024718448519706726, ref_abs_avg=28.96306800842285, test_abs_avg=28.96348762512207
production_forward2 grad[55] vs paper_forward: mean_abs=0.6933726072311401, max_abs=4.6875, mean_rel=0.1548575758934021, max_rel=541.842041015625, norm_rel=0.024861307814717293, ref_abs_avg=27.96462631225586, test_abs_avg=27.957901000976562
production_forward2 grad[56] vs paper_forward: mean_abs=0.5229079723358154, max_abs=2.75, mean_rel=0.1495605707168579, max_rel=15.808370590209961, norm_rel=0.024150289595127106, ref_abs_avg=21.98488998413086, test_abs_avg=21.95840072631836
production_forward2 grad[57] vs paper_forward: mean_abs=0.6593454480171204, max_abs=4.5, mean_rel=0.16428112983703613, max_rel=1343.7708740234375, norm_rel=0.02454390563070774, ref_abs_avg=26.880720138549805, test_abs_avg=26.880874633789062
production_forward2 grad[58] vs paper_forward: mean_abs=0.6502381563186646, max_abs=4.375, mean_rel=0.16043201088905334, max_rel=846.6124877929688, norm_rel=0.02433028258383274, ref_abs_avg=26.797229766845703, test_abs_avg=26.805072784423828
production_forward2 grad[59] vs paper_forward: mean_abs=0.5159206390380859, max_abs=2.5, mean_rel=0.07586265355348587, max_rel=2.6803524494171143, norm_rel=0.02415393479168415, ref_abs_avg=21.95478057861328, test_abs_avg=21.94583511352539
production_forward2 grad[60] vs paper_forward: mean_abs=0.6126079559326172, max_abs=5.0, mean_rel=0.16224384307861328, max_rel=643.94921875, norm_rel=0.02375427819788456, ref_abs_avg=25.777950286865234, test_abs_avg=25.77851104736328
production_forward2 grad[61] vs paper_forward: mean_abs=0.6029808521270752, max_abs=4.2421875, mean_rel=0.14881952106952667, max_rel=418.6601867675781, norm_rel=0.023976681753993034, ref_abs_avg=25.209514617919922, test_abs_avg=25.220081329345703
production_forward2 grad[62] vs paper_forward: mean_abs=0.45537805557250977, max_abs=2.25, mean_rel=0.08009149134159088, max_rel=2.5239903926849365, norm_rel=0.023134995251893997, ref_abs_avg=19.993507385253906, test_abs_avg=19.946857452392578
production_forward2 grad[63] vs paper_forward: mean_abs=0.5780619382858276, max_abs=4.0, mean_rel=0.1560220569372177, max_rel=967.5984497070312, norm_rel=0.023337556049227715, ref_abs_avg=24.779861450195312, test_abs_avg=24.780376434326172
production_forward2 grad[64] vs paper_forward: mean_abs=0.5652702450752258, max_abs=4.0, mean_rel=0.1621030867099762, max_rel=900.1466674804688, norm_rel=0.023333929479122162, ref_abs_avg=24.20841407775879, test_abs_avg=24.21477699279785
production_forward2 grad[65] vs paper_forward: mean_abs=0.44084739685058594, max_abs=2.25, mean_rel=0.056592755019664764, max_rel=1.5514284372329712, norm_rel=0.02293243259191513, ref_abs_avg=19.855815887451172, test_abs_avg=19.858654022216797
production_forward2 grad[66] vs paper_forward: mean_abs=0.5486572980880737, max_abs=4.0, mean_rel=0.14962546527385712, max_rel=807.1937255859375, norm_rel=0.02301873452961445, ref_abs_avg=23.814138412475586, test_abs_avg=23.817005157470703
production_forward2 grad[67] vs paper_forward: mean_abs=0.5398985147476196, max_abs=4.0, mean_rel=0.142048642039299, max_rel=628.653564453125, norm_rel=0.023046456277370453, ref_abs_avg=23.479145050048828, test_abs_avg=23.47386932373047
production_forward2 grad[68] vs paper_forward: mean_abs=0.4030795097351074, max_abs=1.4375, mean_rel=0.09906429052352905, max_rel=4.787914752960205, norm_rel=0.020549271255731583, ref_abs_avg=19.128559112548828, test_abs_avg=19.133045196533203
production_forward2 grad[69] vs paper_forward: mean_abs=0.5235167741775513, max_abs=4.0, mean_rel=0.158034548163414, max_rel=1748.8262939453125, norm_rel=0.02254946529865265, ref_abs_avg=23.20309066772461, test_abs_avg=23.204383850097656
production_forward2 grad[70] vs paper_forward: mean_abs=0.5125386714935303, max_abs=6.0, mean_rel=0.14295917749404907, max_rel=402.7288818359375, norm_rel=0.022735709324479103, ref_abs_avg=22.59861183166504, test_abs_avg=22.60511016845703
production_forward2 grad[71] vs paper_forward: mean_abs=0.385847806930542, max_abs=1.625, mean_rel=0.08206899464130402, max_rel=2.9184844493865967, norm_rel=0.020505251362919807, ref_abs_avg=18.66067886352539, test_abs_avg=18.655506134033203
production_forward2 grad[72] vs paper_forward: mean_abs=0.5015807747840881, max_abs=4.875, mean_rel=0.14225861430168152, max_rel=936.7786254882812, norm_rel=0.022146981209516525, ref_abs_avg=22.605003356933594, test_abs_avg=22.604820251464844
production_forward2 grad[73] vs paper_forward: mean_abs=0.4889581799507141, max_abs=3.5, mean_rel=0.14120110869407654, max_rel=773.8795776367188, norm_rel=0.022258056327700615, ref_abs_avg=21.967506408691406, test_abs_avg=21.971084594726562
production_forward2 grad[74] vs paper_forward: mean_abs=0.45319223403930664, max_abs=1.53125, mean_rel=0.17455585300922394, max_rel=17.81905746459961, norm_rel=0.023800823837518692, ref_abs_avg=18.971771240234375, test_abs_avg=18.978710174560547
production_forward2 grad[75] vs paper_forward: mean_abs=0.5457332134246826, max_abs=4.5, mean_rel=0.15049424767494202, max_rel=1078.090576171875, norm_rel=0.023588716983795166, ref_abs_avg=23.178905487060547, test_abs_avg=23.180641174316406
production_forward2 grad[76] vs paper_forward: mean_abs=0.5360802412033081, max_abs=4.0, mean_rel=0.15421099960803986, max_rel=825.1034545898438, norm_rel=0.023246916010975838, ref_abs_avg=23.01299476623535, test_abs_avg=23.01948356628418
production_forward2 grad[77] vs paper_forward: mean_abs=0.40624427795410156, max_abs=1.59375, mean_rel=0.08849040418863297, max_rel=7.584646224975586, norm_rel=0.02271699532866478, ref_abs_avg=18.295745849609375, test_abs_avg=18.30855941772461
production_forward2 grad[78] vs paper_forward: mean_abs=0.5071314573287964, max_abs=4.25, mean_rel=0.14622917771339417, max_rel=826.0391845703125, norm_rel=0.02299444004893303, ref_abs_avg=22.05719757080078, test_abs_avg=22.05799102783203
production_forward2 grad[79] vs paper_forward: mean_abs=0.500230610370636, max_abs=4.0, mean_rel=0.13584989309310913, max_rel=264.89599609375, norm_rel=0.02245679497718811, ref_abs_avg=22.225933074951172, test_abs_avg=22.22623062133789
production_forward2 grad[80] vs paper_forward: mean_abs=0.3807497024536133, max_abs=1.5, mean_rel=0.21847563982009888, max_rel=52.32881546020508, norm_rel=0.023588333278894424, ref_abs_avg=16.261877059936523, test_abs_avg=16.262235641479492
production_forward2 grad[81] vs paper_forward: mean_abs=0.47655731439590454, max_abs=3.8125, mean_rel=0.1472059190273285, max_rel=809.3916625976562, norm_rel=0.022714050486683846, ref_abs_avg=21.014572143554688, test_abs_avg=21.01395034790039
production_forward2 grad[82] vs paper_forward: mean_abs=0.4623424708843231, max_abs=4.375, mean_rel=0.1370891034603119, max_rel=1037.896484375, norm_rel=0.022281503304839134, ref_abs_avg=20.84571075439453, test_abs_avg=20.84688949584961
production_forward2 grad[83] vs paper_forward: mean_abs=0.3678274154663086, max_abs=1.75, mean_rel=0.09119271486997604, max_rel=4.814550399780273, norm_rel=0.020831400528550148, ref_abs_avg=17.780963897705078, test_abs_avg=17.789018630981445
production_forward2 grad[84] vs paper_forward: mean_abs=0.4443776309490204, max_abs=4.0, mean_rel=0.13147467374801636, max_rel=417.7476501464844, norm_rel=0.021957214921712875, ref_abs_avg=20.298931121826172, test_abs_avg=20.299636840820312
production_forward2 grad[85] vs paper_forward: mean_abs=0.4339979290962219, max_abs=3.25, mean_rel=0.13079048693180084, max_rel=1134.397705078125, norm_rel=0.021740369498729706, ref_abs_avg=20.10053825378418, test_abs_avg=20.104141235351562
production_forward2 grad[86] vs paper_forward: mean_abs=0.3420238494873047, max_abs=1.25, mean_rel=0.10426633059978485, max_rel=15.261317253112793, norm_rel=0.020581943914294243, ref_abs_avg=16.32069969177246, test_abs_avg=16.334716796875
production_forward2 grad[87] vs paper_forward: mean_abs=0.41889166831970215, max_abs=5.0, mean_rel=0.14091798663139343, max_rel=962.173095703125, norm_rel=0.02152073197066784, ref_abs_avg=19.556758880615234, test_abs_avg=19.55681037902832
production_forward2 grad[88] vs paper_forward: mean_abs=0.40586793422698975, max_abs=4.125, mean_rel=0.1330600529909134, max_rel=384.2916564941406, norm_rel=0.02078843116760254, ref_abs_avg=19.53902244567871, test_abs_avg=19.537872314453125
production_forward2 grad[89] vs paper_forward: mean_abs=0.3110746145248413, max_abs=1.453125, mean_rel=0.12213227152824402, max_rel=19.207561492919922, norm_rel=0.02036285772919655, ref_abs_avg=15.19472885131836, test_abs_avg=15.176265716552734
production_forward2 grad[90] vs paper_forward: mean_abs=0.39011046290397644, max_abs=4.875, mean_rel=0.12635111808776855, max_rel=644.435302734375, norm_rel=0.020946063101291656, ref_abs_avg=18.781946182250977, test_abs_avg=18.783191680908203
production_forward2 grad[91] vs paper_forward: mean_abs=0.38650989532470703, max_abs=3.5, mean_rel=0.13447588682174683, max_rel=701.9723510742188, norm_rel=0.020924793556332588, ref_abs_avg=18.641342163085938, test_abs_avg=18.640501022338867
production_forward2 grad[92] vs paper_forward: mean_abs=0.3066573143005371, max_abs=1.125, mean_rel=0.13458368182182312, max_rel=14.62057113647461, norm_rel=0.020380999892950058, ref_abs_avg=15.20871639251709, test_abs_avg=15.186300277709961
production_forward2 grad[93] vs paper_forward: mean_abs=0.3703208565711975, max_abs=4.25, mean_rel=0.12417876720428467, max_rel=836.9517211914062, norm_rel=0.020400047302246094, ref_abs_avg=18.38072967529297, test_abs_avg=18.38287353515625
production_forward2 grad[94] vs paper_forward: mean_abs=0.36066263914108276, max_abs=3.0, mean_rel=0.11876664310693741, max_rel=408.17535400390625, norm_rel=0.019725235179066658, ref_abs_avg=18.469219207763672, test_abs_avg=18.468692779541016
production_forward2 grad[95] vs paper_forward: mean_abs=0.2986893653869629, max_abs=1.25, mean_rel=0.05311250314116478, max_rel=1.813632845878601, norm_rel=0.018845614045858383, ref_abs_avg=15.92737102508545, test_abs_avg=15.965547561645508
production_forward2 grad[96] vs paper_forward: mean_abs=0.35730820894241333, max_abs=4.75, mean_rel=0.12261667102575302, max_rel=654.9535522460938, norm_rel=0.01990816369652748, ref_abs_avg=18.21951675415039, test_abs_avg=18.21990966796875
production_forward2 grad[97] vs paper_forward: mean_abs=0.3482164740562439, max_abs=3.5, mean_rel=0.12330137193202972, max_rel=799.885498046875, norm_rel=0.019791990518569946, ref_abs_avg=18.015567779541016, test_abs_avg=18.020713806152344
identity layers + randn queries
paper_forward fwd+bwd:  379.802 ms
paper_forward bwd-only: 294.040 ms
paper_forward peak allocated: fwd=30.001 GiB, fwd+bwd=32.120 GiB
paper_forward peak reserved:  fwd=30.043 GiB, fwd+bwd=32.793 GiB
production_forward2 fwd+bwd:  224.406 ms
production_forward2 bwd-only: 202.219 ms
production_forward2 peak allocated: fwd=2.864 GiB, fwd+bwd=6.243 GiB
production_forward2 peak reserved:  fwd=3.252 GiB, fwd+bwd=9.002 GiB
production_forward fwd+bwd:  112.045 ms
production_forward bwd-only: 91.677 ms
production_forward peak allocated: fwd=2.364 GiB, fwd+bwd=6.243 GiB
production_forward peak reserved:  fwd=2.502 GiB, fwd+bwd=6.377 GiB

grads check for swiglu layers + randn queries
production_forward vs paper_forward output: mean_abs=0.0015964774647727609, max_abs=0.03515625
production_forward grad[0] vs paper_forward: mean_abs=0.008500732481479645, max_abs=0.34375, mean_rel=0.07476009428501129, max_rel=175.08518981933594, norm_rel=0.02053343877196312, ref_abs_avg=0.44746002554893494, test_abs_avg=0.4474821984767914
production_forward grad[1] vs paper_forward: mean_abs=7.362330913543701, max_abs=64.0, mean_rel=0.21622200310230255, max_rel=1027.5665283203125, norm_rel=0.02072373405098915, ref_abs_avg=312.3734130859375, test_abs_avg=312.3000183105469
production_forward grad[2] vs paper_forward: mean_abs=1.2679448127746582, max_abs=4.75, mean_rel=0.09050151705741882, max_rel=6.511669158935547, norm_rel=0.024683156982064247, ref_abs_avg=49.227684020996094, test_abs_avg=49.27251434326172
production_forward grad[3] vs paper_forward: mean_abs=1.543885588645935, max_abs=10.0, mean_rel=0.17091158032417297, max_rel=3244.729736328125, norm_rel=0.02442041039466858, ref_abs_avg=63.50519561767578, test_abs_avg=63.51051330566406
production_forward grad[4] vs paper_forward: mean_abs=1.5055376291275024, max_abs=9.5, mean_rel=0.16551819443702698, max_rel=1693.95068359375, norm_rel=0.024176396429538727, ref_abs_avg=62.587303161621094, test_abs_avg=62.58953857421875
production_forward grad[5] vs paper_forward: mean_abs=1.135448694229126, max_abs=4.75, mean_rel=0.2428051233291626, max_rel=52.03708267211914, norm_rel=0.02567044086754322, ref_abs_avg=45.97639465332031, test_abs_avg=46.04011535644531
production_forward grad[6] vs paper_forward: mean_abs=1.3736932277679443, max_abs=9.0, mean_rel=0.1701066792011261, max_rel=1994.685302734375, norm_rel=0.024158352985978127, ref_abs_avg=57.16047668457031, test_abs_avg=57.166297912597656
production_forward grad[7] vs paper_forward: mean_abs=1.3421151638031006, max_abs=8.0, mean_rel=0.16589495539665222, max_rel=1505.4951171875, norm_rel=0.02391684800386429, ref_abs_avg=56.481849670410156, test_abs_avg=56.488521575927734
production_forward grad[8] vs paper_forward: mean_abs=1.0179576873779297, max_abs=4.0, mean_rel=0.09049347043037415, max_rel=3.6495072841644287, norm_rel=0.02342974953353405, ref_abs_avg=44.10752487182617, test_abs_avg=44.07745361328125
production_forward grad[9] vs paper_forward: mean_abs=1.2501626014709473, max_abs=8.0, mean_rel=0.1648455113172531, max_rel=1655.3089599609375, norm_rel=0.02384544536471367, ref_abs_avg=52.69188690185547, test_abs_avg=52.69657516479492
production_forward grad[10] vs paper_forward: mean_abs=1.2236831188201904, max_abs=7.75, mean_rel=0.174168661236763, max_rel=1854.69091796875, norm_rel=0.023817747831344604, ref_abs_avg=51.64215850830078, test_abs_avg=51.64569091796875
production_forward grad[11] vs paper_forward: mean_abs=0.895024299621582, max_abs=4.0, mean_rel=0.0903351828455925, max_rel=13.600430488586426, norm_rel=0.021331479772925377, ref_abs_avg=43.20718765258789, test_abs_avg=43.230552673339844
production_forward grad[12] vs paper_forward: mean_abs=1.155586838722229, max_abs=7.5, mean_rel=0.16160368919372559, max_rel=2403.9658203125, norm_rel=0.02369983121752739, ref_abs_avg=49.020111083984375, test_abs_avg=49.025794982910156
production_forward grad[13] vs paper_forward: mean_abs=1.1184766292572021, max_abs=7.0, mean_rel=0.16443821787834167, max_rel=1244.1533203125, norm_rel=0.023204533383250237, ref_abs_avg=48.41257858276367, test_abs_avg=48.417118072509766
production_forward grad[14] vs paper_forward: mean_abs=0.8527898788452148, max_abs=3.5, mean_rel=0.08090081065893173, max_rel=4.1825947761535645, norm_rel=0.023979250341653824, ref_abs_avg=36.459747314453125, test_abs_avg=36.49633026123047
production_forward grad[15] vs paper_forward: mean_abs=1.0745668411254883, max_abs=6.5, mean_rel=0.152847558259964, max_rel=1195.5081787109375, norm_rel=0.023449737578630447, ref_abs_avg=46.01848602294922, test_abs_avg=46.01877975463867
production_forward grad[16] vs paper_forward: mean_abs=1.042942762374878, max_abs=6.5, mean_rel=0.16765618324279785, max_rel=1260.9307861328125, norm_rel=0.023255284875631332, ref_abs_avg=45.028751373291016, test_abs_avg=45.04185104370117
production_forward grad[17] vs paper_forward: mean_abs=0.8022980690002441, max_abs=3.0, mean_rel=0.07907778024673462, max_rel=3.1453869342803955, norm_rel=0.021968040615320206, ref_abs_avg=36.424644470214844, test_abs_avg=36.375335693359375
production_forward grad[18] vs paper_forward: mean_abs=1.007512092590332, max_abs=7.5, mean_rel=0.16516155004501343, max_rel=1522.3399658203125, norm_rel=0.023307671770453453, ref_abs_avg=43.452613830566406, test_abs_avg=43.456695556640625
production_forward grad[19] vs paper_forward: mean_abs=0.9832732081413269, max_abs=6.0, mean_rel=0.16334854066371918, max_rel=1094.6781005859375, norm_rel=0.02297057770192623, ref_abs_avg=42.98231887817383, test_abs_avg=42.981231689453125
production_forward grad[20] vs paper_forward: mean_abs=0.7842657566070557, max_abs=3.1875, mean_rel=0.09534823894500732, max_rel=7.208502769470215, norm_rel=0.024812882766127586, ref_abs_avg=32.57734680175781, test_abs_avg=32.57322692871094
production_forward grad[21] vs paper_forward: mean_abs=0.9529092311859131, max_abs=6.5, mean_rel=0.15322929620742798, max_rel=1088.685546875, norm_rel=0.023135704919695854, ref_abs_avg=41.40481185913086, test_abs_avg=41.40530014038086
production_forward grad[22] vs paper_forward: mean_abs=0.9284202456474304, max_abs=5.3125, mean_rel=0.1549929678440094, max_rel=692.3433227539062, norm_rel=0.022855397313833237, ref_abs_avg=40.77063751220703, test_abs_avg=40.77923583984375
production_forward grad[23] vs paper_forward: mean_abs=0.7218952178955078, max_abs=3.0, mean_rel=0.07933320105075836, max_rel=5.052509307861328, norm_rel=0.021657828241586685, ref_abs_avg=33.75258255004883, test_abs_avg=33.81642150878906
production_forward grad[24] vs paper_forward: mean_abs=0.8999160528182983, max_abs=6.0, mean_rel=0.15104609727859497, max_rel=1306.494384765625, norm_rel=0.023032337427139282, ref_abs_avg=39.285423278808594, test_abs_avg=39.288429260253906
production_forward grad[25] vs paper_forward: mean_abs=0.8782625198364258, max_abs=6.0, mean_rel=0.14728544652462006, max_rel=802.178466796875, norm_rel=0.022724179551005363, ref_abs_avg=38.833396911621094, test_abs_avg=38.83006286621094
production_forward grad[26] vs paper_forward: mean_abs=0.8033714294433594, max_abs=3.25, mean_rel=0.06493847817182541, max_rel=2.9202089309692383, norm_rel=0.02278890646994114, ref_abs_avg=35.83210372924805, test_abs_avg=35.837646484375
production_forward grad[27] vs paper_forward: mean_abs=1.0435857772827148, max_abs=6.75, mean_rel=0.1722404658794403, max_rel=1595.4110107421875, norm_rel=0.025184763595461845, ref_abs_avg=41.60702133178711, test_abs_avg=41.613040924072266
production_forward grad[28] vs paper_forward: mean_abs=1.007512092590332, max_abs=7.5, mean_rel=0.16822993755340576, max_rel=917.9278564453125, norm_rel=0.024671277031302452, ref_abs_avg=41.04370880126953, test_abs_avg=41.049346923828125
production_forward grad[29] vs paper_forward: mean_abs=0.761347770690918, max_abs=3.0, mean_rel=0.22242620587348938, max_rel=42.6309700012207, norm_rel=0.023805350065231323, ref_abs_avg=31.924396514892578, test_abs_avg=31.926122665405273
production_forward grad[30] vs paper_forward: mean_abs=0.9530014395713806, max_abs=6.5, mean_rel=0.17236483097076416, max_rel=1715.330322265625, norm_rel=0.025324825197458267, ref_abs_avg=37.76667022705078, test_abs_avg=37.770751953125
production_forward grad[31] vs paper_forward: mean_abs=0.9399752616882324, max_abs=7.0, mean_rel=0.17835760116577148, max_rel=1698.673583984375, norm_rel=0.025078799575567245, ref_abs_avg=37.63287353515625, test_abs_avg=37.62977600097656
production_forward grad[32] vs paper_forward: mean_abs=0.7499450445175171, max_abs=2.78125, mean_rel=0.2674749195575714, max_rel=88.73460388183594, norm_rel=0.025454191491007805, ref_abs_avg=29.362869262695312, test_abs_avg=29.396867752075195
production_forward grad[33] vs paper_forward: mean_abs=0.8852285146713257, max_abs=5.5, mean_rel=0.1597721427679062, max_rel=1131.974609375, norm_rel=0.02505839429795742, ref_abs_avg=35.41510772705078, test_abs_avg=35.41845703125
production_forward grad[34] vs paper_forward: mean_abs=0.8734877109527588, max_abs=6.0, mean_rel=0.1705067902803421, max_rel=1033.4873046875, norm_rel=0.024970397353172302, ref_abs_avg=35.081031799316406, test_abs_avg=35.08720016479492
production_forward grad[35] vs paper_forward: mean_abs=0.6374759674072266, max_abs=2.25, mean_rel=0.09352324157953262, max_rel=6.164260387420654, norm_rel=0.02291577309370041, ref_abs_avg=27.733989715576172, test_abs_avg=27.766624450683594
production_forward grad[36] vs paper_forward: mean_abs=0.827873170375824, max_abs=5.5, mean_rel=0.16392025351524353, max_rel=1577.7572021484375, norm_rel=0.024954425171017647, ref_abs_avg=33.268226623535156, test_abs_avg=33.27035903930664
production_forward grad[37] vs paper_forward: mean_abs=0.8138488531112671, max_abs=5.0, mean_rel=0.1673916131258011, max_rel=779.0196533203125, norm_rel=0.02465343289077282, ref_abs_avg=33.08883285522461, test_abs_avg=33.08518981933594
production_forward grad[38] vs paper_forward: mean_abs=0.6634440422058105, max_abs=2.78125, mean_rel=0.0895896628499031, max_rel=2.8574447631835938, norm_rel=0.025702202692627907, ref_abs_avg=25.37755584716797, test_abs_avg=25.38373374938965
production_forward grad[39] vs paper_forward: mean_abs=0.7839435338973999, max_abs=5.15625, mean_rel=0.1693934202194214, max_rel=2125.6630859375, norm_rel=0.024724936112761497, ref_abs_avg=31.815505981445312, test_abs_avg=31.81808090209961
production_forward grad[40] vs paper_forward: mean_abs=0.7734546661376953, max_abs=5.0, mean_rel=0.16860322654247284, max_rel=1325.1251220703125, norm_rel=0.024368803948163986, ref_abs_avg=31.876300811767578, test_abs_avg=31.8729190826416
production_forward grad[41] vs paper_forward: mean_abs=0.614334225654602, max_abs=2.375, mean_rel=0.28479140996932983, max_rel=96.53919982910156, norm_rel=0.0236568171530962, ref_abs_avg=26.364852905273438, test_abs_avg=26.360248565673828
production_forward grad[42] vs paper_forward: mean_abs=0.7484722137451172, max_abs=5.0, mean_rel=0.15742400288581848, max_rel=1002.81396484375, norm_rel=0.024360110983252525, ref_abs_avg=30.776161193847656, test_abs_avg=30.778850555419922
production_forward grad[43] vs paper_forward: mean_abs=0.7333158850669861, max_abs=4.578125, mean_rel=0.16280892491340637, max_rel=868.812255859375, norm_rel=0.02421037293970585, ref_abs_avg=30.369983673095703, test_abs_avg=30.37533950805664
production_forward grad[44] vs paper_forward: mean_abs=0.5694985389709473, max_abs=2.6015625, mean_rel=0.11870982497930527, max_rel=15.761075019836426, norm_rel=0.023121973499655724, ref_abs_avg=24.980836868286133, test_abs_avg=25.004261016845703
production_forward grad[45] vs paper_forward: mean_abs=0.709856390953064, max_abs=4.75, mean_rel=0.1582670658826828, max_rel=782.5536499023438, norm_rel=0.02413768321275711, ref_abs_avg=29.467309951782227, test_abs_avg=29.470455169677734
production_forward grad[46] vs paper_forward: mean_abs=0.6949677467346191, max_abs=4.5, mean_rel=0.16435369849205017, max_rel=2469.27294921875, norm_rel=0.024000506848096848, ref_abs_avg=29.0704345703125, test_abs_avg=29.068632125854492
production_forward grad[47] vs paper_forward: mean_abs=0.5481945276260376, max_abs=1.8125, mean_rel=0.08382941782474518, max_rel=6.407116889953613, norm_rel=0.02386670559644699, ref_abs_avg=22.78373908996582, test_abs_avg=22.718669891357422
production_forward grad[48] vs paper_forward: mean_abs=0.6784297227859497, max_abs=4.5, mean_rel=0.1562502086162567, max_rel=1675.8160400390625, norm_rel=0.02377277798950672, ref_abs_avg=28.563251495361328, test_abs_avg=28.563907623291016
production_forward grad[49] vs paper_forward: mean_abs=0.6631155014038086, max_abs=4.5, mean_rel=0.16807898879051208, max_rel=1204.126708984375, norm_rel=0.023414937779307365, ref_abs_avg=28.328466415405273, test_abs_avg=28.328651428222656
production_forward grad[50] vs paper_forward: mean_abs=0.5943665504455566, max_abs=2.25, mean_rel=0.09472986310720444, max_rel=3.288501501083374, norm_rel=0.023855332285165787, ref_abs_avg=25.09334945678711, test_abs_avg=25.115530014038086
production_forward grad[51] vs paper_forward: mean_abs=0.7721626162528992, max_abs=5.0, mean_rel=0.1643909513950348, max_rel=1172.558837890625, norm_rel=0.02565106377005577, ref_abs_avg=30.211429595947266, test_abs_avg=30.214263916015625
production_forward grad[52] vs paper_forward: mean_abs=0.7561149597167969, max_abs=4.75, mean_rel=0.1714630126953125, max_rel=2145.600341796875, norm_rel=0.025510545819997787, ref_abs_avg=29.697174072265625, test_abs_avg=29.693099975585938
production_forward grad[53] vs paper_forward: mean_abs=0.5689294934272766, max_abs=2.1875, mean_rel=0.3048349618911743, max_rel=113.82649993896484, norm_rel=0.024621956050395966, ref_abs_avg=23.28278350830078, test_abs_avg=23.318172454833984
production_forward grad[54] vs paper_forward: mean_abs=0.705531120300293, max_abs=6.0, mean_rel=0.16365112364292145, max_rel=1305.4749755859375, norm_rel=0.024983657523989677, ref_abs_avg=28.260740280151367, test_abs_avg=28.26292610168457
production_forward grad[55] vs paper_forward: mean_abs=0.6937624216079712, max_abs=4.75, mean_rel=0.17743878066539764, max_rel=1708.2689208984375, norm_rel=0.02513764798641205, ref_abs_avg=27.7144775390625, test_abs_avg=27.715572357177734
production_forward grad[56] vs paper_forward: mean_abs=0.4977019429206848, max_abs=2.25, mean_rel=0.08936417102813721, max_rel=4.5417256355285645, norm_rel=0.02323915623128414, ref_abs_avg=21.41238021850586, test_abs_avg=21.407760620117188
production_forward grad[57] vs paper_forward: mean_abs=0.654055118560791, max_abs=5.0, mean_rel=0.1698618084192276, max_rel=1225.4925537109375, norm_rel=0.024345722049474716, ref_abs_avg=26.891101837158203, test_abs_avg=26.89403533935547
production_forward grad[58] vs paper_forward: mean_abs=0.6418321132659912, max_abs=4.25, mean_rel=0.1557873785495758, max_rel=1031.090576171875, norm_rel=0.024624472483992577, ref_abs_avg=26.129535675048828, test_abs_avg=26.135164260864258
production_forward grad[59] vs paper_forward: mean_abs=0.48058629035949707, max_abs=1.75, mean_rel=0.06968705356121063, max_rel=2.327746629714966, norm_rel=0.023175733163952827, ref_abs_avg=20.643327713012695, test_abs_avg=20.614700317382812
production_forward grad[60] vs paper_forward: mean_abs=0.6079273223876953, max_abs=4.0, mean_rel=0.16178351640701294, max_rel=782.32958984375, norm_rel=0.024112612009048462, ref_abs_avg=25.27296257019043, test_abs_avg=25.27280044555664
production_forward grad[61] vs paper_forward: mean_abs=0.6000787615776062, max_abs=4.25, mean_rel=0.1488012671470642, max_rel=798.2662353515625, norm_rel=0.023697054013609886, ref_abs_avg=25.307159423828125, test_abs_avg=25.30057144165039
production_forward grad[62] vs paper_forward: mean_abs=0.46502685546875, max_abs=1.5, mean_rel=0.08984807133674622, max_rel=5.185436248779297, norm_rel=0.022418322041630745, ref_abs_avg=20.594602584838867, test_abs_avg=20.565303802490234
production_forward grad[63] vs paper_forward: mean_abs=0.575316309928894, max_abs=4.125, mean_rel=0.1525849997997284, max_rel=1411.9283447265625, norm_rel=0.023686610162258148, ref_abs_avg=24.291881561279297, test_abs_avg=24.293575286865234
production_forward grad[64] vs paper_forward: mean_abs=0.5643606781959534, max_abs=4.25, mean_rel=0.14396241307258606, max_rel=711.1770629882812, norm_rel=0.023296786472201347, ref_abs_avg=24.226627349853516, test_abs_avg=24.226539611816406
production_forward grad[65] vs paper_forward: mean_abs=0.41059446334838867, max_abs=1.5, mean_rel=0.08553285896778107, max_rel=5.205163955688477, norm_rel=0.020941412076354027, ref_abs_avg=19.40099334716797, test_abs_avg=19.405284881591797
production_forward grad[66] vs paper_forward: mean_abs=0.549530029296875, max_abs=4.25, mean_rel=0.14826485514640808, max_rel=908.9880981445312, norm_rel=0.023028112947940826, ref_abs_avg=23.822044372558594, test_abs_avg=23.82305145263672
production_forward grad[67] vs paper_forward: mean_abs=0.5393935441970825, max_abs=4.0, mean_rel=0.15433263778686523, max_rel=909.5457153320312, norm_rel=0.023142840713262558, ref_abs_avg=23.27069091796875, test_abs_avg=23.278587341308594
production_forward grad[68] vs paper_forward: mean_abs=0.4132475256919861, max_abs=1.71875, mean_rel=0.19640493392944336, max_rel=69.35003662109375, norm_rel=0.020678052678704262, ref_abs_avg=20.089399337768555, test_abs_avg=20.074527740478516
production_forward grad[69] vs paper_forward: mean_abs=0.5219429731369019, max_abs=5.0, mean_rel=0.14741042256355286, max_rel=633.275634765625, norm_rel=0.02272491715848446, ref_abs_avg=22.954875946044922, test_abs_avg=22.956287384033203
production_forward grad[70] vs paper_forward: mean_abs=0.513386070728302, max_abs=4.5, mean_rel=0.15541544556617737, max_rel=1050.2421875, norm_rel=0.022959811612963676, ref_abs_avg=22.37519073486328, test_abs_avg=22.377216339111328
production_forward grad[71] vs paper_forward: mean_abs=0.41741693019866943, max_abs=1.875, mean_rel=0.3379693031311035, max_rel=127.36577606201172, norm_rel=0.02240210399031639, ref_abs_avg=18.701766967773438, test_abs_avg=18.688093185424805
production_forward grad[72] vs paper_forward: mean_abs=0.49198445677757263, max_abs=3.625, mean_rel=0.14221510291099548, max_rel=883.3890991210938, norm_rel=0.02251722663640976, ref_abs_avg=21.903289794921875, test_abs_avg=21.905078887939453
production_forward grad[73] vs paper_forward: mean_abs=0.48690640926361084, max_abs=4.0, mean_rel=0.13354191184043884, max_rel=681.4431762695312, norm_rel=0.022175323218107224, ref_abs_avg=21.928569793701172, test_abs_avg=21.936466217041016
production_forward grad[74] vs paper_forward: mean_abs=0.42742785811424255, max_abs=2.0, mean_rel=0.5726978182792664, max_rel=233.8248291015625, norm_rel=0.020699936896562576, ref_abs_avg=20.750747680664062, test_abs_avg=20.741260528564453
production_forward grad[75] vs paper_forward: mean_abs=0.5642870664596558, max_abs=5.0, mean_rel=0.16283847391605377, max_rel=1142.899169921875, norm_rel=0.02394244447350502, ref_abs_avg=23.58047103881836, test_abs_avg=23.58185386657715
production_forward grad[76] vs paper_forward: mean_abs=0.553646981716156, max_abs=5.0, mean_rel=0.15973971784114838, max_rel=531.6668090820312, norm_rel=0.023906243965029716, ref_abs_avg=23.212100982666016, test_abs_avg=23.214946746826172
production_forward grad[77] vs paper_forward: mean_abs=0.43439626693725586, max_abs=1.64453125, mean_rel=0.09385901689529419, max_rel=6.43120813369751, norm_rel=0.02271248959004879, ref_abs_avg=19.47956085205078, test_abs_avg=19.448406219482422
production_forward grad[78] vs paper_forward: mean_abs=0.5140027403831482, max_abs=4.5, mean_rel=0.15326815843582153, max_rel=1207.3978271484375, norm_rel=0.023408068343997, ref_abs_avg=22.000978469848633, test_abs_avg=22.00067138671875
production_forward grad[79] vs paper_forward: mean_abs=0.5048242211341858, max_abs=4.5, mean_rel=0.14897999167442322, max_rel=733.7786254882812, norm_rel=0.023010747507214546, ref_abs_avg=21.973033905029297, test_abs_avg=21.968215942382812
production_forward grad[80] vs paper_forward: mean_abs=0.37358856201171875, max_abs=1.375, mean_rel=0.06756393611431122, max_rel=4.507688045501709, norm_rel=0.020305806770920753, ref_abs_avg=18.79967498779297, test_abs_avg=18.803695678710938
production_forward grad[81] vs paper_forward: mean_abs=0.4744136333465576, max_abs=4.0, mean_rel=0.1373082995414734, max_rel=1102.34130859375, norm_rel=0.02245980314910412, ref_abs_avg=21.123943328857422, test_abs_avg=21.124366760253906
production_forward grad[82] vs paper_forward: mean_abs=0.4608392119407654, max_abs=4.5, mean_rel=0.14380761981010437, max_rel=537.3432006835938, norm_rel=0.022155188024044037, ref_abs_avg=20.875926971435547, test_abs_avg=20.875518798828125
production_forward grad[83] vs paper_forward: mean_abs=0.3468616008758545, max_abs=1.388671875, mean_rel=0.08489890396595001, max_rel=7.22909688949585, norm_rel=0.019682252779603004, ref_abs_avg=17.784404754638672, test_abs_avg=17.7777099609375
production_forward grad[84] vs paper_forward: mean_abs=0.440884530544281, max_abs=4.25, mean_rel=0.13950955867767334, max_rel=725.5100708007812, norm_rel=0.02176823280751705, ref_abs_avg=20.284955978393555, test_abs_avg=20.285030364990234
production_forward grad[85] vs paper_forward: mean_abs=0.4175448417663574, max_abs=4.0, mean_rel=0.13806495070457458, max_rel=739.71923828125, norm_rel=0.021349549293518066, ref_abs_avg=19.594491958618164, test_abs_avg=19.59190559387207
production_forward grad[86] vs paper_forward: mean_abs=0.32700204849243164, max_abs=1.34375, mean_rel=0.11205822974443436, max_rel=17.358652114868164, norm_rel=0.019749414175748825, ref_abs_avg=16.291658401489258, test_abs_avg=16.28445816040039
production_forward grad[87] vs paper_forward: mean_abs=0.40819090604782104, max_abs=4.5, mean_rel=0.13595789670944214, max_rel=929.5814819335938, norm_rel=0.021291758865118027, ref_abs_avg=19.26221466064453, test_abs_avg=19.26259994506836
production_forward grad[88] vs paper_forward: mean_abs=0.4012467861175537, max_abs=5.5, mean_rel=0.1266402304172516, max_rel=370.42510986328125, norm_rel=0.02094915695488453, ref_abs_avg=19.286663055419922, test_abs_avg=19.28818130493164
production_forward grad[89] vs paper_forward: mean_abs=0.31595784425735474, max_abs=1.625, mean_rel=0.32066795229911804, max_rel=110.7616195678711, norm_rel=0.020044932141900063, ref_abs_avg=15.836316108703613, test_abs_avg=15.831429481506348
production_forward grad[90] vs paper_forward: mean_abs=0.3851398825645447, max_abs=4.5, mean_rel=0.1314808428287506, max_rel=685.1091918945312, norm_rel=0.020757215097546577, ref_abs_avg=18.67788314819336, test_abs_avg=18.6788330078125
production_forward grad[91] vs paper_forward: mean_abs=0.379202663898468, max_abs=4.25, mean_rel=0.12529167532920837, max_rel=394.84088134765625, norm_rel=0.02028338797390461, ref_abs_avg=18.82501983642578, test_abs_avg=18.83222770690918
production_forward grad[92] vs paper_forward: mean_abs=0.3072361946105957, max_abs=1.25, mean_rel=0.09760486334562302, max_rel=16.343191146850586, norm_rel=0.020302947610616684, ref_abs_avg=15.251312255859375, test_abs_avg=15.224691390991211
production_forward grad[93] vs paper_forward: mean_abs=0.3632262945175171, max_abs=3.328125, mean_rel=0.1266712099313736, max_rel=907.1834716796875, norm_rel=0.020203087478876114, ref_abs_avg=18.188758850097656, test_abs_avg=18.189430236816406
production_forward grad[94] vs paper_forward: mean_abs=0.3705618679523468, max_abs=5.0, mean_rel=0.1358235478401184, max_rel=651.55810546875, norm_rel=0.02045699767768383, ref_abs_avg=18.355924606323242, test_abs_avg=18.364681243896484
production_forward grad[95] vs paper_forward: mean_abs=0.2843835651874542, max_abs=1.171875, mean_rel=0.057326629757881165, max_rel=1.4455313682556152, norm_rel=0.01850901171565056, ref_abs_avg=15.45756721496582, test_abs_avg=15.445938110351562
production_forward grad[96] vs paper_forward: mean_abs=0.34789490699768066, max_abs=4.25, mean_rel=0.12025538831949234, max_rel=410.9007568359375, norm_rel=0.0199016984552145, ref_abs_avg=17.77570343017578, test_abs_avg=17.775287628173828
production_forward grad[97] vs paper_forward: mean_abs=0.3421793580055237, max_abs=4.0, mean_rel=0.1134328842163086, max_rel=641.7582397460938, norm_rel=0.019895579665899277, ref_abs_avg=17.47018814086914, test_abs_avg=17.473812103271484
production_forward2 vs paper_forward output: mean_abs=0.0015964774647727609, max_abs=0.03515625
production_forward2 grad[0] vs paper_forward: mean_abs=0.008514334447681904, max_abs=0.3505859375, mean_rel=0.07477651536464691, max_rel=186.2578582763672, norm_rel=0.02055584453046322, ref_abs_avg=0.44746002554893494, test_abs_avg=0.44747188687324524
production_forward2 grad[1] vs paper_forward: mean_abs=7.330048561096191, max_abs=64.0, mean_rel=0.22678600251674652, max_rel=997.623046875, norm_rel=0.020641261711716652, ref_abs_avg=312.3734130859375, test_abs_avg=312.3153381347656
production_forward2 grad[2] vs paper_forward: mean_abs=1.2447433471679688, max_abs=5.0, mean_rel=0.09863224625587463, max_rel=12.029818534851074, norm_rel=0.025038782507181168, ref_abs_avg=49.227684020996094, test_abs_avg=49.23200225830078
production_forward2 grad[3] vs paper_forward: mean_abs=1.5434544086456299, max_abs=10.625, mean_rel=0.16811016201972961, max_rel=2902.4921875, norm_rel=0.02442769519984722, ref_abs_avg=63.50519561767578, test_abs_avg=63.50965881347656
production_forward2 grad[4] vs paper_forward: mean_abs=1.510716438293457, max_abs=9.125, mean_rel=0.1635153591632843, max_rel=1221.6536865234375, norm_rel=0.024264123290777206, ref_abs_avg=62.587303161621094, test_abs_avg=62.58612823486328
production_forward2 grad[5] vs paper_forward: mean_abs=1.0902302265167236, max_abs=4.75, mean_rel=0.23243853449821472, max_rel=32.416805267333984, norm_rel=0.024915801361203194, ref_abs_avg=45.97639465332031, test_abs_avg=46.087059020996094
production_forward2 grad[6] vs paper_forward: mean_abs=1.3776613473892212, max_abs=9.0, mean_rel=0.1709991693496704, max_rel=1713.0655517578125, norm_rel=0.02422412484884262, ref_abs_avg=57.16047668457031, test_abs_avg=57.16297912597656
production_forward2 grad[7] vs paper_forward: mean_abs=1.3461171388626099, max_abs=8.0, mean_rel=0.1715887486934662, max_rel=2242.314697265625, norm_rel=0.023992743343114853, ref_abs_avg=56.481849670410156, test_abs_avg=56.48515701293945
production_forward2 grad[8] vs paper_forward: mean_abs=1.0142936706542969, max_abs=4.375, mean_rel=0.09217080473899841, max_rel=4.304642677307129, norm_rel=0.023647749796509743, ref_abs_avg=44.10752487182617, test_abs_avg=44.077667236328125
production_forward2 grad[9] vs paper_forward: mean_abs=1.2542741298675537, max_abs=8.0, mean_rel=0.16707579791545868, max_rel=1625.5662841796875, norm_rel=0.02391594648361206, ref_abs_avg=52.69188690185547, test_abs_avg=52.69478225708008
production_forward2 grad[10] vs paper_forward: mean_abs=1.2254881858825684, max_abs=8.0, mean_rel=0.1708730310201645, max_rel=1745.9942626953125, norm_rel=0.023871691897511482, ref_abs_avg=51.64215850830078, test_abs_avg=51.6463623046875
production_forward2 grad[11] vs paper_forward: mean_abs=0.8982620239257812, max_abs=3.5, mean_rel=0.08831686526536942, max_rel=13.600430488586426, norm_rel=0.02121366374194622, ref_abs_avg=43.20718765258789, test_abs_avg=43.22237777709961
production_forward2 grad[12] vs paper_forward: mean_abs=1.1597316265106201, max_abs=7.5, mean_rel=0.16116313636302948, max_rel=1621.9000244140625, norm_rel=0.023779965937137604, ref_abs_avg=49.020111083984375, test_abs_avg=49.026084899902344
production_forward2 grad[13] vs paper_forward: mean_abs=1.1252251863479614, max_abs=7.0, mean_rel=0.16636836528778076, max_rel=1577.8125, norm_rel=0.023344390094280243, ref_abs_avg=48.41257858276367, test_abs_avg=48.41764450073242
production_forward2 grad[14] vs paper_forward: mean_abs=0.8691015243530273, max_abs=3.25, mean_rel=0.09612829983234406, max_rel=8.200148582458496, norm_rel=0.024269983172416687, ref_abs_avg=36.459747314453125, test_abs_avg=36.460052490234375
production_forward2 grad[15] vs paper_forward: mean_abs=1.0759356021881104, max_abs=6.5, mean_rel=0.15209805965423584, max_rel=960.3114013671875, norm_rel=0.023477405309677124, ref_abs_avg=46.01848602294922, test_abs_avg=46.01826858520508
production_forward2 grad[16] vs paper_forward: mean_abs=1.0460416078567505, max_abs=6.0, mean_rel=0.1677863895893097, max_rel=1549.0906982421875, norm_rel=0.02331703156232834, ref_abs_avg=45.028751373291016, test_abs_avg=45.0414924621582
production_forward2 grad[17] vs paper_forward: mean_abs=0.8090882301330566, max_abs=3.5, mean_rel=0.0793817937374115, max_rel=3.3007147312164307, norm_rel=0.02249273844063282, ref_abs_avg=36.424644470214844, test_abs_avg=36.35987091064453
production_forward2 grad[18] vs paper_forward: mean_abs=1.0122010707855225, max_abs=7.5, mean_rel=0.16497762501239777, max_rel=1885.6485595703125, norm_rel=0.023414146155118942, ref_abs_avg=43.452613830566406, test_abs_avg=43.45601272583008
production_forward2 grad[19] vs paper_forward: mean_abs=0.9846063256263733, max_abs=5.75, mean_rel=0.16609054803848267, max_rel=1357.4991455078125, norm_rel=0.023026056587696075, ref_abs_avg=42.98231887817383, test_abs_avg=42.982852935791016
production_forward2 grad[20] vs paper_forward: mean_abs=0.7769486904144287, max_abs=3.0, mean_rel=0.077436164021492, max_rel=4.098264694213867, norm_rel=0.024729160591959953, ref_abs_avg=32.57734680175781, test_abs_avg=32.563079833984375
production_forward2 grad[21] vs paper_forward: mean_abs=0.9560104608535767, max_abs=6.0, mean_rel=0.1524072289466858, max_rel=1561.173828125, norm_rel=0.023199312388896942, ref_abs_avg=41.40481185913086, test_abs_avg=41.40464782714844
production_forward2 grad[22] vs paper_forward: mean_abs=0.9311396479606628, max_abs=5.75, mean_rel=0.15997213125228882, max_rel=669.9729614257812, norm_rel=0.02293478697538376, ref_abs_avg=40.77063751220703, test_abs_avg=40.77788162231445
production_forward2 grad[23] vs paper_forward: mean_abs=0.7305355072021484, max_abs=2.875, mean_rel=0.08077657967805862, max_rel=4.655635356903076, norm_rel=0.021861160174012184, ref_abs_avg=33.75258255004883, test_abs_avg=33.82075500488281
production_forward2 grad[24] vs paper_forward: mean_abs=0.9029815196990967, max_abs=6.0, mean_rel=0.14807990193367004, max_rel=1057.997314453125, norm_rel=0.023105563595891, ref_abs_avg=39.285423278808594, test_abs_avg=39.28822326660156
production_forward2 grad[25] vs paper_forward: mean_abs=0.8804470300674438, max_abs=6.0, mean_rel=0.1443527489900589, max_rel=782.6781616210938, norm_rel=0.02277623489499092, ref_abs_avg=38.833396911621094, test_abs_avg=38.829917907714844
production_forward2 grad[26] vs paper_forward: mean_abs=0.777191162109375, max_abs=3.25, mean_rel=0.06187066063284874, max_rel=3.0457191467285156, norm_rel=0.022222554311156273, ref_abs_avg=35.83210372924805, test_abs_avg=35.85348892211914
production_forward2 grad[27] vs paper_forward: mean_abs=1.041905403137207, max_abs=7.25, mean_rel=0.17326441407203674, max_rel=1595.4110107421875, norm_rel=0.025151079520583153, ref_abs_avg=41.60702133178711, test_abs_avg=41.612396240234375
production_forward2 grad[28] vs paper_forward: mean_abs=1.0058550834655762, max_abs=6.5, mean_rel=0.17166846990585327, max_rel=699.5382080078125, norm_rel=0.02463541366159916, ref_abs_avg=41.04370880126953, test_abs_avg=41.04656982421875
production_forward2 grad[29] vs paper_forward: mean_abs=0.7213239669799805, max_abs=3.375, mean_rel=0.19631338119506836, max_rel=35.05680847167969, norm_rel=0.023319313302636147, ref_abs_avg=31.924396514892578, test_abs_avg=31.91339683532715
production_forward2 grad[30] vs paper_forward: mean_abs=0.9544259309768677, max_abs=6.5, mean_rel=0.1741088628768921, max_rel=1649.169189453125, norm_rel=0.02535288780927658, ref_abs_avg=37.76667022705078, test_abs_avg=37.77051544189453
production_forward2 grad[31] vs paper_forward: mean_abs=0.9388411045074463, max_abs=5.984375, mean_rel=0.18199194967746735, max_rel=1302.9442138671875, norm_rel=0.025053244084119797, ref_abs_avg=37.63287353515625, test_abs_avg=37.62742614746094
production_forward2 grad[32] vs paper_forward: mean_abs=0.7678017616271973, max_abs=2.8125, mean_rel=0.33492594957351685, max_rel=107.33631134033203, norm_rel=0.025938736274838448, ref_abs_avg=29.362869262695312, test_abs_avg=29.37984275817871
production_forward2 grad[33] vs paper_forward: mean_abs=0.8868009448051453, max_abs=5.75, mean_rel=0.16232837736606598, max_rel=859.3128662109375, norm_rel=0.02510954812169075, ref_abs_avg=35.41510772705078, test_abs_avg=35.41748046875
production_forward2 grad[34] vs paper_forward: mean_abs=0.8766968250274658, max_abs=5.5, mean_rel=0.17255862057209015, max_rel=947.1445922851562, norm_rel=0.025068184360861778, ref_abs_avg=35.081031799316406, test_abs_avg=35.0875358581543
production_forward2 grad[35] vs paper_forward: mean_abs=0.6579856872558594, max_abs=2.5, mean_rel=0.1125054806470871, max_rel=11.668307304382324, norm_rel=0.02373422123491764, ref_abs_avg=27.733989715576172, test_abs_avg=27.753780364990234
production_forward2 grad[36] vs paper_forward: mean_abs=0.8300467729568481, max_abs=5.5, mean_rel=0.16402031481266022, max_rel=1332.515869140625, norm_rel=0.02503157965838909, ref_abs_avg=33.268226623535156, test_abs_avg=33.26982879638672
production_forward2 grad[37] vs paper_forward: mean_abs=0.8143125176429749, max_abs=5.0, mean_rel=0.1657983511686325, max_rel=871.2003173828125, norm_rel=0.024671753868460655, ref_abs_avg=33.08883285522461, test_abs_avg=33.08580017089844
production_forward2 grad[38] vs paper_forward: mean_abs=0.6527886390686035, max_abs=2.96875, mean_rel=0.09728903323411942, max_rel=7.054176330566406, norm_rel=0.025252554565668106, ref_abs_avg=25.37755584716797, test_abs_avg=25.379989624023438
production_forward2 grad[39] vs paper_forward: mean_abs=0.7862949371337891, max_abs=5.34375, mean_rel=0.17145103216171265, max_rel=1594.1549072265625, norm_rel=0.024791304022073746, ref_abs_avg=31.815505981445312, test_abs_avg=31.817256927490234
production_forward2 grad[40] vs paper_forward: mean_abs=0.7765393257141113, max_abs=4.5, mean_rel=0.17053338885307312, max_rel=1903.789794921875, norm_rel=0.024459287524223328, ref_abs_avg=31.876300811767578, test_abs_avg=31.872936248779297
production_forward2 grad[41] vs paper_forward: mean_abs=0.5971394777297974, max_abs=2.375, mean_rel=0.2161770761013031, max_rel=65.0419692993164, norm_rel=0.02327662706375122, ref_abs_avg=26.364852905273438, test_abs_avg=26.36315155029297
production_forward2 grad[42] vs paper_forward: mean_abs=0.7501049041748047, max_abs=5.0, mean_rel=0.15792730450630188, max_rel=1135.885986328125, norm_rel=0.024428317323327065, ref_abs_avg=30.776161193847656, test_abs_avg=30.778079986572266
production_forward2 grad[43] vs paper_forward: mean_abs=0.7340559959411621, max_abs=5.0, mean_rel=0.16360627114772797, max_rel=776.1513671875, norm_rel=0.024222146719694138, ref_abs_avg=30.369983673095703, test_abs_avg=30.373844146728516
production_forward2 grad[44] vs paper_forward: mean_abs=0.5656471252441406, max_abs=2.4375, mean_rel=0.11984594166278839, max_rel=13.015902519226074, norm_rel=0.023148056119680405, ref_abs_avg=24.980836868286133, test_abs_avg=25.01266098022461
production_forward2 grad[45] vs paper_forward: mean_abs=0.7111434936523438, max_abs=4.328125, mean_rel=0.15991829335689545, max_rel=688.9244995117188, norm_rel=0.024191319942474365, ref_abs_avg=29.467309951782227, test_abs_avg=29.469921112060547
production_forward2 grad[46] vs paper_forward: mean_abs=0.6942193508148193, max_abs=4.5, mean_rel=0.1619430035352707, max_rel=2180.460205078125, norm_rel=0.023977400735020638, ref_abs_avg=29.0704345703125, test_abs_avg=29.070236206054688
production_forward2 grad[47] vs paper_forward: mean_abs=0.5553305745124817, max_abs=1.9375, mean_rel=0.08582787215709686, max_rel=6.189465522766113, norm_rel=0.02448268048465252, ref_abs_avg=22.78373908996582, test_abs_avg=22.722036361694336
production_forward2 grad[48] vs paper_forward: mean_abs=0.6790128946304321, max_abs=4.5, mean_rel=0.15712665021419525, max_rel=1534.0408935546875, norm_rel=0.02380787767469883, ref_abs_avg=28.563251495361328, test_abs_avg=28.56357192993164
production_forward2 grad[49] vs paper_forward: mean_abs=0.6652892827987671, max_abs=4.203125, mean_rel=0.16742803156375885, max_rel=1227.389892578125, norm_rel=0.02349437214434147, ref_abs_avg=28.328466415405273, test_abs_avg=28.327789306640625
production_forward2 grad[50] vs paper_forward: mean_abs=0.588133692741394, max_abs=2.75, mean_rel=0.08225458860397339, max_rel=2.457138776779175, norm_rel=0.02359681949019432, ref_abs_avg=25.09334945678711, test_abs_avg=25.12154769897461
production_forward2 grad[51] vs paper_forward: mean_abs=0.7696739435195923, max_abs=5.5, mean_rel=0.16383349895477295, max_rel=1137.5645751953125, norm_rel=0.025565125048160553, ref_abs_avg=30.211429595947266, test_abs_avg=30.213279724121094
production_forward2 grad[52] vs paper_forward: mean_abs=0.7537317276000977, max_abs=5.125, mean_rel=0.16320177912712097, max_rel=1507.39111328125, norm_rel=0.025434201583266258, ref_abs_avg=29.697174072265625, test_abs_avg=29.69123649597168
production_forward2 grad[53] vs paper_forward: mean_abs=0.578055739402771, max_abs=2.8125, mean_rel=0.3419286608695984, max_rel=132.5880584716797, norm_rel=0.024917738512158394, ref_abs_avg=23.28278350830078, test_abs_avg=23.325759887695312
production_forward2 grad[54] vs paper_forward: mean_abs=0.7058029770851135, max_abs=6.0, mean_rel=0.1638663411140442, max_rel=1264.52978515625, norm_rel=0.02499835379421711, ref_abs_avg=28.260740280151367, test_abs_avg=28.26211929321289
production_forward2 grad[55] vs paper_forward: mean_abs=0.6931471228599548, max_abs=4.75, mean_rel=0.18115770816802979, max_rel=1807.0888671875, norm_rel=0.025106046348810196, ref_abs_avg=27.7144775390625, test_abs_avg=27.715755462646484
production_forward2 grad[56] vs paper_forward: mean_abs=0.49992257356643677, max_abs=2.0, mean_rel=0.08638834953308105, max_rel=4.321964740753174, norm_rel=0.023330703377723694, ref_abs_avg=21.41238021850586, test_abs_avg=21.403717041015625
production_forward2 grad[57] vs paper_forward: mean_abs=0.6539793610572815, max_abs=4.5, mean_rel=0.16961269080638885, max_rel=1139.302490234375, norm_rel=0.02434059977531433, ref_abs_avg=26.891101837158203, test_abs_avg=26.893733978271484
production_forward2 grad[58] vs paper_forward: mean_abs=0.6433296203613281, max_abs=4.25, mean_rel=0.15627789497375488, max_rel=1129.953125, norm_rel=0.024688106030225754, ref_abs_avg=26.129535675048828, test_abs_avg=26.134296417236328
production_forward2 grad[59] vs paper_forward: mean_abs=0.4783592224121094, max_abs=1.75, mean_rel=0.067782923579216, max_rel=1.6414750814437866, norm_rel=0.023123962804675102, ref_abs_avg=20.643327713012695, test_abs_avg=20.62727165222168
production_forward2 grad[60] vs paper_forward: mean_abs=0.6088365912437439, max_abs=4.5, mean_rel=0.16284379363059998, max_rel=721.9495239257812, norm_rel=0.024147341027855873, ref_abs_avg=25.27296257019043, test_abs_avg=25.272815704345703
production_forward2 grad[61] vs paper_forward: mean_abs=0.6010763645172119, max_abs=4.09375, mean_rel=0.1480087786912918, max_rel=847.7954711914062, norm_rel=0.023737363517284393, ref_abs_avg=25.307159423828125, test_abs_avg=25.30084228515625
production_forward2 grad[62] vs paper_forward: mean_abs=0.4632112979888916, max_abs=1.625, mean_rel=0.0917549878358841, max_rel=4.9871087074279785, norm_rel=0.02249763160943985, ref_abs_avg=20.594602584838867, test_abs_avg=20.577688217163086
production_forward2 grad[63] vs paper_forward: mean_abs=0.5762015581130981, max_abs=4.0, mean_rel=0.1523621380329132, max_rel=1301.2667236328125, norm_rel=0.023707862943410873, ref_abs_avg=24.291881561279297, test_abs_avg=24.293689727783203
production_forward2 grad[64] vs paper_forward: mean_abs=0.5655860304832458, max_abs=4.25, mean_rel=0.14494043588638306, max_rel=730.6773681640625, norm_rel=0.023349303752183914, ref_abs_avg=24.226627349853516, test_abs_avg=24.22690200805664
production_forward2 grad[65] vs paper_forward: mean_abs=0.43220579624176025, max_abs=1.625, mean_rel=0.08459053933620453, max_rel=4.574893474578857, norm_rel=0.02180466242134571, ref_abs_avg=19.40099334716797, test_abs_avg=19.40892219543457
production_forward2 grad[66] vs paper_forward: mean_abs=0.5504754781723022, max_abs=4.75, mean_rel=0.15014293789863586, max_rel=923.8955078125, norm_rel=0.023064684122800827, ref_abs_avg=23.822044372558594, test_abs_avg=23.822486877441406
production_forward2 grad[67] vs paper_forward: mean_abs=0.5398831963539124, max_abs=4.25, mean_rel=0.15485429763793945, max_rel=929.7656860351562, norm_rel=0.023166872560977936, ref_abs_avg=23.27069091796875, test_abs_avg=23.27817153930664
production_forward2 grad[68] vs paper_forward: mean_abs=0.4111332297325134, max_abs=1.5625, mean_rel=0.5497846603393555, max_rel=248.06961059570312, norm_rel=0.020620523020625114, ref_abs_avg=20.089399337768555, test_abs_avg=20.07487678527832
production_forward2 grad[69] vs paper_forward: mean_abs=0.5227987170219421, max_abs=4.046875, mean_rel=0.14598841965198517, max_rel=572.8463134765625, norm_rel=0.022761810570955276, ref_abs_avg=22.954875946044922, test_abs_avg=22.95611572265625
production_forward2 grad[70] vs paper_forward: mean_abs=0.5142853260040283, max_abs=4.5, mean_rel=0.15544429421424866, max_rel=885.967529296875, norm_rel=0.023012010380625725, ref_abs_avg=22.37519073486328, test_abs_avg=22.377197265625
production_forward2 grad[71] vs paper_forward: mean_abs=0.4050239324569702, max_abs=1.75, mean_rel=0.3085031509399414, max_rel=113.12423706054688, norm_rel=0.02201738767325878, ref_abs_avg=18.701766967773438, test_abs_avg=18.685054779052734
production_forward2 grad[72] vs paper_forward: mean_abs=0.4926142394542694, max_abs=3.75, mean_rel=0.14235082268714905, max_rel=751.162109375, norm_rel=0.022543443366885185, ref_abs_avg=21.903289794921875, test_abs_avg=21.904748916625977
production_forward2 grad[73] vs paper_forward: mean_abs=0.48795637488365173, max_abs=4.0, mean_rel=0.13383543491363525, max_rel=751.28955078125, norm_rel=0.022211244329810143, ref_abs_avg=21.928569793701172, test_abs_avg=21.935653686523438
production_forward2 grad[74] vs paper_forward: mean_abs=0.4299839437007904, max_abs=1.75, mean_rel=0.5765526294708252, max_rel=255.96835327148438, norm_rel=0.020802030339837074, ref_abs_avg=20.750747680664062, test_abs_avg=20.743595123291016
production_forward2 grad[75] vs paper_forward: mean_abs=0.5615090131759644, max_abs=5.0, mean_rel=0.16218255460262299, max_rel=1196.077392578125, norm_rel=0.023824116215109825, ref_abs_avg=23.58047103881836, test_abs_avg=23.581707000732422
production_forward2 grad[76] vs paper_forward: mean_abs=0.5503392815589905, max_abs=4.1875, mean_rel=0.1576509177684784, max_rel=583.4078979492188, norm_rel=0.02375032939016819, ref_abs_avg=23.212100982666016, test_abs_avg=23.214580535888672
production_forward2 grad[77] vs paper_forward: mean_abs=0.4345855712890625, max_abs=1.75, mean_rel=0.08994271606206894, max_rel=4.470122337341309, norm_rel=0.022651338949799538, ref_abs_avg=19.47956085205078, test_abs_avg=19.449005126953125
production_forward2 grad[78] vs paper_forward: mean_abs=0.5127205848693848, max_abs=4.0, mean_rel=0.1526976227760315, max_rel=1318.93603515625, norm_rel=0.023347511887550354, ref_abs_avg=22.000978469848633, test_abs_avg=22.00057601928711
production_forward2 grad[79] vs paper_forward: mean_abs=0.5036699175834656, max_abs=4.09375, mean_rel=0.15090569853782654, max_rel=697.5897216796875, norm_rel=0.022984998300671577, ref_abs_avg=21.973033905029297, test_abs_avg=21.96845054626465
production_forward2 grad[80] vs paper_forward: mean_abs=0.37311744689941406, max_abs=1.4375, mean_rel=0.06794770061969757, max_rel=5.512760162353516, norm_rel=0.020067548379302025, ref_abs_avg=18.79967498779297, test_abs_avg=18.79555320739746
production_forward2 grad[81] vs paper_forward: mean_abs=0.47408971190452576, max_abs=3.75, mean_rel=0.1377672553062439, max_rel=1132.6712646484375, norm_rel=0.022451534867286682, ref_abs_avg=21.123943328857422, test_abs_avg=21.123878479003906
production_forward2 grad[82] vs paper_forward: mean_abs=0.4614517390727997, max_abs=4.5, mean_rel=0.1444728672504425, max_rel=533.611328125, norm_rel=0.02218095026910305, ref_abs_avg=20.875926971435547, test_abs_avg=20.874706268310547
production_forward2 grad[83] vs paper_forward: mean_abs=0.3567521572113037, max_abs=1.490234375, mean_rel=0.07764419913291931, max_rel=5.871340751647949, norm_rel=0.020140333101153374, ref_abs_avg=17.784404754638672, test_abs_avg=17.77912139892578
production_forward2 grad[84] vs paper_forward: mean_abs=0.4410199820995331, max_abs=4.0, mean_rel=0.13985179364681244, max_rel=680.1952514648438, norm_rel=0.021779701113700867, ref_abs_avg=20.284955978393555, test_abs_avg=20.285179138183594
production_forward2 grad[85] vs paper_forward: mean_abs=0.41681385040283203, max_abs=4.0, mean_rel=0.14067193865776062, max_rel=721.8495483398438, norm_rel=0.021311501041054726, ref_abs_avg=19.594491958618164, test_abs_avg=19.592193603515625
production_forward2 grad[86] vs paper_forward: mean_abs=0.3256392478942871, max_abs=1.1875, mean_rel=0.12323751300573349, max_rel=21.60665512084961, norm_rel=0.01960296742618084, ref_abs_avg=16.291658401489258, test_abs_avg=16.291109085083008
production_forward2 grad[87] vs paper_forward: mean_abs=0.40831995010375977, max_abs=5.5, mean_rel=0.1361648142337799, max_rel=891.1085815429688, norm_rel=0.021288014948368073, ref_abs_avg=19.26221466064453, test_abs_avg=19.262401580810547
production_forward2 grad[88] vs paper_forward: mean_abs=0.40188735723495483, max_abs=5.5, mean_rel=0.1272047460079193, max_rel=387.7757568359375, norm_rel=0.020976819097995758, ref_abs_avg=19.286663055419922, test_abs_avg=19.28830909729004
production_forward2 grad[89] vs paper_forward: mean_abs=0.31709975004196167, max_abs=1.375, mean_rel=0.29372262954711914, max_rel=99.71609497070312, norm_rel=0.019845521077513695, ref_abs_avg=15.836316108703613, test_abs_avg=15.843574523925781
production_forward2 grad[90] vs paper_forward: mean_abs=0.3851071894168854, max_abs=4.25, mean_rel=0.13163262605667114, max_rel=576.3731689453125, norm_rel=0.02076021395623684, ref_abs_avg=18.67788314819336, test_abs_avg=18.678531646728516
production_forward2 grad[91] vs paper_forward: mean_abs=0.3790318965911865, max_abs=4.25, mean_rel=0.12450678646564484, max_rel=392.69677734375, norm_rel=0.020285900682210922, ref_abs_avg=18.82501983642578, test_abs_avg=18.831684112548828
production_forward2 grad[92] vs paper_forward: mean_abs=0.3081197738647461, max_abs=1.25, mean_rel=0.09751217067241669, max_rel=14.735115051269531, norm_rel=0.02033548429608345, ref_abs_avg=15.251312255859375, test_abs_avg=15.221834182739258
production_forward2 grad[93] vs paper_forward: mean_abs=0.36302274465560913, max_abs=3.5, mean_rel=0.12718281149864197, max_rel=1168.57080078125, norm_rel=0.020190054550766945, ref_abs_avg=18.188758850097656, test_abs_avg=18.189773559570312
production_forward2 grad[94] vs paper_forward: mean_abs=0.37072473764419556, max_abs=4.5, mean_rel=0.13599756360054016, max_rel=632.947998046875, norm_rel=0.020458048209547997, ref_abs_avg=18.355924606323242, test_abs_avg=18.364452362060547
production_forward2 grad[95] vs paper_forward: mean_abs=0.2832142114639282, max_abs=1.1875, mean_rel=0.05702280253171921, max_rel=1.43226957321167, norm_rel=0.018434729427099228, ref_abs_avg=15.45756721496582, test_abs_avg=15.445026397705078
production_forward2 grad[96] vs paper_forward: mean_abs=0.3478938341140747, max_abs=4.25, mean_rel=0.12038305401802063, max_rel=410.9007568359375, norm_rel=0.019901029765605927, ref_abs_avg=17.77570343017578, test_abs_avg=17.775325775146484
production_forward2 grad[97] vs paper_forward: mean_abs=0.34221169352531433, max_abs=4.0, mean_rel=0.1134493499994278, max_rel=641.7582397460938, norm_rel=0.01989717409014702, ref_abs_avg=17.47018814086914, test_abs_avg=17.473819732666016
identity layers + randn queries
production_forward2 fwd+bwd:  224.373 ms
production_forward2 bwd-only: 202.212 ms
production_forward2 peak allocated: fwd=2.864 GiB, fwd+bwd=6.243 GiB
production_forward2 peak reserved:  fwd=3.242 GiB, fwd+bwd=8.992 GiB
production_forward fwd+bwd:  112.053 ms
production_forward bwd-only: 91.676 ms
production_forward peak allocated: fwd=2.364 GiB, fwd+bwd=6.243 GiB
production_forward peak reserved:  fwd=2.492 GiB, fwd+bwd=6.367 GiB
paper_forward fwd+bwd:  379.488 ms
paper_forward bwd-only: 293.939 ms
paper_forward peak allocated: fwd=30.001 GiB, fwd+bwd=32.120 GiB
paper_forward peak reserved:  fwd=30.037 GiB, fwd+bwd=32.787 GiB

grads check for swiglu layers + randn queries
production_forward vs paper_forward output: mean_abs=0.0016229460015892982, max_abs=0.0390625
production_forward grad[0] vs paper_forward: mean_abs=0.008588670752942562, max_abs=0.328125, mean_rel=0.07432816922664642, max_rel=108.2647705078125, norm_rel=0.020272977650165558, ref_abs_avg=0.4566798806190491, test_abs_avg=0.45668768882751465
production_forward grad[1] vs paper_forward: mean_abs=7.383102893829346, max_abs=64.0, mean_rel=0.13207845389842987, max_rel=122.37740325927734, norm_rel=0.020428672432899475, ref_abs_avg=319.4510498046875, test_abs_avg=319.4057312011719
production_forward grad[2] vs paper_forward: mean_abs=1.2626590728759766, max_abs=5.3125, mean_rel=0.14472167193889618, max_rel=16.72856330871582, norm_rel=0.023821035400032997, ref_abs_avg=54.36587142944336, test_abs_avg=54.33789825439453
production_forward grad[3] vs paper_forward: mean_abs=1.602068543434143, max_abs=10.0, mean_rel=0.17349061369895935, max_rel=2213.934814453125, norm_rel=0.02419586479663849, ref_abs_avg=66.59817504882812, test_abs_avg=66.60411834716797
production_forward grad[4] vs paper_forward: mean_abs=1.5465039014816284, max_abs=12.0, mean_rel=0.16415423154830933, max_rel=1134.520263671875, norm_rel=0.02384529635310173, ref_abs_avg=65.29682922363281, test_abs_avg=65.30427551269531
production_forward grad[5] vs paper_forward: mean_abs=1.0957517623901367, max_abs=4.25, mean_rel=0.08988241851329803, max_rel=8.805488586425781, norm_rel=0.02297719195485115, ref_abs_avg=48.6696891784668, test_abs_avg=48.67201232910156
production_forward grad[6] vs paper_forward: mean_abs=1.3755866289138794, max_abs=9.9375, mean_rel=0.16663840413093567, max_rel=2815.919921875, norm_rel=0.023801444098353386, ref_abs_avg=58.14122009277344, test_abs_avg=58.144264221191406
production_forward grad[7] vs paper_forward: mean_abs=1.3357956409454346, max_abs=8.75, mean_rel=0.1656607985496521, max_rel=1466.7449951171875, norm_rel=0.02318454161286354, ref_abs_avg=57.94084548950195, test_abs_avg=57.946998596191406
production_forward grad[8] vs paper_forward: mean_abs=0.9966409206390381, max_abs=3.875, mean_rel=0.43780380487442017, max_rel=137.4667205810547, norm_rel=0.023005977272987366, ref_abs_avg=42.002174377441406, test_abs_avg=42.040809631347656
production_forward grad[9] vs paper_forward: mean_abs=1.2326469421386719, max_abs=8.0, mean_rel=0.16248399019241333, max_rel=1759.6732177734375, norm_rel=0.0235259048640728, ref_abs_avg=52.68552780151367, test_abs_avg=52.68986511230469
production_forward grad[10] vs paper_forward: mean_abs=1.2063137292861938, max_abs=9.0, mean_rel=0.1515202820301056, max_rel=1401.1131591796875, norm_rel=0.02326403371989727, ref_abs_avg=52.179298400878906, test_abs_avg=52.17951202392578
production_forward grad[11] vs paper_forward: mean_abs=0.9166951179504395, max_abs=4.0, mean_rel=0.0839623361825943, max_rel=6.782078266143799, norm_rel=0.023440521210432053, ref_abs_avg=40.86144256591797, test_abs_avg=40.891197204589844
production_forward grad[12] vs paper_forward: mean_abs=1.1489444971084595, max_abs=8.0, mean_rel=0.16370919346809387, max_rel=1156.6710205078125, norm_rel=0.023429961875081062, ref_abs_avg=49.3016357421875, test_abs_avg=49.29976272583008
production_forward grad[13] vs paper_forward: mean_abs=1.1153786182403564, max_abs=7.5, mean_rel=0.1677720993757248, max_rel=1899.94287109375, norm_rel=0.02311174012720585, ref_abs_avg=48.4852294921875, test_abs_avg=48.487159729003906
production_forward grad[14] vs paper_forward: mean_abs=0.873403787612915, max_abs=3.25, mean_rel=0.07768803089857101, max_rel=5.251276016235352, norm_rel=0.023675814270973206, ref_abs_avg=37.395301818847656, test_abs_avg=37.42988586425781
production_forward grad[15] vs paper_forward: mean_abs=1.0671768188476562, max_abs=7.5, mean_rel=0.15173110365867615, max_rel=1147.872802734375, norm_rel=0.023106548935174942, ref_abs_avg=46.427947998046875, test_abs_avg=46.432559967041016
production_forward grad[16] vs paper_forward: mean_abs=1.0390493869781494, max_abs=7.0, mean_rel=0.1439964771270752, max_rel=823.3201904296875, norm_rel=0.022708464413881302, ref_abs_avg=45.968746185302734, test_abs_avg=45.970298767089844
production_forward grad[17] vs paper_forward: mean_abs=0.8488368988037109, max_abs=3.0, mean_rel=0.09040920436382294, max_rel=5.605448246002197, norm_rel=0.023278964683413506, ref_abs_avg=35.65180587768555, test_abs_avg=35.630882263183594
production_forward grad[18] vs paper_forward: mean_abs=1.006636619567871, max_abs=6.5, mean_rel=0.1479821801185608, max_rel=1253.97021484375, norm_rel=0.022973133251070976, ref_abs_avg=44.050655364990234, test_abs_avg=44.05126953125
production_forward grad[19] vs paper_forward: mean_abs=0.9773061275482178, max_abs=6.0, mean_rel=0.16111992299556732, max_rel=1072.2144775390625, norm_rel=0.022696923464536667, ref_abs_avg=43.32176971435547, test_abs_avg=43.32956314086914
production_forward grad[20] vs paper_forward: mean_abs=0.7468090057373047, max_abs=3.0625, mean_rel=0.07037077844142914, max_rel=4.191877841949463, norm_rel=0.021359199658036232, ref_abs_avg=34.579978942871094, test_abs_avg=34.63010787963867
production_forward grad[21] vs paper_forward: mean_abs=0.9467519521713257, max_abs=6.0, mean_rel=0.1535046100616455, max_rel=1733.1182861328125, norm_rel=0.022765470668673515, ref_abs_avg=41.78212356567383, test_abs_avg=41.78398895263672
production_forward grad[22] vs paper_forward: mean_abs=0.9287713766098022, max_abs=6.0, mean_rel=0.14367830753326416, max_rel=781.6016235351562, norm_rel=0.022473998367786407, ref_abs_avg=41.53359603881836, test_abs_avg=41.53932571411133
production_forward grad[23] vs paper_forward: mean_abs=0.6966918706893921, max_abs=2.875, mean_rel=0.20003479719161987, max_rel=52.71698760986328, norm_rel=0.02075217105448246, ref_abs_avg=33.946075439453125, test_abs_avg=33.979461669921875
production_forward grad[24] vs paper_forward: mean_abs=0.9076492190361023, max_abs=5.75, mean_rel=0.15441767871379852, max_rel=2230.122802734375, norm_rel=0.02261159010231495, ref_abs_avg=40.28291320800781, test_abs_avg=40.28326416015625
production_forward grad[25] vs paper_forward: mean_abs=0.8876711130142212, max_abs=5.75, mean_rel=0.1535990983247757, max_rel=1490.745361328125, norm_rel=0.022459233179688454, ref_abs_avg=39.77630615234375, test_abs_avg=39.780460357666016
production_forward grad[26] vs paper_forward: mean_abs=0.8971099853515625, max_abs=3.75, mean_rel=0.13239723443984985, max_rel=15.125089645385742, norm_rel=0.026151595637202263, ref_abs_avg=33.9775390625, test_abs_avg=34.023780822753906
production_forward grad[27] vs paper_forward: mean_abs=1.0470013618469238, max_abs=7.25, mean_rel=0.16802389919757843, max_rel=1353.9833984375, norm_rel=0.024698542430996895, ref_abs_avg=42.52191162109375, test_abs_avg=42.523475646972656
production_forward grad[28] vs paper_forward: mean_abs=1.0233893394470215, max_abs=7.0, mean_rel=0.16218185424804688, max_rel=842.8072509765625, norm_rel=0.02449920028448105, ref_abs_avg=41.96208190917969, test_abs_avg=41.95814514160156
production_forward grad[29] vs paper_forward: mean_abs=0.8072891235351562, max_abs=3.25, mean_rel=0.08315648138523102, max_rel=3.5014395713806152, norm_rel=0.026646941900253296, ref_abs_avg=30.99641227722168, test_abs_avg=30.967348098754883
production_forward grad[30] vs paper_forward: mean_abs=0.9704281091690063, max_abs=6.0, mean_rel=0.17330385744571686, max_rel=1614.812744140625, norm_rel=0.025055093690752983, ref_abs_avg=38.87725830078125, test_abs_avg=38.87665557861328
production_forward grad[31] vs paper_forward: mean_abs=0.953650951385498, max_abs=6.0, mean_rel=0.17166590690612793, max_rel=1161.87353515625, norm_rel=0.024971984326839447, ref_abs_avg=38.38410568237305, test_abs_avg=38.38580322265625
production_forward grad[32] vs paper_forward: mean_abs=0.7877926826477051, max_abs=3.1640625, mean_rel=0.11493880301713943, max_rel=6.113072872161865, norm_rel=0.02487904764711857, ref_abs_avg=31.44898796081543, test_abs_avg=31.464263916015625
production_forward grad[33] vs paper_forward: mean_abs=0.9101883769035339, max_abs=5.875, mean_rel=0.17302155494689941, max_rel=1589.7962646484375, norm_rel=0.024936294183135033, ref_abs_avg=36.628997802734375, test_abs_avg=36.630043029785156
production_forward grad[34] vs paper_forward: mean_abs=0.8912882804870605, max_abs=5.5, mean_rel=0.1544608771800995, max_rel=737.4617919921875, norm_rel=0.02448488026857376, ref_abs_avg=36.524658203125, test_abs_avg=36.52427673339844
production_forward grad[35] vs paper_forward: mean_abs=0.6330118179321289, max_abs=2.6640625, mean_rel=0.14067357778549194, max_rel=28.0155086517334, norm_rel=0.02360915206372738, ref_abs_avg=27.16545295715332, test_abs_avg=27.118215560913086
production_forward grad[36] vs paper_forward: mean_abs=0.83951735496521, max_abs=5.625, mean_rel=0.17752814292907715, max_rel=1233.7391357421875, norm_rel=0.024634383618831635, ref_abs_avg=34.17245864868164, test_abs_avg=34.171112060546875
production_forward grad[37] vs paper_forward: mean_abs=0.8338083624839783, max_abs=5.75, mean_rel=0.1581241339445114, max_rel=1030.7381591796875, norm_rel=0.024596519768238068, ref_abs_avg=33.99601745605469, test_abs_avg=33.98944854736328
production_forward grad[38] vs paper_forward: mean_abs=0.6716231107711792, max_abs=2.28125, mean_rel=0.19575083255767822, max_rel=61.16152572631836, norm_rel=0.025801563635468483, ref_abs_avg=25.728870391845703, test_abs_avg=25.674480438232422
production_forward grad[39] vs paper_forward: mean_abs=0.7972327470779419, max_abs=5.25, mean_rel=0.16594162583351135, max_rel=1196.749267578125, norm_rel=0.02446429245173931, ref_abs_avg=32.69658660888672, test_abs_avg=32.697425842285156
production_forward grad[40] vs paper_forward: mean_abs=0.7838715314865112, max_abs=6.0, mean_rel=0.16665498912334442, max_rel=687.55859375, norm_rel=0.024328084662556648, ref_abs_avg=32.32502746582031, test_abs_avg=32.32444763183594
production_forward grad[41] vs paper_forward: mean_abs=0.5639152526855469, max_abs=2.3125, mean_rel=0.06898651272058487, max_rel=3.4509201049804688, norm_rel=0.0204786229878664, ref_abs_avg=27.470170974731445, test_abs_avg=27.525239944458008
production_forward grad[42] vs paper_forward: mean_abs=0.7513373494148254, max_abs=4.7109375, mean_rel=0.16778121888637543, max_rel=1570.4696044921875, norm_rel=0.024014310911297798, ref_abs_avg=31.3145809173584, test_abs_avg=31.314701080322266
production_forward grad[43] vs paper_forward: mean_abs=0.7428562641143799, max_abs=4.5625, mean_rel=0.16163033246994019, max_rel=855.4825439453125, norm_rel=0.02384115383028984, ref_abs_avg=31.24480438232422, test_abs_avg=31.247760772705078
production_forward grad[44] vs paper_forward: mean_abs=0.5692517757415771, max_abs=2.25, mean_rel=0.14779800176620483, max_rel=9.51554012298584, norm_rel=0.022833893075585365, ref_abs_avg=25.11896514892578, test_abs_avg=25.116174697875977
production_forward grad[45] vs paper_forward: mean_abs=0.7237149477005005, max_abs=5.0, mean_rel=0.16451358795166016, max_rel=1728.431396484375, norm_rel=0.023831471800804138, ref_abs_avg=30.437231063842773, test_abs_avg=30.438228607177734
production_forward grad[46] vs paper_forward: mean_abs=0.7075284719467163, max_abs=4.875, mean_rel=0.14444442093372345, max_rel=682.2599487304688, norm_rel=0.02377256192266941, ref_abs_avg=29.863391876220703, test_abs_avg=29.86319351196289
production_forward grad[47] vs paper_forward: mean_abs=0.5659719705581665, max_abs=2.25, mean_rel=0.11849477887153625, max_rel=18.000614166259766, norm_rel=0.02302774041891098, ref_abs_avg=24.627105712890625, test_abs_avg=24.657878875732422
production_forward grad[48] vs paper_forward: mean_abs=0.6914016604423523, max_abs=4.875, mean_rel=0.15505272150039673, max_rel=924.4122314453125, norm_rel=0.02353942207992077, ref_abs_avg=29.411209106445312, test_abs_avg=29.411914825439453
production_forward grad[49] vs paper_forward: mean_abs=0.6773900985717773, max_abs=5.0, mean_rel=0.14266395568847656, max_rel=873.4143676757812, norm_rel=0.02335703745484352, ref_abs_avg=29.06414031982422, test_abs_avg=29.063880920410156
production_forward grad[50] vs paper_forward: mean_abs=0.6285400390625, max_abs=2.625, mean_rel=0.1457928717136383, max_rel=11.701452255249023, norm_rel=0.024986891075968742, ref_abs_avg=25.834129333496094, test_abs_avg=25.899280548095703
production_forward grad[51] vs paper_forward: mean_abs=0.7813936471939087, max_abs=5.0, mean_rel=0.1711936593055725, max_rel=1285.2978515625, norm_rel=0.02521328441798687, ref_abs_avg=31.070266723632812, test_abs_avg=31.069198608398438
production_forward grad[52] vs paper_forward: mean_abs=0.764757513999939, max_abs=5.0, mean_rel=0.16301599144935608, max_rel=1353.5894775390625, norm_rel=0.025127530097961426, ref_abs_avg=30.57146644592285, test_abs_avg=30.56915283203125
production_forward grad[53] vs paper_forward: mean_abs=0.5742216110229492, max_abs=2.25, mean_rel=0.10029160231351852, max_rel=10.09273910522461, norm_rel=0.024047976359725, ref_abs_avg=23.901485443115234, test_abs_avg=23.847890853881836
production_forward grad[54] vs paper_forward: mean_abs=0.7217593193054199, max_abs=5.0, mean_rel=0.17339465022087097, max_rel=1168.472900390625, norm_rel=0.024815095588564873, ref_abs_avg=29.155595779418945, test_abs_avg=29.156030654907227
production_forward grad[55] vs paper_forward: mean_abs=0.7041159868240356, max_abs=5.5, mean_rel=0.15643389523029327, max_rel=622.9767456054688, norm_rel=0.024867402389645576, ref_abs_avg=28.39717674255371, test_abs_avg=28.39178466796875
production_forward grad[56] vs paper_forward: mean_abs=0.5272135734558105, max_abs=2.125, mean_rel=0.09668940305709839, max_rel=6.1767754554748535, norm_rel=0.02368474006652832, ref_abs_avg=22.47003936767578, test_abs_avg=22.453571319580078
production_forward grad[57] vs paper_forward: mean_abs=0.6598565578460693, max_abs=5.0, mean_rel=0.15611322224140167, max_rel=1466.7723388671875, norm_rel=0.02427498623728752, ref_abs_avg=27.166549682617188, test_abs_avg=27.16469955444336
production_forward grad[58] vs paper_forward: mean_abs=0.6449761390686035, max_abs=4.125, mean_rel=0.1642300933599472, max_rel=1220.4739990234375, norm_rel=0.024404199793934822, ref_abs_avg=26.462759017944336, test_abs_avg=26.459087371826172
production_forward grad[59] vs paper_forward: mean_abs=0.4709463119506836, max_abs=2.0, mean_rel=0.1318950206041336, max_rel=19.9630069732666, norm_rel=0.022674191743135452, ref_abs_avg=20.67945098876953, test_abs_avg=20.677000045776367
production_forward grad[60] vs paper_forward: mean_abs=0.6108841896057129, max_abs=4.125, mean_rel=0.15028497576713562, max_rel=834.6204833984375, norm_rel=0.02367711253464222, ref_abs_avg=25.818300247192383, test_abs_avg=25.82060432434082
production_forward grad[61] vs paper_forward: mean_abs=0.5970454216003418, max_abs=4.0, mean_rel=0.1457623839378357, max_rel=609.3441772460938, norm_rel=0.022941328585147858, ref_abs_avg=25.988523483276367, test_abs_avg=25.98978042602539
production_forward grad[62] vs paper_forward: mean_abs=0.46241796016693115, max_abs=1.875, mean_rel=0.19628798961639404, max_rel=63.5007438659668, norm_rel=0.021665144711732864, ref_abs_avg=21.82489776611328, test_abs_avg=21.838058471679688
production_forward grad[63] vs paper_forward: mean_abs=0.577373206615448, max_abs=4.25, mean_rel=0.14787733554840088, max_rel=1085.7078857421875, norm_rel=0.023389097303152084, ref_abs_avg=24.671432495117188, test_abs_avg=24.670520782470703
production_forward grad[64] vs paper_forward: mean_abs=0.5649731159210205, max_abs=4.375, mean_rel=0.1554698348045349, max_rel=1031.6796875, norm_rel=0.023209698498249054, ref_abs_avg=24.414480209350586, test_abs_avg=24.412960052490234
production_forward grad[65] vs paper_forward: mean_abs=0.4260205924510956, max_abs=2.03125, mean_rel=0.21872451901435852, max_rel=62.67537307739258, norm_rel=0.02220248617231846, ref_abs_avg=19.65349769592285, test_abs_avg=19.662464141845703
production_forward grad[66] vs paper_forward: mean_abs=0.5536061525344849, max_abs=4.0, mean_rel=0.15511472523212433, max_rel=1170.880859375, norm_rel=0.022979609668254852, ref_abs_avg=24.039844512939453, test_abs_avg=24.041702270507812
production_forward grad[67] vs paper_forward: mean_abs=0.5403638482093811, max_abs=4.125, mean_rel=0.1469656229019165, max_rel=1449.3265380859375, norm_rel=0.022806625813245773, ref_abs_avg=23.738548278808594, test_abs_avg=23.737070083618164
production_forward grad[68] vs paper_forward: mean_abs=0.4027571678161621, max_abs=1.75, mean_rel=0.07701057195663452, max_rel=4.1480889320373535, norm_rel=0.021410943940281868, ref_abs_avg=19.141939163208008, test_abs_avg=19.149341583251953
production_forward grad[69] vs paper_forward: mean_abs=0.5179953575134277, max_abs=4.9375, mean_rel=0.149922177195549, max_rel=875.8407592773438, norm_rel=0.02263292483985424, ref_abs_avg=22.89664077758789, test_abs_avg=22.896312713623047
production_forward grad[70] vs paper_forward: mean_abs=0.5113294124603271, max_abs=4.0, mean_rel=0.1403866708278656, max_rel=870.251220703125, norm_rel=0.022728189826011658, ref_abs_avg=22.591339111328125, test_abs_avg=22.590295791625977
production_forward grad[71] vs paper_forward: mean_abs=0.40564537048339844, max_abs=2.0, mean_rel=0.09953664243221283, max_rel=5.871340751647949, norm_rel=0.021213753148913383, ref_abs_avg=18.93880844116211, test_abs_avg=18.92000389099121
production_forward grad[72] vs paper_forward: mean_abs=0.4991510808467865, max_abs=4.75, mean_rel=0.13412615656852722, max_rel=993.128662109375, norm_rel=0.022148311138153076, ref_abs_avg=22.49756622314453, test_abs_avg=22.49802589416504
production_forward grad[73] vs paper_forward: mean_abs=0.4924241304397583, max_abs=4.0, mean_rel=0.14494378864765167, max_rel=587.5317993164062, norm_rel=0.02233058400452137, ref_abs_avg=22.115306854248047, test_abs_avg=22.11670684814453
production_forward grad[74] vs paper_forward: mean_abs=0.4525703489780426, max_abs=1.625, mean_rel=0.2050132006406784, max_rel=61.28660583496094, norm_rel=0.02337905950844288, ref_abs_avg=19.420740127563477, test_abs_avg=19.378686904907227
production_forward grad[75] vs paper_forward: mean_abs=0.5453085899353027, max_abs=4.0, mean_rel=0.14495082199573517, max_rel=671.3549194335938, norm_rel=0.02357448637485504, ref_abs_avg=23.118427276611328, test_abs_avg=23.11854362487793
production_forward grad[76] vs paper_forward: mean_abs=0.5326699018478394, max_abs=3.875, mean_rel=0.14566098153591156, max_rel=748.3289184570312, norm_rel=0.023356638848781586, ref_abs_avg=22.838623046875, test_abs_avg=22.832740783691406
production_forward grad[77] vs paper_forward: mean_abs=0.3914012312889099, max_abs=1.5, mean_rel=0.06773054599761963, max_rel=2.3173880577087402, norm_rel=0.022073421627283096, ref_abs_avg=18.136014938354492, test_abs_avg=18.14108657836914
production_forward grad[78] vs paper_forward: mean_abs=0.5079646110534668, max_abs=4.0, mean_rel=0.14784908294677734, max_rel=676.3755493164062, norm_rel=0.02283621020615101, ref_abs_avg=22.255352020263672, test_abs_avg=22.254505157470703
production_forward grad[79] vs paper_forward: mean_abs=0.4988310933113098, max_abs=3.875, mean_rel=0.1472501903772354, max_rel=414.8955078125, norm_rel=0.022994989529252052, ref_abs_avg=21.775779724121094, test_abs_avg=21.77609634399414
production_forward grad[80] vs paper_forward: mean_abs=0.38713765144348145, max_abs=1.6875, mean_rel=0.1864500343799591, max_rel=26.205873489379883, norm_rel=0.02134902961552143, ref_abs_avg=17.583553314208984, test_abs_avg=17.586360931396484
production_forward grad[81] vs paper_forward: mean_abs=0.4817574620246887, max_abs=4.5, mean_rel=0.15310776233673096, max_rel=787.5411987304688, norm_rel=0.02271883934736252, ref_abs_avg=21.265182495117188, test_abs_avg=21.264192581176758
production_forward grad[82] vs paper_forward: mean_abs=0.4673374891281128, max_abs=3.875, mean_rel=0.14438246190547943, max_rel=1127.243896484375, norm_rel=0.022161424160003662, ref_abs_avg=21.122957229614258, test_abs_avg=21.12504768371582
production_forward grad[83] vs paper_forward: mean_abs=0.3650045394897461, max_abs=1.75, mean_rel=0.10586778819561005, max_rel=14.7095365524292, norm_rel=0.021604545414447784, ref_abs_avg=17.18977165222168, test_abs_avg=17.23015022277832
production_forward grad[84] vs paper_forward: mean_abs=0.44046229124069214, max_abs=4.25, mean_rel=0.13891534507274628, max_rel=881.694091796875, norm_rel=0.02208053693175316, ref_abs_avg=19.983755111694336, test_abs_avg=19.983789443969727
production_forward grad[85] vs paper_forward: mean_abs=0.44026899337768555, max_abs=3.765625, mean_rel=0.1360180377960205, max_rel=893.8779907226562, norm_rel=0.02246047556400299, ref_abs_avg=19.790740966796875, test_abs_avg=19.79541015625
production_forward grad[86] vs paper_forward: mean_abs=0.3496098518371582, max_abs=1.40625, mean_rel=0.09575402736663818, max_rel=11.58483600616455, norm_rel=0.02087450958788395, ref_abs_avg=16.953689575195312, test_abs_avg=16.968778610229492
production_forward grad[87] vs paper_forward: mean_abs=0.4184781610965729, max_abs=4.25, mean_rel=0.13528797030448914, max_rel=835.1321411132812, norm_rel=0.021505501121282578, ref_abs_avg=19.5283203125, test_abs_avg=19.52709197998047
production_forward grad[88] vs paper_forward: mean_abs=0.41455432772636414, max_abs=3.5, mean_rel=0.14549154043197632, max_rel=904.0180053710938, norm_rel=0.021648526191711426, ref_abs_avg=19.243497848510742, test_abs_avg=19.243595123291016
production_forward grad[89] vs paper_forward: mean_abs=0.33249831199645996, max_abs=1.5, mean_rel=0.09113624691963196, max_rel=9.618058204650879, norm_rel=0.020646879449486732, ref_abs_avg=16.29256820678711, test_abs_avg=16.268096923828125
production_forward grad[90] vs paper_forward: mean_abs=0.39563116431236267, max_abs=4.625, mean_rel=0.12950509786605835, max_rel=1074.9156494140625, norm_rel=0.020808758214116096, ref_abs_avg=19.147178649902344, test_abs_avg=19.14651870727539
production_forward grad[91] vs paper_forward: mean_abs=0.377538800239563, max_abs=3.5, mean_rel=0.135277658700943, max_rel=651.9191284179688, norm_rel=0.02064225822687149, ref_abs_avg=18.49658203125, test_abs_avg=18.496904373168945
production_forward grad[92] vs paper_forward: mean_abs=0.30061280727386475, max_abs=1.5, mean_rel=0.07531043142080307, max_rel=4.991943836212158, norm_rel=0.01942535862326622, ref_abs_avg=15.920970916748047, test_abs_avg=15.951584815979004
production_forward grad[93] vs paper_forward: mean_abs=0.376589298248291, max_abs=5.0, mean_rel=0.12661227583885193, max_rel=791.56201171875, norm_rel=0.020253950729966164, ref_abs_avg=18.775611877441406, test_abs_avg=18.77432632446289
production_forward grad[94] vs paper_forward: mean_abs=0.356414794921875, max_abs=3.5625, mean_rel=0.12536481022834778, max_rel=386.53460693359375, norm_rel=0.02021758444607258, ref_abs_avg=17.900623321533203, test_abs_avg=17.902935028076172
production_forward grad[95] vs paper_forward: mean_abs=0.30221155285835266, max_abs=1.125, mean_rel=0.11021646857261658, max_rel=20.352020263671875, norm_rel=0.0209357887506485, ref_abs_avg=14.327051162719727, test_abs_avg=14.363245010375977
production_forward grad[96] vs paper_forward: mean_abs=0.3544993996620178, max_abs=3.5, mean_rel=0.12053817510604858, max_rel=386.7577819824219, norm_rel=0.02013021521270275, ref_abs_avg=17.86845588684082, test_abs_avg=17.86788558959961
production_forward grad[97] vs paper_forward: mean_abs=0.3449382483959198, max_abs=3.75, mean_rel=0.120578333735466, max_rel=589.2901000976562, norm_rel=0.019972966983914375, ref_abs_avg=17.525388717651367, test_abs_avg=17.520614624023438
production_forward2 vs paper_forward output: mean_abs=0.0016229460015892982, max_abs=0.0390625
production_forward2 grad[0] vs paper_forward: mean_abs=0.008596159517765045, max_abs=0.328125, mean_rel=0.07428693771362305, max_rel=110.95108032226562, norm_rel=0.020282598212361336, ref_abs_avg=0.4566798806190491, test_abs_avg=0.45668014883995056
production_forward2 grad[1] vs paper_forward: mean_abs=7.324682235717773, max_abs=64.0, mean_rel=0.14005796611309052, max_rel=261.70623779296875, norm_rel=0.020274605602025986, ref_abs_avg=319.4510498046875, test_abs_avg=319.37164306640625
production_forward2 grad[2] vs paper_forward: mean_abs=1.2530508041381836, max_abs=4.75, mean_rel=0.11043649911880493, max_rel=8.813007354736328, norm_rel=0.023634279146790504, ref_abs_avg=54.36587142944336, test_abs_avg=54.3376350402832
production_forward2 grad[3] vs paper_forward: mean_abs=1.5975675582885742, max_abs=11.0, mean_rel=0.17202341556549072, max_rel=2178.36474609375, norm_rel=0.02413368970155716, ref_abs_avg=66.59817504882812, test_abs_avg=66.59981536865234
production_forward2 grad[4] vs paper_forward: mean_abs=1.5459327697753906, max_abs=10.0, mean_rel=0.16192853450775146, max_rel=1211.6473388671875, norm_rel=0.02380705252289772, ref_abs_avg=65.29682922363281, test_abs_avg=65.3004150390625
production_forward2 grad[5] vs paper_forward: mean_abs=1.1363630294799805, max_abs=4.5, mean_rel=0.11474967002868652, max_rel=13.73782730102539, norm_rel=0.023696167394518852, ref_abs_avg=48.6696891784668, test_abs_avg=48.6644287109375
production_forward2 grad[6] vs paper_forward: mean_abs=1.3805146217346191, max_abs=9.53125, mean_rel=0.16892436146736145, max_rel=3654.154296875, norm_rel=0.023876752704381943, ref_abs_avg=58.14122009277344, test_abs_avg=58.14325714111328
production_forward2 grad[7] vs paper_forward: mean_abs=1.3412398099899292, max_abs=9.0, mean_rel=0.16739729046821594, max_rel=1909.4122314453125, norm_rel=0.023294392973184586, ref_abs_avg=57.94084548950195, test_abs_avg=57.945579528808594
production_forward2 grad[8] vs paper_forward: mean_abs=0.9805984497070312, max_abs=4.0, mean_rel=0.5332189202308655, max_rel=186.84942626953125, norm_rel=0.023174773901700974, ref_abs_avg=42.002174377441406, test_abs_avg=42.0634765625
production_forward2 grad[9] vs paper_forward: mean_abs=1.236822247505188, max_abs=7.9375, mean_rel=0.15953975915908813, max_rel=1216.674560546875, norm_rel=0.02359997108578682, ref_abs_avg=52.68552780151367, test_abs_avg=52.687843322753906
production_forward2 grad[10] vs paper_forward: mean_abs=1.2082828283309937, max_abs=7.5, mean_rel=0.1530151069164276, max_rel=989.1600341796875, norm_rel=0.023293789476156235, ref_abs_avg=52.179298400878906, test_abs_avg=52.17986297607422
production_forward2 grad[11] vs paper_forward: mean_abs=0.9427409172058105, max_abs=3.5, mean_rel=0.10258608311414719, max_rel=9.570905685424805, norm_rel=0.02377728745341301, ref_abs_avg=40.86144256591797, test_abs_avg=40.948944091796875
production_forward2 grad[12] vs paper_forward: mean_abs=1.152787208557129, max_abs=8.0, mean_rel=0.16698583960533142, max_rel=1863.2713623046875, norm_rel=0.023510567843914032, ref_abs_avg=49.3016357421875, test_abs_avg=49.30076599121094
production_forward2 grad[13] vs paper_forward: mean_abs=1.119236707687378, max_abs=7.5, mean_rel=0.16950608789920807, max_rel=2077.64697265625, norm_rel=0.023172402754426003, ref_abs_avg=48.4852294921875, test_abs_avg=48.48825454711914
production_forward2 grad[14] vs paper_forward: mean_abs=0.9105072021484375, max_abs=3.5, mean_rel=0.07636818289756775, max_rel=4.564193248748779, norm_rel=0.024606727063655853, ref_abs_avg=37.395301818847656, test_abs_avg=37.43971252441406
production_forward2 grad[15] vs paper_forward: mean_abs=1.0705735683441162, max_abs=6.59375, mean_rel=0.15046629309654236, max_rel=762.684814453125, norm_rel=0.023190369829535484, ref_abs_avg=46.427947998046875, test_abs_avg=46.43134307861328
production_forward2 grad[16] vs paper_forward: mean_abs=1.0411970615386963, max_abs=7.5, mean_rel=0.1485598236322403, max_rel=1023.3294067382812, norm_rel=0.02276073582470417, ref_abs_avg=45.968746185302734, test_abs_avg=45.96983337402344
production_forward2 grad[17] vs paper_forward: mean_abs=0.8392314910888672, max_abs=2.875, mean_rel=0.09015613049268723, max_rel=5.308076858520508, norm_rel=0.02319616824388504, ref_abs_avg=35.65180587768555, test_abs_avg=35.641510009765625
production_forward2 grad[18] vs paper_forward: mean_abs=1.0104206800460815, max_abs=6.25, mean_rel=0.15114428102970123, max_rel=1583.5406494140625, norm_rel=0.02305028960108757, ref_abs_avg=44.050655364990234, test_abs_avg=44.051631927490234
production_forward2 grad[19] vs paper_forward: mean_abs=0.9788872003555298, max_abs=7.0, mean_rel=0.15904876589775085, max_rel=988.3104858398438, norm_rel=0.022724397480487823, ref_abs_avg=43.32176971435547, test_abs_avg=43.32765579223633
production_forward2 grad[20] vs paper_forward: mean_abs=0.7785601615905762, max_abs=3.0, mean_rel=0.07933686673641205, max_rel=5.26869010925293, norm_rel=0.021958233788609505, ref_abs_avg=34.579978942871094, test_abs_avg=34.63597106933594
production_forward2 grad[21] vs paper_forward: mean_abs=0.9500480890274048, max_abs=6.0, mean_rel=0.15571141242980957, max_rel=1321.1082763671875, norm_rel=0.02282991074025631, ref_abs_avg=41.78212356567383, test_abs_avg=41.7840576171875
production_forward2 grad[22] vs paper_forward: mean_abs=0.9309330582618713, max_abs=5.75, mean_rel=0.14326760172843933, max_rel=722.0546264648438, norm_rel=0.02252282202243805, ref_abs_avg=41.53359603881836, test_abs_avg=41.5365104675293
production_forward2 grad[23] vs paper_forward: mean_abs=0.7125257253646851, max_abs=3.0, mean_rel=0.16753895580768585, max_rel=27.807584762573242, norm_rel=0.02095276676118374, ref_abs_avg=33.946075439453125, test_abs_avg=33.94062423706055
production_forward2 grad[24] vs paper_forward: mean_abs=0.9103871583938599, max_abs=6.0, mean_rel=0.15368354320526123, max_rel=1737.452880859375, norm_rel=0.02267993800342083, ref_abs_avg=40.28291320800781, test_abs_avg=40.283058166503906
production_forward2 grad[25] vs paper_forward: mean_abs=0.8883934020996094, max_abs=5.5, mean_rel=0.15195772051811218, max_rel=738.1356811523438, norm_rel=0.02246808260679245, ref_abs_avg=39.77630615234375, test_abs_avg=39.77866744995117
production_forward2 grad[26] vs paper_forward: mean_abs=0.9245262145996094, max_abs=3.5, mean_rel=0.1471620351076126, max_rel=17.29011344909668, norm_rel=0.02683369070291519, ref_abs_avg=33.9775390625, test_abs_avg=34.004581451416016
production_forward2 grad[27] vs paper_forward: mean_abs=1.0456160306930542, max_abs=6.25, mean_rel=0.16990111768245697, max_rel=1504.416748046875, norm_rel=0.024665001779794693, ref_abs_avg=42.52191162109375, test_abs_avg=42.52277755737305
production_forward2 grad[28] vs paper_forward: mean_abs=1.0193018913269043, max_abs=7.5, mean_rel=0.15960043668746948, max_rel=985.9193115234375, norm_rel=0.024426294490695, ref_abs_avg=41.96208190917969, test_abs_avg=41.95742416381836
production_forward2 grad[29] vs paper_forward: mean_abs=0.8321638107299805, max_abs=3.375, mean_rel=0.08812672644853592, max_rel=4.733170986175537, norm_rel=0.027219798415899277, ref_abs_avg=30.99641227722168, test_abs_avg=30.975345611572266
production_forward2 grad[30] vs paper_forward: mean_abs=0.9710173606872559, max_abs=6.0, mean_rel=0.17287829518318176, max_rel=1581.843505859375, norm_rel=0.02507754974067211, ref_abs_avg=38.87725830078125, test_abs_avg=38.87630844116211
production_forward2 grad[31] vs paper_forward: mean_abs=0.9553402662277222, max_abs=7.0, mean_rel=0.17317864298820496, max_rel=1540.201904296875, norm_rel=0.02500857599079609, ref_abs_avg=38.38410568237305, test_abs_avg=38.386131286621094
production_forward2 grad[32] vs paper_forward: mean_abs=0.777996301651001, max_abs=3.40625, mean_rel=0.15980610251426697, max_rel=19.455406188964844, norm_rel=0.02436359040439129, ref_abs_avg=31.44898796081543, test_abs_avg=31.454700469970703
production_forward2 grad[33] vs paper_forward: mean_abs=0.9120519161224365, max_abs=5.75, mean_rel=0.1760452389717102, max_rel=1062.84716796875, norm_rel=0.024973025545477867, ref_abs_avg=36.628997802734375, test_abs_avg=36.629642486572266
production_forward2 grad[34] vs paper_forward: mean_abs=0.8910658359527588, max_abs=5.5, mean_rel=0.15537527203559875, max_rel=782.9070434570312, norm_rel=0.024484556168317795, ref_abs_avg=36.524658203125, test_abs_avg=36.523887634277344
production_forward2 grad[35] vs paper_forward: mean_abs=0.6623997688293457, max_abs=2.75, mean_rel=0.12832531332969666, max_rel=21.469388961791992, norm_rel=0.024465736001729965, ref_abs_avg=27.16545295715332, test_abs_avg=27.113576889038086
production_forward2 grad[36] vs paper_forward: mean_abs=0.8407718539237976, max_abs=5.625, mean_rel=0.17494550347328186, max_rel=1162.6429443359375, norm_rel=0.024677058681845665, ref_abs_avg=34.17245864868164, test_abs_avg=34.17109680175781
production_forward2 grad[37] vs paper_forward: mean_abs=0.8340845108032227, max_abs=5.625, mean_rel=0.16057458519935608, max_rel=937.7869262695312, norm_rel=0.024620085954666138, ref_abs_avg=33.99601745605469, test_abs_avg=33.98936462402344
production_forward2 grad[38] vs paper_forward: mean_abs=0.6593703031539917, max_abs=2.5, mean_rel=0.21499986946582794, max_rel=64.17603302001953, norm_rel=0.025407610461115837, ref_abs_avg=25.728870391845703, test_abs_avg=25.68442153930664
production_forward2 grad[39] vs paper_forward: mean_abs=0.7987837195396423, max_abs=5.25, mean_rel=0.16494667530059814, max_rel=1217.1905517578125, norm_rel=0.02450699545443058, ref_abs_avg=32.69658660888672, test_abs_avg=32.69664764404297
production_forward2 grad[40] vs paper_forward: mean_abs=0.7859101295471191, max_abs=6.0, mean_rel=0.16834640502929688, max_rel=784.023681640625, norm_rel=0.02440839260816574, ref_abs_avg=32.32502746582031, test_abs_avg=32.32162857055664
production_forward2 grad[41] vs paper_forward: mean_abs=0.5665159225463867, max_abs=2.125, mean_rel=0.07746720314025879, max_rel=7.77826452255249, norm_rel=0.02066735178232193, ref_abs_avg=27.470170974731445, test_abs_avg=27.540122985839844
production_forward2 grad[42] vs paper_forward: mean_abs=0.752912163734436, max_abs=5.0, mean_rel=0.16611745953559875, max_rel=1592.2774658203125, norm_rel=0.02407200261950493, ref_abs_avg=31.3145809173584, test_abs_avg=31.314594268798828
production_forward2 grad[43] vs paper_forward: mean_abs=0.7438141107559204, max_abs=5.125, mean_rel=0.1642780750989914, max_rel=1124.522216796875, norm_rel=0.023885713890194893, ref_abs_avg=31.24480438232422, test_abs_avg=31.24711036682129
production_forward2 grad[44] vs paper_forward: mean_abs=0.5651059150695801, max_abs=2.078125, mean_rel=0.134089857339859, max_rel=8.878783226013184, norm_rel=0.02276519499719143, ref_abs_avg=25.11896514892578, test_abs_avg=25.11333465576172
production_forward2 grad[45] vs paper_forward: mean_abs=0.725528359413147, max_abs=5.421875, mean_rel=0.16481629014015198, max_rel=1618.5775146484375, norm_rel=0.02388594299554825, ref_abs_avg=30.437231063842773, test_abs_avg=30.438243865966797
production_forward2 grad[46] vs paper_forward: mean_abs=0.7088203430175781, max_abs=4.625, mean_rel=0.14481189846992493, max_rel=436.73565673828125, norm_rel=0.023811697959899902, ref_abs_avg=29.863391876220703, test_abs_avg=29.863710403442383
production_forward2 grad[47] vs paper_forward: mean_abs=0.5512939691543579, max_abs=2.0, mean_rel=0.1409280002117157, max_rel=17.349172592163086, norm_rel=0.022495469078421593, ref_abs_avg=24.627105712890625, test_abs_avg=24.633342742919922
production_forward2 grad[48] vs paper_forward: mean_abs=0.6923808455467224, max_abs=4.5, mean_rel=0.15529152750968933, max_rel=892.3406982421875, norm_rel=0.02357516624033451, ref_abs_avg=29.411209106445312, test_abs_avg=29.41147232055664
production_forward2 grad[49] vs paper_forward: mean_abs=0.6791940927505493, max_abs=5.0, mean_rel=0.14722412824630737, max_rel=904.7158203125, norm_rel=0.02341112308204174, ref_abs_avg=29.06414031982422, test_abs_avg=29.064971923828125
production_forward2 grad[50] vs paper_forward: mean_abs=0.6172065734863281, max_abs=2.5, mean_rel=0.14126905798912048, max_rel=11.17612361907959, norm_rel=0.02462124079465866, ref_abs_avg=25.834129333496094, test_abs_avg=25.909687042236328
production_forward2 grad[51] vs paper_forward: mean_abs=0.779529333114624, max_abs=5.5, mean_rel=0.17252174019813538, max_rel=1510.8514404296875, norm_rel=0.02515603043138981, ref_abs_avg=31.070266723632812, test_abs_avg=31.068965911865234
production_forward2 grad[52] vs paper_forward: mean_abs=0.7622144818305969, max_abs=5.5, mean_rel=0.164017453789711, max_rel=1260.782470703125, norm_rel=0.025044012814760208, ref_abs_avg=30.57146644592285, test_abs_avg=30.5699405670166
production_forward2 grad[53] vs paper_forward: mean_abs=0.5691342353820801, max_abs=2.25, mean_rel=0.09479035437107086, max_rel=7.403640270233154, norm_rel=0.023791372776031494, ref_abs_avg=23.901485443115234, test_abs_avg=23.85812759399414
production_forward2 grad[54] vs paper_forward: mean_abs=0.7215868234634399, max_abs=5.125, mean_rel=0.17164207994937897, max_rel=1312.2305908203125, norm_rel=0.024806419387459755, ref_abs_avg=29.155595779418945, test_abs_avg=29.156322479248047
production_forward2 grad[55] vs paper_forward: mean_abs=0.7043523788452148, max_abs=5.25, mean_rel=0.15496978163719177, max_rel=630.0826416015625, norm_rel=0.02488485351204872, ref_abs_avg=28.39717674255371, test_abs_avg=28.39177703857422
production_forward2 grad[56] vs paper_forward: mean_abs=0.5379204750061035, max_abs=2.125, mean_rel=0.10898664593696594, max_rel=9.073321342468262, norm_rel=0.024029474705457687, ref_abs_avg=22.47003936767578, test_abs_avg=22.46091079711914
production_forward2 grad[57] vs paper_forward: mean_abs=0.6603826284408569, max_abs=5.0625, mean_rel=0.15877144038677216, max_rel=1581.843505859375, norm_rel=0.024292072281241417, ref_abs_avg=27.166549682617188, test_abs_avg=27.164722442626953
production_forward2 grad[58] vs paper_forward: mean_abs=0.6444318294525146, max_abs=4.5, mean_rel=0.16160213947296143, max_rel=1135.3310546875, norm_rel=0.02440473809838295, ref_abs_avg=26.462759017944336, test_abs_avg=26.458683013916016
production_forward2 grad[59] vs paper_forward: mean_abs=0.4764280319213867, max_abs=2.0, mean_rel=0.13546165823936462, max_rel=22.055526733398438, norm_rel=0.022524608299136162, ref_abs_avg=20.67945098876953, test_abs_avg=20.685327529907227
production_forward2 grad[60] vs paper_forward: mean_abs=0.6124788522720337, max_abs=4.25, mean_rel=0.15095128118991852, max_rel=891.8468627929688, norm_rel=0.023740191012620926, ref_abs_avg=25.818300247192383, test_abs_avg=25.820547103881836
production_forward2 grad[61] vs paper_forward: mean_abs=0.5990291237831116, max_abs=4.0, mean_rel=0.1434711217880249, max_rel=500.6251525878906, norm_rel=0.02301490679383278, ref_abs_avg=25.988523483276367, test_abs_avg=25.990406036376953
production_forward2 grad[62] vs paper_forward: mean_abs=0.4496692419052124, max_abs=1.75, mean_rel=0.19761303067207336, max_rel=66.38392639160156, norm_rel=0.021210001781582832, ref_abs_avg=21.82489776611328, test_abs_avg=21.8211669921875
production_forward2 grad[63] vs paper_forward: mean_abs=0.578144907951355, max_abs=4.25, mean_rel=0.14611433446407318, max_rel=1093.3004150390625, norm_rel=0.023420438170433044, ref_abs_avg=24.671432495117188, test_abs_avg=24.67066192626953
production_forward2 grad[64] vs paper_forward: mean_abs=0.564812183380127, max_abs=4.375, mean_rel=0.15160083770751953, max_rel=685.8316650390625, norm_rel=0.0232157614082098, ref_abs_avg=24.414480209350586, test_abs_avg=24.412681579589844
production_forward2 grad[65] vs paper_forward: mean_abs=0.4302637577056885, max_abs=2.04296875, mean_rel=0.20865687727928162, max_rel=58.07072448730469, norm_rel=0.02227918803691864, ref_abs_avg=19.65349769592285, test_abs_avg=19.66622543334961
production_forward2 grad[66] vs paper_forward: mean_abs=0.5546749830245972, max_abs=4.0, mean_rel=0.15766336023807526, max_rel=1278.9407958984375, norm_rel=0.023023003712296486, ref_abs_avg=24.039844512939453, test_abs_avg=24.041385650634766
production_forward2 grad[67] vs paper_forward: mean_abs=0.5411562919616699, max_abs=3.875, mean_rel=0.14843809604644775, max_rel=1402.57421875, norm_rel=0.022833023220300674, ref_abs_avg=23.738548278808594, test_abs_avg=23.737327575683594
production_forward2 grad[68] vs paper_forward: mean_abs=0.3988039493560791, max_abs=1.625, mean_rel=0.07525402307510376, max_rel=3.6875486373901367, norm_rel=0.021200500428676605, ref_abs_avg=19.141939163208008, test_abs_avg=19.15186309814453
production_forward2 grad[69] vs paper_forward: mean_abs=0.518679678440094, max_abs=4.5625, mean_rel=0.15080222487449646, max_rel=862.7709350585938, norm_rel=0.022670621052384377, ref_abs_avg=22.89664077758789, test_abs_avg=22.896312713623047
production_forward2 grad[70] vs paper_forward: mean_abs=0.5118734240531921, max_abs=3.75, mean_rel=0.13936805725097656, max_rel=792.2079467773438, norm_rel=0.022766828536987305, ref_abs_avg=22.591339111328125, test_abs_avg=22.589794158935547
production_forward2 grad[71] vs paper_forward: mean_abs=0.4018516540527344, max_abs=1.75, mean_rel=0.0948110967874527, max_rel=4.623326301574707, norm_rel=0.02121138572692871, ref_abs_avg=18.93880844116211, test_abs_avg=18.92520523071289
production_forward2 grad[72] vs paper_forward: mean_abs=0.49974724650382996, max_abs=4.75, mean_rel=0.13491225242614746, max_rel=1017.64501953125, norm_rel=0.02216562256217003, ref_abs_avg=22.49756622314453, test_abs_avg=22.498619079589844
production_forward2 grad[73] vs paper_forward: mean_abs=0.49291300773620605, max_abs=4.0, mean_rel=0.1464325338602066, max_rel=642.523681640625, norm_rel=0.022357026115059853, ref_abs_avg=22.115306854248047, test_abs_avg=22.116992950439453
production_forward2 grad[74] vs paper_forward: mean_abs=0.4473111927509308, max_abs=1.5, mean_rel=0.14895766973495483, max_rel=36.7933235168457, norm_rel=0.022918758913874626, ref_abs_avg=19.420740127563477, test_abs_avg=19.38153839111328
production_forward2 grad[75] vs paper_forward: mean_abs=0.543493926525116, max_abs=4.0, mean_rel=0.14586800336837769, max_rel=874.20751953125, norm_rel=0.023497410118579865, ref_abs_avg=23.118427276611328, test_abs_avg=23.118589401245117
production_forward2 grad[76] vs paper_forward: mean_abs=0.5307236909866333, max_abs=3.625, mean_rel=0.1469448357820511, max_rel=730.513671875, norm_rel=0.023257983848452568, ref_abs_avg=22.838623046875, test_abs_avg=22.833045959472656
production_forward2 grad[77] vs paper_forward: mean_abs=0.38919878005981445, max_abs=1.4375, mean_rel=0.06656712293624878, max_rel=2.9359560012817383, norm_rel=0.021862434223294258, ref_abs_avg=18.136014938354492, test_abs_avg=18.135772705078125
production_forward2 grad[78] vs paper_forward: mean_abs=0.5072197914123535, max_abs=4.5, mean_rel=0.14956393837928772, max_rel=802.6237182617188, norm_rel=0.022798361256718636, ref_abs_avg=22.255352020263672, test_abs_avg=22.25428009033203
production_forward2 grad[79] vs paper_forward: mean_abs=0.4977867007255554, max_abs=3.8125, mean_rel=0.14743030071258545, max_rel=471.29876708984375, norm_rel=0.02295009419322014, ref_abs_avg=21.775779724121094, test_abs_avg=21.77509307861328
production_forward2 grad[80] vs paper_forward: mean_abs=0.39788174629211426, max_abs=1.625, mean_rel=0.17862898111343384, max_rel=22.637998580932617, norm_rel=0.021938735619187355, ref_abs_avg=17.583553314208984, test_abs_avg=17.58342742919922
production_forward2 grad[81] vs paper_forward: mean_abs=0.48163533210754395, max_abs=4.5, mean_rel=0.15112516283988953, max_rel=730.0558471679688, norm_rel=0.02269653230905533, ref_abs_avg=21.265182495117188, test_abs_avg=21.26333999633789
production_forward2 grad[82] vs paper_forward: mean_abs=0.46626585721969604, max_abs=3.625, mean_rel=0.14270099997520447, max_rel=950.7137451171875, norm_rel=0.022125061601400375, ref_abs_avg=21.122957229614258, test_abs_avg=21.12485694885254
production_forward2 grad[83] vs paper_forward: mean_abs=0.36034607887268066, max_abs=1.75, mean_rel=0.12049414962530136, max_rel=14.890229225158691, norm_rel=0.021093403920531273, ref_abs_avg=17.18977165222168, test_abs_avg=17.218250274658203
production_forward2 grad[84] vs paper_forward: mean_abs=0.4406537711620331, max_abs=4.0, mean_rel=0.13928335905075073, max_rel=878.1248168945312, norm_rel=0.02208878844976425, ref_abs_avg=19.983755111694336, test_abs_avg=19.98345947265625
production_forward2 grad[85] vs paper_forward: mean_abs=0.44046545028686523, max_abs=3.796875, mean_rel=0.13699494302272797, max_rel=1028.8817138671875, norm_rel=0.022457798942923546, ref_abs_avg=19.790740966796875, test_abs_avg=19.795116424560547
production_forward2 grad[86] vs paper_forward: mean_abs=0.3538050651550293, max_abs=1.265625, mean_rel=0.102479487657547, max_rel=11.58483600616455, norm_rel=0.020847322419285774, ref_abs_avg=16.953689575195312, test_abs_avg=16.970609664916992
production_forward2 grad[87] vs paper_forward: mean_abs=0.4185746908187866, max_abs=4.0, mean_rel=0.1361064910888672, max_rel=866.0552368164062, norm_rel=0.0215232502669096, ref_abs_avg=19.5283203125, test_abs_avg=19.527084350585938
production_forward2 grad[88] vs paper_forward: mean_abs=0.41449737548828125, max_abs=3.5, mean_rel=0.14380189776420593, max_rel=857.5037231445312, norm_rel=0.02166067250072956, ref_abs_avg=19.243497848510742, test_abs_avg=19.243864059448242
production_forward2 grad[89] vs paper_forward: mean_abs=0.3361162543296814, max_abs=1.4375, mean_rel=0.08759044110774994, max_rel=8.262667655944824, norm_rel=0.020729046314954758, ref_abs_avg=16.29256820678711, test_abs_avg=16.261608123779297
production_forward2 grad[90] vs paper_forward: mean_abs=0.39591482281684875, max_abs=4.5, mean_rel=0.12811017036437988, max_rel=1107.7855224609375, norm_rel=0.020820405334234238, ref_abs_avg=19.147178649902344, test_abs_avg=19.14651870727539
production_forward2 grad[91] vs paper_forward: mean_abs=0.3780975937843323, max_abs=3.5, mean_rel=0.135423943400383, max_rel=656.5469360351562, norm_rel=0.02065601944923401, ref_abs_avg=18.49658203125, test_abs_avg=18.49679183959961
production_forward2 grad[92] vs paper_forward: mean_abs=0.2962644100189209, max_abs=1.5, mean_rel=0.07264915108680725, max_rel=5.10703182220459, norm_rel=0.01919534243643284, ref_abs_avg=15.920970916748047, test_abs_avg=15.947786331176758
production_forward2 grad[93] vs paper_forward: mean_abs=0.37676364183425903, max_abs=5.0, mean_rel=0.12727200984954834, max_rel=787.8257446289062, norm_rel=0.020261386409401894, ref_abs_avg=18.775611877441406, test_abs_avg=18.774398803710938
production_forward2 grad[94] vs paper_forward: mean_abs=0.3568536639213562, max_abs=3.5625, mean_rel=0.1245967298746109, max_rel=311.0126647949219, norm_rel=0.020231740549206734, ref_abs_avg=17.900623321533203, test_abs_avg=17.90304946899414
production_forward2 grad[95] vs paper_forward: mean_abs=0.30229854583740234, max_abs=1.125, mean_rel=0.11181701719760895, max_rel=20.935983657836914, norm_rel=0.02092159539461136, ref_abs_avg=14.327051162719727, test_abs_avg=14.36539363861084
production_forward2 grad[96] vs paper_forward: mean_abs=0.35447338223457336, max_abs=3.5, mean_rel=0.12061244994401932, max_rel=444.7468566894531, norm_rel=0.02012930065393448, ref_abs_avg=17.86845588684082, test_abs_avg=17.8679141998291
production_forward2 grad[97] vs paper_forward: mean_abs=0.34492164850234985, max_abs=3.75, mean_rel=0.12056393176317215, max_rel=589.2901000976562, norm_rel=0.019972797483205795, ref_abs_avg=17.525388717651367, test_abs_avg=17.520610809326172
identity layers + randn queries
paper_forward fwd+bwd:  379.752 ms
paper_forward bwd-only: 293.967 ms
paper_forward peak allocated: fwd=30.001 GiB, fwd+bwd=32.120 GiB
paper_forward peak reserved:  fwd=30.043 GiB, fwd+bwd=32.793 GiB
production_forward fwd+bwd:  112.030 ms
production_forward bwd-only: 91.654 ms
production_forward peak allocated: fwd=2.364 GiB, fwd+bwd=6.243 GiB
production_forward peak reserved:  fwd=2.504 GiB, fwd+bwd=6.379 GiB
production_forward2 fwd+bwd:  224.345 ms
production_forward2 bwd-only: 202.155 ms
production_forward2 peak allocated: fwd=2.864 GiB, fwd+bwd=6.243 GiB
production_forward2 peak reserved:  fwd=3.254 GiB, fwd+bwd=9.004 GiB

grads check for swiglu layers + randn queries
production_forward vs paper_forward output: mean_abs=0.0016783958999440074, max_abs=0.046875
production_forward grad[0] vs paper_forward: mean_abs=0.009018725715577602, max_abs=0.40625, mean_rel=0.07653272897005081, max_rel=89.16954040527344, norm_rel=0.020885031670331955, ref_abs_avg=0.4666064381599426, test_abs_avg=0.46662378311157227
production_forward grad[1] vs paper_forward: mean_abs=7.6883111000061035, max_abs=64.0, mean_rel=0.1533886194229126, max_rel=252.47593688964844, norm_rel=0.021499495953321457, ref_abs_avg=321.979248046875, test_abs_avg=321.9114074707031
production_forward grad[2] vs paper_forward: mean_abs=1.3534846305847168, max_abs=4.5, mean_rel=0.11235209554433823, max_rel=13.890604019165039, norm_rel=0.024569587782025337, ref_abs_avg=52.61597442626953, test_abs_avg=52.69371795654297
production_forward grad[3] vs paper_forward: mean_abs=1.651113748550415, max_abs=12.0, mean_rel=0.16813357174396515, max_rel=3262.03076171875, norm_rel=0.02456614002585411, ref_abs_avg=67.55751037597656, test_abs_avg=67.56929016113281
production_forward grad[4] vs paper_forward: mean_abs=1.6044175624847412, max_abs=11.25, mean_rel=0.18019258975982666, max_rel=2387.820068359375, norm_rel=0.024239279329776764, ref_abs_avg=66.50546264648438, test_abs_avg=66.50694274902344
production_forward grad[5] vs paper_forward: mean_abs=1.1310901641845703, max_abs=5.25, mean_rel=0.08904597163200378, max_rel=7.085958003997803, norm_rel=0.02478969842195511, ref_abs_avg=46.634037017822266, test_abs_avg=46.59540557861328
production_forward grad[6] vs paper_forward: mean_abs=1.444562315940857, max_abs=9.25, mean_rel=0.16219091415405273, max_rel=3545.083251953125, norm_rel=0.024223417043685913, ref_abs_avg=59.95556640625, test_abs_avg=59.960113525390625
production_forward grad[7] vs paper_forward: mean_abs=1.4042105674743652, max_abs=8.25, mean_rel=0.16621050238609314, max_rel=934.4742431640625, norm_rel=0.02393556758761406, ref_abs_avg=58.878936767578125, test_abs_avg=58.87932586669922
production_forward grad[8] vs paper_forward: mean_abs=1.051309585571289, max_abs=4.6875, mean_rel=0.0816490575671196, max_rel=4.063589096069336, norm_rel=0.02321922779083252, ref_abs_avg=46.277183532714844, test_abs_avg=46.30421829223633
production_forward grad[9] vs paper_forward: mean_abs=1.310922622680664, max_abs=8.0, mean_rel=0.16834750771522522, max_rel=3319.919921875, norm_rel=0.024027938023209572, ref_abs_avg=54.84776306152344, test_abs_avg=54.854461669921875
production_forward grad[10] vs paper_forward: mean_abs=1.2798280715942383, max_abs=8.0, mean_rel=0.16377538442611694, max_rel=1583.2384033203125, norm_rel=0.02380768209695816, ref_abs_avg=54.05445861816406, test_abs_avg=54.05792236328125
production_forward grad[11] vs paper_forward: mean_abs=0.9462051391601562, max_abs=3.59375, mean_rel=0.0782119557261467, max_rel=3.5614097118377686, norm_rel=0.022688765078783035, ref_abs_avg=42.40625, test_abs_avg=42.391448974609375
production_forward grad[12] vs paper_forward: mean_abs=1.2093923091888428, max_abs=7.5, mean_rel=0.15975448489189148, max_rel=1371.953125, norm_rel=0.023826908320188522, ref_abs_avg=50.956077575683594, test_abs_avg=50.956085205078125
production_forward grad[13] vs paper_forward: mean_abs=1.1826002597808838, max_abs=7.5, mean_rel=0.15652114152908325, max_rel=951.7760009765625, norm_rel=0.0235787071287632, ref_abs_avg=50.353614807128906, test_abs_avg=50.35969543457031
production_forward grad[14] vs paper_forward: mean_abs=0.926976203918457, max_abs=3.25, mean_rel=0.11022976785898209, max_rel=8.152307510375977, norm_rel=0.023059872910380363, ref_abs_avg=40.940826416015625, test_abs_avg=40.92821502685547
production_forward grad[15] vs paper_forward: mean_abs=1.128537654876709, max_abs=7.0, mean_rel=0.15476979315280914, max_rel=775.7255859375, norm_rel=0.02360176481306553, ref_abs_avg=48.069549560546875, test_abs_avg=48.07030487060547
production_forward grad[16] vs paper_forward: mean_abs=1.1028763055801392, max_abs=6.5, mean_rel=0.16796188056468964, max_rel=1010.2222900390625, norm_rel=0.02335871197283268, ref_abs_avg=47.431190490722656, test_abs_avg=47.42724609375
production_forward grad[17] vs paper_forward: mean_abs=0.8346624374389648, max_abs=3.1875, mean_rel=0.12447378039360046, max_rel=12.172797203063965, norm_rel=0.023185310885310173, ref_abs_avg=35.8472900390625, test_abs_avg=35.91075897216797
production_forward grad[18] vs paper_forward: mean_abs=1.0579055547714233, max_abs=7.0, mean_rel=0.16014757752418518, max_rel=1091.7418212890625, norm_rel=0.023536590859293938, ref_abs_avg=45.175270080566406, test_abs_avg=45.178123474121094
production_forward grad[19] vs paper_forward: mean_abs=1.0364539623260498, max_abs=6.5, mean_rel=0.15290844440460205, max_rel=459.76556396484375, norm_rel=0.023386523127555847, ref_abs_avg=44.57703399658203, test_abs_avg=44.57962417602539
production_forward grad[20] vs paper_forward: mean_abs=0.8432207107543945, max_abs=3.0, mean_rel=0.08909648656845093, max_rel=3.6284162998199463, norm_rel=0.024497469887137413, ref_abs_avg=34.3343620300293, test_abs_avg=34.33009338378906
production_forward grad[21] vs paper_forward: mean_abs=1.0023221969604492, max_abs=6.25, mean_rel=0.16289271414279938, max_rel=900.0052490234375, norm_rel=0.02329923026263714, ref_abs_avg=43.193931579589844, test_abs_avg=43.19474792480469
production_forward grad[22] vs paper_forward: mean_abs=0.9757983088493347, max_abs=6.0, mean_rel=0.14989589154720306, max_rel=839.8678588867188, norm_rel=0.02307927794754505, ref_abs_avg=42.49159622192383, test_abs_avg=42.490257263183594
production_forward grad[23] vs paper_forward: mean_abs=0.7695178985595703, max_abs=3.0, mean_rel=0.06614388525485992, max_rel=4.19492244720459, norm_rel=0.02238183096051216, ref_abs_avg=34.74391174316406, test_abs_avg=34.79534149169922
production_forward grad[24] vs paper_forward: mean_abs=0.9556578397750854, max_abs=6.5, mean_rel=0.15192130208015442, max_rel=1136.1685791015625, norm_rel=0.023159051313996315, ref_abs_avg=41.42169952392578, test_abs_avg=41.420860290527344
production_forward grad[25] vs paper_forward: mean_abs=0.9308003187179565, max_abs=6.5, mean_rel=0.1446104794740677, max_rel=1100.9014892578125, norm_rel=0.02296549640595913, ref_abs_avg=40.746192932128906, test_abs_avg=40.7430419921875
production_forward grad[26] vs paper_forward: mean_abs=0.8718414306640625, max_abs=3.25, mean_rel=0.11599496752023697, max_rel=12.25355052947998, norm_rel=0.023401489481329918, ref_abs_avg=37.5480842590332, test_abs_avg=37.50065231323242
production_forward grad[27] vs paper_forward: mean_abs=1.1216670274734497, max_abs=7.5, mean_rel=0.17164894938468933, max_rel=1688.1278076171875, norm_rel=0.025295279920101166, ref_abs_avg=44.509796142578125, test_abs_avg=44.50927734375
production_forward grad[28] vs paper_forward: mean_abs=1.0950837135314941, max_abs=7.0, mean_rel=0.17812520265579224, max_rel=1278.357177734375, norm_rel=0.025157546624541283, ref_abs_avg=43.68766784667969, test_abs_avg=43.693443298339844
production_forward grad[29] vs paper_forward: mean_abs=0.7921478748321533, max_abs=3.25, mean_rel=0.1250024437904358, max_rel=18.220937728881836, norm_rel=0.02659795619547367, ref_abs_avg=30.555490493774414, test_abs_avg=30.581756591796875
production_forward grad[30] vs paper_forward: mean_abs=1.0304265022277832, max_abs=6.5, mean_rel=0.17580871284008026, max_rel=1874.852783203125, norm_rel=0.02564314194023609, ref_abs_avg=40.348854064941406, test_abs_avg=40.34967803955078
production_forward grad[31] vs paper_forward: mean_abs=1.011915922164917, max_abs=6.0625, mean_rel=0.15098637342453003, max_rel=416.6002197265625, norm_rel=0.02539646439254284, ref_abs_avg=39.981842041015625, test_abs_avg=39.98600769042969
production_forward grad[32] vs paper_forward: mean_abs=0.7786140441894531, max_abs=3.0, mean_rel=0.11629191040992737, max_rel=6.848598957061768, norm_rel=0.025175072252750397, ref_abs_avg=31.24591064453125, test_abs_avg=31.20025634765625
production_forward grad[33] vs paper_forward: mean_abs=0.9635522365570068, max_abs=7.0, mean_rel=0.17529414594173431, max_rel=1601.5953369140625, norm_rel=0.025516360998153687, ref_abs_avg=37.86491394042969, test_abs_avg=37.86497497558594
production_forward grad[34] vs paper_forward: mean_abs=0.9416854381561279, max_abs=6.0, mean_rel=0.18537166714668274, max_rel=2007.55224609375, norm_rel=0.025317329913377762, ref_abs_avg=37.34291458129883, test_abs_avg=37.346519470214844
production_forward grad[35] vs paper_forward: mean_abs=0.7464025020599365, max_abs=3.375, mean_rel=0.23036092519760132, max_rel=79.77733612060547, norm_rel=0.02667916938662529, ref_abs_avg=29.283279418945312, test_abs_avg=29.294208526611328
production_forward grad[36] vs paper_forward: mean_abs=0.8995441794395447, max_abs=5.5, mean_rel=0.16618214547634125, max_rel=2128.137939453125, norm_rel=0.02510991506278515, ref_abs_avg=35.89112854003906, test_abs_avg=35.88971710205078
production_forward grad[37] vs paper_forward: mean_abs=0.8798415064811707, max_abs=5.25, mean_rel=0.16491371393203735, max_rel=1635.995849609375, norm_rel=0.024880781769752502, ref_abs_avg=35.40528869628906, test_abs_avg=35.40654373168945
production_forward grad[38] vs paper_forward: mean_abs=0.6571855545043945, max_abs=2.9375, mean_rel=0.12779881060123444, max_rel=9.984554290771484, norm_rel=0.02525174990296364, ref_abs_avg=26.91118621826172, test_abs_avg=26.844707489013672
production_forward grad[39] vs paper_forward: mean_abs=0.8438758254051208, max_abs=5.388671875, mean_rel=0.16740256547927856, max_rel=1550.312255859375, norm_rel=0.024843018501996994, ref_abs_avg=34.03230285644531, test_abs_avg=34.03377914428711
production_forward grad[40] vs paper_forward: mean_abs=0.8295835256576538, max_abs=5.5, mean_rel=0.15662989020347595, max_rel=578.2865600585938, norm_rel=0.024694636464118958, ref_abs_avg=33.71221923828125, test_abs_avg=33.7104606628418
production_forward grad[41] vs paper_forward: mean_abs=0.6311523914337158, max_abs=2.7109375, mean_rel=0.5202684998512268, max_rel=210.7470703125, norm_rel=0.024646049365401268, ref_abs_avg=26.388696670532227, test_abs_avg=26.407154083251953
production_forward grad[42] vs paper_forward: mean_abs=0.794550895690918, max_abs=6.5, mean_rel=0.16306376457214355, max_rel=1640.6134033203125, norm_rel=0.024751810356974602, ref_abs_avg=32.190223693847656, test_abs_avg=32.192081451416016
production_forward grad[43] vs paper_forward: mean_abs=0.7875984907150269, max_abs=4.75, mean_rel=0.1649836003780365, max_rel=637.0800170898438, norm_rel=0.024801481515169144, ref_abs_avg=31.867464065551758, test_abs_avg=31.86838722229004
production_forward grad[44] vs paper_forward: mean_abs=0.6119937896728516, max_abs=2.625, mean_rel=0.12222690880298615, max_rel=19.146467208862305, norm_rel=0.023129787296056747, ref_abs_avg=26.447664260864258, test_abs_avg=26.429988861083984
production_forward grad[45] vs paper_forward: mean_abs=0.7611609101295471, max_abs=4.5, mean_rel=0.16131767630577087, max_rel=760.525390625, norm_rel=0.024430986493825912, ref_abs_avg=31.216625213623047, test_abs_avg=31.218971252441406
production_forward grad[46] vs paper_forward: mean_abs=0.744926393032074, max_abs=5.0, mean_rel=0.16786590218544006, max_rel=1026.874755859375, norm_rel=0.0243960153311491, ref_abs_avg=30.63884735107422, test_abs_avg=30.64288330078125
production_forward grad[47] vs paper_forward: mean_abs=0.5678725242614746, max_abs=2.375, mean_rel=0.10763295739889145, max_rel=8.708515167236328, norm_rel=0.024024834856390953, ref_abs_avg=23.377899169921875, test_abs_avg=23.361495971679688
production_forward grad[48] vs paper_forward: mean_abs=0.7262421250343323, max_abs=5.0, mean_rel=0.16540849208831787, max_rel=852.6757202148438, norm_rel=0.02407725714147091, ref_abs_avg=30.169517517089844, test_abs_avg=30.17084312438965
production_forward grad[49] vs paper_forward: mean_abs=0.7140703201293945, max_abs=4.5625, mean_rel=0.1636471152305603, max_rel=1184.126708984375, norm_rel=0.023853279650211334, ref_abs_avg=29.9547119140625, test_abs_avg=29.957027435302734
production_forward grad[50] vs paper_forward: mean_abs=0.6925976872444153, max_abs=2.40625, mean_rel=0.0923134833574295, max_rel=9.181962013244629, norm_rel=0.02545522153377533, ref_abs_avg=27.129663467407227, test_abs_avg=27.10671615600586
production_forward grad[51] vs paper_forward: mean_abs=0.8257571458816528, max_abs=6.0, mean_rel=0.1705995798110962, max_rel=1305.1051025390625, norm_rel=0.025687340646982193, ref_abs_avg=32.287208557128906, test_abs_avg=32.289188385009766
production_forward grad[52] vs paper_forward: mean_abs=0.8014971613883972, max_abs=5.5, mean_rel=0.16109338402748108, max_rel=965.5984497070312, norm_rel=0.025262344628572464, ref_abs_avg=31.848388671875, test_abs_avg=31.844953536987305
production_forward grad[53] vs paper_forward: mean_abs=0.6411721706390381, max_abs=2.75, mean_rel=0.11528624594211578, max_rel=18.363021850585938, norm_rel=0.02540162019431591, ref_abs_avg=25.32941436767578, test_abs_avg=25.300037384033203
production_forward grad[54] vs paper_forward: mean_abs=0.7510112524032593, max_abs=5.0, mean_rel=0.17201685905456543, max_rel=1148.0093994140625, norm_rel=0.025126444175839424, ref_abs_avg=29.88825035095215, test_abs_avg=29.889991760253906
production_forward grad[55] vs paper_forward: mean_abs=0.7413041591644287, max_abs=5.0, mean_rel=0.17391851544380188, max_rel=559.605224609375, norm_rel=0.025173280388116837, ref_abs_avg=29.42934799194336, test_abs_avg=29.433197021484375
production_forward grad[56] vs paper_forward: mean_abs=0.552238941192627, max_abs=2.5, mean_rel=0.20052176713943481, max_rel=32.9840202331543, norm_rel=0.02282608672976494, ref_abs_avg=24.35857582092285, test_abs_avg=24.309303283691406
production_forward grad[57] vs paper_forward: mean_abs=0.696236252784729, max_abs=4.75, mean_rel=0.16452345252037048, max_rel=1405.5802001953125, norm_rel=0.02467181906104088, ref_abs_avg=28.237863540649414, test_abs_avg=28.240169525146484
production_forward grad[58] vs paper_forward: mean_abs=0.6846718192100525, max_abs=5.0, mean_rel=0.16135136783123016, max_rel=658.6934204101562, norm_rel=0.02449425496160984, ref_abs_avg=27.933534622192383, test_abs_avg=27.93576431274414
production_forward grad[59] vs paper_forward: mean_abs=0.5298285484313965, max_abs=2.0625, mean_rel=0.14044871926307678, max_rel=21.6568546295166, norm_rel=0.024611910805106163, ref_abs_avg=21.021013259887695, test_abs_avg=21.025882720947266
production_forward grad[60] vs paper_forward: mean_abs=0.6544067859649658, max_abs=5.0, mean_rel=0.15785884857177734, max_rel=1193.3597412109375, norm_rel=0.024042097851634026, ref_abs_avg=27.231056213378906, test_abs_avg=27.231731414794922
production_forward grad[61] vs paper_forward: mean_abs=0.6393285393714905, max_abs=4.25, mean_rel=0.15807147324085236, max_rel=578.7509155273438, norm_rel=0.024331454187631607, ref_abs_avg=26.322437286376953, test_abs_avg=26.322307586669922
production_forward grad[62] vs paper_forward: mean_abs=0.5273303985595703, max_abs=2.0, mean_rel=0.09670573472976685, max_rel=6.019129276275635, norm_rel=0.024572964757680893, ref_abs_avg=21.36260414123535, test_abs_avg=21.365036010742188
production_forward grad[63] vs paper_forward: mean_abs=0.6121429800987244, max_abs=4.625, mean_rel=0.15543460845947266, max_rel=1048.30419921875, norm_rel=0.02393260784447193, ref_abs_avg=25.568246841430664, test_abs_avg=25.56998062133789
production_forward grad[64] vs paper_forward: mean_abs=0.6023268699645996, max_abs=4.5, mean_rel=0.15491822361946106, max_rel=867.7451171875, norm_rel=0.023791620507836342, ref_abs_avg=25.33572006225586, test_abs_avg=25.334163665771484
production_forward grad[65] vs paper_forward: mean_abs=0.4960792064666748, max_abs=2.0, mean_rel=0.3120906352996826, max_rel=78.06546020507812, norm_rel=0.02352505922317505, ref_abs_avg=21.371051788330078, test_abs_avg=21.41416358947754
production_forward grad[66] vs paper_forward: mean_abs=0.5824577808380127, max_abs=4.53125, mean_rel=0.1542447805404663, max_rel=1258.4649658203125, norm_rel=0.023206980898976326, ref_abs_avg=25.032135009765625, test_abs_avg=25.03315544128418
production_forward grad[67] vs paper_forward: mean_abs=0.5640071630477905, max_abs=4.25, mean_rel=0.1432076394557953, max_rel=1266.9251708984375, norm_rel=0.022869044914841652, ref_abs_avg=24.62250518798828, test_abs_avg=24.627338409423828
production_forward grad[68] vs paper_forward: mean_abs=0.44683074951171875, max_abs=1.90625, mean_rel=0.0874103233218193, max_rel=10.653280258178711, norm_rel=0.020639309659600258, ref_abs_avg=21.852203369140625, test_abs_avg=21.828969955444336
production_forward grad[69] vs paper_forward: mean_abs=0.5516817569732666, max_abs=4.5, mean_rel=0.14969158172607422, max_rel=1018.705078125, norm_rel=0.022985078394412994, ref_abs_avg=23.967864990234375, test_abs_avg=23.967498779296875
production_forward grad[70] vs paper_forward: mean_abs=0.5423461198806763, max_abs=4.0, mean_rel=0.15264122188091278, max_rel=1049.5908203125, norm_rel=0.02292010933160782, ref_abs_avg=23.616294860839844, test_abs_avg=23.618885040283203
production_forward grad[71] vs paper_forward: mean_abs=0.3982011079788208, max_abs=1.625, mean_rel=0.21712787449359894, max_rel=76.2305679321289, norm_rel=0.02127140946686268, ref_abs_avg=19.17057991027832, test_abs_avg=19.169282913208008
production_forward grad[72] vs paper_forward: mean_abs=0.5278517007827759, max_abs=4.5, mean_rel=0.14722967147827148, max_rel=712.499267578125, norm_rel=0.022393004968762398, ref_abs_avg=23.540348052978516, test_abs_avg=23.541677474975586
production_forward grad[73] vs paper_forward: mean_abs=0.5110368728637695, max_abs=4.0, mean_rel=0.13874974846839905, max_rel=526.5842895507812, norm_rel=0.02199828252196312, ref_abs_avg=23.199155807495117, test_abs_avg=23.19499969482422
production_forward grad[74] vs paper_forward: mean_abs=0.4948740005493164, max_abs=2.01953125, mean_rel=0.09068777412176132, max_rel=4.8202433586120605, norm_rel=0.024984685704112053, ref_abs_avg=20.098783493041992, test_abs_avg=20.095561981201172
production_forward grad[75] vs paper_forward: mean_abs=0.5883982181549072, max_abs=4.5, mean_rel=0.15641948580741882, max_rel=873.1707763671875, norm_rel=0.024445760995149612, ref_abs_avg=24.092079162597656, test_abs_avg=24.094144821166992
production_forward grad[76] vs paper_forward: mean_abs=0.5731430053710938, max_abs=4.25, mean_rel=0.14711715281009674, max_rel=546.763427734375, norm_rel=0.02393747679889202, ref_abs_avg=23.94029998779297, test_abs_avg=23.939128875732422
production_forward grad[77] vs paper_forward: mean_abs=0.4385547637939453, max_abs=1.75, mean_rel=0.10767306387424469, max_rel=11.452925682067871, norm_rel=0.02410929463803768, ref_abs_avg=18.296144485473633, test_abs_avg=18.311429977416992
production_forward grad[78] vs paper_forward: mean_abs=0.5321985483169556, max_abs=3.875, mean_rel=0.1464947611093521, max_rel=941.99267578125, norm_rel=0.023777863010764122, ref_abs_avg=22.415796279907227, test_abs_avg=22.416242599487305
production_forward grad[79] vs paper_forward: mean_abs=0.5291585326194763, max_abs=4.5, mean_rel=0.15416067838668823, max_rel=1138.9124755859375, norm_rel=0.023557379841804504, ref_abs_avg=22.482563018798828, test_abs_avg=22.480867385864258
production_forward grad[80] vs paper_forward: mean_abs=0.4064614772796631, max_abs=1.5625, mean_rel=0.12064800411462784, max_rel=22.962318420410156, norm_rel=0.022472361102700233, ref_abs_avg=18.39719009399414, test_abs_avg=18.387840270996094
production_forward grad[81] vs paper_forward: mean_abs=0.49913614988327026, max_abs=4.0, mean_rel=0.1551058441400528, max_rel=1064.3377685546875, norm_rel=0.022965483367443085, ref_abs_avg=21.756053924560547, test_abs_avg=21.75756072998047
production_forward grad[82] vs paper_forward: mean_abs=0.48965609073638916, max_abs=4.0, mean_rel=0.1472814530134201, max_rel=730.5680541992188, norm_rel=0.022875094786286354, ref_abs_avg=21.44923973083496, test_abs_avg=21.444181442260742
production_forward grad[83] vs paper_forward: mean_abs=0.38070958852767944, max_abs=1.560546875, mean_rel=0.2387218177318573, max_rel=56.1719970703125, norm_rel=0.021892402321100235, ref_abs_avg=17.429889678955078, test_abs_avg=17.431814193725586
production_forward grad[84] vs paper_forward: mean_abs=0.4596624970436096, max_abs=4.0, mean_rel=0.13734447956085205, max_rel=593.8145751953125, norm_rel=0.022328315302729607, ref_abs_avg=20.621240615844727, test_abs_avg=20.622236251831055
production_forward grad[85] vs paper_forward: mean_abs=0.4621378779411316, max_abs=4.0, mean_rel=0.1544138789176941, max_rel=980.604736328125, norm_rel=0.02233043685555458, ref_abs_avg=20.715782165527344, test_abs_avg=20.71569061279297
production_forward grad[86] vs paper_forward: mean_abs=0.3410038948059082, max_abs=1.5, mean_rel=0.11380624771118164, max_rel=10.218619346618652, norm_rel=0.020597685128450394, ref_abs_avg=16.724586486816406, test_abs_avg=16.718461990356445
production_forward grad[87] vs paper_forward: mean_abs=0.434830904006958, max_abs=4.25, mean_rel=0.13229629397392273, max_rel=726.6380004882812, norm_rel=0.021679922938346863, ref_abs_avg=20.161094665527344, test_abs_avg=20.1612548828125
production_forward grad[88] vs paper_forward: mean_abs=0.42806416749954224, max_abs=3.875, mean_rel=0.14267723262310028, max_rel=1429.859619140625, norm_rel=0.021768005564808846, ref_abs_avg=19.801250457763672, test_abs_avg=19.799915313720703
production_forward grad[89] vs paper_forward: mean_abs=0.3405466079711914, max_abs=1.25, mean_rel=0.05680720880627632, max_rel=4.109104156494141, norm_rel=0.020510785281658173, ref_abs_avg=16.977779388427734, test_abs_avg=16.99998664855957
production_forward grad[90] vs paper_forward: mean_abs=0.4037334620952606, max_abs=4.25, mean_rel=0.13121354579925537, max_rel=1120.843017578125, norm_rel=0.02111000195145607, ref_abs_avg=19.248943328857422, test_abs_avg=19.248737335205078
production_forward grad[91] vs paper_forward: mean_abs=0.39326971769332886, max_abs=4.0, mean_rel=0.13174059987068176, max_rel=498.28472900390625, norm_rel=0.020901072770357132, ref_abs_avg=18.95298957824707, test_abs_avg=18.958879470825195
production_forward grad[92] vs paper_forward: mean_abs=0.3275771141052246, max_abs=1.5, mean_rel=0.08920446783304214, max_rel=5.5637712478637695, norm_rel=0.022139467298984528, ref_abs_avg=15.001148223876953, test_abs_avg=15.016141891479492
production_forward grad[93] vs paper_forward: mean_abs=0.38387683033943176, max_abs=3.75, mean_rel=0.12468347698450089, max_rel=532.1742553710938, norm_rel=0.020594701170921326, ref_abs_avg=18.81970977783203, test_abs_avg=18.820043563842773
production_forward grad[94] vs paper_forward: mean_abs=0.3720816373825073, max_abs=4.25, mean_rel=0.1297030746936798, max_rel=429.1856994628906, norm_rel=0.020411107689142227, ref_abs_avg=18.48053550720215, test_abs_avg=18.481468200683594
production_forward grad[95] vs paper_forward: mean_abs=0.29590272903442383, max_abs=1.125, mean_rel=0.06375239789485931, max_rel=2.9064972400665283, norm_rel=0.019579140469431877, ref_abs_avg=15.437275886535645, test_abs_avg=15.425424575805664
production_forward grad[96] vs paper_forward: mean_abs=0.36201637983322144, max_abs=4.5, mean_rel=0.12120760232210159, max_rel=728.4627685546875, norm_rel=0.020232100039720535, ref_abs_avg=18.14980697631836, test_abs_avg=18.150318145751953
production_forward grad[97] vs paper_forward: mean_abs=0.35453906655311584, max_abs=4.0, mean_rel=0.11826055496931076, max_rel=495.4944152832031, norm_rel=0.02013898640871048, ref_abs_avg=17.874176025390625, test_abs_avg=17.884004592895508
production_forward2 vs paper_forward output: mean_abs=0.0016783958999440074, max_abs=0.046875
production_forward2 grad[0] vs paper_forward: mean_abs=0.009022878482937813, max_abs=0.453125, mean_rel=0.07647556066513062, max_rel=98.8233642578125, norm_rel=0.02089221030473709, ref_abs_avg=0.4666064381599426, test_abs_avg=0.4666116237640381
production_forward2 grad[1] vs paper_forward: mean_abs=7.637601375579834, max_abs=64.0, mean_rel=0.16852280497550964, max_rel=233.5771484375, norm_rel=0.021383026614785194, ref_abs_avg=321.979248046875, test_abs_avg=321.88543701171875
production_forward2 grad[2] vs paper_forward: mean_abs=1.3273921012878418, max_abs=5.0, mean_rel=0.15586291253566742, max_rel=37.54206848144531, norm_rel=0.024559704586863518, ref_abs_avg=52.61597442626953, test_abs_avg=52.73335266113281
production_forward2 grad[3] vs paper_forward: mean_abs=1.650674819946289, max_abs=11.0, mean_rel=0.16956213116645813, max_rel=1975.671630859375, norm_rel=0.0245516337454319, ref_abs_avg=67.55751037597656, test_abs_avg=67.5634536743164
production_forward2 grad[4] vs paper_forward: mean_abs=1.6039894819259644, max_abs=11.0, mean_rel=0.18803924322128296, max_rel=3368.163330078125, norm_rel=0.0242081917822361, ref_abs_avg=66.50546264648438, test_abs_avg=66.50218963623047
production_forward2 grad[5] vs paper_forward: mean_abs=1.2065153121948242, max_abs=5.625, mean_rel=0.08771176636219025, max_rel=7.187005519866943, norm_rel=0.02656141296029091, ref_abs_avg=46.634037017822266, test_abs_avg=46.56190872192383
production_forward2 grad[6] vs paper_forward: mean_abs=1.4459229707717896, max_abs=10.0, mean_rel=0.16300800442695618, max_rel=3081.478759765625, norm_rel=0.024254359304904938, ref_abs_avg=59.95556640625, test_abs_avg=59.96099090576172
production_forward2 grad[7] vs paper_forward: mean_abs=1.4039195775985718, max_abs=8.9375, mean_rel=0.1707601398229599, max_rel=1314.453857421875, norm_rel=0.023949338123202324, ref_abs_avg=58.878936767578125, test_abs_avg=58.880210876464844
production_forward2 grad[8] vs paper_forward: mean_abs=1.0895235538482666, max_abs=4.59375, mean_rel=0.08019774407148361, max_rel=3.2284984588623047, norm_rel=0.023628048598766327, ref_abs_avg=46.277183532714844, test_abs_avg=46.319854736328125
production_forward2 grad[9] vs paper_forward: mean_abs=1.3160459995269775, max_abs=8.375, mean_rel=0.16807714104652405, max_rel=4110.41455078125, norm_rel=0.024107666686177254, ref_abs_avg=54.84776306152344, test_abs_avg=54.8524169921875
production_forward2 grad[10] vs paper_forward: mean_abs=1.2881252765655518, max_abs=7.5, mean_rel=0.16087891161441803, max_rel=875.8909912109375, norm_rel=0.023949632421135902, ref_abs_avg=54.05445861816406, test_abs_avg=54.05247497558594
production_forward2 grad[11] vs paper_forward: mean_abs=0.9330959320068359, max_abs=3.5390625, mean_rel=0.0797027051448822, max_rel=3.4518280029296875, norm_rel=0.02252994105219841, ref_abs_avg=42.40625, test_abs_avg=42.363006591796875
production_forward2 grad[12] vs paper_forward: mean_abs=1.213114619255066, max_abs=8.0, mean_rel=0.1555897295475006, max_rel=1529.449951171875, norm_rel=0.023908834904432297, ref_abs_avg=50.956077575683594, test_abs_avg=50.953887939453125
production_forward2 grad[13] vs paper_forward: mean_abs=1.187537431716919, max_abs=7.25, mean_rel=0.15388573706150055, max_rel=1355.2550048828125, norm_rel=0.02366741932928562, ref_abs_avg=50.353614807128906, test_abs_avg=50.358787536621094
production_forward2 grad[14] vs paper_forward: mean_abs=0.8831119537353516, max_abs=4.0, mean_rel=0.09822575747966766, max_rel=8.396937370300293, norm_rel=0.02227058820426464, ref_abs_avg=40.940826416015625, test_abs_avg=40.92082214355469
production_forward2 grad[15] vs paper_forward: mean_abs=1.1319836378097534, max_abs=7.0, mean_rel=0.15426012873649597, max_rel=1177.7008056640625, norm_rel=0.023676561191678047, ref_abs_avg=48.069549560546875, test_abs_avg=48.06861877441406
production_forward2 grad[16] vs paper_forward: mean_abs=1.1049052476882935, max_abs=8.0, mean_rel=0.1676202118396759, max_rel=1334.1990966796875, norm_rel=0.023400597274303436, ref_abs_avg=47.431190490722656, test_abs_avg=47.42731475830078
production_forward2 grad[17] vs paper_forward: mean_abs=0.8313636779785156, max_abs=3.25, mean_rel=0.13971573114395142, max_rel=21.431278228759766, norm_rel=0.02351388707756996, ref_abs_avg=35.8472900390625, test_abs_avg=35.89466094970703
production_forward2 grad[18] vs paper_forward: mean_abs=1.0609933137893677, max_abs=7.0, mean_rel=0.16227327287197113, max_rel=1517.37060546875, norm_rel=0.02359084226191044, ref_abs_avg=45.175270080566406, test_abs_avg=45.178436279296875
production_forward2 grad[19] vs paper_forward: mean_abs=1.0383329391479492, max_abs=7.0, mean_rel=0.15659603476524353, max_rel=690.9895629882812, norm_rel=0.02342669665813446, ref_abs_avg=44.57703399658203, test_abs_avg=44.58129119873047
production_forward2 grad[20] vs paper_forward: mean_abs=0.8701114654541016, max_abs=2.78125, mean_rel=0.09509895741939545, max_rel=3.8127992153167725, norm_rel=0.02476966753602028, ref_abs_avg=34.3343620300293, test_abs_avg=34.32598876953125
production_forward2 grad[21] vs paper_forward: mean_abs=1.0045336484909058, max_abs=7.0, mean_rel=0.16166824102401733, max_rel=1042.3682861328125, norm_rel=0.023341644555330276, ref_abs_avg=43.193931579589844, test_abs_avg=43.19430923461914
production_forward2 grad[22] vs paper_forward: mean_abs=0.9776444435119629, max_abs=6.0, mean_rel=0.15298110246658325, max_rel=854.2911376953125, norm_rel=0.02312697283923626, ref_abs_avg=42.49159622192383, test_abs_avg=42.488853454589844
production_forward2 grad[23] vs paper_forward: mean_abs=0.7487277984619141, max_abs=3.0, mean_rel=0.06580226868391037, max_rel=3.905117988586426, norm_rel=0.022068332880735397, ref_abs_avg=34.74391174316406, test_abs_avg=34.80162048339844
production_forward2 grad[24] vs paper_forward: mean_abs=0.9590047001838684, max_abs=6.25, mean_rel=0.15682749450206757, max_rel=1020.086181640625, norm_rel=0.02323184162378311, ref_abs_avg=41.42169952392578, test_abs_avg=41.4204216003418
production_forward2 grad[25] vs paper_forward: mean_abs=0.934503436088562, max_abs=6.0, mean_rel=0.14711779356002808, max_rel=1083.25, norm_rel=0.023061463609337807, ref_abs_avg=40.746192932128906, test_abs_avg=40.74182891845703
production_forward2 grad[26] vs paper_forward: mean_abs=0.8794784545898438, max_abs=3.5, mean_rel=0.12194141000509262, max_rel=14.252723693847656, norm_rel=0.02359912544488907, ref_abs_avg=37.5480842590332, test_abs_avg=37.50094223022461
production_forward2 grad[27] vs paper_forward: mean_abs=1.118035078048706, max_abs=7.875, mean_rel=0.16954274475574493, max_rel=1633.6988525390625, norm_rel=0.025222063064575195, ref_abs_avg=44.509796142578125, test_abs_avg=44.509033203125
production_forward2 grad[28] vs paper_forward: mean_abs=1.0924031734466553, max_abs=8.0, mean_rel=0.1718573421239853, max_rel=1163.728515625, norm_rel=0.0251141507178545, ref_abs_avg=43.68766784667969, test_abs_avg=43.69081497192383
production_forward2 grad[29] vs paper_forward: mean_abs=0.8237011432647705, max_abs=3.0, mean_rel=0.11025052517652512, max_rel=10.785717010498047, norm_rel=0.02684602327644825, ref_abs_avg=30.555490493774414, test_abs_avg=30.541122436523438
production_forward2 grad[30] vs paper_forward: mean_abs=1.031160593032837, max_abs=6.25, mean_rel=0.17489910125732422, max_rel=1397.992431640625, norm_rel=0.025667648762464523, ref_abs_avg=40.348854064941406, test_abs_avg=40.34873962402344
production_forward2 grad[31] vs paper_forward: mean_abs=1.0118799209594727, max_abs=6.25, mean_rel=0.15053272247314453, max_rel=425.8769226074219, norm_rel=0.025371572002768517, ref_abs_avg=39.981842041015625, test_abs_avg=39.98600769042969
production_forward2 grad[32] vs paper_forward: mean_abs=0.7794380187988281, max_abs=2.75, mean_rel=0.11358664184808731, max_rel=8.521538734436035, norm_rel=0.025040702894330025, ref_abs_avg=31.24591064453125, test_abs_avg=31.184127807617188
production_forward2 grad[33] vs paper_forward: mean_abs=0.9657036066055298, max_abs=8.0, mean_rel=0.1721792221069336, max_rel=1727.8739013671875, norm_rel=0.02558215893805027, ref_abs_avg=37.86491394042969, test_abs_avg=37.86468505859375
production_forward2 grad[34] vs paper_forward: mean_abs=0.943932831287384, max_abs=6.0, mean_rel=0.18457597494125366, max_rel=1378.3065185546875, norm_rel=0.025376204401254654, ref_abs_avg=37.34291458129883, test_abs_avg=37.34376525878906
production_forward2 grad[35] vs paper_forward: mean_abs=0.7666852474212646, max_abs=3.375, mean_rel=0.21387556195259094, max_rel=70.9372329711914, norm_rel=0.027035262435674667, ref_abs_avg=29.283279418945312, test_abs_avg=29.298538208007812
production_forward2 grad[36] vs paper_forward: mean_abs=0.9014501571655273, max_abs=6.0, mean_rel=0.16231101751327515, max_rel=1687.5274658203125, norm_rel=0.025161849334836006, ref_abs_avg=35.89112854003906, test_abs_avg=35.88783645629883
production_forward2 grad[37] vs paper_forward: mean_abs=0.8790780305862427, max_abs=5.0, mean_rel=0.165424644947052, max_rel=1405.5762939453125, norm_rel=0.024856174364686012, ref_abs_avg=35.40528869628906, test_abs_avg=35.40739440917969
production_forward2 grad[38] vs paper_forward: mean_abs=0.6651301383972168, max_abs=3.125, mean_rel=0.14395813643932343, max_rel=10.72542953491211, norm_rel=0.025157634168863297, ref_abs_avg=26.91118621826172, test_abs_avg=26.864896774291992
production_forward2 grad[39] vs paper_forward: mean_abs=0.8466267585754395, max_abs=5.75, mean_rel=0.16602067649364471, max_rel=1457.7410888671875, norm_rel=0.024911852553486824, ref_abs_avg=34.03230285644531, test_abs_avg=34.033363342285156
production_forward2 grad[40] vs paper_forward: mean_abs=0.8303477764129639, max_abs=5.75, mean_rel=0.1582484394311905, max_rel=734.4627075195312, norm_rel=0.024707535281777382, ref_abs_avg=33.71221923828125, test_abs_avg=33.70966339111328
production_forward2 grad[41] vs paper_forward: mean_abs=0.6192476749420166, max_abs=2.75, mean_rel=0.5168123841285706, max_rel=208.9982452392578, norm_rel=0.02446209080517292, ref_abs_avg=26.388696670532227, test_abs_avg=26.389921188354492
production_forward2 grad[42] vs paper_forward: mean_abs=0.7979806661605835, max_abs=6.5, mean_rel=0.16149622201919556, max_rel=1054.24462890625, norm_rel=0.024844421073794365, ref_abs_avg=32.190223693847656, test_abs_avg=32.191871643066406
production_forward2 grad[43] vs paper_forward: mean_abs=0.7892100811004639, max_abs=5.0, mean_rel=0.1681751012802124, max_rel=635.8423461914062, norm_rel=0.024845410138368607, ref_abs_avg=31.867464065551758, test_abs_avg=31.868165969848633
production_forward2 grad[44] vs paper_forward: mean_abs=0.6053023338317871, max_abs=2.375, mean_rel=0.10445468127727509, max_rel=11.023723602294922, norm_rel=0.02299227938055992, ref_abs_avg=26.447664260864258, test_abs_avg=26.43097496032715
production_forward2 grad[45] vs paper_forward: mean_abs=0.7625554800033569, max_abs=5.25, mean_rel=0.1636919379234314, max_rel=899.9296875, norm_rel=0.024478193372488022, ref_abs_avg=31.216625213623047, test_abs_avg=31.217763900756836
production_forward2 grad[46] vs paper_forward: mean_abs=0.7449090480804443, max_abs=4.75, mean_rel=0.17306077480316162, max_rel=909.4189453125, norm_rel=0.024382458999753, ref_abs_avg=30.63884735107422, test_abs_avg=30.64209747314453
production_forward2 grad[47] vs paper_forward: mean_abs=0.5625371932983398, max_abs=2.25, mean_rel=0.10149647295475006, max_rel=8.27422046661377, norm_rel=0.023602310568094254, ref_abs_avg=23.377899169921875, test_abs_avg=23.35555648803711
production_forward2 grad[48] vs paper_forward: mean_abs=0.7276812195777893, max_abs=5.0, mean_rel=0.1661090850830078, max_rel=1011.74072265625, norm_rel=0.02413012646138668, ref_abs_avg=30.169517517089844, test_abs_avg=30.16986656188965
production_forward2 grad[49] vs paper_forward: mean_abs=0.7144039273262024, max_abs=5.5, mean_rel=0.16143065690994263, max_rel=1222.9381103515625, norm_rel=0.023865122348070145, ref_abs_avg=29.9547119140625, test_abs_avg=29.957157135009766
production_forward2 grad[50] vs paper_forward: mean_abs=0.7198934555053711, max_abs=2.5, mean_rel=0.09925933182239532, max_rel=12.330655097961426, norm_rel=0.02590087242424488, ref_abs_avg=27.129663467407227, test_abs_avg=27.102283477783203
production_forward2 grad[51] vs paper_forward: mean_abs=0.8240229487419128, max_abs=5.5, mean_rel=0.16876022517681122, max_rel=1520.2373046875, norm_rel=0.025610076263546944, ref_abs_avg=32.287208557128906, test_abs_avg=32.289344787597656
production_forward2 grad[52] vs paper_forward: mean_abs=0.7981166839599609, max_abs=5.8125, mean_rel=0.15917223691940308, max_rel=893.5101318359375, norm_rel=0.025153350085020065, ref_abs_avg=31.848388671875, test_abs_avg=31.84695053100586
production_forward2 grad[53] vs paper_forward: mean_abs=0.6205824613571167, max_abs=2.875, mean_rel=0.09456942230463028, max_rel=10.764005661010742, norm_rel=0.024774642661213875, ref_abs_avg=25.32941436767578, test_abs_avg=25.288711547851562
production_forward2 grad[54] vs paper_forward: mean_abs=0.7503334283828735, max_abs=5.25, mean_rel=0.170192688703537, max_rel=1238.82177734375, norm_rel=0.02510838769376278, ref_abs_avg=29.88825035095215, test_abs_avg=29.889270782470703
production_forward2 grad[55] vs paper_forward: mean_abs=0.7405713796615601, max_abs=5.0, mean_rel=0.17277191579341888, max_rel=555.4939575195312, norm_rel=0.0251503624022007, ref_abs_avg=29.42934799194336, test_abs_avg=29.432754516601562
production_forward2 grad[56] vs paper_forward: mean_abs=0.5472589731216431, max_abs=2.0, mean_rel=0.22230280935764313, max_rel=41.5207405090332, norm_rel=0.02254045009613037, ref_abs_avg=24.35857582092285, test_abs_avg=24.32244300842285
production_forward2 grad[57] vs paper_forward: mean_abs=0.6961708664894104, max_abs=4.875, mean_rel=0.16708871722221375, max_rel=1577.373291015625, norm_rel=0.024674588814377785, ref_abs_avg=28.237863540649414, test_abs_avg=28.239744186401367
production_forward2 grad[58] vs paper_forward: mean_abs=0.6841784715652466, max_abs=5.0, mean_rel=0.16176065802574158, max_rel=791.6399536132812, norm_rel=0.02447827346622944, ref_abs_avg=27.933534622192383, test_abs_avg=27.935131072998047
production_forward2 grad[59] vs paper_forward: mean_abs=0.5205168724060059, max_abs=1.9375, mean_rel=0.12962964177131653, max_rel=22.077375411987305, norm_rel=0.024323496967554092, ref_abs_avg=21.021013259887695, test_abs_avg=21.032882690429688
production_forward2 grad[60] vs paper_forward: mean_abs=0.6543118953704834, max_abs=5.0, mean_rel=0.16099099814891815, max_rel=1235.6612548828125, norm_rel=0.024040432646870613, ref_abs_avg=27.231056213378906, test_abs_avg=27.23098373413086
production_forward2 grad[61] vs paper_forward: mean_abs=0.6406433582305908, max_abs=5.25, mean_rel=0.16086146235466003, max_rel=687.1514892578125, norm_rel=0.02437213435769081, ref_abs_avg=26.322437286376953, test_abs_avg=26.32175064086914
production_forward2 grad[62] vs paper_forward: mean_abs=0.5366077423095703, max_abs=2.25, mean_rel=0.09957358241081238, max_rel=6.101583003997803, norm_rel=0.024897174909710884, ref_abs_avg=21.36260414123535, test_abs_avg=21.3653564453125
production_forward2 grad[63] vs paper_forward: mean_abs=0.6127718687057495, max_abs=4.75, mean_rel=0.15513461828231812, max_rel=935.4883422851562, norm_rel=0.023955820128321648, ref_abs_avg=25.568246841430664, test_abs_avg=25.56995391845703
production_forward2 grad[64] vs paper_forward: mean_abs=0.6033756732940674, max_abs=4.25, mean_rel=0.15269014239311218, max_rel=764.7522583007812, norm_rel=0.023826435208320618, ref_abs_avg=25.33572006225586, test_abs_avg=25.334556579589844
production_forward2 grad[65] vs paper_forward: mean_abs=0.49687838554382324, max_abs=2.0, mean_rel=0.2677159607410431, max_rel=57.629024505615234, norm_rel=0.023457743227481842, ref_abs_avg=21.371051788330078, test_abs_avg=21.411754608154297
production_forward2 grad[66] vs paper_forward: mean_abs=0.5828847885131836, max_abs=4.75, mean_rel=0.15238486230373383, max_rel=1035.9345703125, norm_rel=0.023220552131533623, ref_abs_avg=25.032135009765625, test_abs_avg=25.032930374145508
production_forward2 grad[67] vs paper_forward: mean_abs=0.5655452013015747, max_abs=4.0, mean_rel=0.14317074418067932, max_rel=1071.0343017578125, norm_rel=0.022927816957235336, ref_abs_avg=24.62250518798828, test_abs_avg=24.627412796020508
production_forward2 grad[68] vs paper_forward: mean_abs=0.445864200592041, max_abs=1.8125, mean_rel=0.08467745780944824, max_rel=9.381246566772461, norm_rel=0.02056247554719448, ref_abs_avg=21.852203369140625, test_abs_avg=21.833833694458008
production_forward2 grad[69] vs paper_forward: mean_abs=0.5523216724395752, max_abs=4.5, mean_rel=0.14902594685554504, max_rel=959.5595703125, norm_rel=0.02301863580942154, ref_abs_avg=23.967864990234375, test_abs_avg=23.967653274536133
production_forward2 grad[70] vs paper_forward: mean_abs=0.5418680310249329, max_abs=4.0, mean_rel=0.1523999124765396, max_rel=1048.3355712890625, norm_rel=0.022919701412320137, ref_abs_avg=23.616294860839844, test_abs_avg=23.61882781982422
production_forward2 grad[71] vs paper_forward: mean_abs=0.4025360345840454, max_abs=1.625, mean_rel=0.2104186713695526, max_rel=70.98426055908203, norm_rel=0.021431559696793556, ref_abs_avg=19.17057991027832, test_abs_avg=19.16880989074707
production_forward2 grad[72] vs paper_forward: mean_abs=0.5281492471694946, max_abs=4.25, mean_rel=0.14615073800086975, max_rel=580.9986572265625, norm_rel=0.02240722067654133, ref_abs_avg=23.540348052978516, test_abs_avg=23.541732788085938
production_forward2 grad[73] vs paper_forward: mean_abs=0.5112648010253906, max_abs=4.0, mean_rel=0.14012622833251953, max_rel=730.9381713867188, norm_rel=0.022001923993229866, ref_abs_avg=23.199155807495117, test_abs_avg=23.19411849975586
production_forward2 grad[74] vs paper_forward: mean_abs=0.49236011505126953, max_abs=1.78125, mean_rel=0.08875587582588196, max_rel=3.791245460510254, norm_rel=0.024639811366796494, ref_abs_avg=20.098783493041992, test_abs_avg=20.088672637939453
production_forward2 grad[75] vs paper_forward: mean_abs=0.5858088135719299, max_abs=4.5, mean_rel=0.15729032456874847, max_rel=892.6189575195312, norm_rel=0.024332070723176003, ref_abs_avg=24.092079162597656, test_abs_avg=24.094451904296875
production_forward2 grad[76] vs paper_forward: mean_abs=0.5693521499633789, max_abs=4.25, mean_rel=0.14737150073051453, max_rel=611.4273071289062, norm_rel=0.023780375719070435, ref_abs_avg=23.94029998779297, test_abs_avg=23.93967056274414
production_forward2 grad[77] vs paper_forward: mean_abs=0.4293999671936035, max_abs=1.625, mean_rel=0.10859805345535278, max_rel=8.553153991699219, norm_rel=0.02371920272707939, ref_abs_avg=18.296144485473633, test_abs_avg=18.32198143005371
production_forward2 grad[78] vs paper_forward: mean_abs=0.5315108299255371, max_abs=4.0, mean_rel=0.14544618129730225, max_rel=799.9840698242188, norm_rel=0.02374044619500637, ref_abs_avg=22.415796279907227, test_abs_avg=22.416156768798828
production_forward2 grad[79] vs paper_forward: mean_abs=0.5284096002578735, max_abs=4.5, mean_rel=0.15434405207633972, max_rel=1009.9695434570312, norm_rel=0.023529477417469025, ref_abs_avg=22.482563018798828, test_abs_avg=22.480697631835938
production_forward2 grad[80] vs paper_forward: mean_abs=0.40698981285095215, max_abs=1.5625, mean_rel=0.11843665689229965, max_rel=23.098846435546875, norm_rel=0.022551482543349266, ref_abs_avg=18.39719009399414, test_abs_avg=18.39286231994629
production_forward2 grad[81] vs paper_forward: mean_abs=0.498618483543396, max_abs=4.0, mean_rel=0.15381598472595215, max_rel=939.1719360351562, norm_rel=0.022941775619983673, ref_abs_avg=21.756053924560547, test_abs_avg=21.75751304626465
production_forward2 grad[82] vs paper_forward: mean_abs=0.4894745945930481, max_abs=4.0, mean_rel=0.14992880821228027, max_rel=699.886474609375, norm_rel=0.02287568524479866, ref_abs_avg=21.44923973083496, test_abs_avg=21.443496704101562
production_forward2 grad[83] vs paper_forward: mean_abs=0.3760460615158081, max_abs=1.529296875, mean_rel=0.2195642590522766, max_rel=47.16383361816406, norm_rel=0.021474597975611687, ref_abs_avg=17.429889678955078, test_abs_avg=17.431503295898438
production_forward2 grad[84] vs paper_forward: mean_abs=0.4594880938529968, max_abs=3.78125, mean_rel=0.13803714513778687, max_rel=461.5159606933594, norm_rel=0.022320054471492767, ref_abs_avg=20.621240615844727, test_abs_avg=20.621931076049805
production_forward2 grad[85] vs paper_forward: mean_abs=0.461381196975708, max_abs=4.0, mean_rel=0.15517361462116241, max_rel=881.94873046875, norm_rel=0.022306090220808983, ref_abs_avg=20.715782165527344, test_abs_avg=20.71588897705078
production_forward2 grad[86] vs paper_forward: mean_abs=0.33748674392700195, max_abs=1.625, mean_rel=0.10337439924478531, max_rel=9.165030479431152, norm_rel=0.020408354699611664, ref_abs_avg=16.724586486816406, test_abs_avg=16.708066940307617
production_forward2 grad[87] vs paper_forward: mean_abs=0.43467119336128235, max_abs=4.25, mean_rel=0.1327774077653885, max_rel=812.1619873046875, norm_rel=0.02167363464832306, ref_abs_avg=20.161094665527344, test_abs_avg=20.161239624023438
production_forward2 grad[88] vs paper_forward: mean_abs=0.4284205138683319, max_abs=3.8125, mean_rel=0.1429806649684906, max_rel=1346.196044921875, norm_rel=0.021786386147141457, ref_abs_avg=19.801250457763672, test_abs_avg=19.800180435180664
production_forward2 grad[89] vs paper_forward: mean_abs=0.3464641571044922, max_abs=1.25, mean_rel=0.06121542677283287, max_rel=5.965137004852295, norm_rel=0.020818045362830162, ref_abs_avg=16.977779388427734, test_abs_avg=16.999584197998047
production_forward2 grad[90] vs paper_forward: mean_abs=0.4040161967277527, max_abs=4.25, mean_rel=0.13165733218193054, max_rel=1116.3433837890625, norm_rel=0.021119246259331703, ref_abs_avg=19.248943328857422, test_abs_avg=19.248363494873047
production_forward2 grad[91] vs paper_forward: mean_abs=0.39348578453063965, max_abs=3.875, mean_rel=0.13022392988204956, max_rel=506.9052734375, norm_rel=0.02091638743877411, ref_abs_avg=18.95298957824707, test_abs_avg=18.958877563476562
production_forward2 grad[92] vs paper_forward: mean_abs=0.3240180015563965, max_abs=1.375, mean_rel=0.08442378044128418, max_rel=4.413590431213379, norm_rel=0.021963289007544518, ref_abs_avg=15.001148223876953, test_abs_avg=15.010171890258789
production_forward2 grad[93] vs paper_forward: mean_abs=0.3839735984802246, max_abs=3.75, mean_rel=0.12476561963558197, max_rel=582.8965454101562, norm_rel=0.02059721015393734, ref_abs_avg=18.81970977783203, test_abs_avg=18.820232391357422
production_forward2 grad[94] vs paper_forward: mean_abs=0.37192216515541077, max_abs=4.0, mean_rel=0.13032235205173492, max_rel=532.5422973632812, norm_rel=0.020404795184731483, ref_abs_avg=18.48053550720215, test_abs_avg=18.481191635131836
production_forward2 grad[95] vs paper_forward: mean_abs=0.2952847480773926, max_abs=1.125, mean_rel=0.06365829706192017, max_rel=2.7828166484832764, norm_rel=0.01956447772681713, ref_abs_avg=15.437275886535645, test_abs_avg=15.423669815063477
production_forward2 grad[96] vs paper_forward: mean_abs=0.36196818947792053, max_abs=4.5, mean_rel=0.1211463063955307, max_rel=728.4627685546875, norm_rel=0.020231764763593674, ref_abs_avg=18.14980697631836, test_abs_avg=18.150264739990234
production_forward2 grad[97] vs paper_forward: mean_abs=0.35454481840133667, max_abs=4.0, mean_rel=0.11825211346149445, max_rel=495.4944152832031, norm_rel=0.020139005035161972, ref_abs_avg=17.874176025390625, test_abs_avg=17.883987426757812

