identity layers + randn queries
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_kernel,
with key as (1, 512, 8, 1, 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'torch.float32'),
finished after 8.21s,
best config selected: num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None;
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_2_online_softmax_merge_intrablock_out_kernel,
with key as (512, 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.bfloat16'),
finished after 3.60s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_kernel,
with key as (2, 512, 8, 2, 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'torch.float32'),
finished after 7.24s,
best config selected: num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_kernel,
with key as (3, 512, 8, 4, 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'torch.float32'),
finished after 8.46s,
best config selected: num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_kernel,
with key as (4, 512, 8, 4, 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'torch.float32'),
finished after 8.60s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_kernel,
with key as (5, 512, 1, 8, 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'torch.float32'),
finished after 4.74s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (5, 512, 1, 8, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 7.37s,
best config selected: num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None;
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 32, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 32, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 64, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 64, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 32, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 32, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 64, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 64, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Triton autotuning for function phase_1_reduce_grad_pseudo_queries_kernel,
with key as (131072, 512, 1, 'torch.float32', 'torch.float32'),
finished after 1.40s,
best config selected: BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 64, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None;
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_2_online_softmax_merge_intrablock_backward_kernel,
with key as (512, 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 4.43s,
best config selected: num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None;
Autotuning kernel phase_2_reduce_grad_pseudo_query_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 32, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_reduce_grad_pseudo_query_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 32, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_reduce_grad_pseudo_query_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 64, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_reduce_grad_pseudo_query_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 64, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_reduce_grad_pseudo_query_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 32, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_reduce_grad_pseudo_query_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 32, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_reduce_grad_pseudo_query_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 64, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_reduce_grad_pseudo_query_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 64, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Triton autotuning for function phase_2_reduce_grad_pseudo_query_kernel,
with key as (131072, 512, 'torch.float32', 'torch.float32'),
finished after 1.38s,
best config selected: BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 64, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (4, 512, 8, 4, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 19.47s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None;
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 32, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 32, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 64, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 64, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 32, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 32, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 64, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 64, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Triton autotuning for function phase_1_reduce_grad_pseudo_queries_kernel,
with key as (131072, 512, 8, 'torch.float32', 'torch.float32'),
finished after 1.46s,
best config selected: BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 64, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (3, 512, 8, 4, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 18.90s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (2, 512, 8, 2, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 14.38s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (1, 512, 8, 1, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 9.93s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None;
production_forward2 fwd+bwd:  224.502 ms
production_forward2 bwd-only: 202.252 ms
production_forward2 peak allocated: fwd=2.551 GiB, fwd+bwd=5.930 GiB
production_forward2 peak reserved:  fwd=2.818 GiB, fwd+bwd=8.568 GiB

/usr/local/lib/python3.12/dist-packages/torch/_inductor/lowering.py:7627: UserWarning: 
Online softmax is disabled on the fly since Inductor decides to
split the reduction. Cut an issue to PyTorch if this is an
important use case and you want to speed it up with online
softmax.

  warnings.warn(
/usr/local/lib/python3.12/dist-packages/torch/_inductor/lowering.py:7627: UserWarning: 
Online softmax is disabled on the fly since Inductor decides to
split the reduction. Cut an issue to PyTorch if this is an
important use case and you want to speed it up with online
softmax.

  warnings.warn(
/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_fx.py:321: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
  warnings.warn(
/usr/local/lib/python3.12/dist-packages/torch/_inductor/select_algorithm.py:3464: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  current_size = base.storage().size()
E0429 22:00:35.736000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 22:00:35.736000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 22:00:35.736000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 22:00:35.736000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 22:00:35.736000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 22:00:35.736000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 22:00:35.736000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
E0429 22:00:35.740000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 22:00:35.740000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 22:00:35.740000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 22:00:35.740000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 22:00:35.740000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 22:00:35.740000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 22:00:35.740000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
E0429 22:00:35.744000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 22:00:35.744000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 22:00:35.744000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 22:00:35.744000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 22:00:35.744000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 22:00:35.744000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 22:00:35.744000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
E0429 22:00:35.747000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 22:00:35.747000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 22:00:35.747000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 22:00:35.747000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 22:00:35.747000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 22:00:35.747000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 22:00:35.747000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
E0429 22:00:35.751000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 22:00:35.751000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 22:00:35.751000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 22:00:35.751000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 22:00:35.751000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 22:00:35.751000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 22:00:35.751000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
E0429 22:00:35.754000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 22:00:35.754000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 22:00:35.754000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 22:00:35.754000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 22:00:35.754000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 22:00:35.754000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 22:00:35.754000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
E0429 22:00:35.757000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 22:00:35.757000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 22:00:35.757000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 22:00:35.757000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 22:00:35.757000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 22:00:35.757000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 22:00:35.757000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
E0429 22:00:35.761000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 22:00:35.761000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 22:00:35.761000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 22:00:35.761000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 22:00:35.761000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 22:00:35.761000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 22:00:35.761000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
E0429 22:00:35.765000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 22:00:35.765000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 22:00:35.765000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 22:00:35.765000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 22:00:35.765000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 22:00:35.765000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 22:00:35.765000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
E0429 22:00:35.768000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 22:00:35.768000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 22:00:35.768000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 22:00:35.768000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 22:00:35.768000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 22:00:35.768000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 22:00:35.768000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
E0429 22:00:35.771000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 22:00:35.771000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 22:00:35.771000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 22:00:35.771000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 22:00:35.771000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 22:00:35.771000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 22:00:35.771000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
E0429 22:00:35.774000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] Runtime error during autotuning: 
E0429 22:00:35.774000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] CUDA driver error: invalid argument
E0429 22:00:35.774000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 22:00:35.774000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] This may mean this GPU is too small for max_autotune mode.
E0429 22:00:35.774000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] 
E0429 22:00:35.774000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] . 
E0429 22:00:35.774000 3744 torch/_inductor/select_algorithm.py:3727] [0/1] Ignoring this choice.
Autotune Choices Stats:
{"num_choices": 13, "num_triton_choices": 12, "best_kernel": "bmm", "best_time": 2.427903890609741, "best_triton_pos": 1, "best_triton_time": Infinity, "best_triton_kernel": "triton_bmm_0", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2"}
AUTOTUNE bmm(131072x2x1, 131072x1x512)
strides: [1, 131072, 0], [512, 0, 1]
dtypes: torch.float32, torch.float32
  bmm 2.4279 ms 100.0% 
  triton_bmm_0 inf ms 0.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=1, num_warps=2
  triton_bmm_1 inf ms 0.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=2
  triton_bmm_2 inf ms 0.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
  triton_bmm_3 inf ms 0.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=2
  triton_bmm_4 inf ms 0.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4
  triton_bmm_5 inf ms 0.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
  triton_bmm_6 inf ms 0.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=128, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
  triton_bmm_7 inf ms 0.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=128, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8
  triton_bmm_8 inf ms 0.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=16, BLOCK_N=128, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4
SingleProcess AUTOTUNE benchmarking takes 0.2370 seconds and 0.0003 seconds precompiling for 13 choices
Autotune Choices Stats:
{"num_choices": 18, "num_triton_choices": 17, "best_kernel": "triton_mm_18", "best_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8", "best_time": 0.35366401076316833, "best_triton_pos": 0}
AUTOTUNE mm(512x1, 1x262144)
strides: [1, 512], [0, 1]
dtypes: torch.float32, torch.float32
  triton_mm_18 0.3537 ms 100.0% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8
  mm 0.3542 ms 99.8% 
  triton_mm_24 0.3546 ms 99.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=128, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8
  triton_mm_23 0.3555 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=128, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
  triton_mm_21 0.3556 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=128, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8
  triton_mm_16 0.3556 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
  triton_mm_20 0.3556 ms 99.5% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=128, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
  triton_mm_15 0.3557 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8
  triton_mm_19 0.3557 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
  triton_mm_22 0.3557 ms 99.4% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=128, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4
SingleProcess AUTOTUNE benchmarking takes 0.8275 seconds and 0.0295 seconds precompiling for 18 choices
/usr/local/lib/python3.12/dist-packages/torch/_inductor/lowering.py:7627: UserWarning: 
Online softmax is disabled on the fly since Inductor decides to
split the reduction. Cut an issue to PyTorch if this is an
important use case and you want to speed it up with online
softmax.

  warnings.warn(
Autotune Choices Stats:
{"num_choices": 18, "num_triton_choices": 17, "best_kernel": "mm", "best_time": 0.1759680062532425, "best_triton_pos": 1, "best_triton_time": 0.17609600722789764, "best_triton_kernel": "triton_mm_38", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=128, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8"}
AUTOTUNE mm(512x1, 1x131072)
strides: [1, 512], [0, 1]
dtypes: torch.float32, torch.float32
  mm 0.1760 ms 100.0% 
  triton_mm_38 0.1761 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=128, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8
  triton_mm_37 0.1761 ms 99.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=128, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4
  triton_mm_32 0.1772 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8
  triton_mm_35 0.1773 ms 99.3% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=8
  triton_mm_39 0.1773 ms 99.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=128, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4
  triton_mm_33 0.1775 ms 99.2% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=32, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
  triton_mm_36 0.1781 ms 98.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4
  triton_mm_41 0.1781 ms 98.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=128, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=8
  triton_mm_34 0.1782 ms 98.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=16, BLOCK_M=64, BLOCK_N=64, EVEN_K=False, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=2, num_warps=4
SingleProcess AUTOTUNE benchmarking takes 0.7168 seconds and 0.3409 seconds precompiling for 18 choices

paper_forward fwd+bwd:  379.749 ms
paper_forward bwd-only: 294.094 ms
paper_forward peak allocated: fwd=29.705 GiB, fwd+bwd=31.823 GiB
paper_forward peak reserved:  fwd=29.740 GiB, fwd+bwd=32.490 GiB
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_2_online_softmax_merge_intrablock_backward_kernel,
with key as (512, 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 4.73s,
best config selected: num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (4, 512, 8, 4, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 20.38s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (3, 512, 8, 4, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 18.97s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (2, 512, 8, 2, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 14.76s,
best config selected: num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (1, 512, 8, 1, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 10.19s,
best config selected: num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None;
production_forward fwd+bwd:  126.740 ms
production_forward bwd-only: 106.302 ms
production_forward peak allocated: fwd=3.071 GiB, fwd+bwd=7.571 GiB
production_forward peak reserved:  fwd=3.318 GiB, fwd+bwd=8.568 GiB

grads check for swiglu layers + randn queries
production_forward vs paper_forward output: mean_abs=0.001673883176408708, max_abs=0.046875
production_forward grad[0] vs paper_forward: mean_abs=0.00875028781592846, max_abs=0.38671875, mean_rel=0.07360494136810303, max_rel=121.49032592773438, norm_rel=0.02015526033937931, ref_abs_avg=0.4729507565498352, test_abs_avg=0.4729730486869812
production_forward grad[1] vs paper_forward: mean_abs=7.563207626342773, max_abs=64.0, mean_rel=0.35067927837371826, max_rel=3052.84228515625, norm_rel=0.020356083288788795, ref_abs_avg=327.7489013671875, test_abs_avg=327.70916748046875
production_forward grad[2] vs paper_forward: mean_abs=1.3568072319030762, max_abs=6.0, mean_rel=0.33377549052238464, max_rel=110.1822280883789, norm_rel=0.02329576574265957, ref_abs_avg=59.42229080200195, test_abs_avg=59.470428466796875
production_forward grad[3] vs paper_forward: mean_abs=1.6711535453796387, max_abs=11.5, mean_rel=0.1623333990573883, max_rel=2243.426513671875, norm_rel=0.02399125136435032, ref_abs_avg=70.08052062988281, test_abs_avg=70.09129333496094
production_forward grad[4] vs paper_forward: mean_abs=1.620690107345581, max_abs=10.5, mean_rel=0.1654537171125412, max_rel=1388.6895751953125, norm_rel=0.023721233010292053, ref_abs_avg=68.79881286621094, test_abs_avg=68.7919921875
production_forward grad[5] vs paper_forward: mean_abs=1.1496429443359375, max_abs=5.5, mean_rel=0.22786866128444672, max_rel=78.63928985595703, norm_rel=0.02299303002655506, ref_abs_avg=51.28330612182617, test_abs_avg=51.237274169921875
production_forward grad[6] vs paper_forward: mean_abs=1.4419267177581787, max_abs=9.0, mean_rel=0.1660376787185669, max_rel=1979.19580078125, norm_rel=0.02363251894712448, ref_abs_avg=61.33770751953125, test_abs_avg=61.34748077392578
production_forward grad[7] vs paper_forward: mean_abs=1.4126349687576294, max_abs=8.625, mean_rel=0.18127718567848206, max_rel=1949.9393310546875, norm_rel=0.023468198254704475, ref_abs_avg=60.51335525512695, test_abs_avg=60.52195739746094
production_forward grad[8] vs paper_forward: mean_abs=1.058436632156372, max_abs=4.0, mean_rel=0.11691969633102417, max_rel=11.501931190490723, norm_rel=0.02207200974225998, ref_abs_avg=48.238182067871094, test_abs_avg=48.23567581176758
production_forward grad[9] vs paper_forward: mean_abs=1.3018391132354736, max_abs=8.75, mean_rel=0.17046701908111572, max_rel=1016.9502563476562, norm_rel=0.02345513366162777, ref_abs_avg=55.81283187866211, test_abs_avg=55.81768798828125
production_forward grad[10] vs paper_forward: mean_abs=1.2665942907333374, max_abs=8.0, mean_rel=0.16722962260246277, max_rel=2122.9580078125, norm_rel=0.02312367968261242, ref_abs_avg=55.13685607910156, test_abs_avg=55.145301818847656
production_forward grad[11] vs paper_forward: mean_abs=1.0063769817352295, max_abs=3.75, mean_rel=0.16707313060760498, max_rel=32.957725524902344, norm_rel=0.02544548176229, ref_abs_avg=39.62942123413086, test_abs_avg=39.5983772277832
production_forward grad[12] vs paper_forward: mean_abs=1.1898460388183594, max_abs=7.5, mean_rel=0.15568426251411438, max_rel=985.91845703125, norm_rel=0.0233759768307209, ref_abs_avg=51.140342712402344, test_abs_avg=51.143341064453125
production_forward grad[13] vs paper_forward: mean_abs=1.1676931381225586, max_abs=7.25, mean_rel=0.15705446898937225, max_rel=1764.2684326171875, norm_rel=0.023131098598241806, ref_abs_avg=50.76123046875, test_abs_avg=50.76155090332031
production_forward grad[14] vs paper_forward: mean_abs=0.8643671274185181, max_abs=3.375, mean_rel=0.381902277469635, max_rel=63.619041442871094, norm_rel=0.02110164426267147, ref_abs_avg=39.91735076904297, test_abs_avg=39.986785888671875
production_forward grad[15] vs paper_forward: mean_abs=1.108821153640747, max_abs=6.375, mean_rel=0.16388598084449768, max_rel=1809.867919921875, norm_rel=0.023208873346447945, ref_abs_avg=47.99644470214844, test_abs_avg=47.999916076660156
production_forward grad[16] vs paper_forward: mean_abs=1.0809390544891357, max_abs=7.0, mean_rel=0.1633094996213913, max_rel=3280.819091796875, norm_rel=0.022854311391711235, ref_abs_avg=47.59370422363281, test_abs_avg=47.5968017578125
production_forward grad[17] vs paper_forward: mean_abs=0.8059911727905273, max_abs=3.375, mean_rel=0.10608753561973572, max_rel=12.890905380249023, norm_rel=0.02210160344839096, ref_abs_avg=36.524532318115234, test_abs_avg=36.4943733215332
production_forward grad[18] vs paper_forward: mean_abs=1.0433986186981201, max_abs=6.25, mean_rel=0.17550550401210785, max_rel=2116.06494140625, norm_rel=0.023098532110452652, ref_abs_avg=45.3787841796875, test_abs_avg=45.38494110107422
production_forward grad[19] vs paper_forward: mean_abs=1.0222420692443848, max_abs=6.25, mean_rel=0.17575353384017944, max_rel=1805.6829833984375, norm_rel=0.022975053638219833, ref_abs_avg=44.68083953857422, test_abs_avg=44.691619873046875
production_forward grad[20] vs paper_forward: mean_abs=0.7872288227081299, max_abs=4.375, mean_rel=0.12189443409442902, max_rel=8.799932479858398, norm_rel=0.02279515005648136, ref_abs_avg=35.798431396484375, test_abs_avg=35.84804916381836
production_forward grad[21] vs paper_forward: mean_abs=0.9819096326828003, max_abs=6.0, mean_rel=0.17543652653694153, max_rel=2277.6484375, norm_rel=0.022955020889639854, ref_abs_avg=42.96687316894531, test_abs_avg=42.970787048339844
production_forward grad[22] vs paper_forward: mean_abs=0.9630776643753052, max_abs=5.5, mean_rel=0.14455629885196686, max_rel=993.2799072265625, norm_rel=0.022689029574394226, ref_abs_avg=42.647701263427734, test_abs_avg=42.646751403808594
production_forward grad[23] vs paper_forward: mean_abs=0.7196497917175293, max_abs=2.875, mean_rel=0.14066235721111298, max_rel=11.222718238830566, norm_rel=0.021245796233415604, ref_abs_avg=33.71043395996094, test_abs_avg=33.70086669921875
production_forward grad[24] vs paper_forward: mean_abs=0.9421075582504272, max_abs=6.5, mean_rel=0.15345679223537445, max_rel=1099.128662109375, norm_rel=0.02283116988837719, ref_abs_avg=41.438987731933594, test_abs_avg=41.442588806152344
production_forward grad[25] vs paper_forward: mean_abs=0.9169961214065552, max_abs=6.25, mean_rel=0.16206243634223938, max_rel=1591.049560546875, norm_rel=0.02249056287109852, ref_abs_avg=40.97686767578125, test_abs_avg=40.975128173828125
production_forward grad[26] vs paper_forward: mean_abs=0.8756788969039917, max_abs=3.5, mean_rel=0.08003269881010056, max_rel=4.0842108726501465, norm_rel=0.021807733923196793, ref_abs_avg=39.373512268066406, test_abs_avg=39.399654388427734
production_forward grad[27] vs paper_forward: mean_abs=1.0922353267669678, max_abs=8.0, mean_rel=0.15994369983673096, max_rel=1077.9559326171875, norm_rel=0.024747805669903755, ref_abs_avg=44.36196517944336, test_abs_avg=44.36524963378906
production_forward grad[28] vs paper_forward: mean_abs=1.0645787715911865, max_abs=6.5, mean_rel=0.1617605984210968, max_rel=1627.184326171875, norm_rel=0.02437509223818779, ref_abs_avg=43.869327545166016, test_abs_avg=43.874351501464844
production_forward grad[29] vs paper_forward: mean_abs=0.8371410369873047, max_abs=3.125, mean_rel=0.09301713109016418, max_rel=3.8509576320648193, norm_rel=0.025328876450657845, ref_abs_avg=32.58767318725586, test_abs_avg=32.605934143066406
production_forward grad[30] vs paper_forward: mean_abs=1.0042228698730469, max_abs=7.0, mean_rel=0.1710343062877655, max_rel=1695.6641845703125, norm_rel=0.025039128959178925, ref_abs_avg=40.24940872192383, test_abs_avg=40.252830505371094
production_forward grad[31] vs paper_forward: mean_abs=0.9835143685340881, max_abs=6.0, mean_rel=0.16912941634655, max_rel=1460.633056640625, norm_rel=0.024758344516158104, ref_abs_avg=39.83796310424805, test_abs_avg=39.8455696105957
production_forward grad[32] vs paper_forward: mean_abs=0.7166967391967773, max_abs=3.5, mean_rel=0.0839337706565857, max_rel=9.336565971374512, norm_rel=0.023626020178198814, ref_abs_avg=30.8895320892334, test_abs_avg=30.894996643066406
production_forward grad[33] vs paper_forward: mean_abs=0.928148090839386, max_abs=6.4375, mean_rel=0.16610094904899597, max_rel=1135.1966552734375, norm_rel=0.02500397525727749, ref_abs_avg=37.260650634765625, test_abs_avg=37.265052795410156
production_forward grad[34] vs paper_forward: mean_abs=0.9140356779098511, max_abs=5.375, mean_rel=0.16153092682361603, max_rel=1606.3092041015625, norm_rel=0.024800799787044525, ref_abs_avg=37.018585205078125, test_abs_avg=37.02977752685547
production_forward grad[35] vs paper_forward: mean_abs=0.7170166969299316, max_abs=3.03125, mean_rel=0.0991172343492508, max_rel=11.192898750305176, norm_rel=0.025998111814260483, ref_abs_avg=28.147558212280273, test_abs_avg=28.109661102294922
production_forward grad[36] vs paper_forward: mean_abs=0.8799848556518555, max_abs=6.0, mean_rel=0.16347180306911469, max_rel=1121.535888671875, norm_rel=0.024828443303704262, ref_abs_avg=35.52450942993164, test_abs_avg=35.52647399902344
production_forward grad[37] vs paper_forward: mean_abs=0.8578837513923645, max_abs=5.0, mean_rel=0.16530893743038177, max_rel=1321.6419677734375, norm_rel=0.02430902048945427, ref_abs_avg=35.40971755981445, test_abs_avg=35.419002532958984
production_forward grad[38] vs paper_forward: mean_abs=0.6473473310470581, max_abs=2.5, mean_rel=0.10048863291740417, max_rel=15.657307624816895, norm_rel=0.02396877482533455, ref_abs_avg=27.198488235473633, test_abs_avg=27.240245819091797
production_forward grad[39] vs paper_forward: mean_abs=0.8289036154747009, max_abs=5.5, mean_rel=0.17344655096530914, max_rel=1105.6861572265625, norm_rel=0.024425137788057327, ref_abs_avg=34.01494598388672, test_abs_avg=34.019195556640625
production_forward grad[40] vs paper_forward: mean_abs=0.8138435482978821, max_abs=5.0, mean_rel=0.1541774421930313, max_rel=699.6306762695312, norm_rel=0.02431553602218628, ref_abs_avg=33.56377410888672, test_abs_avg=33.56916046142578
production_forward grad[41] vs paper_forward: mean_abs=0.664135217666626, max_abs=2.25, mean_rel=0.15137530863285065, max_rel=20.25432586669922, norm_rel=0.023946965113282204, ref_abs_avg=26.673519134521484, test_abs_avg=26.632301330566406
production_forward grad[42] vs paper_forward: mean_abs=0.787259578704834, max_abs=5.0, mean_rel=0.1547311544418335, max_rel=1484.580810546875, norm_rel=0.02436640113592148, ref_abs_avg=32.36856460571289, test_abs_avg=32.37409210205078
production_forward grad[43] vs paper_forward: mean_abs=0.775510847568512, max_abs=5.0, mean_rel=0.15544173121452332, max_rel=1269.1390380859375, norm_rel=0.02415335364639759, ref_abs_avg=32.20361328125, test_abs_avg=32.20717239379883
production_forward grad[44] vs paper_forward: mean_abs=0.5857601165771484, max_abs=2.75, mean_rel=0.1549915373325348, max_rel=28.921911239624023, norm_rel=0.02395491674542427, ref_abs_avg=25.707618713378906, test_abs_avg=25.71414566040039
production_forward grad[45] vs paper_forward: mean_abs=0.748866617679596, max_abs=4.5, mean_rel=0.16307422518730164, max_rel=1631.59912109375, norm_rel=0.023974483832716942, ref_abs_avg=31.320575714111328, test_abs_avg=31.323654174804688
production_forward grad[46] vs paper_forward: mean_abs=0.7432219386100769, max_abs=4.5, mean_rel=0.1619643270969391, max_rel=1134.138916015625, norm_rel=0.023957645520567894, ref_abs_avg=31.122512817382812, test_abs_avg=31.126575469970703
production_forward grad[47] vs paper_forward: mean_abs=0.5874519348144531, max_abs=2.25, mean_rel=0.09001569449901581, max_rel=8.905205726623535, norm_rel=0.02378779835999012, ref_abs_avg=24.972869873046875, test_abs_avg=24.968936920166016
production_forward grad[48] vs paper_forward: mean_abs=0.7237948179244995, max_abs=5.0, mean_rel=0.16133855283260345, max_rel=899.454345703125, norm_rel=0.0239105261862278, ref_abs_avg=30.285953521728516, test_abs_avg=30.28815269470215
production_forward grad[49] vs paper_forward: mean_abs=0.7123599052429199, max_abs=4.5, mean_rel=0.16679199039936066, max_rel=944.5772705078125, norm_rel=0.023771969601511955, ref_abs_avg=30.008193969726562, test_abs_avg=30.006710052490234
production_forward grad[50] vs paper_forward: mean_abs=0.663428783416748, max_abs=3.0, mean_rel=0.0913933739066124, max_rel=6.600301265716553, norm_rel=0.026337023824453354, ref_abs_avg=25.07156753540039, test_abs_avg=25.012256622314453
production_forward grad[51] vs paper_forward: mean_abs=0.8179484009742737, max_abs=6.0625, mean_rel=0.17845961451530457, max_rel=2202.14892578125, norm_rel=0.025815829634666443, ref_abs_avg=31.785259246826172, test_abs_avg=31.788002014160156
production_forward grad[52] vs paper_forward: mean_abs=0.8010509014129639, max_abs=6.0, mean_rel=0.16022738814353943, max_rel=1250.2261962890625, norm_rel=0.025223108008503914, ref_abs_avg=31.82931137084961, test_abs_avg=31.83075714111328
production_forward grad[53] vs paper_forward: mean_abs=0.6124789714813232, max_abs=2.46875, mean_rel=0.08156886696815491, max_rel=5.030136585235596, norm_rel=0.025286206975579262, ref_abs_avg=24.80777359008789, test_abs_avg=24.802310943603516
production_forward grad[54] vs paper_forward: mean_abs=0.7476738691329956, max_abs=6.0, mean_rel=0.17599047720432281, max_rel=1212.9698486328125, norm_rel=0.025321533903479576, ref_abs_avg=29.565385818481445, test_abs_avg=29.568416595458984
production_forward grad[55] vs paper_forward: mean_abs=0.7300270795822144, max_abs=5.0, mean_rel=0.15203389525413513, max_rel=693.3604125976562, norm_rel=0.024897903203964233, ref_abs_avg=29.345653533935547, test_abs_avg=29.343936920166016
production_forward grad[56] vs paper_forward: mean_abs=0.5600142478942871, max_abs=2.25, mean_rel=0.12395842373371124, max_rel=12.746944427490234, norm_rel=0.02459966205060482, ref_abs_avg=23.043071746826172, test_abs_avg=23.03419303894043
production_forward grad[57] vs paper_forward: mean_abs=0.6898181438446045, max_abs=5.25, mean_rel=0.1747007668018341, max_rel=1277.672607421875, norm_rel=0.02479216828942299, ref_abs_avg=27.853893280029297, test_abs_avg=27.854381561279297
production_forward grad[58] vs paper_forward: mean_abs=0.6794959306716919, max_abs=4.5, mean_rel=0.16818487644195557, max_rel=914.5076904296875, norm_rel=0.02482348494231701, ref_abs_avg=27.412681579589844, test_abs_avg=27.41257095336914
production_forward grad[59] vs paper_forward: mean_abs=0.4826946258544922, max_abs=2.0, mean_rel=0.09459982812404633, max_rel=9.298521995544434, norm_rel=0.02111893706023693, ref_abs_avg=23.111724853515625, test_abs_avg=23.082252502441406
production_forward grad[60] vs paper_forward: mean_abs=0.6420836448669434, max_abs=6.0, mean_rel=0.16325071454048157, max_rel=1340.6123046875, norm_rel=0.02410615049302578, ref_abs_avg=26.637025833129883, test_abs_avg=26.639373779296875
production_forward grad[61] vs paper_forward: mean_abs=0.6219294667243958, max_abs=4.5, mean_rel=0.15311074256896973, max_rel=957.0800170898438, norm_rel=0.023604756221175194, ref_abs_avg=26.320571899414062, test_abs_avg=26.315948486328125
production_forward grad[62] vs paper_forward: mean_abs=0.49296140670776367, max_abs=2.0, mean_rel=0.12206117808818817, max_rel=14.53040885925293, norm_rel=0.024700557813048363, ref_abs_avg=19.772724151611328, test_abs_avg=19.72897720336914
production_forward grad[63] vs paper_forward: mean_abs=0.6049481630325317, max_abs=4.0, mean_rel=0.16096368432044983, max_rel=1526.1239013671875, norm_rel=0.023713067173957825, ref_abs_avg=25.520221710205078, test_abs_avg=25.52133560180664
production_forward grad[64] vs paper_forward: mean_abs=0.5959091186523438, max_abs=5.125, mean_rel=0.1527521014213562, max_rel=1276.5208740234375, norm_rel=0.023589730262756348, ref_abs_avg=25.27552032470703, test_abs_avg=25.282283782958984
production_forward grad[65] vs paper_forward: mean_abs=0.4526351988315582, max_abs=1.875, mean_rel=0.09565576165914536, max_rel=11.822286605834961, norm_rel=0.02288680337369442, ref_abs_avg=19.88360595703125, test_abs_avg=19.884571075439453
production_forward grad[66] vs paper_forward: mean_abs=0.5715605616569519, max_abs=3.75, mean_rel=0.15094472467899323, max_rel=954.8440551757812, norm_rel=0.02336355671286583, ref_abs_avg=24.451759338378906, test_abs_avg=24.453210830688477
production_forward grad[67] vs paper_forward: mean_abs=0.5647128820419312, max_abs=4.0, mean_rel=0.15358048677444458, max_rel=1100.2550048828125, norm_rel=0.023046037182211876, ref_abs_avg=24.4775390625, test_abs_avg=24.478620529174805
production_forward grad[68] vs paper_forward: mean_abs=0.4574543833732605, max_abs=1.75, mean_rel=0.19796675443649292, max_rel=24.843544006347656, norm_rel=0.02378740906715393, ref_abs_avg=18.980548858642578, test_abs_avg=18.960838317871094
production_forward grad[69] vs paper_forward: mean_abs=0.5456110239028931, max_abs=4.5, mean_rel=0.15170010924339294, max_rel=1193.6817626953125, norm_rel=0.022881152108311653, ref_abs_avg=23.787677764892578, test_abs_avg=23.78700065612793
production_forward grad[70] vs paper_forward: mean_abs=0.5351316928863525, max_abs=4.0, mean_rel=0.14619599282741547, max_rel=560.428955078125, norm_rel=0.022860875353217125, ref_abs_avg=23.403709411621094, test_abs_avg=23.402345657348633
production_forward grad[71] vs paper_forward: mean_abs=0.416551411151886, max_abs=1.875, mean_rel=0.08620674908161163, max_rel=5.917757034301758, norm_rel=0.021416503936052322, ref_abs_avg=19.597774505615234, test_abs_avg=19.658336639404297
production_forward grad[72] vs paper_forward: mean_abs=0.5220699906349182, max_abs=5.0, mean_rel=0.14830563962459564, max_rel=633.2644653320312, norm_rel=0.02257818914949894, ref_abs_avg=23.101943969726562, test_abs_avg=23.103452682495117
production_forward grad[73] vs paper_forward: mean_abs=0.5086336135864258, max_abs=4.0, mean_rel=0.140050008893013, max_rel=414.9127197265625, norm_rel=0.0225117988884449, ref_abs_avg=22.56458854675293, test_abs_avg=22.562782287597656
production_forward grad[74] vs paper_forward: mean_abs=0.47113025188446045, max_abs=1.75, mean_rel=0.15986308455467224, max_rel=40.09233474731445, norm_rel=0.022413117811083794, ref_abs_avg=20.98248863220215, test_abs_avg=20.99041175842285
production_forward grad[75] vs paper_forward: mean_abs=0.5843691825866699, max_abs=4.375, mean_rel=0.16097527742385864, max_rel=2353.747802734375, norm_rel=0.023636702448129654, ref_abs_avg=24.715063095092773, test_abs_avg=24.716224670410156
production_forward grad[76] vs paper_forward: mean_abs=0.574413537979126, max_abs=6.0, mean_rel=0.15062201023101807, max_rel=525.5191040039062, norm_rel=0.02384749799966812, ref_abs_avg=24.164119720458984, test_abs_avg=24.159072875976562
production_forward grad[77] vs paper_forward: mean_abs=0.4197702407836914, max_abs=1.75, mean_rel=0.09723837673664093, max_rel=9.399643898010254, norm_rel=0.023298827931284904, ref_abs_avg=18.276769638061523, test_abs_avg=18.267501831054688
production_forward grad[78] vs paper_forward: mean_abs=0.5316588282585144, max_abs=4.0, mean_rel=0.15161308646202087, max_rel=1124.5001220703125, norm_rel=0.023080971091985703, ref_abs_avg=23.033334732055664, test_abs_avg=23.033531188964844
production_forward grad[79] vs paper_forward: mean_abs=0.5257145166397095, max_abs=4.0, mean_rel=0.1379028856754303, max_rel=439.88800048828125, norm_rel=0.023032817989587784, ref_abs_avg=22.854938507080078, test_abs_avg=22.859468460083008
production_forward grad[80] vs paper_forward: mean_abs=0.40943479537963867, max_abs=1.375, mean_rel=0.07747547328472137, max_rel=4.384320259094238, norm_rel=0.022280966863036156, ref_abs_avg=18.228397369384766, test_abs_avg=18.253700256347656
production_forward grad[81] vs paper_forward: mean_abs=0.49755221605300903, max_abs=4.0, mean_rel=0.14092528820037842, max_rel=1096.4677734375, norm_rel=0.022452373057603836, ref_abs_avg=22.166339874267578, test_abs_avg=22.16650390625
production_forward grad[82] vs paper_forward: mean_abs=0.484072208404541, max_abs=4.0, mean_rel=0.1487128734588623, max_rel=652.676513671875, norm_rel=0.022167084738612175, ref_abs_avg=21.817258834838867, test_abs_avg=21.816192626953125
production_forward grad[83] vs paper_forward: mean_abs=0.36567652225494385, max_abs=1.484375, mean_rel=0.06699052453041077, max_rel=3.610342502593994, norm_rel=0.021195566281676292, ref_abs_avg=17.619945526123047, test_abs_avg=17.58397674560547
production_forward grad[84] vs paper_forward: mean_abs=0.46518784761428833, max_abs=4.25, mean_rel=0.1468418389558792, max_rel=1547.8336181640625, norm_rel=0.02217472344636917, ref_abs_avg=21.006040573120117, test_abs_avg=21.007957458496094
production_forward grad[85] vs paper_forward: mean_abs=0.4497992992401123, max_abs=3.75, mean_rel=0.12745119631290436, max_rel=366.5052185058594, norm_rel=0.02161382883787155, ref_abs_avg=20.86577033996582, test_abs_avg=20.861665725708008
production_forward grad[86] vs paper_forward: mean_abs=0.3669261932373047, max_abs=1.3125, mean_rel=0.07520271837711334, max_rel=2.680476188659668, norm_rel=0.021757058799266815, ref_abs_avg=16.83729362487793, test_abs_avg=16.86414909362793
production_forward grad[87] vs paper_forward: mean_abs=0.4342445433139801, max_abs=5.25, mean_rel=0.13700789213180542, max_rel=964.2108764648438, norm_rel=0.02152552828192711, ref_abs_avg=20.244394302368164, test_abs_avg=20.244892120361328
production_forward grad[88] vs paper_forward: mean_abs=0.42387279868125916, max_abs=4.0, mean_rel=0.1253470480442047, max_rel=508.6669006347656, norm_rel=0.020945042371749878, ref_abs_avg=20.304956436157227, test_abs_avg=20.307537078857422
production_forward grad[89] vs paper_forward: mean_abs=0.3493351936340332, max_abs=1.125, mean_rel=0.06387798488140106, max_rel=5.794774532318115, norm_rel=0.02047072909772396, ref_abs_avg=16.879880905151367, test_abs_avg=16.851877212524414
production_forward grad[90] vs paper_forward: mean_abs=0.40536558628082275, max_abs=4.125, mean_rel=0.13056680560112, max_rel=590.9719848632812, norm_rel=0.020889349281787872, ref_abs_avg=19.546546936035156, test_abs_avg=19.54693603515625
production_forward grad[91] vs paper_forward: mean_abs=0.4013783633708954, max_abs=3.96875, mean_rel=0.12924626469612122, max_rel=650.8419799804688, norm_rel=0.021063733845949173, ref_abs_avg=19.25585174560547, test_abs_avg=19.250057220458984
production_forward grad[92] vs paper_forward: mean_abs=0.3146045207977295, max_abs=1.25, mean_rel=0.08605049550533295, max_rel=10.05296516418457, norm_rel=0.01971462182700634, ref_abs_avg=16.007877349853516, test_abs_avg=16.01409912109375
production_forward grad[93] vs paper_forward: mean_abs=0.38549479842185974, max_abs=4.5, mean_rel=0.1266452670097351, max_rel=960.6503295898438, norm_rel=0.020615562796592712, ref_abs_avg=18.8951416015625, test_abs_avg=18.895742416381836
production_forward grad[94] vs paper_forward: mean_abs=0.382468581199646, max_abs=3.5, mean_rel=0.12937939167022705, max_rel=663.3552856445312, norm_rel=0.021118683740496635, ref_abs_avg=18.426490783691406, test_abs_avg=18.41607666015625
production_forward grad[95] vs paper_forward: mean_abs=0.33185720443725586, max_abs=1.1875, mean_rel=0.13022646307945251, max_rel=19.846405029296875, norm_rel=0.02084497921168804, ref_abs_avg=16.0375919342041, test_abs_avg=16.01487159729004
production_forward grad[96] vs paper_forward: mean_abs=0.37774986028671265, max_abs=4.75, mean_rel=0.12338448315858841, max_rel=496.4707946777344, norm_rel=0.020128177478909492, ref_abs_avg=19.048044204711914, test_abs_avg=19.048324584960938
production_forward grad[97] vs paper_forward: mean_abs=0.35328209400177, max_abs=3.75, mean_rel=0.12353117763996124, max_rel=823.4065551757812, norm_rel=0.01977241411805153, ref_abs_avg=18.24317169189453, test_abs_avg=18.237384796142578
production_forward2 vs paper_forward output: mean_abs=0.001673883176408708, max_abs=0.046875
production_forward2 grad[0] vs paper_forward: mean_abs=0.009091182611882687, max_abs=0.3828125, mean_rel=0.07613058388233185, max_rel=112.3377685546875, norm_rel=0.02080419659614563, ref_abs_avg=0.4729507565498352, test_abs_avg=0.472959965467453
production_forward2 grad[1] vs paper_forward: mean_abs=7.6421332359313965, max_abs=64.0, mean_rel=0.3998664915561676, max_rel=3667.672607421875, norm_rel=0.02059810608625412, ref_abs_avg=327.7489013671875, test_abs_avg=327.6632995605469
production_forward2 grad[2] vs paper_forward: mean_abs=1.4098162651062012, max_abs=6.5, mean_rel=0.2750718295574188, max_rel=87.68677520751953, norm_rel=0.02422199957072735, ref_abs_avg=59.42229080200195, test_abs_avg=59.457244873046875
production_forward2 grad[3] vs paper_forward: mean_abs=1.7241909503936768, max_abs=11.828125, mean_rel=0.16824179887771606, max_rel=3843.0732421875, norm_rel=0.02472502738237381, ref_abs_avg=70.08052062988281, test_abs_avg=70.08814239501953
production_forward2 grad[4] vs paper_forward: mean_abs=1.6727421283721924, max_abs=11.0, mean_rel=0.1690148264169693, max_rel=857.1200561523438, norm_rel=0.024462467059493065, ref_abs_avg=68.79881286621094, test_abs_avg=68.79098510742188
production_forward2 grad[5] vs paper_forward: mean_abs=1.1949834823608398, max_abs=5.5, mean_rel=0.23136265575885773, max_rel=78.63928985595703, norm_rel=0.0234978124499321, ref_abs_avg=51.28330612182617, test_abs_avg=51.276180267333984
production_forward2 grad[6] vs paper_forward: mean_abs=1.483616828918457, max_abs=8.625, mean_rel=0.1661980152130127, max_rel=1947.940185546875, norm_rel=0.024298902601003647, ref_abs_avg=61.33770751953125, test_abs_avg=61.343421936035156
production_forward2 grad[7] vs paper_forward: mean_abs=1.4461290836334229, max_abs=8.5, mean_rel=0.18278947472572327, max_rel=1633.227783203125, norm_rel=0.024045279249548912, ref_abs_avg=60.51335525512695, test_abs_avg=60.51791763305664
production_forward2 grad[8] vs paper_forward: mean_abs=1.087113618850708, max_abs=4.0625, mean_rel=0.11296284198760986, max_rel=15.714592933654785, norm_rel=0.022742491215467453, ref_abs_avg=48.238182067871094, test_abs_avg=48.18970489501953
production_forward2 grad[9] vs paper_forward: mean_abs=1.3412463665008545, max_abs=8.5, mean_rel=0.17379190027713776, max_rel=1233.866943359375, norm_rel=0.024151504039764404, ref_abs_avg=55.81283187866211, test_abs_avg=55.81473159790039
production_forward2 grad[10] vs paper_forward: mean_abs=1.3046653270721436, max_abs=8.03125, mean_rel=0.1738928109407425, max_rel=1480.62109375, norm_rel=0.023800846189260483, ref_abs_avg=55.13685607910156, test_abs_avg=55.14855194091797
production_forward2 grad[11] vs paper_forward: mean_abs=1.0450513362884521, max_abs=3.5, mean_rel=0.12272201478481293, max_rel=5.086413383483887, norm_rel=0.026400301605463028, ref_abs_avg=39.62942123413086, test_abs_avg=39.56659698486328
production_forward2 grad[12] vs paper_forward: mean_abs=1.223320722579956, max_abs=8.0, mean_rel=0.15896376967430115, max_rel=855.6468505859375, norm_rel=0.02403295785188675, ref_abs_avg=51.140342712402344, test_abs_avg=51.141780853271484
production_forward2 grad[13] vs paper_forward: mean_abs=1.1982417106628418, max_abs=7.5, mean_rel=0.16561493277549744, max_rel=1451.0391845703125, norm_rel=0.02374076470732689, ref_abs_avg=50.76123046875, test_abs_avg=50.759910583496094
production_forward2 grad[14] vs paper_forward: mean_abs=0.9214416742324829, max_abs=3.171875, mean_rel=0.42672133445739746, max_rel=81.73797607421875, norm_rel=0.022372277453541756, ref_abs_avg=39.91735076904297, test_abs_avg=39.99298858642578
production_forward2 grad[15] vs paper_forward: mean_abs=1.1372863054275513, max_abs=7.5, mean_rel=0.16431692242622375, max_rel=1261.162841796875, norm_rel=0.023776903748512268, ref_abs_avg=47.99644470214844, test_abs_avg=47.997840881347656
production_forward2 grad[16] vs paper_forward: mean_abs=1.1115927696228027, max_abs=7.0, mean_rel=0.16554135084152222, max_rel=3252.039306640625, norm_rel=0.02349839173257351, ref_abs_avg=47.59370422363281, test_abs_avg=47.59446716308594
production_forward2 grad[17] vs paper_forward: mean_abs=0.8246822357177734, max_abs=3.875, mean_rel=0.12386039644479752, max_rel=18.476789474487305, norm_rel=0.022648703306913376, ref_abs_avg=36.524532318115234, test_abs_avg=36.5172119140625
production_forward2 grad[18] vs paper_forward: mean_abs=1.0669819116592407, max_abs=7.0, mean_rel=0.17764878273010254, max_rel=2127.501708984375, norm_rel=0.02362547442317009, ref_abs_avg=45.3787841796875, test_abs_avg=45.38349914550781
production_forward2 grad[19] vs paper_forward: mean_abs=1.0454139709472656, max_abs=6.875, mean_rel=0.17991553246974945, max_rel=2219.859130859375, norm_rel=0.023502381518483162, ref_abs_avg=44.68083953857422, test_abs_avg=44.689842224121094
production_forward2 grad[20] vs paper_forward: mean_abs=0.7807574272155762, max_abs=3.75, mean_rel=0.10687029361724854, max_rel=8.75174617767334, norm_rel=0.022362040355801582, ref_abs_avg=35.798431396484375, test_abs_avg=35.82166290283203
production_forward2 grad[21] vs paper_forward: mean_abs=1.0044307708740234, max_abs=6.25, mean_rel=0.17625834047794342, max_rel=2612.9921875, norm_rel=0.02348223514854908, ref_abs_avg=42.96687316894531, test_abs_avg=42.969791412353516
production_forward2 grad[22] vs paper_forward: mean_abs=0.9866832494735718, max_abs=6.0, mean_rel=0.1537390649318695, max_rel=1550.1220703125, norm_rel=0.0232410691678524, ref_abs_avg=42.647701263427734, test_abs_avg=42.649314880371094
production_forward2 grad[23] vs paper_forward: mean_abs=0.716245174407959, max_abs=2.625, mean_rel=0.12618312239646912, max_rel=7.130456447601318, norm_rel=0.02115839533507824, ref_abs_avg=33.71043395996094, test_abs_avg=33.74874496459961
production_forward2 grad[24] vs paper_forward: mean_abs=0.9606051445007324, max_abs=6.25, mean_rel=0.158002108335495, max_rel=1405.685302734375, norm_rel=0.023277755826711655, ref_abs_avg=41.438987731933594, test_abs_avg=41.440433502197266
production_forward2 grad[25] vs paper_forward: mean_abs=0.9402209520339966, max_abs=6.0, mean_rel=0.1664314717054367, max_rel=1346.0225830078125, norm_rel=0.023052625358104706, ref_abs_avg=40.97686767578125, test_abs_avg=40.97516632080078
production_forward2 grad[26] vs paper_forward: mean_abs=0.8853944540023804, max_abs=3.65625, mean_rel=0.16188167035579681, max_rel=47.21193313598633, norm_rel=0.022680679336190224, ref_abs_avg=39.373512268066406, test_abs_avg=39.42082977294922
production_forward2 grad[27] vs paper_forward: mean_abs=1.1154656410217285, max_abs=7.5, mean_rel=0.1639987826347351, max_rel=1118.330810546875, norm_rel=0.025257853791117668, ref_abs_avg=44.36196517944336, test_abs_avg=44.36328125
production_forward2 grad[28] vs paper_forward: mean_abs=1.085361361503601, max_abs=7.0, mean_rel=0.16875791549682617, max_rel=1959.9339599609375, norm_rel=0.02484320104122162, ref_abs_avg=43.869327545166016, test_abs_avg=43.870582580566406
production_forward2 grad[29] vs paper_forward: mean_abs=0.8385210037231445, max_abs=3.125, mean_rel=0.09521219879388809, max_rel=3.4049007892608643, norm_rel=0.02566368132829666, ref_abs_avg=32.58767318725586, test_abs_avg=32.6133918762207
production_forward2 grad[30] vs paper_forward: mean_abs=1.02434504032135, max_abs=7.0, mean_rel=0.1744888573884964, max_rel=1568.79736328125, norm_rel=0.025529464706778526, ref_abs_avg=40.24940872192383, test_abs_avg=40.253170013427734
production_forward2 grad[31] vs paper_forward: mean_abs=1.001715064048767, max_abs=6.0, mean_rel=0.17081007361412048, max_rel=1385.7391357421875, norm_rel=0.025204168632626534, ref_abs_avg=39.83796310424805, test_abs_avg=39.84585952758789
production_forward2 grad[32] vs paper_forward: mean_abs=0.7481465339660645, max_abs=3.5, mean_rel=0.08205719292163849, max_rel=9.069926261901855, norm_rel=0.024437110871076584, ref_abs_avg=30.8895320892334, test_abs_avg=30.9029598236084
production_forward2 grad[33] vs paper_forward: mean_abs=0.9455336332321167, max_abs=6.1875, mean_rel=0.16998063027858734, max_rel=1128.3160400390625, norm_rel=0.02545037865638733, ref_abs_avg=37.260650634765625, test_abs_avg=37.26400375366211
production_forward2 grad[34] vs paper_forward: mean_abs=0.9300980567932129, max_abs=6.0, mean_rel=0.16157662868499756, max_rel=981.722412109375, norm_rel=0.025203609839081764, ref_abs_avg=37.018585205078125, test_abs_avg=37.02915573120117
production_forward2 grad[35] vs paper_forward: mean_abs=0.7159061431884766, max_abs=2.75, mean_rel=0.09670404344797134, max_rel=10.647643089294434, norm_rel=0.0258718840777874, ref_abs_avg=28.147558212280273, test_abs_avg=28.12895965576172
production_forward2 grad[36] vs paper_forward: mean_abs=0.8946809768676758, max_abs=6.0, mean_rel=0.16510635614395142, max_rel=1394.935302734375, norm_rel=0.025244809687137604, ref_abs_avg=35.52450942993164, test_abs_avg=35.52488327026367
production_forward2 grad[37] vs paper_forward: mean_abs=0.8743388056755066, max_abs=5.125, mean_rel=0.17129263281822205, max_rel=1368.6451416015625, norm_rel=0.024755246937274933, ref_abs_avg=35.40971755981445, test_abs_avg=35.4175910949707
production_forward2 grad[38] vs paper_forward: mean_abs=0.6916959285736084, max_abs=2.5, mean_rel=0.09251061081886292, max_rel=11.086625099182129, norm_rel=0.025224676355719566, ref_abs_avg=27.198488235473633, test_abs_avg=27.258317947387695
production_forward2 grad[39] vs paper_forward: mean_abs=0.8427994251251221, max_abs=5.5, mean_rel=0.17141814529895782, max_rel=1111.6390380859375, norm_rel=0.02482691779732704, ref_abs_avg=34.01494598388672, test_abs_avg=34.017860412597656
production_forward2 grad[40] vs paper_forward: mean_abs=0.8269331455230713, max_abs=5.25, mean_rel=0.15887847542762756, max_rel=809.5316162109375, norm_rel=0.024678917601704597, ref_abs_avg=33.56377410888672, test_abs_avg=33.568321228027344
production_forward2 grad[41] vs paper_forward: mean_abs=0.680673360824585, max_abs=2.5, mean_rel=0.15637922286987305, max_rel=16.911378860473633, norm_rel=0.024835528805851936, ref_abs_avg=26.673519134521484, test_abs_avg=26.639270782470703
production_forward2 grad[42] vs paper_forward: mean_abs=0.7984885573387146, max_abs=5.0, mean_rel=0.15495449304580688, max_rel=1547.7626953125, norm_rel=0.024719299748539925, ref_abs_avg=32.36856460571289, test_abs_avg=32.37437438964844
production_forward2 grad[43] vs paper_forward: mean_abs=0.7877905368804932, max_abs=5.0, mean_rel=0.15698888897895813, max_rel=1413.315673828125, norm_rel=0.02452709525823593, ref_abs_avg=32.20361328125, test_abs_avg=32.207096099853516
production_forward2 grad[44] vs paper_forward: mean_abs=0.6053664684295654, max_abs=2.875, mean_rel=0.14503291249275208, max_rel=26.141237258911133, norm_rel=0.02451002411544323, ref_abs_avg=25.707618713378906, test_abs_avg=25.70974349975586
production_forward2 grad[45] vs paper_forward: mean_abs=0.7595114707946777, max_abs=5.5, mean_rel=0.16269618272781372, max_rel=1670.6883544921875, norm_rel=0.02431098371744156, ref_abs_avg=31.320575714111328, test_abs_avg=31.324026107788086
production_forward2 grad[46] vs paper_forward: mean_abs=0.7509976029396057, max_abs=4.5, mean_rel=0.1643875539302826, max_rel=1193.8543701171875, norm_rel=0.02419334463775158, ref_abs_avg=31.122512817382812, test_abs_avg=31.123809814453125
production_forward2 grad[47] vs paper_forward: mean_abs=0.6103191375732422, max_abs=2.5, mean_rel=0.08478754758834839, max_rel=7.269555568695068, norm_rel=0.024546895176172256, ref_abs_avg=24.972869873046875, test_abs_avg=24.966745376586914
production_forward2 grad[48] vs paper_forward: mean_abs=0.731781005859375, max_abs=5.125, mean_rel=0.16699081659317017, max_rel=1340.5606689453125, norm_rel=0.024175496771931648, ref_abs_avg=30.285953521728516, test_abs_avg=30.287771224975586
production_forward2 grad[49] vs paper_forward: mean_abs=0.7207295894622803, max_abs=5.0, mean_rel=0.167384535074234, max_rel=584.3239135742188, norm_rel=0.024046748876571655, ref_abs_avg=30.008193969726562, test_abs_avg=30.006790161132812
production_forward2 grad[50] vs paper_forward: mean_abs=0.6682872772216797, max_abs=3.25, mean_rel=0.09398001432418823, max_rel=6.2303786277771, norm_rel=0.026889638975262642, ref_abs_avg=25.07156753540039, test_abs_avg=24.996692657470703
production_forward2 grad[51] vs paper_forward: mean_abs=0.8288816809654236, max_abs=5.5, mean_rel=0.1834477335214615, max_rel=2086.91064453125, norm_rel=0.02615981735289097, ref_abs_avg=31.785259246826172, test_abs_avg=31.78736686706543
production_forward2 grad[52] vs paper_forward: mean_abs=0.8127488493919373, max_abs=5.5, mean_rel=0.1621084213256836, max_rel=1324.3377685546875, norm_rel=0.02557753399014473, ref_abs_avg=31.82931137084961, test_abs_avg=31.82986831665039
production_forward2 grad[53] vs paper_forward: mean_abs=0.622859001159668, max_abs=2.40625, mean_rel=0.08401045948266983, max_rel=5.739230632781982, norm_rel=0.025428835302591324, ref_abs_avg=24.80777359008789, test_abs_avg=24.788135528564453
production_forward2 grad[54] vs paper_forward: mean_abs=0.7579026818275452, max_abs=5.25, mean_rel=0.1805654764175415, max_rel=1080.931884765625, norm_rel=0.0256607998162508, ref_abs_avg=29.565385818481445, test_abs_avg=29.56797981262207
production_forward2 grad[55] vs paper_forward: mean_abs=0.7384685277938843, max_abs=5.5, mean_rel=0.15665993094444275, max_rel=789.8692626953125, norm_rel=0.02521497942507267, ref_abs_avg=29.345653533935547, test_abs_avg=29.343467712402344
production_forward2 grad[56] vs paper_forward: mean_abs=0.5631352663040161, max_abs=2.375, mean_rel=0.1240554079413414, max_rel=14.226482391357422, norm_rel=0.02467501163482666, ref_abs_avg=23.043071746826172, test_abs_avg=23.017934799194336
production_forward2 grad[57] vs paper_forward: mean_abs=0.6985735893249512, max_abs=5.25, mean_rel=0.17501544952392578, max_rel=1176.0970458984375, norm_rel=0.025101350620388985, ref_abs_avg=27.853893280029297, test_abs_avg=27.854446411132812
production_forward2 grad[58] vs paper_forward: mean_abs=0.6860606074333191, max_abs=4.5, mean_rel=0.17202910780906677, max_rel=1093.987548828125, norm_rel=0.025077378377318382, ref_abs_avg=27.412681579589844, test_abs_avg=27.412172317504883
production_forward2 grad[59] vs paper_forward: mean_abs=0.49961090087890625, max_abs=1.875, mean_rel=0.10105016827583313, max_rel=13.660043716430664, norm_rel=0.02131854183971882, ref_abs_avg=23.111724853515625, test_abs_avg=23.097728729248047
production_forward2 grad[60] vs paper_forward: mean_abs=0.6499648094177246, max_abs=4.25, mean_rel=0.16644920408725739, max_rel=1207.540283203125, norm_rel=0.024385401979088783, ref_abs_avg=26.637025833129883, test_abs_avg=26.639156341552734
production_forward2 grad[61] vs paper_forward: mean_abs=0.6287118196487427, max_abs=4.0, mean_rel=0.15592463314533234, max_rel=799.2062377929688, norm_rel=0.023852620273828506, ref_abs_avg=26.320571899414062, test_abs_avg=26.316123962402344
production_forward2 grad[62] vs paper_forward: mean_abs=0.49610090255737305, max_abs=1.875, mean_rel=0.1217721477150917, max_rel=9.733819007873535, norm_rel=0.024912534281611443, ref_abs_avg=19.772724151611328, test_abs_avg=19.72917938232422
production_forward2 grad[63] vs paper_forward: mean_abs=0.611083447933197, max_abs=4.5, mean_rel=0.1633206009864807, max_rel=1467.168701171875, norm_rel=0.02394857630133629, ref_abs_avg=25.520221710205078, test_abs_avg=25.520244598388672
production_forward2 grad[64] vs paper_forward: mean_abs=0.6021437048912048, max_abs=5.375, mean_rel=0.1566399335861206, max_rel=1404.925048828125, norm_rel=0.02382424846291542, ref_abs_avg=25.27552032470703, test_abs_avg=25.28304672241211
production_forward2 grad[65] vs paper_forward: mean_abs=0.4562298059463501, max_abs=1.625, mean_rel=0.11925356090068817, max_rel=19.972776412963867, norm_rel=0.022804800420999527, ref_abs_avg=19.88360595703125, test_abs_avg=19.868064880371094
production_forward2 grad[66] vs paper_forward: mean_abs=0.5764349699020386, max_abs=4.25, mean_rel=0.153776615858078, max_rel=979.475341796875, norm_rel=0.023562580347061157, ref_abs_avg=24.451759338378906, test_abs_avg=24.453418731689453
production_forward2 grad[67] vs paper_forward: mean_abs=0.5697500705718994, max_abs=4.0, mean_rel=0.1549946367740631, max_rel=994.020751953125, norm_rel=0.023263927549123764, ref_abs_avg=24.4775390625, test_abs_avg=24.47732162475586
production_forward2 grad[68] vs paper_forward: mean_abs=0.46483200788497925, max_abs=1.75, mean_rel=0.21203570067882538, max_rel=29.688602447509766, norm_rel=0.024090595543384552, ref_abs_avg=18.980548858642578, test_abs_avg=18.956615447998047
production_forward2 grad[69] vs paper_forward: mean_abs=0.5498090982437134, max_abs=5.0, mean_rel=0.15201708674430847, max_rel=1024.477294921875, norm_rel=0.02304893732070923, ref_abs_avg=23.787677764892578, test_abs_avg=23.786806106567383
production_forward2 grad[70] vs paper_forward: mean_abs=0.5393561124801636, max_abs=4.0, mean_rel=0.14463216066360474, max_rel=577.5713500976562, norm_rel=0.023041008040308952, ref_abs_avg=23.403709411621094, test_abs_avg=23.401695251464844
production_forward2 grad[71] vs paper_forward: mean_abs=0.4170604348182678, max_abs=1.6875, mean_rel=0.08948585391044617, max_rel=8.301017761230469, norm_rel=0.021467462182044983, ref_abs_avg=19.597774505615234, test_abs_avg=19.658308029174805
production_forward2 grad[72] vs paper_forward: mean_abs=0.5255558490753174, max_abs=4.0, mean_rel=0.15133512020111084, max_rel=850.690185546875, norm_rel=0.022724654525518417, ref_abs_avg=23.101943969726562, test_abs_avg=23.103553771972656
production_forward2 grad[73] vs paper_forward: mean_abs=0.5124273300170898, max_abs=3.5, mean_rel=0.13965848088264465, max_rel=385.8189697265625, norm_rel=0.02266189455986023, ref_abs_avg=22.56458854675293, test_abs_avg=22.563255310058594
production_forward2 grad[74] vs paper_forward: mean_abs=0.4922245740890503, max_abs=2.0, mean_rel=0.13729865849018097, max_rel=32.53554153442383, norm_rel=0.023334547877311707, ref_abs_avg=20.98248863220215, test_abs_avg=20.974632263183594
production_forward2 grad[75] vs paper_forward: mean_abs=0.5898417234420776, max_abs=5.0, mean_rel=0.1634202003479004, max_rel=1934.969482421875, norm_rel=0.023847103118896484, ref_abs_avg=24.715063095092773, test_abs_avg=24.714981079101562
production_forward2 grad[76] vs paper_forward: mean_abs=0.5783233046531677, max_abs=5.5, mean_rel=0.15310677886009216, max_rel=695.4952392578125, norm_rel=0.024020997807383537, ref_abs_avg=24.164119720458984, test_abs_avg=24.15865135192871
production_forward2 grad[77] vs paper_forward: mean_abs=0.4102039337158203, max_abs=1.75, mean_rel=0.10194940119981766, max_rel=13.675996780395508, norm_rel=0.022868212312459946, ref_abs_avg=18.276769638061523, test_abs_avg=18.263595581054688
production_forward2 grad[78] vs paper_forward: mean_abs=0.5361894369125366, max_abs=4.0, mean_rel=0.15318673849105835, max_rel=1154.255126953125, norm_rel=0.023277463391423225, ref_abs_avg=23.033334732055664, test_abs_avg=23.03314208984375
production_forward2 grad[79] vs paper_forward: mean_abs=0.5298696160316467, max_abs=4.0, mean_rel=0.1390269696712494, max_rel=327.45709228515625, norm_rel=0.02320980653166771, ref_abs_avg=22.854938507080078, test_abs_avg=22.859291076660156
production_forward2 grad[80] vs paper_forward: mean_abs=0.41576671600341797, max_abs=1.373046875, mean_rel=0.08229223638772964, max_rel=3.301257848739624, norm_rel=0.022500278428196907, ref_abs_avg=18.228397369384766, test_abs_avg=18.25340461730957
production_forward2 grad[81] vs paper_forward: mean_abs=0.5020014643669128, max_abs=4.0, mean_rel=0.1414555460214615, max_rel=1138.8638916015625, norm_rel=0.022627530619502068, ref_abs_avg=22.166339874267578, test_abs_avg=22.16617774963379
production_forward2 grad[82] vs paper_forward: mean_abs=0.4871331453323364, max_abs=4.0, mean_rel=0.15086497366428375, max_rel=564.8643188476562, norm_rel=0.02228192239999771, ref_abs_avg=21.817258834838867, test_abs_avg=21.815549850463867
production_forward2 grad[83] vs paper_forward: mean_abs=0.36673736572265625, max_abs=1.59375, mean_rel=0.06107418239116669, max_rel=2.6807332038879395, norm_rel=0.02133041061460972, ref_abs_avg=17.619945526123047, test_abs_avg=17.585575103759766
production_forward2 grad[84] vs paper_forward: mean_abs=0.4683309495449066, max_abs=4.125, mean_rel=0.14890338480472565, max_rel=1476.5357666015625, norm_rel=0.022311076521873474, ref_abs_avg=21.006040573120117, test_abs_avg=21.00797462463379
production_forward2 grad[85] vs paper_forward: mean_abs=0.4526609480381012, max_abs=3.75, mean_rel=0.12823373079299927, max_rel=374.6114501953125, norm_rel=0.02175523154437542, ref_abs_avg=20.86577033996582, test_abs_avg=20.861297607421875
production_forward2 grad[86] vs paper_forward: mean_abs=0.3630361557006836, max_abs=1.25, mean_rel=0.07431980967521667, max_rel=2.736611843109131, norm_rel=0.021649008616805077, ref_abs_avg=16.83729362487793, test_abs_avg=16.856014251708984
production_forward2 grad[87] vs paper_forward: mean_abs=0.43639037013053894, max_abs=4.0, mean_rel=0.1381918489933014, max_rel=889.7752685546875, norm_rel=0.021617325022816658, ref_abs_avg=20.244394302368164, test_abs_avg=20.244503021240234
production_forward2 grad[88] vs paper_forward: mean_abs=0.4264582395553589, max_abs=4.5, mean_rel=0.1259411871433258, max_rel=431.4689025878906, norm_rel=0.021054798737168312, ref_abs_avg=20.304956436157227, test_abs_avg=20.30763053894043
production_forward2 grad[89] vs paper_forward: mean_abs=0.3538796603679657, max_abs=1.25, mean_rel=0.07437001168727875, max_rel=9.59605884552002, norm_rel=0.020767638459801674, ref_abs_avg=16.879880905151367, test_abs_avg=16.847694396972656
production_forward2 grad[90] vs paper_forward: mean_abs=0.4071503281593323, max_abs=4.25, mean_rel=0.12904274463653564, max_rel=541.6814575195312, norm_rel=0.020970875397324562, ref_abs_avg=19.546546936035156, test_abs_avg=19.54680633544922
production_forward2 grad[91] vs paper_forward: mean_abs=0.40353745222091675, max_abs=4.0, mean_rel=0.12914690375328064, max_rel=694.900390625, norm_rel=0.021180907264351845, ref_abs_avg=19.25585174560547, test_abs_avg=19.24994659423828
production_forward2 grad[92] vs paper_forward: mean_abs=0.31215357780456543, max_abs=1.375, mean_rel=0.08073952794075012, max_rel=7.452483654022217, norm_rel=0.01966635137796402, ref_abs_avg=16.007877349853516, test_abs_avg=16.01376724243164
production_forward2 grad[93] vs paper_forward: mean_abs=0.3863708972930908, max_abs=4.625, mean_rel=0.12581023573875427, max_rel=946.2033081054688, norm_rel=0.020647911354899406, ref_abs_avg=18.8951416015625, test_abs_avg=18.895877838134766
production_forward2 grad[94] vs paper_forward: mean_abs=0.38296857476234436, max_abs=3.5, mean_rel=0.12908343970775604, max_rel=645.4863891601562, norm_rel=0.021140826866030693, ref_abs_avg=18.426490783691406, test_abs_avg=18.41619873046875
production_forward2 grad[95] vs paper_forward: mean_abs=0.33185720443725586, max_abs=1.1875, mean_rel=0.13022646307945251, max_rel=19.846405029296875, norm_rel=0.02084497921168804, ref_abs_avg=16.0375919342041, test_abs_avg=16.01487159729004
production_forward2 grad[96] vs paper_forward: mean_abs=0.37774986028671265, max_abs=4.75, mean_rel=0.12338448315858841, max_rel=496.4707946777344, norm_rel=0.020128177478909492, ref_abs_avg=19.048044204711914, test_abs_avg=19.048324584960938
production_forward2 grad[97] vs paper_forward: mean_abs=0.35328209400177, max_abs=3.75, mean_rel=0.12353117763996124, max_rel=823.4065551757812, norm_rel=0.01977241411805153, ref_abs_avg=18.24317169189453, test_abs_avg=18.237384796142578
identity layers + randn queries
production_forward fwd+bwd:  126.721 ms
production_forward bwd-only: 106.302 ms
production_forward peak allocated: fwd=3.368 GiB, fwd+bwd=7.868 GiB
production_forward peak reserved:  fwd=3.615 GiB, fwd+bwd=8.865 GiB
production_forward2 fwd+bwd:  224.338 ms
production_forward2 bwd-only: 202.106 ms
production_forward2 peak allocated: fwd=2.864 GiB, fwd+bwd=6.243 GiB
production_forward2 peak reserved:  fwd=3.240 GiB, fwd+bwd=8.990 GiB
paper_forward fwd+bwd:  379.729 ms
paper_forward bwd-only: 293.996 ms
paper_forward peak allocated: fwd=30.001 GiB, fwd+bwd=32.120 GiB
paper_forward peak reserved:  fwd=30.037 GiB, fwd+bwd=32.787 GiB

grads check for swiglu layers + randn queries
production_forward vs paper_forward output: mean_abs=0.0016820961609482765, max_abs=0.0859375
production_forward grad[0] vs paper_forward: mean_abs=0.008568214252591133, max_abs=0.484375, mean_rel=0.07304348796606064, max_rel=100.99864959716797, norm_rel=0.019987955689430237, ref_abs_avg=0.46524301171302795, test_abs_avg=0.4652577042579651
production_forward grad[1] vs paper_forward: mean_abs=7.437660217285156, max_abs=64.0, mean_rel=0.16136576235294342, max_rel=530.7662963867188, norm_rel=0.02036156877875328, ref_abs_avg=322.52032470703125, test_abs_avg=322.59942626953125
production_forward grad[2] vs paper_forward: mean_abs=1.3257884979248047, max_abs=5.5, mean_rel=0.0782962366938591, max_rel=4.22437047958374, norm_rel=0.02477269247174263, ref_abs_avg=53.421268463134766, test_abs_avg=53.24395751953125
production_forward grad[3] vs paper_forward: mean_abs=1.5577000379562378, max_abs=11.0, mean_rel=0.18008020520210266, max_rel=4028.9189453125, norm_rel=0.023747071623802185, ref_abs_avg=65.8941650390625, test_abs_avg=65.89947509765625
production_forward grad[4] vs paper_forward: mean_abs=1.526275634765625, max_abs=11.0, mean_rel=0.15469223260879517, max_rel=2137.68359375, norm_rel=0.023636702448129654, ref_abs_avg=64.90705871582031, test_abs_avg=64.90762329101562
production_forward grad[5] vs paper_forward: mean_abs=1.0850486755371094, max_abs=4.59375, mean_rel=0.08801986277103424, max_rel=5.339099884033203, norm_rel=0.02231239527463913, ref_abs_avg=49.716217041015625, test_abs_avg=49.68996047973633
production_forward grad[6] vs paper_forward: mean_abs=1.3774125576019287, max_abs=9.5, mean_rel=0.1722404956817627, max_rel=2060.615234375, norm_rel=0.023543652147054672, ref_abs_avg=58.79616928100586, test_abs_avg=58.80021286010742
production_forward grad[7] vs paper_forward: mean_abs=1.3384876251220703, max_abs=8.0, mean_rel=0.1591777801513672, max_rel=929.7014770507812, norm_rel=0.023266959935426712, ref_abs_avg=57.95061492919922, test_abs_avg=57.950191497802734
production_forward grad[8] vs paper_forward: mean_abs=1.0336952209472656, max_abs=3.59375, mean_rel=0.10112292319536209, max_rel=5.783870697021484, norm_rel=0.023651104420423508, ref_abs_avg=43.2030029296875, test_abs_avg=43.30338668823242
production_forward grad[9] vs paper_forward: mean_abs=1.259847640991211, max_abs=8.0, mean_rel=0.16951674222946167, max_rel=1583.2423095703125, norm_rel=0.02329752780497074, ref_abs_avg=54.316749572753906, test_abs_avg=54.3196907043457
production_forward grad[10] vs paper_forward: mean_abs=1.2339887619018555, max_abs=8.0, mean_rel=0.16446484625339508, max_rel=1587.330078125, norm_rel=0.023110244423151016, ref_abs_avg=53.68578338623047, test_abs_avg=53.69240951538086
production_forward grad[11] vs paper_forward: mean_abs=1.0230255126953125, max_abs=5.25, mean_rel=0.09449882805347443, max_rel=5.192941665649414, norm_rel=0.025179564952850342, ref_abs_avg=41.62025451660156, test_abs_avg=41.621498107910156
production_forward grad[12] vs paper_forward: mean_abs=1.1538426876068115, max_abs=7.5, mean_rel=0.15339955687522888, max_rel=1105.568603515625, norm_rel=0.02311488799750805, ref_abs_avg=50.16999435424805, test_abs_avg=50.174293518066406
production_forward grad[13] vs paper_forward: mean_abs=1.135871171951294, max_abs=6.322265625, mean_rel=0.17063641548156738, max_rel=1828.2779541015625, norm_rel=0.022904502227902412, ref_abs_avg=49.79469299316406, test_abs_avg=49.79216766357422
production_forward grad[14] vs paper_forward: mean_abs=0.8847131729125977, max_abs=3.5, mean_rel=0.07328912615776062, max_rel=2.195005178451538, norm_rel=0.02414202131330967, ref_abs_avg=36.50392150878906, test_abs_avg=36.49293899536133
production_forward grad[15] vs paper_forward: mean_abs=1.0759713649749756, max_abs=6.75, mean_rel=0.15389415621757507, max_rel=866.635009765625, norm_rel=0.02301614359021187, ref_abs_avg=46.99489974975586, test_abs_avg=46.99531555175781
production_forward grad[16] vs paper_forward: mean_abs=1.0618056058883667, max_abs=6.75, mean_rel=0.15538716316223145, max_rel=1378.75830078125, norm_rel=0.022875845432281494, ref_abs_avg=46.65675735473633, test_abs_avg=46.663673400878906
production_forward grad[17] vs paper_forward: mean_abs=0.7731719017028809, max_abs=3.375, mean_rel=0.15773272514343262, max_rel=44.49634552001953, norm_rel=0.022027524188160896, ref_abs_avg=35.93389129638672, test_abs_avg=35.987823486328125
production_forward grad[18] vs paper_forward: mean_abs=1.0204343795776367, max_abs=7.09375, mean_rel=0.16042032837867737, max_rel=1076.5892333984375, norm_rel=0.022886740043759346, ref_abs_avg=44.813438415527344, test_abs_avg=44.81504440307617
production_forward grad[19] vs paper_forward: mean_abs=1.0016882419586182, max_abs=6.0, mean_rel=0.1690407544374466, max_rel=1947.43359375, norm_rel=0.02269872836768627, ref_abs_avg=44.32984161376953, test_abs_avg=44.330482482910156
production_forward grad[20] vs paper_forward: mean_abs=0.732258677482605, max_abs=2.75, mean_rel=0.6059631705284119, max_rel=278.1405334472656, norm_rel=0.021909374743700027, ref_abs_avg=33.394718170166016, test_abs_avg=33.36488342285156
production_forward grad[21] vs paper_forward: mean_abs=0.9698789715766907, max_abs=6.0, mean_rel=0.16265982389450073, max_rel=2343.172607421875, norm_rel=0.02274220436811447, ref_abs_avg=42.81679153442383, test_abs_avg=42.81600570678711
production_forward grad[22] vs paper_forward: mean_abs=0.9452978372573853, max_abs=6.0, mean_rel=0.15470853447914124, max_rel=778.97216796875, norm_rel=0.022575819864869118, ref_abs_avg=42.06793975830078, test_abs_avg=42.067787170410156
production_forward grad[23] vs paper_forward: mean_abs=0.7596273422241211, max_abs=2.625, mean_rel=0.07305736839771271, max_rel=5.884166717529297, norm_rel=0.02184753306210041, ref_abs_avg=35.324607849121094, test_abs_avg=35.277244567871094
production_forward grad[24] vs paper_forward: mean_abs=0.9303935766220093, max_abs=6.5, mean_rel=0.16003704071044922, max_rel=943.3816528320312, norm_rel=0.02272637188434601, ref_abs_avg=41.13105773925781, test_abs_avg=41.132568359375
production_forward grad[25] vs paper_forward: mean_abs=0.9087162017822266, max_abs=5.875, mean_rel=0.1561742126941681, max_rel=924.517333984375, norm_rel=0.022549157962203026, ref_abs_avg=40.52745819091797, test_abs_avg=40.53108596801758
production_forward grad[26] vs paper_forward: mean_abs=0.8604297637939453, max_abs=3.25, mean_rel=0.12566661834716797, max_rel=11.733548164367676, norm_rel=0.02303742803633213, ref_abs_avg=37.24034881591797, test_abs_avg=37.25697708129883
production_forward grad[27] vs paper_forward: mean_abs=1.0835368633270264, max_abs=7.5, mean_rel=0.1653217226266861, max_rel=1303.0550537109375, norm_rel=0.024712208658456802, ref_abs_avg=44.051025390625, test_abs_avg=44.0545654296875
production_forward grad[28] vs paper_forward: mean_abs=1.0509628057479858, max_abs=6.5, mean_rel=0.15427541732788086, max_rel=815.8631591796875, norm_rel=0.024399857968091965, ref_abs_avg=43.288578033447266, test_abs_avg=43.29039001464844
production_forward grad[29] vs paper_forward: mean_abs=0.8077459335327148, max_abs=3.375, mean_rel=0.09044010937213898, max_rel=5.811342239379883, norm_rel=0.024931926280260086, ref_abs_avg=32.89411163330078, test_abs_avg=32.86835479736328
production_forward grad[30] vs paper_forward: mean_abs=1.0120105743408203, max_abs=7.0, mean_rel=0.16644582152366638, max_rel=1363.341064453125, norm_rel=0.025070374831557274, ref_abs_avg=40.4921875, test_abs_avg=40.49402618408203
production_forward grad[31] vs paper_forward: mean_abs=0.9964011907577515, max_abs=6.125, mean_rel=0.1823202669620514, max_rel=2591.522705078125, norm_rel=0.02497941441833973, ref_abs_avg=40.04267120361328, test_abs_avg=40.04616928100586
production_forward grad[32] vs paper_forward: mean_abs=0.7549962997436523, max_abs=3.25, mean_rel=0.1532375067472458, max_rel=34.540443420410156, norm_rel=0.024768393486738205, ref_abs_avg=30.916309356689453, test_abs_avg=30.867290496826172
production_forward grad[33] vs paper_forward: mean_abs=0.9329124689102173, max_abs=6.625, mean_rel=0.16790294647216797, max_rel=1044.8631591796875, norm_rel=0.024927901104092598, ref_abs_avg=37.558570861816406, test_abs_avg=37.559715270996094
production_forward grad[34] vs paper_forward: mean_abs=0.9191275835037231, max_abs=5.5, mean_rel=0.17741000652313232, max_rel=1079.6976318359375, norm_rel=0.02460414357483387, ref_abs_avg=37.496036529541016, test_abs_avg=37.500144958496094
production_forward grad[35] vs paper_forward: mean_abs=0.725027322769165, max_abs=2.75, mean_rel=0.2258382886648178, max_rel=76.16470336914062, norm_rel=0.024927319958806038, ref_abs_avg=29.222278594970703, test_abs_avg=29.259363174438477
production_forward grad[36] vs paper_forward: mean_abs=0.8742406368255615, max_abs=6.1875, mean_rel=0.16270095109939575, max_rel=1291.6146240234375, norm_rel=0.024596426635980606, ref_abs_avg=35.63239288330078, test_abs_avg=35.63532257080078
production_forward grad[37] vs paper_forward: mean_abs=0.859063982963562, max_abs=5.25, mean_rel=0.18705733120441437, max_rel=1677.3092041015625, norm_rel=0.024232057854533195, ref_abs_avg=35.490840911865234, test_abs_avg=35.49141311645508
production_forward grad[38] vs paper_forward: mean_abs=0.675724983215332, max_abs=2.75, mean_rel=0.13623563945293427, max_rel=19.319664001464844, norm_rel=0.02410014718770981, ref_abs_avg=28.758941650390625, test_abs_avg=28.713232040405273
production_forward grad[39] vs paper_forward: mean_abs=0.827082633972168, max_abs=5.5, mean_rel=0.16336724162101746, max_rel=1804.9107666015625, norm_rel=0.02426314540207386, ref_abs_avg=34.17284393310547, test_abs_avg=34.17841339111328
production_forward grad[40] vs paper_forward: mean_abs=0.8094398379325867, max_abs=5.25, mean_rel=0.15686960518360138, max_rel=890.447265625, norm_rel=0.024262312799692154, ref_abs_avg=33.4842529296875, test_abs_avg=33.48369598388672
production_forward grad[41] vs paper_forward: mean_abs=0.6694364547729492, max_abs=2.375, mean_rel=0.13874730467796326, max_rel=17.412837982177734, norm_rel=0.025947388261556625, ref_abs_avg=25.334026336669922, test_abs_avg=25.343563079833984
production_forward grad[42] vs paper_forward: mean_abs=0.7783033847808838, max_abs=5.125, mean_rel=0.15646816790103912, max_rel=942.6016235351562, norm_rel=0.02426878921687603, ref_abs_avg=32.18136215209961, test_abs_avg=32.18020248413086
production_forward grad[43] vs paper_forward: mean_abs=0.7702073454856873, max_abs=4.625, mean_rel=0.16233980655670166, max_rel=1003.8763427734375, norm_rel=0.024053413420915604, ref_abs_avg=32.146888732910156, test_abs_avg=32.145721435546875
production_forward grad[44] vs paper_forward: mean_abs=0.5933628082275391, max_abs=2.78515625, mean_rel=0.07606258988380432, max_rel=3.4737110137939453, norm_rel=0.022701449692249298, ref_abs_avg=26.568052291870117, test_abs_avg=26.56072235107422
production_forward grad[45] vs paper_forward: mean_abs=0.7507379055023193, max_abs=5.5, mean_rel=0.153892382979393, max_rel=1041.8743896484375, norm_rel=0.023908177390694618, ref_abs_avg=31.437137603759766, test_abs_avg=31.438255310058594
production_forward grad[46] vs paper_forward: mean_abs=0.7348200082778931, max_abs=5.0, mean_rel=0.1479155719280243, max_rel=494.2731018066406, norm_rel=0.02410515770316124, ref_abs_avg=30.633237838745117, test_abs_avg=30.633689880371094
production_forward grad[47] vs paper_forward: mean_abs=0.5704813003540039, max_abs=2.25, mean_rel=0.06550101190805435, max_rel=3.4534859657287598, norm_rel=0.023959118872880936, ref_abs_avg=24.46044158935547, test_abs_avg=24.450767517089844
production_forward grad[48] vs paper_forward: mean_abs=0.7140445709228516, max_abs=4.5, mean_rel=0.1662064492702484, max_rel=1944.689453125, norm_rel=0.02383234165608883, ref_abs_avg=29.99386215209961, test_abs_avg=29.99703598022461
production_forward grad[49] vs paper_forward: mean_abs=0.6962897777557373, max_abs=4.5, mean_rel=0.16496196389198303, max_rel=1310.5535888671875, norm_rel=0.023626232519745827, ref_abs_avg=29.54489517211914, test_abs_avg=29.540298461914062
production_forward grad[50] vs paper_forward: mean_abs=0.6571967601776123, max_abs=3.0, mean_rel=0.1502155065536499, max_rel=12.149027824401855, norm_rel=0.025202423334121704, ref_abs_avg=25.763437271118164, test_abs_avg=25.750152587890625
production_forward grad[51] vs paper_forward: mean_abs=0.8094774484634399, max_abs=5.25, mean_rel=0.16592255234718323, max_rel=1234.00830078125, norm_rel=0.02486594021320343, ref_abs_avg=32.60972595214844, test_abs_avg=32.61236572265625
production_forward grad[52] vs paper_forward: mean_abs=0.7865681052207947, max_abs=5.0, mean_rel=0.15229883790016174, max_rel=422.87158203125, norm_rel=0.0246945358812809, ref_abs_avg=31.959857940673828, test_abs_avg=31.96051025390625
production_forward grad[53] vs paper_forward: mean_abs=0.5899462699890137, max_abs=2.5, mean_rel=0.15643563866615295, max_rel=12.486084938049316, norm_rel=0.023879479616880417, ref_abs_avg=24.76683807373047, test_abs_avg=24.753559112548828
production_forward grad[54] vs paper_forward: mean_abs=0.7429358959197998, max_abs=5.0, mean_rel=0.16518545150756836, max_rel=796.9097900390625, norm_rel=0.024677645415067673, ref_abs_avg=30.13473129272461, test_abs_avg=30.13488006591797
production_forward grad[55] vs paper_forward: mean_abs=0.7262692451477051, max_abs=5.0, mean_rel=0.15059319138526917, max_rel=585.4270629882812, norm_rel=0.024312758818268776, ref_abs_avg=29.86803436279297, test_abs_avg=29.864376068115234
production_forward grad[56] vs paper_forward: mean_abs=0.5806765556335449, max_abs=2.25, mean_rel=0.17740795016288757, max_rel=25.621488571166992, norm_rel=0.02541579306125641, ref_abs_avg=22.615304946899414, test_abs_avg=22.58190155029297
production_forward grad[57] vs paper_forward: mean_abs=0.6897568106651306, max_abs=6.0, mean_rel=0.16384698450565338, max_rel=1251.4498291015625, norm_rel=0.024156970903277397, ref_abs_avg=28.594741821289062, test_abs_avg=28.595035552978516
production_forward grad[58] vs paper_forward: mean_abs=0.6788959503173828, max_abs=5.0, mean_rel=0.16472218930721283, max_rel=1131.58544921875, norm_rel=0.024069322273135185, ref_abs_avg=28.283002853393555, test_abs_avg=28.283864974975586
production_forward grad[59] vs paper_forward: mean_abs=0.5137395262718201, max_abs=2.1875, mean_rel=0.46179091930389404, max_rel=189.42935180664062, norm_rel=0.023065363988280296, ref_abs_avg=22.847686767578125, test_abs_avg=22.87141227722168
production_forward grad[60] vs paper_forward: mean_abs=0.6455360054969788, max_abs=6.0, mean_rel=0.15934419631958008, max_rel=1773.4158935546875, norm_rel=0.02351296693086624, ref_abs_avg=27.45551300048828, test_abs_avg=27.4571590423584
production_forward grad[61] vs paper_forward: mean_abs=0.6340899467468262, max_abs=4.5, mean_rel=0.15797880291938782, max_rel=668.0254516601562, norm_rel=0.023739084601402283, ref_abs_avg=26.777164459228516, test_abs_avg=26.773452758789062
production_forward grad[62] vs paper_forward: mean_abs=0.4946300983428955, max_abs=1.75, mean_rel=0.10800080001354218, max_rel=8.272940635681152, norm_rel=0.02228495664894581, ref_abs_avg=22.54256248474121, test_abs_avg=22.54253387451172
production_forward grad[63] vs paper_forward: mean_abs=0.6071977019309998, max_abs=5.0, mean_rel=0.15407925844192505, max_rel=682.1332397460938, norm_rel=0.023183992132544518, ref_abs_avg=26.168624877929688, test_abs_avg=26.169410705566406
production_forward grad[64] vs paper_forward: mean_abs=0.5931186676025391, max_abs=4.5, mean_rel=0.15772472321987152, max_rel=899.8643798828125, norm_rel=0.023191798478364944, ref_abs_avg=25.611127853393555, test_abs_avg=25.60932731628418
production_forward grad[65] vs paper_forward: mean_abs=0.49445563554763794, max_abs=2.5, mean_rel=0.375313937664032, max_rel=132.0709228515625, norm_rel=0.02335522696375847, ref_abs_avg=21.69015121459961, test_abs_avg=21.71408462524414
production_forward grad[66] vs paper_forward: mean_abs=0.5785109996795654, max_abs=4.25, mean_rel=0.15188482403755188, max_rel=951.3046264648438, norm_rel=0.022747594863176346, ref_abs_avg=25.395645141601562, test_abs_avg=25.397085189819336
production_forward grad[67] vs paper_forward: mean_abs=0.5596446990966797, max_abs=4.25, mean_rel=0.14486442506313324, max_rel=759.2935791015625, norm_rel=0.02240588888525963, ref_abs_avg=24.89643096923828, test_abs_avg=24.896787643432617
production_forward grad[68] vs paper_forward: mean_abs=0.44203734397888184, max_abs=1.75, mean_rel=0.07190313190221786, max_rel=1.9417357444763184, norm_rel=0.02221446856856346, ref_abs_avg=19.432865142822266, test_abs_avg=19.395530700683594
production_forward grad[69] vs paper_forward: mean_abs=0.5460543632507324, max_abs=4.25, mean_rel=0.14175769686698914, max_rel=1051.8282470703125, norm_rel=0.022366032004356384, ref_abs_avg=24.35999298095703, test_abs_avg=24.359519958496094
production_forward grad[70] vs paper_forward: mean_abs=0.5301939249038696, max_abs=4.25, mean_rel=0.1432150900363922, max_rel=659.5297241210938, norm_rel=0.02218589559197426, ref_abs_avg=23.842426300048828, test_abs_avg=23.83563804626465
production_forward grad[71] vs paper_forward: mean_abs=0.4103791117668152, max_abs=1.625, mean_rel=0.40949034690856934, max_rel=156.60182189941406, norm_rel=0.02094062976539135, ref_abs_avg=19.687633514404297, test_abs_avg=19.688732147216797
production_forward grad[72] vs paper_forward: mean_abs=0.5208470821380615, max_abs=4.5, mean_rel=0.14769496023654938, max_rel=1657.82568359375, norm_rel=0.022123944014310837, ref_abs_avg=23.483463287353516, test_abs_avg=23.486011505126953
production_forward grad[73] vs paper_forward: mean_abs=0.5120419263839722, max_abs=3.75, mean_rel=0.14604002237319946, max_rel=1296.18798828125, norm_rel=0.022146277129650116, ref_abs_avg=23.14725112915039, test_abs_avg=23.15387725830078
production_forward grad[74] vs paper_forward: mean_abs=0.46886491775512695, max_abs=2.09375, mean_rel=0.17912057042121887, max_rel=25.989831924438477, norm_rel=0.023718401789665222, ref_abs_avg=19.71849822998047, test_abs_avg=19.684810638427734
production_forward grad[75] vs paper_forward: mean_abs=0.5639331340789795, max_abs=4.5, mean_rel=0.15021492540836334, max_rel=709.6846923828125, norm_rel=0.02363904006779194, ref_abs_avg=23.87097930908203, test_abs_avg=23.871816635131836
production_forward grad[76] vs paper_forward: mean_abs=0.5454604029655457, max_abs=4.5, mean_rel=0.1584429293870926, max_rel=751.5864868164062, norm_rel=0.023873208090662956, ref_abs_avg=22.925540924072266, test_abs_avg=22.929941177368164
production_forward grad[77] vs paper_forward: mean_abs=0.4113047122955322, max_abs=1.75, mean_rel=0.09352421760559082, max_rel=8.42152214050293, norm_rel=0.02289959043264389, ref_abs_avg=18.294147491455078, test_abs_avg=18.297700881958008
production_forward grad[78] vs paper_forward: mean_abs=0.5051392912864685, max_abs=4.25, mean_rel=0.14629584550857544, max_rel=809.0137329101562, norm_rel=0.023354295641183853, ref_abs_avg=21.659976959228516, test_abs_avg=21.66156768798828
production_forward grad[79] vs paper_forward: mean_abs=0.504896342754364, max_abs=3.75, mean_rel=0.1641595959663391, max_rel=1111.421142578125, norm_rel=0.022913826629519463, ref_abs_avg=22.034934997558594, test_abs_avg=22.041606903076172
production_forward grad[80] vs paper_forward: mean_abs=0.41327768564224243, max_abs=1.625, mean_rel=0.12872377038002014, max_rel=16.11601448059082, norm_rel=0.022566333413124084, ref_abs_avg=18.85759925842285, test_abs_avg=18.87358856201172
production_forward grad[81] vs paper_forward: mean_abs=0.49004822969436646, max_abs=4.0, mean_rel=0.13909871876239777, max_rel=591.4596557617188, norm_rel=0.022417224943637848, ref_abs_avg=21.832122802734375, test_abs_avg=21.83296775817871
production_forward grad[82] vs paper_forward: mean_abs=0.46499037742614746, max_abs=4.0, mean_rel=0.1336362212896347, max_rel=433.5435791015625, norm_rel=0.022303925827145576, ref_abs_avg=21.003826141357422, test_abs_avg=21.005596160888672
production_forward grad[83] vs paper_forward: mean_abs=0.37849557399749756, max_abs=1.75, mean_rel=0.08796282112598419, max_rel=7.309277057647705, norm_rel=0.023214098066091537, ref_abs_avg=16.75933074951172, test_abs_avg=16.752132415771484
production_forward grad[84] vs paper_forward: mean_abs=0.45721083879470825, max_abs=5.109375, mean_rel=0.14853815734386444, max_rel=1309.9359130859375, norm_rel=0.022036563605070114, ref_abs_avg=20.78106117248535, test_abs_avg=20.782215118408203
production_forward grad[85] vs paper_forward: mean_abs=0.4416811466217041, max_abs=4.0, mean_rel=0.14819622039794922, max_rel=997.562744140625, norm_rel=0.021970102563500404, ref_abs_avg=20.153518676757812, test_abs_avg=20.161338806152344
production_forward grad[86] vs paper_forward: mean_abs=0.338146448135376, max_abs=1.296875, mean_rel=0.18633565306663513, max_rel=35.97634506225586, norm_rel=0.019377244636416435, ref_abs_avg=17.340560913085938, test_abs_avg=17.35193634033203
production_forward grad[87] vs paper_forward: mean_abs=0.42041754722595215, max_abs=4.0, mean_rel=0.1285051852464676, max_rel=733.7348022460938, norm_rel=0.02139132469892502, ref_abs_avg=19.719942092895508, test_abs_avg=19.719120025634766
production_forward grad[88] vs paper_forward: mean_abs=0.4064018130302429, max_abs=4.5, mean_rel=0.14353258907794952, max_rel=856.595458984375, norm_rel=0.021228771656751633, ref_abs_avg=19.287059783935547, test_abs_avg=19.289833068847656
production_forward grad[89] vs paper_forward: mean_abs=0.33175694942474365, max_abs=1.43359375, mean_rel=0.22137150168418884, max_rel=47.780677795410156, norm_rel=0.021395979449152946, ref_abs_avg=15.550066947937012, test_abs_avg=15.552927017211914
production_forward grad[90] vs paper_forward: mean_abs=0.3957904875278473, max_abs=4.5, mean_rel=0.1320531666278839, max_rel=639.95654296875, norm_rel=0.02103804238140583, ref_abs_avg=18.957012176513672, test_abs_avg=18.958385467529297
production_forward grad[91] vs paper_forward: mean_abs=0.39685845375061035, max_abs=4.375, mean_rel=0.12038563936948776, max_rel=571.4920654296875, norm_rel=0.02100561000406742, ref_abs_avg=19.058778762817383, test_abs_avg=19.064069747924805
production_forward grad[92] vs paper_forward: mean_abs=0.3077600598335266, max_abs=1.25, mean_rel=0.11730258911848068, max_rel=10.47403335571289, norm_rel=0.019664589315652847, ref_abs_avg=16.30659294128418, test_abs_avg=16.303455352783203
production_forward grad[93] vs paper_forward: mean_abs=0.384639173746109, max_abs=5.0, mean_rel=0.1268494874238968, max_rel=623.3812866210938, norm_rel=0.020816104486584663, ref_abs_avg=18.680299758911133, test_abs_avg=18.681007385253906
production_forward grad[94] vs paper_forward: mean_abs=0.37489885091781616, max_abs=4.0, mean_rel=0.12480415403842926, max_rel=432.8782653808594, norm_rel=0.021012306213378906, ref_abs_avg=18.163606643676758, test_abs_avg=18.161535263061523
production_forward grad[95] vs paper_forward: mean_abs=0.2969689965248108, max_abs=1.28125, mean_rel=0.24226710200309753, max_rel=71.99095153808594, norm_rel=0.020371844992041588, ref_abs_avg=14.829898834228516, test_abs_avg=14.829328536987305
production_forward grad[96] vs paper_forward: mean_abs=0.3492930829524994, max_abs=3.5234375, mean_rel=0.12782025337219238, max_rel=537.837890625, norm_rel=0.019995519891381264, ref_abs_avg=17.709074020385742, test_abs_avg=17.70987319946289
production_forward grad[97] vs paper_forward: mean_abs=0.3506707549095154, max_abs=3.125, mean_rel=0.12330993264913559, max_rel=389.4765319824219, norm_rel=0.019971398636698723, ref_abs_avg=17.710805892944336, test_abs_avg=17.7100772857666
production_forward2 vs paper_forward output: mean_abs=0.0016820961609482765, max_abs=0.0859375
production_forward2 grad[0] vs paper_forward: mean_abs=0.008902087807655334, max_abs=0.455078125, mean_rel=0.0755700096487999, max_rel=116.2011489868164, norm_rel=0.020647527649998665, ref_abs_avg=0.46524301171302795, test_abs_avg=0.465243935585022
production_forward2 grad[1] vs paper_forward: mean_abs=7.520091533660889, max_abs=64.0, mean_rel=0.15591393411159515, max_rel=508.993896484375, norm_rel=0.02065141499042511, ref_abs_avg=322.52032470703125, test_abs_avg=322.6023864746094
production_forward2 grad[2] vs paper_forward: mean_abs=1.3407344818115234, max_abs=5.5, mean_rel=0.08108974993228912, max_rel=4.295269012451172, norm_rel=0.02557392232120037, ref_abs_avg=53.421268463134766, test_abs_avg=53.32030487060547
production_forward2 grad[3] vs paper_forward: mean_abs=1.6039762496948242, max_abs=12.0, mean_rel=0.18639546632766724, max_rel=4303.61962890625, norm_rel=0.02445867471396923, ref_abs_avg=65.8941650390625, test_abs_avg=65.89372253417969
production_forward2 grad[4] vs paper_forward: mean_abs=1.5744593143463135, max_abs=12.0, mean_rel=0.160089910030365, max_rel=1737.83447265625, norm_rel=0.024367166683077812, ref_abs_avg=64.90705871582031, test_abs_avg=64.89982604980469
production_forward2 grad[5] vs paper_forward: mean_abs=1.1324310302734375, max_abs=4.25, mean_rel=0.10105350613594055, max_rel=10.754745483398438, norm_rel=0.02314077503979206, ref_abs_avg=49.716217041015625, test_abs_avg=49.69640350341797
production_forward2 grad[6] vs paper_forward: mean_abs=1.41887629032135, max_abs=9.0, mean_rel=0.17890577018260956, max_rel=1773.791748046875, norm_rel=0.024245360866189003, ref_abs_avg=58.79616928100586, test_abs_avg=58.797279357910156
production_forward2 grad[7] vs paper_forward: mean_abs=1.3833541870117188, max_abs=8.0, mean_rel=0.16411077976226807, max_rel=1441.4896240234375, norm_rel=0.024019964039325714, ref_abs_avg=57.95061492919922, test_abs_avg=57.94959259033203
production_forward2 grad[8] vs paper_forward: mean_abs=1.081939697265625, max_abs=3.5, mean_rel=0.09763841331005096, max_rel=3.472041368484497, norm_rel=0.024099884554743767, ref_abs_avg=43.2030029296875, test_abs_avg=43.281150817871094
production_forward2 grad[9] vs paper_forward: mean_abs=1.295773983001709, max_abs=9.0, mean_rel=0.17327028512954712, max_rel=2448.570556640625, norm_rel=0.023953605443239212, ref_abs_avg=54.316749572753906, test_abs_avg=54.31662368774414
production_forward2 grad[10] vs paper_forward: mean_abs=1.2668113708496094, max_abs=7.5, mean_rel=0.17323097586631775, max_rel=1420.2078857421875, norm_rel=0.02373380959033966, ref_abs_avg=53.68578338623047, test_abs_avg=53.69140625
production_forward2 grad[11] vs paper_forward: mean_abs=1.0323047637939453, max_abs=4.75, mean_rel=0.10301069915294647, max_rel=7.451849937438965, norm_rel=0.025031691417098045, ref_abs_avg=41.62025451660156, test_abs_avg=41.63972854614258
production_forward2 grad[12] vs paper_forward: mean_abs=1.1859674453735352, max_abs=8.0, mean_rel=0.15559938549995422, max_rel=1153.897216796875, norm_rel=0.023750701919198036, ref_abs_avg=50.16999435424805, test_abs_avg=50.172908782958984
production_forward2 grad[13] vs paper_forward: mean_abs=1.1673773527145386, max_abs=6.5, mean_rel=0.1682436466217041, max_rel=2430.223388671875, norm_rel=0.02355470135807991, ref_abs_avg=49.79469299316406, test_abs_avg=49.791534423828125
production_forward2 grad[14] vs paper_forward: mean_abs=0.9150428771972656, max_abs=4.0, mean_rel=0.07363758981227875, max_rel=2.65718150138855, norm_rel=0.02483414113521576, ref_abs_avg=36.50392150878906, test_abs_avg=36.47834014892578
production_forward2 grad[15] vs paper_forward: mean_abs=1.1040520668029785, max_abs=7.25, mean_rel=0.16138258576393127, max_rel=1405.0672607421875, norm_rel=0.023610034957528114, ref_abs_avg=46.99489974975586, test_abs_avg=46.993988037109375
production_forward2 grad[16] vs paper_forward: mean_abs=1.087806224822998, max_abs=6.625, mean_rel=0.16012042760849, max_rel=1277.1986083984375, norm_rel=0.023423215374350548, ref_abs_avg=46.65675735473633, test_abs_avg=46.66188049316406
production_forward2 grad[17] vs paper_forward: mean_abs=0.8171348571777344, max_abs=3.125, mean_rel=0.15524548292160034, max_rel=42.374088287353516, norm_rel=0.023223813623189926, ref_abs_avg=35.93389129638672, test_abs_avg=36.008968353271484
production_forward2 grad[18] vs paper_forward: mean_abs=1.0457364320755005, max_abs=6.84375, mean_rel=0.1643594205379486, max_rel=1169.39892578125, norm_rel=0.023451635614037514, ref_abs_avg=44.813438415527344, test_abs_avg=44.8111572265625
production_forward2 grad[19] vs paper_forward: mean_abs=1.023056983947754, max_abs=6.0, mean_rel=0.17470264434814453, max_rel=1741.4649658203125, norm_rel=0.023178789764642715, ref_abs_avg=44.32984161376953, test_abs_avg=44.33348846435547
production_forward2 grad[20] vs paper_forward: mean_abs=0.7738200426101685, max_abs=3.5, mean_rel=0.5463393330574036, max_rel=247.55038452148438, norm_rel=0.023155709728598595, ref_abs_avg=33.394718170166016, test_abs_avg=33.36867141723633
production_forward2 grad[21] vs paper_forward: mean_abs=0.9913192391395569, max_abs=6.25, mean_rel=0.16345342993736267, max_rel=2435.051513671875, norm_rel=0.023232437670230865, ref_abs_avg=42.81679153442383, test_abs_avg=42.816246032714844
production_forward2 grad[22] vs paper_forward: mean_abs=0.9679818153381348, max_abs=5.6514892578125, mean_rel=0.1621173918247223, max_rel=844.7191772460938, norm_rel=0.023107662796974182, ref_abs_avg=42.06793975830078, test_abs_avg=42.06752395629883
production_forward2 grad[23] vs paper_forward: mean_abs=0.7631216049194336, max_abs=3.125, mean_rel=0.08339399099349976, max_rel=11.25601863861084, norm_rel=0.02265346795320511, ref_abs_avg=35.324607849121094, test_abs_avg=35.31226348876953
production_forward2 grad[24] vs paper_forward: mean_abs=0.9506402611732483, max_abs=7.0, mean_rel=0.1632114052772522, max_rel=1106.4039306640625, norm_rel=0.023189295083284378, ref_abs_avg=41.13105773925781, test_abs_avg=41.13180923461914
production_forward2 grad[25] vs paper_forward: mean_abs=0.9281258583068848, max_abs=6.0, mean_rel=0.15891841053962708, max_rel=1047.2333984375, norm_rel=0.023029109463095665, ref_abs_avg=40.52745819091797, test_abs_avg=40.53049850463867
production_forward2 grad[26] vs paper_forward: mean_abs=0.8821916580200195, max_abs=3.75, mean_rel=0.12121027708053589, max_rel=13.980368614196777, norm_rel=0.023976877331733704, ref_abs_avg=37.24034881591797, test_abs_avg=37.265220642089844
production_forward2 grad[27] vs paper_forward: mean_abs=1.1071455478668213, max_abs=7.0, mean_rel=0.16961801052093506, max_rel=1346.0875244140625, norm_rel=0.025241075083613396, ref_abs_avg=44.051025390625, test_abs_avg=44.05144500732422
production_forward2 grad[28] vs paper_forward: mean_abs=1.078810214996338, max_abs=6.5, mean_rel=0.1610579788684845, max_rel=1190.919189453125, norm_rel=0.025025324895977974, ref_abs_avg=43.288578033447266, test_abs_avg=43.291473388671875
production_forward2 grad[29] vs paper_forward: mean_abs=0.8398056030273438, max_abs=3.75, mean_rel=0.10782276093959808, max_rel=9.329155921936035, norm_rel=0.025938082486391068, ref_abs_avg=32.89411163330078, test_abs_avg=32.8688850402832
production_forward2 grad[30] vs paper_forward: mean_abs=1.0316908359527588, max_abs=6.8984375, mean_rel=0.17033492028713226, max_rel=1194.8406982421875, norm_rel=0.025545595213770866, ref_abs_avg=40.4921875, test_abs_avg=40.493038177490234
production_forward2 grad[31] vs paper_forward: mean_abs=1.0161662101745605, max_abs=7.375, mean_rel=0.1800820529460907, max_rel=2569.087646484375, norm_rel=0.025464510545134544, ref_abs_avg=40.04267120361328, test_abs_avg=40.04471969604492
production_forward2 grad[32] vs paper_forward: mean_abs=0.7653365135192871, max_abs=3.75, mean_rel=0.18170662224292755, max_rel=42.788909912109375, norm_rel=0.025177067145705223, ref_abs_avg=30.916309356689453, test_abs_avg=30.87375831604004
production_forward2 grad[33] vs paper_forward: mean_abs=0.9511733651161194, max_abs=7.0, mean_rel=0.1718561202287674, max_rel=1167.8289794921875, norm_rel=0.02542286366224289, ref_abs_avg=37.558570861816406, test_abs_avg=37.558021545410156
production_forward2 grad[34] vs paper_forward: mean_abs=0.9377697706222534, max_abs=6.25, mean_rel=0.17762963473796844, max_rel=1118.7513427734375, norm_rel=0.02507404424250126, ref_abs_avg=37.496036529541016, test_abs_avg=37.499671936035156
production_forward2 grad[35] vs paper_forward: mean_abs=0.7103416919708252, max_abs=3.0, mean_rel=0.2239179015159607, max_rel=78.36032104492188, norm_rel=0.02500092051923275, ref_abs_avg=29.222278594970703, test_abs_avg=29.242809295654297
production_forward2 grad[36] vs paper_forward: mean_abs=0.8896191120147705, max_abs=6.0, mean_rel=0.16556400060653687, max_rel=1110.8475341796875, norm_rel=0.025011980906128883, ref_abs_avg=35.63239288330078, test_abs_avg=35.63378143310547
production_forward2 grad[37] vs paper_forward: mean_abs=0.8727555871009827, max_abs=5.125, mean_rel=0.19113337993621826, max_rel=2093.966796875, norm_rel=0.02460683509707451, ref_abs_avg=35.490840911865234, test_abs_avg=35.491310119628906
production_forward2 grad[38] vs paper_forward: mean_abs=0.682581901550293, max_abs=3.0, mean_rel=0.15423423051834106, max_rel=23.857126235961914, norm_rel=0.024410434067249298, ref_abs_avg=28.758941650390625, test_abs_avg=28.723087310791016
production_forward2 grad[39] vs paper_forward: mean_abs=0.8404600024223328, max_abs=5.5, mean_rel=0.16741088032722473, max_rel=1687.87060546875, norm_rel=0.0246554147452116, ref_abs_avg=34.17284393310547, test_abs_avg=34.17682647705078
production_forward2 grad[40] vs paper_forward: mean_abs=0.8227752447128296, max_abs=5.75, mean_rel=0.16309598088264465, max_rel=1275.929443359375, norm_rel=0.024660756811499596, ref_abs_avg=33.4842529296875, test_abs_avg=33.480552673339844
production_forward2 grad[41] vs paper_forward: mean_abs=0.6767764091491699, max_abs=2.5, mean_rel=0.13507351279258728, max_rel=16.994667053222656, norm_rel=0.026461094617843628, ref_abs_avg=25.334026336669922, test_abs_avg=25.349098205566406
production_forward2 grad[42] vs paper_forward: mean_abs=0.7898576259613037, max_abs=5.0, mean_rel=0.1556462049484253, max_rel=917.3888549804688, norm_rel=0.02462119236588478, ref_abs_avg=32.18136215209961, test_abs_avg=32.181190490722656
production_forward2 grad[43] vs paper_forward: mean_abs=0.7817589044570923, max_abs=5.0, mean_rel=0.15949612855911255, max_rel=775.9701538085938, norm_rel=0.02442990057170391, ref_abs_avg=32.146888732910156, test_abs_avg=32.14447784423828
production_forward2 grad[44] vs paper_forward: mean_abs=0.5836429595947266, max_abs=2.64453125, mean_rel=0.0701606497168541, max_rel=3.2983202934265137, norm_rel=0.022331641986966133, ref_abs_avg=26.568052291870117, test_abs_avg=26.575599670410156
production_forward2 grad[45] vs paper_forward: mean_abs=0.7606068849563599, max_abs=5.0, mean_rel=0.15799051523208618, max_rel=1071.2213134765625, norm_rel=0.024222252890467644, ref_abs_avg=31.437137603759766, test_abs_avg=31.437908172607422
production_forward2 grad[46] vs paper_forward: mean_abs=0.7447078227996826, max_abs=4.5, mean_rel=0.15353311598300934, max_rel=639.5390625, norm_rel=0.024419348686933517, ref_abs_avg=30.633237838745117, test_abs_avg=30.63233184814453
production_forward2 grad[47] vs paper_forward: mean_abs=0.5597586631774902, max_abs=2.25, mean_rel=0.06908731162548065, max_rel=3.6018779277801514, norm_rel=0.023711208254098892, ref_abs_avg=24.46044158935547, test_abs_avg=24.443668365478516
production_forward2 grad[48] vs paper_forward: mean_abs=0.7220913767814636, max_abs=5.0, mean_rel=0.16883453726768494, max_rel=1624.20947265625, norm_rel=0.024097563698887825, ref_abs_avg=29.99386215209961, test_abs_avg=29.996789932250977
production_forward2 grad[49] vs paper_forward: mean_abs=0.704515278339386, max_abs=5.0, mean_rel=0.1703542023897171, max_rel=1310.5535888671875, norm_rel=0.023910611867904663, ref_abs_avg=29.54489517211914, test_abs_avg=29.541114807128906
production_forward2 grad[50] vs paper_forward: mean_abs=0.6616822481155396, max_abs=3.25, mean_rel=0.1489258110523224, max_rel=13.111422538757324, norm_rel=0.02546123042702675, ref_abs_avg=25.763437271118164, test_abs_avg=25.744491577148438
production_forward2 grad[51] vs paper_forward: mean_abs=0.8204278945922852, max_abs=6.0, mean_rel=0.16887018084526062, max_rel=1571.6136474609375, norm_rel=0.025193417444825172, ref_abs_avg=32.60972595214844, test_abs_avg=32.609893798828125
production_forward2 grad[52] vs paper_forward: mean_abs=0.8015843629837036, max_abs=4.96875, mean_rel=0.15744107961654663, max_rel=524.0939331054688, norm_rel=0.025152122601866722, ref_abs_avg=31.959857940673828, test_abs_avg=31.959943771362305
production_forward2 grad[53] vs paper_forward: mean_abs=0.5985846519470215, max_abs=2.5, mean_rel=0.1677575260400772, max_rel=13.599332809448242, norm_rel=0.024030540138483047, ref_abs_avg=24.76683807373047, test_abs_avg=24.75359535217285
production_forward2 grad[54] vs paper_forward: mean_abs=0.7538468241691589, max_abs=5.0, mean_rel=0.16527977585792542, max_rel=802.9336547851562, norm_rel=0.0250216256827116, ref_abs_avg=30.13473129272461, test_abs_avg=30.133256912231445
production_forward2 grad[55] vs paper_forward: mean_abs=0.7364315390586853, max_abs=5.0, mean_rel=0.1558372974395752, max_rel=633.7889404296875, norm_rel=0.024664951488375664, ref_abs_avg=29.86803436279297, test_abs_avg=29.863922119140625
production_forward2 grad[56] vs paper_forward: mean_abs=0.585507869720459, max_abs=2.3125, mean_rel=0.16714876890182495, max_rel=18.100309371948242, norm_rel=0.025684379041194916, ref_abs_avg=22.615304946899414, test_abs_avg=22.580101013183594
production_forward2 grad[57] vs paper_forward: mean_abs=0.6974720358848572, max_abs=6.0, mean_rel=0.16362321376800537, max_rel=993.88232421875, norm_rel=0.024419380351901054, ref_abs_avg=28.594741821289062, test_abs_avg=28.594301223754883
production_forward2 grad[58] vs paper_forward: mean_abs=0.6856638193130493, max_abs=5.0, mean_rel=0.16137710213661194, max_rel=1249.9267578125, norm_rel=0.02430794946849346, ref_abs_avg=28.283002853393555, test_abs_avg=28.282474517822266
production_forward2 grad[59] vs paper_forward: mean_abs=0.5042781233787537, max_abs=2.0, mean_rel=0.2609820067882538, max_rel=89.85830688476562, norm_rel=0.022780997678637505, ref_abs_avg=22.847686767578125, test_abs_avg=22.876937866210938
production_forward2 grad[60] vs paper_forward: mean_abs=0.652949333190918, max_abs=6.0, mean_rel=0.160992830991745, max_rel=1734.3001708984375, norm_rel=0.02378014475107193, ref_abs_avg=27.45551300048828, test_abs_avg=27.456199645996094
production_forward2 grad[61] vs paper_forward: mean_abs=0.6410208940505981, max_abs=4.75, mean_rel=0.15848436951637268, max_rel=741.292724609375, norm_rel=0.023990420624613762, ref_abs_avg=26.777164459228516, test_abs_avg=26.773042678833008
production_forward2 grad[62] vs paper_forward: mean_abs=0.49989986419677734, max_abs=1.75, mean_rel=0.1312279999256134, max_rel=16.09030532836914, norm_rel=0.022787902504205704, ref_abs_avg=22.54256248474121, test_abs_avg=22.52652359008789
production_forward2 grad[63] vs paper_forward: mean_abs=0.6128783226013184, max_abs=4.5, mean_rel=0.1563158631324768, max_rel=897.9307250976562, norm_rel=0.0234020184725523, ref_abs_avg=26.168624877929688, test_abs_avg=26.169761657714844
production_forward2 grad[64] vs paper_forward: mean_abs=0.597530722618103, max_abs=4.5, mean_rel=0.15937510132789612, max_rel=1198.901123046875, norm_rel=0.023364650085568428, ref_abs_avg=25.611127853393555, test_abs_avg=25.607589721679688
production_forward2 grad[65] vs paper_forward: mean_abs=0.4958565831184387, max_abs=2.875, mean_rel=0.3861176669597626, max_rel=126.78836822509766, norm_rel=0.02354111149907112, ref_abs_avg=21.69015121459961, test_abs_avg=21.714488983154297
production_forward2 grad[66] vs paper_forward: mean_abs=0.5832506418228149, max_abs=5.0, mean_rel=0.15511080622673035, max_rel=875.2280883789062, norm_rel=0.02292485535144806, ref_abs_avg=25.395645141601562, test_abs_avg=25.397274017333984
production_forward2 grad[67] vs paper_forward: mean_abs=0.565223753452301, max_abs=4.0, mean_rel=0.14584305882453918, max_rel=559.3359375, norm_rel=0.022618694230914116, ref_abs_avg=24.89643096923828, test_abs_avg=24.897069931030273
production_forward2 grad[68] vs paper_forward: mean_abs=0.44438838958740234, max_abs=1.875, mean_rel=0.07865014672279358, max_rel=1.9026365280151367, norm_rel=0.02241973578929901, ref_abs_avg=19.432865142822266, test_abs_avg=19.394624710083008
production_forward2 grad[69] vs paper_forward: mean_abs=0.5497758984565735, max_abs=4.25, mean_rel=0.14246338605880737, max_rel=994.9677734375, norm_rel=0.022518768906593323, ref_abs_avg=24.35999298095703, test_abs_avg=24.35950469970703
production_forward2 grad[70] vs paper_forward: mean_abs=0.5344833731651306, max_abs=4.5625, mean_rel=0.1475583016872406, max_rel=788.4091186523438, norm_rel=0.022374091669917107, ref_abs_avg=23.842426300048828, test_abs_avg=23.83504867553711
production_forward2 grad[71] vs paper_forward: mean_abs=0.41208702325820923, max_abs=1.6875, mean_rel=0.2510722875595093, max_rel=78.57261657714844, norm_rel=0.021021027117967606, ref_abs_avg=19.687633514404297, test_abs_avg=19.677932739257812
production_forward2 grad[72] vs paper_forward: mean_abs=0.5239012241363525, max_abs=4.25, mean_rel=0.1486581712961197, max_rel=1657.82568359375, norm_rel=0.022253207862377167, ref_abs_avg=23.483463287353516, test_abs_avg=23.4857234954834
production_forward2 grad[73] vs paper_forward: mean_abs=0.5150101184844971, max_abs=4.0, mean_rel=0.1432284414768219, max_rel=1121.3226318359375, norm_rel=0.022270910441875458, ref_abs_avg=23.14725112915039, test_abs_avg=23.15300750732422
production_forward2 grad[74] vs paper_forward: mean_abs=0.4725315570831299, max_abs=2.0625, mean_rel=0.21262706816196442, max_rel=36.01658630371094, norm_rel=0.023817995563149452, ref_abs_avg=19.71849822998047, test_abs_avg=19.674663543701172
production_forward2 grad[75] vs paper_forward: mean_abs=0.5699920654296875, max_abs=4.25, mean_rel=0.15345287322998047, max_rel=895.0142211914062, norm_rel=0.023881256580352783, ref_abs_avg=23.87097930908203, test_abs_avg=23.871034622192383
production_forward2 grad[76] vs paper_forward: mean_abs=0.5506932735443115, max_abs=5.0, mean_rel=0.16618266701698303, max_rel=946.6729125976562, norm_rel=0.024097435176372528, ref_abs_avg=22.925540924072266, test_abs_avg=22.92920684814453
production_forward2 grad[77] vs paper_forward: mean_abs=0.4237235188484192, max_abs=2.0, mean_rel=0.1251084953546524, max_rel=18.69495391845703, norm_rel=0.023241780698299408, ref_abs_avg=18.294147491455078, test_abs_avg=18.28862762451172
production_forward2 grad[78] vs paper_forward: mean_abs=0.5101096630096436, max_abs=4.25, mean_rel=0.14577797055244446, max_rel=742.4776611328125, norm_rel=0.02356305904686451, ref_abs_avg=21.659976959228516, test_abs_avg=21.661575317382812
production_forward2 grad[79] vs paper_forward: mean_abs=0.5087785720825195, max_abs=3.75, mean_rel=0.1650683879852295, max_rel=917.035400390625, norm_rel=0.02306952513754368, ref_abs_avg=22.034934997558594, test_abs_avg=22.040372848510742
production_forward2 grad[80] vs paper_forward: mean_abs=0.4059755206108093, max_abs=1.75, mean_rel=0.17158102989196777, max_rel=40.2261848449707, norm_rel=0.022510331124067307, ref_abs_avg=18.85759925842285, test_abs_avg=18.867734909057617
production_forward2 grad[81] vs paper_forward: mean_abs=0.4939006567001343, max_abs=4.0, mean_rel=0.1412915289402008, max_rel=680.099853515625, norm_rel=0.02257394604384899, ref_abs_avg=21.832122802734375, test_abs_avg=21.832969665527344
production_forward2 grad[82] vs paper_forward: mean_abs=0.46927982568740845, max_abs=4.0, mean_rel=0.13757064938545227, max_rel=478.90032958984375, norm_rel=0.02249227650463581, ref_abs_avg=21.003826141357422, test_abs_avg=21.005661010742188
production_forward2 grad[83] vs paper_forward: mean_abs=0.3866915702819824, max_abs=1.625, mean_rel=0.09386463463306427, max_rel=8.036417961120605, norm_rel=0.023409761488437653, ref_abs_avg=16.75933074951172, test_abs_avg=16.750043869018555
production_forward2 grad[84] vs paper_forward: mean_abs=0.4598216414451599, max_abs=5.109375, mean_rel=0.14875099062919617, max_rel=1419.658447265625, norm_rel=0.02216157503426075, ref_abs_avg=20.78106117248535, test_abs_avg=20.782411575317383
production_forward2 grad[85] vs paper_forward: mean_abs=0.4444122612476349, max_abs=4.0, mean_rel=0.150542214512825, max_rel=1124.1961669921875, norm_rel=0.02211124636232853, ref_abs_avg=20.153518676757812, test_abs_avg=20.16106414794922
production_forward2 grad[86] vs paper_forward: mean_abs=0.34045112133026123, max_abs=1.359375, mean_rel=0.14851480722427368, max_rel=21.23400115966797, norm_rel=0.019655950367450714, ref_abs_avg=17.340560913085938, test_abs_avg=17.35377311706543
production_forward2 grad[87] vs paper_forward: mean_abs=0.42289626598358154, max_abs=4.0, mean_rel=0.12987478077411652, max_rel=674.3831176757812, norm_rel=0.021500403061509132, ref_abs_avg=19.719942092895508, test_abs_avg=19.719402313232422
production_forward2 grad[88] vs paper_forward: mean_abs=0.40875163674354553, max_abs=4.5, mean_rel=0.14412683248519897, max_rel=1031.9898681640625, norm_rel=0.021353766322135925, ref_abs_avg=19.287059783935547, test_abs_avg=19.289398193359375
production_forward2 grad[89] vs paper_forward: mean_abs=0.33074700832366943, max_abs=1.5, mean_rel=0.19809259474277496, max_rel=36.990203857421875, norm_rel=0.021177317947149277, ref_abs_avg=15.550066947937012, test_abs_avg=15.551788330078125
production_forward2 grad[90] vs paper_forward: mean_abs=0.3973521888256073, max_abs=5.0, mean_rel=0.13262839615345, max_rel=643.2716064453125, norm_rel=0.02110998146235943, ref_abs_avg=18.957012176513672, test_abs_avg=18.95831871032715
production_forward2 grad[91] vs paper_forward: mean_abs=0.39791443943977356, max_abs=4.25, mean_rel=0.12095974385738373, max_rel=583.0907592773438, norm_rel=0.021070601418614388, ref_abs_avg=19.058778762817383, test_abs_avg=19.064231872558594
production_forward2 grad[92] vs paper_forward: mean_abs=0.31196433305740356, max_abs=1.375, mean_rel=0.09935900568962097, max_rel=7.654938220977783, norm_rel=0.01994306780397892, ref_abs_avg=16.30659294128418, test_abs_avg=16.301713943481445
production_forward2 grad[93] vs paper_forward: mean_abs=0.38535696268081665, max_abs=5.0, mean_rel=0.127061665058136, max_rel=537.18994140625, norm_rel=0.020850133150815964, ref_abs_avg=18.680299758911133, test_abs_avg=18.680675506591797
production_forward2 grad[94] vs paper_forward: mean_abs=0.37588343024253845, max_abs=4.0, mean_rel=0.1255241483449936, max_rel=433.1152648925781, norm_rel=0.021062156185507774, ref_abs_avg=18.163606643676758, test_abs_avg=18.161563873291016
production_forward2 grad[95] vs paper_forward: mean_abs=0.2969689965248108, max_abs=1.28125, mean_rel=0.24226710200309753, max_rel=71.99095153808594, norm_rel=0.020371844992041588, ref_abs_avg=14.829898834228516, test_abs_avg=14.829328536987305
production_forward2 grad[96] vs paper_forward: mean_abs=0.3492930829524994, max_abs=3.5234375, mean_rel=0.12782025337219238, max_rel=537.837890625, norm_rel=0.019995519891381264, ref_abs_avg=17.709074020385742, test_abs_avg=17.70987319946289
production_forward2 grad[97] vs paper_forward: mean_abs=0.3506707549095154, max_abs=3.125, mean_rel=0.12330993264913559, max_rel=389.4765319824219, norm_rel=0.019971398636698723, ref_abs_avg=17.710805892944336, test_abs_avg=17.7100772857666
identity layers + randn queries
production_forward2 fwd+bwd:  224.380 ms
production_forward2 bwd-only: 202.147 ms
production_forward2 peak allocated: fwd=2.864 GiB, fwd+bwd=6.243 GiB
production_forward2 peak reserved:  fwd=3.242 GiB, fwd+bwd=8.992 GiB
paper_forward fwd+bwd:  379.773 ms
paper_forward bwd-only: 294.084 ms
paper_forward peak allocated: fwd=30.001 GiB, fwd+bwd=32.120 GiB
paper_forward peak reserved:  fwd=30.037 GiB, fwd+bwd=32.787 GiB
production_forward fwd+bwd:  126.734 ms
production_forward bwd-only: 106.292 ms
production_forward peak allocated: fwd=3.368 GiB, fwd+bwd=7.868 GiB
production_forward peak reserved:  fwd=3.617 GiB, fwd+bwd=8.867 GiB

grads check for swiglu layers + randn queries
production_forward vs paper_forward output: mean_abs=0.0017196618719026446, max_abs=0.0390625
production_forward grad[0] vs paper_forward: mean_abs=0.008939512073993683, max_abs=0.375, mean_rel=0.07455046474933624, max_rel=166.03102111816406, norm_rel=0.02042793482542038, ref_abs_avg=0.47740787267684937, test_abs_avg=0.4774242639541626
production_forward grad[1] vs paper_forward: mean_abs=7.763253211975098, max_abs=64.0, mean_rel=0.23765446245670319, max_rel=1463.9794921875, norm_rel=0.020940447226166725, ref_abs_avg=332.32659912109375, test_abs_avg=332.39129638671875
production_forward grad[2] vs paper_forward: mean_abs=1.4243707656860352, max_abs=5.875, mean_rel=0.23034192621707916, max_rel=66.27867889404297, norm_rel=0.0242898128926754, ref_abs_avg=58.31346893310547, test_abs_avg=58.33824157714844
production_forward grad[3] vs paper_forward: mean_abs=1.7091848850250244, max_abs=11.5, mean_rel=0.17684395611286163, max_rel=2442.892578125, norm_rel=0.0243212953209877, ref_abs_avg=70.67276000976562, test_abs_avg=70.68009185791016
production_forward grad[4] vs paper_forward: mean_abs=1.6484819650650024, max_abs=10.546875, mean_rel=0.17643415927886963, max_rel=1727.515625, norm_rel=0.02394958958029747, ref_abs_avg=69.20582580566406, test_abs_avg=69.2136001586914
production_forward grad[5] vs paper_forward: mean_abs=1.2476162910461426, max_abs=5.0, mean_rel=0.10245129466056824, max_rel=9.519281387329102, norm_rel=0.024872945621609688, ref_abs_avg=50.33711624145508, test_abs_avg=50.40508270263672
production_forward grad[6] vs paper_forward: mean_abs=1.4517791271209717, max_abs=9.25, mean_rel=0.1732971966266632, max_rel=2802.275146484375, norm_rel=0.024003447964787483, ref_abs_avg=60.78848648071289, test_abs_avg=60.78912353515625
production_forward grad[7] vs paper_forward: mean_abs=1.4144761562347412, max_abs=9.0, mean_rel=0.1640150547027588, max_rel=1926.859130859375, norm_rel=0.023801397532224655, ref_abs_avg=59.739479064941406, test_abs_avg=59.75233840942383
production_forward grad[8] vs paper_forward: mean_abs=1.0874700546264648, max_abs=3.859375, mean_rel=0.19179187715053558, max_rel=35.57412338256836, norm_rel=0.024206452071666718, ref_abs_avg=44.44817352294922, test_abs_avg=44.409629821777344
production_forward grad[9] vs paper_forward: mean_abs=1.3263211250305176, max_abs=9.0625, mean_rel=0.16660377383232117, max_rel=2258.646484375, norm_rel=0.023828238248825073, ref_abs_avg=55.981143951416016, test_abs_avg=55.982269287109375
production_forward grad[10] vs paper_forward: mean_abs=1.2941256761550903, max_abs=8.0703125, mean_rel=0.16343159973621368, max_rel=1464.0826416015625, norm_rel=0.023634467273950577, ref_abs_avg=55.08021545410156, test_abs_avg=55.07617950439453
production_forward grad[11] vs paper_forward: mean_abs=1.0050716400146484, max_abs=3.8125, mean_rel=0.09885263442993164, max_rel=10.835041999816895, norm_rel=0.024367447942495346, ref_abs_avg=43.31646728515625, test_abs_avg=43.3197135925293
production_forward grad[12] vs paper_forward: mean_abs=1.218443512916565, max_abs=8.0, mean_rel=0.15788722038269043, max_rel=1727.8331298828125, norm_rel=0.023644374683499336, ref_abs_avg=51.7795524597168, test_abs_avg=51.78376388549805
production_forward grad[13] vs paper_forward: mean_abs=1.1943368911743164, max_abs=8.0, mean_rel=0.16086003184318542, max_rel=1020.6281127929688, norm_rel=0.02349003776907921, ref_abs_avg=51.052608489990234, test_abs_avg=51.05347442626953
production_forward grad[14] vs paper_forward: mean_abs=0.9586870670318604, max_abs=4.3125, mean_rel=0.0911562442779541, max_rel=4.716589450836182, norm_rel=0.023577092215418816, ref_abs_avg=40.09723663330078, test_abs_avg=40.094146728515625
production_forward grad[15] vs paper_forward: mean_abs=1.1423271894454956, max_abs=7.125, mean_rel=0.164841428399086, max_rel=1060.625244140625, norm_rel=0.023522334173321724, ref_abs_avg=48.80442428588867, test_abs_avg=48.80879211425781
production_forward grad[16] vs paper_forward: mean_abs=1.1139284372329712, max_abs=7.0, mean_rel=0.1673082858324051, max_rel=1349.9111328125, norm_rel=0.023452667519450188, ref_abs_avg=47.72216796875, test_abs_avg=47.71600341796875
production_forward grad[17] vs paper_forward: mean_abs=0.8706600666046143, max_abs=3.5, mean_rel=0.1629161536693573, max_rel=26.86135482788086, norm_rel=0.022784609347581863, ref_abs_avg=38.48554229736328, test_abs_avg=38.4503288269043
production_forward grad[18] vs paper_forward: mean_abs=1.0705454349517822, max_abs=8.0, mean_rel=0.1670723557472229, max_rel=1556.9669189453125, norm_rel=0.02331014722585678, ref_abs_avg=46.12638854980469, test_abs_avg=46.13096618652344
production_forward grad[19] vs paper_forward: mean_abs=1.0492018461227417, max_abs=6.515625, mean_rel=0.16282883286476135, max_rel=2041.3636474609375, norm_rel=0.023144982755184174, ref_abs_avg=45.485382080078125, test_abs_avg=45.48396301269531
production_forward grad[20] vs paper_forward: mean_abs=0.7813692092895508, max_abs=3.75, mean_rel=0.09847721457481384, max_rel=9.913809776306152, norm_rel=0.02257402054965496, ref_abs_avg=35.504119873046875, test_abs_avg=35.51291275024414
production_forward grad[21] vs paper_forward: mean_abs=1.013794183731079, max_abs=6.5, mean_rel=0.16026057302951813, max_rel=1925.185791015625, norm_rel=0.023238137364387512, ref_abs_avg=43.854644775390625, test_abs_avg=43.85541915893555
production_forward grad[22] vs paper_forward: mean_abs=0.9969931840896606, max_abs=6.3125, mean_rel=0.15232397615909576, max_rel=777.4852905273438, norm_rel=0.02293046936392784, ref_abs_avg=43.725624084472656, test_abs_avg=43.720890045166016
production_forward grad[23] vs paper_forward: mean_abs=0.7733488082885742, max_abs=3.0, mean_rel=0.09909257292747498, max_rel=5.6836652755737305, norm_rel=0.02352319099009037, ref_abs_avg=33.25893020629883, test_abs_avg=33.25531005859375
production_forward grad[24] vs paper_forward: mean_abs=0.97611403465271, max_abs=6.3125, mean_rel=0.1586330682039261, max_rel=1853.123291015625, norm_rel=0.02328363060951233, ref_abs_avg=42.1633415222168, test_abs_avg=42.16669845581055
production_forward grad[25] vs paper_forward: mean_abs=0.9497367143630981, max_abs=5.75, mean_rel=0.15279227495193481, max_rel=1176.9112548828125, norm_rel=0.022970687597990036, ref_abs_avg=41.52214813232422, test_abs_avg=41.518558502197266
production_forward grad[26] vs paper_forward: mean_abs=0.9071042537689209, max_abs=4.0, mean_rel=0.09978314489126205, max_rel=10.897127151489258, norm_rel=0.025377057492733, ref_abs_avg=36.79756164550781, test_abs_avg=36.744773864746094
production_forward grad[27] vs paper_forward: mean_abs=1.1233665943145752, max_abs=7.03125, mean_rel=0.1786898523569107, max_rel=2349.30419921875, norm_rel=0.025152212008833885, ref_abs_avg=44.871315002441406, test_abs_avg=44.872249603271484
production_forward grad[28] vs paper_forward: mean_abs=1.1044914722442627, max_abs=8.0, mean_rel=0.1700236201286316, max_rel=1437.7867431640625, norm_rel=0.024743998423218727, ref_abs_avg=44.82758331298828, test_abs_avg=44.820037841796875
production_forward grad[29] vs paper_forward: mean_abs=0.8446559906005859, max_abs=2.9375, mean_rel=0.08977227658033371, max_rel=4.03573751449585, norm_rel=0.023802500218153, ref_abs_avg=34.334747314453125, test_abs_avg=34.38562774658203
production_forward grad[30] vs paper_forward: mean_abs=1.0431838035583496, max_abs=7.5, mean_rel=0.17009124159812927, max_rel=1512.8990478515625, norm_rel=0.025345608592033386, ref_abs_avg=41.35286331176758, test_abs_avg=41.35539245605469
production_forward grad[31] vs paper_forward: mean_abs=1.0245869159698486, max_abs=6.0, mean_rel=0.16174539923667908, max_rel=670.4713134765625, norm_rel=0.02527116984128952, ref_abs_avg=40.70027160644531, test_abs_avg=40.69403076171875
production_forward grad[32] vs paper_forward: mean_abs=0.770355224609375, max_abs=3.1328125, mean_rel=0.08495587855577469, max_rel=8.17410945892334, norm_rel=0.02458762563765049, ref_abs_avg=32.42280578613281, test_abs_avg=32.422462463378906
production_forward grad[33] vs paper_forward: mean_abs=0.971011757850647, max_abs=6.0, mean_rel=0.16578516364097595, max_rel=1413.921875, norm_rel=0.02518322691321373, ref_abs_avg=38.663917541503906, test_abs_avg=38.66764450073242
production_forward grad[34] vs paper_forward: mean_abs=0.9523009657859802, max_abs=6.0, mean_rel=0.1750354766845703, max_rel=1015.8922729492188, norm_rel=0.02505478821694851, ref_abs_avg=38.14471435546875, test_abs_avg=38.13883972167969
production_forward grad[35] vs paper_forward: mean_abs=0.7175769805908203, max_abs=2.59375, mean_rel=0.09945078939199448, max_rel=6.247838020324707, norm_rel=0.02337232418358326, ref_abs_avg=30.270885467529297, test_abs_avg=30.329111099243164
production_forward grad[36] vs paper_forward: mean_abs=0.9046376943588257, max_abs=5.9375, mean_rel=0.17409324645996094, max_rel=1201.5228271484375, norm_rel=0.024926552549004555, ref_abs_avg=36.37940216064453, test_abs_avg=36.38198471069336
production_forward grad[37] vs paper_forward: mean_abs=0.8908690810203552, max_abs=5.5, mean_rel=0.16188669204711914, max_rel=1088.541015625, norm_rel=0.02466551773250103, ref_abs_avg=36.23287582397461, test_abs_avg=36.23414611816406
production_forward grad[38] vs paper_forward: mean_abs=0.6702032089233398, max_abs=2.375, mean_rel=0.07760170847177505, max_rel=4.558474063873291, norm_rel=0.022715583443641663, ref_abs_avg=29.3680419921875, test_abs_avg=29.325733184814453
production_forward grad[39] vs paper_forward: mean_abs=0.8530189990997314, max_abs=5.5, mean_rel=0.1664665937423706, max_rel=1103.6175537109375, norm_rel=0.02472371980547905, ref_abs_avg=34.595458984375, test_abs_avg=34.59322738647461
production_forward grad[40] vs paper_forward: mean_abs=0.8373831510543823, max_abs=5.5, mean_rel=0.16006314754486084, max_rel=1053.554931640625, norm_rel=0.024475157260894775, ref_abs_avg=34.32914733886719, test_abs_avg=34.327537536621094
production_forward grad[41] vs paper_forward: mean_abs=0.6236124038696289, max_abs=2.78125, mean_rel=0.12070474028587341, max_rel=19.622093200683594, norm_rel=0.023266304284334183, ref_abs_avg=27.694366455078125, test_abs_avg=27.67181968688965
production_forward grad[42] vs paper_forward: mean_abs=0.8122238516807556, max_abs=5.5, mean_rel=0.16725707054138184, max_rel=1437.9351806640625, norm_rel=0.02452935464680195, ref_abs_avg=33.13909149169922, test_abs_avg=33.13837432861328
production_forward grad[43] vs paper_forward: mean_abs=0.7923709750175476, max_abs=5.0, mean_rel=0.1750601828098297, max_rel=1195.32470703125, norm_rel=0.024348795413970947, ref_abs_avg=32.630489349365234, test_abs_avg=32.63262939453125
production_forward grad[44] vs paper_forward: mean_abs=0.6358184814453125, max_abs=2.375, mean_rel=0.08932466804981232, max_rel=2.914013624191284, norm_rel=0.0237729512155056, ref_abs_avg=26.751880645751953, test_abs_avg=26.759788513183594
production_forward grad[45] vs paper_forward: mean_abs=0.7721138000488281, max_abs=5.0, mean_rel=0.17871379852294922, max_rel=2158.733154296875, norm_rel=0.02424563281238079, ref_abs_avg=31.913082122802734, test_abs_avg=31.91336441040039
production_forward grad[46] vs paper_forward: mean_abs=0.757412314414978, max_abs=5.25, mean_rel=0.16266053915023804, max_rel=799.2874145507812, norm_rel=0.023908581584692, ref_abs_avg=31.724531173706055, test_abs_avg=31.725933074951172
production_forward grad[47] vs paper_forward: mean_abs=0.5740878582000732, max_abs=2.25, mean_rel=0.1407880336046219, max_rel=18.980972290039062, norm_rel=0.023531366139650345, ref_abs_avg=24.740617752075195, test_abs_avg=24.747268676757812
production_forward grad[48] vs paper_forward: mean_abs=0.7290170788764954, max_abs=4.75, mean_rel=0.1623184233903885, max_rel=1141.80908203125, norm_rel=0.024171048775315285, ref_abs_avg=30.202136993408203, test_abs_avg=30.20342445373535
production_forward grad[49] vs paper_forward: mean_abs=0.7258074283599854, max_abs=5.0, mean_rel=0.15553921461105347, max_rel=740.8943481445312, norm_rel=0.024195872247219086, ref_abs_avg=30.032634735107422, test_abs_avg=30.034509658813477
production_forward grad[50] vs paper_forward: mean_abs=0.7146129608154297, max_abs=2.5, mean_rel=0.0865192636847496, max_rel=3.89015531539917, norm_rel=0.025217801332473755, ref_abs_avg=27.438201904296875, test_abs_avg=27.39848518371582
production_forward grad[51] vs paper_forward: mean_abs=0.8438250422477722, max_abs=5.40625, mean_rel=0.18293139338493347, max_rel=1170.91162109375, norm_rel=0.025524981319904327, ref_abs_avg=33.118553161621094, test_abs_avg=33.11772918701172
production_forward grad[52] vs paper_forward: mean_abs=0.8158951997756958, max_abs=6.125, mean_rel=0.17066729068756104, max_rel=1317.6475830078125, norm_rel=0.024958891794085503, ref_abs_avg=32.803955078125, test_abs_avg=32.79763412475586
production_forward grad[53] vs paper_forward: mean_abs=0.6084592342376709, max_abs=3.0, mean_rel=0.20088991522789001, max_rel=35.576202392578125, norm_rel=0.02361023798584938, ref_abs_avg=26.734756469726562, test_abs_avg=26.789432525634766
production_forward grad[54] vs paper_forward: mean_abs=0.7631487250328064, max_abs=5.5, mean_rel=0.15886405110359192, max_rel=1086.8865966796875, norm_rel=0.024953119456768036, ref_abs_avg=30.597240447998047, test_abs_avg=30.59593963623047
production_forward grad[55] vs paper_forward: mean_abs=0.7477630376815796, max_abs=5.0, mean_rel=0.16978409886360168, max_rel=871.8328857421875, norm_rel=0.024710506200790405, ref_abs_avg=30.26093292236328, test_abs_avg=30.264604568481445
production_forward grad[56] vs paper_forward: mean_abs=0.5501744747161865, max_abs=2.0, mean_rel=0.1031278595328331, max_rel=16.053035736083984, norm_rel=0.023455139249563217, ref_abs_avg=23.819652557373047, test_abs_avg=23.829944610595703
production_forward grad[57] vs paper_forward: mean_abs=0.7038362622261047, max_abs=5.5, mean_rel=0.16961130499839783, max_rel=1319.3187255859375, norm_rel=0.024533553048968315, ref_abs_avg=28.687862396240234, test_abs_avg=28.6873779296875
production_forward grad[58] vs paper_forward: mean_abs=0.6859627962112427, max_abs=4.375, mean_rel=0.16557812690734863, max_rel=833.3555908203125, norm_rel=0.024269716814160347, ref_abs_avg=28.269990921020508, test_abs_avg=28.26403045654297
production_forward grad[59] vs paper_forward: mean_abs=0.5200958251953125, max_abs=2.0, mean_rel=0.0984603762626648, max_rel=10.571632385253906, norm_rel=0.02261483110487461, ref_abs_avg=23.851655960083008, test_abs_avg=23.89588737487793
production_forward grad[60] vs paper_forward: mean_abs=0.6643633246421814, max_abs=5.625, mean_rel=0.16320553421974182, max_rel=1103.1317138671875, norm_rel=0.02383965253829956, ref_abs_avg=27.821685791015625, test_abs_avg=27.822067260742188
production_forward grad[61] vs paper_forward: mean_abs=0.6494307518005371, max_abs=4.75, mean_rel=0.17579999566078186, max_rel=1124.20703125, norm_rel=0.024059701710939407, ref_abs_avg=27.025251388549805, test_abs_avg=27.024826049804688
production_forward grad[62] vs paper_forward: mean_abs=0.5234951972961426, max_abs=2.0, mean_rel=0.09651198238134384, max_rel=5.13456916809082, norm_rel=0.02373696304857731, ref_abs_avg=22.034395217895508, test_abs_avg=22.09856605529785
production_forward grad[63] vs paper_forward: mean_abs=0.6230542659759521, max_abs=4.515625, mean_rel=0.1648721694946289, max_rel=1088.2977294921875, norm_rel=0.023633994162082672, ref_abs_avg=26.387710571289062, test_abs_avg=26.38827896118164
production_forward grad[64] vs paper_forward: mean_abs=0.6098629832267761, max_abs=5.0, mean_rel=0.14924928545951843, max_rel=743.3729858398438, norm_rel=0.023342669010162354, ref_abs_avg=26.182479858398438, test_abs_avg=26.182432174682617
production_forward grad[65] vs paper_forward: mean_abs=0.46587085723876953, max_abs=1.625, mean_rel=0.07762832939624786, max_rel=3.46267032623291, norm_rel=0.022270340472459793, ref_abs_avg=21.06168556213379, test_abs_avg=21.097431182861328
production_forward grad[66] vs paper_forward: mean_abs=0.5925254225730896, max_abs=4.5, mean_rel=0.15075621008872986, max_rel=802.5377197265625, norm_rel=0.02318825013935566, ref_abs_avg=25.532400131225586, test_abs_avg=25.53256607055664
production_forward grad[67] vs paper_forward: mean_abs=0.5795226693153381, max_abs=5.0, mean_rel=0.15745273232460022, max_rel=1514.46337890625, norm_rel=0.023389996960759163, ref_abs_avg=24.752660751342773, test_abs_avg=24.748029708862305
production_forward grad[68] vs paper_forward: mean_abs=0.46496105194091797, max_abs=2.27734375, mean_rel=0.08266796171665192, max_rel=6.279860973358154, norm_rel=0.02246956340968609, ref_abs_avg=21.03182601928711, test_abs_avg=21.025409698486328
production_forward grad[69] vs paper_forward: mean_abs=0.5646331310272217, max_abs=4.625, mean_rel=0.15018069744110107, max_rel=788.49658203125, norm_rel=0.02274690382182598, ref_abs_avg=24.760311126708984, test_abs_avg=24.75872230529785
production_forward grad[70] vs paper_forward: mean_abs=0.5501946210861206, max_abs=5.0, mean_rel=0.15938687324523926, max_rel=1346.4490966796875, norm_rel=0.022743187844753265, ref_abs_avg=24.282764434814453, test_abs_avg=24.285289764404297
production_forward grad[71] vs paper_forward: mean_abs=0.43326830863952637, max_abs=1.625, mean_rel=0.08187076449394226, max_rel=3.724822521209717, norm_rel=0.021051255986094475, ref_abs_avg=20.223278045654297, test_abs_avg=20.18220329284668
production_forward grad[72] vs paper_forward: mean_abs=0.5323240756988525, max_abs=4.0, mean_rel=0.15289048850536346, max_rel=1397.3486328125, norm_rel=0.022413453087210655, ref_abs_avg=23.717998504638672, test_abs_avg=23.717540740966797
production_forward grad[73] vs paper_forward: mean_abs=0.5166153907775879, max_abs=3.75, mean_rel=0.14368900656700134, max_rel=516.9876708984375, norm_rel=0.021953845396637917, ref_abs_avg=23.520706176757812, test_abs_avg=23.51628875732422
production_forward grad[74] vs paper_forward: mean_abs=0.48476552963256836, max_abs=1.75, mean_rel=0.09811193495988846, max_rel=11.956940650939941, norm_rel=0.023192111402750015, ref_abs_avg=20.314353942871094, test_abs_avg=20.32588005065918
production_forward grad[75] vs paper_forward: mean_abs=0.5900394916534424, max_abs=4.875, mean_rel=0.16162529587745667, max_rel=820.2802734375, norm_rel=0.023993492126464844, ref_abs_avg=24.646133422851562, test_abs_avg=24.6448974609375
production_forward grad[76] vs paper_forward: mean_abs=0.580504298210144, max_abs=4.75, mean_rel=0.16328881680965424, max_rel=798.0999145507812, norm_rel=0.023473503068089485, ref_abs_avg=24.685100555419922, test_abs_avg=24.675207138061523
production_forward grad[77] vs paper_forward: mean_abs=0.4481797218322754, max_abs=1.5, mean_rel=0.07953536510467529, max_rel=3.7249057292938232, norm_rel=0.023292100057005882, ref_abs_avg=19.69887351989746, test_abs_avg=19.658870697021484
production_forward grad[78] vs paper_forward: mean_abs=0.552211344242096, max_abs=4.125, mean_rel=0.15196123719215393, max_rel=740.1456909179688, norm_rel=0.023447973653674126, ref_abs_avg=23.530475616455078, test_abs_avg=23.52702522277832
production_forward grad[79] vs paper_forward: mean_abs=0.5350534915924072, max_abs=4.5, mean_rel=0.13860875368118286, max_rel=572.1358642578125, norm_rel=0.02282400242984295, ref_abs_avg=23.452308654785156, test_abs_avg=23.451921463012695
production_forward grad[80] vs paper_forward: mean_abs=0.4057067632675171, max_abs=1.5, mean_rel=0.26918768882751465, max_rel=59.773380279541016, norm_rel=0.02225775085389614, ref_abs_avg=18.247282028198242, test_abs_avg=18.22575569152832
production_forward grad[81] vs paper_forward: mean_abs=0.5158642530441284, max_abs=4.0, mean_rel=0.1456502079963684, max_rel=1515.79345703125, norm_rel=0.02310936152935028, ref_abs_avg=22.325157165527344, test_abs_avg=22.324504852294922
production_forward grad[82] vs paper_forward: mean_abs=0.512336015701294, max_abs=4.5, mean_rel=0.14380742609500885, max_rel=660.3223876953125, norm_rel=0.022954130545258522, ref_abs_avg=22.346233367919922, test_abs_avg=22.349388122558594
production_forward grad[83] vs paper_forward: mean_abs=0.38260746002197266, max_abs=1.59375, mean_rel=0.07906144857406616, max_rel=5.185259819030762, norm_rel=0.02095532976090908, ref_abs_avg=18.605737686157227, test_abs_avg=18.609256744384766
production_forward grad[84] vs paper_forward: mean_abs=0.4735477566719055, max_abs=4.5, mean_rel=0.1505710482597351, max_rel=674.188232421875, norm_rel=0.02224603481590748, ref_abs_avg=21.29195785522461, test_abs_avg=21.292497634887695
production_forward grad[85] vs paper_forward: mean_abs=0.46416038274765015, max_abs=5.75, mean_rel=0.15082822740077972, max_rel=979.8134155273438, norm_rel=0.02251463755965233, ref_abs_avg=20.694190979003906, test_abs_avg=20.701187133789062
production_forward grad[86] vs paper_forward: mean_abs=0.34786272048950195, max_abs=1.5, mean_rel=0.11327804625034332, max_rel=12.435100555419922, norm_rel=0.020397307351231575, ref_abs_avg=17.406631469726562, test_abs_avg=17.40161895751953
production_forward grad[87] vs paper_forward: mean_abs=0.44423580169677734, max_abs=4.125, mean_rel=0.13069549202919006, max_rel=497.41119384765625, norm_rel=0.02141794003546238, ref_abs_avg=20.811359405517578, test_abs_avg=20.81082534790039
production_forward grad[88] vs paper_forward: mean_abs=0.44057968258857727, max_abs=5.0, mean_rel=0.139848530292511, max_rel=621.4778442382812, norm_rel=0.02195962704718113, ref_abs_avg=20.164554595947266, test_abs_avg=20.163747787475586
production_forward grad[89] vs paper_forward: mean_abs=0.3432779312133789, max_abs=1.46875, mean_rel=0.0871109813451767, max_rel=3.339465618133545, norm_rel=0.021093184128403664, ref_abs_avg=16.530242919921875, test_abs_avg=16.554031372070312
production_forward grad[90] vs paper_forward: mean_abs=0.41377073526382446, max_abs=4.5, mean_rel=0.13282953202724457, max_rel=1395.1357421875, norm_rel=0.02089782990515232, ref_abs_avg=19.9345760345459, test_abs_avg=19.933530807495117
production_forward grad[91] vs paper_forward: mean_abs=0.40499481558799744, max_abs=4.5, mean_rel=0.13185933232307434, max_rel=517.6785278320312, norm_rel=0.020540926605463028, ref_abs_avg=19.80035400390625, test_abs_avg=19.809175491333008
production_forward grad[92] vs paper_forward: mean_abs=0.3061767816543579, max_abs=1.3125, mean_rel=0.15756793320178986, max_rel=26.08028793334961, norm_rel=0.0198653731495142, ref_abs_avg=15.287399291992188, test_abs_avg=15.316259384155273
production_forward grad[93] vs paper_forward: mean_abs=0.3913017511367798, max_abs=4.0, mean_rel=0.12859150767326355, max_rel=568.1455078125, norm_rel=0.02054259367287159, ref_abs_avg=19.228591918945312, test_abs_avg=19.227184295654297
production_forward grad[94] vs paper_forward: mean_abs=0.38146770000457764, max_abs=4.0, mean_rel=0.12729015946388245, max_rel=781.0700073242188, norm_rel=0.02042376436293125, ref_abs_avg=18.87894058227539, test_abs_avg=18.88327407836914
production_forward grad[95] vs paper_forward: mean_abs=0.31255149841308594, max_abs=1.5, mean_rel=0.0587751567363739, max_rel=3.1315951347351074, norm_rel=0.01979190483689308, ref_abs_avg=15.87424087524414, test_abs_avg=15.882394790649414
production_forward grad[96] vs paper_forward: mean_abs=0.3719303011894226, max_abs=4.0, mean_rel=0.11941838264465332, max_rel=552.377685546875, norm_rel=0.020220620557665825, ref_abs_avg=18.67150115966797, test_abs_avg=18.67222785949707
production_forward grad[97] vs paper_forward: mean_abs=0.3682655692100525, max_abs=5.0, mean_rel=0.12461499124765396, max_rel=652.4827880859375, norm_rel=0.020325327292084694, ref_abs_avg=18.535049438476562, test_abs_avg=18.531471252441406
production_forward2 vs paper_forward output: mean_abs=0.0017196618719026446, max_abs=0.0390625
production_forward2 grad[0] vs paper_forward: mean_abs=0.009287435561418533, max_abs=0.40625, mean_rel=0.07708054035902023, max_rel=180.87696838378906, norm_rel=0.02109231799840927, ref_abs_avg=0.47740787267684937, test_abs_avg=0.4774102568626404
production_forward2 grad[1] vs paper_forward: mean_abs=7.934568881988525, max_abs=64.0, mean_rel=0.1438485085964203, max_rel=134.6349639892578, norm_rel=0.021380754187703133, ref_abs_avg=332.32659912109375, test_abs_avg=332.3601989746094
production_forward2 grad[2] vs paper_forward: mean_abs=1.44630765914917, max_abs=6.0, mean_rel=0.23043757677078247, max_rel=61.79471206665039, norm_rel=0.024950623512268066, ref_abs_avg=58.31346893310547, test_abs_avg=58.36314010620117
production_forward2 grad[3] vs paper_forward: mean_abs=1.7608686685562134, max_abs=10.75, mean_rel=0.18055644631385803, max_rel=2099.952392578125, norm_rel=0.0250522680580616, ref_abs_avg=70.67276000976562, test_abs_avg=70.67587280273438
production_forward2 grad[4] vs paper_forward: mean_abs=1.6985448598861694, max_abs=11.0, mean_rel=0.1819838583469391, max_rel=1484.882568359375, norm_rel=0.024686286225914955, ref_abs_avg=69.20582580566406, test_abs_avg=69.21847534179688
production_forward2 grad[5] vs paper_forward: mean_abs=1.281888484954834, max_abs=4.546875, mean_rel=0.09453552961349487, max_rel=8.698814392089844, norm_rel=0.02471509762108326, ref_abs_avg=50.33711624145508, test_abs_avg=50.377418518066406
production_forward2 grad[6] vs paper_forward: mean_abs=1.4946032762527466, max_abs=9.0, mean_rel=0.18402650952339172, max_rel=2679.120361328125, norm_rel=0.024706700816750526, ref_abs_avg=60.78848648071289, test_abs_avg=60.7874755859375
production_forward2 grad[7] vs paper_forward: mean_abs=1.4553539752960205, max_abs=10.0, mean_rel=0.16794559359550476, max_rel=1632.8985595703125, norm_rel=0.024473242461681366, ref_abs_avg=59.739479064941406, test_abs_avg=59.75012969970703
production_forward2 grad[8] vs paper_forward: mean_abs=1.073594093322754, max_abs=4.25, mean_rel=0.18610745668411255, max_rel=31.299518585205078, norm_rel=0.024214908480644226, ref_abs_avg=44.44817352294922, test_abs_avg=44.44963455200195
production_forward2 grad[9] vs paper_forward: mean_abs=1.3639640808105469, max_abs=9.0, mean_rel=0.1705489605665207, max_rel=1575.907470703125, norm_rel=0.024499161168932915, ref_abs_avg=55.981143951416016, test_abs_avg=55.979087829589844
production_forward2 grad[10] vs paper_forward: mean_abs=1.331441879272461, max_abs=8.0, mean_rel=0.1664704829454422, max_rel=1488.100341796875, norm_rel=0.024308795109391212, ref_abs_avg=55.08021545410156, test_abs_avg=55.074005126953125
production_forward2 grad[11] vs paper_forward: mean_abs=1.009054183959961, max_abs=4.625, mean_rel=0.09456457942724228, max_rel=8.75121784210205, norm_rel=0.024376051500439644, ref_abs_avg=43.31646728515625, test_abs_avg=43.3153190612793
production_forward2 grad[12] vs paper_forward: mean_abs=1.2509334087371826, max_abs=8.5, mean_rel=0.15764442086219788, max_rel=1635.699462890625, norm_rel=0.02427075244486332, ref_abs_avg=51.7795524597168, test_abs_avg=51.781150817871094
production_forward2 grad[13] vs paper_forward: mean_abs=1.2261934280395508, max_abs=8.0, mean_rel=0.15712082386016846, max_rel=859.8744506835938, norm_rel=0.02413223870098591, ref_abs_avg=51.052608489990234, test_abs_avg=51.05087661743164
production_forward2 grad[14] vs paper_forward: mean_abs=0.9489574432373047, max_abs=3.8046875, mean_rel=0.0985158160328865, max_rel=5.880153179168701, norm_rel=0.023545190691947937, ref_abs_avg=40.09723663330078, test_abs_avg=40.11810302734375
production_forward2 grad[15] vs paper_forward: mean_abs=1.1703414916992188, max_abs=8.0, mean_rel=0.17062903940677643, max_rel=1391.1251220703125, norm_rel=0.024101508781313896, ref_abs_avg=48.80442428588867, test_abs_avg=48.805946350097656
production_forward2 grad[16] vs paper_forward: mean_abs=1.1419140100479126, max_abs=7.0, mean_rel=0.1720801293849945, max_rel=1567.73583984375, norm_rel=0.024038653820753098, ref_abs_avg=47.72216796875, test_abs_avg=47.71232986450195
production_forward2 grad[17] vs paper_forward: mean_abs=0.9102423191070557, max_abs=3.25, mean_rel=0.13016074895858765, max_rel=7.946765422821045, norm_rel=0.02384355664253235, ref_abs_avg=38.48554229736328, test_abs_avg=38.445960998535156
production_forward2 grad[18] vs paper_forward: mean_abs=1.0945780277252197, max_abs=7.25, mean_rel=0.17170268297195435, max_rel=1879.5931396484375, norm_rel=0.023842524737119675, ref_abs_avg=46.12638854980469, test_abs_avg=46.12810134887695
production_forward2 grad[19] vs paper_forward: mean_abs=1.0714622735977173, max_abs=7.390625, mean_rel=0.16649207472801208, max_rel=1861.2637939453125, norm_rel=0.023639613762497902, ref_abs_avg=45.485382080078125, test_abs_avg=45.48213577270508
production_forward2 grad[20] vs paper_forward: mean_abs=0.8177733421325684, max_abs=3.75, mean_rel=0.09582965075969696, max_rel=7.794064521789551, norm_rel=0.023586278781294823, ref_abs_avg=35.504119873046875, test_abs_avg=35.481685638427734
production_forward2 grad[21] vs paper_forward: mean_abs=1.0360724925994873, max_abs=7.0, mean_rel=0.16559842228889465, max_rel=2146.505126953125, norm_rel=0.023745298385620117, ref_abs_avg=43.854644775390625, test_abs_avg=43.85588073730469
production_forward2 grad[22] vs paper_forward: mean_abs=1.0192922353744507, max_abs=6.125, mean_rel=0.15044090151786804, max_rel=644.9957275390625, norm_rel=0.023443402722477913, ref_abs_avg=43.725624084472656, test_abs_avg=43.723175048828125
production_forward2 grad[23] vs paper_forward: mean_abs=0.7724189758300781, max_abs=3.0, mean_rel=0.10092964768409729, max_rel=4.799622535705566, norm_rel=0.023578451946377754, ref_abs_avg=33.25893020629883, test_abs_avg=33.27009582519531
production_forward2 grad[24] vs paper_forward: mean_abs=0.9955698251724243, max_abs=6.75, mean_rel=0.16922804713249207, max_rel=2269.69873046875, norm_rel=0.023740040138363838, ref_abs_avg=42.1633415222168, test_abs_avg=42.16501998901367
production_forward2 grad[25] vs paper_forward: mean_abs=0.968890368938446, max_abs=6.5, mean_rel=0.15399055182933807, max_rel=905.6304931640625, norm_rel=0.02342541143298149, ref_abs_avg=41.52214813232422, test_abs_avg=41.51930618286133
production_forward2 grad[26] vs paper_forward: mean_abs=0.9173882007598877, max_abs=3.5, mean_rel=0.15898698568344116, max_rel=35.26221466064453, norm_rel=0.02535819821059704, ref_abs_avg=36.79756164550781, test_abs_avg=36.7416877746582
production_forward2 grad[27] vs paper_forward: mean_abs=1.149367332458496, max_abs=7.0, mean_rel=0.18081054091453552, max_rel=2735.958740234375, norm_rel=0.025730887427926064, ref_abs_avg=44.871315002441406, test_abs_avg=44.871665954589844
production_forward2 grad[28] vs paper_forward: mean_abs=1.1277166604995728, max_abs=8.0, mean_rel=0.171078622341156, max_rel=1444.3492431640625, norm_rel=0.025269415229558945, ref_abs_avg=44.82758331298828, test_abs_avg=44.81455993652344
production_forward2 grad[29] vs paper_forward: mean_abs=0.8390111923217773, max_abs=3.0, mean_rel=0.0858706384897232, max_rel=4.8267059326171875, norm_rel=0.02409861423075199, ref_abs_avg=34.334747314453125, test_abs_avg=34.34563446044922
production_forward2 grad[30] vs paper_forward: mean_abs=1.0632460117340088, max_abs=6.5, mean_rel=0.1749313622713089, max_rel=1653.81103515625, norm_rel=0.025822684168815613, ref_abs_avg=41.35286331176758, test_abs_avg=41.35333251953125
production_forward2 grad[31] vs paper_forward: mean_abs=1.0460418462753296, max_abs=6.0, mean_rel=0.16392001509666443, max_rel=871.9945068359375, norm_rel=0.02579626254737377, ref_abs_avg=40.70027160644531, test_abs_avg=40.68962097167969
production_forward2 grad[32] vs paper_forward: mean_abs=0.7742166519165039, max_abs=2.796875, mean_rel=0.08166391402482986, max_rel=8.3592529296875, norm_rel=0.024198412895202637, ref_abs_avg=32.42280578613281, test_abs_avg=32.420082092285156
production_forward2 grad[33] vs paper_forward: mean_abs=0.9892681837081909, max_abs=6.265625, mean_rel=0.17161071300506592, max_rel=1511.8409423828125, norm_rel=0.025654030963778496, ref_abs_avg=38.663917541503906, test_abs_avg=38.66650390625
production_forward2 grad[34] vs paper_forward: mean_abs=0.9708176851272583, max_abs=6.0, mean_rel=0.17400920391082764, max_rel=1045.141845703125, norm_rel=0.025507811456918716, ref_abs_avg=38.14471435546875, test_abs_avg=38.13914489746094
production_forward2 grad[35] vs paper_forward: mean_abs=0.7422335147857666, max_abs=2.75, mean_rel=0.08924072235822678, max_rel=3.08845853805542, norm_rel=0.024335317313671112, ref_abs_avg=30.270885467529297, test_abs_avg=30.295251846313477
production_forward2 grad[36] vs paper_forward: mean_abs=0.9205905199050903, max_abs=6.1875, mean_rel=0.17675811052322388, max_rel=1164.96484375, norm_rel=0.025356870144605637, ref_abs_avg=36.37940216064453, test_abs_avg=36.3804817199707
production_forward2 grad[37] vs paper_forward: mean_abs=0.9062092304229736, max_abs=5.75, mean_rel=0.1646886169910431, max_rel=1126.9537353515625, norm_rel=0.02508980594575405, ref_abs_avg=36.23287582397461, test_abs_avg=36.23565673828125
production_forward2 grad[38] vs paper_forward: mean_abs=0.6803951263427734, max_abs=2.5, mean_rel=0.07980238646268845, max_rel=2.6372780799865723, norm_rel=0.023189488798379898, ref_abs_avg=29.3680419921875, test_abs_avg=29.3313045501709
production_forward2 grad[39] vs paper_forward: mean_abs=0.8665021657943726, max_abs=5.875, mean_rel=0.172390878200531, max_rel=1270.6693115234375, norm_rel=0.025114402174949646, ref_abs_avg=34.595458984375, test_abs_avg=34.59233093261719
production_forward2 grad[40] vs paper_forward: mean_abs=0.8535370230674744, max_abs=5.5, mean_rel=0.16264373064041138, max_rel=1669.7216796875, norm_rel=0.024941271170973778, ref_abs_avg=34.32914733886719, test_abs_avg=34.32575607299805
production_forward2 grad[41] vs paper_forward: mean_abs=0.6266307830810547, max_abs=2.71875, mean_rel=0.11209864169359207, max_rel=10.53779125213623, norm_rel=0.023192688822746277, ref_abs_avg=27.694366455078125, test_abs_avg=27.682289123535156
production_forward2 grad[42] vs paper_forward: mean_abs=0.8237879872322083, max_abs=5.5, mean_rel=0.1667931079864502, max_rel=865.6714477539062, norm_rel=0.02487407624721527, ref_abs_avg=33.13909149169922, test_abs_avg=33.138771057128906
production_forward2 grad[43] vs paper_forward: mean_abs=0.8044192790985107, max_abs=5.0, mean_rel=0.17840193212032318, max_rel=1023.2320556640625, norm_rel=0.024698181077837944, ref_abs_avg=32.630489349365234, test_abs_avg=32.63385772705078
production_forward2 grad[44] vs paper_forward: mean_abs=0.6447726488113403, max_abs=2.3125, mean_rel=0.08780254423618317, max_rel=2.8937735557556152, norm_rel=0.02420659549534321, ref_abs_avg=26.751880645751953, test_abs_avg=26.765504837036133
production_forward2 grad[45] vs paper_forward: mean_abs=0.7829751968383789, max_abs=5.0, mean_rel=0.1814025640487671, max_rel=1768.156494140625, norm_rel=0.02456783689558506, ref_abs_avg=31.913082122802734, test_abs_avg=31.912731170654297
production_forward2 grad[46] vs paper_forward: mean_abs=0.7680525183677673, max_abs=4.75, mean_rel=0.15773113071918488, max_rel=740.0021362304688, norm_rel=0.02424451895058155, ref_abs_avg=31.724531173706055, test_abs_avg=31.726329803466797
production_forward2 grad[47] vs paper_forward: mean_abs=0.5841162204742432, max_abs=2.375, mean_rel=0.10945741832256317, max_rel=11.03575325012207, norm_rel=0.023903729394078255, ref_abs_avg=24.740617752075195, test_abs_avg=24.732173919677734
production_forward2 grad[48] vs paper_forward: mean_abs=0.7384025454521179, max_abs=5.0, mean_rel=0.16525812447071075, max_rel=1135.095947265625, norm_rel=0.02447337657213211, ref_abs_avg=30.202136993408203, test_abs_avg=30.20298957824707
production_forward2 grad[49] vs paper_forward: mean_abs=0.7344692349433899, max_abs=4.75, mean_rel=0.15845926105976105, max_rel=919.9728393554688, norm_rel=0.024484943598508835, ref_abs_avg=30.032634735107422, test_abs_avg=30.03311538696289
production_forward2 grad[50] vs paper_forward: mean_abs=0.7162905931472778, max_abs=2.5, mean_rel=0.0886891633272171, max_rel=4.372315406799316, norm_rel=0.025673113763332367, ref_abs_avg=27.438201904296875, test_abs_avg=27.398862838745117
production_forward2 grad[51] vs paper_forward: mean_abs=0.8555301427841187, max_abs=5.25, mean_rel=0.18382695317268372, max_rel=1353.5390625, norm_rel=0.02588970586657524, ref_abs_avg=33.118553161621094, test_abs_avg=33.11695861816406
production_forward2 grad[52] vs paper_forward: mean_abs=0.8274307250976562, max_abs=6.5, mean_rel=0.1714089810848236, max_rel=1488.92138671875, norm_rel=0.02530759945511818, ref_abs_avg=32.803955078125, test_abs_avg=32.79859924316406
production_forward2 grad[53] vs paper_forward: mean_abs=0.6180417537689209, max_abs=3.125, mean_rel=0.1964668333530426, max_rel=33.749061584472656, norm_rel=0.023701906204223633, ref_abs_avg=26.734756469726562, test_abs_avg=26.790924072265625
production_forward2 grad[54] vs paper_forward: mean_abs=0.7729984521865845, max_abs=5.6484375, mean_rel=0.16262589395046234, max_rel=1146.1397705078125, norm_rel=0.02528407610952854, ref_abs_avg=30.597240447998047, test_abs_avg=30.594867706298828
production_forward2 grad[55] vs paper_forward: mean_abs=0.7571642994880676, max_abs=5.0, mean_rel=0.17975591123104095, max_rel=1172.69677734375, norm_rel=0.025037480518221855, ref_abs_avg=30.26093292236328, test_abs_avg=30.263622283935547
production_forward2 grad[56] vs paper_forward: mean_abs=0.5477628707885742, max_abs=2.125, mean_rel=0.08235423266887665, max_rel=4.602505683898926, norm_rel=0.02359968051314354, ref_abs_avg=23.819652557373047, test_abs_avg=23.82607078552246
production_forward2 grad[57] vs paper_forward: mean_abs=0.7128925323486328, max_abs=5.3125, mean_rel=0.16948920488357544, max_rel=1319.3187255859375, norm_rel=0.02484028972685337, ref_abs_avg=28.687862396240234, test_abs_avg=28.686614990234375
production_forward2 grad[58] vs paper_forward: mean_abs=0.6952999830245972, max_abs=4.5, mean_rel=0.16292527318000793, max_rel=820.908447265625, norm_rel=0.024592261761426926, ref_abs_avg=28.269990921020508, test_abs_avg=28.26448631286621
production_forward2 grad[59] vs paper_forward: mean_abs=0.5269523859024048, max_abs=1.8125, mean_rel=0.0873175710439682, max_rel=11.7943754196167, norm_rel=0.02264641784131527, ref_abs_avg=23.851655960083008, test_abs_avg=23.871925354003906
production_forward2 grad[60] vs paper_forward: mean_abs=0.6715685129165649, max_abs=6.0, mean_rel=0.1622714400291443, max_rel=1291.4732666015625, norm_rel=0.024095937609672546, ref_abs_avg=27.821685791015625, test_abs_avg=27.821258544921875
production_forward2 grad[61] vs paper_forward: mean_abs=0.6549782752990723, max_abs=5.0, mean_rel=0.17130440473556519, max_rel=1171.7296142578125, norm_rel=0.024270113557577133, ref_abs_avg=27.025251388549805, test_abs_avg=27.02519416809082
production_forward2 grad[62] vs paper_forward: mean_abs=0.5226306915283203, max_abs=2.0, mean_rel=0.09751001000404358, max_rel=6.899025917053223, norm_rel=0.02370958775281906, ref_abs_avg=22.034395217895508, test_abs_avg=22.098533630371094
production_forward2 grad[63] vs paper_forward: mean_abs=0.6294592022895813, max_abs=5.0, mean_rel=0.16627778112888336, max_rel=939.9435424804688, norm_rel=0.023866886273026466, ref_abs_avg=26.387710571289062, test_abs_avg=26.387619018554688
production_forward2 grad[64] vs paper_forward: mean_abs=0.6161550879478455, max_abs=4.5, mean_rel=0.14932510256767273, max_rel=842.0982666015625, norm_rel=0.02357354760169983, ref_abs_avg=26.182479858398438, test_abs_avg=26.182710647583008
production_forward2 grad[65] vs paper_forward: mean_abs=0.4795083999633789, max_abs=1.671875, mean_rel=0.09111184626817703, max_rel=3.958914041519165, norm_rel=0.022585861384868622, ref_abs_avg=21.06168556213379, test_abs_avg=21.107418060302734
production_forward2 grad[66] vs paper_forward: mean_abs=0.5973265171051025, max_abs=4.5, mean_rel=0.15193553268909454, max_rel=762.654052734375, norm_rel=0.023372173309326172, ref_abs_avg=25.532400131225586, test_abs_avg=25.5325984954834
production_forward2 grad[67] vs paper_forward: mean_abs=0.5853366851806641, max_abs=5.0, mean_rel=0.1587614119052887, max_rel=1474.779052734375, norm_rel=0.02362859435379505, ref_abs_avg=24.752660751342773, test_abs_avg=24.746334075927734
production_forward2 grad[68] vs paper_forward: mean_abs=0.46333765983581543, max_abs=2.08203125, mean_rel=0.08154745399951935, max_rel=5.990299701690674, norm_rel=0.02242269180715084, ref_abs_avg=21.03182601928711, test_abs_avg=21.01927947998047
production_forward2 grad[69] vs paper_forward: mean_abs=0.5690106749534607, max_abs=4.5, mean_rel=0.15528368949890137, max_rel=854.7440185546875, norm_rel=0.02292483113706112, ref_abs_avg=24.760311126708984, test_abs_avg=24.758831024169922
production_forward2 grad[70] vs paper_forward: mean_abs=0.5544582009315491, max_abs=5.0, mean_rel=0.161323681473732, max_rel=1476.5245361328125, norm_rel=0.02289889194071293, ref_abs_avg=24.282764434814453, test_abs_avg=24.285110473632812
production_forward2 grad[71] vs paper_forward: mean_abs=0.43109679222106934, max_abs=1.625, mean_rel=0.08200716227293015, max_rel=3.8689510822296143, norm_rel=0.02125689759850502, ref_abs_avg=20.223278045654297, test_abs_avg=20.177387237548828
production_forward2 grad[72] vs paper_forward: mean_abs=0.5359998941421509, max_abs=4.0, mean_rel=0.15695111453533173, max_rel=1351.421142578125, norm_rel=0.022558238357305527, ref_abs_avg=23.717998504638672, test_abs_avg=23.71689796447754
production_forward2 grad[73] vs paper_forward: mean_abs=0.5208126902580261, max_abs=4.0, mean_rel=0.14819841086864471, max_rel=605.4867553710938, norm_rel=0.022127998992800713, ref_abs_avg=23.520706176757812, test_abs_avg=23.5158634185791
production_forward2 grad[74] vs paper_forward: mean_abs=0.4779510498046875, max_abs=2.0, mean_rel=0.08218619227409363, max_rel=6.153068542480469, norm_rel=0.023033563047647476, ref_abs_avg=20.314353942871094, test_abs_avg=20.31888771057129
production_forward2 grad[75] vs paper_forward: mean_abs=0.5969283580780029, max_abs=4.875, mean_rel=0.16483964025974274, max_rel=1026.6734619140625, norm_rel=0.024251261726021767, ref_abs_avg=24.646133422851562, test_abs_avg=24.644189834594727
production_forward2 grad[76] vs paper_forward: mean_abs=0.5866432785987854, max_abs=4.5, mean_rel=0.16670875251293182, max_rel=825.9547729492188, norm_rel=0.023723136633634567, ref_abs_avg=24.685100555419922, test_abs_avg=24.67644500732422
production_forward2 grad[77] vs paper_forward: mean_abs=0.44350385665893555, max_abs=1.6875, mean_rel=0.08594512939453125, max_rel=5.071099758148193, norm_rel=0.023134099319577217, ref_abs_avg=19.69887351989746, test_abs_avg=19.662700653076172
production_forward2 grad[78] vs paper_forward: mean_abs=0.557438850402832, max_abs=4.5, mean_rel=0.15231715142726898, max_rel=748.1226196289062, norm_rel=0.02365795522928238, ref_abs_avg=23.530475616455078, test_abs_avg=23.526721954345703
production_forward2 grad[79] vs paper_forward: mean_abs=0.5414096117019653, max_abs=4.25, mean_rel=0.1407509744167328, max_rel=624.8416137695312, norm_rel=0.02309093438088894, ref_abs_avg=23.452308654785156, test_abs_avg=23.451499938964844
production_forward2 grad[80] vs paper_forward: mean_abs=0.4191373586654663, max_abs=1.75, mean_rel=0.268146276473999, max_rel=59.32071304321289, norm_rel=0.022743599489331245, ref_abs_avg=18.247282028198242, test_abs_avg=18.2198543548584
production_forward2 grad[81] vs paper_forward: mean_abs=0.5194944739341736, max_abs=4.125, mean_rel=0.14837628602981567, max_rel=1764.882080078125, norm_rel=0.023265527561306953, ref_abs_avg=22.325157165527344, test_abs_avg=22.324207305908203
production_forward2 grad[82] vs paper_forward: mean_abs=0.5165143013000488, max_abs=5.0, mean_rel=0.14484024047851562, max_rel=722.807861328125, norm_rel=0.02314370684325695, ref_abs_avg=22.346233367919922, test_abs_avg=22.34880828857422
production_forward2 grad[83] vs paper_forward: mean_abs=0.38945484161376953, max_abs=1.75, mean_rel=0.08092574775218964, max_rel=7.71675968170166, norm_rel=0.021474644541740417, ref_abs_avg=18.605737686157227, test_abs_avg=18.605213165283203
production_forward2 grad[84] vs paper_forward: mean_abs=0.4765891432762146, max_abs=4.5, mean_rel=0.1513630449771881, max_rel=831.832763671875, norm_rel=0.02237658202648163, ref_abs_avg=21.29195785522461, test_abs_avg=21.29241371154785
production_forward2 grad[85] vs paper_forward: mean_abs=0.4675784707069397, max_abs=6.125, mean_rel=0.15201719105243683, max_rel=984.6419677734375, norm_rel=0.022666625678539276, ref_abs_avg=20.694190979003906, test_abs_avg=20.701393127441406
production_forward2 grad[86] vs paper_forward: mean_abs=0.3599514961242676, max_abs=1.5, mean_rel=0.12483364343643188, max_rel=15.3153076171875, norm_rel=0.021094713360071182, ref_abs_avg=17.406631469726562, test_abs_avg=17.40213394165039
production_forward2 grad[87] vs paper_forward: mean_abs=0.44657832384109497, max_abs=4.0, mean_rel=0.13163037598133087, max_rel=450.39599609375, norm_rel=0.021519368514418602, ref_abs_avg=20.811359405517578, test_abs_avg=20.810768127441406
production_forward2 grad[88] vs paper_forward: mean_abs=0.4427359700202942, max_abs=4.75, mean_rel=0.14078152179718018, max_rel=713.8753051757812, norm_rel=0.022072702646255493, ref_abs_avg=20.164554595947266, test_abs_avg=20.163597106933594
production_forward2 grad[89] vs paper_forward: mean_abs=0.3437521457672119, max_abs=1.53125, mean_rel=0.08735824376344681, max_rel=3.659104108810425, norm_rel=0.021174123510718346, ref_abs_avg=16.530242919921875, test_abs_avg=16.54783058166504
production_forward2 grad[90] vs paper_forward: mean_abs=0.4153810143470764, max_abs=4.5, mean_rel=0.13405635952949524, max_rel=1173.3582763671875, norm_rel=0.02096714824438095, ref_abs_avg=19.9345760345459, test_abs_avg=19.93332290649414
production_forward2 grad[91] vs paper_forward: mean_abs=0.4062725007534027, max_abs=4.5, mean_rel=0.1316976547241211, max_rel=676.0693359375, norm_rel=0.020598813891410828, ref_abs_avg=19.80035400390625, test_abs_avg=19.80886459350586
production_forward2 grad[92] vs paper_forward: mean_abs=0.3087502717971802, max_abs=1.375, mean_rel=0.14570415019989014, max_rel=25.530988693237305, norm_rel=0.020378999412059784, ref_abs_avg=15.287399291992188, test_abs_avg=15.316557884216309
production_forward2 grad[93] vs paper_forward: mean_abs=0.39206135272979736, max_abs=4.0, mean_rel=0.12891633808612823, max_rel=481.2362365722656, norm_rel=0.020578833296895027, ref_abs_avg=19.228591918945312, test_abs_avg=19.2271671295166
production_forward2 grad[94] vs paper_forward: mean_abs=0.38132616877555847, max_abs=4.0, mean_rel=0.12711390852928162, max_rel=859.1722412109375, norm_rel=0.020424552261829376, ref_abs_avg=18.87894058227539, test_abs_avg=18.883319854736328
production_forward2 grad[95] vs paper_forward: mean_abs=0.31255149841308594, max_abs=1.5, mean_rel=0.0587751567363739, max_rel=3.1315951347351074, norm_rel=0.01979190483689308, ref_abs_avg=15.87424087524414, test_abs_avg=15.882394790649414
production_forward2 grad[96] vs paper_forward: mean_abs=0.3719303011894226, max_abs=4.0, mean_rel=0.11941838264465332, max_rel=552.377685546875, norm_rel=0.020220620557665825, ref_abs_avg=18.67150115966797, test_abs_avg=18.67222785949707
production_forward2 grad[97] vs paper_forward: mean_abs=0.3682655692100525, max_abs=5.0, mean_rel=0.12461499124765396, max_rel=652.4827880859375, norm_rel=0.020325327292084694, ref_abs_avg=18.535049438476562, test_abs_avg=18.531471252441406
identity layers + randn queries
production_forward fwd+bwd:  126.683 ms
production_forward bwd-only: 106.261 ms
production_forward peak allocated: fwd=3.368 GiB, fwd+bwd=7.868 GiB
production_forward peak reserved:  fwd=3.617 GiB, fwd+bwd=8.867 GiB
production_forward2 fwd+bwd:  224.338 ms
production_forward2 bwd-only: 202.099 ms
production_forward2 peak allocated: fwd=2.864 GiB, fwd+bwd=6.243 GiB
production_forward2 peak reserved:  fwd=3.242 GiB, fwd+bwd=8.992 GiB
paper_forward fwd+bwd:  379.710 ms
paper_forward bwd-only: 293.972 ms
paper_forward peak allocated: fwd=30.001 GiB, fwd+bwd=32.120 GiB
paper_forward peak reserved:  fwd=30.037 GiB, fwd+bwd=32.787 GiB

grads check for swiglu layers + randn queries
production_forward vs paper_forward output: mean_abs=0.0016148764407262206, max_abs=0.0390625
production_forward grad[0] vs paper_forward: mean_abs=0.00819210521876812, max_abs=0.46875, mean_rel=0.07188614457845688, max_rel=122.01921844482422, norm_rel=0.019827570766210556, ref_abs_avg=0.44994795322418213, test_abs_avg=0.4499640464782715
production_forward grad[1] vs paper_forward: mean_abs=7.232047080993652, max_abs=66.0, mean_rel=0.135599747300148, max_rel=243.49815368652344, norm_rel=0.02044159546494484, ref_abs_avg=312.08782958984375, test_abs_avg=312.2342529296875
production_forward grad[2] vs paper_forward: mean_abs=1.2522563934326172, max_abs=5.375, mean_rel=0.09875061362981796, max_rel=5.007403373718262, norm_rel=0.023451944813132286, ref_abs_avg=53.47811508178711, test_abs_avg=53.60884094238281
production_forward grad[3] vs paper_forward: mean_abs=1.4746677875518799, max_abs=10.0, mean_rel=0.1746571660041809, max_rel=3142.5966796875, norm_rel=0.023441815748810768, ref_abs_avg=63.25735092163086, test_abs_avg=63.26285934448242
production_forward grad[4] vs paper_forward: mean_abs=1.439340353012085, max_abs=10.0, mean_rel=0.14944450557231903, max_rel=630.3737182617188, norm_rel=0.023288249969482422, ref_abs_avg=62.23832321166992, test_abs_avg=62.252906799316406
production_forward grad[5] vs paper_forward: mean_abs=1.0400023460388184, max_abs=4.0, mean_rel=0.25651779770851135, max_rel=81.03218841552734, norm_rel=0.021825440227985382, ref_abs_avg=48.04098129272461, test_abs_avg=48.068016052246094
production_forward grad[6] vs paper_forward: mean_abs=1.3150544166564941, max_abs=8.0, mean_rel=0.15497446060180664, max_rel=1407.2181396484375, norm_rel=0.023271238431334496, ref_abs_avg=56.79205322265625, test_abs_avg=56.79557418823242
production_forward grad[7] vs paper_forward: mean_abs=1.2832506895065308, max_abs=8.0, mean_rel=0.16599102318286896, max_rel=2094.7578125, norm_rel=0.023140693083405495, ref_abs_avg=55.75505447387695, test_abs_avg=55.759559631347656
production_forward grad[8] vs paper_forward: mean_abs=1.0446548461914062, max_abs=4.625, mean_rel=0.08974762260913849, max_rel=6.802282333374023, norm_rel=0.02373734675347805, ref_abs_avg=43.07315444946289, test_abs_avg=43.10915756225586
production_forward grad[9] vs paper_forward: mean_abs=1.2000998258590698, max_abs=7.8203125, mean_rel=0.1617850959300995, max_rel=3316.220947265625, norm_rel=0.023170964792370796, ref_abs_avg=52.08430480957031, test_abs_avg=52.0888786315918
production_forward grad[10] vs paper_forward: mean_abs=1.1732233762741089, max_abs=7.75, mean_rel=0.14786377549171448, max_rel=750.3930053710938, norm_rel=0.022850139066576958, ref_abs_avg=51.54917907714844, test_abs_avg=51.555213928222656
production_forward grad[11] vs paper_forward: mean_abs=0.9121603965759277, max_abs=3.5, mean_rel=0.11216564476490021, max_rel=8.189122200012207, norm_rel=0.022196048870682716, ref_abs_avg=42.383155822753906, test_abs_avg=42.36076736450195
production_forward grad[12] vs paper_forward: mean_abs=1.118602991104126, max_abs=7.0, mean_rel=0.16350094974040985, max_rel=1694.733154296875, norm_rel=0.022940076887607574, ref_abs_avg=48.98735809326172, test_abs_avg=48.99036407470703
production_forward grad[13] vs paper_forward: mean_abs=1.0990455150604248, max_abs=7.5, mean_rel=0.16516545414924622, max_rel=1593.7950439453125, norm_rel=0.02272617444396019, ref_abs_avg=48.6209716796875, test_abs_avg=48.61638641357422
production_forward grad[14] vs paper_forward: mean_abs=0.866765022277832, max_abs=3.13671875, mean_rel=0.08420468866825104, max_rel=3.7374918460845947, norm_rel=0.0228200051933527, ref_abs_avg=38.203102111816406, test_abs_avg=38.23146438598633
production_forward grad[15] vs paper_forward: mean_abs=1.0456552505493164, max_abs=6.875, mean_rel=0.15434008836746216, max_rel=1862.4569091796875, norm_rel=0.022845428436994553, ref_abs_avg=45.967437744140625, test_abs_avg=45.97412872314453
production_forward grad[16] vs paper_forward: mean_abs=1.0198185443878174, max_abs=6.375, mean_rel=0.15983650088310242, max_rel=1397.9730224609375, norm_rel=0.022622644901275635, ref_abs_avg=45.30883026123047, test_abs_avg=45.31132125854492
production_forward grad[17] vs paper_forward: mean_abs=0.7865076065063477, max_abs=3.25, mean_rel=0.10981482267379761, max_rel=7.207898139953613, norm_rel=0.022870942950248718, ref_abs_avg=34.33709716796875, test_abs_avg=34.4229621887207
production_forward grad[18] vs paper_forward: mean_abs=0.9820119142532349, max_abs=6.0, mean_rel=0.15373195707798004, max_rel=922.9744873046875, norm_rel=0.02271946147084236, ref_abs_avg=43.41645050048828, test_abs_avg=43.420753479003906
production_forward grad[19] vs paper_forward: mean_abs=0.9610809683799744, max_abs=5.75, mean_rel=0.14402440190315247, max_rel=1221.2213134765625, norm_rel=0.022497113794088364, ref_abs_avg=42.87748718261719, test_abs_avg=42.880760192871094
production_forward grad[20] vs paper_forward: mean_abs=0.759562611579895, max_abs=2.875, mean_rel=0.08884280920028687, max_rel=7.626760959625244, norm_rel=0.021562496200203896, ref_abs_avg=35.16055679321289, test_abs_avg=35.12287139892578
production_forward grad[21] vs paper_forward: mean_abs=0.931492805480957, max_abs=6.0, mean_rel=0.14684021472930908, max_rel=1048.177001953125, norm_rel=0.022449292242527008, ref_abs_avg=41.712345123291016, test_abs_avg=41.714630126953125
production_forward grad[22] vs paper_forward: mean_abs=0.9111759662628174, max_abs=5.5, mean_rel=0.14443083107471466, max_rel=545.730224609375, norm_rel=0.022258836776018143, ref_abs_avg=41.09059524536133, test_abs_avg=41.09453582763672
production_forward grad[23] vs paper_forward: mean_abs=0.6859569549560547, max_abs=2.75, mean_rel=0.14582468569278717, max_rel=12.986615180969238, norm_rel=0.021010320633649826, ref_abs_avg=32.53321838378906, test_abs_avg=32.509918212890625
production_forward grad[24] vs paper_forward: mean_abs=0.8881245255470276, max_abs=6.0, mean_rel=0.15716803073883057, max_rel=2257.2353515625, norm_rel=0.02245747484266758, ref_abs_avg=39.697166442871094, test_abs_avg=39.70073318481445
production_forward grad[25] vs paper_forward: mean_abs=0.8679365515708923, max_abs=5.150390625, mean_rel=0.1574488878250122, max_rel=1176.1888427734375, norm_rel=0.02212817780673504, ref_abs_avg=39.402381896972656, test_abs_avg=39.41553497314453
production_forward grad[26] vs paper_forward: mean_abs=0.8297042846679688, max_abs=3.25, mean_rel=0.09881696105003357, max_rel=5.89796257019043, norm_rel=0.022767366841435432, ref_abs_avg=37.07672882080078, test_abs_avg=37.055938720703125
production_forward grad[27] vs paper_forward: mean_abs=1.0152465105056763, max_abs=6.5, mean_rel=0.18413114547729492, max_rel=2509.205078125, norm_rel=0.02432790957391262, ref_abs_avg=41.90745162963867, test_abs_avg=41.91088104248047
production_forward grad[28] vs paper_forward: mean_abs=0.9903823137283325, max_abs=6.75, mean_rel=0.16013775765895844, max_rel=1049.271240234375, norm_rel=0.024088740348815918, ref_abs_avg=41.327537536621094, test_abs_avg=41.326255798339844
production_forward grad[29] vs paper_forward: mean_abs=0.7737109661102295, max_abs=3.0, mean_rel=0.22477856278419495, max_rel=36.75638198852539, norm_rel=0.023601971566677094, ref_abs_avg=33.15580749511719, test_abs_avg=33.18090057373047
production_forward grad[30] vs paper_forward: mean_abs=0.9519970417022705, max_abs=6.375, mean_rel=0.16622579097747803, max_rel=1068.25146484375, norm_rel=0.024607142433524132, ref_abs_avg=38.82341384887695, test_abs_avg=38.82483673095703
production_forward grad[31] vs paper_forward: mean_abs=0.9272276759147644, max_abs=6.0, mean_rel=0.1584717035293579, max_rel=858.1363525390625, norm_rel=0.02429559826850891, ref_abs_avg=38.2624397277832, test_abs_avg=38.260658264160156
production_forward grad[32] vs paper_forward: mean_abs=0.7275418043136597, max_abs=2.5, mean_rel=0.42011919617652893, max_rel=173.92440795898438, norm_rel=0.0226580873131752, ref_abs_avg=32.70454406738281, test_abs_avg=32.70307159423828
production_forward grad[33] vs paper_forward: mean_abs=0.8976107835769653, max_abs=6.0, mean_rel=0.17273814976215363, max_rel=2925.18115234375, norm_rel=0.024448545649647713, ref_abs_avg=36.81173324584961, test_abs_avg=36.81058120727539
production_forward grad[34] vs paper_forward: mean_abs=0.8800326585769653, max_abs=5.5, mean_rel=0.168723002076149, max_rel=961.0487060546875, norm_rel=0.02444835565984249, ref_abs_avg=36.12456512451172, test_abs_avg=36.12498474121094
production_forward grad[35] vs paper_forward: mean_abs=0.6817375421524048, max_abs=3.0, mean_rel=0.2818429172039032, max_rel=96.84809875488281, norm_rel=0.024718478322029114, ref_abs_avg=28.185205459594727, test_abs_avg=28.316444396972656
production_forward grad[36] vs paper_forward: mean_abs=0.8540627956390381, max_abs=5.0, mean_rel=0.15747418999671936, max_rel=1792.0614013671875, norm_rel=0.024533692747354507, ref_abs_avg=34.94243621826172, test_abs_avg=34.9425163269043
production_forward grad[37] vs paper_forward: mean_abs=0.8371286392211914, max_abs=5.3125, mean_rel=0.16244105994701385, max_rel=597.8568115234375, norm_rel=0.024402473121881485, ref_abs_avg=34.42631912231445, test_abs_avg=34.432518005371094
production_forward grad[38] vs paper_forward: mean_abs=0.6718850135803223, max_abs=2.40625, mean_rel=0.10285988450050354, max_rel=15.757012367248535, norm_rel=0.02558731846511364, ref_abs_avg=26.068958282470703, test_abs_avg=26.090377807617188
production_forward grad[39] vs paper_forward: mean_abs=0.7985662221908569, max_abs=6.0, mean_rel=0.15034039318561554, max_rel=782.4580688476562, norm_rel=0.024154337123036385, ref_abs_avg=33.10108184814453, test_abs_avg=33.10199737548828
production_forward grad[40] vs paper_forward: mean_abs=0.7887276411056519, max_abs=5.0, mean_rel=0.16696375608444214, max_rel=1152.39208984375, norm_rel=0.024218415841460228, ref_abs_avg=32.63813781738281, test_abs_avg=32.6369514465332
production_forward grad[41] vs paper_forward: mean_abs=0.6127815246582031, max_abs=2.5625, mean_rel=0.09153743088245392, max_rel=4.745724678039551, norm_rel=0.023783499374985695, ref_abs_avg=26.182395935058594, test_abs_avg=26.235157012939453
production_forward grad[42] vs paper_forward: mean_abs=0.7566425800323486, max_abs=4.921875, mean_rel=0.16127222776412964, max_rel=980.9312744140625, norm_rel=0.02390735037624836, ref_abs_avg=31.72336196899414, test_abs_avg=31.723800659179688
production_forward grad[43] vs paper_forward: mean_abs=0.742997407913208, max_abs=5.0, mean_rel=0.1646813601255417, max_rel=800.0780029296875, norm_rel=0.024107150733470917, ref_abs_avg=30.874961853027344, test_abs_avg=30.874814987182617
production_forward grad[44] vs paper_forward: mean_abs=0.5886101722717285, max_abs=2.25, mean_rel=0.08441051840782166, max_rel=2.3853163719177246, norm_rel=0.025592537596821785, ref_abs_avg=23.18548011779785, test_abs_avg=23.17049217224121
production_forward grad[45] vs paper_forward: mean_abs=0.7141045331954956, max_abs=4.5, mean_rel=0.1452517807483673, max_rel=841.96826171875, norm_rel=0.023807324469089508, ref_abs_avg=30.037872314453125, test_abs_avg=30.03933334350586
production_forward grad[46] vs paper_forward: mean_abs=0.7053037881851196, max_abs=5.5, mean_rel=0.1560695618391037, max_rel=970.7449951171875, norm_rel=0.02371101826429367, ref_abs_avg=29.823802947998047, test_abs_avg=29.81897735595703
production_forward grad[47] vs paper_forward: mean_abs=0.5741157531738281, max_abs=2.25, mean_rel=0.08128310739994049, max_rel=4.993743896484375, norm_rel=0.023189755156636238, ref_abs_avg=24.54905128479004, test_abs_avg=24.5054988861084
production_forward grad[48] vs paper_forward: mean_abs=0.684701681137085, max_abs=5.5, mean_rel=0.1583593189716339, max_rel=1233.096923828125, norm_rel=0.023323586210608482, ref_abs_avg=29.363859176635742, test_abs_avg=29.364219665527344
production_forward grad[49] vs paper_forward: mean_abs=0.6702260971069336, max_abs=5.0, mean_rel=0.1518346071243286, max_rel=820.2814331054688, norm_rel=0.023469170555472374, ref_abs_avg=28.583637237548828, test_abs_avg=28.588275909423828
production_forward grad[50] vs paper_forward: mean_abs=0.5745944976806641, max_abs=2.125, mean_rel=0.06556368619203568, max_rel=2.83935809135437, norm_rel=0.022134073078632355, ref_abs_avg=25.726213455200195, test_abs_avg=25.721332550048828
production_forward grad[51] vs paper_forward: mean_abs=0.7458934783935547, max_abs=5.15625, mean_rel=0.16585052013397217, max_rel=1414.4425048828125, norm_rel=0.025024980306625366, ref_abs_avg=29.873699188232422, test_abs_avg=29.87486457824707
production_forward grad[52] vs paper_forward: mean_abs=0.735723614692688, max_abs=5.0, mean_rel=0.17443141341209412, max_rel=1185.3529052734375, norm_rel=0.02485906332731247, ref_abs_avg=29.681381225585938, test_abs_avg=29.683616638183594
production_forward grad[53] vs paper_forward: mean_abs=0.5824146270751953, max_abs=2.640625, mean_rel=0.08397450298070908, max_rel=2.916766881942749, norm_rel=0.024542273953557014, ref_abs_avg=23.938743591308594, test_abs_avg=23.952688217163086
production_forward grad[54] vs paper_forward: mean_abs=0.7035871744155884, max_abs=4.7099609375, mean_rel=0.16427773237228394, max_rel=1053.027587890625, norm_rel=0.0246596522629261, ref_abs_avg=28.5637264251709, test_abs_avg=28.565078735351562
production_forward grad[55] vs paper_forward: mean_abs=0.6820811033248901, max_abs=4.5, mean_rel=0.17562329769134521, max_rel=1089.2791748046875, norm_rel=0.02461492270231247, ref_abs_avg=27.77375030517578, test_abs_avg=27.78121566772461
production_forward grad[56] vs paper_forward: mean_abs=0.5268598794937134, max_abs=2.25, mean_rel=0.12558355927467346, max_rel=17.300296783447266, norm_rel=0.025134995579719543, ref_abs_avg=21.584531784057617, test_abs_avg=21.593345642089844
production_forward grad[57] vs paper_forward: mean_abs=0.6605557799339294, max_abs=5.0, mean_rel=0.1624346673488617, max_rel=809.7138061523438, norm_rel=0.02434580959379673, ref_abs_avg=27.18332290649414, test_abs_avg=27.186084747314453
production_forward grad[58] vs paper_forward: mean_abs=0.6401075124740601, max_abs=5.0, mean_rel=0.1435127854347229, max_rel=750.98486328125, norm_rel=0.024204466491937637, ref_abs_avg=26.494863510131836, test_abs_avg=26.503314971923828
production_forward grad[59] vs paper_forward: mean_abs=0.5138187408447266, max_abs=1.9375, mean_rel=0.08484645187854767, max_rel=9.553254127502441, norm_rel=0.023101797327399254, ref_abs_avg=22.087648391723633, test_abs_avg=22.1392822265625
production_forward grad[60] vs paper_forward: mean_abs=0.6153074502944946, max_abs=5.0, mean_rel=0.14830008149147034, max_rel=1139.716796875, norm_rel=0.02380109205842018, ref_abs_avg=25.81564712524414, test_abs_avg=25.81928825378418
production_forward grad[61] vs paper_forward: mean_abs=0.5972363948822021, max_abs=3.8125, mean_rel=0.1600833535194397, max_rel=1016.1434936523438, norm_rel=0.023677870631217957, ref_abs_avg=25.236995697021484, test_abs_avg=25.23975372314453
production_forward grad[62] vs paper_forward: mean_abs=0.4387434720993042, max_abs=1.6875, mean_rel=0.12093694508075714, max_rel=9.16049861907959, norm_rel=0.02284240908920765, ref_abs_avg=20.340808868408203, test_abs_avg=20.330482482910156
production_forward grad[63] vs paper_forward: mean_abs=0.5765589475631714, max_abs=5.0, mean_rel=0.16341423988342285, max_rel=1393.4788818359375, norm_rel=0.023564787581562996, ref_abs_avg=24.470314025878906, test_abs_avg=24.471965789794922
production_forward grad[64] vs paper_forward: mean_abs=0.5683016180992126, max_abs=4.5, mean_rel=0.16751879453659058, max_rel=2363.251953125, norm_rel=0.02330620028078556, ref_abs_avg=24.417078018188477, test_abs_avg=24.41015625
production_forward grad[65] vs paper_forward: mean_abs=0.4376649856567383, max_abs=1.5, mean_rel=0.08647830784320831, max_rel=10.0576753616333, norm_rel=0.02117474004626274, ref_abs_avg=20.781909942626953, test_abs_avg=20.771739959716797
production_forward grad[66] vs paper_forward: mean_abs=0.5523298978805542, max_abs=3.75, mean_rel=0.1503066122531891, max_rel=787.5731201171875, norm_rel=0.023172127082943916, ref_abs_avg=23.844778060913086, test_abs_avg=23.844470977783203
production_forward grad[67] vs paper_forward: mean_abs=0.5374207496643066, max_abs=4.0, mean_rel=0.15246430039405823, max_rel=1187.6834716796875, norm_rel=0.02284347265958786, ref_abs_avg=23.485158920288086, test_abs_avg=23.489826202392578
production_forward grad[68] vs paper_forward: mean_abs=0.4259098172187805, max_abs=2.0, mean_rel=0.0855979174375534, max_rel=2.796800374984741, norm_rel=0.021809572353959084, ref_abs_avg=19.73990249633789, test_abs_avg=19.70072364807129
production_forward grad[69] vs paper_forward: mean_abs=0.5249317288398743, max_abs=4.625, mean_rel=0.14316707849502563, max_rel=1044.737060546875, norm_rel=0.022810179740190506, ref_abs_avg=22.9874267578125, test_abs_avg=22.989118576049805
production_forward grad[70] vs paper_forward: mean_abs=0.518179714679718, max_abs=4.1484375, mean_rel=0.13771361112594604, max_rel=935.7142333984375, norm_rel=0.022658616304397583, ref_abs_avg=22.851093292236328, test_abs_avg=22.85083770751953
production_forward grad[71] vs paper_forward: mean_abs=0.38541245460510254, max_abs=1.75, mean_rel=0.06822367012500763, max_rel=5.71719217300415, norm_rel=0.020376555621623993, ref_abs_avg=19.494266510009766, test_abs_avg=19.449459075927734
production_forward grad[72] vs paper_forward: mean_abs=0.503851056098938, max_abs=4.0, mean_rel=0.15404212474822998, max_rel=1381.5489501953125, norm_rel=0.022324776276946068, ref_abs_avg=22.498714447021484, test_abs_avg=22.501916885375977
production_forward grad[73] vs paper_forward: mean_abs=0.4876607060432434, max_abs=4.25, mean_rel=0.1321967989206314, max_rel=648.9069213867188, norm_rel=0.022309474647045135, ref_abs_avg=21.882320404052734, test_abs_avg=21.883434295654297
production_forward grad[74] vs paper_forward: mean_abs=0.46945953369140625, max_abs=2.0, mean_rel=0.11608262360095978, max_rel=12.605558395385742, norm_rel=0.023545565083622932, ref_abs_avg=20.321762084960938, test_abs_avg=20.333202362060547
production_forward grad[75] vs paper_forward: mean_abs=0.5685954093933105, max_abs=4.5, mean_rel=0.1650407910346985, max_rel=1255.786376953125, norm_rel=0.024853181093931198, ref_abs_avg=22.873035430908203, test_abs_avg=22.87461280822754
production_forward grad[76] vs paper_forward: mean_abs=0.5505011677742004, max_abs=4.5, mean_rel=0.16053816676139832, max_rel=821.7787475585938, norm_rel=0.024309823289513588, ref_abs_avg=22.691490173339844, test_abs_avg=22.69110107421875
production_forward grad[77] vs paper_forward: mean_abs=0.41134071350097656, max_abs=2.0, mean_rel=0.08044148236513138, max_rel=4.434621334075928, norm_rel=0.023237910121679306, ref_abs_avg=18.182876586914062, test_abs_avg=18.159757614135742
production_forward grad[78] vs paper_forward: mean_abs=0.5158631801605225, max_abs=4.0, mean_rel=0.1527842879295349, max_rel=919.3702392578125, norm_rel=0.023926744237542152, ref_abs_avg=21.552841186523438, test_abs_avg=21.55280303955078
production_forward grad[79] vs paper_forward: mean_abs=0.5040347576141357, max_abs=4.125, mean_rel=0.1564597189426422, max_rel=952.3319702148438, norm_rel=0.02420753985643387, ref_abs_avg=20.955177307128906, test_abs_avg=20.952056884765625
production_forward grad[80] vs paper_forward: mean_abs=0.3862419128417969, max_abs=1.75, mean_rel=0.08640813827514648, max_rel=3.5807127952575684, norm_rel=0.022253423929214478, ref_abs_avg=17.313169479370117, test_abs_avg=17.29557991027832
production_forward grad[81] vs paper_forward: mean_abs=0.4702146053314209, max_abs=4.78125, mean_rel=0.14130443334579468, max_rel=596.0447387695312, norm_rel=0.02307949960231781, ref_abs_avg=20.350406646728516, test_abs_avg=20.351062774658203
production_forward grad[82] vs paper_forward: mean_abs=0.457530677318573, max_abs=4.75, mean_rel=0.14379070699214935, max_rel=1201.2137451171875, norm_rel=0.023148203268647194, ref_abs_avg=19.831222534179688, test_abs_avg=19.837013244628906
production_forward grad[83] vs paper_forward: mean_abs=0.36802124977111816, max_abs=1.75, mean_rel=0.16340643167495728, max_rel=12.670269012451172, norm_rel=0.02258034609258175, ref_abs_avg=15.780231475830078, test_abs_avg=15.780410766601562
production_forward grad[84] vs paper_forward: mean_abs=0.4321349561214447, max_abs=3.75, mean_rel=0.1375046968460083, max_rel=658.939208984375, norm_rel=0.02235652320086956, ref_abs_avg=19.400131225585938, test_abs_avg=19.400287628173828
production_forward grad[85] vs paper_forward: mean_abs=0.41763120889663696, max_abs=3.625, mean_rel=0.13874685764312744, max_rel=729.6727294921875, norm_rel=0.021879171952605247, ref_abs_avg=19.100723266601562, test_abs_avg=19.096935272216797
production_forward grad[86] vs paper_forward: mean_abs=0.3322865962982178, max_abs=1.46875, mean_rel=0.15458258986473083, max_rel=11.610735893249512, norm_rel=0.021726597100496292, ref_abs_avg=15.191651344299316, test_abs_avg=15.185486793518066
production_forward grad[87] vs paper_forward: mean_abs=0.40706831216812134, max_abs=4.0625, mean_rel=0.13103345036506653, max_rel=689.5433959960938, norm_rel=0.021768731996417046, ref_abs_avg=18.771224975585938, test_abs_avg=18.77088165283203
production_forward grad[88] vs paper_forward: mean_abs=0.39663541316986084, max_abs=3.5, mean_rel=0.13890132308006287, max_rel=784.8770141601562, norm_rel=0.021775076165795326, ref_abs_avg=18.35757827758789, test_abs_avg=18.350725173950195
production_forward grad[89] vs paper_forward: mean_abs=0.29272156953811646, max_abs=1.25, mean_rel=0.11843037605285645, max_rel=8.120078086853027, norm_rel=0.02026907540857792, ref_abs_avg=14.393152236938477, test_abs_avg=14.394679069519043
production_forward grad[90] vs paper_forward: mean_abs=0.37520262598991394, max_abs=3.5, mean_rel=0.13793732225894928, max_rel=937.94140625, norm_rel=0.02146897092461586, ref_abs_avg=17.597553253173828, test_abs_avg=17.596065521240234
production_forward grad[91] vs paper_forward: mean_abs=0.381355881690979, max_abs=4.25, mean_rel=0.13787923753261566, max_rel=559.3565673828125, norm_rel=0.02160450629889965, ref_abs_avg=17.85057258605957, test_abs_avg=17.850873947143555
production_forward grad[92] vs paper_forward: mean_abs=0.3026895523071289, max_abs=1.25, mean_rel=0.08483367413282394, max_rel=3.8443989753723145, norm_rel=0.02000734955072403, ref_abs_avg=15.655653953552246, test_abs_avg=15.64249324798584
production_forward grad[93] vs paper_forward: mean_abs=0.3625352382659912, max_abs=4.25, mean_rel=0.12483949959278107, max_rel=1026.989013671875, norm_rel=0.020668616518378258, ref_abs_avg=17.74825668334961, test_abs_avg=17.748260498046875
production_forward grad[94] vs paper_forward: mean_abs=0.3413453698158264, max_abs=3.75, mean_rel=0.12823759019374847, max_rel=839.5266723632812, norm_rel=0.019462816417217255, ref_abs_avg=17.565946578979492, test_abs_avg=17.56557273864746
production_forward grad[95] vs paper_forward: mean_abs=0.2870502471923828, max_abs=1.25, mean_rel=0.0508425310254097, max_rel=1.0385780334472656, norm_rel=0.021736659109592438, ref_abs_avg=13.872093200683594, test_abs_avg=13.870119094848633
production_forward grad[96] vs paper_forward: mean_abs=0.336063414812088, max_abs=4.0, mean_rel=0.12127788364887238, max_rel=1318.4185791015625, norm_rel=0.019987262785434723, ref_abs_avg=17.08713722229004, test_abs_avg=17.086727142333984
production_forward grad[97] vs paper_forward: mean_abs=0.3283958435058594, max_abs=3.4375, mean_rel=0.12167129665613174, max_rel=580.4605102539062, norm_rel=0.020144354552030563, ref_abs_avg=16.573074340820312, test_abs_avg=16.57986068725586
production_forward2 vs paper_forward output: mean_abs=0.0016148764407262206, max_abs=0.0390625
production_forward2 grad[0] vs paper_forward: mean_abs=0.00851622223854065, max_abs=0.46875, mean_rel=0.07441391050815582, max_rel=111.69783020019531, norm_rel=0.02048468589782715, ref_abs_avg=0.44994795322418213, test_abs_avg=0.4499530792236328
production_forward2 grad[1] vs paper_forward: mean_abs=7.363110065460205, max_abs=66.0, mean_rel=0.1418275237083435, max_rel=303.0342102050781, norm_rel=0.020812949165701866, ref_abs_avg=312.08782958984375, test_abs_avg=312.2216491699219
production_forward2 grad[2] vs paper_forward: mean_abs=1.2587488889694214, max_abs=4.0, mean_rel=0.0970991998910904, max_rel=5.451798915863037, norm_rel=0.023380782455205917, ref_abs_avg=53.47811508178711, test_abs_avg=53.57908630371094
production_forward2 grad[3] vs paper_forward: mean_abs=1.5228838920593262, max_abs=9.5, mean_rel=0.1719159185886383, max_rel=2537.1572265625, norm_rel=0.024187516421079636, ref_abs_avg=63.25735092163086, test_abs_avg=63.26028060913086
production_forward2 grad[4] vs paper_forward: mean_abs=1.4843719005584717, max_abs=10.0, mean_rel=0.15050914883613586, max_rel=600.1909790039062, norm_rel=0.023999040946364403, ref_abs_avg=62.23832321166992, test_abs_avg=62.24689483642578
production_forward2 grad[5] vs paper_forward: mean_abs=1.020735263824463, max_abs=3.75, mean_rel=0.21829023957252502, max_rel=64.35973358154297, norm_rel=0.021762564778327942, ref_abs_avg=48.04098129272461, test_abs_avg=48.08491897583008
production_forward2 grad[6] vs paper_forward: mean_abs=1.353582501411438, max_abs=9.0, mean_rel=0.16007806360721588, max_rel=1219.08251953125, norm_rel=0.023940935730934143, ref_abs_avg=56.79205322265625, test_abs_avg=56.79270935058594
production_forward2 grad[7] vs paper_forward: mean_abs=1.3276476860046387, max_abs=8.0625, mean_rel=0.16945220530033112, max_rel=2133.025390625, norm_rel=0.023933710530400276, ref_abs_avg=55.75505447387695, test_abs_avg=55.75458526611328
production_forward2 grad[8] vs paper_forward: mean_abs=1.0602750778198242, max_abs=4.125, mean_rel=0.09363114833831787, max_rel=8.190104484558105, norm_rel=0.02408062480390072, ref_abs_avg=43.07315444946289, test_abs_avg=43.14000701904297
production_forward2 grad[9] vs paper_forward: mean_abs=1.233568787574768, max_abs=8.0, mean_rel=0.16171970963478088, max_rel=3439.056884765625, norm_rel=0.023812411352992058, ref_abs_avg=52.08430480957031, test_abs_avg=52.087806701660156
production_forward2 grad[10] vs paper_forward: mean_abs=1.2082395553588867, max_abs=7.5, mean_rel=0.1475835144519806, max_rel=647.8099365234375, norm_rel=0.023531107231974602, ref_abs_avg=51.54917907714844, test_abs_avg=51.553009033203125
production_forward2 grad[11] vs paper_forward: mean_abs=0.9525299072265625, max_abs=3.25, mean_rel=0.12930436432361603, max_rel=12.260960578918457, norm_rel=0.022887904196977615, ref_abs_avg=42.383155822753906, test_abs_avg=42.42186737060547
production_forward2 grad[12] vs paper_forward: mean_abs=1.1493290662765503, max_abs=7.15625, mean_rel=0.1656791716814041, max_rel=1362.2440185546875, norm_rel=0.023574111983180046, ref_abs_avg=48.98735809326172, test_abs_avg=48.990875244140625
production_forward2 grad[13] vs paper_forward: mean_abs=1.1308109760284424, max_abs=7.5, mean_rel=0.16642703115940094, max_rel=1372.1865234375, norm_rel=0.02338644675910473, ref_abs_avg=48.6209716796875, test_abs_avg=48.61300277709961
production_forward2 grad[14] vs paper_forward: mean_abs=0.8596992492675781, max_abs=3.5, mean_rel=0.0855257660150528, max_rel=3.6326324939727783, norm_rel=0.023222507908940315, ref_abs_avg=38.203102111816406, test_abs_avg=38.215423583984375
production_forward2 grad[15] vs paper_forward: mean_abs=1.0716041326522827, max_abs=7.125, mean_rel=0.1553286910057068, max_rel=2176.529296875, norm_rel=0.02341216802597046, ref_abs_avg=45.967437744140625, test_abs_avg=45.97367477416992
production_forward2 grad[16] vs paper_forward: mean_abs=1.0444108247756958, max_abs=6.25, mean_rel=0.16203245520591736, max_rel=1569.7349853515625, norm_rel=0.0231776162981987, ref_abs_avg=45.30883026123047, test_abs_avg=45.3099365234375
production_forward2 grad[17] vs paper_forward: mean_abs=0.7782125473022461, max_abs=3.25, mean_rel=0.1284150928258896, max_rel=15.221768379211426, norm_rel=0.02286055125296116, ref_abs_avg=34.33709716796875, test_abs_avg=34.39835739135742
production_forward2 grad[18] vs paper_forward: mean_abs=1.0070762634277344, max_abs=6.328125, mean_rel=0.15652722120285034, max_rel=628.8102416992188, norm_rel=0.023287929594516754, ref_abs_avg=43.41645050048828, test_abs_avg=43.42020034790039
production_forward2 grad[19] vs paper_forward: mean_abs=0.98505699634552, max_abs=6.0, mean_rel=0.15179038047790527, max_rel=1297.3944091796875, norm_rel=0.023046521469950676, ref_abs_avg=42.87748718261719, test_abs_avg=42.87955856323242
production_forward2 grad[20] vs paper_forward: mean_abs=0.7848339080810547, max_abs=3.0, mean_rel=0.08480386435985565, max_rel=8.515531539916992, norm_rel=0.02215133234858513, ref_abs_avg=35.16055679321289, test_abs_avg=35.117515563964844
production_forward2 grad[21] vs paper_forward: mean_abs=0.9522467255592346, max_abs=6.0, mean_rel=0.1535658985376358, max_rel=1247.6103515625, norm_rel=0.022948214784264565, ref_abs_avg=41.712345123291016, test_abs_avg=41.71326446533203
production_forward2 grad[22] vs paper_forward: mean_abs=0.9319502711296082, max_abs=6.203125, mean_rel=0.15315186977386475, max_rel=779.7614135742188, norm_rel=0.022769879549741745, ref_abs_avg=41.09059524536133, test_abs_avg=41.09540557861328
production_forward2 grad[23] vs paper_forward: mean_abs=0.6846520900726318, max_abs=2.75, mean_rel=0.14530807733535767, max_rel=13.501741409301758, norm_rel=0.021527454257011414, ref_abs_avg=32.53321838378906, test_abs_avg=32.52325439453125
production_forward2 grad[24] vs paper_forward: mean_abs=0.9063184857368469, max_abs=6.0, mean_rel=0.16281060874462128, max_rel=2216.1923828125, norm_rel=0.02293008379638195, ref_abs_avg=39.697166442871094, test_abs_avg=39.700225830078125
production_forward2 grad[25] vs paper_forward: mean_abs=0.8862032294273376, max_abs=5.125, mean_rel=0.15550370514392853, max_rel=1027.804443359375, norm_rel=0.022566525265574455, ref_abs_avg=39.402381896972656, test_abs_avg=39.413307189941406
production_forward2 grad[26] vs paper_forward: mean_abs=0.856769323348999, max_abs=3.5, mean_rel=0.10314860194921494, max_rel=7.750566005706787, norm_rel=0.02363491803407669, ref_abs_avg=37.07672882080078, test_abs_avg=37.0587272644043
production_forward2 grad[27] vs paper_forward: mean_abs=1.0373982191085815, max_abs=6.7734375, mean_rel=0.18767768144607544, max_rel=2133.27978515625, norm_rel=0.02485162764787674, ref_abs_avg=41.90745162963867, test_abs_avg=41.909385681152344
production_forward2 grad[28] vs paper_forward: mean_abs=1.0147194862365723, max_abs=7.0, mean_rel=0.15646815299987793, max_rel=964.9027709960938, norm_rel=0.02465982921421528, ref_abs_avg=41.327537536621094, test_abs_avg=41.32396697998047
production_forward2 grad[29] vs paper_forward: mean_abs=0.7794212102890015, max_abs=3.5, mean_rel=0.18664103746414185, max_rel=27.4191837310791, norm_rel=0.024065671488642693, ref_abs_avg=33.15580749511719, test_abs_avg=33.19740676879883
production_forward2 grad[30] vs paper_forward: mean_abs=0.9724976420402527, max_abs=6.5, mean_rel=0.1714954376220703, max_rel=1232.4979248046875, norm_rel=0.025131678208708763, ref_abs_avg=38.82341384887695, test_abs_avg=38.82404327392578
production_forward2 grad[31] vs paper_forward: mean_abs=0.9485917091369629, max_abs=5.5, mean_rel=0.16428326070308685, max_rel=1003.4332885742188, norm_rel=0.024832356721162796, ref_abs_avg=38.2624397277832, test_abs_avg=38.256866455078125
production_forward2 grad[32] vs paper_forward: mean_abs=0.7528892755508423, max_abs=2.75, mean_rel=0.12360928952693939, max_rel=24.536897659301758, norm_rel=0.023418782278895378, ref_abs_avg=32.70454406738281, test_abs_avg=32.7144660949707
production_forward2 grad[33] vs paper_forward: mean_abs=0.9143328666687012, max_abs=6.0, mean_rel=0.17293989658355713, max_rel=2595.3720703125, norm_rel=0.02487502619624138, ref_abs_avg=36.81173324584961, test_abs_avg=36.81017303466797
production_forward2 grad[34] vs paper_forward: mean_abs=0.8997844457626343, max_abs=5.5, mean_rel=0.17086100578308105, max_rel=946.3753051757812, norm_rel=0.024978408589959145, ref_abs_avg=36.12456512451172, test_abs_avg=36.124549865722656
production_forward2 grad[35] vs paper_forward: mean_abs=0.685705304145813, max_abs=3.25, mean_rel=0.30889180302619934, max_rel=105.41073608398438, norm_rel=0.024653680622577667, ref_abs_avg=28.185205459594727, test_abs_avg=28.30522918701172
production_forward2 grad[36] vs paper_forward: mean_abs=0.8677185773849487, max_abs=5.5, mean_rel=0.16069349646568298, max_rel=1948.8914794921875, norm_rel=0.024929596111178398, ref_abs_avg=34.94243621826172, test_abs_avg=34.94167709350586
production_forward2 grad[37] vs paper_forward: mean_abs=0.8528746366500854, max_abs=5.75, mean_rel=0.16370901465415955, max_rel=509.90362548828125, norm_rel=0.024851812049746513, ref_abs_avg=34.42631912231445, test_abs_avg=34.432456970214844
production_forward2 grad[38] vs paper_forward: mean_abs=0.6650509834289551, max_abs=2.75, mean_rel=0.10970843583345413, max_rel=14.179903030395508, norm_rel=0.025737503543496132, ref_abs_avg=26.068958282470703, test_abs_avg=26.08294677734375
production_forward2 grad[39] vs paper_forward: mean_abs=0.8104313611984253, max_abs=5.5, mean_rel=0.15182426571846008, max_rel=645.6123657226562, norm_rel=0.024513004347682, ref_abs_avg=33.10108184814453, test_abs_avg=33.10166549682617
production_forward2 grad[40] vs paper_forward: mean_abs=0.7985630035400391, max_abs=5.0, mean_rel=0.1695362627506256, max_rel=896.4443969726562, norm_rel=0.02451510913670063, ref_abs_avg=32.63813781738281, test_abs_avg=32.63713455200195
production_forward2 grad[41] vs paper_forward: mean_abs=0.6188449859619141, max_abs=2.3125, mean_rel=0.10141408443450928, max_rel=4.94633674621582, norm_rel=0.02403981238603592, ref_abs_avg=26.182395935058594, test_abs_avg=26.22578239440918
production_forward2 grad[42] vs paper_forward: mean_abs=0.7676536440849304, max_abs=4.65625, mean_rel=0.1652819663286209, max_rel=1176.7554931640625, norm_rel=0.02424798347055912, ref_abs_avg=31.72336196899414, test_abs_avg=31.72353172302246
production_forward2 grad[43] vs paper_forward: mean_abs=0.7535957098007202, max_abs=5.1171875, mean_rel=0.1657567024230957, max_rel=770.9913940429688, norm_rel=0.02443673089146614, ref_abs_avg=30.874961853027344, test_abs_avg=30.874210357666016
production_forward2 grad[44] vs paper_forward: mean_abs=0.6233223676681519, max_abs=2.5, mean_rel=0.09075833857059479, max_rel=2.3700222969055176, norm_rel=0.026944078505039215, ref_abs_avg=23.18548011779785, test_abs_avg=23.152267456054688
production_forward2 grad[45] vs paper_forward: mean_abs=0.7231475114822388, max_abs=4.90625, mean_rel=0.1478247493505478, max_rel=630.4822998046875, norm_rel=0.024101585149765015, ref_abs_avg=30.037872314453125, test_abs_avg=30.038972854614258
production_forward2 grad[46] vs paper_forward: mean_abs=0.7141057848930359, max_abs=5.0, mean_rel=0.16170673072338104, max_rel=875.699951171875, norm_rel=0.0239923894405365, ref_abs_avg=29.823802947998047, test_abs_avg=29.81920623779297
production_forward2 grad[47] vs paper_forward: mean_abs=0.6060304641723633, max_abs=2.125, mean_rel=0.0866384208202362, max_rel=5.173172950744629, norm_rel=0.024376584216952324, ref_abs_avg=24.54905128479004, test_abs_avg=24.49627685546875
production_forward2 grad[48] vs paper_forward: mean_abs=0.6933692693710327, max_abs=5.5, mean_rel=0.16126549243927002, max_rel=1363.3018798828125, norm_rel=0.023613635450601578, ref_abs_avg=29.363859176635742, test_abs_avg=29.365081787109375
production_forward2 grad[49] vs paper_forward: mean_abs=0.6783739328384399, max_abs=5.0, mean_rel=0.15267309546470642, max_rel=848.5574340820312, norm_rel=0.023741601034998894, ref_abs_avg=28.583637237548828, test_abs_avg=28.58831214904785
production_forward2 grad[50] vs paper_forward: mean_abs=0.5866999626159668, max_abs=2.3125, mean_rel=0.07052002847194672, max_rel=3.18062686920166, norm_rel=0.022803420200943947, ref_abs_avg=25.726213455200195, test_abs_avg=25.71309471130371
production_forward2 grad[51] vs paper_forward: mean_abs=0.75702965259552, max_abs=4.6875, mean_rel=0.16785380244255066, max_rel=1497.662109375, norm_rel=0.02539399079978466, ref_abs_avg=29.873699188232422, test_abs_avg=29.87470054626465
production_forward2 grad[52] vs paper_forward: mean_abs=0.7469996213912964, max_abs=4.8125, mean_rel=0.1766379475593567, max_rel=1072.998779296875, norm_rel=0.02522560954093933, ref_abs_avg=29.681381225585938, test_abs_avg=29.682809829711914
production_forward2 grad[53] vs paper_forward: mean_abs=0.5784988403320312, max_abs=3.140625, mean_rel=0.08304107189178467, max_rel=2.60103440284729, norm_rel=0.0244035292416811, ref_abs_avg=23.938743591308594, test_abs_avg=23.966590881347656
production_forward2 grad[54] vs paper_forward: mean_abs=0.7125468254089355, max_abs=4.5, mean_rel=0.16656208038330078, max_rel=1192.1932373046875, norm_rel=0.02498537488281727, ref_abs_avg=28.5637264251709, test_abs_avg=28.56481170654297
production_forward2 grad[55] vs paper_forward: mean_abs=0.6909307241439819, max_abs=4.5, mean_rel=0.17561154067516327, max_rel=1152.56787109375, norm_rel=0.02493247017264366, ref_abs_avg=27.77375030517578, test_abs_avg=27.780738830566406
production_forward2 grad[56] vs paper_forward: mean_abs=0.534371018409729, max_abs=2.125, mean_rel=0.08577011525630951, max_rel=3.928436040878296, norm_rel=0.02522902563214302, ref_abs_avg=21.584531784057617, test_abs_avg=21.58770179748535
production_forward2 grad[57] vs paper_forward: mean_abs=0.6678338646888733, max_abs=5.0, mean_rel=0.166610985994339, max_rel=1080.44775390625, norm_rel=0.02460300177335739, ref_abs_avg=27.18332290649414, test_abs_avg=27.184986114501953
production_forward2 grad[58] vs paper_forward: mean_abs=0.6482315063476562, max_abs=4.5, mean_rel=0.14233142137527466, max_rel=610.5009765625, norm_rel=0.024510888382792473, ref_abs_avg=26.494863510131836, test_abs_avg=26.503704071044922
production_forward2 grad[59] vs paper_forward: mean_abs=0.5212312340736389, max_abs=2.0, mean_rel=0.07710358500480652, max_rel=8.229171752929688, norm_rel=0.023538796231150627, ref_abs_avg=22.087648391723633, test_abs_avg=22.14818000793457
production_forward2 grad[60] vs paper_forward: mean_abs=0.622916579246521, max_abs=5.0, mean_rel=0.15206600725650787, max_rel=1229.241943359375, norm_rel=0.02408703789114952, ref_abs_avg=25.81564712524414, test_abs_avg=25.819782257080078
production_forward2 grad[61] vs paper_forward: mean_abs=0.6037896871566772, max_abs=4.5, mean_rel=0.16294029355049133, max_rel=763.4769897460938, norm_rel=0.023933351039886475, ref_abs_avg=25.236995697021484, test_abs_avg=25.2398624420166
production_forward2 grad[62] vs paper_forward: mean_abs=0.45031750202178955, max_abs=1.66015625, mean_rel=0.13000741600990295, max_rel=8.68886947631836, norm_rel=0.023283246904611588, ref_abs_avg=20.340808868408203, test_abs_avg=20.333419799804688
production_forward2 grad[63] vs paper_forward: mean_abs=0.5819253921508789, max_abs=4.5, mean_rel=0.16424256563186646, max_rel=1183.1796875, norm_rel=0.023786621168255806, ref_abs_avg=24.470314025878906, test_abs_avg=24.471290588378906
production_forward2 grad[64] vs paper_forward: mean_abs=0.5756364464759827, max_abs=4.7578125, mean_rel=0.1689024269580841, max_rel=2241.2763671875, norm_rel=0.023575086146593094, ref_abs_avg=24.417078018188477, test_abs_avg=24.409595489501953
production_forward2 grad[65] vs paper_forward: mean_abs=0.4312458038330078, max_abs=1.515625, mean_rel=0.09656112641096115, max_rel=14.445192337036133, norm_rel=0.021139245480298996, ref_abs_avg=20.781909942626953, test_abs_avg=20.775341033935547
production_forward2 grad[66] vs paper_forward: mean_abs=0.5571066737174988, max_abs=4.25, mean_rel=0.15070447325706482, max_rel=804.617431640625, norm_rel=0.02336873859167099, ref_abs_avg=23.844778060913086, test_abs_avg=23.844161987304688
production_forward2 grad[67] vs paper_forward: mean_abs=0.5420958995819092, max_abs=3.875, mean_rel=0.1513228714466095, max_rel=1118.23486328125, norm_rel=0.023047706112265587, ref_abs_avg=23.485158920288086, test_abs_avg=23.48995018005371
production_forward2 grad[68] vs paper_forward: mean_abs=0.42805004119873047, max_abs=2.0, mean_rel=0.08952312171459198, max_rel=4.58292293548584, norm_rel=0.02182781510055065, ref_abs_avg=19.73990249633789, test_abs_avg=19.699777603149414
production_forward2 grad[69] vs paper_forward: mean_abs=0.5288720726966858, max_abs=4.625, mean_rel=0.14654266834259033, max_rel=975.0669555664062, norm_rel=0.022985972464084625, ref_abs_avg=22.9874267578125, test_abs_avg=22.98888397216797
production_forward2 grad[70] vs paper_forward: mean_abs=0.5219509601593018, max_abs=4.1484375, mean_rel=0.14048297703266144, max_rel=992.6998901367188, norm_rel=0.02282748930156231, ref_abs_avg=22.851093292236328, test_abs_avg=22.850666046142578
production_forward2 grad[71] vs paper_forward: mean_abs=0.39443516731262207, max_abs=1.96875, mean_rel=0.07865504920482635, max_rel=6.55056619644165, norm_rel=0.02067558467388153, ref_abs_avg=19.494266510009766, test_abs_avg=19.452903747558594
production_forward2 grad[72] vs paper_forward: mean_abs=0.5065125226974487, max_abs=4.0, mean_rel=0.1522672325372696, max_rel=1293.8504638671875, norm_rel=0.022442469373345375, ref_abs_avg=22.498714447021484, test_abs_avg=22.501419067382812
production_forward2 grad[73] vs paper_forward: mean_abs=0.4901639223098755, max_abs=4.375, mean_rel=0.13257475197315216, max_rel=655.845947265625, norm_rel=0.022412234917283058, ref_abs_avg=21.882320404052734, test_abs_avg=21.883262634277344
production_forward2 grad[74] vs paper_forward: mean_abs=0.4753885269165039, max_abs=2.0, mean_rel=0.09731672704219818, max_rel=11.980027198791504, norm_rel=0.023774776607751846, ref_abs_avg=20.321762084960938, test_abs_avg=20.340055465698242
production_forward2 grad[75] vs paper_forward: mean_abs=0.5746580958366394, max_abs=4.5, mean_rel=0.1673535704612732, max_rel=1347.5081787109375, norm_rel=0.025095060467720032, ref_abs_avg=22.873035430908203, test_abs_avg=22.874805450439453
production_forward2 grad[76] vs paper_forward: mean_abs=0.5561278462409973, max_abs=4.5, mean_rel=0.16432569921016693, max_rel=795.0873413085938, norm_rel=0.024560296908020973, ref_abs_avg=22.691490173339844, test_abs_avg=22.690216064453125
production_forward2 grad[77] vs paper_forward: mean_abs=0.42702245712280273, max_abs=1.875, mean_rel=0.08382456004619598, max_rel=4.86635160446167, norm_rel=0.024138392880558968, ref_abs_avg=18.182876586914062, test_abs_avg=18.164474487304688
production_forward2 grad[78] vs paper_forward: mean_abs=0.5207130908966064, max_abs=4.5, mean_rel=0.15481412410736084, max_rel=847.0211791992188, norm_rel=0.024136444553732872, ref_abs_avg=21.552841186523438, test_abs_avg=21.55231475830078
production_forward2 grad[79] vs paper_forward: mean_abs=0.508136510848999, max_abs=4.375, mean_rel=0.155361607670784, max_rel=936.0379638671875, norm_rel=0.024404777213931084, ref_abs_avg=20.955177307128906, test_abs_avg=20.952102661132812
production_forward2 grad[80] vs paper_forward: mean_abs=0.38048791885375977, max_abs=1.625, mean_rel=0.07928061485290527, max_rel=2.7558364868164062, norm_rel=0.02184869721531868, ref_abs_avg=17.313169479370117, test_abs_avg=17.290138244628906
production_forward2 grad[81] vs paper_forward: mean_abs=0.47321754693984985, max_abs=4.28125, mean_rel=0.1436941772699356, max_rel=510.9827880859375, norm_rel=0.02323441207408905, ref_abs_avg=20.350406646728516, test_abs_avg=20.351572036743164
production_forward2 grad[82] vs paper_forward: mean_abs=0.46121394634246826, max_abs=4.5, mean_rel=0.14550167322158813, max_rel=1288.69921875, norm_rel=0.023344943299889565, ref_abs_avg=19.831222534179688, test_abs_avg=19.8354549407959
production_forward2 grad[83] vs paper_forward: mean_abs=0.35894107818603516, max_abs=1.5, mean_rel=0.14255543053150177, max_rel=12.612590789794922, norm_rel=0.022107671946287155, ref_abs_avg=15.780231475830078, test_abs_avg=15.77756118774414
production_forward2 grad[84] vs paper_forward: mean_abs=0.43478089570999146, max_abs=4.0, mean_rel=0.13881495594978333, max_rel=632.441162109375, norm_rel=0.022480830550193787, ref_abs_avg=19.400131225585938, test_abs_avg=19.40049934387207
production_forward2 grad[85] vs paper_forward: mean_abs=0.4199048578739166, max_abs=3.875, mean_rel=0.13944467902183533, max_rel=718.6956787109375, norm_rel=0.021985461935400963, ref_abs_avg=19.100723266601562, test_abs_avg=19.0961971282959
production_forward2 grad[86] vs paper_forward: mean_abs=0.3352482318878174, max_abs=1.578125, mean_rel=0.12861476838588715, max_rel=9.833516120910645, norm_rel=0.021897468715906143, ref_abs_avg=15.191651344299316, test_abs_avg=15.177911758422852
production_forward2 grad[87] vs paper_forward: mean_abs=0.40930506587028503, max_abs=4.1875, mean_rel=0.1322866827249527, max_rel=676.0146484375, norm_rel=0.021871425211429596, ref_abs_avg=18.771224975585938, test_abs_avg=18.77072525024414
production_forward2 grad[88] vs paper_forward: mean_abs=0.39886438846588135, max_abs=3.5, mean_rel=0.13922695815563202, max_rel=807.73681640625, norm_rel=0.021888822317123413, ref_abs_avg=18.35757827758789, test_abs_avg=18.350616455078125
production_forward2 grad[89] vs paper_forward: mean_abs=0.28719615936279297, max_abs=1.125, mean_rel=0.1086178570985794, max_rel=5.813238143920898, norm_rel=0.019967379048466682, ref_abs_avg=14.393152236938477, test_abs_avg=14.390357971191406
production_forward2 grad[90] vs paper_forward: mean_abs=0.3762568235397339, max_abs=3.75, mean_rel=0.13831117749214172, max_rel=980.9923706054688, norm_rel=0.02152317389845848, ref_abs_avg=17.597553253173828, test_abs_avg=17.596139907836914
production_forward2 grad[91] vs paper_forward: mean_abs=0.3828244209289551, max_abs=4.25, mean_rel=0.1382984220981598, max_rel=571.5781860351562, norm_rel=0.021678803488612175, ref_abs_avg=17.85057258605957, test_abs_avg=17.850948333740234
production_forward2 grad[92] vs paper_forward: mean_abs=0.3017532229423523, max_abs=1.1875, mean_rel=0.08660348504781723, max_rel=5.0789008140563965, norm_rel=0.019652847200632095, ref_abs_avg=15.655653953552246, test_abs_avg=15.637381553649902
production_forward2 grad[93] vs paper_forward: mean_abs=0.36323800683021545, max_abs=4.25, mean_rel=0.1257804036140442, max_rel=1011.878662109375, norm_rel=0.020717822015285492, ref_abs_avg=17.74825668334961, test_abs_avg=17.748046875
production_forward2 grad[94] vs paper_forward: mean_abs=0.3421534299850464, max_abs=3.75, mean_rel=0.12913599610328674, max_rel=776.3727416992188, norm_rel=0.01950220949947834, ref_abs_avg=17.565946578979492, test_abs_avg=17.56553840637207
production_forward2 grad[95] vs paper_forward: mean_abs=0.2870502471923828, max_abs=1.25, mean_rel=0.0508425310254097, max_rel=1.0385780334472656, norm_rel=0.021736659109592438, ref_abs_avg=13.872093200683594, test_abs_avg=13.870119094848633
production_forward2 grad[96] vs paper_forward: mean_abs=0.336063414812088, max_abs=4.0, mean_rel=0.12127788364887238, max_rel=1318.4185791015625, norm_rel=0.019987262785434723, ref_abs_avg=17.08713722229004, test_abs_avg=17.086727142333984
production_forward2 grad[97] vs paper_forward: mean_abs=0.3283958435058594, max_abs=3.4375, mean_rel=0.12167129665613174, max_rel=580.4605102539062, norm_rel=0.020144354552030563, ref_abs_avg=16.573074340820312, test_abs_avg=16.57986068725586
identity layers + randn queries
production_forward fwd+bwd:  126.691 ms
production_forward bwd-only: 106.265 ms
production_forward peak allocated: fwd=3.368 GiB, fwd+bwd=7.868 GiB
production_forward peak reserved:  fwd=3.619 GiB, fwd+bwd=8.869 GiB
production_forward2 fwd+bwd:  224.348 ms
production_forward2 bwd-only: 202.102 ms
production_forward2 peak allocated: fwd=2.864 GiB, fwd+bwd=6.243 GiB
production_forward2 peak reserved:  fwd=3.244 GiB, fwd+bwd=8.994 GiB
paper_forward fwd+bwd:  379.650 ms
paper_forward bwd-only: 293.968 ms
paper_forward peak allocated: fwd=30.001 GiB, fwd+bwd=32.120 GiB
paper_forward peak reserved:  fwd=30.039 GiB, fwd+bwd=32.789 GiB

grads check for swiglu layers + randn queries
production_forward vs paper_forward output: mean_abs=0.0016231349436566234, max_abs=0.0439453125
production_forward grad[0] vs paper_forward: mean_abs=0.008202368393540382, max_abs=0.5, mean_rel=0.07142987847328186, max_rel=111.99144744873047, norm_rel=0.019490133970975876, ref_abs_avg=0.4561946392059326, test_abs_avg=0.4562225341796875
production_forward grad[1] vs paper_forward: mean_abs=7.069005489349365, max_abs=52.0, mean_rel=0.14410725235939026, max_rel=128.28334045410156, norm_rel=0.020211316645145416, ref_abs_avg=312.6710510253906, test_abs_avg=312.65777587890625
production_forward grad[2] vs paper_forward: mean_abs=1.2221717834472656, max_abs=4.25, mean_rel=0.1381787657737732, max_rel=11.604737281799316, norm_rel=0.022456800565123558, ref_abs_avg=53.74199295043945, test_abs_avg=53.78874969482422
production_forward grad[3] vs paper_forward: mean_abs=1.5128307342529297, max_abs=10.0, mean_rel=0.1566934585571289, max_rel=1703.6798095703125, norm_rel=0.02324220910668373, ref_abs_avg=65.44303131103516, test_abs_avg=65.44607543945312
production_forward grad[4] vs paper_forward: mean_abs=1.4745575189590454, max_abs=9.0, mean_rel=0.18625789880752563, max_rel=2586.64599609375, norm_rel=0.023012883961200714, ref_abs_avg=64.44645690917969, test_abs_avg=64.45024108886719
production_forward grad[5] vs paper_forward: mean_abs=1.1354866027832031, max_abs=4.5, mean_rel=0.08515092730522156, max_rel=5.503993034362793, norm_rel=0.023588811978697777, ref_abs_avg=47.72909164428711, test_abs_avg=47.72785186767578
production_forward grad[6] vs paper_forward: mean_abs=1.3194830417633057, max_abs=8.40625, mean_rel=0.14933812618255615, max_rel=1549.720947265625, norm_rel=0.023087991401553154, ref_abs_avg=57.450862884521484, test_abs_avg=57.453041076660156
production_forward grad[7] vs paper_forward: mean_abs=1.2994134426116943, max_abs=8.0, mean_rel=0.14708179235458374, max_rel=1620.953857421875, norm_rel=0.022823117673397064, ref_abs_avg=57.15382385253906, test_abs_avg=57.15272903442383
production_forward grad[8] vs paper_forward: mean_abs=0.9982175827026367, max_abs=3.953125, mean_rel=0.14469951391220093, max_rel=8.01907730102539, norm_rel=0.024650683626532555, ref_abs_avg=39.90187072753906, test_abs_avg=40.03410720825195
production_forward grad[9] vs paper_forward: mean_abs=1.2132009267807007, max_abs=8.0, mean_rel=0.16250118613243103, max_rel=2541.376220703125, norm_rel=0.022894207388162613, ref_abs_avg=53.25398254394531, test_abs_avg=53.25823974609375
production_forward grad[10] vs paper_forward: mean_abs=1.1727190017700195, max_abs=8.25, mean_rel=0.15545418858528137, max_rel=895.8404541015625, norm_rel=0.02253156714141369, ref_abs_avg=52.29208755493164, test_abs_avg=52.29747772216797
production_forward grad[11] vs paper_forward: mean_abs=0.8870697021484375, max_abs=4.0, mean_rel=0.07012651860713959, max_rel=7.0039801597595215, norm_rel=0.02126496657729149, ref_abs_avg=43.207794189453125, test_abs_avg=43.180572509765625
production_forward grad[12] vs paper_forward: mean_abs=1.122385025024414, max_abs=7.5, mean_rel=0.16128051280975342, max_rel=2831.19482421875, norm_rel=0.022729117423295975, ref_abs_avg=49.60670852661133, test_abs_avg=49.60799789428711
production_forward grad[13] vs paper_forward: mean_abs=1.0922496318817139, max_abs=6.5, mean_rel=0.1544962227344513, max_rel=1380.20849609375, norm_rel=0.022466151043772697, ref_abs_avg=48.90270233154297, test_abs_avg=48.90146255493164
production_forward grad[14] vs paper_forward: mean_abs=0.8768351078033447, max_abs=3.171875, mean_rel=0.21881774067878723, max_rel=25.475589752197266, norm_rel=0.023323381319642067, ref_abs_avg=37.41636657714844, test_abs_avg=37.40574645996094
production_forward grad[15] vs paper_forward: mean_abs=1.0417009592056274, max_abs=6.5, mean_rel=0.14947357773780823, max_rel=1519.68310546875, norm_rel=0.022515742108225822, ref_abs_avg=46.50901412963867, test_abs_avg=46.5095329284668
production_forward grad[16] vs paper_forward: mean_abs=1.0123329162597656, max_abs=6.375, mean_rel=0.1412186473608017, max_rel=985.01513671875, norm_rel=0.02223474346101284, ref_abs_avg=45.758056640625, test_abs_avg=45.767738342285156
production_forward grad[17] vs paper_forward: mean_abs=0.7598333358764648, max_abs=3.125, mean_rel=0.07395973056554794, max_rel=3.064089059829712, norm_rel=0.02086273394525051, ref_abs_avg=37.296630859375, test_abs_avg=37.309425354003906
production_forward grad[18] vs paper_forward: mean_abs=0.9809708595275879, max_abs=6.5, mean_rel=0.1545180082321167, max_rel=2060.5595703125, norm_rel=0.022349461913108826, ref_abs_avg=44.10239791870117, test_abs_avg=44.105934143066406
production_forward grad[19] vs paper_forward: mean_abs=0.9573341608047485, max_abs=7.0, mean_rel=0.13825619220733643, max_rel=1402.3238525390625, norm_rel=0.02200707048177719, ref_abs_avg=43.78697967529297, test_abs_avg=43.793704986572266
production_forward grad[20] vs paper_forward: mean_abs=0.7938003540039062, max_abs=3.25, mean_rel=0.1020379364490509, max_rel=10.399460792541504, norm_rel=0.023454830050468445, ref_abs_avg=34.33401107788086, test_abs_avg=34.35636901855469
production_forward grad[21] vs paper_forward: mean_abs=0.9278916120529175, max_abs=6.0, mean_rel=0.15921816229820251, max_rel=1688.974853515625, norm_rel=0.02226286381483078, ref_abs_avg=41.87453079223633, test_abs_avg=41.87535858154297
production_forward grad[22] vs paper_forward: mean_abs=0.9068251252174377, max_abs=6.25, mean_rel=0.1500694453716278, max_rel=1208.4254150390625, norm_rel=0.022003185003995895, ref_abs_avg=41.480838775634766, test_abs_avg=41.480709075927734
production_forward grad[23] vs paper_forward: mean_abs=0.7082438468933105, max_abs=2.375, mean_rel=0.14640173316001892, max_rel=18.237548828125, norm_rel=0.021659962832927704, ref_abs_avg=32.72911834716797, test_abs_avg=32.744911193847656
production_forward grad[24] vs paper_forward: mean_abs=0.8810765743255615, max_abs=5.5, mean_rel=0.15278692543506622, max_rel=986.0179443359375, norm_rel=0.02220662496984005, ref_abs_avg=39.87667465209961, test_abs_avg=39.87839889526367
production_forward grad[25] vs paper_forward: mean_abs=0.8539217114448547, max_abs=5.5, mean_rel=0.16003775596618652, max_rel=906.6905517578125, norm_rel=0.021577991545200348, ref_abs_avg=39.75249481201172, test_abs_avg=39.75946044921875
production_forward grad[26] vs paper_forward: mean_abs=0.8314080238342285, max_abs=3.25, mean_rel=0.09757791459560394, max_rel=8.623128890991211, norm_rel=0.023390917107462883, ref_abs_avg=35.54334259033203, test_abs_avg=35.57000732421875
production_forward grad[27] vs paper_forward: mean_abs=1.0361955165863037, max_abs=7.25, mean_rel=0.16291671991348267, max_rel=936.5283813476562, norm_rel=0.023976963013410568, ref_abs_avg=43.434478759765625, test_abs_avg=43.437435150146484
production_forward grad[28] vs paper_forward: mean_abs=1.0040419101715088, max_abs=7.0, mean_rel=0.151160329580307, max_rel=871.8035888671875, norm_rel=0.02380814217031002, ref_abs_avg=42.370750427246094, test_abs_avg=42.368865966796875
production_forward grad[29] vs paper_forward: mean_abs=0.7614307403564453, max_abs=2.6875, mean_rel=0.09672901034355164, max_rel=9.71117877960205, norm_rel=0.023255614563822746, ref_abs_avg=32.726810455322266, test_abs_avg=32.754058837890625
production_forward grad[30] vs paper_forward: mean_abs=0.955717921257019, max_abs=6.0, mean_rel=0.1689947247505188, max_rel=2359.935546875, norm_rel=0.02424127608537674, ref_abs_avg=39.549964904785156, test_abs_avg=39.554412841796875
production_forward grad[31] vs paper_forward: mean_abs=0.9349290728569031, max_abs=6.02734375, mean_rel=0.14826156198978424, max_rel=760.3617553710938, norm_rel=0.024187911301851273, ref_abs_avg=38.86224365234375, test_abs_avg=38.869285583496094
production_forward grad[32] vs paper_forward: mean_abs=0.7091951370239258, max_abs=2.9375, mean_rel=0.08007023483514786, max_rel=3.9515278339385986, norm_rel=0.02284630388021469, ref_abs_avg=32.33365249633789, test_abs_avg=32.30158996582031
production_forward grad[33] vs paper_forward: mean_abs=0.8895496129989624, max_abs=6.0, mean_rel=0.17631295323371887, max_rel=1604.5450439453125, norm_rel=0.024058030918240547, ref_abs_avg=37.088043212890625, test_abs_avg=37.09318542480469
production_forward grad[34] vs paper_forward: mean_abs=0.8719642162322998, max_abs=6.0, mean_rel=0.17125621438026428, max_rel=2276.813232421875, norm_rel=0.024147402495145798, ref_abs_avg=36.292694091796875, test_abs_avg=36.29402160644531
production_forward grad[35] vs paper_forward: mean_abs=0.649574875831604, max_abs=2.25, mean_rel=0.23505762219429016, max_rel=81.4344253540039, norm_rel=0.02292148768901825, ref_abs_avg=29.466806411743164, test_abs_avg=29.469160079956055
production_forward grad[36] vs paper_forward: mean_abs=0.8291550874710083, max_abs=6.0, mean_rel=0.15984974801540375, max_rel=1712.5074462890625, norm_rel=0.023984676226973534, ref_abs_avg=34.664268493652344, test_abs_avg=34.666996002197266
production_forward grad[37] vs paper_forward: mean_abs=0.8123198747634888, max_abs=6.0, mean_rel=0.1503593474626541, max_rel=504.7900390625, norm_rel=0.02395867183804512, ref_abs_avg=34.049835205078125, test_abs_avg=34.05084228515625
production_forward grad[38] vs paper_forward: mean_abs=0.5927233695983887, max_abs=2.5, mean_rel=0.13904541730880737, max_rel=19.21510124206543, norm_rel=0.022416602820158005, ref_abs_avg=26.789583206176758, test_abs_avg=26.750057220458984
production_forward grad[39] vs paper_forward: mean_abs=0.77805495262146, max_abs=5.5, mean_rel=0.16106818616390228, max_rel=1541.2144775390625, norm_rel=0.023799944669008255, ref_abs_avg=32.78636932373047, test_abs_avg=32.7892951965332
production_forward grad[40] vs paper_forward: mean_abs=0.7636687159538269, max_abs=5.125, mean_rel=0.16061538457870483, max_rel=1321.053955078125, norm_rel=0.02340627834200859, ref_abs_avg=32.721153259277344, test_abs_avg=32.72969055175781
production_forward grad[41] vs paper_forward: mean_abs=0.618074893951416, max_abs=2.75, mean_rel=0.08752336353063583, max_rel=7.873482704162598, norm_rel=0.023730754852294922, ref_abs_avg=26.15036964416504, test_abs_avg=26.173419952392578
production_forward grad[42] vs paper_forward: mean_abs=0.7380757331848145, max_abs=5.0, mean_rel=0.15516796708106995, max_rel=1099.1868896484375, norm_rel=0.023588698357343674, ref_abs_avg=31.380786895751953, test_abs_avg=31.386337280273438
production_forward grad[43] vs paper_forward: mean_abs=0.7272266745567322, max_abs=5.0, mean_rel=0.14523278176784515, max_rel=555.2518310546875, norm_rel=0.023233475163578987, ref_abs_avg=31.412017822265625, test_abs_avg=31.413619995117188
production_forward grad[44] vs paper_forward: mean_abs=0.5782150030136108, max_abs=2.5, mean_rel=0.11872988939285278, max_rel=9.424025535583496, norm_rel=0.023788204416632652, ref_abs_avg=24.543964385986328, test_abs_avg=24.509109497070312
production_forward grad[45] vs paper_forward: mean_abs=0.7089139223098755, max_abs=4.5, mean_rel=0.15143239498138428, max_rel=936.8543701171875, norm_rel=0.023255670443177223, ref_abs_avg=30.55695915222168, test_abs_avg=30.55813980102539
production_forward grad[46] vs paper_forward: mean_abs=0.70073401927948, max_abs=5.0, mean_rel=0.15447549521923065, max_rel=888.7251586914062, norm_rel=0.023142045363783836, ref_abs_avg=30.332225799560547, test_abs_avg=30.336875915527344
production_forward grad[47] vs paper_forward: mean_abs=0.5411128997802734, max_abs=2.3125, mean_rel=0.12425751984119415, max_rel=10.556507110595703, norm_rel=0.02231776900589466, ref_abs_avg=24.045625686645508, test_abs_avg=24.055944442749023
production_forward grad[48] vs paper_forward: mean_abs=0.6804653406143188, max_abs=4.75, mean_rel=0.16220133006572723, max_rel=1239.678466796875, norm_rel=0.023001166060566902, ref_abs_avg=29.607189178466797, test_abs_avg=29.610225677490234
production_forward grad[49] vs paper_forward: mean_abs=0.6666305065155029, max_abs=4.5, mean_rel=0.1432184875011444, max_rel=408.4417419433594, norm_rel=0.02306126244366169, ref_abs_avg=28.970823287963867, test_abs_avg=28.9693603515625
production_forward grad[50] vs paper_forward: mean_abs=0.5992186069488525, max_abs=2.6875, mean_rel=0.10698242485523224, max_rel=8.89913272857666, norm_rel=0.024060536175966263, ref_abs_avg=24.87539291381836, test_abs_avg=24.9271297454834
production_forward grad[51] vs paper_forward: mean_abs=0.7530868053436279, max_abs=5.125, mean_rel=0.17188668251037598, max_rel=1155.4444580078125, norm_rel=0.024634428322315216, ref_abs_avg=30.66454315185547, test_abs_avg=30.666284561157227
production_forward grad[52] vs paper_forward: mean_abs=0.732327938079834, max_abs=4.5, mean_rel=0.1532529592514038, max_rel=830.2971801757812, norm_rel=0.02416861616075039, ref_abs_avg=30.402219772338867, test_abs_avg=30.40797233581543
production_forward grad[53] vs paper_forward: mean_abs=0.5580079555511475, max_abs=2.125, mean_rel=0.11877886205911636, max_rel=10.299698829650879, norm_rel=0.023401008918881416, ref_abs_avg=23.949472427368164, test_abs_avg=23.932933807373047
production_forward grad[54] vs paper_forward: mean_abs=0.6840729713439941, max_abs=5.0, mean_rel=0.15644419193267822, max_rel=1157.2041015625, norm_rel=0.024065740406513214, ref_abs_avg=28.45552635192871, test_abs_avg=28.456703186035156
production_forward grad[55] vs paper_forward: mean_abs=0.6660884618759155, max_abs=4.25, mean_rel=0.14775750041007996, max_rel=523.3829345703125, norm_rel=0.02396599017083645, ref_abs_avg=27.843069076538086, test_abs_avg=27.843015670776367
production_forward grad[56] vs paper_forward: mean_abs=0.5148448944091797, max_abs=1.98046875, mean_rel=0.11389017105102539, max_rel=6.271699905395508, norm_rel=0.022858992218971252, ref_abs_avg=22.217540740966797, test_abs_avg=22.226110458374023
production_forward grad[57] vs paper_forward: mean_abs=0.6335282325744629, max_abs=4.75, mean_rel=0.1602470874786377, max_rel=1080.0960693359375, norm_rel=0.023549562320113182, ref_abs_avg=26.901691436767578, test_abs_avg=26.902990341186523
production_forward grad[58] vs paper_forward: mean_abs=0.6238285303115845, max_abs=4.765625, mean_rel=0.16409045457839966, max_rel=2078.762939453125, norm_rel=0.023449605330824852, ref_abs_avg=26.639766693115234, test_abs_avg=26.638504028320312
production_forward grad[59] vs paper_forward: mean_abs=0.4929533004760742, max_abs=2.0, mean_rel=0.11285508424043655, max_rel=3.8023335933685303, norm_rel=0.023889537900686264, ref_abs_avg=20.40161895751953, test_abs_avg=20.466567993164062
production_forward grad[60] vs paper_forward: mean_abs=0.5859620571136475, max_abs=5.0, mean_rel=0.14667928218841553, max_rel=1051.5216064453125, norm_rel=0.02307651937007904, ref_abs_avg=25.37647247314453, test_abs_avg=25.379859924316406
production_forward grad[61] vs paper_forward: mean_abs=0.5791559219360352, max_abs=4.0, mean_rel=0.15350578725337982, max_rel=1012.251708984375, norm_rel=0.02293907105922699, ref_abs_avg=25.275344848632812, test_abs_avg=25.274198532104492
production_forward grad[62] vs paper_forward: mean_abs=0.4554004669189453, max_abs=2.0625, mean_rel=0.09508654475212097, max_rel=12.671890258789062, norm_rel=0.022343840450048447, ref_abs_avg=20.54904556274414, test_abs_avg=20.580188751220703
production_forward grad[63] vs paper_forward: mean_abs=0.56324303150177, max_abs=4.5, mean_rel=0.14645013213157654, max_rel=1304.814697265625, norm_rel=0.02238435670733452, ref_abs_avg=25.126930236816406, test_abs_avg=25.128263473510742
production_forward grad[64] vs paper_forward: mean_abs=0.5503032207489014, max_abs=4.5, mean_rel=0.14735785126686096, max_rel=669.7771606445312, norm_rel=0.022745374590158463, ref_abs_avg=24.223270416259766, test_abs_avg=24.2247257232666
production_forward grad[65] vs paper_forward: mean_abs=0.42724698781967163, max_abs=1.6875, mean_rel=0.13543328642845154, max_rel=33.342838287353516, norm_rel=0.02154812403023243, ref_abs_avg=20.282310485839844, test_abs_avg=20.268310546875
production_forward grad[66] vs paper_forward: mean_abs=0.5342848300933838, max_abs=4.0, mean_rel=0.14593040943145752, max_rel=986.6531372070312, norm_rel=0.022294064983725548, ref_abs_avg=23.920902252197266, test_abs_avg=23.921682357788086
production_forward grad[67] vs paper_forward: mean_abs=0.5209929943084717, max_abs=4.0, mean_rel=0.15488535165786743, max_rel=866.7924194335938, norm_rel=0.02173810452222824, ref_abs_avg=23.963069915771484, test_abs_avg=23.967304229736328
production_forward grad[68] vs paper_forward: mean_abs=0.40216994285583496, max_abs=1.75, mean_rel=0.14596673846244812, max_rel=35.5409049987793, norm_rel=0.021028276532888412, ref_abs_avg=19.562633514404297, test_abs_avg=19.508647918701172
production_forward grad[69] vs paper_forward: mean_abs=0.5116706490516663, max_abs=4.0, mean_rel=0.1436798870563507, max_rel=986.568603515625, norm_rel=0.022009003907442093, ref_abs_avg=23.223215103149414, test_abs_avg=23.224483489990234
production_forward grad[70] vs paper_forward: mean_abs=0.506386399269104, max_abs=5.0, mean_rel=0.14471335709095, max_rel=604.2644653320312, norm_rel=0.02211884595453739, ref_abs_avg=22.930404663085938, test_abs_avg=22.927631378173828
production_forward grad[71] vs paper_forward: mean_abs=0.38883447647094727, max_abs=1.625, mean_rel=0.4899235963821411, max_rel=176.59329223632812, norm_rel=0.02120918780565262, ref_abs_avg=18.359397888183594, test_abs_avg=18.392921447753906
production_forward grad[72] vs paper_forward: mean_abs=0.4906879663467407, max_abs=3.5, mean_rel=0.14790871739387512, max_rel=1234.7449951171875, norm_rel=0.021948881447315216, ref_abs_avg=22.36186981201172, test_abs_avg=22.364704132080078
production_forward grad[73] vs paper_forward: mean_abs=0.4793280065059662, max_abs=4.25, mean_rel=0.13523246347904205, max_rel=591.8935546875, norm_rel=0.021629247814416885, ref_abs_avg=22.156492233276367, test_abs_avg=22.15962791442871
production_forward grad[74] vs paper_forward: mean_abs=0.4421689510345459, max_abs=1.75, mean_rel=0.25535428524017334, max_rel=61.434967041015625, norm_rel=0.024701431393623352, ref_abs_avg=17.837505340576172, test_abs_avg=17.851015090942383
production_forward grad[75] vs paper_forward: mean_abs=0.5343669652938843, max_abs=4.25, mean_rel=0.15699532628059387, max_rel=883.1285400390625, norm_rel=0.023751843720674515, ref_abs_avg=22.502689361572266, test_abs_avg=22.505130767822266
production_forward grad[76] vs paper_forward: mean_abs=0.5267729759216309, max_abs=4.0, mean_rel=0.1459144800901413, max_rel=899.6312866210938, norm_rel=0.023457566276192665, ref_abs_avg=22.47954750061035, test_abs_avg=22.483217239379883
production_forward grad[77] vs paper_forward: mean_abs=0.39438584446907043, max_abs=1.6171875, mean_rel=0.5154292583465576, max_rel=211.0104217529297, norm_rel=0.024528946727514267, ref_abs_avg=16.644752502441406, test_abs_avg=16.64809799194336
production_forward grad[78] vs paper_forward: mean_abs=0.4979880750179291, max_abs=4.0, mean_rel=0.14628206193447113, max_rel=821.6402587890625, norm_rel=0.023221122100949287, ref_abs_avg=21.459819793701172, test_abs_avg=21.459205627441406
production_forward grad[79] vs paper_forward: mean_abs=0.48520588874816895, max_abs=4.0, mean_rel=0.1472884714603424, max_rel=930.02099609375, norm_rel=0.022714441642165184, ref_abs_avg=21.34961700439453, test_abs_avg=21.35420799255371
production_forward grad[80] vs paper_forward: mean_abs=0.3703460693359375, max_abs=1.625, mean_rel=0.0542372390627861, max_rel=1.8222250938415527, norm_rel=0.022124873474240303, ref_abs_avg=17.42441177368164, test_abs_avg=17.432170867919922
production_forward grad[81] vs paper_forward: mean_abs=0.45981624722480774, max_abs=3.625, mean_rel=0.1496787667274475, max_rel=797.3955688476562, norm_rel=0.02245207689702511, ref_abs_avg=20.51032829284668, test_abs_avg=20.508285522460938
production_forward grad[82] vs paper_forward: mean_abs=0.4524974524974823, max_abs=3.5, mean_rel=0.14015349745750427, max_rel=943.37646484375, norm_rel=0.022259723395109177, ref_abs_avg=20.334754943847656, test_abs_avg=20.343170166015625
production_forward grad[83] vs paper_forward: mean_abs=0.34897172451019287, max_abs=1.5, mean_rel=0.11223762482404709, max_rel=9.736260414123535, norm_rel=0.022333454340696335, ref_abs_avg=15.683279037475586, test_abs_avg=15.713173866271973
production_forward grad[84] vs paper_forward: mean_abs=0.4296683669090271, max_abs=4.5, mean_rel=0.14374913275241852, max_rel=1353.980224609375, norm_rel=0.02190801315009594, ref_abs_avg=19.638259887695312, test_abs_avg=19.637075424194336
production_forward grad[85] vs paper_forward: mean_abs=0.4154134690761566, max_abs=4.0, mean_rel=0.1404159665107727, max_rel=487.2715148925781, norm_rel=0.021477779373526573, ref_abs_avg=19.420913696289062, test_abs_avg=19.420185089111328
production_forward grad[86] vs paper_forward: mean_abs=0.3055119514465332, max_abs=1.25, mean_rel=0.08148100972175598, max_rel=2.4786856174468994, norm_rel=0.020002933219075203, ref_abs_avg=15.38685131072998, test_abs_avg=15.364839553833008
production_forward grad[87] vs paper_forward: mean_abs=0.396406888961792, max_abs=3.75, mean_rel=0.1336958408355713, max_rel=1217.267822265625, norm_rel=0.021470695734024048, ref_abs_avg=18.540294647216797, test_abs_avg=18.540136337280273
production_forward grad[88] vs paper_forward: mean_abs=0.3952564001083374, max_abs=4.0625, mean_rel=0.131818950176239, max_rel=701.9594116210938, norm_rel=0.021317901089787483, ref_abs_avg=18.642480850219727, test_abs_avg=18.642759323120117
production_forward grad[89] vs paper_forward: mean_abs=0.3218662738800049, max_abs=1.15625, mean_rel=0.17701375484466553, max_rel=30.795032501220703, norm_rel=0.02114611491560936, ref_abs_avg=15.084699630737305, test_abs_avg=15.044017791748047
production_forward grad[90] vs paper_forward: mean_abs=0.3820255696773529, max_abs=4.0, mean_rel=0.13506793975830078, max_rel=541.5349731445312, norm_rel=0.020854363217949867, ref_abs_avg=18.416748046875, test_abs_avg=18.416677474975586
production_forward grad[91] vs paper_forward: mean_abs=0.3778453469276428, max_abs=4.0, mean_rel=0.12206259369850159, max_rel=459.2255554199219, norm_rel=0.020994264632463455, ref_abs_avg=18.196592330932617, test_abs_avg=18.196598052978516
production_forward grad[92] vs paper_forward: mean_abs=0.28172022104263306, max_abs=1.15625, mean_rel=0.08067204058170319, max_rel=5.426504135131836, norm_rel=0.019838422536849976, ref_abs_avg=14.22260570526123, test_abs_avg=14.243579864501953
production_forward grad[93] vs paper_forward: mean_abs=0.35399314761161804, max_abs=4.0, mean_rel=0.12691831588745117, max_rel=882.4988403320312, norm_rel=0.02049463428556919, ref_abs_avg=17.446277618408203, test_abs_avg=17.445432662963867
production_forward grad[94] vs paper_forward: mean_abs=0.35317546129226685, max_abs=4.0, mean_rel=0.12010563910007477, max_rel=582.4267578125, norm_rel=0.020494811236858368, ref_abs_avg=17.551193237304688, test_abs_avg=17.54855728149414
production_forward grad[95] vs paper_forward: mean_abs=0.2912784814834595, max_abs=1.125, mean_rel=0.10004758834838867, max_rel=10.652393341064453, norm_rel=0.019303573295474052, ref_abs_avg=15.131488800048828, test_abs_avg=15.11849308013916
production_forward grad[96] vs paper_forward: mean_abs=0.34107306599617004, max_abs=3.5, mean_rel=0.12104420363903046, max_rel=1035.1175537109375, norm_rel=0.020096097141504288, ref_abs_avg=17.245519638061523, test_abs_avg=17.24357032775879
production_forward grad[97] vs paper_forward: mean_abs=0.33767178654670715, max_abs=4.0, mean_rel=0.11956118047237396, max_rel=650.9840087890625, norm_rel=0.019871458411216736, ref_abs_avg=17.227571487426758, test_abs_avg=17.223678588867188
production_forward2 vs paper_forward output: mean_abs=0.0016231349436566234, max_abs=0.0439453125
production_forward2 grad[0] vs paper_forward: mean_abs=0.0085368063300848, max_abs=0.5, mean_rel=0.07394835352897644, max_rel=111.7230224609375, norm_rel=0.020158080384135246, ref_abs_avg=0.4561946392059326, test_abs_avg=0.456207275390625
production_forward2 grad[1] vs paper_forward: mean_abs=7.233854293823242, max_abs=52.0, mean_rel=0.14785951375961304, max_rel=168.7191925048828, norm_rel=0.020624006167054176, ref_abs_avg=312.6710510253906, test_abs_avg=312.6339416503906
production_forward2 grad[2] vs paper_forward: mean_abs=1.3038063049316406, max_abs=4.5, mean_rel=0.1510871946811676, max_rel=14.529216766357422, norm_rel=0.02405633218586445, ref_abs_avg=53.74199295043945, test_abs_avg=53.81633758544922
production_forward2 grad[3] vs paper_forward: mean_abs=1.559901237487793, max_abs=11.0, mean_rel=0.16042107343673706, max_rel=1380.61572265625, norm_rel=0.023955119773745537, ref_abs_avg=65.44303131103516, test_abs_avg=65.44554138183594
production_forward2 grad[4] vs paper_forward: mean_abs=1.5265119075775146, max_abs=9.5, mean_rel=0.18827089667320251, max_rel=2345.701416015625, norm_rel=0.023807810619473457, ref_abs_avg=64.44645690917969, test_abs_avg=64.44258117675781
production_forward2 grad[5] vs paper_forward: mean_abs=1.1432514190673828, max_abs=4.5, mean_rel=0.08642919361591339, max_rel=6.01662015914917, norm_rel=0.023900166153907776, ref_abs_avg=47.72909164428711, test_abs_avg=47.6868896484375
production_forward2 grad[6] vs paper_forward: mean_abs=1.361575961112976, max_abs=10.0, mean_rel=0.1563664972782135, max_rel=1569.2191162109375, norm_rel=0.02381584607064724, ref_abs_avg=57.450862884521484, test_abs_avg=57.45085906982422
production_forward2 grad[7] vs paper_forward: mean_abs=1.3393586874008179, max_abs=9.0, mean_rel=0.15289802849292755, max_rel=852.2718505859375, norm_rel=0.023541994392871857, ref_abs_avg=57.15382385253906, test_abs_avg=57.15412902832031
production_forward2 grad[8] vs paper_forward: mean_abs=1.029348611831665, max_abs=4.546875, mean_rel=0.14422118663787842, max_rel=5.955781936645508, norm_rel=0.025663863867521286, ref_abs_avg=39.90187072753906, test_abs_avg=40.00415802001953
production_forward2 grad[9] vs paper_forward: mean_abs=1.2487024068832397, max_abs=8.0, mean_rel=0.1663348525762558, max_rel=3670.119140625, norm_rel=0.02356577105820179, ref_abs_avg=53.25398254394531, test_abs_avg=53.25670623779297
production_forward2 grad[10] vs paper_forward: mean_abs=1.2071518898010254, max_abs=8.0, mean_rel=0.16321924328804016, max_rel=1158.0142822265625, norm_rel=0.023207131773233414, ref_abs_avg=52.29208755493164, test_abs_avg=52.296302795410156
production_forward2 grad[11] vs paper_forward: mean_abs=0.9308090209960938, max_abs=4.0, mean_rel=0.06752660125494003, max_rel=4.975505352020264, norm_rel=0.0223127119243145, ref_abs_avg=43.207794189453125, test_abs_avg=43.151222229003906
production_forward2 grad[12] vs paper_forward: mean_abs=1.1536431312561035, max_abs=8.5, mean_rel=0.16280019283294678, max_rel=3258.981689453125, norm_rel=0.02336931601166725, ref_abs_avg=49.60670852661133, test_abs_avg=49.604026794433594
production_forward2 grad[13] vs paper_forward: mean_abs=1.121949553489685, max_abs=6.375, mean_rel=0.15504877269268036, max_rel=1184.6807861328125, norm_rel=0.023063885048031807, ref_abs_avg=48.90270233154297, test_abs_avg=48.90228271484375
production_forward2 grad[14] vs paper_forward: mean_abs=0.8945908546447754, max_abs=3.609375, mean_rel=0.15295737981796265, max_rel=18.88880157470703, norm_rel=0.023686163127422333, ref_abs_avg=37.41636657714844, test_abs_avg=37.357025146484375
production_forward2 grad[15] vs paper_forward: mean_abs=1.0692627429962158, max_abs=7.25, mean_rel=0.15213733911514282, max_rel=1397.8970947265625, norm_rel=0.023097598925232887, ref_abs_avg=46.50901412963867, test_abs_avg=46.50838088989258
production_forward2 grad[16] vs paper_forward: mean_abs=1.0396252870559692, max_abs=6.25, mean_rel=0.1484898030757904, max_rel=1252.406005859375, norm_rel=0.022838743403553963, ref_abs_avg=45.758056640625, test_abs_avg=45.76667022705078
production_forward2 grad[17] vs paper_forward: mean_abs=0.7805633544921875, max_abs=3.0, mean_rel=0.06897766888141632, max_rel=2.7673802375793457, norm_rel=0.021278556436300278, ref_abs_avg=37.296630859375, test_abs_avg=37.2939453125
production_forward2 grad[18] vs paper_forward: mean_abs=1.0042619705200195, max_abs=6.25, mean_rel=0.15748928487300873, max_rel=1665.86962890625, norm_rel=0.022874346002936363, ref_abs_avg=44.10239791870117, test_abs_avg=44.10443878173828
production_forward2 grad[19] vs paper_forward: mean_abs=0.9828306436538696, max_abs=6.5, mean_rel=0.1439083069562912, max_rel=1402.3238525390625, norm_rel=0.022582292556762695, ref_abs_avg=43.78697967529297, test_abs_avg=43.79229736328125
production_forward2 grad[20] vs paper_forward: mean_abs=0.7980213165283203, max_abs=3.125, mean_rel=0.10254740715026855, max_rel=8.443449020385742, norm_rel=0.02330104634165764, ref_abs_avg=34.33401107788086, test_abs_avg=34.345367431640625
production_forward2 grad[21] vs paper_forward: mean_abs=0.948824405670166, max_abs=6.875, mean_rel=0.16111844778060913, max_rel=1152.3695068359375, norm_rel=0.022771600633859634, ref_abs_avg=41.87453079223633, test_abs_avg=41.875396728515625
production_forward2 grad[22] vs paper_forward: mean_abs=0.9270789623260498, max_abs=6.125, mean_rel=0.15838691592216492, max_rel=1300.9857177734375, norm_rel=0.022507289424538612, ref_abs_avg=41.480838775634766, test_abs_avg=41.48149108886719
production_forward2 grad[23] vs paper_forward: mean_abs=0.6985745429992676, max_abs=2.75, mean_rel=0.1384754478931427, max_rel=19.697547912597656, norm_rel=0.021254763007164, ref_abs_avg=32.72911834716797, test_abs_avg=32.74481201171875
production_forward2 grad[24] vs paper_forward: mean_abs=0.9001549482345581, max_abs=6.0, mean_rel=0.15762467682361603, max_rel=1636.3009033203125, norm_rel=0.022673726081848145, ref_abs_avg=39.87667465209961, test_abs_avg=39.877708435058594
production_forward2 grad[25] vs paper_forward: mean_abs=0.8760212659835815, max_abs=5.5, mean_rel=0.16458749771118164, max_rel=1234.309326171875, norm_rel=0.022090177983045578, ref_abs_avg=39.75249481201172, test_abs_avg=39.760738372802734
production_forward2 grad[26] vs paper_forward: mean_abs=0.8278923034667969, max_abs=3.728515625, mean_rel=0.09818742424249649, max_rel=8.365914344787598, norm_rel=0.023580331355333328, ref_abs_avg=35.54334259033203, test_abs_avg=35.53522872924805
production_forward2 grad[27] vs paper_forward: mean_abs=1.0601345300674438, max_abs=7.0, mean_rel=0.16822972893714905, max_rel=1476.8990478515625, norm_rel=0.0245120320469141, ref_abs_avg=43.434478759765625, test_abs_avg=43.4371452331543
production_forward2 grad[28] vs paper_forward: mean_abs=1.0267140865325928, max_abs=7.5, mean_rel=0.15739664435386658, max_rel=889.8271484375, norm_rel=0.024355171248316765, ref_abs_avg=42.370750427246094, test_abs_avg=42.37059020996094
production_forward2 grad[29] vs paper_forward: mean_abs=0.769141674041748, max_abs=2.9375, mean_rel=0.1053880900144577, max_rel=10.812817573547363, norm_rel=0.02372819371521473, ref_abs_avg=32.726810455322266, test_abs_avg=32.75971984863281
production_forward2 grad[30] vs paper_forward: mean_abs=0.9762019515037537, max_abs=6.5, mean_rel=0.17383161187171936, max_rel=1992.8094482421875, norm_rel=0.024760760366916656, ref_abs_avg=39.549964904785156, test_abs_avg=39.55363464355469
production_forward2 grad[31] vs paper_forward: mean_abs=0.95533287525177, max_abs=6.0625, mean_rel=0.15081310272216797, max_rel=887.965087890625, norm_rel=0.02471458539366722, ref_abs_avg=38.86224365234375, test_abs_avg=38.86717987060547
production_forward2 grad[32] vs paper_forward: mean_abs=0.7339353561401367, max_abs=3.3125, mean_rel=0.08635318279266357, max_rel=5.520164489746094, norm_rel=0.02349591813981533, ref_abs_avg=32.33365249633789, test_abs_avg=32.29685974121094
production_forward2 grad[33] vs paper_forward: mean_abs=0.905964732170105, max_abs=6.0, mean_rel=0.17628315091133118, max_rel=1575.9952392578125, norm_rel=0.0244913212954998, ref_abs_avg=37.088043212890625, test_abs_avg=37.09259796142578
production_forward2 grad[34] vs paper_forward: mean_abs=0.8889766335487366, max_abs=6.0, mean_rel=0.17308568954467773, max_rel=2419.132568359375, norm_rel=0.02459709905087948, ref_abs_avg=36.292694091796875, test_abs_avg=36.293304443359375
production_forward2 grad[35] vs paper_forward: mean_abs=0.6747728586196899, max_abs=2.5, mean_rel=0.26516515016555786, max_rel=100.71359252929688, norm_rel=0.02365705743432045, ref_abs_avg=29.466806411743164, test_abs_avg=29.465965270996094
production_forward2 grad[36] vs paper_forward: mean_abs=0.8438146710395813, max_abs=5.5, mean_rel=0.165817528963089, max_rel=2047.2490234375, norm_rel=0.02441088855266571, ref_abs_avg=34.664268493652344, test_abs_avg=34.66548538208008
production_forward2 grad[37] vs paper_forward: mean_abs=0.8274201154708862, max_abs=5.0, mean_rel=0.15442192554473877, max_rel=571.8408203125, norm_rel=0.024379022419452667, ref_abs_avg=34.049835205078125, test_abs_avg=34.0481071472168
production_forward2 grad[38] vs paper_forward: mean_abs=0.6072216033935547, max_abs=2.5, mean_rel=0.18132779002189636, max_rel=28.216419219970703, norm_rel=0.02301778271794319, ref_abs_avg=26.789583206176758, test_abs_avg=26.7689266204834
production_forward2 grad[39] vs paper_forward: mean_abs=0.7907517552375793, max_abs=5.5, mean_rel=0.164362370967865, max_rel=1644.7890625, norm_rel=0.024192580953240395, ref_abs_avg=32.78636932373047, test_abs_avg=32.78883743286133
production_forward2 grad[40] vs paper_forward: mean_abs=0.7748735547065735, max_abs=4.625, mean_rel=0.16145843267440796, max_rel=1618.847412109375, norm_rel=0.023745384067296982, ref_abs_avg=32.721153259277344, test_abs_avg=32.72795104980469
production_forward2 grad[41] vs paper_forward: mean_abs=0.6168160438537598, max_abs=2.75, mean_rel=0.08897039294242859, max_rel=7.873482704162598, norm_rel=0.023794693872332573, ref_abs_avg=26.15036964416504, test_abs_avg=26.169803619384766
production_forward2 grad[42] vs paper_forward: mean_abs=0.748701810836792, max_abs=5.25, mean_rel=0.1553553193807602, max_rel=1044.233642578125, norm_rel=0.023928552865982056, ref_abs_avg=31.380786895751953, test_abs_avg=31.384944915771484
production_forward2 grad[43] vs paper_forward: mean_abs=0.738577127456665, max_abs=5.0, mean_rel=0.149070143699646, max_rel=821.4141235351562, norm_rel=0.023601766675710678, ref_abs_avg=31.412017822265625, test_abs_avg=31.4124755859375
production_forward2 grad[44] vs paper_forward: mean_abs=0.5860954523086548, max_abs=2.1875, mean_rel=0.11644802987575531, max_rel=6.3873772621154785, norm_rel=0.023888224735856056, ref_abs_avg=24.543964385986328, test_abs_avg=24.517776489257812
production_forward2 grad[45] vs paper_forward: mean_abs=0.7183966636657715, max_abs=4.625, mean_rel=0.1511751115322113, max_rel=958.7538452148438, norm_rel=0.02356637269258499, ref_abs_avg=30.55695915222168, test_abs_avg=30.557769775390625
production_forward2 grad[46] vs paper_forward: mean_abs=0.7105545997619629, max_abs=4.6875, mean_rel=0.15833762288093567, max_rel=799.7041015625, norm_rel=0.023466143757104874, ref_abs_avg=30.332225799560547, test_abs_avg=30.33612823486328
production_forward2 grad[47] vs paper_forward: mean_abs=0.5333664417266846, max_abs=1.9375, mean_rel=0.11602705717086792, max_rel=9.747215270996094, norm_rel=0.021827174350619316, ref_abs_avg=24.045625686645508, test_abs_avg=24.054336547851562
production_forward2 grad[48] vs paper_forward: mean_abs=0.6887603402137756, max_abs=4.625, mean_rel=0.16483013331890106, max_rel=1250.975341796875, norm_rel=0.02326393686234951, ref_abs_avg=29.607189178466797, test_abs_avg=29.61074447631836
production_forward2 grad[49] vs paper_forward: mean_abs=0.6755468845367432, max_abs=4.5, mean_rel=0.14643406867980957, max_rel=677.8930053710938, norm_rel=0.023348482325673103, ref_abs_avg=28.970823287963867, test_abs_avg=28.96782684326172
production_forward2 grad[50] vs paper_forward: mean_abs=0.6040902137756348, max_abs=2.625, mean_rel=0.09292261302471161, max_rel=4.686164379119873, norm_rel=0.024461718276143074, ref_abs_avg=24.87539291381836, test_abs_avg=24.901565551757812
production_forward2 grad[51] vs paper_forward: mean_abs=0.7639540433883667, max_abs=5.5, mean_rel=0.16982752084732056, max_rel=1073.9744873046875, norm_rel=0.0249799732118845, ref_abs_avg=30.66454315185547, test_abs_avg=30.666141510009766
production_forward2 grad[52] vs paper_forward: mean_abs=0.7446799874305725, max_abs=4.75, mean_rel=0.15273600816726685, max_rel=766.460205078125, norm_rel=0.024570481851696968, ref_abs_avg=30.402219772338867, test_abs_avg=30.406259536743164
production_forward2 grad[53] vs paper_forward: mean_abs=0.5675432682037354, max_abs=2.0625, mean_rel=0.12059913575649261, max_rel=11.569784164428711, norm_rel=0.02361353114247322, ref_abs_avg=23.949472427368164, test_abs_avg=23.94467544555664
production_forward2 grad[54] vs paper_forward: mean_abs=0.6932523250579834, max_abs=5.0, mean_rel=0.15902546048164368, max_rel=1128.252197265625, norm_rel=0.024397261440753937, ref_abs_avg=28.45552635192871, test_abs_avg=28.455791473388672
production_forward2 grad[55] vs paper_forward: mean_abs=0.6764398813247681, max_abs=4.25, mean_rel=0.15184836089611053, max_rel=497.93450927734375, norm_rel=0.024341212585568428, ref_abs_avg=27.843069076538086, test_abs_avg=27.84381675720215
production_forward2 grad[56] vs paper_forward: mean_abs=0.5005078315734863, max_abs=2.19921875, mean_rel=0.11338236182928085, max_rel=6.748247146606445, norm_rel=0.022353343665599823, ref_abs_avg=22.217540740966797, test_abs_avg=22.235734939575195
production_forward2 grad[57] vs paper_forward: mean_abs=0.6417543888092041, max_abs=4.5, mean_rel=0.16458278894424438, max_rel=1262.5335693359375, norm_rel=0.02383904531598091, ref_abs_avg=26.901691436767578, test_abs_avg=26.903226852416992
production_forward2 grad[58] vs paper_forward: mean_abs=0.6307289600372314, max_abs=5.078125, mean_rel=0.16529545187950134, max_rel=1943.6651611328125, norm_rel=0.023704132065176964, ref_abs_avg=26.639766693115234, test_abs_avg=26.63736343383789
production_forward2 grad[59] vs paper_forward: mean_abs=0.5049858093261719, max_abs=2.25, mean_rel=0.1286350041627884, max_rel=4.637358665466309, norm_rel=0.02435479499399662, ref_abs_avg=20.40161895751953, test_abs_avg=20.47576141357422
production_forward2 grad[60] vs paper_forward: mean_abs=0.5929771065711975, max_abs=4.75, mean_rel=0.1506020724773407, max_rel=1164.2059326171875, norm_rel=0.023344505578279495, ref_abs_avg=25.37647247314453, test_abs_avg=25.379281997680664
production_forward2 grad[61] vs paper_forward: mean_abs=0.5860636234283447, max_abs=4.5, mean_rel=0.15639731287956238, max_rel=1183.4493408203125, norm_rel=0.023197419941425323, ref_abs_avg=25.275344848632812, test_abs_avg=25.27410888671875
production_forward2 grad[62] vs paper_forward: mean_abs=0.46736860275268555, max_abs=2.125, mean_rel=0.09227289259433746, max_rel=13.204882621765137, norm_rel=0.023030735552310944, ref_abs_avg=20.54904556274414, test_abs_avg=20.573688507080078
production_forward2 grad[63] vs paper_forward: mean_abs=0.5690721869468689, max_abs=5.0, mean_rel=0.14636769890785217, max_rel=1187.5, norm_rel=0.022609371691942215, ref_abs_avg=25.126930236816406, test_abs_avg=25.127443313598633
production_forward2 grad[64] vs paper_forward: mean_abs=0.5562343597412109, max_abs=4.0, mean_rel=0.14711076021194458, max_rel=566.8875122070312, norm_rel=0.022984899580478668, ref_abs_avg=24.223270416259766, test_abs_avg=24.223350524902344
production_forward2 grad[65] vs paper_forward: mean_abs=0.4397854804992676, max_abs=1.75, mean_rel=0.12226638197898865, max_rel=27.639726638793945, norm_rel=0.022158470004796982, ref_abs_avg=20.282310485839844, test_abs_avg=20.26214599609375
production_forward2 grad[66] vs paper_forward: mean_abs=0.538933515548706, max_abs=4.1875, mean_rel=0.14757974445819855, max_rel=835.4263305664062, norm_rel=0.022496292367577553, ref_abs_avg=23.920902252197266, test_abs_avg=23.922039031982422
production_forward2 grad[67] vs paper_forward: mean_abs=0.5261110663414001, max_abs=4.0, mean_rel=0.15839120745658875, max_rel=787.6403198242188, norm_rel=0.02196248434484005, ref_abs_avg=23.963069915771484, test_abs_avg=23.967660903930664
production_forward2 grad[68] vs paper_forward: mean_abs=0.40822768211364746, max_abs=1.5625, mean_rel=0.13074883818626404, max_rel=25.372140884399414, norm_rel=0.02109115570783615, ref_abs_avg=19.562633514404297, test_abs_avg=19.51043701171875
production_forward2 grad[69] vs paper_forward: mean_abs=0.5151759386062622, max_abs=4.0, mean_rel=0.14534831047058105, max_rel=858.97265625, norm_rel=0.02215767838060856, ref_abs_avg=23.223215103149414, test_abs_avg=23.224609375
production_forward2 grad[70] vs paper_forward: mean_abs=0.5101027488708496, max_abs=4.75, mean_rel=0.14740949869155884, max_rel=550.7055053710938, norm_rel=0.02227121964097023, ref_abs_avg=22.930404663085938, test_abs_avg=22.927059173583984
production_forward2 grad[71] vs paper_forward: mean_abs=0.3834364414215088, max_abs=1.5, mean_rel=0.22225111722946167, max_rel=41.13284683227539, norm_rel=0.021026963368058205, ref_abs_avg=18.359397888183594, test_abs_avg=18.3833065032959
production_forward2 grad[72] vs paper_forward: mean_abs=0.4934343695640564, max_abs=4.0, mean_rel=0.14750707149505615, max_rel=1204.3270263671875, norm_rel=0.02208143100142479, ref_abs_avg=22.36186981201172, test_abs_avg=22.364013671875
production_forward2 grad[73] vs paper_forward: mean_abs=0.48225682973861694, max_abs=4.25, mean_rel=0.13401466608047485, max_rel=671.4895629882812, norm_rel=0.021754469722509384, ref_abs_avg=22.156492233276367, test_abs_avg=22.159610748291016
production_forward2 grad[74] vs paper_forward: mean_abs=0.45100855827331543, max_abs=1.6484375, mean_rel=0.26332977414131165, max_rel=60.664772033691406, norm_rel=0.025526808574795723, ref_abs_avg=17.837505340576172, test_abs_avg=17.842479705810547
production_forward2 grad[75] vs paper_forward: mean_abs=0.5399566292762756, max_abs=6.0, mean_rel=0.16188842058181763, max_rel=1095.89501953125, norm_rel=0.02399331144988537, ref_abs_avg=22.502689361572266, test_abs_avg=22.504505157470703
production_forward2 grad[76] vs paper_forward: mean_abs=0.5323134660720825, max_abs=4.125, mean_rel=0.14811328053474426, max_rel=774.1597290039062, norm_rel=0.02368898317217827, ref_abs_avg=22.47954750061035, test_abs_avg=22.481754302978516
production_forward2 grad[77] vs paper_forward: mean_abs=0.39226213097572327, max_abs=1.5625, mean_rel=0.5741264820098877, max_rel=241.86460876464844, norm_rel=0.024618864059448242, ref_abs_avg=16.644752502441406, test_abs_avg=16.64539337158203
production_forward2 grad[78] vs paper_forward: mean_abs=0.5024604797363281, max_abs=4.5, mean_rel=0.1474827527999878, max_rel=976.6134033203125, norm_rel=0.0234332587569952, ref_abs_avg=21.459819793701172, test_abs_avg=21.45924186706543
production_forward2 grad[79] vs paper_forward: mean_abs=0.4891645312309265, max_abs=4.25, mean_rel=0.14918260276317596, max_rel=1054.4820556640625, norm_rel=0.022911803796887398, ref_abs_avg=21.34961700439453, test_abs_avg=21.353492736816406
production_forward2 grad[80] vs paper_forward: mean_abs=0.3685874938964844, max_abs=1.625, mean_rel=0.05361485108733177, max_rel=2.248779296875, norm_rel=0.022362643852829933, ref_abs_avg=17.42441177368164, test_abs_avg=17.428245544433594
production_forward2 grad[81] vs paper_forward: mean_abs=0.46368446946144104, max_abs=3.75, mean_rel=0.15064668655395508, max_rel=707.8726806640625, norm_rel=0.02262934297323227, ref_abs_avg=20.51032829284668, test_abs_avg=20.50771141052246
production_forward2 grad[82] vs paper_forward: mean_abs=0.45615461468696594, max_abs=3.875, mean_rel=0.1412508487701416, max_rel=985.0205078125, norm_rel=0.022449219599366188, ref_abs_avg=20.334754943847656, test_abs_avg=20.34302520751953
production_forward2 grad[83] vs paper_forward: mean_abs=0.35776519775390625, max_abs=1.5, mean_rel=0.11787446588277817, max_rel=10.691658020019531, norm_rel=0.022656364366412163, ref_abs_avg=15.683279037475586, test_abs_avg=15.704843521118164
production_forward2 grad[84] vs paper_forward: mean_abs=0.43263697624206543, max_abs=4.5, mean_rel=0.14515629410743713, max_rel=1216.3228759765625, norm_rel=0.022033480927348137, ref_abs_avg=19.638259887695312, test_abs_avg=19.63714599609375
production_forward2 grad[85] vs paper_forward: mean_abs=0.4187648296356201, max_abs=4.0, mean_rel=0.14241833984851837, max_rel=577.7850341796875, norm_rel=0.021646788343787193, ref_abs_avg=19.420913696289062, test_abs_avg=19.41971778869629
production_forward2 grad[86] vs paper_forward: mean_abs=0.31652355194091797, max_abs=1.28125, mean_rel=0.08544166386127472, max_rel=2.70619797706604, norm_rel=0.02044801414012909, ref_abs_avg=15.38685131072998, test_abs_avg=15.368825912475586
production_forward2 grad[87] vs paper_forward: mean_abs=0.3986048996448517, max_abs=4.0, mean_rel=0.13462214171886444, max_rel=1116.4742431640625, norm_rel=0.021581165492534637, ref_abs_avg=18.540294647216797, test_abs_avg=18.539968490600586
production_forward2 grad[88] vs paper_forward: mean_abs=0.39739182591438293, max_abs=4.2421875, mean_rel=0.13235676288604736, max_rel=715.717529296875, norm_rel=0.021412121132016182, ref_abs_avg=18.642480850219727, test_abs_avg=18.642580032348633
production_forward2 grad[89] vs paper_forward: mean_abs=0.3237459659576416, max_abs=1.1875, mean_rel=0.1874111294746399, max_rel=33.07878875732422, norm_rel=0.021044328808784485, ref_abs_avg=15.084699630737305, test_abs_avg=15.039830207824707
production_forward2 grad[90] vs paper_forward: mean_abs=0.38339680433273315, max_abs=4.0, mean_rel=0.13455702364444733, max_rel=499.4460754394531, norm_rel=0.02091851644217968, ref_abs_avg=18.416748046875, test_abs_avg=18.416614532470703
production_forward2 grad[91] vs paper_forward: mean_abs=0.37926650047302246, max_abs=4.0, mean_rel=0.12257910519838333, max_rel=446.72418212890625, norm_rel=0.021066531538963318, ref_abs_avg=18.196592330932617, test_abs_avg=18.196508407592773
production_forward2 grad[92] vs paper_forward: mean_abs=0.28215527534484863, max_abs=1.125, mean_rel=0.07671727985143661, max_rel=4.573978900909424, norm_rel=0.020056119188666344, ref_abs_avg=14.22260570526123, test_abs_avg=14.24687385559082
production_forward2 grad[93] vs paper_forward: mean_abs=0.354915052652359, max_abs=4.0, mean_rel=0.1273878663778305, max_rel=807.8148193359375, norm_rel=0.020531274378299713, ref_abs_avg=17.446277618408203, test_abs_avg=17.445749282836914
production_forward2 grad[94] vs paper_forward: mean_abs=0.3539581596851349, max_abs=4.0, mean_rel=0.12169669568538666, max_rel=773.1045532226562, norm_rel=0.020536882802844048, ref_abs_avg=17.551193237304688, test_abs_avg=17.548534393310547
production_forward2 grad[95] vs paper_forward: mean_abs=0.2912784814834595, max_abs=1.125, mean_rel=0.10004758834838867, max_rel=10.652393341064453, norm_rel=0.019303573295474052, ref_abs_avg=15.131488800048828, test_abs_avg=15.11849308013916
production_forward2 grad[96] vs paper_forward: mean_abs=0.34107306599617004, max_abs=3.5, mean_rel=0.12104420363903046, max_rel=1035.1175537109375, norm_rel=0.020096097141504288, ref_abs_avg=17.245519638061523, test_abs_avg=17.24357032775879
production_forward2 grad[97] vs paper_forward: mean_abs=0.33767178654670715, max_abs=4.0, mean_rel=0.11956118047237396, max_rel=650.9840087890625, norm_rel=0.019871458411216736, ref_abs_avg=17.227571487426758, test_abs_avg=17.223678588867188

