identity layers + randn queries
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_kernel,
with key as (1, 512, 8, 1, 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'torch.float32'),
finished after 6.49s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None;
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_out_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_2_online_softmax_merge_intrablock_out_kernel,
with key as (512, 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.bfloat16'),
finished after 3.61s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_kernel,
with key as (2, 512, 8, 2, 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'torch.float32'),
finished after 7.26s,
best config selected: num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_kernel,
with key as (3, 512, 8, 4, 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'torch.float32'),
finished after 8.45s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_kernel,
with key as (4, 512, 8, 4, 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'torch.float32'),
finished after 8.52s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_kernel,
with key as (5, 512, 1, 8, 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'torch.float32'),
finished after 4.74s,
best config selected: num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (5, 512, 1, 8, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 7.78s,
best config selected: num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None;
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 32, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 32, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 64, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 64, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 32, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 32, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 64, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 64, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Triton autotuning for function phase_1_reduce_grad_pseudo_queries_kernel,
with key as (131072, 512, 1, 'torch.float32', 'torch.float32'),
finished after 1.50s,
best config selected: BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 64, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None;
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_2_online_softmax_merge_intrablock_backward_kernel,
with key as (512, 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 4.69s,
best config selected: num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None;
Autotuning kernel phase_2_reduce_grad_pseudo_query_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 32, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_reduce_grad_pseudo_query_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 32, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_reduce_grad_pseudo_query_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 64, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_reduce_grad_pseudo_query_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 64, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_reduce_grad_pseudo_query_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 32, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_reduce_grad_pseudo_query_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 32, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_reduce_grad_pseudo_query_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 64, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_reduce_grad_pseudo_query_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 64, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Triton autotuning for function phase_2_reduce_grad_pseudo_query_kernel,
with key as (131072, 512, 'torch.float32', 'torch.float32'),
finished after 1.47s,
best config selected: BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 64, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (4, 512, 8, 4, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 20.14s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None;
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 32, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 32, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 64, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 64, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 32, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 32, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 64, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_reduce_grad_pseudo_queries_kernel with config BLOCK_BATCH_SEQ: 256, BLOCK_HIDDEN: 64, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Triton autotuning for function phase_1_reduce_grad_pseudo_queries_kernel,
with key as (131072, 512, 8, 'torch.float32', 'torch.float32'),
finished after 1.50s,
best config selected: BLOCK_BATCH_SEQ: 128, BLOCK_HIDDEN: 64, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (3, 512, 8, 4, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 18.77s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (2, 512, 8, 2, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 14.51s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (1, 512, 8, 1, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 9.97s,
best config selected: num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None;
production_forward2 fwd+bwd:  224.445 ms
production_forward2 bwd-only: 202.236 ms
production_forward2 peak allocated: fwd=2.567 GiB, fwd+bwd=5.946 GiB
production_forward2 peak reserved:  fwd=2.949 GiB, fwd+bwd=8.699 GiB
paper_forward fwd+bwd:  380.170 ms
paper_forward bwd-only: 294.187 ms
paper_forward peak allocated: fwd=29.705 GiB, fwd+bwd=31.823 GiB
paper_forward peak reserved:  fwd=29.742 GiB, fwd+bwd=32.492 GiB
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (5, 512, 1, 8, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.bfloat16', 'torch.float32', 'torch.bfloat16', 'torch.float32'),
finished after 7.03s,
best config selected: num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None;
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_2_online_softmax_merge_intrablock_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_2_online_softmax_merge_intrablock_backward_kernel,
with key as (512, 'torch.bfloat16', 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32'),
finished after 4.41s,
best config selected: num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (4, 512, 8, 4, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.bfloat16', 'torch.float32'),
finished after 24.07s,
best config selected: num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (3, 512, 8, 4, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.bfloat16', 'torch.float32'),
finished after 22.47s,
best config selected: num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (2, 512, 8, 2, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.bfloat16', 'torch.float32'),
finished after 16.05s,
best config selected: num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None;
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 1, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 2, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 4, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 8, num_ctas: 1, num_stages: 4, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 1, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 2, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 3, maxnreg: None
Autotuning kernel phase_1_batched_interblock_attention_backward_kernel with config num_warps: 16, num_ctas: 1, num_stages: 4, maxnreg: None
Triton autotuning for function phase_1_batched_interblock_attention_backward_kernel,
with key as (1, 512, 8, 1, 'torch.bfloat16', 'torch.bfloat16', 'torch.float32', 'torch.float32', 'torch.float32', 'torch.bfloat16', 'torch.float32'),
finished after 10.93s,
best config selected: num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None;
production_forward fwd+bwd:  109.593 ms
production_forward bwd-only: 89.259 ms
production_forward peak allocated: fwd=3.071 GiB, fwd+bwd=6.696 GiB
production_forward peak reserved:  fwd=3.324 GiB, fwd+bwd=7.824 GiB

grads check for swiglu layers + randn queries
production_forward vs paper_forward output: mean_abs=0.0016365128103643656, max_abs=0.0390625
production_forward grad[0] vs paper_forward: mean_abs=0.008517708629369736, max_abs=0.5, mean_rel=0.07327231764793396, max_rel=132.022705078125, norm_rel=0.020076081156730652, ref_abs_avg=0.45930689573287964, test_abs_avg=0.4593246877193451
production_forward grad[1] vs paper_forward: mean_abs=7.290658473968506, max_abs=72.0, mean_rel=0.2889525294303894, max_rel=1233.1947021484375, norm_rel=0.0203513503074646, ref_abs_avg=318.4886474609375, test_abs_avg=318.4306640625
production_forward grad[2] vs paper_forward: mean_abs=1.3072388172149658, max_abs=5.0, mean_rel=0.12653468549251556, max_rel=9.843981742858887, norm_rel=0.023780405521392822, ref_abs_avg=55.638912200927734, test_abs_avg=55.579559326171875
production_forward grad[3] vs paper_forward: mean_abs=1.5688705444335938, max_abs=10.0, mean_rel=0.16016718745231628, max_rel=1597.4559326171875, norm_rel=0.023970816284418106, ref_abs_avg=65.84440612792969, test_abs_avg=65.84793090820312
production_forward grad[4] vs paper_forward: mean_abs=1.5299543142318726, max_abs=10.5, mean_rel=0.1590738445520401, max_rel=914.84326171875, norm_rel=0.02367556467652321, ref_abs_avg=64.99253845214844, test_abs_avg=64.99272155761719
production_forward grad[5] vs paper_forward: mean_abs=1.1054401397705078, max_abs=4.3125, mean_rel=0.15576690435409546, max_rel=15.531587600708008, norm_rel=0.02325001358985901, ref_abs_avg=47.214988708496094, test_abs_avg=47.209861755371094
production_forward grad[6] vs paper_forward: mean_abs=1.3806780576705933, max_abs=9.0, mean_rel=0.15976482629776, max_rel=1351.6221923828125, norm_rel=0.02359931170940399, ref_abs_avg=58.82403564453125, test_abs_avg=58.83041000366211
production_forward grad[7] vs paper_forward: mean_abs=1.3438646793365479, max_abs=8.0, mean_rel=0.1466902494430542, max_rel=1215.4757080078125, norm_rel=0.023453282192349434, ref_abs_avg=57.65807342529297, test_abs_avg=57.65431213378906
production_forward grad[8] vs paper_forward: mean_abs=0.9779667854309082, max_abs=3.7890625, mean_rel=0.18761390447616577, max_rel=23.6821346282959, norm_rel=0.023540880531072617, ref_abs_avg=41.21825408935547, test_abs_avg=41.184288024902344
production_forward grad[9] vs paper_forward: mean_abs=1.2485325336456299, max_abs=8.0, mean_rel=0.16239459812641144, max_rel=2449.32275390625, norm_rel=0.023417053744196892, ref_abs_avg=53.58391571044922, test_abs_avg=53.589744567871094
production_forward grad[10] vs paper_forward: mean_abs=1.2160425186157227, max_abs=7.0, mean_rel=0.16198492050170898, max_rel=1078.9337158203125, norm_rel=0.023195000365376472, ref_abs_avg=52.694129943847656, test_abs_avg=52.69224166870117
production_forward grad[11] vs paper_forward: mean_abs=0.9289655685424805, max_abs=3.5, mean_rel=0.07123110443353653, max_rel=3.7820839881896973, norm_rel=0.022782523185014725, ref_abs_avg=41.41888427734375, test_abs_avg=41.487632751464844
production_forward grad[12] vs paper_forward: mean_abs=1.1537361145019531, max_abs=7.0, mean_rel=0.1439945101737976, max_rel=649.5744018554688, norm_rel=0.02319464460015297, ref_abs_avg=50.004371643066406, test_abs_avg=50.01032257080078
production_forward grad[13] vs paper_forward: mean_abs=1.125583291053772, max_abs=7.0, mean_rel=0.15319553017616272, max_rel=820.2575073242188, norm_rel=0.023044001311063766, ref_abs_avg=49.059173583984375, test_abs_avg=49.05575942993164
production_forward grad[14] vs paper_forward: mean_abs=0.8640031814575195, max_abs=3.65625, mean_rel=0.0758562684059143, max_rel=1.938589096069336, norm_rel=0.022538205608725548, ref_abs_avg=39.35063934326172, test_abs_avg=39.34767150878906
production_forward grad[15] vs paper_forward: mean_abs=1.0734326839447021, max_abs=7.25, mean_rel=0.1557338833808899, max_rel=1364.46484375, norm_rel=0.02305809035897255, ref_abs_avg=46.75851821899414, test_abs_avg=46.76094055175781
production_forward grad[16] vs paper_forward: mean_abs=1.0503888130187988, max_abs=7.0, mean_rel=0.15000972151756287, max_rel=606.3138427734375, norm_rel=0.022811176255345345, ref_abs_avg=46.26638412475586, test_abs_avg=46.2708854675293
production_forward grad[17] vs paper_forward: mean_abs=0.8078536987304688, max_abs=3.125, mean_rel=0.061310216784477234, max_rel=1.2673321962356567, norm_rel=0.022162161767482758, ref_abs_avg=36.20037841796875, test_abs_avg=36.24199676513672
production_forward grad[18] vs paper_forward: mean_abs=1.0104495286941528, max_abs=7.0, mean_rel=0.15557557344436646, max_rel=957.8049926757812, norm_rel=0.022974595427513123, ref_abs_avg=44.19859313964844, test_abs_avg=44.203941345214844
production_forward grad[19] vs paper_forward: mean_abs=0.9864587187767029, max_abs=6.0546875, mean_rel=0.14655253291130066, max_rel=1095.6534423828125, norm_rel=0.022744838148355484, ref_abs_avg=43.57270050048828, test_abs_avg=43.573829650878906
production_forward grad[20] vs paper_forward: mean_abs=0.7566068172454834, max_abs=3.0, mean_rel=0.05866319686174393, max_rel=2.17533016204834, norm_rel=0.020458418875932693, ref_abs_avg=36.999610900878906, test_abs_avg=37.13446807861328
production_forward grad[21] vs paper_forward: mean_abs=0.959540843963623, max_abs=6.25, mean_rel=0.1580410897731781, max_rel=868.0388793945312, norm_rel=0.022774435579776764, ref_abs_avg=42.31806945800781, test_abs_avg=42.31918716430664
production_forward grad[22] vs paper_forward: mean_abs=0.9302812814712524, max_abs=6.0, mean_rel=0.1526857316493988, max_rel=1128.0906982421875, norm_rel=0.022539544850587845, ref_abs_avg=41.50236511230469, test_abs_avg=41.511383056640625
production_forward grad[23] vs paper_forward: mean_abs=0.7545291185379028, max_abs=2.75, mean_rel=0.0687120109796524, max_rel=5.610956192016602, norm_rel=0.022219102829694748, ref_abs_avg=34.72947692871094, test_abs_avg=34.708534240722656
production_forward grad[24] vs paper_forward: mean_abs=0.9122262597084045, max_abs=5.5, mean_rel=0.15441007912158966, max_rel=1034.5906982421875, norm_rel=0.022626761347055435, ref_abs_avg=40.47168731689453, test_abs_avg=40.4736328125
production_forward grad[25] vs paper_forward: mean_abs=0.8884584903717041, max_abs=6.0, mean_rel=0.1511957347393036, max_rel=476.92950439453125, norm_rel=0.022390611469745636, ref_abs_avg=39.87950897216797, test_abs_avg=39.88086700439453
production_forward grad[26] vs paper_forward: mean_abs=0.8144617080688477, max_abs=2.75, mean_rel=0.1075543463230133, max_rel=10.95190143585205, norm_rel=0.022992629557847977, ref_abs_avg=35.634674072265625, test_abs_avg=35.61433792114258
production_forward grad[27] vs paper_forward: mean_abs=1.0157074928283691, max_abs=6.6875, mean_rel=0.15424995124340057, max_rel=986.69775390625, norm_rel=0.02460465393960476, ref_abs_avg=41.486846923828125, test_abs_avg=41.48859786987305
production_forward grad[28] vs paper_forward: mean_abs=0.9918763637542725, max_abs=7.0, mean_rel=0.15809260308742523, max_rel=1101.4693603515625, norm_rel=0.024168003350496292, ref_abs_avg=41.2172737121582, test_abs_avg=41.212974548339844
production_forward grad[29] vs paper_forward: mean_abs=0.7564752101898193, max_abs=3.5, mean_rel=0.2010241150856018, max_rel=55.6343879699707, norm_rel=0.024898357689380646, ref_abs_avg=30.342864990234375, test_abs_avg=30.31675910949707
production_forward grad[30] vs paper_forward: mean_abs=0.9630250930786133, max_abs=6.5, mean_rel=0.1707482635974884, max_rel=2000.904052734375, norm_rel=0.02493487484753132, ref_abs_avg=38.7740592956543, test_abs_avg=38.773094177246094
production_forward grad[31] vs paper_forward: mean_abs=0.9424465894699097, max_abs=6.0, mean_rel=0.16461095213890076, max_rel=943.3816528320312, norm_rel=0.024505607783794403, ref_abs_avg=38.60586166381836, test_abs_avg=38.608917236328125
production_forward grad[32] vs paper_forward: mean_abs=0.7392752170562744, max_abs=3.0, mean_rel=0.6545445919036865, max_rel=252.63421630859375, norm_rel=0.02416612207889557, ref_abs_avg=30.78614044189453, test_abs_avg=30.78415298461914
production_forward grad[33] vs paper_forward: mean_abs=0.9064749479293823, max_abs=6.0, mean_rel=0.17812538146972656, max_rel=1552.2015380859375, norm_rel=0.02492809295654297, ref_abs_avg=36.456207275390625, test_abs_avg=36.45799255371094
production_forward grad[34] vs paper_forward: mean_abs=0.8893953561782837, max_abs=5.25, mean_rel=0.16124820709228516, max_rel=1340.765380859375, norm_rel=0.024779701605439186, ref_abs_avg=36.030784606933594, test_abs_avg=36.03058624267578
production_forward grad[35] vs paper_forward: mean_abs=0.6976804733276367, max_abs=2.875, mean_rel=0.1672750562429428, max_rel=25.850797653198242, norm_rel=0.023193562403321266, ref_abs_avg=30.22367286682129, test_abs_avg=30.225696563720703
production_forward grad[36] vs paper_forward: mean_abs=0.8514313697814941, max_abs=5.326171875, mean_rel=0.16512352228164673, max_rel=1206.3968505859375, norm_rel=0.02457723207771778, ref_abs_avg=34.70581817626953, test_abs_avg=34.705726623535156
production_forward grad[37] vs paper_forward: mean_abs=0.8380588889122009, max_abs=5.25, mean_rel=0.1555805504322052, max_rel=1287.6171875, norm_rel=0.02445637807250023, ref_abs_avg=34.34699249267578, test_abs_avg=34.34834671020508
production_forward grad[38] vs paper_forward: mean_abs=0.6920220851898193, max_abs=2.59375, mean_rel=0.12656892836093903, max_rel=8.746953964233398, norm_rel=0.025315193459391594, ref_abs_avg=27.552608489990234, test_abs_avg=27.535186767578125
production_forward grad[39] vs paper_forward: mean_abs=0.8007455468177795, max_abs=5.375, mean_rel=0.18171590566635132, max_rel=1543.5257568359375, norm_rel=0.024475082755088806, ref_abs_avg=32.8138313293457, test_abs_avg=32.81493377685547
production_forward grad[40] vs paper_forward: mean_abs=0.790579080581665, max_abs=6.0, mean_rel=0.16370302438735962, max_rel=699.557861328125, norm_rel=0.02435225248336792, ref_abs_avg=32.584373474121094, test_abs_avg=32.58856201171875
production_forward grad[41] vs paper_forward: mean_abs=0.6206445693969727, max_abs=2.125, mean_rel=0.08332228660583496, max_rel=4.547646522521973, norm_rel=0.024104507640004158, ref_abs_avg=26.08892059326172, test_abs_avg=26.06949234008789
production_forward grad[42] vs paper_forward: mean_abs=0.7684463262557983, max_abs=4.75, mean_rel=0.1522388458251953, max_rel=1956.774169921875, norm_rel=0.024204451590776443, ref_abs_avg=31.825828552246094, test_abs_avg=31.827165603637695
production_forward grad[43] vs paper_forward: mean_abs=0.7525383234024048, max_abs=4.75, mean_rel=0.16445058584213257, max_rel=1021.3534545898438, norm_rel=0.02411465160548687, ref_abs_avg=31.294788360595703, test_abs_avg=31.291101455688477
production_forward grad[44] vs paper_forward: mean_abs=0.58982253074646, max_abs=2.375, mean_rel=1.4202911853790283, max_rel=687.253662109375, norm_rel=0.023517660796642303, ref_abs_avg=25.60738754272461, test_abs_avg=25.643455505371094
production_forward grad[45] vs paper_forward: mean_abs=0.7277047634124756, max_abs=4.625, mean_rel=0.16455140709877014, max_rel=883.9599609375, norm_rel=0.02384328655898571, ref_abs_avg=30.550065994262695, test_abs_avg=30.54941177368164
production_forward grad[46] vs paper_forward: mean_abs=0.7178628444671631, max_abs=5.0, mean_rel=0.15375420451164246, max_rel=1144.9609375, norm_rel=0.023858731612563133, ref_abs_avg=30.172595977783203, test_abs_avg=30.178327560424805
production_forward grad[47] vs paper_forward: mean_abs=0.5512175559997559, max_abs=2.625, mean_rel=0.10192456841468811, max_rel=7.142060279846191, norm_rel=0.02379407174885273, ref_abs_avg=23.64335060119629, test_abs_avg=23.558956146240234
production_forward grad[48] vs paper_forward: mean_abs=0.6951755285263062, max_abs=5.0, mean_rel=0.15514302253723145, max_rel=887.6544189453125, norm_rel=0.02351222187280655, ref_abs_avg=29.545948028564453, test_abs_avg=29.546977996826172
production_forward grad[49] vs paper_forward: mean_abs=0.6806567311286926, max_abs=5.0, mean_rel=0.15453603863716125, max_rel=583.8931884765625, norm_rel=0.023842042312026024, ref_abs_avg=28.61752700805664, test_abs_avg=28.619775772094727
production_forward grad[50] vs paper_forward: mean_abs=0.6404123306274414, max_abs=3.125, mean_rel=0.15019428730010986, max_rel=26.733861923217773, norm_rel=0.024872461333870888, ref_abs_avg=25.861202239990234, test_abs_avg=25.83339500427246
production_forward grad[51] vs paper_forward: mean_abs=0.7589672803878784, max_abs=6.0, mean_rel=0.1574096977710724, max_rel=1049.7392578125, norm_rel=0.02476537972688675, ref_abs_avg=30.71734619140625, test_abs_avg=30.717432022094727
production_forward grad[52] vs paper_forward: mean_abs=0.7456493973731995, max_abs=5.0, mean_rel=0.1633913815021515, max_rel=673.4480590820312, norm_rel=0.024588007479906082, ref_abs_avg=30.43000602722168, test_abs_avg=30.4296817779541
production_forward grad[53] vs paper_forward: mean_abs=0.5928306579589844, max_abs=2.125, mean_rel=0.07482579350471497, max_rel=2.1361000537872314, norm_rel=0.02430967055261135, ref_abs_avg=24.64533233642578, test_abs_avg=24.616901397705078
production_forward grad[54] vs paper_forward: mean_abs=0.703442394733429, max_abs=5.0, mean_rel=0.16806231439113617, max_rel=805.31103515625, norm_rel=0.024547060951590538, ref_abs_avg=28.703580856323242, test_abs_avg=28.705184936523438
production_forward grad[55] vs paper_forward: mean_abs=0.6966938972473145, max_abs=5.0, mean_rel=0.15765072405338287, max_rel=678.3071899414062, norm_rel=0.024410132318735123, ref_abs_avg=28.618202209472656, test_abs_avg=28.623493194580078
production_forward grad[56] vs paper_forward: mean_abs=0.5716552734375, max_abs=2.625, mean_rel=0.10266795754432678, max_rel=4.832780838012695, norm_rel=0.024082748219370842, ref_abs_avg=23.819778442382812, test_abs_avg=23.793956756591797
production_forward grad[57] vs paper_forward: mean_abs=0.671097457408905, max_abs=4.5, mean_rel=0.16590334475040436, max_rel=979.7628784179688, norm_rel=0.024324413388967514, ref_abs_avg=27.631607055664062, test_abs_avg=27.632354736328125
production_forward grad[58] vs paper_forward: mean_abs=0.6592422723770142, max_abs=4.625, mean_rel=0.15205618739128113, max_rel=1778.6614990234375, norm_rel=0.024077583104372025, ref_abs_avg=27.40127944946289, test_abs_avg=27.398414611816406
production_forward grad[59] vs paper_forward: mean_abs=0.5191566944122314, max_abs=2.25, mean_rel=0.2199876457452774, max_rel=39.912445068359375, norm_rel=0.023572467267513275, ref_abs_avg=21.46916961669922, test_abs_avg=21.4682559967041
production_forward grad[60] vs paper_forward: mean_abs=0.626258134841919, max_abs=4.25, mean_rel=0.15380309522151947, max_rel=930.3441162109375, norm_rel=0.02385528013110161, ref_abs_avg=26.262882232666016, test_abs_avg=26.263629913330078
production_forward grad[61] vs paper_forward: mean_abs=0.6168071031570435, max_abs=5.4375, mean_rel=0.14737319946289062, max_rel=774.1316528320312, norm_rel=0.023543715476989746, ref_abs_avg=26.200870513916016, test_abs_avg=26.206584930419922
production_forward grad[62] vs paper_forward: mean_abs=0.47431421279907227, max_abs=2.0, mean_rel=0.11159995198249817, max_rel=10.944701194763184, norm_rel=0.022925162687897682, ref_abs_avg=20.012805938720703, test_abs_avg=20.038850784301758
production_forward grad[63] vs paper_forward: mean_abs=0.5959306955337524, max_abs=4.0625, mean_rel=0.1519746631383896, max_rel=1344.4783935546875, norm_rel=0.023173697292804718, ref_abs_avg=25.672008514404297, test_abs_avg=25.675113677978516
production_forward grad[64] vs paper_forward: mean_abs=0.5816498398780823, max_abs=4.5, mean_rel=0.14694392681121826, max_rel=417.76458740234375, norm_rel=0.023272046819329262, ref_abs_avg=25.047161102294922, test_abs_avg=25.05197525024414
production_forward grad[65] vs paper_forward: mean_abs=0.43164050579071045, max_abs=2.125, mean_rel=0.1394716501235962, max_rel=14.017390251159668, norm_rel=0.021747615188360214, ref_abs_avg=20.180583953857422, test_abs_avg=20.175323486328125
production_forward grad[66] vs paper_forward: mean_abs=0.5652035474777222, max_abs=5.0, mean_rel=0.15502622723579407, max_rel=897.6536865234375, norm_rel=0.022919990122318268, ref_abs_avg=24.629276275634766, test_abs_avg=24.629615783691406
production_forward grad[67] vs paper_forward: mean_abs=0.551037609577179, max_abs=5.0, mean_rel=0.14817507565021515, max_rel=737.8359375, norm_rel=0.0226756539195776, ref_abs_avg=24.325992584228516, test_abs_avg=24.321748733520508
production_forward grad[68] vs paper_forward: mean_abs=0.42707109451293945, max_abs=1.75, mean_rel=0.21140077710151672, max_rel=25.175228118896484, norm_rel=0.021462081000208855, ref_abs_avg=20.281299591064453, test_abs_avg=20.27437973022461
production_forward grad[69] vs paper_forward: mean_abs=0.5389654636383057, max_abs=4.5, mean_rel=0.14536058902740479, max_rel=1081.6781005859375, norm_rel=0.022370485588908195, ref_abs_avg=24.04323959350586, test_abs_avg=24.04229736328125
production_forward grad[70] vs paper_forward: mean_abs=0.525962233543396, max_abs=4.6875, mean_rel=0.14590363204479218, max_rel=1017.9614868164062, norm_rel=0.022597044706344604, ref_abs_avg=23.33050537109375, test_abs_avg=23.32831573486328
production_forward grad[71] vs paper_forward: mean_abs=0.4244399070739746, max_abs=1.64453125, mean_rel=0.08527900278568268, max_rel=2.2521955966949463, norm_rel=0.022245170548558235, ref_abs_avg=19.13631820678711, test_abs_avg=19.163782119750977
production_forward grad[72] vs paper_forward: mean_abs=0.5117916464805603, max_abs=3.5, mean_rel=0.14312955737113953, max_rel=1147.7982177734375, norm_rel=0.022358421236276627, ref_abs_avg=22.879478454589844, test_abs_avg=22.879863739013672
production_forward grad[73] vs paper_forward: mean_abs=0.5021141767501831, max_abs=4.0, mean_rel=0.16379061341285706, max_rel=1544.39013671875, norm_rel=0.02221146784722805, ref_abs_avg=22.585609436035156, test_abs_avg=22.58970069885254
production_forward grad[74] vs paper_forward: mean_abs=0.4603266716003418, max_abs=1.625, mean_rel=0.1298535168170929, max_rel=17.13467025756836, norm_rel=0.023050177842378616, ref_abs_avg=20.144926071166992, test_abs_avg=20.191608428955078
production_forward grad[75] vs paper_forward: mean_abs=0.5766505002975464, max_abs=4.25, mean_rel=0.16216148436069489, max_rel=1124.9766845703125, norm_rel=0.024279793724417686, ref_abs_avg=23.786865234375, test_abs_avg=23.788593292236328
production_forward grad[76] vs paper_forward: mean_abs=0.5647298097610474, max_abs=5.0, mean_rel=0.16023793816566467, max_rel=581.2081909179688, norm_rel=0.02346702478826046, ref_abs_avg=24.014469146728516, test_abs_avg=24.015716552734375
production_forward grad[77] vs paper_forward: mean_abs=0.4412112236022949, max_abs=1.75, mean_rel=0.08739335834980011, max_rel=10.93916130065918, norm_rel=0.02387966774404049, ref_abs_avg=18.65048599243164, test_abs_avg=18.623865127563477
production_forward grad[78] vs paper_forward: mean_abs=0.5289715528488159, max_abs=5.0, mean_rel=0.15122932195663452, max_rel=1109.8311767578125, norm_rel=0.023356791585683823, ref_abs_avg=22.616313934326172, test_abs_avg=22.61635398864746
production_forward grad[79] vs paper_forward: mean_abs=0.5079725384712219, max_abs=5.0, mean_rel=0.1482885777950287, max_rel=590.3880004882812, norm_rel=0.02356226183474064, ref_abs_avg=21.580432891845703, test_abs_avg=21.579784393310547
production_forward grad[80] vs paper_forward: mean_abs=0.376270055770874, max_abs=1.625, mean_rel=0.08386671543121338, max_rel=2.8426265716552734, norm_rel=0.020595597103238106, ref_abs_avg=18.217004776000977, test_abs_avg=18.217666625976562
production_forward grad[81] vs paper_forward: mean_abs=0.4821992516517639, max_abs=4.5, mean_rel=0.13976794481277466, max_rel=886.6112060546875, norm_rel=0.02269829623401165, ref_abs_avg=21.244964599609375, test_abs_avg=21.244281768798828
production_forward grad[82] vs paper_forward: mean_abs=0.46980059146881104, max_abs=6.0, mean_rel=0.14416643977165222, max_rel=837.1168212890625, norm_rel=0.022514158859848976, ref_abs_avg=20.96836280822754, test_abs_avg=20.966293334960938
production_forward grad[83] vs paper_forward: mean_abs=0.3934168517589569, max_abs=1.625, mean_rel=0.11401237547397614, max_rel=14.93607234954834, norm_rel=0.022071028128266335, ref_abs_avg=18.100650787353516, test_abs_avg=18.13239097595215
production_forward grad[84] vs paper_forward: mean_abs=0.45918041467666626, max_abs=4.0, mean_rel=0.14576175808906555, max_rel=917.0162963867188, norm_rel=0.022236505523324013, ref_abs_avg=20.655357360839844, test_abs_avg=20.655319213867188
production_forward grad[85] vs paper_forward: mean_abs=0.4354056119918823, max_abs=5.0, mean_rel=0.12949322164058685, max_rel=464.47088623046875, norm_rel=0.021493427455425262, ref_abs_avg=20.245498657226562, test_abs_avg=20.249483108520508
production_forward grad[86] vs paper_forward: mean_abs=0.3274350166320801, max_abs=1.5, mean_rel=0.07368452847003937, max_rel=2.8871638774871826, norm_rel=0.020212439820170403, ref_abs_avg=16.49778938293457, test_abs_avg=16.509031295776367
production_forward grad[87] vs paper_forward: mean_abs=0.41976115107536316, max_abs=3.75, mean_rel=0.13810399174690247, max_rel=658.662109375, norm_rel=0.021758638322353363, ref_abs_avg=19.372629165649414, test_abs_avg=19.371654510498047
production_forward grad[88] vs paper_forward: mean_abs=0.4166228473186493, max_abs=4.5, mean_rel=0.13945356011390686, max_rel=798.9203491210938, norm_rel=0.02174421213567257, ref_abs_avg=19.266395568847656, test_abs_avg=19.267292022705078
production_forward grad[89] vs paper_forward: mean_abs=0.3258323669433594, max_abs=1.15625, mean_rel=0.07530684769153595, max_rel=5.765129089355469, norm_rel=0.02080763317644596, ref_abs_avg=15.555839538574219, test_abs_avg=15.56987190246582
production_forward grad[90] vs paper_forward: mean_abs=0.3976089656352997, max_abs=5.125, mean_rel=0.13023145496845245, max_rel=477.4818115234375, norm_rel=0.021299831569194794, ref_abs_avg=18.77954864501953, test_abs_avg=18.77811050415039
production_forward grad[91] vs paper_forward: mean_abs=0.3933745324611664, max_abs=4.625, mean_rel=0.13256597518920898, max_rel=939.0464477539062, norm_rel=0.02092224545776844, ref_abs_avg=18.92862319946289, test_abs_avg=18.924142837524414
production_forward grad[92] vs paper_forward: mean_abs=0.29543614387512207, max_abs=1.1875, mean_rel=0.06842759996652603, max_rel=1.8173247575759888, norm_rel=0.018862172961235046, ref_abs_avg=15.92001724243164, test_abs_avg=15.917838096618652
production_forward grad[93] vs paper_forward: mean_abs=0.3676437437534332, max_abs=4.5, mean_rel=0.1353803426027298, max_rel=734.463623046875, norm_rel=0.020628925412893295, ref_abs_avg=18.005046844482422, test_abs_avg=18.00568389892578
production_forward grad[94] vs paper_forward: mean_abs=0.3596079647541046, max_abs=4.5625, mean_rel=0.1257164478302002, max_rel=383.5495910644531, norm_rel=0.02007835917174816, ref_abs_avg=18.07044219970703, test_abs_avg=18.06277084350586
production_forward grad[95] vs paper_forward: mean_abs=0.28116273880004883, max_abs=1.0, mean_rel=0.07603488862514496, max_rel=11.263258934020996, norm_rel=0.019092245027422905, ref_abs_avg=14.899686813354492, test_abs_avg=14.892122268676758
production_forward grad[96] vs paper_forward: mean_abs=0.35446488857269287, max_abs=3.5, mean_rel=0.12706410884857178, max_rel=693.93212890625, norm_rel=0.020156463608145714, ref_abs_avg=17.890613555908203, test_abs_avg=17.891021728515625
production_forward grad[97] vs paper_forward: mean_abs=0.33651304244995117, max_abs=5.5, mean_rel=0.11087584495544434, max_rel=336.51153564453125, norm_rel=0.019415512681007385, ref_abs_avg=17.645587921142578, test_abs_avg=17.634374618530273
production_forward2 vs paper_forward output: mean_abs=0.0016365128103643656, max_abs=0.0390625
production_forward2 grad[0] vs paper_forward: mean_abs=0.008649139665067196, max_abs=0.5, mean_rel=0.07431268692016602, max_rel=129.69593811035156, norm_rel=0.020349198952317238, ref_abs_avg=0.45930689573287964, test_abs_avg=0.45931535959243774
production_forward2 grad[1] vs paper_forward: mean_abs=7.3276214599609375, max_abs=72.0, mean_rel=0.360895574092865, max_rel=1913.3739013671875, norm_rel=0.02045941911637783, ref_abs_avg=318.4886474609375, test_abs_avg=318.4442443847656
production_forward2 grad[2] vs paper_forward: mean_abs=1.296884536743164, max_abs=5.0, mean_rel=0.22590509057044983, max_rel=48.046627044677734, norm_rel=0.023810766637325287, ref_abs_avg=55.638912200927734, test_abs_avg=55.49795150756836
production_forward2 grad[3] vs paper_forward: mean_abs=1.5848839282989502, max_abs=10.5, mean_rel=0.1675734668970108, max_rel=2737.85888671875, norm_rel=0.024209078401327133, ref_abs_avg=65.84440612792969, test_abs_avg=65.84698486328125
production_forward2 grad[4] vs paper_forward: mean_abs=1.545162558555603, max_abs=10.25, mean_rel=0.16299819946289062, max_rel=1190.4146728515625, norm_rel=0.023929212242364883, ref_abs_avg=64.99253845214844, test_abs_avg=64.99417114257812
production_forward2 grad[5] vs paper_forward: mean_abs=1.1116962432861328, max_abs=5.125, mean_rel=0.1597716212272644, max_rel=16.824203491210938, norm_rel=0.023084523156285286, ref_abs_avg=47.214988708496094, test_abs_avg=47.183837890625
production_forward2 grad[6] vs paper_forward: mean_abs=1.3951023817062378, max_abs=8.5, mean_rel=0.16719579696655273, max_rel=1941.9290771484375, norm_rel=0.023838620632886887, ref_abs_avg=58.82403564453125, test_abs_avg=58.82997131347656
production_forward2 grad[7] vs paper_forward: mean_abs=1.3544180393218994, max_abs=8.5, mean_rel=0.14881351590156555, max_rel=1240.2962646484375, norm_rel=0.023630887269973755, ref_abs_avg=57.65807342529297, test_abs_avg=57.65503692626953
production_forward2 grad[8] vs paper_forward: mean_abs=1.003720760345459, max_abs=3.6103515625, mean_rel=0.1883593499660492, max_rel=24.35796356201172, norm_rel=0.02385055460035801, ref_abs_avg=41.21825408935547, test_abs_avg=41.20866394042969
production_forward2 grad[9] vs paper_forward: mean_abs=1.2609527111053467, max_abs=9.0, mean_rel=0.16993466019630432, max_rel=3604.67041015625, norm_rel=0.02364303544163704, ref_abs_avg=53.58391571044922, test_abs_avg=53.589019775390625
production_forward2 grad[10] vs paper_forward: mean_abs=1.2305243015289307, max_abs=7.0, mean_rel=0.1662188321352005, max_rel=1341.2401123046875, norm_rel=0.02344372309744358, ref_abs_avg=52.694129943847656, test_abs_avg=52.69368362426758
production_forward2 grad[11] vs paper_forward: mean_abs=0.9491081237792969, max_abs=3.75, mean_rel=0.07235444337129593, max_rel=3.157721519470215, norm_rel=0.023344557732343674, ref_abs_avg=41.41888427734375, test_abs_avg=41.49247360229492
production_forward2 grad[12] vs paper_forward: mean_abs=1.1641180515289307, max_abs=8.0, mean_rel=0.14435900747776031, max_rel=848.3018798828125, norm_rel=0.02340514399111271, ref_abs_avg=50.004371643066406, test_abs_avg=50.01091766357422
production_forward2 grad[13] vs paper_forward: mean_abs=1.1361401081085205, max_abs=7.0, mean_rel=0.16145595908164978, max_rel=1792.03271484375, norm_rel=0.023266900330781937, ref_abs_avg=49.059173583984375, test_abs_avg=49.055328369140625
production_forward2 grad[14] vs paper_forward: mean_abs=0.8676671981811523, max_abs=3.5, mean_rel=0.0744514912366867, max_rel=3.0046162605285645, norm_rel=0.02248632349073887, ref_abs_avg=39.35063934326172, test_abs_avg=39.32392883300781
production_forward2 grad[15] vs paper_forward: mean_abs=1.0827406644821167, max_abs=6.5, mean_rel=0.16159117221832275, max_rel=1479.770751953125, norm_rel=0.0232620257884264, ref_abs_avg=46.75851821899414, test_abs_avg=46.75977325439453
production_forward2 grad[16] vs paper_forward: mean_abs=1.0591005086898804, max_abs=6.25, mean_rel=0.15200847387313843, max_rel=600.6414794921875, norm_rel=0.023005830124020576, ref_abs_avg=46.26638412475586, test_abs_avg=46.26945495605469
production_forward2 grad[17] vs paper_forward: mean_abs=0.7832803726196289, max_abs=2.875, mean_rel=0.0651785135269165, max_rel=1.5013294219970703, norm_rel=0.02170397713780403, ref_abs_avg=36.20037841796875, test_abs_avg=36.24009704589844
production_forward2 grad[18] vs paper_forward: mean_abs=1.018786907196045, max_abs=6.5, mean_rel=0.15933701395988464, max_rel=1029.79248046875, norm_rel=0.02316317893564701, ref_abs_avg=44.19859313964844, test_abs_avg=44.203121185302734
production_forward2 grad[19] vs paper_forward: mean_abs=0.9947777390480042, max_abs=6.375, mean_rel=0.14833895862102509, max_rel=946.983154296875, norm_rel=0.022932201623916626, ref_abs_avg=43.57270050048828, test_abs_avg=43.573978424072266
production_forward2 grad[20] vs paper_forward: mean_abs=0.7873268127441406, max_abs=3.25, mean_rel=0.05665519833564758, max_rel=1.9202425479888916, norm_rel=0.021193893626332283, ref_abs_avg=36.999610900878906, test_abs_avg=37.151676177978516
production_forward2 grad[21] vs paper_forward: mean_abs=0.9677062034606934, max_abs=5.75, mean_rel=0.15582598745822906, max_rel=738.177978515625, norm_rel=0.022975392639636993, ref_abs_avg=42.31806945800781, test_abs_avg=42.31849670410156
production_forward2 grad[22] vs paper_forward: mean_abs=0.9413378238677979, max_abs=6.0, mean_rel=0.1544073224067688, max_rel=1020.32275390625, norm_rel=0.022788692265748978, ref_abs_avg=41.50236511230469, test_abs_avg=41.50801086425781
production_forward2 grad[23] vs paper_forward: mean_abs=0.7567787170410156, max_abs=2.875, mean_rel=0.06575550138950348, max_rel=3.1902856826782227, norm_rel=0.022194556891918182, ref_abs_avg=34.72947692871094, test_abs_avg=34.66459655761719
production_forward2 grad[24] vs paper_forward: mean_abs=0.9208624362945557, max_abs=6.0, mean_rel=0.15419930219650269, max_rel=838.607421875, norm_rel=0.022838594391942024, ref_abs_avg=40.47168731689453, test_abs_avg=40.47216796875
production_forward2 grad[25] vs paper_forward: mean_abs=0.8968448638916016, max_abs=6.0, mean_rel=0.15341278910636902, max_rel=896.0987548828125, norm_rel=0.022606253623962402, ref_abs_avg=39.87950897216797, test_abs_avg=39.881690979003906
production_forward2 grad[26] vs paper_forward: mean_abs=0.8132543563842773, max_abs=3.0, mean_rel=0.11690452694892883, max_rel=11.32270336151123, norm_rel=0.022857554256916046, ref_abs_avg=35.634674072265625, test_abs_avg=35.612091064453125
production_forward2 grad[27] vs paper_forward: mean_abs=1.0240042209625244, max_abs=6.75, mean_rel=0.15941175818443298, max_rel=878.23583984375, norm_rel=0.024785233661532402, ref_abs_avg=41.486846923828125, test_abs_avg=41.48636245727539
production_forward2 grad[28] vs paper_forward: mean_abs=0.9974775314331055, max_abs=6.5, mean_rel=0.15909117460250854, max_rel=947.0679931640625, norm_rel=0.024292927235364914, ref_abs_avg=41.2172737121582, test_abs_avg=41.212738037109375
production_forward2 grad[29] vs paper_forward: mean_abs=0.7982528209686279, max_abs=3.0, mean_rel=0.19498983025550842, max_rel=50.69211196899414, norm_rel=0.025902176275849342, ref_abs_avg=30.342864990234375, test_abs_avg=30.272438049316406
production_forward2 grad[30] vs paper_forward: mean_abs=0.9707046747207642, max_abs=6.0, mean_rel=0.16921523213386536, max_rel=2012.405029296875, norm_rel=0.02512282133102417, ref_abs_avg=38.7740592956543, test_abs_avg=38.772613525390625
production_forward2 grad[31] vs paper_forward: mean_abs=0.9491995573043823, max_abs=6.072265625, mean_rel=0.163264662027359, max_rel=859.4531860351562, norm_rel=0.024676566943526268, ref_abs_avg=38.60586166381836, test_abs_avg=38.60749816894531
production_forward2 grad[32] vs paper_forward: mean_abs=0.7427756786346436, max_abs=3.0, mean_rel=0.4120386242866516, max_rel=136.59368896484375, norm_rel=0.024216266348958015, ref_abs_avg=30.78614044189453, test_abs_avg=30.77564811706543
production_forward2 grad[33] vs paper_forward: mean_abs=0.9111765623092651, max_abs=7.0, mean_rel=0.17569857835769653, max_rel=1544.5902099609375, norm_rel=0.025070037692785263, ref_abs_avg=36.456207275390625, test_abs_avg=36.45625305175781
production_forward2 grad[34] vs paper_forward: mean_abs=0.8946990370750427, max_abs=5.5, mean_rel=0.16385945677757263, max_rel=1150.18896484375, norm_rel=0.02493044175207615, ref_abs_avg=36.030784606933594, test_abs_avg=36.029212951660156
production_forward2 grad[35] vs paper_forward: mean_abs=0.7103891372680664, max_abs=2.5, mean_rel=0.14572444558143616, max_rel=16.908477783203125, norm_rel=0.023576289415359497, ref_abs_avg=30.22367286682129, test_abs_avg=30.223657608032227
production_forward2 grad[36] vs paper_forward: mean_abs=0.8564132452011108, max_abs=5.9375, mean_rel=0.16746479272842407, max_rel=1381.9022216796875, norm_rel=0.02473125047981739, ref_abs_avg=34.70581817626953, test_abs_avg=34.70604705810547
production_forward2 grad[37] vs paper_forward: mean_abs=0.8445319533348083, max_abs=5.25, mean_rel=0.153732031583786, max_rel=1374.0045166015625, norm_rel=0.02465112693607807, ref_abs_avg=34.34699249267578, test_abs_avg=34.34764862060547
production_forward2 grad[38] vs paper_forward: mean_abs=0.6881859302520752, max_abs=2.625, mean_rel=0.15784107148647308, max_rel=19.639923095703125, norm_rel=0.025532232597470284, ref_abs_avg=27.552608489990234, test_abs_avg=27.547691345214844
production_forward2 grad[39] vs paper_forward: mean_abs=0.8056604862213135, max_abs=5.125, mean_rel=0.18086138367652893, max_rel=1396.9764404296875, norm_rel=0.02461947128176689, ref_abs_avg=32.8138313293457, test_abs_avg=32.81458282470703
production_forward2 grad[40] vs paper_forward: mean_abs=0.7950272560119629, max_abs=6.5, mean_rel=0.16203320026397705, max_rel=726.7323608398438, norm_rel=0.024487460032105446, ref_abs_avg=32.584373474121094, test_abs_avg=32.58782958984375
production_forward2 grad[41] vs paper_forward: mean_abs=0.6293730735778809, max_abs=2.25, mean_rel=0.07911628484725952, max_rel=3.6709916591644287, norm_rel=0.02438298426568508, ref_abs_avg=26.08892059326172, test_abs_avg=26.063013076782227
production_forward2 grad[42] vs paper_forward: mean_abs=0.7717956304550171, max_abs=5.0, mean_rel=0.15248923003673553, max_rel=2021.1475830078125, norm_rel=0.02430851198732853, ref_abs_avg=31.825828552246094, test_abs_avg=31.826871871948242
production_forward2 grad[43] vs paper_forward: mean_abs=0.7546249032020569, max_abs=5.25, mean_rel=0.16794493794441223, max_rel=1198.46923828125, norm_rel=0.02419048547744751, ref_abs_avg=31.294788360595703, test_abs_avg=31.29180908203125
production_forward2 grad[44] vs paper_forward: mean_abs=0.5820573568344116, max_abs=2.25, mean_rel=2.397627592086792, max_rel=1177.058837890625, norm_rel=0.023516174405813217, ref_abs_avg=25.60738754272461, test_abs_avg=25.63929557800293
production_forward2 grad[45] vs paper_forward: mean_abs=0.7319144606590271, max_abs=5.0, mean_rel=0.16615304350852966, max_rel=1109.0478515625, norm_rel=0.02397831901907921, ref_abs_avg=30.550065994262695, test_abs_avg=30.54949951171875
production_forward2 grad[46] vs paper_forward: mean_abs=0.720294713973999, max_abs=5.0, mean_rel=0.15752612054347992, max_rel=1161.91650390625, norm_rel=0.023939166218042374, ref_abs_avg=30.172595977783203, test_abs_avg=30.178218841552734
production_forward2 grad[47] vs paper_forward: mean_abs=0.5466266870498657, max_abs=2.375, mean_rel=0.09506578743457794, max_rel=5.644338607788086, norm_rel=0.023630185052752495, ref_abs_avg=23.64335060119629, test_abs_avg=23.57010269165039
production_forward2 grad[48] vs paper_forward: mean_abs=0.6983293294906616, max_abs=4.75, mean_rel=0.15133561193943024, max_rel=898.7289428710938, norm_rel=0.023621195927262306, ref_abs_avg=29.545948028564453, test_abs_avg=29.546646118164062
production_forward2 grad[49] vs paper_forward: mean_abs=0.6835477352142334, max_abs=4.5, mean_rel=0.15551340579986572, max_rel=631.179931640625, norm_rel=0.02394811064004898, ref_abs_avg=28.61752700805664, test_abs_avg=28.61941146850586
production_forward2 grad[50] vs paper_forward: mean_abs=0.6253633499145508, max_abs=3.0, mean_rel=0.16101974248886108, max_rel=35.824005126953125, norm_rel=0.024455387145280838, ref_abs_avg=25.861202239990234, test_abs_avg=25.83147621154785
production_forward2 grad[51] vs paper_forward: mean_abs=0.7622671127319336, max_abs=6.0, mean_rel=0.16056139767169952, max_rel=1481.4586181640625, norm_rel=0.024854226037859917, ref_abs_avg=30.71734619140625, test_abs_avg=30.717052459716797
production_forward2 grad[52] vs paper_forward: mean_abs=0.7489731311798096, max_abs=5.0, mean_rel=0.16807156801223755, max_rel=827.2188720703125, norm_rel=0.0247027724981308, ref_abs_avg=30.43000602722168, test_abs_avg=30.428247451782227
production_forward2 grad[53] vs paper_forward: mean_abs=0.5805317163467407, max_abs=2.125, mean_rel=0.07168254256248474, max_rel=2.2284154891967773, norm_rel=0.02392404153943062, ref_abs_avg=24.64533233642578, test_abs_avg=24.615684509277344
production_forward2 grad[54] vs paper_forward: mean_abs=0.7063490748405457, max_abs=4.5, mean_rel=0.17058904469013214, max_rel=882.3023681640625, norm_rel=0.024640001356601715, ref_abs_avg=28.703580856323242, test_abs_avg=28.704347610473633
production_forward2 grad[55] vs paper_forward: mean_abs=0.6985267996788025, max_abs=5.0, mean_rel=0.15942344069480896, max_rel=843.3934936523438, norm_rel=0.024471459910273552, ref_abs_avg=28.618202209472656, test_abs_avg=28.623390197753906
production_forward2 grad[56] vs paper_forward: mean_abs=0.5548629760742188, max_abs=2.75, mean_rel=0.10214164853096008, max_rel=5.08633279800415, norm_rel=0.02357541024684906, ref_abs_avg=23.819778442382812, test_abs_avg=23.7967529296875
production_forward2 grad[57] vs paper_forward: mean_abs=0.6739095449447632, max_abs=4.5, mean_rel=0.16528481245040894, max_rel=1082.3624267578125, norm_rel=0.024440938606858253, ref_abs_avg=27.631607055664062, test_abs_avg=27.631000518798828
production_forward2 grad[58] vs paper_forward: mean_abs=0.6617718935012817, max_abs=4.8203125, mean_rel=0.15219679474830627, max_rel=1609.2559814453125, norm_rel=0.024174584075808525, ref_abs_avg=27.40127944946289, test_abs_avg=27.399444580078125
production_forward2 grad[59] vs paper_forward: mean_abs=0.5371362566947937, max_abs=2.5, mean_rel=0.22268256545066833, max_rel=38.40882873535156, norm_rel=0.024315668269991875, ref_abs_avg=21.46916961669922, test_abs_avg=21.475555419921875
production_forward2 grad[60] vs paper_forward: mean_abs=0.6290737986564636, max_abs=4.5, mean_rel=0.1568298190832138, max_rel=1166.9837646484375, norm_rel=0.023945270106196404, ref_abs_avg=26.262882232666016, test_abs_avg=26.26313018798828
production_forward2 grad[61] vs paper_forward: mean_abs=0.6181807518005371, max_abs=5.375, mean_rel=0.14543169736862183, max_rel=635.03857421875, norm_rel=0.023596977815032005, ref_abs_avg=26.200870513916016, test_abs_avg=26.20589828491211
production_forward2 grad[62] vs paper_forward: mean_abs=0.47695446014404297, max_abs=1.9375, mean_rel=0.10599537193775177, max_rel=7.047980785369873, norm_rel=0.023300405591726303, ref_abs_avg=20.012805938720703, test_abs_avg=20.030868530273438
production_forward2 grad[63] vs paper_forward: mean_abs=0.5980592370033264, max_abs=4.25, mean_rel=0.15381956100463867, max_rel=1344.4783935546875, norm_rel=0.02325161173939705, ref_abs_avg=25.672008514404297, test_abs_avg=25.674997329711914
production_forward2 grad[64] vs paper_forward: mean_abs=0.5839277505874634, max_abs=4.5, mean_rel=0.14853698015213013, max_rel=444.6294250488281, norm_rel=0.023364009335637093, ref_abs_avg=25.047161102294922, test_abs_avg=25.05199432373047
production_forward2 grad[65] vs paper_forward: mean_abs=0.4392913579940796, max_abs=1.77734375, mean_rel=0.1343577355146408, max_rel=11.618692398071289, norm_rel=0.02192629873752594, ref_abs_avg=20.180583953857422, test_abs_avg=20.167808532714844
production_forward2 grad[66] vs paper_forward: mean_abs=0.5667614936828613, max_abs=4.5, mean_rel=0.15686604380607605, max_rel=1046.1602783203125, norm_rel=0.022991815581917763, ref_abs_avg=24.629276275634766, test_abs_avg=24.629701614379883
production_forward2 grad[67] vs paper_forward: mean_abs=0.5541797876358032, max_abs=5.0, mean_rel=0.14831236004829407, max_rel=668.9738159179688, norm_rel=0.02278592251241207, ref_abs_avg=24.325992584228516, test_abs_avg=24.32219886779785
production_forward2 grad[68] vs paper_forward: mean_abs=0.42784953117370605, max_abs=1.8125, mean_rel=0.24450761079788208, max_rel=24.291120529174805, norm_rel=0.02142193354666233, ref_abs_avg=20.281299591064453, test_abs_avg=20.27739143371582
production_forward2 grad[69] vs paper_forward: mean_abs=0.5408951044082642, max_abs=4.5, mean_rel=0.14488470554351807, max_rel=1005.4139404296875, norm_rel=0.022449437528848648, ref_abs_avg=24.04323959350586, test_abs_avg=24.042469024658203
production_forward2 grad[70] vs paper_forward: mean_abs=0.5280194282531738, max_abs=5.0, mean_rel=0.14852213859558105, max_rel=1075.548583984375, norm_rel=0.02268792875111103, ref_abs_avg=23.33050537109375, test_abs_avg=23.328350067138672
production_forward2 grad[71] vs paper_forward: mean_abs=0.4365115165710449, max_abs=1.75, mean_rel=0.09135817736387253, max_rel=2.1630465984344482, norm_rel=0.022817330434918404, ref_abs_avg=19.13631820678711, test_abs_avg=19.1632080078125
production_forward2 grad[72] vs paper_forward: mean_abs=0.5132428407669067, max_abs=3.75, mean_rel=0.14289432764053345, max_rel=1103.0770263671875, norm_rel=0.022424524649977684, ref_abs_avg=22.879478454589844, test_abs_avg=22.880353927612305
production_forward2 grad[73] vs paper_forward: mean_abs=0.5034840106964111, max_abs=4.0, mean_rel=0.16429859399795532, max_rel=1739.039794921875, norm_rel=0.022263703867793083, ref_abs_avg=22.585609436035156, test_abs_avg=22.589073181152344
production_forward2 grad[74] vs paper_forward: mean_abs=0.4596235752105713, max_abs=1.75, mean_rel=0.1377040147781372, max_rel=20.795204162597656, norm_rel=0.023035550490021706, ref_abs_avg=20.144926071166992, test_abs_avg=20.181089401245117
production_forward2 grad[75] vs paper_forward: mean_abs=0.5781445503234863, max_abs=4.75, mean_rel=0.16312474012374878, max_rel=961.8745727539062, norm_rel=0.024335909634828568, ref_abs_avg=23.786865234375, test_abs_avg=23.787559509277344
production_forward2 grad[76] vs paper_forward: mean_abs=0.5657855868339539, max_abs=4.125, mean_rel=0.16323837637901306, max_rel=644.4830322265625, norm_rel=0.023499956354498863, ref_abs_avg=24.014469146728516, test_abs_avg=24.015254974365234
production_forward2 grad[77] vs paper_forward: mean_abs=0.44106197357177734, max_abs=1.5, mean_rel=0.09114906936883926, max_rel=11.586380004882812, norm_rel=0.02403097413480282, ref_abs_avg=18.65048599243164, test_abs_avg=18.629545211791992
production_forward2 grad[78] vs paper_forward: mean_abs=0.5299547910690308, max_abs=5.0, mean_rel=0.1500946581363678, max_rel=862.2008056640625, norm_rel=0.023393813520669937, ref_abs_avg=22.616313934326172, test_abs_avg=22.616413116455078
production_forward2 grad[79] vs paper_forward: mean_abs=0.508898913860321, max_abs=4.875, mean_rel=0.1470308005809784, max_rel=794.3692626953125, norm_rel=0.023593632504343987, ref_abs_avg=21.580432891845703, test_abs_avg=21.579208374023438
production_forward2 grad[80] vs paper_forward: mean_abs=0.38179194927215576, max_abs=1.5, mean_rel=0.09619536995887756, max_rel=4.386184215545654, norm_rel=0.020847272127866745, ref_abs_avg=18.217004776000977, test_abs_avg=18.227012634277344
production_forward2 grad[81] vs paper_forward: mean_abs=0.4834379255771637, max_abs=4.5, mean_rel=0.14042696356773376, max_rel=985.1709594726562, norm_rel=0.022739147767424583, ref_abs_avg=21.244964599609375, test_abs_avg=21.244125366210938
production_forward2 grad[82] vs paper_forward: mean_abs=0.4708186984062195, max_abs=6.0, mean_rel=0.14420577883720398, max_rel=779.2256469726562, norm_rel=0.02256060019135475, ref_abs_avg=20.96836280822754, test_abs_avg=20.965957641601562
production_forward2 grad[83] vs paper_forward: mean_abs=0.3955569267272949, max_abs=1.625, mean_rel=0.10443627834320068, max_rel=9.154366493225098, norm_rel=0.02214834652841091, ref_abs_avg=18.100650787353516, test_abs_avg=18.136287689208984
production_forward2 grad[84] vs paper_forward: mean_abs=0.45975810289382935, max_abs=4.0, mean_rel=0.1445768177509308, max_rel=855.1506958007812, norm_rel=0.022263363003730774, ref_abs_avg=20.655357360839844, test_abs_avg=20.655593872070312
production_forward2 grad[85] vs paper_forward: mean_abs=0.436434268951416, max_abs=5.0, mean_rel=0.13029730319976807, max_rel=430.7215270996094, norm_rel=0.021533695980906487, ref_abs_avg=20.245498657226562, test_abs_avg=20.249420166015625
production_forward2 grad[86] vs paper_forward: mean_abs=0.3305835723876953, max_abs=1.5, mean_rel=0.07389247417449951, max_rel=3.194812536239624, norm_rel=0.02030046470463276, ref_abs_avg=16.49778938293457, test_abs_avg=16.50808334350586
production_forward2 grad[87] vs paper_forward: mean_abs=0.41985630989074707, max_abs=3.75, mean_rel=0.1392548382282257, max_rel=692.3721313476562, norm_rel=0.021756500005722046, ref_abs_avg=19.372629165649414, test_abs_avg=19.371793746948242
production_forward2 grad[88] vs paper_forward: mean_abs=0.4169960618019104, max_abs=5.0, mean_rel=0.1395747810602188, max_rel=802.5377807617188, norm_rel=0.021761775016784668, ref_abs_avg=19.266395568847656, test_abs_avg=19.266603469848633
production_forward2 grad[89] vs paper_forward: mean_abs=0.3214731216430664, max_abs=1.25, mean_rel=0.07721023261547089, max_rel=6.875944137573242, norm_rel=0.020806703716516495, ref_abs_avg=15.555839538574219, test_abs_avg=15.567767143249512
production_forward2 grad[90] vs paper_forward: mean_abs=0.3979703187942505, max_abs=5.0, mean_rel=0.13044863939285278, max_rel=477.4818115234375, norm_rel=0.02131534367799759, ref_abs_avg=18.77954864501953, test_abs_avg=18.777957916259766
production_forward2 grad[91] vs paper_forward: mean_abs=0.3934304416179657, max_abs=4.375, mean_rel=0.13267338275909424, max_rel=925.2386474609375, norm_rel=0.020938513800501823, ref_abs_avg=18.92862319946289, test_abs_avg=18.923770904541016
production_forward2 grad[92] vs paper_forward: mean_abs=0.29431724548339844, max_abs=1.1875, mean_rel=0.06718580424785614, max_rel=1.7564938068389893, norm_rel=0.018930623307824135, ref_abs_avg=15.92001724243164, test_abs_avg=15.916105270385742
production_forward2 grad[93] vs paper_forward: mean_abs=0.36785370111465454, max_abs=5.0, mean_rel=0.1359367072582245, max_rel=714.8330078125, norm_rel=0.020635517314076424, ref_abs_avg=18.005046844482422, test_abs_avg=18.005756378173828
production_forward2 grad[94] vs paper_forward: mean_abs=0.3598744869232178, max_abs=4.5, mean_rel=0.12499538064002991, max_rel=389.9347229003906, norm_rel=0.020087042823433876, ref_abs_avg=18.07044219970703, test_abs_avg=18.0626220703125
production_forward2 grad[95] vs paper_forward: mean_abs=0.2820568084716797, max_abs=1.0, mean_rel=0.0767333135008812, max_rel=11.263258934020996, norm_rel=0.019116928800940514, ref_abs_avg=14.899686813354492, test_abs_avg=14.891429901123047
production_forward2 grad[96] vs paper_forward: mean_abs=0.3544340133666992, max_abs=3.5, mean_rel=0.12704291939735413, max_rel=690.8351440429688, norm_rel=0.020153820514678955, ref_abs_avg=17.890613555908203, test_abs_avg=17.891094207763672
production_forward2 grad[97] vs paper_forward: mean_abs=0.3365124464035034, max_abs=5.5, mean_rel=0.11084876954555511, max_rel=336.51153564453125, norm_rel=0.019414911046624184, ref_abs_avg=17.645587921142578, test_abs_avg=17.63439178466797
identity layers + randn queries
production_forward fwd+bwd:  109.539 ms
production_forward bwd-only: 89.181 ms
production_forward peak allocated: fwd=3.368 GiB, fwd+bwd=6.993 GiB
production_forward peak reserved:  fwd=3.615 GiB, fwd+bwd=8.115 GiB
production_forward2 fwd+bwd:  224.436 ms
production_forward2 bwd-only: 202.277 ms
production_forward2 peak allocated: fwd=2.864 GiB, fwd+bwd=6.243 GiB
production_forward2 peak reserved:  fwd=3.240 GiB, fwd+bwd=8.990 GiB
paper_forward fwd+bwd:  379.782 ms
paper_forward bwd-only: 294.193 ms
paper_forward peak allocated: fwd=30.001 GiB, fwd+bwd=32.120 GiB
paper_forward peak reserved:  fwd=30.037 GiB, fwd+bwd=32.787 GiB

grads check for swiglu layers + randn queries
production_forward vs paper_forward output: mean_abs=0.0016604398842900991, max_abs=0.0390625
production_forward grad[0] vs paper_forward: mean_abs=0.008655373007059097, max_abs=0.375, mean_rel=0.07419001311063766, max_rel=120.12228393554688, norm_rel=0.02029390260577202, ref_abs_avg=0.4615171253681183, test_abs_avg=0.4615219831466675
production_forward grad[1] vs paper_forward: mean_abs=7.555171489715576, max_abs=56.0, mean_rel=0.17143331468105316, max_rel=649.3106079101562, norm_rel=0.020889325067400932, ref_abs_avg=320.78814697265625, test_abs_avg=320.71563720703125
production_forward grad[2] vs paper_forward: mean_abs=1.3337440490722656, max_abs=5.0, mean_rel=0.08453452587127686, max_rel=5.217286109924316, norm_rel=0.025551222264766693, ref_abs_avg=51.854393005371094, test_abs_avg=51.80387496948242
production_forward grad[3] vs paper_forward: mean_abs=1.5913078784942627, max_abs=10.0, mean_rel=0.17044904828071594, max_rel=1785.32666015625, norm_rel=0.024122126400470734, ref_abs_avg=66.30470275878906, test_abs_avg=66.30607604980469
production_forward grad[4] vs paper_forward: mean_abs=1.5475722551345825, max_abs=10.0, mean_rel=0.16670802235603333, max_rel=845.3486328125, norm_rel=0.02381405234336853, ref_abs_avg=65.31829833984375, test_abs_avg=65.33592224121094
production_forward grad[5] vs paper_forward: mean_abs=1.1891560554504395, max_abs=4.3671875, mean_rel=0.1430671662092209, max_rel=16.732707977294922, norm_rel=0.024609053507447243, ref_abs_avg=47.773597717285156, test_abs_avg=47.755027770996094
production_forward grad[6] vs paper_forward: mean_abs=1.406358242034912, max_abs=9.25, mean_rel=0.1727713793516159, max_rel=1392.6063232421875, norm_rel=0.023731568828225136, ref_abs_avg=59.550899505615234, test_abs_avg=59.54702377319336
production_forward grad[7] vs paper_forward: mean_abs=1.3657150268554688, max_abs=9.09375, mean_rel=0.1518065333366394, max_rel=2170.659423828125, norm_rel=0.023545773699879646, ref_abs_avg=58.301673889160156, test_abs_avg=58.29902648925781
production_forward grad[8] vs paper_forward: mean_abs=0.9921150207519531, max_abs=4.125, mean_rel=0.09170831739902496, max_rel=3.9643282890319824, norm_rel=0.022324858233332634, ref_abs_avg=44.51505661010742, test_abs_avg=44.57447814941406
production_forward grad[9] vs paper_forward: mean_abs=1.2699427604675293, max_abs=8.0, mean_rel=0.16907590627670288, max_rel=3000.59814453125, norm_rel=0.023570863530039787, ref_abs_avg=54.1474494934082, test_abs_avg=54.14990234375
production_forward grad[10] vs paper_forward: mean_abs=1.2431085109710693, max_abs=8.0, mean_rel=0.15843507647514343, max_rel=1680.0079345703125, norm_rel=0.0233832448720932, ref_abs_avg=53.377418518066406, test_abs_avg=53.37738037109375
production_forward grad[11] vs paper_forward: mean_abs=0.9255108833312988, max_abs=4.0, mean_rel=0.15175339579582214, max_rel=35.33940505981445, norm_rel=0.02264680154621601, ref_abs_avg=40.48884582519531, test_abs_avg=40.64854049682617
production_forward grad[12] vs paper_forward: mean_abs=1.172921895980835, max_abs=8.0, mean_rel=0.16722725331783295, max_rel=2832.60400390625, norm_rel=0.023333458229899406, ref_abs_avg=50.48628234863281, test_abs_avg=50.48872756958008
production_forward grad[13] vs paper_forward: mean_abs=1.1372017860412598, max_abs=7.0, mean_rel=0.1597953587770462, max_rel=2249.92041015625, norm_rel=0.023065922781825066, ref_abs_avg=49.58500671386719, test_abs_avg=49.59392547607422
production_forward grad[14] vs paper_forward: mean_abs=0.8755321502685547, max_abs=4.0, mean_rel=0.092776820063591, max_rel=6.564387798309326, norm_rel=0.022844459861516953, ref_abs_avg=39.42327117919922, test_abs_avg=39.46425247192383
production_forward grad[15] vs paper_forward: mean_abs=1.0900185108184814, max_abs=7.5, mean_rel=0.1566314697265625, max_rel=1443.61376953125, norm_rel=0.023099545389413834, ref_abs_avg=47.445526123046875, test_abs_avg=47.45049285888672
production_forward grad[16] vs paper_forward: mean_abs=1.0660024881362915, max_abs=6.5, mean_rel=0.15780413150787354, max_rel=1383.7105712890625, norm_rel=0.02296319603919983, ref_abs_avg=46.69757080078125, test_abs_avg=46.699676513671875
production_forward grad[17] vs paper_forward: mean_abs=0.7731671333312988, max_abs=3.1875, mean_rel=0.11455394327640533, max_rel=10.089597702026367, norm_rel=0.02141597494482994, ref_abs_avg=35.788490295410156, test_abs_avg=35.77061462402344
production_forward grad[18] vs paper_forward: mean_abs=1.0216410160064697, max_abs=6.5, mean_rel=0.14453324675559998, max_rel=722.9647216796875, norm_rel=0.0229948777705431, ref_abs_avg=44.64289093017578, test_abs_avg=44.642189025878906
production_forward grad[19] vs paper_forward: mean_abs=1.0000333786010742, max_abs=7.0, mean_rel=0.1553339958190918, max_rel=1164.6392822265625, norm_rel=0.022715788334608078, ref_abs_avg=44.26605224609375, test_abs_avg=44.264095306396484
production_forward grad[20] vs paper_forward: mean_abs=0.7739617824554443, max_abs=2.640625, mean_rel=0.07426932454109192, max_rel=5.51226806640625, norm_rel=0.022885940968990326, ref_abs_avg=34.654029846191406, test_abs_avg=34.64965057373047
production_forward grad[21] vs paper_forward: mean_abs=0.9654162526130676, max_abs=6.0, mean_rel=0.15771082043647766, max_rel=1767.823974609375, norm_rel=0.022847570478916168, ref_abs_avg=42.449684143066406, test_abs_avg=42.450469970703125
production_forward grad[22] vs paper_forward: mean_abs=0.9438139200210571, max_abs=5.75, mean_rel=0.14118996262550354, max_rel=851.957763671875, norm_rel=0.022738782688975334, ref_abs_avg=41.6693000793457, test_abs_avg=41.66779327392578
production_forward grad[23] vs paper_forward: mean_abs=0.732614278793335, max_abs=2.5, mean_rel=0.12855112552642822, max_rel=20.134050369262695, norm_rel=0.02068537287414074, ref_abs_avg=35.13788604736328, test_abs_avg=35.158329010009766
production_forward grad[24] vs paper_forward: mean_abs=0.9162219166755676, max_abs=6.0, mean_rel=0.15618515014648438, max_rel=1462.86865234375, norm_rel=0.02282879129052162, ref_abs_avg=40.33368682861328, test_abs_avg=40.33592987060547
production_forward grad[25] vs paper_forward: mean_abs=0.8962032794952393, max_abs=6.0, mean_rel=0.14733874797821045, max_rel=1116.9766845703125, norm_rel=0.022510547190904617, ref_abs_avg=39.98896026611328, test_abs_avg=39.9842643737793
production_forward grad[26] vs paper_forward: mean_abs=0.8400938510894775, max_abs=3.25, mean_rel=0.43333613872528076, max_rel=173.04013061523438, norm_rel=0.024100545793771744, ref_abs_avg=35.57957077026367, test_abs_avg=35.550926208496094
production_forward grad[27] vs paper_forward: mean_abs=1.0687681436538696, max_abs=7.5, mean_rel=0.16125576198101044, max_rel=1102.505615234375, norm_rel=0.02481147088110447, ref_abs_avg=43.237857818603516, test_abs_avg=43.2390251159668
production_forward grad[28] vs paper_forward: mean_abs=1.041853427886963, max_abs=6.5, mean_rel=0.1736200898885727, max_rel=2558.41162109375, norm_rel=0.024510880932211876, ref_abs_avg=42.665077209472656, test_abs_avg=42.660621643066406
production_forward grad[29] vs paper_forward: mean_abs=0.7895326614379883, max_abs=3.125, mean_rel=0.08614268898963928, max_rel=4.336615085601807, norm_rel=0.0268560741096735, ref_abs_avg=29.825225830078125, test_abs_avg=29.83840560913086
production_forward grad[30] vs paper_forward: mean_abs=0.9840579628944397, max_abs=6.0, mean_rel=0.19479221105575562, max_rel=2601.666748046875, norm_rel=0.025128349661827087, ref_abs_avg=39.2843017578125, test_abs_avg=39.286563873291016
production_forward grad[31] vs paper_forward: mean_abs=0.9712424278259277, max_abs=5.8125, mean_rel=0.16963256895542145, max_rel=2199.074462890625, norm_rel=0.024829652160406113, ref_abs_avg=39.22703170776367, test_abs_avg=39.227455139160156
production_forward grad[32] vs paper_forward: mean_abs=0.7485332489013672, max_abs=2.6875, mean_rel=0.08047536015510559, max_rel=6.548464298248291, norm_rel=0.023579780012369156, ref_abs_avg=31.874191284179688, test_abs_avg=31.90073013305664
production_forward grad[33] vs paper_forward: mean_abs=0.9235823750495911, max_abs=6.25, mean_rel=0.17808839678764343, max_rel=1491.8343505859375, norm_rel=0.025089653208851814, ref_abs_avg=36.933433532714844, test_abs_avg=36.93422317504883
production_forward grad[34] vs paper_forward: mean_abs=0.9081632494926453, max_abs=6.0, mean_rel=0.17352333664894104, max_rel=1757.92431640625, norm_rel=0.024899963289499283, ref_abs_avg=36.5896110534668, test_abs_avg=36.594024658203125
production_forward grad[35] vs paper_forward: mean_abs=0.7114325165748596, max_abs=2.75, mean_rel=0.12101392447948456, max_rel=9.51121997833252, norm_rel=0.024291563779115677, ref_abs_avg=29.39659309387207, test_abs_avg=29.363590240478516
production_forward grad[36] vs paper_forward: mean_abs=0.8620476722717285, max_abs=5.53125, mean_rel=0.1704787015914917, max_rel=1262.2578125, norm_rel=0.0247204452753067, ref_abs_avg=34.956817626953125, test_abs_avg=34.95779037475586
production_forward grad[37] vs paper_forward: mean_abs=0.8509800434112549, max_abs=5.25, mean_rel=0.15880689024925232, max_rel=1038.5689697265625, norm_rel=0.024812303483486176, ref_abs_avg=34.38996124267578, test_abs_avg=34.389923095703125
production_forward grad[38] vs paper_forward: mean_abs=0.6018385887145996, max_abs=3.0, mean_rel=0.07073187828063965, max_rel=2.3338782787323, norm_rel=0.02196286991238594, ref_abs_avg=28.460037231445312, test_abs_avg=28.422237396240234
production_forward grad[39] vs paper_forward: mean_abs=0.8152339458465576, max_abs=5.5, mean_rel=0.17906849086284637, max_rel=1372.334716796875, norm_rel=0.024526117369532585, ref_abs_avg=33.30704116821289, test_abs_avg=33.307373046875
production_forward grad[40] vs paper_forward: mean_abs=0.8051204681396484, max_abs=5.5, mean_rel=0.16432499885559082, max_rel=1640.1697998046875, norm_rel=0.024511240422725677, ref_abs_avg=32.91267395019531, test_abs_avg=32.915924072265625
production_forward grad[41] vs paper_forward: mean_abs=0.6143178939819336, max_abs=2.5, mean_rel=0.14070987701416016, max_rel=15.885594367980957, norm_rel=0.023533690720796585, ref_abs_avg=26.328224182128906, test_abs_avg=26.301002502441406
production_forward grad[42] vs paper_forward: mean_abs=0.7750909328460693, max_abs=5.0, mean_rel=0.16077342629432678, max_rel=1078.2120361328125, norm_rel=0.02435164712369442, ref_abs_avg=31.931285858154297, test_abs_avg=31.930938720703125
production_forward grad[43] vs paper_forward: mean_abs=0.7634937167167664, max_abs=5.75, mean_rel=0.15459038317203522, max_rel=1247.5008544921875, norm_rel=0.023992933332920074, ref_abs_avg=31.863449096679688, test_abs_avg=31.859254837036133
production_forward grad[44] vs paper_forward: mean_abs=0.5921874046325684, max_abs=2.75, mean_rel=0.0799788385629654, max_rel=3.359846591949463, norm_rel=0.022220462560653687, ref_abs_avg=27.332645416259766, test_abs_avg=27.415180206298828
production_forward grad[45] vs paper_forward: mean_abs=0.7387930750846863, max_abs=5.0, mean_rel=0.16357305645942688, max_rel=1603.32080078125, norm_rel=0.02393459528684616, ref_abs_avg=30.92675018310547, test_abs_avg=30.9284725189209
production_forward grad[46] vs paper_forward: mean_abs=0.7235425710678101, max_abs=5.0, mean_rel=0.14610669016838074, max_rel=663.3648681640625, norm_rel=0.02380361594259739, ref_abs_avg=30.511547088623047, test_abs_avg=30.508760452270508
production_forward grad[47] vs paper_forward: mean_abs=0.5560183525085449, max_abs=2.1875, mean_rel=0.0885315090417862, max_rel=3.021517276763916, norm_rel=0.022587083280086517, ref_abs_avg=24.97911262512207, test_abs_avg=24.955080032348633
production_forward grad[48] vs paper_forward: mean_abs=0.7089219093322754, max_abs=4.5, mean_rel=0.17888876795768738, max_rel=1780.5244140625, norm_rel=0.02377936616539955, ref_abs_avg=29.860218048095703, test_abs_avg=29.860843658447266
production_forward grad[49] vs paper_forward: mean_abs=0.6941736936569214, max_abs=5.0, mean_rel=0.15962523221969604, max_rel=725.0121459960938, norm_rel=0.02358878403902054, ref_abs_avg=29.518951416015625, test_abs_avg=29.51999282836914
production_forward grad[50] vs paper_forward: mean_abs=0.6665477752685547, max_abs=2.5, mean_rel=0.11056938022375107, max_rel=6.360832691192627, norm_rel=0.025120489299297333, ref_abs_avg=26.06386947631836, test_abs_avg=26.021696090698242
production_forward grad[51] vs paper_forward: mean_abs=0.7992990016937256, max_abs=5.0, mean_rel=0.1604815125465393, max_rel=926.4519653320312, norm_rel=0.025072768330574036, ref_abs_avg=31.93474578857422, test_abs_avg=31.934402465820312
production_forward grad[52] vs paper_forward: mean_abs=0.7787255644798279, max_abs=5.0, mean_rel=0.1637851893901825, max_rel=1090.83984375, norm_rel=0.02500382997095585, ref_abs_avg=31.203887939453125, test_abs_avg=31.205547332763672
production_forward grad[53] vs paper_forward: mean_abs=0.5554225444793701, max_abs=2.375, mean_rel=0.10668444633483887, max_rel=6.845199108123779, norm_rel=0.022304195910692215, ref_abs_avg=24.771018981933594, test_abs_avg=24.84078025817871
production_forward grad[54] vs paper_forward: mean_abs=0.725460410118103, max_abs=5.0, mean_rel=0.16449081897735596, max_rel=661.8245849609375, norm_rel=0.024670297279953957, ref_abs_avg=29.437198638916016, test_abs_avg=29.43775177001953
production_forward grad[55] vs paper_forward: mean_abs=0.7129696607589722, max_abs=4.625, mean_rel=0.16844594478607178, max_rel=1039.3970947265625, norm_rel=0.02464832179248333, ref_abs_avg=28.990779876708984, test_abs_avg=28.984619140625
production_forward grad[56] vs paper_forward: mean_abs=0.53913813829422, max_abs=2.25, mean_rel=0.1596505045890808, max_rel=18.480627059936523, norm_rel=0.0235873032361269, ref_abs_avg=22.762758255004883, test_abs_avg=22.777509689331055
production_forward grad[57] vs paper_forward: mean_abs=0.6766067743301392, max_abs=5.0, mean_rel=0.15618334710597992, max_rel=931.1597290039062, norm_rel=0.024117618799209595, ref_abs_avg=28.060997009277344, test_abs_avg=28.057891845703125
production_forward grad[58] vs paper_forward: mean_abs=0.6663297414779663, max_abs=5.0, mean_rel=0.16234390437602997, max_rel=924.8695678710938, norm_rel=0.023947594687342644, ref_abs_avg=27.809734344482422, test_abs_avg=27.802453994750977
production_forward grad[59] vs paper_forward: mean_abs=0.5177326202392578, max_abs=2.375, mean_rel=0.06818295270204544, max_rel=3.04577898979187, norm_rel=0.024089563637971878, ref_abs_avg=21.679567337036133, test_abs_avg=21.691974639892578
production_forward grad[60] vs paper_forward: mean_abs=0.6335534453392029, max_abs=4.25, mean_rel=0.1613142192363739, max_rel=791.6520385742188, norm_rel=0.023737061768770218, ref_abs_avg=26.71576690673828, test_abs_avg=26.715003967285156
production_forward grad[61] vs paper_forward: mean_abs=0.6215237379074097, max_abs=4.0, mean_rel=0.1464289128780365, max_rel=453.25054931640625, norm_rel=0.02339600771665573, ref_abs_avg=26.5617733001709, test_abs_avg=26.563138961791992
production_forward grad[62] vs paper_forward: mean_abs=0.5007648468017578, max_abs=2.375, mean_rel=0.2565246522426605, max_rel=36.1165885925293, norm_rel=0.02501867339015007, ref_abs_avg=20.61704444885254, test_abs_avg=20.607276916503906
production_forward grad[63] vs paper_forward: mean_abs=0.6057407855987549, max_abs=4.75, mean_rel=0.14884448051452637, max_rel=1417.40283203125, norm_rel=0.02310767211019993, ref_abs_avg=26.18055534362793, test_abs_avg=26.179147720336914
production_forward grad[64] vs paper_forward: mean_abs=0.5871766209602356, max_abs=4.25, mean_rel=0.15328702330589294, max_rel=1042.206298828125, norm_rel=0.023199589923024178, ref_abs_avg=25.379165649414062, test_abs_avg=25.375564575195312
production_forward grad[65] vs paper_forward: mean_abs=0.4484996795654297, max_abs=1.875, mean_rel=0.11631444096565247, max_rel=17.227914810180664, norm_rel=0.020614491775631905, ref_abs_avg=20.816579818725586, test_abs_avg=20.83118438720703
production_forward grad[66] vs paper_forward: mean_abs=0.5710949897766113, max_abs=4.0, mean_rel=0.14683392643928528, max_rel=760.9776000976562, norm_rel=0.02311277948319912, ref_abs_avg=24.711149215698242, test_abs_avg=24.7109375
production_forward grad[67] vs paper_forward: mean_abs=0.5612092614173889, max_abs=4.25, mean_rel=0.14677965641021729, max_rel=536.327880859375, norm_rel=0.022715596482157707, ref_abs_avg=24.694822311401367, test_abs_avg=24.69192123413086
production_forward grad[68] vs paper_forward: mean_abs=0.41431665420532227, max_abs=1.5, mean_rel=0.10744966566562653, max_rel=7.111060619354248, norm_rel=0.02101968228816986, ref_abs_avg=19.904457092285156, test_abs_avg=19.949342727661133
production_forward grad[69] vs paper_forward: mean_abs=0.5449683666229248, max_abs=4.125, mean_rel=0.14391008019447327, max_rel=554.7623291015625, norm_rel=0.02244708314538002, ref_abs_avg=24.258502960205078, test_abs_avg=24.259140014648438
production_forward grad[70] vs paper_forward: mean_abs=0.5282570719718933, max_abs=4.5, mean_rel=0.1552748829126358, max_rel=787.4630126953125, norm_rel=0.021980909630656242, ref_abs_avg=24.043758392333984, test_abs_avg=24.042984008789062
production_forward grad[71] vs paper_forward: mean_abs=0.4483771324157715, max_abs=1.75, mean_rel=0.1413535475730896, max_rel=38.08760070800781, norm_rel=0.022700605913996696, ref_abs_avg=20.63714027404785, test_abs_avg=20.638370513916016
production_forward grad[72] vs paper_forward: mean_abs=0.5210627317428589, max_abs=4.75, mean_rel=0.1502370685338974, max_rel=951.2993774414062, norm_rel=0.022209396585822105, ref_abs_avg=23.40911102294922, test_abs_avg=23.409631729125977
production_forward grad[73] vs paper_forward: mean_abs=0.5047048330307007, max_abs=4.0, mean_rel=0.1348896026611328, max_rel=858.402587890625, norm_rel=0.021934786811470985, ref_abs_avg=23.031715393066406, test_abs_avg=23.020437240600586
production_forward grad[74] vs paper_forward: mean_abs=0.4720497131347656, max_abs=1.8125, mean_rel=0.05840185284614563, max_rel=1.9108996391296387, norm_rel=0.021647291257977486, ref_abs_avg=21.931224822998047, test_abs_avg=21.906631469726562
production_forward grad[75] vs paper_forward: mean_abs=0.5802635550498962, max_abs=4.75, mean_rel=0.14794516563415527, max_rel=673.5474853515625, norm_rel=0.023985151201486588, ref_abs_avg=24.196929931640625, test_abs_avg=24.1988525390625
production_forward grad[76] vs paper_forward: mean_abs=0.5663990378379822, max_abs=4.5, mean_rel=0.1571321189403534, max_rel=1240.6710205078125, norm_rel=0.023979485034942627, ref_abs_avg=23.63653564453125, test_abs_avg=23.626386642456055
production_forward grad[77] vs paper_forward: mean_abs=0.43101632595062256, max_abs=1.71875, mean_rel=0.377887099981308, max_rel=98.59956359863281, norm_rel=0.022203365340828896, ref_abs_avg=19.514131546020508, test_abs_avg=19.513586044311523
production_forward grad[78] vs paper_forward: mean_abs=0.5373777151107788, max_abs=4.0, mean_rel=0.15594983100891113, max_rel=1032.2247314453125, norm_rel=0.02360062673687935, ref_abs_avg=22.757076263427734, test_abs_avg=22.75824546813965
production_forward grad[79] vs paper_forward: mean_abs=0.5230613946914673, max_abs=4.0, mean_rel=0.14025773108005524, max_rel=922.2696533203125, norm_rel=0.023282915353775024, ref_abs_avg=22.508697509765625, test_abs_avg=22.50606918334961
production_forward grad[80] vs paper_forward: mean_abs=0.39141273498535156, max_abs=1.5, mean_rel=0.06841059029102325, max_rel=1.83766770362854, norm_rel=0.020931920036673546, ref_abs_avg=18.77070426940918, test_abs_avg=18.765522003173828
production_forward grad[81] vs paper_forward: mean_abs=0.49734267592430115, max_abs=5.0, mean_rel=0.15015935897827148, max_rel=919.6904907226562, norm_rel=0.022843746468424797, ref_abs_avg=21.77242660522461, test_abs_avg=21.772014617919922
production_forward grad[82] vs paper_forward: mean_abs=0.48154062032699585, max_abs=4.5, mean_rel=0.14861807227134705, max_rel=692.6996459960938, norm_rel=0.022904187440872192, ref_abs_avg=21.075559616088867, test_abs_avg=21.06932830810547
production_forward grad[83] vs paper_forward: mean_abs=0.3773670196533203, max_abs=1.5, mean_rel=0.09212207794189453, max_rel=4.335909366607666, norm_rel=0.021547378972172737, ref_abs_avg=17.47693634033203, test_abs_avg=17.491518020629883
production_forward grad[84] vs paper_forward: mean_abs=0.4567811191082001, max_abs=4.140625, mean_rel=0.1341754049062729, max_rel=820.2733154296875, norm_rel=0.022218098863959312, ref_abs_avg=20.602462768554688, test_abs_avg=20.602968215942383
production_forward grad[85] vs paper_forward: mean_abs=0.4482678174972534, max_abs=4.125, mean_rel=0.1437588632106781, max_rel=969.9064331054688, norm_rel=0.02192009799182415, ref_abs_avg=20.4941349029541, test_abs_avg=20.496665954589844
production_forward grad[86] vs paper_forward: mean_abs=0.3496975898742676, max_abs=1.5, mean_rel=0.07603636384010315, max_rel=4.883431434631348, norm_rel=0.019863905385136604, ref_abs_avg=17.91931915283203, test_abs_avg=17.9212646484375
production_forward grad[87] vs paper_forward: mean_abs=0.4243236184120178, max_abs=4.5, mean_rel=0.13782191276550293, max_rel=1041.8984375, norm_rel=0.021267954260110855, ref_abs_avg=20.020856857299805, test_abs_avg=20.02299690246582
production_forward grad[88] vs paper_forward: mean_abs=0.4109542965888977, max_abs=4.5, mean_rel=0.13394488394260406, max_rel=596.6416625976562, norm_rel=0.021567609161138535, ref_abs_avg=19.176576614379883, test_abs_avg=19.1832332611084
production_forward grad[89] vs paper_forward: mean_abs=0.32046616077423096, max_abs=1.25, mean_rel=0.14948758482933044, max_rel=35.58154296875, norm_rel=0.02066524140536785, ref_abs_avg=15.470318794250488, test_abs_avg=15.48874282836914
production_forward grad[90] vs paper_forward: mean_abs=0.3942573368549347, max_abs=4.0, mean_rel=0.1251412034034729, max_rel=567.5999145507812, norm_rel=0.020888201892375946, ref_abs_avg=19.005434036254883, test_abs_avg=19.005332946777344
production_forward grad[91] vs paper_forward: mean_abs=0.38621658086776733, max_abs=3.5, mean_rel=0.12897242605686188, max_rel=589.6367797851562, norm_rel=0.02072911150753498, ref_abs_avg=18.759857177734375, test_abs_avg=18.765674591064453
production_forward grad[92] vs paper_forward: mean_abs=0.33408644795417786, max_abs=1.46875, mean_rel=0.16619595885276794, max_rel=38.626319885253906, norm_rel=0.022130461409687996, ref_abs_avg=15.242249488830566, test_abs_avg=15.245769500732422
production_forward grad[93] vs paper_forward: mean_abs=0.3781376779079437, max_abs=4.5, mean_rel=0.12445962429046631, max_rel=466.5350646972656, norm_rel=0.020467551425099373, ref_abs_avg=18.678123474121094, test_abs_avg=18.678604125976562
production_forward grad[94] vs paper_forward: mean_abs=0.37745314836502075, max_abs=4.5, mean_rel=0.13158485293388367, max_rel=601.8326416015625, norm_rel=0.02100149355828762, ref_abs_avg=18.227066040039062, test_abs_avg=18.22655487060547
production_forward grad[95] vs paper_forward: mean_abs=0.2977020740509033, max_abs=1.25, mean_rel=0.14132699370384216, max_rel=13.298373222351074, norm_rel=0.018603255972266197, ref_abs_avg=16.047056198120117, test_abs_avg=16.069595336914062
production_forward grad[96] vs paper_forward: mean_abs=0.3537580370903015, max_abs=4.0, mean_rel=0.1197807639837265, max_rel=410.3683166503906, norm_rel=0.020086785778403282, ref_abs_avg=17.884000778198242, test_abs_avg=17.885847091674805
production_forward grad[97] vs paper_forward: mean_abs=0.3544250726699829, max_abs=5.0, mean_rel=0.12742355465888977, max_rel=515.4738159179688, norm_rel=0.02069537341594696, ref_abs_avg=17.467761993408203, test_abs_avg=17.473350524902344
production_forward2 vs paper_forward output: mean_abs=0.0016604398842900991, max_abs=0.0390625
production_forward2 grad[0] vs paper_forward: mean_abs=0.008786197751760483, max_abs=0.375, mean_rel=0.07522515952587128, max_rel=114.46782684326172, norm_rel=0.02056765928864479, ref_abs_avg=0.4615171253681183, test_abs_avg=0.46151426434516907
production_forward2 grad[1] vs paper_forward: mean_abs=7.642151832580566, max_abs=56.0, mean_rel=0.14079204201698303, max_rel=257.4383239746094, norm_rel=0.02114836312830448, ref_abs_avg=320.78814697265625, test_abs_avg=320.7096252441406
production_forward2 grad[2] vs paper_forward: mean_abs=1.3422517776489258, max_abs=5.25, mean_rel=0.08886174857616425, max_rel=2.969500780105591, norm_rel=0.025733312591910362, ref_abs_avg=51.854393005371094, test_abs_avg=51.769283294677734
production_forward2 grad[3] vs paper_forward: mean_abs=1.610721230506897, max_abs=11.03125, mean_rel=0.1837109923362732, max_rel=3520.7431640625, norm_rel=0.024409065023064613, ref_abs_avg=66.30470275878906, test_abs_avg=66.30616760253906
production_forward2 grad[4] vs paper_forward: mean_abs=1.5642732381820679, max_abs=9.0, mean_rel=0.16678205132484436, max_rel=910.3619384765625, norm_rel=0.024071499705314636, ref_abs_avg=65.31829833984375, test_abs_avg=65.33984375
production_forward2 grad[5] vs paper_forward: mean_abs=1.2144360542297363, max_abs=3.75, mean_rel=0.11111552268266678, max_rel=6.010865688323975, norm_rel=0.024748072028160095, ref_abs_avg=47.773597717285156, test_abs_avg=47.72322463989258
production_forward2 grad[6] vs paper_forward: mean_abs=1.4212193489074707, max_abs=8.5, mean_rel=0.1777675300836563, max_rel=1749.9205322265625, norm_rel=0.02397306263446808, ref_abs_avg=59.550899505615234, test_abs_avg=59.54496765136719
production_forward2 grad[7] vs paper_forward: mean_abs=1.3838341236114502, max_abs=9.40625, mean_rel=0.15568749606609344, max_rel=1803.47998046875, norm_rel=0.023848852142691612, ref_abs_avg=58.301673889160156, test_abs_avg=58.296817779541016
production_forward2 grad[8] vs paper_forward: mean_abs=1.0452156066894531, max_abs=3.5, mean_rel=0.10024060308933258, max_rel=6.056023597717285, norm_rel=0.023402078077197075, ref_abs_avg=44.51505661010742, test_abs_avg=44.54436492919922
production_forward2 grad[9] vs paper_forward: mean_abs=1.2824733257293701, max_abs=8.625, mean_rel=0.16897708177566528, max_rel=2110.963134765625, norm_rel=0.023800818249583244, ref_abs_avg=54.1474494934082, test_abs_avg=54.150665283203125
production_forward2 grad[10] vs paper_forward: mean_abs=1.254078984260559, max_abs=8.0, mean_rel=0.17264379560947418, max_rel=2350.011962890625, norm_rel=0.023622380569577217, ref_abs_avg=53.377418518066406, test_abs_avg=53.37762451171875
production_forward2 grad[11] vs paper_forward: mean_abs=0.9421563148498535, max_abs=3.25, mean_rel=0.15743251144886017, max_rel=39.8785285949707, norm_rel=0.02326063998043537, ref_abs_avg=40.48884582519531, test_abs_avg=40.66261291503906
production_forward2 grad[12] vs paper_forward: mean_abs=1.1839148998260498, max_abs=8.0, mean_rel=0.16806314885616302, max_rel=2098.74755859375, norm_rel=0.023559315130114555, ref_abs_avg=50.48628234863281, test_abs_avg=50.4876594543457
production_forward2 grad[13] vs paper_forward: mean_abs=1.151161551475525, max_abs=7.0, mean_rel=0.16002893447875977, max_rel=2437.383056640625, norm_rel=0.023341815918684006, ref_abs_avg=49.58500671386719, test_abs_avg=49.591461181640625
production_forward2 grad[14] vs paper_forward: mean_abs=0.9117898941040039, max_abs=3.75, mean_rel=0.09377994388341904, max_rel=5.41172981262207, norm_rel=0.023767266422510147, ref_abs_avg=39.42327117919922, test_abs_avg=39.463722229003906
production_forward2 grad[15] vs paper_forward: mean_abs=1.1009596586227417, max_abs=7.0, mean_rel=0.15998205542564392, max_rel=1707.3353271484375, norm_rel=0.023328270763158798, ref_abs_avg=47.445526123046875, test_abs_avg=47.4487190246582
production_forward2 grad[16] vs paper_forward: mean_abs=1.078532338142395, max_abs=6.6875, mean_rel=0.1633334457874298, max_rel=1527.88720703125, norm_rel=0.023212620988488197, ref_abs_avg=46.69757080078125, test_abs_avg=46.70121765136719
production_forward2 grad[17] vs paper_forward: mean_abs=0.7760815620422363, max_abs=3.0, mean_rel=0.11417440325021744, max_rel=8.325984954833984, norm_rel=0.021349795162677765, ref_abs_avg=35.788490295410156, test_abs_avg=35.787628173828125
production_forward2 grad[18] vs paper_forward: mean_abs=1.0319602489471436, max_abs=6.5, mean_rel=0.1508244276046753, max_rel=970.062744140625, norm_rel=0.023225268349051476, ref_abs_avg=44.64289093017578, test_abs_avg=44.64148712158203
production_forward2 grad[19] vs paper_forward: mean_abs=1.009247064590454, max_abs=6.5, mean_rel=0.1522042155265808, max_rel=1365.921142578125, norm_rel=0.022925931960344315, ref_abs_avg=44.26605224609375, test_abs_avg=44.265140533447266
production_forward2 grad[20] vs paper_forward: mean_abs=0.7604781985282898, max_abs=2.875, mean_rel=0.06959539651870728, max_rel=4.4761271476745605, norm_rel=0.02291693538427353, ref_abs_avg=34.654029846191406, test_abs_avg=34.63684844970703
production_forward2 grad[21] vs paper_forward: mean_abs=0.9746689796447754, max_abs=5.875, mean_rel=0.16508802771568298, max_rel=1447.7784423828125, norm_rel=0.02307332120835781, ref_abs_avg=42.449684143066406, test_abs_avg=42.44987487792969
production_forward2 grad[22] vs paper_forward: mean_abs=0.95144122838974, max_abs=5.75, mean_rel=0.14329293370246887, max_rel=893.8201904296875, norm_rel=0.022937553003430367, ref_abs_avg=41.6693000793457, test_abs_avg=41.66626739501953
production_forward2 grad[23] vs paper_forward: mean_abs=0.7348980903625488, max_abs=2.75, mean_rel=0.12249967455863953, max_rel=17.311969757080078, norm_rel=0.020785192027688026, ref_abs_avg=35.13788604736328, test_abs_avg=35.14164733886719
production_forward2 grad[24] vs paper_forward: mean_abs=0.9248224496841431, max_abs=6.25, mean_rel=0.15693353116512299, max_rel=1057.5181884765625, norm_rel=0.02304757945239544, ref_abs_avg=40.33368682861328, test_abs_avg=40.33543395996094
production_forward2 grad[25] vs paper_forward: mean_abs=0.903106689453125, max_abs=6.0, mean_rel=0.14687001705169678, max_rel=982.1134643554688, norm_rel=0.022695258259773254, ref_abs_avg=39.98896026611328, test_abs_avg=39.98394012451172
production_forward2 grad[26] vs paper_forward: mean_abs=0.8765404224395752, max_abs=3.0177001953125, mean_rel=0.4967343509197235, max_rel=207.4048614501953, norm_rel=0.024704305455088615, ref_abs_avg=35.57957077026367, test_abs_avg=35.56146240234375
production_forward2 grad[27] vs paper_forward: mean_abs=1.0767524242401123, max_abs=8.0, mean_rel=0.1586616039276123, max_rel=900.6412963867188, norm_rel=0.024983949959278107, ref_abs_avg=43.237857818603516, test_abs_avg=43.237449645996094
production_forward2 grad[28] vs paper_forward: mean_abs=1.0490612983703613, max_abs=6.75, mean_rel=0.17464351654052734, max_rel=1875.63720703125, norm_rel=0.024686546996235847, ref_abs_avg=42.665077209472656, test_abs_avg=42.662498474121094
production_forward2 grad[29] vs paper_forward: mean_abs=0.7898173332214355, max_abs=3.5, mean_rel=0.07965343445539474, max_rel=3.691983222961426, norm_rel=0.026706790551543236, ref_abs_avg=29.825225830078125, test_abs_avg=29.818504333496094
production_forward2 grad[30] vs paper_forward: mean_abs=0.9908835887908936, max_abs=6.0, mean_rel=0.20079666376113892, max_rel=2437.882080078125, norm_rel=0.0252932608127594, ref_abs_avg=39.2843017578125, test_abs_avg=39.28602981567383
production_forward2 grad[31] vs paper_forward: mean_abs=0.9787445664405823, max_abs=6.0, mean_rel=0.17552052438259125, max_rel=2133.62109375, norm_rel=0.025022117421030998, ref_abs_avg=39.22703170776367, test_abs_avg=39.22564697265625
production_forward2 grad[32] vs paper_forward: mean_abs=0.7493610382080078, max_abs=2.5, mean_rel=0.07865124940872192, max_rel=6.567147731781006, norm_rel=0.02344394475221634, ref_abs_avg=31.874191284179688, test_abs_avg=31.861764907836914
production_forward2 grad[33] vs paper_forward: mean_abs=0.9287641048431396, max_abs=6.5, mean_rel=0.1805863380432129, max_rel=1394.7611083984375, norm_rel=0.02522864192724228, ref_abs_avg=36.933433532714844, test_abs_avg=36.934059143066406
production_forward2 grad[34] vs paper_forward: mean_abs=0.9137636423110962, max_abs=5.5, mean_rel=0.17123043537139893, max_rel=1748.5758056640625, norm_rel=0.025060180574655533, ref_abs_avg=36.5896110534668, test_abs_avg=36.59254837036133
production_forward2 grad[35] vs paper_forward: mean_abs=0.7167863845825195, max_abs=2.875, mean_rel=0.14455482363700867, max_rel=13.724100112915039, norm_rel=0.02453133836388588, ref_abs_avg=29.39659309387207, test_abs_avg=29.377918243408203
production_forward2 grad[36] vs paper_forward: mean_abs=0.8671666383743286, max_abs=5.5, mean_rel=0.16986139118671417, max_rel=1296.9288330078125, norm_rel=0.024863509461283684, ref_abs_avg=34.956817626953125, test_abs_avg=34.95746612548828
production_forward2 grad[37] vs paper_forward: mean_abs=0.8548534512519836, max_abs=6.25, mean_rel=0.15987557172775269, max_rel=778.9879150390625, norm_rel=0.024937668815255165, ref_abs_avg=34.38996124267578, test_abs_avg=34.38832092285156
production_forward2 grad[38] vs paper_forward: mean_abs=0.6224384307861328, max_abs=3.0, mean_rel=0.07405410706996918, max_rel=2.050168991088867, norm_rel=0.02254260517656803, ref_abs_avg=28.460037231445312, test_abs_avg=28.43109703063965
production_forward2 grad[39] vs paper_forward: mean_abs=0.8207752704620361, max_abs=5.0625, mean_rel=0.1815604567527771, max_rel=1586.0294189453125, norm_rel=0.024691281840205193, ref_abs_avg=33.30704116821289, test_abs_avg=33.30752944946289
production_forward2 grad[40] vs paper_forward: mean_abs=0.8091104030609131, max_abs=5.0546875, mean_rel=0.16735023260116577, max_rel=1649.9888916015625, norm_rel=0.024622030556201935, ref_abs_avg=32.91267395019531, test_abs_avg=32.91630554199219
production_forward2 grad[41] vs paper_forward: mean_abs=0.6190261840820312, max_abs=2.625, mean_rel=0.11709102243185043, max_rel=8.10193920135498, norm_rel=0.02352396957576275, ref_abs_avg=26.328224182128906, test_abs_avg=26.30718231201172
production_forward2 grad[42] vs paper_forward: mean_abs=0.7791117429733276, max_abs=5.0, mean_rel=0.15661059319972992, max_rel=1047.80908203125, norm_rel=0.024471793323755264, ref_abs_avg=31.931285858154297, test_abs_avg=31.929691314697266
production_forward2 grad[43] vs paper_forward: mean_abs=0.7694486975669861, max_abs=5.5, mean_rel=0.16059798002243042, max_rel=1422.9835205078125, norm_rel=0.024169063195586205, ref_abs_avg=31.863449096679688, test_abs_avg=31.8586483001709
production_forward2 grad[44] vs paper_forward: mean_abs=0.5999984741210938, max_abs=2.4375, mean_rel=0.07924889028072357, max_rel=3.9693424701690674, norm_rel=0.02235330082476139, ref_abs_avg=27.332645416259766, test_abs_avg=27.415870666503906
production_forward2 grad[45] vs paper_forward: mean_abs=0.7426400780677795, max_abs=4.75, mean_rel=0.16352611780166626, max_rel=1512.994873046875, norm_rel=0.024057727307081223, ref_abs_avg=30.92675018310547, test_abs_avg=30.927270889282227
production_forward2 grad[46] vs paper_forward: mean_abs=0.7282342910766602, max_abs=4.75, mean_rel=0.1471891701221466, max_rel=635.9088134765625, norm_rel=0.023933913558721542, ref_abs_avg=30.511547088623047, test_abs_avg=30.508506774902344
production_forward2 grad[47] vs paper_forward: mean_abs=0.5650553703308105, max_abs=2.25, mean_rel=0.10051491856575012, max_rel=5.095603942871094, norm_rel=0.0228925459086895, ref_abs_avg=24.97911262512207, test_abs_avg=24.948667526245117
production_forward2 grad[48] vs paper_forward: mean_abs=0.712212085723877, max_abs=5.0, mean_rel=0.1748388707637787, max_rel=1619.282958984375, norm_rel=0.023898854851722717, ref_abs_avg=29.860218048095703, test_abs_avg=29.86065673828125
production_forward2 grad[49] vs paper_forward: mean_abs=0.6982526779174805, max_abs=4.609375, mean_rel=0.1564701348543167, max_rel=828.5084228515625, norm_rel=0.023728054016828537, ref_abs_avg=29.518951416015625, test_abs_avg=29.520124435424805
production_forward2 grad[50] vs paper_forward: mean_abs=0.6788482666015625, max_abs=2.5, mean_rel=0.10806906968355179, max_rel=7.067591667175293, norm_rel=0.02561214007437229, ref_abs_avg=26.06386947631836, test_abs_avg=26.00998306274414
production_forward2 grad[51] vs paper_forward: mean_abs=0.80228590965271, max_abs=5.0, mean_rel=0.15987485647201538, max_rel=892.644775390625, norm_rel=0.02517201378941536, ref_abs_avg=31.93474578857422, test_abs_avg=31.934654235839844
production_forward2 grad[52] vs paper_forward: mean_abs=0.7817501425743103, max_abs=5.0, mean_rel=0.15936912596225739, max_rel=1059.6634521484375, norm_rel=0.025093017145991325, ref_abs_avg=31.203887939453125, test_abs_avg=31.203962326049805
production_forward2 grad[53] vs paper_forward: mean_abs=0.5614011287689209, max_abs=2.3125, mean_rel=0.10712308436632156, max_rel=8.637078285217285, norm_rel=0.022345084697008133, ref_abs_avg=24.771018981933594, test_abs_avg=24.847070693969727
production_forward2 grad[54] vs paper_forward: mean_abs=0.7285683155059814, max_abs=5.125, mean_rel=0.16540782153606415, max_rel=936.8634033203125, norm_rel=0.02476782165467739, ref_abs_avg=29.437198638916016, test_abs_avg=29.43692398071289
production_forward2 grad[55] vs paper_forward: mean_abs=0.714631199836731, max_abs=4.6875, mean_rel=0.17026835680007935, max_rel=1456.4188232421875, norm_rel=0.024725863710045815, ref_abs_avg=28.990779876708984, test_abs_avg=28.98396873474121
production_forward2 grad[56] vs paper_forward: mean_abs=0.5228197574615479, max_abs=2.0, mean_rel=0.13533979654312134, max_rel=10.816341400146484, norm_rel=0.02304149977862835, ref_abs_avg=22.762758255004883, test_abs_avg=22.762073516845703
production_forward2 grad[57] vs paper_forward: mean_abs=0.6785148978233337, max_abs=4.375, mean_rel=0.15559983253479004, max_rel=836.5154418945312, norm_rel=0.02419123984873295, ref_abs_avg=28.060997009277344, test_abs_avg=28.057727813720703
production_forward2 grad[58] vs paper_forward: mean_abs=0.6696332693099976, max_abs=5.125, mean_rel=0.1626516729593277, max_rel=1078.916259765625, norm_rel=0.02406211942434311, ref_abs_avg=27.809734344482422, test_abs_avg=27.80221939086914
production_forward2 grad[59] vs paper_forward: mean_abs=0.5277614593505859, max_abs=2.375, mean_rel=0.07119300961494446, max_rel=4.071293830871582, norm_rel=0.02471953257918358, ref_abs_avg=21.679567337036133, test_abs_avg=21.68964385986328
production_forward2 grad[60] vs paper_forward: mean_abs=0.6362976431846619, max_abs=4.25, mean_rel=0.16302859783172607, max_rel=706.9712524414062, norm_rel=0.023832909762859344, ref_abs_avg=26.71576690673828, test_abs_avg=26.714370727539062
production_forward2 grad[61] vs paper_forward: mean_abs=0.6226370930671692, max_abs=4.0, mean_rel=0.14311832189559937, max_rel=429.0064697265625, norm_rel=0.0234502162784338, ref_abs_avg=26.5617733001709, test_abs_avg=26.56267547607422
production_forward2 grad[62] vs paper_forward: mean_abs=0.49938106536865234, max_abs=2.125, mean_rel=0.2460608333349228, max_rel=36.992916107177734, norm_rel=0.02500130981206894, ref_abs_avg=20.61704444885254, test_abs_avg=20.61601448059082
production_forward2 grad[63] vs paper_forward: mean_abs=0.608130931854248, max_abs=5.375, mean_rel=0.14876890182495117, max_rel=1359.146240234375, norm_rel=0.023192517459392548, ref_abs_avg=26.18055534362793, test_abs_avg=26.179479598999023
production_forward2 grad[64] vs paper_forward: mean_abs=0.5889400839805603, max_abs=4.25, mean_rel=0.1525990515947342, max_rel=781.6989135742188, norm_rel=0.023258071392774582, ref_abs_avg=25.379165649414062, test_abs_avg=25.37626075744629
production_forward2 grad[65] vs paper_forward: mean_abs=0.4472336769104004, max_abs=1.875, mean_rel=0.1222754716873169, max_rel=23.85042953491211, norm_rel=0.02086828462779522, ref_abs_avg=20.816579818725586, test_abs_avg=20.837276458740234
production_forward2 grad[66] vs paper_forward: mean_abs=0.5731709003448486, max_abs=4.5, mean_rel=0.1464538723230362, max_rel=696.9379272460938, norm_rel=0.023194575682282448, ref_abs_avg=24.711149215698242, test_abs_avg=24.710678100585938
production_forward2 grad[67] vs paper_forward: mean_abs=0.5626193284988403, max_abs=4.34375, mean_rel=0.14686886966228485, max_rel=498.7902526855469, norm_rel=0.022771719843149185, ref_abs_avg=24.694822311401367, test_abs_avg=24.692325592041016
production_forward2 grad[68] vs paper_forward: mean_abs=0.42609548568725586, max_abs=1.75, mean_rel=0.11479290574789047, max_rel=7.193742752075195, norm_rel=0.021747274324297905, ref_abs_avg=19.904457092285156, test_abs_avg=19.943622589111328
production_forward2 grad[69] vs paper_forward: mean_abs=0.5464178323745728, max_abs=4.0, mean_rel=0.14450356364250183, max_rel=568.29638671875, norm_rel=0.022505393251776695, ref_abs_avg=24.258502960205078, test_abs_avg=24.259122848510742
production_forward2 grad[70] vs paper_forward: mean_abs=0.5294719934463501, max_abs=4.5, mean_rel=0.15562300384044647, max_rel=617.4578857421875, norm_rel=0.022023746743798256, ref_abs_avg=24.043758392333984, test_abs_avg=24.042985916137695
production_forward2 grad[71] vs paper_forward: mean_abs=0.4535393714904785, max_abs=1.75, mean_rel=0.16980019211769104, max_rel=52.75304412841797, norm_rel=0.02302299067378044, ref_abs_avg=20.63714027404785, test_abs_avg=20.635250091552734
production_forward2 grad[72] vs paper_forward: mean_abs=0.5222998857498169, max_abs=4.75, mean_rel=0.1511281132698059, max_rel=804.9700317382812, norm_rel=0.0222687479108572, ref_abs_avg=23.40911102294922, test_abs_avg=23.40938949584961
production_forward2 grad[73] vs paper_forward: mean_abs=0.5055489540100098, max_abs=4.0, mean_rel=0.13420966267585754, max_rel=743.9304809570312, norm_rel=0.02198408544063568, ref_abs_avg=23.031715393066406, test_abs_avg=23.0203800201416
production_forward2 grad[74] vs paper_forward: mean_abs=0.46820640563964844, max_abs=2.0, mean_rel=0.053386494517326355, max_rel=2.0420398712158203, norm_rel=0.021451614797115326, ref_abs_avg=21.931224822998047, test_abs_avg=21.899168014526367
production_forward2 grad[75] vs paper_forward: mean_abs=0.5821281671524048, max_abs=4.625, mean_rel=0.1482396423816681, max_rel=848.455078125, norm_rel=0.024062462151050568, ref_abs_avg=24.196929931640625, test_abs_avg=24.198884963989258
production_forward2 grad[76] vs paper_forward: mean_abs=0.5683156251907349, max_abs=4.125, mean_rel=0.1553083062171936, max_rel=1131.96240234375, norm_rel=0.024070588871836662, ref_abs_avg=23.63653564453125, test_abs_avg=23.626781463623047
production_forward2 grad[77] vs paper_forward: mean_abs=0.43449264764785767, max_abs=1.9140625, mean_rel=0.22958329319953918, max_rel=25.04867935180664, norm_rel=0.022468117997050285, ref_abs_avg=19.514131546020508, test_abs_avg=19.50502586364746
production_forward2 grad[78] vs paper_forward: mean_abs=0.5389862060546875, max_abs=4.5, mean_rel=0.1562868058681488, max_rel=990.9404907226562, norm_rel=0.023658467456698418, ref_abs_avg=22.757076263427734, test_abs_avg=22.758037567138672
production_forward2 grad[79] vs paper_forward: mean_abs=0.5240886211395264, max_abs=4.0, mean_rel=0.13986527919769287, max_rel=842.9658203125, norm_rel=0.023327061906456947, ref_abs_avg=22.508697509765625, test_abs_avg=22.505409240722656
production_forward2 grad[80] vs paper_forward: mean_abs=0.3892054557800293, max_abs=1.5, mean_rel=0.06577074527740479, max_rel=1.5210946798324585, norm_rel=0.020744742825627327, ref_abs_avg=18.77070426940918, test_abs_avg=18.775339126586914
production_forward2 grad[81] vs paper_forward: mean_abs=0.49816229939460754, max_abs=4.125, mean_rel=0.15063989162445068, max_rel=726.2166748046875, norm_rel=0.02288336120545864, ref_abs_avg=21.77242660522461, test_abs_avg=21.772262573242188
production_forward2 grad[82] vs paper_forward: mean_abs=0.4829489290714264, max_abs=3.625, mean_rel=0.14882302284240723, max_rel=704.7686767578125, norm_rel=0.022964585572481155, ref_abs_avg=21.075559616088867, test_abs_avg=21.06969451904297
production_forward2 grad[83] vs paper_forward: mean_abs=0.3761155605316162, max_abs=1.5, mean_rel=0.09347699582576752, max_rel=5.30194091796875, norm_rel=0.021592210978269577, ref_abs_avg=17.47693634033203, test_abs_avg=17.48568344116211
production_forward2 grad[84] vs paper_forward: mean_abs=0.45732173323631287, max_abs=4.5, mean_rel=0.13488547503948212, max_rel=906.8492431640625, norm_rel=0.022247610613703728, ref_abs_avg=20.602462768554688, test_abs_avg=20.602617263793945
production_forward2 grad[85] vs paper_forward: mean_abs=0.4490792751312256, max_abs=4.25, mean_rel=0.1450350284576416, max_rel=1010.3300170898438, norm_rel=0.021963944658637047, ref_abs_avg=20.4941349029541, test_abs_avg=20.496204376220703
production_forward2 grad[86] vs paper_forward: mean_abs=0.3571109175682068, max_abs=1.75, mean_rel=0.07233120501041412, max_rel=4.76149845123291, norm_rel=0.020208731293678284, ref_abs_avg=17.91931915283203, test_abs_avg=17.920764923095703
production_forward2 grad[87] vs paper_forward: mean_abs=0.4244701862335205, max_abs=4.5, mean_rel=0.13685284554958344, max_rel=1075.1412353515625, norm_rel=0.021274497732520103, ref_abs_avg=20.020856857299805, test_abs_avg=20.023029327392578
production_forward2 grad[88] vs paper_forward: mean_abs=0.41146260499954224, max_abs=5.0, mean_rel=0.13272526860237122, max_rel=542.2771606445312, norm_rel=0.02159327082335949, ref_abs_avg=19.176576614379883, test_abs_avg=19.182811737060547
production_forward2 grad[89] vs paper_forward: mean_abs=0.32550013065338135, max_abs=1.375, mean_rel=0.13939771056175232, max_rel=29.77301788330078, norm_rel=0.02093944326043129, ref_abs_avg=15.470318794250488, test_abs_avg=15.490196228027344
production_forward2 grad[90] vs paper_forward: mean_abs=0.394569456577301, max_abs=4.0, mean_rel=0.1258847415447235, max_rel=468.22210693359375, norm_rel=0.020897481590509415, ref_abs_avg=19.005434036254883, test_abs_avg=19.004993438720703
production_forward2 grad[91] vs paper_forward: mean_abs=0.3861503601074219, max_abs=3.5, mean_rel=0.1300344318151474, max_rel=586.2079467773438, norm_rel=0.02072390727698803, ref_abs_avg=18.759857177734375, test_abs_avg=18.765331268310547
production_forward2 grad[92] vs paper_forward: mean_abs=0.329894095659256, max_abs=1.59375, mean_rel=0.1847495138645172, max_rel=49.17251205444336, norm_rel=0.021905014291405678, ref_abs_avg=15.242249488830566, test_abs_avg=15.243606567382812
production_forward2 grad[93] vs paper_forward: mean_abs=0.3779604136943817, max_abs=4.5, mean_rel=0.1236313208937645, max_rel=430.4165954589844, norm_rel=0.02046113647520542, ref_abs_avg=18.678123474121094, test_abs_avg=18.67850112915039
production_forward2 grad[94] vs paper_forward: mean_abs=0.37767866253852844, max_abs=4.5, mean_rel=0.13127218186855316, max_rel=601.8326416015625, norm_rel=0.021006692200899124, ref_abs_avg=18.227066040039062, test_abs_avg=18.226272583007812
production_forward2 grad[95] vs paper_forward: mean_abs=0.2964843511581421, max_abs=1.25, mean_rel=0.14019981026649475, max_rel=13.486502647399902, norm_rel=0.01855626329779625, ref_abs_avg=16.047056198120117, test_abs_avg=16.071279525756836
production_forward2 grad[96] vs paper_forward: mean_abs=0.35376179218292236, max_abs=4.0, mean_rel=0.11970588564872742, max_rel=410.3683166503906, norm_rel=0.020086346194148064, ref_abs_avg=17.884000778198242, test_abs_avg=17.885848999023438
production_forward2 grad[97] vs paper_forward: mean_abs=0.35441190004348755, max_abs=5.0, mean_rel=0.12742292881011963, max_rel=515.4738159179688, norm_rel=0.020695187151432037, ref_abs_avg=17.467761993408203, test_abs_avg=17.47335433959961
identity layers + randn queries
paper_forward fwd+bwd:  380.043 ms
paper_forward bwd-only: 294.326 ms
paper_forward peak allocated: fwd=30.001 GiB, fwd+bwd=32.120 GiB
paper_forward peak reserved:  fwd=30.041 GiB, fwd+bwd=32.791 GiB
production_forward fwd+bwd:  109.519 ms
production_forward bwd-only: 89.147 ms
production_forward peak allocated: fwd=3.368 GiB, fwd+bwd=6.993 GiB
production_forward peak reserved:  fwd=3.623 GiB, fwd+bwd=8.123 GiB
production_forward2 fwd+bwd:  224.407 ms
production_forward2 bwd-only: 202.244 ms
production_forward2 peak allocated: fwd=2.864 GiB, fwd+bwd=6.243 GiB
production_forward2 peak reserved:  fwd=3.248 GiB, fwd+bwd=8.998 GiB

grads check for swiglu layers + randn queries
production_forward vs paper_forward output: mean_abs=0.0016778269782662392, max_abs=0.046875
production_forward grad[0] vs paper_forward: mean_abs=0.008819693699479103, max_abs=0.4375, mean_rel=0.07509204745292664, max_rel=139.26486206054688, norm_rel=0.020572200417518616, ref_abs_avg=0.46387702226638794, test_abs_avg=0.4638863205909729
production_forward grad[1] vs paper_forward: mean_abs=7.586562156677246, max_abs=64.0, mean_rel=0.12805545330047607, max_rel=92.17593383789062, norm_rel=0.020908750593662262, ref_abs_avg=323.14129638671875, test_abs_avg=323.0251159667969
production_forward grad[2] vs paper_forward: mean_abs=1.2635499238967896, max_abs=5.0, mean_rel=0.11342085897922516, max_rel=5.875558853149414, norm_rel=0.022379117086529732, ref_abs_avg=55.947696685791016, test_abs_avg=56.073638916015625
production_forward grad[3] vs paper_forward: mean_abs=1.6070828437805176, max_abs=10.5, mean_rel=0.18411552906036377, max_rel=3170.081298828125, norm_rel=0.024323169142007828, ref_abs_avg=66.40446472167969, test_abs_avg=66.40675354003906
production_forward grad[4] vs paper_forward: mean_abs=1.5693248510360718, max_abs=10.0, mean_rel=0.15588802099227905, max_rel=1388.3218994140625, norm_rel=0.02412821725010872, ref_abs_avg=65.33675384521484, test_abs_avg=65.32304382324219
production_forward grad[5] vs paper_forward: mean_abs=1.1039643287658691, max_abs=4.75, mean_rel=0.1282111555337906, max_rel=11.158130645751953, norm_rel=0.025462165474891663, ref_abs_avg=45.08155822753906, test_abs_avg=45.07331848144531
production_forward grad[6] vs paper_forward: mean_abs=1.4103296995162964, max_abs=9.0, mean_rel=0.1653684377670288, max_rel=2613.896728515625, norm_rel=0.024170441552996635, ref_abs_avg=58.60868835449219, test_abs_avg=58.6124267578125
production_forward grad[7] vs paper_forward: mean_abs=1.3710575103759766, max_abs=9.0, mean_rel=0.16893434524536133, max_rel=1723.8411865234375, norm_rel=0.023778121918439865, ref_abs_avg=57.94591522216797, test_abs_avg=57.95030975341797
production_forward grad[8] vs paper_forward: mean_abs=0.9948415756225586, max_abs=4.0, mean_rel=0.06365146487951279, max_rel=1.8111305236816406, norm_rel=0.022501150146126747, ref_abs_avg=43.88523864746094, test_abs_avg=43.824798583984375
production_forward grad[9] vs paper_forward: mean_abs=1.2743765115737915, max_abs=8.5, mean_rel=0.1574891060590744, max_rel=1727.8306884765625, norm_rel=0.023780353367328644, ref_abs_avg=53.84579849243164, test_abs_avg=53.84616470336914
production_forward grad[10] vs paper_forward: mean_abs=1.246034026145935, max_abs=8.75, mean_rel=0.15244004130363464, max_rel=1236.4903564453125, norm_rel=0.023581555113196373, ref_abs_avg=53.052276611328125, test_abs_avg=53.051387786865234
production_forward grad[11] vs paper_forward: mean_abs=0.9083021879196167, max_abs=3.5, mean_rel=0.11850842833518982, max_rel=7.854800701141357, norm_rel=0.021583544090390205, ref_abs_avg=42.190147399902344, test_abs_avg=42.249263763427734
production_forward grad[12] vs paper_forward: mean_abs=1.185396671295166, max_abs=8.4375, mean_rel=0.16688212752342224, max_rel=1337.0030517578125, norm_rel=0.02371716871857643, ref_abs_avg=50.194358825683594, test_abs_avg=50.195106506347656
production_forward grad[13] vs paper_forward: mean_abs=1.1576623916625977, max_abs=7.0, mean_rel=0.17933207750320435, max_rel=1264.1302490234375, norm_rel=0.023511553183197975, ref_abs_avg=49.51398468017578, test_abs_avg=49.51616668701172
production_forward grad[14] vs paper_forward: mean_abs=0.9151077270507812, max_abs=3.5625, mean_rel=0.11709016561508179, max_rel=10.003653526306152, norm_rel=0.02293763868510723, ref_abs_avg=39.41645812988281, test_abs_avg=39.45381546020508
production_forward grad[15] vs paper_forward: mean_abs=1.112140417098999, max_abs=6.875, mean_rel=0.16556037962436676, max_rel=1913.2584228515625, norm_rel=0.02353483997285366, ref_abs_avg=47.4984130859375, test_abs_avg=47.50520324707031
production_forward grad[16] vs paper_forward: mean_abs=1.0852138996124268, max_abs=6.5, mean_rel=0.17462284862995148, max_rel=2008.98974609375, norm_rel=0.02312658540904522, ref_abs_avg=47.162784576416016, test_abs_avg=47.16891860961914
production_forward grad[17] vs paper_forward: mean_abs=0.8535127639770508, max_abs=3.5, mean_rel=0.07217654585838318, max_rel=2.7577335834503174, norm_rel=0.02251710183918476, ref_abs_avg=37.55598449707031, test_abs_avg=37.61599349975586
production_forward grad[18] vs paper_forward: mean_abs=1.0502619743347168, max_abs=6.5, mean_rel=0.15490204095840454, max_rel=936.025634765625, norm_rel=0.023469217121601105, ref_abs_avg=44.94068908691406, test_abs_avg=44.94379425048828
production_forward grad[19] vs paper_forward: mean_abs=1.0283818244934082, max_abs=6.625, mean_rel=0.16458609700202942, max_rel=1499.8038330078125, norm_rel=0.023188965395092964, ref_abs_avg=44.558807373046875, test_abs_avg=44.56409454345703
production_forward grad[20] vs paper_forward: mean_abs=0.8305926322937012, max_abs=3.125, mean_rel=0.2649461627006531, max_rel=94.62898254394531, norm_rel=0.025120770558714867, ref_abs_avg=32.7713737487793, test_abs_avg=32.78209686279297
production_forward grad[21] vs paper_forward: mean_abs=0.9955532550811768, max_abs=6.25, mean_rel=0.15487945079803467, max_rel=2240.824462890625, norm_rel=0.02323434315621853, ref_abs_avg=43.03125, test_abs_avg=43.03499221801758
production_forward grad[22] vs paper_forward: mean_abs=0.9649022221565247, max_abs=6.0, mean_rel=0.15222813189029694, max_rel=837.2147827148438, norm_rel=0.022886106744408607, ref_abs_avg=42.38042449951172, test_abs_avg=42.38554382324219
production_forward grad[23] vs paper_forward: mean_abs=0.7685461044311523, max_abs=2.921875, mean_rel=0.12269462645053864, max_rel=10.019996643066406, norm_rel=0.022508947178721428, ref_abs_avg=34.010292053222656, test_abs_avg=34.016902923583984
production_forward grad[24] vs paper_forward: mean_abs=0.9449782371520996, max_abs=5.59375, mean_rel=0.14961063861846924, max_rel=1410.829833984375, norm_rel=0.023016495630145073, ref_abs_avg=41.21314239501953, test_abs_avg=41.21504211425781
production_forward grad[25] vs paper_forward: mean_abs=0.921478271484375, max_abs=5.75, mean_rel=0.17557589709758759, max_rel=3005.81689453125, norm_rel=0.022695442661643028, ref_abs_avg=40.75272750854492, test_abs_avg=40.75279998779297
production_forward grad[26] vs paper_forward: mean_abs=0.8969619274139404, max_abs=3.65625, mean_rel=0.19636185467243195, max_rel=42.25633239746094, norm_rel=0.025371260941028595, ref_abs_avg=36.071311950683594, test_abs_avg=36.04569625854492
production_forward grad[27] vs paper_forward: mean_abs=1.1017507314682007, max_abs=7.0, mean_rel=0.17662173509597778, max_rel=1301.482666015625, norm_rel=0.025120750069618225, ref_abs_avg=44.042938232421875, test_abs_avg=44.04182052612305
production_forward grad[28] vs paper_forward: mean_abs=1.0740363597869873, max_abs=6.75, mean_rel=0.19040173292160034, max_rel=1944.554931640625, norm_rel=0.02497570589184761, ref_abs_avg=43.18413543701172, test_abs_avg=43.18840026855469
production_forward grad[29] vs paper_forward: mean_abs=0.7813234329223633, max_abs=3.25, mean_rel=0.08566887676715851, max_rel=2.8617825508117676, norm_rel=0.024945925921201706, ref_abs_avg=31.788360595703125, test_abs_avg=31.817119598388672
production_forward grad[30] vs paper_forward: mean_abs=1.0071847438812256, max_abs=6.5, mean_rel=0.1860036551952362, max_rel=1900.1812744140625, norm_rel=0.025271860882639885, ref_abs_avg=40.00803756713867, test_abs_avg=40.01084899902344
production_forward grad[31] vs paper_forward: mean_abs=0.9914898872375488, max_abs=6.0, mean_rel=0.16562563180923462, max_rel=988.0288696289062, norm_rel=0.02523217536509037, ref_abs_avg=39.439170837402344, test_abs_avg=39.43986511230469
production_forward grad[32] vs paper_forward: mean_abs=0.8230247497558594, max_abs=3.125, mean_rel=0.2761451005935669, max_rel=47.27593994140625, norm_rel=0.02570999227464199, ref_abs_avg=31.269487380981445, test_abs_avg=31.2493896484375
production_forward grad[33] vs paper_forward: mean_abs=0.9513914585113525, max_abs=6.0, mean_rel=0.16889359056949615, max_rel=2493.821533203125, norm_rel=0.02528080902993679, ref_abs_avg=37.747589111328125, test_abs_avg=37.74858856201172
production_forward grad[34] vs paper_forward: mean_abs=0.9328591227531433, max_abs=6.0, mean_rel=0.18051281571388245, max_rel=1701.02978515625, norm_rel=0.024845220148563385, ref_abs_avg=37.63447952270508, test_abs_avg=37.637638092041016
production_forward grad[35] vs paper_forward: mean_abs=0.7293024063110352, max_abs=3.0, mean_rel=0.15578970313072205, max_rel=15.50261116027832, norm_rel=0.026380449533462524, ref_abs_avg=26.658363342285156, test_abs_avg=26.68994903564453
production_forward grad[36] vs paper_forward: mean_abs=0.8931478261947632, max_abs=6.0, mean_rel=0.16154265403747559, max_rel=1798.9300537109375, norm_rel=0.02511173114180565, ref_abs_avg=35.686065673828125, test_abs_avg=35.687652587890625
production_forward grad[37] vs paper_forward: mean_abs=0.8767743110656738, max_abs=5.375, mean_rel=0.16346053779125214, max_rel=923.897216796875, norm_rel=0.024854028597474098, ref_abs_avg=35.336647033691406, test_abs_avg=35.33909225463867
production_forward grad[38] vs paper_forward: mean_abs=0.6994175910949707, max_abs=2.75, mean_rel=0.1572263538837433, max_rel=27.69024658203125, norm_rel=0.025623442605137825, ref_abs_avg=26.868736267089844, test_abs_avg=26.871389389038086
production_forward grad[39] vs paper_forward: mean_abs=0.83624267578125, max_abs=5.5, mean_rel=0.16502021253108978, max_rel=1167.340576171875, norm_rel=0.024664951488375664, ref_abs_avg=33.95132827758789, test_abs_avg=33.953880310058594
production_forward grad[40] vs paper_forward: mean_abs=0.8259384632110596, max_abs=5.375, mean_rel=0.17193010449409485, max_rel=876.2930297851562, norm_rel=0.024652842432260513, ref_abs_avg=33.59394073486328, test_abs_avg=33.59407043457031
production_forward grad[41] vs paper_forward: mean_abs=0.6347527503967285, max_abs=2.875, mean_rel=0.12384510040283203, max_rel=27.265396118164062, norm_rel=0.024163231253623962, ref_abs_avg=26.63653564453125, test_abs_avg=26.661226272583008
production_forward grad[42] vs paper_forward: mean_abs=0.7938330769538879, max_abs=5.5, mean_rel=0.16257303953170776, max_rel=1100.0853271484375, norm_rel=0.02467314712703228, ref_abs_avg=32.262332916259766, test_abs_avg=32.26405715942383
production_forward grad[43] vs paper_forward: mean_abs=0.783516526222229, max_abs=5.5, mean_rel=0.1609129011631012, max_rel=964.6288452148438, norm_rel=0.024444347247481346, ref_abs_avg=32.127777099609375, test_abs_avg=32.1337890625
production_forward grad[44] vs paper_forward: mean_abs=0.6169290542602539, max_abs=2.25, mean_rel=0.09716498851776123, max_rel=5.841294288635254, norm_rel=0.024402368813753128, ref_abs_avg=25.159400939941406, test_abs_avg=25.219350814819336
production_forward grad[45] vs paper_forward: mean_abs=0.7562582492828369, max_abs=5.0, mean_rel=0.16817383468151093, max_rel=2852.400390625, norm_rel=0.02414802461862564, ref_abs_avg=31.398605346679688, test_abs_avg=31.40009307861328
production_forward grad[46] vs paper_forward: mean_abs=0.7418948411941528, max_abs=4.625, mean_rel=0.1512179672718048, max_rel=997.5658569335938, norm_rel=0.02387986145913601, ref_abs_avg=31.15121841430664, test_abs_avg=31.15201187133789
production_forward grad[47] vs paper_forward: mean_abs=0.5835480690002441, max_abs=2.5, mean_rel=0.10834508389234543, max_rel=5.540256977081299, norm_rel=0.02339058928191662, ref_abs_avg=24.93267059326172, test_abs_avg=24.914438247680664
production_forward grad[48] vs paper_forward: mean_abs=0.7196254134178162, max_abs=4.75, mean_rel=0.1611986756324768, max_rel=1271.93505859375, norm_rel=0.023936133831739426, ref_abs_avg=30.127859115600586, test_abs_avg=30.128524780273438
production_forward grad[49] vs paper_forward: mean_abs=0.7111508846282959, max_abs=5.0, mean_rel=0.1488180160522461, max_rel=852.3567504882812, norm_rel=0.023738063871860504, ref_abs_avg=29.959205627441406, test_abs_avg=29.956697463989258
production_forward grad[50] vs paper_forward: mean_abs=0.6406164169311523, max_abs=3.0, mean_rel=0.08557412773370743, max_rel=6.160498142242432, norm_rel=0.026941118761897087, ref_abs_avg=24.716670989990234, test_abs_avg=24.672115325927734
production_forward grad[51] vs paper_forward: mean_abs=0.8062909841537476, max_abs=5.875, mean_rel=0.1743355244398117, max_rel=1682.355224609375, norm_rel=0.02568240463733673, ref_abs_avg=31.45572280883789, test_abs_avg=31.45560073852539
production_forward grad[52] vs paper_forward: mean_abs=0.7940571904182434, max_abs=5.5, mean_rel=0.16602730751037598, max_rel=1017.555908203125, norm_rel=0.025494802743196487, ref_abs_avg=31.182554244995117, test_abs_avg=31.178693771362305
production_forward grad[53] vs paper_forward: mean_abs=0.5540082454681396, max_abs=2.5, mean_rel=0.1420055329799652, max_rel=15.892372131347656, norm_rel=0.023448986932635307, ref_abs_avg=24.072540283203125, test_abs_avg=24.082050323486328
production_forward grad[54] vs paper_forward: mean_abs=0.7498501539230347, max_abs=5.75, mean_rel=0.15689152479171753, max_rel=921.072021484375, norm_rel=0.025335466489195824, ref_abs_avg=29.653017044067383, test_abs_avg=29.654102325439453
production_forward grad[55] vs paper_forward: mean_abs=0.7295956015586853, max_abs=6.0, mean_rel=0.1716885268688202, max_rel=1619.2105712890625, norm_rel=0.025187857449054718, ref_abs_avg=29.00450325012207, test_abs_avg=29.0072078704834
production_forward grad[56] vs paper_forward: mean_abs=0.5587315559387207, max_abs=2.0, mean_rel=0.11866086721420288, max_rel=13.427159309387207, norm_rel=0.023934198543429375, ref_abs_avg=23.128978729248047, test_abs_avg=23.12603759765625
production_forward grad[57] vs paper_forward: mean_abs=0.6947517991065979, max_abs=5.0, mean_rel=0.17126679420471191, max_rel=1145.6190185546875, norm_rel=0.02489383891224861, ref_abs_avg=27.942779541015625, test_abs_avg=27.942302703857422
production_forward grad[58] vs paper_forward: mean_abs=0.6790505647659302, max_abs=4.5, mean_rel=0.15952080488204956, max_rel=592.4072875976562, norm_rel=0.02469208836555481, ref_abs_avg=27.49941635131836, test_abs_avg=27.496112823486328
production_forward grad[59] vs paper_forward: mean_abs=0.5125529766082764, max_abs=2.0625, mean_rel=0.17824706435203552, max_rel=21.497621536254883, norm_rel=0.023282496258616447, ref_abs_avg=22.239093780517578, test_abs_avg=22.223264694213867
production_forward grad[60] vs paper_forward: mean_abs=0.6447656750679016, max_abs=4.5, mean_rel=0.15279264748096466, max_rel=993.5814819335938, norm_rel=0.02446771413087845, ref_abs_avg=26.372915267944336, test_abs_avg=26.37217903137207
production_forward grad[61] vs paper_forward: mean_abs=0.6328056454658508, max_abs=4.625, mean_rel=0.1608973890542984, max_rel=596.15234375, norm_rel=0.02458237297832966, ref_abs_avg=25.756221771240234, test_abs_avg=25.751686096191406
production_forward grad[62] vs paper_forward: mean_abs=0.5186702013015747, max_abs=2.125, mean_rel=0.6792846322059631, max_rel=280.0923767089844, norm_rel=0.024034850299358368, ref_abs_avg=21.511198043823242, test_abs_avg=21.483963012695312
production_forward grad[63] vs paper_forward: mean_abs=0.6126201152801514, max_abs=4.1875, mean_rel=0.15641307830810547, max_rel=924.3482666015625, norm_rel=0.023826275020837784, ref_abs_avg=25.671436309814453, test_abs_avg=25.673437118530273
production_forward grad[64] vs paper_forward: mean_abs=0.5931767821311951, max_abs=5.84375, mean_rel=0.15670067071914673, max_rel=1103.5640869140625, norm_rel=0.02383148856461048, ref_abs_avg=24.893402099609375, test_abs_avg=24.889667510986328
production_forward grad[65] vs paper_forward: mean_abs=0.4563635587692261, max_abs=2.0, mean_rel=0.23925632238388062, max_rel=69.60071563720703, norm_rel=0.022694693878293037, ref_abs_avg=20.505937576293945, test_abs_avg=20.51739501953125
production_forward grad[66] vs paper_forward: mean_abs=0.5810115337371826, max_abs=4.0, mean_rel=0.15859416127204895, max_rel=1161.08544921875, norm_rel=0.023739414289593697, ref_abs_avg=24.471420288085938, test_abs_avg=24.472986221313477
production_forward grad[67] vs paper_forward: mean_abs=0.5707640051841736, max_abs=4.125, mean_rel=0.153143048286438, max_rel=1322.326904296875, norm_rel=0.02373657375574112, ref_abs_avg=24.01433753967285, test_abs_avg=24.018524169921875
production_forward grad[68] vs paper_forward: mean_abs=0.4536590576171875, max_abs=1.5, mean_rel=0.11241434514522552, max_rel=7.1605119705200195, norm_rel=0.02367202192544937, ref_abs_avg=18.815624237060547, test_abs_avg=18.812612533569336
production_forward grad[69] vs paper_forward: mean_abs=0.5479117035865784, max_abs=4.0, mean_rel=0.14993616938591003, max_rel=843.5813598632812, norm_rel=0.02336709015071392, ref_abs_avg=23.454010009765625, test_abs_avg=23.454219818115234
production_forward grad[70] vs paper_forward: mean_abs=0.5326708555221558, max_abs=4.5, mean_rel=0.13873037695884705, max_rel=654.3226318359375, norm_rel=0.023079028353095055, ref_abs_avg=23.15987777709961, test_abs_avg=23.16299819946289
production_forward grad[71] vs paper_forward: mean_abs=0.4192502498626709, max_abs=1.734375, mean_rel=0.15212072432041168, max_rel=17.564373016357422, norm_rel=0.02139931172132492, ref_abs_avg=19.682703018188477, test_abs_avg=19.684383392333984
production_forward grad[72] vs paper_forward: mean_abs=0.526160717010498, max_abs=3.6630859375, mean_rel=0.15099585056304932, max_rel=679.486328125, norm_rel=0.02309238724410534, ref_abs_avg=22.78891372680664, test_abs_avg=22.789636611938477
production_forward grad[73] vs paper_forward: mean_abs=0.5165937542915344, max_abs=4.75, mean_rel=0.15386436879634857, max_rel=787.1521606445312, norm_rel=0.022614486515522003, ref_abs_avg=22.737306594848633, test_abs_avg=22.743106842041016
production_forward grad[74] vs paper_forward: mean_abs=0.4679412841796875, max_abs=1.75, mean_rel=0.10484075546264648, max_rel=6.0710954666137695, norm_rel=0.022891728207468987, ref_abs_avg=19.66635513305664, test_abs_avg=19.69414520263672
production_forward grad[75] vs paper_forward: mean_abs=0.5969364643096924, max_abs=4.640625, mean_rel=0.1589103639125824, max_rel=1052.7681884765625, norm_rel=0.02396290935575962, ref_abs_avg=24.94350814819336, test_abs_avg=24.944686889648438
production_forward grad[76] vs paper_forward: mean_abs=0.5872565507888794, max_abs=5.0, mean_rel=0.15215586125850677, max_rel=1029.013427734375, norm_rel=0.023913051933050156, ref_abs_avg=24.58226776123047, test_abs_avg=24.57859230041504
production_forward grad[77] vs paper_forward: mean_abs=0.47649192810058594, max_abs=2.203125, mean_rel=0.08686794340610504, max_rel=5.71843147277832, norm_rel=0.02372872829437256, ref_abs_avg=20.26376724243164, test_abs_avg=20.290122985839844
production_forward grad[78] vs paper_forward: mean_abs=0.5584374666213989, max_abs=5.5, mean_rel=0.1496906578540802, max_rel=774.9761962890625, norm_rel=0.023402195423841476, ref_abs_avg=23.8387451171875, test_abs_avg=23.83931541442871
production_forward grad[79] vs paper_forward: mean_abs=0.5355476140975952, max_abs=5.0, mean_rel=0.14947587251663208, max_rel=1188.6552734375, norm_rel=0.022820906713604927, ref_abs_avg=23.36388397216797, test_abs_avg=23.367496490478516
production_forward grad[80] vs paper_forward: mean_abs=0.42055368423461914, max_abs=1.5, mean_rel=0.08728188276290894, max_rel=7.7403717041015625, norm_rel=0.022435162216424942, ref_abs_avg=18.79374122619629, test_abs_avg=18.787403106689453
production_forward grad[81] vs paper_forward: mean_abs=0.5126689076423645, max_abs=4.0, mean_rel=0.1496407836675644, max_rel=911.40576171875, norm_rel=0.022882752120494843, ref_abs_avg=22.425045013427734, test_abs_avg=22.425851821899414
production_forward grad[82] vs paper_forward: mean_abs=0.4998708665370941, max_abs=5.0, mean_rel=0.14569979906082153, max_rel=858.4813842773438, norm_rel=0.022747213020920753, ref_abs_avg=22.033357620239258, test_abs_avg=22.038545608520508
production_forward grad[83] vs paper_forward: mean_abs=0.3764770030975342, max_abs=1.5625, mean_rel=0.1863240897655487, max_rel=43.25996017456055, norm_rel=0.020805615931749344, ref_abs_avg=18.3857479095459, test_abs_avg=18.357568740844727
production_forward grad[84] vs paper_forward: mean_abs=0.47490134835243225, max_abs=4.0, mean_rel=0.14226862788200378, max_rel=562.0394897460938, norm_rel=0.022234098985791206, ref_abs_avg=21.393634796142578, test_abs_avg=21.393817901611328
production_forward grad[85] vs paper_forward: mean_abs=0.46249157190322876, max_abs=4.837890625, mean_rel=0.13115113973617554, max_rel=338.83184814453125, norm_rel=0.021642467007040977, ref_abs_avg=21.269319534301758, test_abs_avg=21.283645629882812
production_forward grad[86] vs paper_forward: mean_abs=0.34887218475341797, max_abs=1.5, mean_rel=0.0875176414847374, max_rel=6.090692520141602, norm_rel=0.019844479858875275, ref_abs_avg=17.65019989013672, test_abs_avg=17.625280380249023
production_forward grad[87] vs paper_forward: mean_abs=0.4512318968772888, max_abs=4.0, mean_rel=0.14201313257217407, max_rel=758.631103515625, norm_rel=0.0216791033744812, ref_abs_avg=20.87883758544922, test_abs_avg=20.879291534423828
production_forward grad[88] vs paper_forward: mean_abs=0.4350760877132416, max_abs=5.5, mean_rel=0.14640003442764282, max_rel=886.7650756835938, norm_rel=0.021134043112397194, ref_abs_avg=20.622474670410156, test_abs_avg=20.62122344970703
production_forward grad[89] vs paper_forward: mean_abs=0.3702775239944458, max_abs=1.5, mean_rel=0.13814669847488403, max_rel=34.102630615234375, norm_rel=0.02230883575975895, ref_abs_avg=16.772802352905273, test_abs_avg=16.759769439697266
production_forward grad[90] vs paper_forward: mean_abs=0.42502880096435547, max_abs=4.5, mean_rel=0.13182175159454346, max_rel=553.2325439453125, norm_rel=0.021179579198360443, ref_abs_avg=20.191429138183594, test_abs_avg=20.19194221496582
production_forward grad[91] vs paper_forward: mean_abs=0.41332709789276123, max_abs=6.0, mean_rel=0.12846998870372772, max_rel=503.3224792480469, norm_rel=0.021512698382139206, ref_abs_avg=19.39676856994629, test_abs_avg=19.396015167236328
production_forward grad[92] vs paper_forward: mean_abs=0.3428843021392822, max_abs=1.375, mean_rel=0.12410199642181396, max_rel=35.99085998535156, norm_rel=0.022327935323119164, ref_abs_avg=15.752962112426758, test_abs_avg=15.772756576538086
production_forward grad[93] vs paper_forward: mean_abs=0.39308029413223267, max_abs=4.25, mean_rel=0.13368859887123108, max_rel=657.728271484375, norm_rel=0.020789097994565964, ref_abs_avg=19.065410614013672, test_abs_avg=19.066123962402344
production_forward grad[94] vs paper_forward: mean_abs=0.3871277868747711, max_abs=4.5, mean_rel=0.13538450002670288, max_rel=544.2405395507812, norm_rel=0.02045116201043129, ref_abs_avg=19.090248107910156, test_abs_avg=19.07848358154297
production_forward grad[95] vs paper_forward: mean_abs=0.31035566329956055, max_abs=1.25, mean_rel=0.09271592646837234, max_rel=12.044967651367188, norm_rel=0.02042246051132679, ref_abs_avg=15.079561233520508, test_abs_avg=15.088156700134277
production_forward grad[96] vs paper_forward: mean_abs=0.36988577246665955, max_abs=3.5, mean_rel=0.12349102646112442, max_rel=590.1181640625, norm_rel=0.020007997751235962, ref_abs_avg=18.746349334716797, test_abs_avg=18.74555015563965
production_forward grad[97] vs paper_forward: mean_abs=0.3568822145462036, max_abs=4.125, mean_rel=0.11156591773033142, max_rel=354.5289306640625, norm_rel=0.019179247319698334, ref_abs_avg=18.83529281616211, test_abs_avg=18.816463470458984
production_forward2 vs paper_forward output: mean_abs=0.0016778269782662392, max_abs=0.046875
production_forward2 grad[0] vs paper_forward: mean_abs=0.008947137743234634, max_abs=0.4375, mean_rel=0.07609616219997406, max_rel=148.07289123535156, norm_rel=0.02083456888794899, ref_abs_avg=0.46387702226638794, test_abs_avg=0.46387583017349243
production_forward2 grad[1] vs paper_forward: mean_abs=7.632449150085449, max_abs=64.0, mean_rel=0.13311901688575745, max_rel=97.42778778076172, norm_rel=0.02101024053990841, ref_abs_avg=323.14129638671875, test_abs_avg=322.9778747558594
production_forward2 grad[2] vs paper_forward: mean_abs=1.2976312637329102, max_abs=4.75, mean_rel=0.12266647815704346, max_rel=8.559673309326172, norm_rel=0.023116538301110268, ref_abs_avg=55.947696685791016, test_abs_avg=56.02701950073242
production_forward2 grad[3] vs paper_forward: mean_abs=1.6243282556533813, max_abs=10.5, mean_rel=0.1874522864818573, max_rel=2855.924560546875, norm_rel=0.024562979117035866, ref_abs_avg=66.40446472167969, test_abs_avg=66.40575408935547
production_forward2 grad[4] vs paper_forward: mean_abs=1.5851123332977295, max_abs=10.0, mean_rel=0.15465697646141052, max_rel=835.5953979492188, norm_rel=0.02436051517724991, ref_abs_avg=65.33675384521484, test_abs_avg=65.32095336914062
production_forward2 grad[5] vs paper_forward: mean_abs=1.126817226409912, max_abs=4.5, mean_rel=0.13780143857002258, max_rel=9.614324569702148, norm_rel=0.025336092337965965, ref_abs_avg=45.08155822753906, test_abs_avg=45.019859313964844
production_forward2 grad[6] vs paper_forward: mean_abs=1.4221540689468384, max_abs=9.0, mean_rel=0.16344638168811798, max_rel=2351.289794921875, norm_rel=0.024372786283493042, ref_abs_avg=58.60868835449219, test_abs_avg=58.61137390136719
production_forward2 grad[7] vs paper_forward: mean_abs=1.38873291015625, max_abs=9.0, mean_rel=0.16092173755168915, max_rel=1638.0897216796875, norm_rel=0.024079836905002594, ref_abs_avg=57.94591522216797, test_abs_avg=57.94554138183594
production_forward2 grad[8] vs paper_forward: mean_abs=1.0180482864379883, max_abs=3.8125, mean_rel=0.06720961630344391, max_rel=2.9783270359039307, norm_rel=0.022938061505556107, ref_abs_avg=43.88523864746094, test_abs_avg=43.79988098144531
production_forward2 grad[9] vs paper_forward: mean_abs=1.287092924118042, max_abs=8.25, mean_rel=0.1569092869758606, max_rel=1374.75, norm_rel=0.023998526856303215, ref_abs_avg=53.84579849243164, test_abs_avg=53.84255599975586
production_forward2 grad[10] vs paper_forward: mean_abs=1.2599215507507324, max_abs=8.5, mean_rel=0.15621836483478546, max_rel=985.4058837890625, norm_rel=0.023828888311982155, ref_abs_avg=53.052276611328125, test_abs_avg=53.05059051513672
production_forward2 grad[11] vs paper_forward: mean_abs=0.9041886329650879, max_abs=3.8125, mean_rel=0.1253347098827362, max_rel=13.816073417663574, norm_rel=0.021775757893919945, ref_abs_avg=42.190147399902344, test_abs_avg=42.28386688232422
production_forward2 grad[12] vs paper_forward: mean_abs=1.1965094804763794, max_abs=8.0, mean_rel=0.16240674257278442, max_rel=1302.9508056640625, norm_rel=0.023939259350299835, ref_abs_avg=50.194358825683594, test_abs_avg=50.19330978393555
production_forward2 grad[13] vs paper_forward: mean_abs=1.1680238246917725, max_abs=7.25, mean_rel=0.1863836944103241, max_rel=1388.7391357421875, norm_rel=0.023732466623187065, ref_abs_avg=49.51398468017578, test_abs_avg=49.51332092285156
production_forward2 grad[14] vs paper_forward: mean_abs=0.9459865093231201, max_abs=3.5, mean_rel=0.1146206259727478, max_rel=10.579020500183105, norm_rel=0.02339955046772957, ref_abs_avg=39.41645812988281, test_abs_avg=39.471153259277344
production_forward2 grad[15] vs paper_forward: mean_abs=1.1231718063354492, max_abs=7.0, mean_rel=0.16368669271469116, max_rel=1623.4659423828125, norm_rel=0.023749615997076035, ref_abs_avg=47.4984130859375, test_abs_avg=47.50315475463867
production_forward2 grad[16] vs paper_forward: mean_abs=1.097033977508545, max_abs=6.75, mean_rel=0.16807019710540771, max_rel=1396.1263427734375, norm_rel=0.023381773382425308, ref_abs_avg=47.162784576416016, test_abs_avg=47.16590118408203
production_forward2 grad[17] vs paper_forward: mean_abs=0.8530998229980469, max_abs=3.78125, mean_rel=0.08167646825313568, max_rel=3.2931876182556152, norm_rel=0.022639954462647438, ref_abs_avg=37.55598449707031, test_abs_avg=37.622379302978516
production_forward2 grad[18] vs paper_forward: mean_abs=1.0608465671539307, max_abs=7.0, mean_rel=0.1560632586479187, max_rel=1455.234619140625, norm_rel=0.02370384894311428, ref_abs_avg=44.94068908691406, test_abs_avg=44.94202423095703
production_forward2 grad[19] vs paper_forward: mean_abs=1.0382790565490723, max_abs=6.5, mean_rel=0.16064532101154327, max_rel=575.6215209960938, norm_rel=0.023428387939929962, ref_abs_avg=44.558807373046875, test_abs_avg=44.5624885559082
production_forward2 grad[20] vs paper_forward: mean_abs=0.8124494552612305, max_abs=3.5625, mean_rel=0.2950285077095032, max_rel=117.39356231689453, norm_rel=0.025034217163920403, ref_abs_avg=32.7713737487793, test_abs_avg=32.770469665527344
production_forward2 grad[21] vs paper_forward: mean_abs=1.0044255256652832, max_abs=7.0, mean_rel=0.15420794486999512, max_rel=1571.1837158203125, norm_rel=0.023455297574400902, ref_abs_avg=43.03125, test_abs_avg=43.03330612182617
production_forward2 grad[22] vs paper_forward: mean_abs=0.9726443290710449, max_abs=6.0, mean_rel=0.15278467535972595, max_rel=880.7026977539062, norm_rel=0.02308214083313942, ref_abs_avg=42.38042449951172, test_abs_avg=42.38111114501953
production_forward2 grad[23] vs paper_forward: mean_abs=0.7877998352050781, max_abs=3.0, mean_rel=0.11589589715003967, max_rel=8.773568153381348, norm_rel=0.02309684455394745, ref_abs_avg=34.010292053222656, test_abs_avg=34.01513671875
production_forward2 grad[24] vs paper_forward: mean_abs=0.9531131982803345, max_abs=6.09375, mean_rel=0.15424199402332306, max_rel=1406.70947265625, norm_rel=0.023224469274282455, ref_abs_avg=41.21314239501953, test_abs_avg=41.212955474853516
production_forward2 grad[25] vs paper_forward: mean_abs=0.9290642738342285, max_abs=5.5, mean_rel=0.1776646375656128, max_rel=2924.946533203125, norm_rel=0.022887347266077995, ref_abs_avg=40.75272750854492, test_abs_avg=40.75263214111328
production_forward2 grad[26] vs paper_forward: mean_abs=0.8815205097198486, max_abs=4.0, mean_rel=0.2591906487941742, max_rel=68.50605010986328, norm_rel=0.025014804676175117, ref_abs_avg=36.071311950683594, test_abs_avg=36.02732467651367
production_forward2 grad[27] vs paper_forward: mean_abs=1.1092555522918701, max_abs=7.5625, mean_rel=0.17893153429031372, max_rel=1655.45703125, norm_rel=0.025281384587287903, ref_abs_avg=44.042938232421875, test_abs_avg=44.04054260253906
production_forward2 grad[28] vs paper_forward: mean_abs=1.0814628601074219, max_abs=7.0, mean_rel=0.18959683179855347, max_rel=1985.5001220703125, norm_rel=0.025161102414131165, ref_abs_avg=43.18413543701172, test_abs_avg=43.18572235107422
production_forward2 grad[29] vs paper_forward: mean_abs=0.797297477722168, max_abs=2.75, mean_rel=0.09095705300569534, max_rel=3.5519845485687256, norm_rel=0.02516954392194748, ref_abs_avg=31.788360595703125, test_abs_avg=31.84025001525879
production_forward2 grad[30] vs paper_forward: mean_abs=1.0138146877288818, max_abs=6.5, mean_rel=0.18173834681510925, max_rel=1627.0716552734375, norm_rel=0.025416048243641853, ref_abs_avg=40.00803756713867, test_abs_avg=40.00938415527344
production_forward2 grad[31] vs paper_forward: mean_abs=0.9957993030548096, max_abs=6.5, mean_rel=0.1703643649816513, max_rel=1004.7556762695312, norm_rel=0.02533913403749466, ref_abs_avg=39.439170837402344, test_abs_avg=39.43889236450195
production_forward2 grad[32] vs paper_forward: mean_abs=0.8248996734619141, max_abs=3.375, mean_rel=0.26946359872817993, max_rel=53.040164947509766, norm_rel=0.025799017399549484, ref_abs_avg=31.269487380981445, test_abs_avg=31.245098114013672
production_forward2 grad[33] vs paper_forward: mean_abs=0.9566731452941895, max_abs=6.0, mean_rel=0.16992038488388062, max_rel=2133.263427734375, norm_rel=0.025429997593164444, ref_abs_avg=37.747589111328125, test_abs_avg=37.74798583984375
production_forward2 grad[34] vs paper_forward: mean_abs=0.9393116235733032, max_abs=6.0, mean_rel=0.18167373538017273, max_rel=1297.35107421875, norm_rel=0.025016438215970993, ref_abs_avg=37.63447952270508, test_abs_avg=37.63587951660156
production_forward2 grad[35] vs paper_forward: mean_abs=0.7054510116577148, max_abs=2.75, mean_rel=0.1260165125131607, max_rel=13.952350616455078, norm_rel=0.02587118372321129, ref_abs_avg=26.658363342285156, test_abs_avg=26.690065383911133
production_forward2 grad[36] vs paper_forward: mean_abs=0.8993678092956543, max_abs=6.0, mean_rel=0.16266457736492157, max_rel=1756.60986328125, norm_rel=0.025261322036385536, ref_abs_avg=35.686065673828125, test_abs_avg=35.686485290527344
production_forward2 grad[37] vs paper_forward: mean_abs=0.8815248012542725, max_abs=5.5, mean_rel=0.16233614087104797, max_rel=859.4014892578125, norm_rel=0.024994421750307083, ref_abs_avg=35.336647033691406, test_abs_avg=35.33600616455078
production_forward2 grad[38] vs paper_forward: mean_abs=0.7211074829101562, max_abs=2.75, mean_rel=0.17632146179676056, max_rel=34.462894439697266, norm_rel=0.02674369141459465, ref_abs_avg=26.868736267089844, test_abs_avg=26.870229721069336
production_forward2 grad[39] vs paper_forward: mean_abs=0.8399879336357117, max_abs=5.25, mean_rel=0.16464465856552124, max_rel=1087.0616455078125, norm_rel=0.024790246039628983, ref_abs_avg=33.95132827758789, test_abs_avg=33.95314025878906
production_forward2 grad[40] vs paper_forward: mean_abs=0.8306870460510254, max_abs=5.75, mean_rel=0.17281393706798553, max_rel=927.2200927734375, norm_rel=0.024802180007100105, ref_abs_avg=33.59394073486328, test_abs_avg=33.59398651123047
production_forward2 grad[41] vs paper_forward: mean_abs=0.6457110643386841, max_abs=2.625, mean_rel=0.1329706609249115, max_rel=27.078367233276367, norm_rel=0.02421470172703266, ref_abs_avg=26.63653564453125, test_abs_avg=26.67660140991211
production_forward2 grad[42] vs paper_forward: mean_abs=0.7987064123153687, max_abs=5.5, mean_rel=0.1615685224533081, max_rel=1411.9151611328125, norm_rel=0.024817924946546555, ref_abs_avg=32.262332916259766, test_abs_avg=32.263755798339844
production_forward2 grad[43] vs paper_forward: mean_abs=0.7869943380355835, max_abs=5.0, mean_rel=0.15805983543395996, max_rel=1182.880126953125, norm_rel=0.024552270770072937, ref_abs_avg=32.127777099609375, test_abs_avg=32.131736755371094
production_forward2 grad[44] vs paper_forward: mean_abs=0.6146011352539062, max_abs=2.3125, mean_rel=0.10501204431056976, max_rel=12.265774726867676, norm_rel=0.024578426033258438, ref_abs_avg=25.159400939941406, test_abs_avg=25.217769622802734
production_forward2 grad[45] vs paper_forward: mean_abs=0.7606997489929199, max_abs=5.0, mean_rel=0.1664639115333557, max_rel=2508.402099609375, norm_rel=0.024286018684506416, ref_abs_avg=31.398605346679688, test_abs_avg=31.3988037109375
production_forward2 grad[46] vs paper_forward: mean_abs=0.7453061938285828, max_abs=4.375, mean_rel=0.15370891988277435, max_rel=810.2112426757812, norm_rel=0.023989709094166756, ref_abs_avg=31.15121841430664, test_abs_avg=31.151582717895508
production_forward2 grad[47] vs paper_forward: mean_abs=0.5830841064453125, max_abs=2.75, mean_rel=0.1075739860534668, max_rel=3.8355624675750732, norm_rel=0.023591363802552223, ref_abs_avg=24.93267059326172, test_abs_avg=24.90576171875
production_forward2 grad[48] vs paper_forward: mean_abs=0.7234479188919067, max_abs=4.75, mean_rel=0.1618301421403885, max_rel=1235.0003662109375, norm_rel=0.02404528111219406, ref_abs_avg=30.127859115600586, test_abs_avg=30.12826919555664
production_forward2 grad[49] vs paper_forward: mean_abs=0.7148580551147461, max_abs=5.0, mean_rel=0.15170206129550934, max_rel=1398.6959228515625, norm_rel=0.023847004398703575, ref_abs_avg=29.959205627441406, test_abs_avg=29.95620346069336
production_forward2 grad[50] vs paper_forward: mean_abs=0.6278982162475586, max_abs=2.75, mean_rel=0.07254341244697571, max_rel=2.038001298904419, norm_rel=0.026256004348397255, ref_abs_avg=24.716670989990234, test_abs_avg=24.661577224731445
production_forward2 grad[51] vs paper_forward: mean_abs=0.8087428212165833, max_abs=6.125, mean_rel=0.17612281441688538, max_rel=1906.6829833984375, norm_rel=0.025764524936676025, ref_abs_avg=31.45572280883789, test_abs_avg=31.45535659790039
production_forward2 grad[52] vs paper_forward: mean_abs=0.798608660697937, max_abs=5.25, mean_rel=0.1656225472688675, max_rel=837.65869140625, norm_rel=0.025620795786380768, ref_abs_avg=31.182554244995117, test_abs_avg=31.178211212158203
production_forward2 grad[53] vs paper_forward: mean_abs=0.559689998626709, max_abs=2.5, mean_rel=0.17485380172729492, max_rel=23.968496322631836, norm_rel=0.02366708219051361, ref_abs_avg=24.072540283203125, test_abs_avg=24.076704025268555
production_forward2 grad[54] vs paper_forward: mean_abs=0.7522866725921631, max_abs=5.0, mean_rel=0.156412735581398, max_rel=887.0552368164062, norm_rel=0.025420408695936203, ref_abs_avg=29.653017044067383, test_abs_avg=29.653139114379883
production_forward2 grad[55] vs paper_forward: mean_abs=0.7321594953536987, max_abs=6.0, mean_rel=0.17191001772880554, max_rel=1332.178955078125, norm_rel=0.02528960257768631, ref_abs_avg=29.00450325012207, test_abs_avg=29.007640838623047
production_forward2 grad[56] vs paper_forward: mean_abs=0.5579086542129517, max_abs=2.1875, mean_rel=0.1291847825050354, max_rel=15.099837303161621, norm_rel=0.02385811135172844, ref_abs_avg=23.128978729248047, test_abs_avg=23.1199951171875
production_forward2 grad[57] vs paper_forward: mean_abs=0.6966612339019775, max_abs=4.75, mean_rel=0.17361655831336975, max_rel=1110.8475341796875, norm_rel=0.024960041046142578, ref_abs_avg=27.942779541015625, test_abs_avg=27.94207763671875
production_forward2 grad[58] vs paper_forward: mean_abs=0.6832084655761719, max_abs=4.5, mean_rel=0.15663528442382812, max_rel=516.7999267578125, norm_rel=0.024838704615831375, ref_abs_avg=27.49941635131836, test_abs_avg=27.494983673095703
production_forward2 grad[59] vs paper_forward: mean_abs=0.5190956592559814, max_abs=2.1875, mean_rel=0.1594974398612976, max_rel=18.12314224243164, norm_rel=0.023153360933065414, ref_abs_avg=22.239093780517578, test_abs_avg=22.228893280029297
production_forward2 grad[60] vs paper_forward: mean_abs=0.6477972269058228, max_abs=4.25, mean_rel=0.15470480918884277, max_rel=1069.593994140625, norm_rel=0.02457129955291748, ref_abs_avg=26.372915267944336, test_abs_avg=26.372093200683594
production_forward2 grad[61] vs paper_forward: mean_abs=0.635553240776062, max_abs=4.875, mean_rel=0.16125673055648804, max_rel=683.606201171875, norm_rel=0.024698320776224136, ref_abs_avg=25.756221771240234, test_abs_avg=25.75164222717285
production_forward2 grad[62] vs paper_forward: mean_abs=0.5079046487808228, max_abs=2.0, mean_rel=0.8666607141494751, max_rel=376.8852844238281, norm_rel=0.02374703250825405, ref_abs_avg=21.511198043823242, test_abs_avg=21.492202758789062
production_forward2 grad[63] vs paper_forward: mean_abs=0.6150108575820923, max_abs=4.25, mean_rel=0.1578337401151657, max_rel=894.0590209960938, norm_rel=0.023916857317090034, ref_abs_avg=25.671436309814453, test_abs_avg=25.673002243041992
production_forward2 grad[64] vs paper_forward: mean_abs=0.5947889089584351, max_abs=5.84375, mean_rel=0.15916374325752258, max_rel=1127.371337890625, norm_rel=0.023899944499135017, ref_abs_avg=24.893402099609375, test_abs_avg=24.8900203704834
production_forward2 grad[65] vs paper_forward: mean_abs=0.47314536571502686, max_abs=1.828125, mean_rel=0.2800367772579193, max_rel=91.47913360595703, norm_rel=0.023123901337385178, ref_abs_avg=20.505937576293945, test_abs_avg=20.51901626586914
production_forward2 grad[66] vs paper_forward: mean_abs=0.5830057263374329, max_abs=4.03125, mean_rel=0.16092273592948914, max_rel=1120.2188720703125, norm_rel=0.023831728845834732, ref_abs_avg=24.471420288085938, test_abs_avg=24.473514556884766
production_forward2 grad[67] vs paper_forward: mean_abs=0.5737971067428589, max_abs=4.0, mean_rel=0.1556737720966339, max_rel=1367.1500244140625, norm_rel=0.02387166954576969, ref_abs_avg=24.01433753967285, test_abs_avg=24.018108367919922
production_forward2 grad[68] vs paper_forward: mean_abs=0.44620251655578613, max_abs=1.5, mean_rel=0.11421984434127808, max_rel=8.759814262390137, norm_rel=0.023299837484955788, ref_abs_avg=18.815624237060547, test_abs_avg=18.801158905029297
production_forward2 grad[69] vs paper_forward: mean_abs=0.5493741035461426, max_abs=4.25, mean_rel=0.14964433014392853, max_rel=761.294677734375, norm_rel=0.02343153953552246, ref_abs_avg=23.454010009765625, test_abs_avg=23.454391479492188
production_forward2 grad[70] vs paper_forward: mean_abs=0.5353195071220398, max_abs=4.59375, mean_rel=0.14043930172920227, max_rel=734.4551391601562, norm_rel=0.023185066878795624, ref_abs_avg=23.15987777709961, test_abs_avg=23.162147521972656
production_forward2 grad[71] vs paper_forward: mean_abs=0.42828965187072754, max_abs=1.75, mean_rel=0.1503801941871643, max_rel=20.6937313079834, norm_rel=0.021875381469726562, ref_abs_avg=19.682703018188477, test_abs_avg=19.685468673706055
production_forward2 grad[72] vs paper_forward: mean_abs=0.527190625667572, max_abs=3.666015625, mean_rel=0.15207912027835846, max_rel=711.4783325195312, norm_rel=0.02313356101512909, ref_abs_avg=22.78891372680664, test_abs_avg=22.789297103881836
production_forward2 grad[73] vs paper_forward: mean_abs=0.5183815956115723, max_abs=4.5, mean_rel=0.15263521671295166, max_rel=651.070068359375, norm_rel=0.022694172337651253, ref_abs_avg=22.737306594848633, test_abs_avg=22.743579864501953
production_forward2 grad[74] vs paper_forward: mean_abs=0.4752988815307617, max_abs=1.75, mean_rel=0.10013457387685776, max_rel=6.0710954666137695, norm_rel=0.023136252537369728, ref_abs_avg=19.66635513305664, test_abs_avg=19.693572998046875
production_forward2 grad[75] vs paper_forward: mean_abs=0.598484992980957, max_abs=4.75, mean_rel=0.15826144814491272, max_rel=860.7550659179688, norm_rel=0.024019412696361542, ref_abs_avg=24.94350814819336, test_abs_avg=24.944183349609375
production_forward2 grad[76] vs paper_forward: mean_abs=0.5882447957992554, max_abs=5.0, mean_rel=0.15170858800411224, max_rel=1176.0140380859375, norm_rel=0.0239583570510149, ref_abs_avg=24.58226776123047, test_abs_avg=24.577049255371094
production_forward2 grad[77] vs paper_forward: mean_abs=0.4637327194213867, max_abs=1.9453125, mean_rel=0.09070342779159546, max_rel=6.792422294616699, norm_rel=0.023190999403595924, ref_abs_avg=20.26376724243164, test_abs_avg=20.28766632080078
production_forward2 grad[78] vs paper_forward: mean_abs=0.5589460134506226, max_abs=5.125, mean_rel=0.1506769061088562, max_rel=775.5128173828125, norm_rel=0.02342539280653, ref_abs_avg=23.8387451171875, test_abs_avg=23.839431762695312
production_forward2 grad[79] vs paper_forward: mean_abs=0.5364141464233398, max_abs=5.0, mean_rel=0.15084466338157654, max_rel=1080.588623046875, norm_rel=0.02286495454609394, ref_abs_avg=23.36388397216797, test_abs_avg=23.367088317871094
production_forward2 grad[80] vs paper_forward: mean_abs=0.41820788383483887, max_abs=1.75, mean_rel=0.08117803931236267, max_rel=7.30254602432251, norm_rel=0.02227150648832321, ref_abs_avg=18.79374122619629, test_abs_avg=18.80282211303711
production_forward2 grad[81] vs paper_forward: mean_abs=0.5135045051574707, max_abs=4.0, mean_rel=0.14815551042556763, max_rel=753.4673461914062, norm_rel=0.022916167974472046, ref_abs_avg=22.425045013427734, test_abs_avg=22.425891876220703
production_forward2 grad[82] vs paper_forward: mean_abs=0.5002610683441162, max_abs=5.0, mean_rel=0.14689116179943085, max_rel=830.779052734375, norm_rel=0.02276848442852497, ref_abs_avg=22.033357620239258, test_abs_avg=22.037673950195312
production_forward2 grad[83] vs paper_forward: mean_abs=0.37746667861938477, max_abs=1.5625, mean_rel=0.1780092567205429, max_rel=39.5131950378418, norm_rel=0.020817391574382782, ref_abs_avg=18.3857479095459, test_abs_avg=18.358144760131836
production_forward2 grad[84] vs paper_forward: mean_abs=0.4755585789680481, max_abs=4.0, mean_rel=0.14259478449821472, max_rel=520.6092529296875, norm_rel=0.022265024483203888, ref_abs_avg=21.393634796142578, test_abs_avg=21.393268585205078
production_forward2 grad[85] vs paper_forward: mean_abs=0.46308016777038574, max_abs=4.556640625, mean_rel=0.13177761435508728, max_rel=268.8749084472656, norm_rel=0.021662410348653793, ref_abs_avg=21.269319534301758, test_abs_avg=21.283315658569336
production_forward2 grad[86] vs paper_forward: mean_abs=0.34658002853393555, max_abs=1.25, mean_rel=0.08883444964885712, max_rel=9.845606803894043, norm_rel=0.01966249570250511, ref_abs_avg=17.65019989013672, test_abs_avg=17.633365631103516
production_forward2 grad[87] vs paper_forward: mean_abs=0.4517577886581421, max_abs=4.0, mean_rel=0.1412331610918045, max_rel=808.5647583007812, norm_rel=0.021704865619540215, ref_abs_avg=20.87883758544922, test_abs_avg=20.87934112548828
production_forward2 grad[88] vs paper_forward: mean_abs=0.43560296297073364, max_abs=5.5, mean_rel=0.14661812782287598, max_rel=841.2685546875, norm_rel=0.021167166531085968, ref_abs_avg=20.622474670410156, test_abs_avg=20.621295928955078
production_forward2 grad[89] vs paper_forward: mean_abs=0.3692718744277954, max_abs=1.5, mean_rel=0.15267938375473022, max_rel=34.876224517822266, norm_rel=0.022350821644067764, ref_abs_avg=16.772802352905273, test_abs_avg=16.76380729675293
production_forward2 grad[90] vs paper_forward: mean_abs=0.42546403408050537, max_abs=4.5, mean_rel=0.13286808133125305, max_rel=585.1289672851562, norm_rel=0.021201753988862038, ref_abs_avg=20.191429138183594, test_abs_avg=20.192110061645508
production_forward2 grad[91] vs paper_forward: mean_abs=0.41386836767196655, max_abs=5.5, mean_rel=0.1293126940727234, max_rel=493.030029296875, norm_rel=0.021534351631999016, ref_abs_avg=19.39676856994629, test_abs_avg=19.395801544189453
production_forward2 grad[92] vs paper_forward: mean_abs=0.3384134769439697, max_abs=1.4375, mean_rel=0.12792852520942688, max_rel=38.1324348449707, norm_rel=0.02221822366118431, ref_abs_avg=15.752962112426758, test_abs_avg=15.773672103881836
production_forward2 grad[93] vs paper_forward: mean_abs=0.3929488956928253, max_abs=4.5, mean_rel=0.13370439410209656, max_rel=684.807861328125, norm_rel=0.020781900733709335, ref_abs_avg=19.065410614013672, test_abs_avg=19.06610107421875
production_forward2 grad[94] vs paper_forward: mean_abs=0.387325257062912, max_abs=4.5, mean_rel=0.13526302576065063, max_rel=486.3438720703125, norm_rel=0.020458746701478958, ref_abs_avg=19.090248107910156, test_abs_avg=19.07851791381836
production_forward2 grad[95] vs paper_forward: mean_abs=0.31050682067871094, max_abs=1.25, mean_rel=0.08981074392795563, max_rel=10.438971519470215, norm_rel=0.020437654107809067, ref_abs_avg=15.079561233520508, test_abs_avg=15.087919235229492
production_forward2 grad[96] vs paper_forward: mean_abs=0.36992743611335754, max_abs=3.5, mean_rel=0.12351689487695694, max_rel=590.1181640625, norm_rel=0.020009109750390053, ref_abs_avg=18.746349334716797, test_abs_avg=18.745567321777344
production_forward2 grad[97] vs paper_forward: mean_abs=0.35689425468444824, max_abs=4.125, mean_rel=0.11158265173435211, max_rel=354.5289306640625, norm_rel=0.01917933113873005, ref_abs_avg=18.83529281616211, test_abs_avg=18.816452026367188
identity layers + randn queries
paper_forward fwd+bwd:  379.774 ms
paper_forward bwd-only: 294.263 ms
paper_forward peak allocated: fwd=30.001 GiB, fwd+bwd=32.120 GiB
paper_forward peak reserved:  fwd=30.037 GiB, fwd+bwd=32.787 GiB
production_forward2 fwd+bwd:  224.353 ms
production_forward2 bwd-only: 202.157 ms
production_forward2 peak allocated: fwd=2.864 GiB, fwd+bwd=6.243 GiB
production_forward2 peak reserved:  fwd=3.242 GiB, fwd+bwd=8.992 GiB
production_forward fwd+bwd:  109.502 ms
production_forward bwd-only: 89.125 ms
production_forward peak allocated: fwd=3.368 GiB, fwd+bwd=6.993 GiB
production_forward peak reserved:  fwd=3.617 GiB, fwd+bwd=8.117 GiB

grads check for swiglu layers + randn queries
production_forward vs paper_forward output: mean_abs=0.0016557115595787764, max_abs=0.0390625
production_forward grad[0] vs paper_forward: mean_abs=0.008647083304822445, max_abs=0.53125, mean_rel=0.07444658875465393, max_rel=93.9976577758789, norm_rel=0.02051621675491333, ref_abs_avg=0.45794737339019775, test_abs_avg=0.4579586386680603
production_forward grad[1] vs paper_forward: mean_abs=7.419989585876465, max_abs=60.0, mean_rel=0.17610500752925873, max_rel=184.46676635742188, norm_rel=0.020590675994753838, ref_abs_avg=322.20343017578125, test_abs_avg=322.1251220703125
production_forward grad[2] vs paper_forward: mean_abs=1.1958229541778564, max_abs=5.5, mean_rel=0.1741032600402832, max_rel=32.01493835449219, norm_rel=0.023025892674922943, ref_abs_avg=52.78519058227539, test_abs_avg=52.7757453918457
production_forward grad[3] vs paper_forward: mean_abs=1.5967587232589722, max_abs=11.0, mean_rel=0.17297735810279846, max_rel=1294.84228515625, norm_rel=0.02440503053367138, ref_abs_avg=65.76707458496094, test_abs_avg=65.76817321777344
production_forward grad[4] vs paper_forward: mean_abs=1.5589382648468018, max_abs=10.0, mean_rel=0.17169228196144104, max_rel=1170.5167236328125, norm_rel=0.024281607940793037, ref_abs_avg=64.51104736328125, test_abs_avg=64.51138305664062
production_forward grad[5] vs paper_forward: mean_abs=1.135547399520874, max_abs=4.75, mean_rel=0.1411714404821396, max_rel=9.182108879089355, norm_rel=0.025144722312688828, ref_abs_avg=45.962158203125, test_abs_avg=46.036590576171875
production_forward grad[6] vs paper_forward: mean_abs=1.395115852355957, max_abs=8.75, mean_rel=0.16953197121620178, max_rel=1320.60205078125, norm_rel=0.024234389886260033, ref_abs_avg=57.82122802734375, test_abs_avg=57.82206344604492
production_forward grad[7] vs paper_forward: mean_abs=1.3647788763046265, max_abs=8.25, mean_rel=0.15028342604637146, max_rel=1859.7598876953125, norm_rel=0.023794878274202347, ref_abs_avg=57.71215057373047, test_abs_avg=57.7097282409668
production_forward grad[8] vs paper_forward: mean_abs=0.9667868614196777, max_abs=4.5, mean_rel=0.103883758187294, max_rel=9.858180046081543, norm_rel=0.021970398724079132, ref_abs_avg=43.123695373535156, test_abs_avg=43.074806213378906
production_forward grad[9] vs paper_forward: mean_abs=1.26808762550354, max_abs=9.0, mean_rel=0.16626417636871338, max_rel=2245.2109375, norm_rel=0.02386622317135334, ref_abs_avg=53.3665771484375, test_abs_avg=53.36591720581055
production_forward grad[10] vs paper_forward: mean_abs=1.228734016418457, max_abs=7.0, mean_rel=0.1801159381866455, max_rel=2181.25830078125, norm_rel=0.02347630262374878, ref_abs_avg=52.5525016784668, test_abs_avg=52.561641693115234
production_forward grad[11] vs paper_forward: mean_abs=0.9762408137321472, max_abs=3.578125, mean_rel=0.20688185095787048, max_rel=46.547645568847656, norm_rel=0.024025334045290947, ref_abs_avg=39.87432861328125, test_abs_avg=39.96821594238281
production_forward grad[12] vs paper_forward: mean_abs=1.1696364879608154, max_abs=7.125, mean_rel=0.18502679467201233, max_rel=2453.421142578125, norm_rel=0.023708442226052284, ref_abs_avg=49.582672119140625, test_abs_avg=49.58489990234375
production_forward grad[13] vs paper_forward: mean_abs=1.1403143405914307, max_abs=6.875, mean_rel=0.16506557166576385, max_rel=1751.19580078125, norm_rel=0.02340615540742874, ref_abs_avg=49.0069694519043, test_abs_avg=49.00767517089844
production_forward grad[14] vs paper_forward: mean_abs=0.9068024158477783, max_abs=3.5, mean_rel=0.22172674536705017, max_rel=50.85640335083008, norm_rel=0.023669561371207237, ref_abs_avg=38.46793746948242, test_abs_avg=38.565860748291016
production_forward grad[15] vs paper_forward: mean_abs=1.097604513168335, max_abs=6.5, mean_rel=0.16305799782276154, max_rel=1384.3023681640625, norm_rel=0.023548265919089317, ref_abs_avg=46.82510757446289, test_abs_avg=46.82598876953125
production_forward grad[16] vs paper_forward: mean_abs=1.0703179836273193, max_abs=6.5, mean_rel=0.16213887929916382, max_rel=814.7374877929688, norm_rel=0.023215800523757935, ref_abs_avg=46.343265533447266, test_abs_avg=46.34345626831055
production_forward grad[17] vs paper_forward: mean_abs=0.8226108551025391, max_abs=3.5, mean_rel=0.0706930086016655, max_rel=1.9235590696334839, norm_rel=0.02292228303849697, ref_abs_avg=36.10519790649414, test_abs_avg=36.180747985839844
production_forward grad[18] vs paper_forward: mean_abs=1.0347506999969482, max_abs=7.0, mean_rel=0.1497339904308319, max_rel=1621.6702880859375, norm_rel=0.023455696180462837, ref_abs_avg=44.33214569091797, test_abs_avg=44.33429718017578
production_forward grad[19] vs paper_forward: mean_abs=1.0115973949432373, max_abs=6.0, mean_rel=0.14914420247077942, max_rel=2146.71240234375, norm_rel=0.023272693157196045, ref_abs_avg=43.75754928588867, test_abs_avg=43.7625732421875
production_forward grad[20] vs paper_forward: mean_abs=0.840479850769043, max_abs=3.25, mean_rel=0.09957912564277649, max_rel=8.58029556274414, norm_rel=0.023493314161896706, ref_abs_avg=34.868412017822266, test_abs_avg=34.86433410644531
production_forward grad[21] vs paper_forward: mean_abs=0.9860739707946777, max_abs=6.25, mean_rel=0.1603074073791504, max_rel=1181.1033935546875, norm_rel=0.023233000189065933, ref_abs_avg=42.63328170776367, test_abs_avg=42.63587188720703
production_forward grad[22] vs paper_forward: mean_abs=0.9671036005020142, max_abs=6.0, mean_rel=0.1642177402973175, max_rel=1619.0872802734375, norm_rel=0.023146135732531548, ref_abs_avg=41.960697174072266, test_abs_avg=41.96345138549805
production_forward grad[23] vs paper_forward: mean_abs=0.7568232417106628, max_abs=3.0, mean_rel=0.4059329926967621, max_rel=160.71249389648438, norm_rel=0.02228201925754547, ref_abs_avg=34.02057647705078, test_abs_avg=34.01245880126953
production_forward grad[24] vs paper_forward: mean_abs=0.932388961315155, max_abs=6.25, mean_rel=0.14940878748893738, max_rel=908.3023071289062, norm_rel=0.023028673604130745, ref_abs_avg=40.65593719482422, test_abs_avg=40.657447814941406
production_forward grad[25] vs paper_forward: mean_abs=0.9145640134811401, max_abs=5.375, mean_rel=0.14059707522392273, max_rel=731.1488037109375, norm_rel=0.022939244285225868, ref_abs_avg=40.06853103637695, test_abs_avg=40.067466735839844
production_forward grad[26] vs paper_forward: mean_abs=0.8610939979553223, max_abs=4.5, mean_rel=0.07296936213970184, max_rel=2.1791698932647705, norm_rel=0.025392305105924606, ref_abs_avg=34.86878967285156, test_abs_avg=34.92964553833008
production_forward grad[27] vs paper_forward: mean_abs=1.0806281566619873, max_abs=7.0, mean_rel=0.16629399359226227, max_rel=1370.0567626953125, norm_rel=0.024851223453879356, ref_abs_avg=43.666648864746094, test_abs_avg=43.666526794433594
production_forward grad[28] vs paper_forward: mean_abs=1.051676869392395, max_abs=7.5, mean_rel=0.17583364248275757, max_rel=1123.1383056640625, norm_rel=0.024670735001564026, ref_abs_avg=42.816993713378906, test_abs_avg=42.82403564453125
production_forward grad[29] vs paper_forward: mean_abs=0.7745912075042725, max_abs=3.0, mean_rel=0.09804299473762512, max_rel=8.524206161499023, norm_rel=0.024104434996843338, ref_abs_avg=32.487762451171875, test_abs_avg=32.557952880859375
production_forward grad[30] vs paper_forward: mean_abs=0.9820987582206726, max_abs=7.0, mean_rel=0.17202648520469666, max_rel=1501.667724609375, norm_rel=0.025173043832182884, ref_abs_avg=39.1555290222168, test_abs_avg=39.160728454589844
production_forward grad[31] vs paper_forward: mean_abs=0.9612032175064087, max_abs=6.5, mean_rel=0.17849351465702057, max_rel=1188.38671875, norm_rel=0.024798544123768806, ref_abs_avg=38.92695236206055, test_abs_avg=38.92713165283203
production_forward grad[32] vs paper_forward: mean_abs=0.7399711608886719, max_abs=3.5, mean_rel=0.1773705631494522, max_rel=19.36162567138672, norm_rel=0.0254057664424181, ref_abs_avg=29.115924835205078, test_abs_avg=29.003902435302734
production_forward grad[33] vs paper_forward: mean_abs=0.9105772376060486, max_abs=6.0, mean_rel=0.1704070270061493, max_rel=1184.3836669921875, norm_rel=0.02507612109184265, ref_abs_avg=36.42646026611328, test_abs_avg=36.43138885498047
production_forward grad[34] vs paper_forward: mean_abs=0.9021360874176025, max_abs=5.3125, mean_rel=0.1688794046640396, max_rel=978.5478515625, norm_rel=0.0250689834356308, ref_abs_avg=36.13591003417969, test_abs_avg=36.13423156738281
production_forward grad[35] vs paper_forward: mean_abs=0.7182564735412598, max_abs=3.5, mean_rel=0.11305609345436096, max_rel=10.853876113891602, norm_rel=0.024680202826857567, ref_abs_avg=29.463335037231445, test_abs_avg=29.44955062866211
production_forward grad[36] vs paper_forward: mean_abs=0.8574110269546509, max_abs=6.0, mean_rel=0.16393594443798065, max_rel=670.7465209960938, norm_rel=0.024716220796108246, ref_abs_avg=34.78630828857422, test_abs_avg=34.789161682128906
production_forward grad[37] vs paper_forward: mean_abs=0.8448619842529297, max_abs=5.25, mean_rel=0.16432251036167145, max_rel=797.3294067382812, norm_rel=0.024563245475292206, ref_abs_avg=34.51625442504883, test_abs_avg=34.51305389404297
production_forward grad[38] vs paper_forward: mean_abs=0.6741471290588379, max_abs=2.5, mean_rel=0.11707744002342224, max_rel=16.902965545654297, norm_rel=0.024431893602013588, ref_abs_avg=27.127506256103516, test_abs_avg=27.089996337890625
production_forward grad[39] vs paper_forward: mean_abs=0.8150081038475037, max_abs=5.25, mean_rel=0.1562751829624176, max_rel=998.698974609375, norm_rel=0.024528397247195244, ref_abs_avg=33.305885314941406, test_abs_avg=33.30663299560547
production_forward grad[40] vs paper_forward: mean_abs=0.8024208545684814, max_abs=5.0, mean_rel=0.16068008542060852, max_rel=811.578125, norm_rel=0.02430141717195511, ref_abs_avg=33.120697021484375, test_abs_avg=33.11715316772461
production_forward grad[41] vs paper_forward: mean_abs=0.5851845741271973, max_abs=2.3671875, mean_rel=0.18462930619716644, max_rel=62.98189163208008, norm_rel=0.022149067372083664, ref_abs_avg=26.287353515625, test_abs_avg=26.265087127685547
production_forward grad[42] vs paper_forward: mean_abs=0.7768542170524597, max_abs=5.0625, mean_rel=0.16518563032150269, max_rel=1346.496337890625, norm_rel=0.02428751438856125, ref_abs_avg=32.06194305419922, test_abs_avg=32.0629768371582
production_forward grad[43] vs paper_forward: mean_abs=0.7651394605636597, max_abs=5.0, mean_rel=0.15501850843429565, max_rel=909.9998779296875, norm_rel=0.024042444303631783, ref_abs_avg=31.88930320739746, test_abs_avg=31.890832901000977
production_forward grad[44] vs paper_forward: mean_abs=0.6090917587280273, max_abs=2.25, mean_rel=0.09286698698997498, max_rel=3.7565481662750244, norm_rel=0.02492230385541916, ref_abs_avg=24.67707061767578, test_abs_avg=24.669002532958984
production_forward grad[45] vs paper_forward: mean_abs=0.7415100336074829, max_abs=5.0, mean_rel=0.16559864580631256, max_rel=919.6280517578125, norm_rel=0.023906804621219635, ref_abs_avg=31.05063247680664, test_abs_avg=31.05344581604004
production_forward grad[46] vs paper_forward: mean_abs=0.7273008227348328, max_abs=5.0, mean_rel=0.1527230441570282, max_rel=1007.5457153320312, norm_rel=0.02373575232923031, ref_abs_avg=30.695472717285156, test_abs_avg=30.69473648071289
production_forward grad[47] vs paper_forward: mean_abs=0.5816688537597656, max_abs=2.625, mean_rel=0.06324754655361176, max_rel=2.8007283210754395, norm_rel=0.02500586025416851, ref_abs_avg=24.019821166992188, test_abs_avg=23.951221466064453
production_forward grad[48] vs paper_forward: mean_abs=0.7129138112068176, max_abs=5.25, mean_rel=0.14582407474517822, max_rel=782.0953979492188, norm_rel=0.023731332272291183, ref_abs_avg=30.096996307373047, test_abs_avg=30.096275329589844
production_forward grad[49] vs paper_forward: mean_abs=0.6962313652038574, max_abs=4.375, mean_rel=0.15090379118919373, max_rel=890.0330810546875, norm_rel=0.023819660767912865, ref_abs_avg=29.347537994384766, test_abs_avg=29.34739112854004
production_forward grad[50] vs paper_forward: mean_abs=0.6385159492492676, max_abs=2.1875, mean_rel=0.13738378882408142, max_rel=13.198258399963379, norm_rel=0.023430556058883667, ref_abs_avg=26.200777053833008, test_abs_avg=26.194610595703125
production_forward grad[51] vs paper_forward: mean_abs=0.792736291885376, max_abs=5.625, mean_rel=0.17363856732845306, max_rel=1017.2449340820312, norm_rel=0.02547568455338478, ref_abs_avg=31.163330078125, test_abs_avg=31.162757873535156
production_forward grad[52] vs paper_forward: mean_abs=0.7777585983276367, max_abs=5.0, mean_rel=0.1930398941040039, max_rel=1498.60205078125, norm_rel=0.025580493733286858, ref_abs_avg=30.496641159057617, test_abs_avg=30.49776268005371
production_forward grad[53] vs paper_forward: mean_abs=0.6236968040466309, max_abs=2.4375, mean_rel=0.09148657321929932, max_rel=4.829143524169922, norm_rel=0.025749310851097107, ref_abs_avg=24.538482666015625, test_abs_avg=24.56861686706543
production_forward grad[54] vs paper_forward: mean_abs=0.7275432348251343, max_abs=5.0, mean_rel=0.1656714528799057, max_rel=1098.9173583984375, norm_rel=0.02508019097149372, ref_abs_avg=29.054534912109375, test_abs_avg=29.054656982421875
production_forward grad[55] vs paper_forward: mean_abs=0.713981568813324, max_abs=4.75, mean_rel=0.16685956716537476, max_rel=987.9557495117188, norm_rel=0.024847589433193207, ref_abs_avg=28.79184341430664, test_abs_avg=28.789844512939453
production_forward grad[56] vs paper_forward: mean_abs=0.5370063781738281, max_abs=2.25, mean_rel=0.08128133416175842, max_rel=4.779754638671875, norm_rel=0.02388034202158451, ref_abs_avg=22.585140228271484, test_abs_avg=22.589576721191406
production_forward grad[57] vs paper_forward: mean_abs=0.6792566180229187, max_abs=5.046875, mean_rel=0.17583715915679932, max_rel=1776.7421875, norm_rel=0.02470473386347294, ref_abs_avg=27.526811599731445, test_abs_avg=27.525474548339844
production_forward grad[58] vs paper_forward: mean_abs=0.6653908491134644, max_abs=4.625, mean_rel=0.15538142621517181, max_rel=768.1167602539062, norm_rel=0.024368679150938988, ref_abs_avg=27.317047119140625, test_abs_avg=27.314674377441406
production_forward grad[59] vs paper_forward: mean_abs=0.5138874053955078, max_abs=2.5, mean_rel=0.06439042836427689, max_rel=1.798367977142334, norm_rel=0.02475261501967907, ref_abs_avg=21.12051773071289, test_abs_avg=21.114194869995117
production_forward grad[60] vs paper_forward: mean_abs=0.6389089822769165, max_abs=4.0, mean_rel=0.16194429993629456, max_rel=723.592529296875, norm_rel=0.024493379518389702, ref_abs_avg=26.095714569091797, test_abs_avg=26.095006942749023
production_forward grad[61] vs paper_forward: mean_abs=0.6338229179382324, max_abs=4.0, mean_rel=0.16289356350898743, max_rel=966.4421997070312, norm_rel=0.024762628600001335, ref_abs_avg=25.624282836914062, test_abs_avg=25.621238708496094
production_forward grad[62] vs paper_forward: mean_abs=0.48497915267944336, max_abs=2.0, mean_rel=0.07520264387130737, max_rel=3.218031644821167, norm_rel=0.023045718669891357, ref_abs_avg=21.310548782348633, test_abs_avg=21.322006225585938
production_forward grad[63] vs paper_forward: mean_abs=0.6017441749572754, max_abs=4.0, mean_rel=0.15181587636470795, max_rel=1058.19091796875, norm_rel=0.023718981072306633, ref_abs_avg=25.37474822998047, test_abs_avg=25.37413787841797
production_forward grad[64] vs paper_forward: mean_abs=0.5869329571723938, max_abs=4.0, mean_rel=0.1579849272966385, max_rel=736.62646484375, norm_rel=0.023557309061288834, ref_abs_avg=24.89907455444336, test_abs_avg=24.902687072753906
production_forward grad[65] vs paper_forward: mean_abs=0.42840003967285156, max_abs=1.75, mean_rel=0.12323467433452606, max_rel=11.535706520080566, norm_rel=0.02206023782491684, ref_abs_avg=20.2623233795166, test_abs_avg=20.241552352905273
production_forward grad[66] vs paper_forward: mean_abs=0.5773810744285583, max_abs=4.0, mean_rel=0.1536065638065338, max_rel=762.1742553710938, norm_rel=0.023472556844353676, ref_abs_avg=24.55636215209961, test_abs_avg=24.556243896484375
production_forward grad[67] vs paper_forward: mean_abs=0.5629420876502991, max_abs=4.25, mean_rel=0.15338575839996338, max_rel=935.2672119140625, norm_rel=0.023464176803827286, ref_abs_avg=24.05465316772461, test_abs_avg=24.061237335205078
production_forward grad[68] vs paper_forward: mean_abs=0.4021139144897461, max_abs=1.625, mean_rel=0.05815295875072479, max_rel=1.632193684577942, norm_rel=0.02039499022066593, ref_abs_avg=20.01357650756836, test_abs_avg=19.980113983154297
production_forward grad[69] vs paper_forward: mean_abs=0.5440539717674255, max_abs=3.75, mean_rel=0.1520068496465683, max_rel=730.9443359375, norm_rel=0.023169422522187233, ref_abs_avg=23.450950622558594, test_abs_avg=23.450313568115234
production_forward grad[70] vs paper_forward: mean_abs=0.5270853638648987, max_abs=4.0, mean_rel=0.1436123102903366, max_rel=542.0455932617188, norm_rel=0.022891277447342873, ref_abs_avg=23.035188674926758, test_abs_avg=23.032543182373047
production_forward grad[71] vs paper_forward: mean_abs=0.43011581897735596, max_abs=1.5, mean_rel=0.07705336809158325, max_rel=3.785924196243286, norm_rel=0.02278669737279415, ref_abs_avg=18.70030975341797, test_abs_avg=18.7225399017334
production_forward grad[72] vs paper_forward: mean_abs=0.5199478268623352, max_abs=3.75, mean_rel=0.15389791131019592, max_rel=1463.3939208984375, norm_rel=0.02288762293756008, ref_abs_avg=22.743389129638672, test_abs_avg=22.74519920349121
production_forward grad[73] vs paper_forward: mean_abs=0.5051224231719971, max_abs=3.421875, mean_rel=0.15590503811836243, max_rel=602.1382446289062, norm_rel=0.02256569266319275, ref_abs_avg=22.381113052368164, test_abs_avg=22.378978729248047
production_forward grad[74] vs paper_forward: mean_abs=0.47850990295410156, max_abs=1.75, mean_rel=0.053688280284404755, max_rel=1.3373165130615234, norm_rel=0.023308193311095238, ref_abs_avg=21.225528717041016, test_abs_avg=21.212438583374023
production_forward grad[75] vs paper_forward: mean_abs=0.6027132868766785, max_abs=4.75, mean_rel=0.1593271940946579, max_rel=1305.2266845703125, norm_rel=0.024592509493231773, ref_abs_avg=24.548419952392578, test_abs_avg=24.54714012145996
production_forward grad[76] vs paper_forward: mean_abs=0.5803203582763672, max_abs=4.3125, mean_rel=0.1594952642917633, max_rel=821.8238525390625, norm_rel=0.024061664938926697, ref_abs_avg=24.11787223815918, test_abs_avg=24.116859436035156
production_forward grad[77] vs paper_forward: mean_abs=0.4411029815673828, max_abs=1.75, mean_rel=0.06986751407384872, max_rel=5.1371259689331055, norm_rel=0.02231553941965103, ref_abs_avg=20.3289737701416, test_abs_avg=20.3062744140625
production_forward grad[78] vs paper_forward: mean_abs=0.5393706560134888, max_abs=5.0, mean_rel=0.1570262461900711, max_rel=1868.5335693359375, norm_rel=0.023501412943005562, ref_abs_avg=22.941619873046875, test_abs_avg=22.941328048706055
production_forward grad[79] vs paper_forward: mean_abs=0.5315498113632202, max_abs=4.0, mean_rel=0.1496068388223648, max_rel=1042.7642822265625, norm_rel=0.02352353185415268, ref_abs_avg=22.67870330810547, test_abs_avg=22.674301147460938
production_forward grad[80] vs paper_forward: mean_abs=0.4147945046424866, max_abs=1.75, mean_rel=0.20210160315036774, max_rel=33.498390197753906, norm_rel=0.022816242650151253, ref_abs_avg=18.204681396484375, test_abs_avg=18.23587417602539
production_forward grad[81] vs paper_forward: mean_abs=0.5030409693717957, max_abs=4.5, mean_rel=0.1590387374162674, max_rel=1048.1077880859375, norm_rel=0.023037191480398178, ref_abs_avg=21.82576560974121, test_abs_avg=21.825572967529297
production_forward grad[82] vs paper_forward: mean_abs=0.4907826781272888, max_abs=4.125, mean_rel=0.1416601538658142, max_rel=743.8416748046875, norm_rel=0.022739090025424957, ref_abs_avg=21.640939712524414, test_abs_avg=21.6524600982666
production_forward grad[83] vs paper_forward: mean_abs=0.3685283660888672, max_abs=1.625, mean_rel=0.15273220837116241, max_rel=17.31910514831543, norm_rel=0.021674606949090958, ref_abs_avg=17.348146438598633, test_abs_avg=17.31388282775879
production_forward grad[84] vs paper_forward: mean_abs=0.4658661186695099, max_abs=4.25, mean_rel=0.14363494515419006, max_rel=896.6525268554688, norm_rel=0.022342989221215248, ref_abs_avg=20.90148162841797, test_abs_avg=20.901914596557617
production_forward grad[85] vs paper_forward: mean_abs=0.4472093880176544, max_abs=3.5, mean_rel=0.13775423169136047, max_rel=722.0128784179688, norm_rel=0.021512525156140327, ref_abs_avg=20.79904556274414, test_abs_avg=20.795419692993164
production_forward grad[86] vs paper_forward: mean_abs=0.3492845296859741, max_abs=1.5625, mean_rel=0.09322140365839005, max_rel=8.83416748046875, norm_rel=0.021156957373023033, ref_abs_avg=16.74285125732422, test_abs_avg=16.771053314208984
production_forward grad[87] vs paper_forward: mean_abs=0.43659698963165283, max_abs=4.0, mean_rel=0.1386345773935318, max_rel=799.1500244140625, norm_rel=0.021720046177506447, ref_abs_avg=20.177597045898438, test_abs_avg=20.176511764526367
production_forward grad[88] vs paper_forward: mean_abs=0.4283953309059143, max_abs=3.5, mean_rel=0.1359933316707611, max_rel=846.0619506835938, norm_rel=0.021395724266767502, ref_abs_avg=20.11143684387207, test_abs_avg=20.108104705810547
production_forward grad[89] vs paper_forward: mean_abs=0.3042795658111572, max_abs=1.625, mean_rel=0.1381833702325821, max_rel=29.358245849609375, norm_rel=0.01936357095837593, ref_abs_avg=16.32723617553711, test_abs_avg=16.326099395751953
production_forward grad[90] vs paper_forward: mean_abs=0.4118589162826538, max_abs=3.76171875, mean_rel=0.12984465062618256, max_rel=628.966552734375, norm_rel=0.021306555718183517, ref_abs_avg=19.447284698486328, test_abs_avg=19.447546005249023
production_forward grad[91] vs paper_forward: mean_abs=0.4045849144458771, max_abs=3.125, mean_rel=0.12469659000635147, max_rel=433.85345458984375, norm_rel=0.021136395633220673, ref_abs_avg=19.231040954589844, test_abs_avg=19.23032569885254
production_forward grad[92] vs paper_forward: mean_abs=0.3149987459182739, max_abs=1.0859375, mean_rel=0.20023301243782043, max_rel=56.09769058227539, norm_rel=0.019637562334537506, ref_abs_avg=15.876129150390625, test_abs_avg=15.911626815795898
production_forward grad[93] vs paper_forward: mean_abs=0.3902817666530609, max_abs=4.5, mean_rel=0.12698467075824738, max_rel=581.1889038085938, norm_rel=0.020551949739456177, ref_abs_avg=19.1722412109375, test_abs_avg=19.17306137084961
production_forward grad[94] vs paper_forward: mean_abs=0.37538081407546997, max_abs=3.9375, mean_rel=0.13084755837917328, max_rel=524.0731201171875, norm_rel=0.02059595286846161, ref_abs_avg=18.42205047607422, test_abs_avg=18.424522399902344
production_forward grad[95] vs paper_forward: mean_abs=0.30156534910202026, max_abs=1.125, mean_rel=0.11827749758958817, max_rel=13.763952255249023, norm_rel=0.019260630011558533, ref_abs_avg=15.947969436645508, test_abs_avg=15.951085090637207
production_forward grad[96] vs paper_forward: mean_abs=0.3740866780281067, max_abs=3.78125, mean_rel=0.12122548371553421, max_rel=1001.7108154296875, norm_rel=0.020292110741138458, ref_abs_avg=18.699857711791992, test_abs_avg=18.700523376464844
production_forward grad[97] vs paper_forward: mean_abs=0.360917866230011, max_abs=3.25, mean_rel=0.13077625632286072, max_rel=1027.0816650390625, norm_rel=0.020631490275263786, ref_abs_avg=17.84792709350586, test_abs_avg=17.850814819335938
production_forward2 vs paper_forward output: mean_abs=0.0016557115595787764, max_abs=0.0390625
production_forward2 grad[0] vs paper_forward: mean_abs=0.008777580223977566, max_abs=0.546875, mean_rel=0.07549860328435898, max_rel=92.59261322021484, norm_rel=0.020788978785276413, ref_abs_avg=0.45794737339019775, test_abs_avg=0.4579527676105499
production_forward2 grad[1] vs paper_forward: mean_abs=7.494889736175537, max_abs=60.0, mean_rel=0.17627637088298798, max_rel=241.2286834716797, norm_rel=0.020815828815102577, ref_abs_avg=322.20343017578125, test_abs_avg=322.12158203125
production_forward2 grad[2] vs paper_forward: mean_abs=1.2165436744689941, max_abs=5.25, mean_rel=0.2583891749382019, max_rel=69.0039291381836, norm_rel=0.023738745599985123, ref_abs_avg=52.78519058227539, test_abs_avg=52.7684326171875
production_forward2 grad[3] vs paper_forward: mean_abs=1.6147832870483398, max_abs=11.25, mean_rel=0.17296627163887024, max_rel=1925.6558837890625, norm_rel=0.02466760203242302, ref_abs_avg=65.76707458496094, test_abs_avg=65.77130126953125
production_forward2 grad[4] vs paper_forward: mean_abs=1.576956868171692, max_abs=10.25, mean_rel=0.17218747735023499, max_rel=1353.326904296875, norm_rel=0.02453458495438099, ref_abs_avg=64.51104736328125, test_abs_avg=64.51206970214844
production_forward2 grad[5] vs paper_forward: mean_abs=1.1579399108886719, max_abs=5.0, mean_rel=0.16246813535690308, max_rel=10.470826148986816, norm_rel=0.025454118847846985, ref_abs_avg=45.962158203125, test_abs_avg=46.03502655029297
production_forward2 grad[6] vs paper_forward: mean_abs=1.409844160079956, max_abs=9.0, mean_rel=0.16739322245121002, max_rel=1008.6610717773438, norm_rel=0.024490106850862503, ref_abs_avg=57.82122802734375, test_abs_avg=57.822998046875
production_forward2 grad[7] vs paper_forward: mean_abs=1.3804084062576294, max_abs=8.5, mean_rel=0.15279856324195862, max_rel=1956.9056396484375, norm_rel=0.024038095027208328, ref_abs_avg=57.71215057373047, test_abs_avg=57.7121696472168
production_forward2 grad[8] vs paper_forward: mean_abs=0.9969825744628906, max_abs=4.5, mean_rel=0.10359331965446472, max_rel=9.425804138183594, norm_rel=0.02301180176436901, ref_abs_avg=43.123695373535156, test_abs_avg=43.01206588745117
production_forward2 grad[9] vs paper_forward: mean_abs=1.2809182405471802, max_abs=9.0, mean_rel=0.16794684529304504, max_rel=2170.369140625, norm_rel=0.024122074246406555, ref_abs_avg=53.3665771484375, test_abs_avg=53.365997314453125
production_forward2 grad[10] vs paper_forward: mean_abs=1.239365577697754, max_abs=7.78125, mean_rel=0.17993378639221191, max_rel=1660.246337890625, norm_rel=0.02367347665131092, ref_abs_avg=52.5525016784668, test_abs_avg=52.56089782714844
production_forward2 grad[11] vs paper_forward: mean_abs=0.9761524200439453, max_abs=3.7421875, mean_rel=0.2049148827791214, max_rel=45.646541595458984, norm_rel=0.024279652163386345, ref_abs_avg=39.87432861328125, test_abs_avg=39.93117904663086
production_forward2 grad[12] vs paper_forward: mean_abs=1.1801915168762207, max_abs=7.34375, mean_rel=0.18019208312034607, max_rel=2231.7314453125, norm_rel=0.023914629593491554, ref_abs_avg=49.582672119140625, test_abs_avg=49.58484649658203
production_forward2 grad[13] vs paper_forward: mean_abs=1.1526825428009033, max_abs=7.0, mean_rel=0.1715153604745865, max_rel=1817.1898193359375, norm_rel=0.02363581582903862, ref_abs_avg=49.0069694519043, test_abs_avg=49.01171112060547
production_forward2 grad[14] vs paper_forward: mean_abs=0.8972752094268799, max_abs=3.875, mean_rel=0.35180819034576416, max_rel=121.94978332519531, norm_rel=0.023776158690452576, ref_abs_avg=38.46793746948242, test_abs_avg=38.56096649169922
production_forward2 grad[15] vs paper_forward: mean_abs=1.1083557605743408, max_abs=7.0, mean_rel=0.15999442338943481, max_rel=971.5167846679688, norm_rel=0.023771965876221657, ref_abs_avg=46.82510757446289, test_abs_avg=46.827430725097656
production_forward2 grad[16] vs paper_forward: mean_abs=1.081837773323059, max_abs=6.5, mean_rel=0.1608695387840271, max_rel=1109.6573486328125, norm_rel=0.023454949259757996, ref_abs_avg=46.343265533447266, test_abs_avg=46.34513854980469
production_forward2 grad[17] vs paper_forward: mean_abs=0.8067817687988281, max_abs=3.25, mean_rel=0.0727958232164383, max_rel=2.119440793991089, norm_rel=0.023010266944766045, ref_abs_avg=36.10519790649414, test_abs_avg=36.2003173828125
production_forward2 grad[18] vs paper_forward: mean_abs=1.0451805591583252, max_abs=7.375, mean_rel=0.15090425312519073, max_rel=1604.9569091796875, norm_rel=0.023684818297624588, ref_abs_avg=44.33214569091797, test_abs_avg=44.33390426635742
production_forward2 grad[19] vs paper_forward: mean_abs=1.0230436325073242, max_abs=6.0, mean_rel=0.1482800841331482, max_rel=1364.505859375, norm_rel=0.02349705807864666, ref_abs_avg=43.75754928588867, test_abs_avg=43.76041793823242
production_forward2 grad[20] vs paper_forward: mean_abs=0.858922004699707, max_abs=3.859375, mean_rel=0.10649767518043518, max_rel=9.365340232849121, norm_rel=0.024241778999567032, ref_abs_avg=34.868412017822266, test_abs_avg=34.85943603515625
production_forward2 grad[21] vs paper_forward: mean_abs=0.9949871897697449, max_abs=6.0, mean_rel=0.16382694244384766, max_rel=1700.5361328125, norm_rel=0.02344558574259281, ref_abs_avg=42.63328170776367, test_abs_avg=42.6358757019043
production_forward2 grad[22] vs paper_forward: mean_abs=0.9770257472991943, max_abs=6.25, mean_rel=0.16429075598716736, max_rel=1426.5994873046875, norm_rel=0.023379171267151833, ref_abs_avg=41.960697174072266, test_abs_avg=41.96361541748047
production_forward2 grad[23] vs paper_forward: mean_abs=0.7554317712783813, max_abs=3.25, mean_rel=0.3571142554283142, max_rel=136.61973571777344, norm_rel=0.022453835234045982, ref_abs_avg=34.02057647705078, test_abs_avg=34.029380798339844
production_forward2 grad[24] vs paper_forward: mean_abs=0.9390476942062378, max_abs=6.5, mean_rel=0.15318289399147034, max_rel=1145.0565185546875, norm_rel=0.02319757454097271, ref_abs_avg=40.65593719482422, test_abs_avg=40.656837463378906
production_forward2 grad[25] vs paper_forward: mean_abs=0.9233332276344299, max_abs=5.5, mean_rel=0.14269009232521057, max_rel=661.4771118164062, norm_rel=0.023166602477431297, ref_abs_avg=40.06853103637695, test_abs_avg=40.06498336791992
production_forward2 grad[26] vs paper_forward: mean_abs=0.8851233720779419, max_abs=4.0, mean_rel=0.08359622955322266, max_rel=1.916701078414917, norm_rel=0.025773609057068825, ref_abs_avg=34.86878967285156, test_abs_avg=34.92217254638672
production_forward2 grad[27] vs paper_forward: mean_abs=1.0869715213775635, max_abs=7.15625, mean_rel=0.16682752966880798, max_rel=1340.370361328125, norm_rel=0.024995410814881325, ref_abs_avg=43.666648864746094, test_abs_avg=43.666542053222656
production_forward2 grad[28] vs paper_forward: mean_abs=1.0578789710998535, max_abs=8.0, mean_rel=0.17658784985542297, max_rel=1158.8228759765625, norm_rel=0.024828342720866203, ref_abs_avg=42.816993713378906, test_abs_avg=42.82337951660156
production_forward2 grad[29] vs paper_forward: mean_abs=0.7723608016967773, max_abs=3.0, mean_rel=0.0889260545372963, max_rel=4.227336883544922, norm_rel=0.024089336395263672, ref_abs_avg=32.487762451171875, test_abs_avg=32.53263854980469
production_forward2 grad[30] vs paper_forward: mean_abs=0.9887337684631348, max_abs=6.4375, mean_rel=0.17644080519676208, max_rel=1481.0921630859375, norm_rel=0.025321075692772865, ref_abs_avg=39.1555290222168, test_abs_avg=39.15868377685547
production_forward2 grad[31] vs paper_forward: mean_abs=0.9668841361999512, max_abs=6.5, mean_rel=0.17686162889003754, max_rel=939.4633178710938, norm_rel=0.024948162958025932, ref_abs_avg=38.92695236206055, test_abs_avg=38.9276123046875
production_forward2 grad[32] vs paper_forward: mean_abs=0.7582778930664062, max_abs=3.5, mean_rel=0.17898598313331604, max_rel=22.29282569885254, norm_rel=0.02592562325298786, ref_abs_avg=29.115924835205078, test_abs_avg=28.99700164794922
production_forward2 grad[33] vs paper_forward: mean_abs=0.9152559638023376, max_abs=5.5, mean_rel=0.17177334427833557, max_rel=1030.517333984375, norm_rel=0.02519182674586773, ref_abs_avg=36.42646026611328, test_abs_avg=36.431396484375
production_forward2 grad[34] vs paper_forward: mean_abs=0.9073268175125122, max_abs=6.0, mean_rel=0.16731034219264984, max_rel=837.9691162109375, norm_rel=0.025197584182024002, ref_abs_avg=36.13591003417969, test_abs_avg=36.13410186767578
production_forward2 grad[35] vs paper_forward: mean_abs=0.7187767028808594, max_abs=2.75, mean_rel=0.13641469180583954, max_rel=22.734844207763672, norm_rel=0.024618105962872505, ref_abs_avg=29.463335037231445, test_abs_avg=29.405179977416992
production_forward2 grad[36] vs paper_forward: mean_abs=0.862000584602356, max_abs=5.5, mean_rel=0.16674739122390747, max_rel=641.1935424804688, norm_rel=0.024859679862856865, ref_abs_avg=34.78630828857422, test_abs_avg=34.789310455322266
production_forward2 grad[37] vs paper_forward: mean_abs=0.8490423560142517, max_abs=5.0, mean_rel=0.16494232416152954, max_rel=873.0656127929688, norm_rel=0.02469087764620781, ref_abs_avg=34.51625442504883, test_abs_avg=34.51215362548828
production_forward2 grad[38] vs paper_forward: mean_abs=0.6671915054321289, max_abs=2.5, mean_rel=0.11823859810829163, max_rel=15.015892028808594, norm_rel=0.024172645062208176, ref_abs_avg=27.127506256103516, test_abs_avg=27.087068557739258
production_forward2 grad[39] vs paper_forward: mean_abs=0.8200491666793823, max_abs=5.5, mean_rel=0.15872687101364136, max_rel=1137.9947509765625, norm_rel=0.02468392439186573, ref_abs_avg=33.305885314941406, test_abs_avg=33.30560302734375
production_forward2 grad[40] vs paper_forward: mean_abs=0.8067929148674011, max_abs=5.0, mean_rel=0.16013486683368683, max_rel=1029.5517578125, norm_rel=0.024429427459836006, ref_abs_avg=33.120697021484375, test_abs_avg=33.11652755737305
production_forward2 grad[41] vs paper_forward: mean_abs=0.5949215888977051, max_abs=2.5, mean_rel=0.15146106481552124, max_rel=45.72221755981445, norm_rel=0.02288176119327545, ref_abs_avg=26.287353515625, test_abs_avg=26.267593383789062
production_forward2 grad[42] vs paper_forward: mean_abs=0.7807071805000305, max_abs=5.0, mean_rel=0.16380146145820618, max_rel=1100.1871337890625, norm_rel=0.024411987513303757, ref_abs_avg=32.06194305419922, test_abs_avg=32.06184768676758
production_forward2 grad[43] vs paper_forward: mean_abs=0.7684321999549866, max_abs=5.0, mean_rel=0.15436244010925293, max_rel=954.4111938476562, norm_rel=0.024156859144568443, ref_abs_avg=31.88930320739746, test_abs_avg=31.891801834106445
production_forward2 grad[44] vs paper_forward: mean_abs=0.6120753288269043, max_abs=2.0859375, mean_rel=0.10540750622749329, max_rel=10.879055976867676, norm_rel=0.024828286841511726, ref_abs_avg=24.67707061767578, test_abs_avg=24.680679321289062
production_forward2 grad[45] vs paper_forward: mean_abs=0.74517422914505, max_abs=5.0, mean_rel=0.16553156077861786, max_rel=811.581298828125, norm_rel=0.024019047617912292, ref_abs_avg=31.05063247680664, test_abs_avg=31.05340576171875
production_forward2 grad[46] vs paper_forward: mean_abs=0.7302632331848145, max_abs=4.5, mean_rel=0.1513608992099762, max_rel=1007.5457153320312, norm_rel=0.023849910125136375, ref_abs_avg=30.695472717285156, test_abs_avg=30.69317054748535
production_forward2 grad[47] vs paper_forward: mean_abs=0.5948944091796875, max_abs=2.75, mean_rel=0.06894312798976898, max_rel=3.2640678882598877, norm_rel=0.025145120918750763, ref_abs_avg=24.019821166992188, test_abs_avg=23.945133209228516
production_forward2 grad[48] vs paper_forward: mean_abs=0.7169768810272217, max_abs=5.3125, mean_rel=0.14417122304439545, max_rel=543.6375732421875, norm_rel=0.023853978142142296, ref_abs_avg=30.096996307373047, test_abs_avg=30.095314025878906
production_forward2 grad[49] vs paper_forward: mean_abs=0.6998094320297241, max_abs=4.5, mean_rel=0.15087094902992249, max_rel=791.1075439453125, norm_rel=0.023935426026582718, ref_abs_avg=29.347537994384766, test_abs_avg=29.34748649597168
production_forward2 grad[50] vs paper_forward: mean_abs=0.6426920890808105, max_abs=2.25, mean_rel=0.13843441009521484, max_rel=12.267640113830566, norm_rel=0.023875746876001358, ref_abs_avg=26.200777053833008, test_abs_avg=26.201881408691406
production_forward2 grad[51] vs paper_forward: mean_abs=0.7953622341156006, max_abs=5.25, mean_rel=0.17208707332611084, max_rel=1035.8988037109375, norm_rel=0.025574347004294395, ref_abs_avg=31.163330078125, test_abs_avg=31.163414001464844
production_forward2 grad[52] vs paper_forward: mean_abs=0.7799612283706665, max_abs=5.0, mean_rel=0.19294798374176025, max_rel=1549.033203125, norm_rel=0.025644781067967415, ref_abs_avg=30.496641159057617, test_abs_avg=30.496292114257812
production_forward2 grad[53] vs paper_forward: mean_abs=0.6273888349533081, max_abs=2.375, mean_rel=0.0844189003109932, max_rel=3.604853630065918, norm_rel=0.025746723636984825, ref_abs_avg=24.538482666015625, test_abs_avg=24.563095092773438
production_forward2 grad[54] vs paper_forward: mean_abs=0.7308011054992676, max_abs=5.25, mean_rel=0.16682980954647064, max_rel=931.5221557617188, norm_rel=0.02518179640173912, ref_abs_avg=29.054534912109375, test_abs_avg=29.05437660217285
production_forward2 grad[55] vs paper_forward: mean_abs=0.7152382731437683, max_abs=5.0, mean_rel=0.16254985332489014, max_rel=803.6041259765625, norm_rel=0.024897759780287743, ref_abs_avg=28.79184341430664, test_abs_avg=28.789012908935547
production_forward2 grad[56] vs paper_forward: mean_abs=0.5506000518798828, max_abs=2.125, mean_rel=0.08536951243877411, max_rel=5.174331188201904, norm_rel=0.02466435357928276, ref_abs_avg=22.585140228271484, test_abs_avg=22.587717056274414
production_forward2 grad[57] vs paper_forward: mean_abs=0.6815601587295532, max_abs=5.25, mean_rel=0.17603516578674316, max_rel=1691.76708984375, norm_rel=0.024789374321699142, ref_abs_avg=27.526811599731445, test_abs_avg=27.524600982666016
production_forward2 grad[58] vs paper_forward: mean_abs=0.6688162684440613, max_abs=4.296875, mean_rel=0.15613821148872375, max_rel=726.7530517578125, norm_rel=0.024500004947185516, ref_abs_avg=27.317047119140625, test_abs_avg=27.31409454345703
production_forward2 grad[59] vs paper_forward: mean_abs=0.5150737762451172, max_abs=2.5, mean_rel=0.06394688040018082, max_rel=1.5975428819656372, norm_rel=0.024688677862286568, ref_abs_avg=21.12051773071289, test_abs_avg=21.102792739868164
production_forward2 grad[60] vs paper_forward: mean_abs=0.6408146023750305, max_abs=4.0, mean_rel=0.16190704703330994, max_rel=737.5902099609375, norm_rel=0.02457529865205288, ref_abs_avg=26.095714569091797, test_abs_avg=26.09423065185547
production_forward2 grad[61] vs paper_forward: mean_abs=0.6362333297729492, max_abs=4.0625, mean_rel=0.1617751121520996, max_rel=1004.9108276367188, norm_rel=0.02484499290585518, ref_abs_avg=25.624282836914062, test_abs_avg=25.62140655517578
production_forward2 grad[62] vs paper_forward: mean_abs=0.49498918652534485, max_abs=2.0625, mean_rel=0.07909086346626282, max_rel=4.1412200927734375, norm_rel=0.02346300333738327, ref_abs_avg=21.310548782348633, test_abs_avg=21.325336456298828
production_forward2 grad[63] vs paper_forward: mean_abs=0.6039029955863953, max_abs=4.5, mean_rel=0.15322253108024597, max_rel=731.8175659179688, norm_rel=0.023805661126971245, ref_abs_avg=25.37474822998047, test_abs_avg=25.373680114746094
production_forward2 grad[64] vs paper_forward: mean_abs=0.5893523693084717, max_abs=4.0, mean_rel=0.15882177650928497, max_rel=1066.8040771484375, norm_rel=0.023657092824578285, ref_abs_avg=24.89907455444336, test_abs_avg=24.90178680419922
production_forward2 grad[65] vs paper_forward: mean_abs=0.43268463015556335, max_abs=2.0, mean_rel=0.16652929782867432, max_rel=30.565786361694336, norm_rel=0.022253111004829407, ref_abs_avg=20.2623233795166, test_abs_avg=20.253145217895508
production_forward2 grad[66] vs paper_forward: mean_abs=0.5786443948745728, max_abs=4.25, mean_rel=0.15315282344818115, max_rel=962.3609008789062, norm_rel=0.02353307604789734, ref_abs_avg=24.55636215209961, test_abs_avg=24.555936813354492
production_forward2 grad[67] vs paper_forward: mean_abs=0.5644108653068542, max_abs=4.25, mean_rel=0.15655067563056946, max_rel=1069.9786376953125, norm_rel=0.023516230285167694, ref_abs_avg=24.05465316772461, test_abs_avg=24.06093978881836
production_forward2 grad[68] vs paper_forward: mean_abs=0.4076204299926758, max_abs=1.75, mean_rel=0.07041728496551514, max_rel=6.838621616363525, norm_rel=0.020607585087418556, ref_abs_avg=20.01357650756836, test_abs_avg=19.98341941833496
production_forward2 grad[69] vs paper_forward: mean_abs=0.5451713800430298, max_abs=3.662109375, mean_rel=0.15176533162593842, max_rel=659.9449462890625, norm_rel=0.02321893721818924, ref_abs_avg=23.450950622558594, test_abs_avg=23.450260162353516
production_forward2 grad[70] vs paper_forward: mean_abs=0.52788907289505, max_abs=4.0, mean_rel=0.14634522795677185, max_rel=604.5116577148438, norm_rel=0.022922763600945473, ref_abs_avg=23.035188674926758, test_abs_avg=23.032344818115234
production_forward2 grad[71] vs paper_forward: mean_abs=0.43073558807373047, max_abs=1.5, mean_rel=0.08888132870197296, max_rel=8.661833763122559, norm_rel=0.02303456701338291, ref_abs_avg=18.70030975341797, test_abs_avg=18.721904754638672
production_forward2 grad[72] vs paper_forward: mean_abs=0.5211590528488159, max_abs=4.0, mean_rel=0.15466450154781342, max_rel=1473.919677734375, norm_rel=0.02293662540614605, ref_abs_avg=22.743389129638672, test_abs_avg=22.744857788085938
production_forward2 grad[73] vs paper_forward: mean_abs=0.5059949159622192, max_abs=3.265625, mean_rel=0.15701980888843536, max_rel=566.0809326171875, norm_rel=0.022605743259191513, ref_abs_avg=22.381113052368164, test_abs_avg=22.378276824951172
production_forward2 grad[74] vs paper_forward: mean_abs=0.47434425354003906, max_abs=1.6875, mean_rel=0.05177416652441025, max_rel=0.7070730328559875, norm_rel=0.02333921194076538, ref_abs_avg=21.225528717041016, test_abs_avg=21.207353591918945
production_forward2 grad[75] vs paper_forward: mean_abs=0.6036542654037476, max_abs=4.5, mean_rel=0.15714417397975922, max_rel=1265.4781494140625, norm_rel=0.024622708559036255, ref_abs_avg=24.548419952392578, test_abs_avg=24.546762466430664
production_forward2 grad[76] vs paper_forward: mean_abs=0.5810902118682861, max_abs=4.5625, mean_rel=0.1585487723350525, max_rel=791.52734375, norm_rel=0.02410469949245453, ref_abs_avg=24.11787223815918, test_abs_avg=24.116905212402344
production_forward2 grad[77] vs paper_forward: mean_abs=0.4510669708251953, max_abs=1.75, mean_rel=0.07546522468328476, max_rel=4.320899486541748, norm_rel=0.02255452238023281, ref_abs_avg=20.3289737701416, test_abs_avg=20.311765670776367
production_forward2 grad[78] vs paper_forward: mean_abs=0.5402970314025879, max_abs=4.5, mean_rel=0.15822988748550415, max_rel=1911.989501953125, norm_rel=0.023533226922154427, ref_abs_avg=22.941619873046875, test_abs_avg=22.94157600402832
production_forward2 grad[79] vs paper_forward: mean_abs=0.5330554246902466, max_abs=4.0, mean_rel=0.15136562287807465, max_rel=1130.5458984375, norm_rel=0.02358505316078663, ref_abs_avg=22.67870330810547, test_abs_avg=22.672941207885742
production_forward2 grad[80] vs paper_forward: mean_abs=0.407562792301178, max_abs=1.75, mean_rel=0.22139711678028107, max_rel=46.59020233154297, norm_rel=0.02279169298708439, ref_abs_avg=18.204681396484375, test_abs_avg=18.23261833190918
production_forward2 grad[81] vs paper_forward: mean_abs=0.5038443803787231, max_abs=5.5, mean_rel=0.15942202508449554, max_rel=1091.776123046875, norm_rel=0.023081788793206215, ref_abs_avg=21.82576560974121, test_abs_avg=21.825319290161133
production_forward2 grad[82] vs paper_forward: mean_abs=0.49144941568374634, max_abs=4.25, mean_rel=0.14298169314861298, max_rel=684.76025390625, norm_rel=0.02277851663529873, ref_abs_avg=21.640939712524414, test_abs_avg=21.651508331298828
production_forward2 grad[83] vs paper_forward: mean_abs=0.36061692237854004, max_abs=1.5, mean_rel=0.12834814190864563, max_rel=16.358797073364258, norm_rel=0.021705105900764465, ref_abs_avg=17.348146438598633, test_abs_avg=17.31696319580078
production_forward2 grad[84] vs paper_forward: mean_abs=0.4661515951156616, max_abs=4.0, mean_rel=0.14551864564418793, max_rel=809.2914428710938, norm_rel=0.022363314405083656, ref_abs_avg=20.90148162841797, test_abs_avg=20.901758193969727
production_forward2 grad[85] vs paper_forward: mean_abs=0.44745147228240967, max_abs=3.5, mean_rel=0.136757031083107, max_rel=685.172607421875, norm_rel=0.02151041477918625, ref_abs_avg=20.79904556274414, test_abs_avg=20.794673919677734
production_forward2 grad[86] vs paper_forward: mean_abs=0.35024577379226685, max_abs=1.625, mean_rel=0.12582038342952728, max_rel=18.665367126464844, norm_rel=0.02124897949397564, ref_abs_avg=16.74285125732422, test_abs_avg=16.7691650390625
production_forward2 grad[87] vs paper_forward: mean_abs=0.4370751678943634, max_abs=4.25, mean_rel=0.1388441026210785, max_rel=757.8359985351562, norm_rel=0.021740086376667023, ref_abs_avg=20.177597045898438, test_abs_avg=20.17670440673828
production_forward2 grad[88] vs paper_forward: mean_abs=0.4286007881164551, max_abs=3.5, mean_rel=0.13593775033950806, max_rel=960.6016845703125, norm_rel=0.021408701315522194, ref_abs_avg=20.11143684387207, test_abs_avg=20.108489990234375
production_forward2 grad[89] vs paper_forward: mean_abs=0.3137068748474121, max_abs=1.625, mean_rel=0.13164973258972168, max_rel=25.926090240478516, norm_rel=0.019752057269215584, ref_abs_avg=16.32723617553711, test_abs_avg=16.329437255859375
production_forward2 grad[90] vs paper_forward: mean_abs=0.41249603033065796, max_abs=3.51171875, mean_rel=0.12980180978775024, max_rel=623.8633422851562, norm_rel=0.021326687186956406, ref_abs_avg=19.447284698486328, test_abs_avg=19.447778701782227
production_forward2 grad[91] vs paper_forward: mean_abs=0.40499308705329895, max_abs=3.5, mean_rel=0.12431652843952179, max_rel=461.01007080078125, norm_rel=0.021157391369342804, ref_abs_avg=19.231040954589844, test_abs_avg=19.230525970458984
production_forward2 grad[92] vs paper_forward: mean_abs=0.3228144645690918, max_abs=1.1328125, mean_rel=0.2209193855524063, max_rel=67.31183624267578, norm_rel=0.019878512248396873, ref_abs_avg=15.876129150390625, test_abs_avg=15.914185523986816
production_forward2 grad[93] vs paper_forward: mean_abs=0.3903038501739502, max_abs=4.125, mean_rel=0.12645955383777618, max_rel=653.8042602539062, norm_rel=0.020557107403874397, ref_abs_avg=19.1722412109375, test_abs_avg=19.17322540283203
production_forward2 grad[94] vs paper_forward: mean_abs=0.375423789024353, max_abs=4.1875, mean_rel=0.12984462082386017, max_rel=527.431640625, norm_rel=0.020599491894245148, ref_abs_avg=18.42205047607422, test_abs_avg=18.424612045288086
production_forward2 grad[95] vs paper_forward: mean_abs=0.30081313848495483, max_abs=1.125, mean_rel=0.11832452565431595, max_rel=13.680660247802734, norm_rel=0.019231054931879044, ref_abs_avg=15.947969436645508, test_abs_avg=15.950471878051758
production_forward2 grad[96] vs paper_forward: mean_abs=0.3740437626838684, max_abs=3.7734375, mean_rel=0.12108957022428513, max_rel=1001.7108154296875, norm_rel=0.02028912864625454, ref_abs_avg=18.699857711791992, test_abs_avg=18.700469970703125
production_forward2 grad[97] vs paper_forward: mean_abs=0.360921174287796, max_abs=3.25, mean_rel=0.1308044195175171, max_rel=1027.0816650390625, norm_rel=0.02063087746500969, ref_abs_avg=17.84792709350586, test_abs_avg=17.850805282592773
identity layers + randn queries
paper_forward fwd+bwd:  380.036 ms
paper_forward bwd-only: 294.280 ms
paper_forward peak allocated: fwd=30.001 GiB, fwd+bwd=32.120 GiB
paper_forward peak reserved:  fwd=30.041 GiB, fwd+bwd=32.791 GiB
production_forward fwd+bwd:  109.537 ms
production_forward bwd-only: 89.149 ms
production_forward peak allocated: fwd=3.368 GiB, fwd+bwd=6.993 GiB
production_forward peak reserved:  fwd=3.625 GiB, fwd+bwd=8.125 GiB
production_forward2 fwd+bwd:  224.434 ms
production_forward2 bwd-only: 202.243 ms
production_forward2 peak allocated: fwd=2.864 GiB, fwd+bwd=6.243 GiB
production_forward2 peak reserved:  fwd=3.250 GiB, fwd+bwd=9.000 GiB

grads check for swiglu layers + randn queries
production_forward vs paper_forward output: mean_abs=0.001598995877429843, max_abs=0.0390625
production_forward grad[0] vs paper_forward: mean_abs=0.008288578130304813, max_abs=0.35546875, mean_rel=0.07292834669351578, max_rel=107.41061401367188, norm_rel=0.0199262835085392, ref_abs_avg=0.4484460949897766, test_abs_avg=0.44845741987228394
production_forward grad[1] vs paper_forward: mean_abs=7.081351280212402, max_abs=56.0, mean_rel=0.1956329047679901, max_rel=1009.050537109375, norm_rel=0.020090969279408455, ref_abs_avg=310.4834289550781, test_abs_avg=310.5332946777344
production_forward grad[2] vs paper_forward: mean_abs=1.2656574249267578, max_abs=5.0, mean_rel=0.14384521543979645, max_rel=16.347814559936523, norm_rel=0.02324247919023037, ref_abs_avg=54.8831672668457, test_abs_avg=54.89973449707031
production_forward grad[3] vs paper_forward: mean_abs=1.5280731916427612, max_abs=10.125, mean_rel=0.16875168681144714, max_rel=2844.029052734375, norm_rel=0.023919429630041122, ref_abs_avg=64.19351196289062, test_abs_avg=64.19265747070312
production_forward grad[4] vs paper_forward: mean_abs=1.4964182376861572, max_abs=9.5, mean_rel=0.1537240743637085, max_rel=590.3612670898438, norm_rel=0.023491691797971725, ref_abs_avg=64.02261352539062, test_abs_avg=64.0263671875
production_forward grad[5] vs paper_forward: mean_abs=1.0611400604248047, max_abs=4.125, mean_rel=0.06299668550491333, max_rel=1.9985464811325073, norm_rel=0.022401096299290657, ref_abs_avg=48.73845672607422, test_abs_avg=48.75597381591797
production_forward grad[6] vs paper_forward: mean_abs=1.3457257747650146, max_abs=8.5, mean_rel=0.15783441066741943, max_rel=3232.291259765625, norm_rel=0.023587940260767937, ref_abs_avg=57.370582580566406, test_abs_avg=57.370853424072266
production_forward grad[7] vs paper_forward: mean_abs=1.3175104856491089, max_abs=8.0, mean_rel=0.14106261730194092, max_rel=647.16845703125, norm_rel=0.0232516061514616, ref_abs_avg=56.91973114013672, test_abs_avg=56.91423797607422
production_forward grad[8] vs paper_forward: mean_abs=0.9863090515136719, max_abs=4.109375, mean_rel=0.075393907725811, max_rel=5.294484615325928, norm_rel=0.02112535946071148, ref_abs_avg=45.58454513549805, test_abs_avg=45.59862518310547
production_forward grad[9] vs paper_forward: mean_abs=1.2095518112182617, max_abs=7.5, mean_rel=0.16205689311027527, max_rel=2198.355712890625, norm_rel=0.023305639624595642, ref_abs_avg=52.19643783569336, test_abs_avg=52.19556427001953
production_forward grad[10] vs paper_forward: mean_abs=1.1801409721374512, max_abs=7.25, mean_rel=0.1629350185394287, max_rel=1151.958251953125, norm_rel=0.02298276498913765, ref_abs_avg=51.634132385253906, test_abs_avg=51.63665771484375
production_forward grad[11] vs paper_forward: mean_abs=0.901752233505249, max_abs=3.25, mean_rel=0.305527001619339, max_rel=88.89692687988281, norm_rel=0.02243700996041298, ref_abs_avg=40.377845764160156, test_abs_avg=40.347068786621094
production_forward grad[12] vs paper_forward: mean_abs=1.1187939643859863, max_abs=7.0, mean_rel=0.1566839963197708, max_rel=1662.0665283203125, norm_rel=0.023125778883695602, ref_abs_avg=48.611053466796875, test_abs_avg=48.61178970336914
production_forward grad[13] vs paper_forward: mean_abs=1.0925472974777222, max_abs=7.0, mean_rel=0.16528457403182983, max_rel=1527.7606201171875, norm_rel=0.02289363369345665, ref_abs_avg=47.90937423706055, test_abs_avg=47.9150390625
production_forward grad[14] vs paper_forward: mean_abs=0.8498403429985046, max_abs=3.875, mean_rel=0.18065977096557617, max_rel=35.018131256103516, norm_rel=0.023391228169202805, ref_abs_avg=36.97727584838867, test_abs_avg=36.887184143066406
production_forward grad[15] vs paper_forward: mean_abs=1.0377991199493408, max_abs=6.3125, mean_rel=0.17130157351493835, max_rel=1421.679443359375, norm_rel=0.022967729717493057, ref_abs_avg=45.370079040527344, test_abs_avg=45.37201690673828
production_forward grad[16] vs paper_forward: mean_abs=1.007034182548523, max_abs=6.5, mean_rel=0.13743944466114044, max_rel=950.0743408203125, norm_rel=0.02261751890182495, ref_abs_avg=44.694664001464844, test_abs_avg=44.69583511352539
production_forward grad[17] vs paper_forward: mean_abs=0.7823125720024109, max_abs=3.5, mean_rel=0.20988929271697998, max_rel=23.00533103942871, norm_rel=0.02367115579545498, ref_abs_avg=33.04541015625, test_abs_avg=33.166439056396484
production_forward grad[18] vs paper_forward: mean_abs=0.9760280847549438, max_abs=6.5, mean_rel=0.15217578411102295, max_rel=1004.3948364257812, norm_rel=0.022823169827461243, ref_abs_avg=42.98725891113281, test_abs_avg=42.988121032714844
production_forward grad[19] vs paper_forward: mean_abs=0.9532803297042847, max_abs=5.5, mean_rel=0.14119048416614532, max_rel=1032.6776123046875, norm_rel=0.022638602182269096, ref_abs_avg=42.357398986816406, test_abs_avg=42.35993194580078
production_forward grad[20] vs paper_forward: mean_abs=0.7404556274414062, max_abs=2.75, mean_rel=0.08091916888952255, max_rel=2.2821779251098633, norm_rel=0.023855768144130707, ref_abs_avg=31.256664276123047, test_abs_avg=31.231319427490234
production_forward grad[21] vs paper_forward: mean_abs=0.9224860072135925, max_abs=6.0, mean_rel=0.1596052646636963, max_rel=1981.6336669921875, norm_rel=0.022713512182235718, ref_abs_avg=40.847232818603516, test_abs_avg=40.84765625
production_forward grad[22] vs paper_forward: mean_abs=0.8990029692649841, max_abs=5.0, mean_rel=0.16134881973266602, max_rel=826.2901000976562, norm_rel=0.022399868816137314, ref_abs_avg=40.344058990478516, test_abs_avg=40.34706115722656
production_forward grad[23] vs paper_forward: mean_abs=0.7006509304046631, max_abs=2.59765625, mean_rel=0.20987659692764282, max_rel=62.52695083618164, norm_rel=0.02220282517373562, ref_abs_avg=32.08567810058594, test_abs_avg=32.046905517578125
production_forward grad[24] vs paper_forward: mean_abs=0.8754158020019531, max_abs=5.5, mean_rel=0.17102470993995667, max_rel=1583.719482421875, norm_rel=0.022529134526848793, ref_abs_avg=39.062042236328125, test_abs_avg=39.06341552734375
production_forward grad[25] vs paper_forward: mean_abs=0.8573504686355591, max_abs=5.3203125, mean_rel=0.15887150168418884, max_rel=1359.4892578125, norm_rel=0.022307544946670532, ref_abs_avg=38.61917495727539, test_abs_avg=38.6202392578125
production_forward grad[26] vs paper_forward: mean_abs=0.8589382171630859, max_abs=3.75, mean_rel=0.13595303893089294, max_rel=13.886634826660156, norm_rel=0.024358080700039864, ref_abs_avg=35.501075744628906, test_abs_avg=35.48157501220703
production_forward grad[27] vs paper_forward: mean_abs=1.0077848434448242, max_abs=6.75, mean_rel=0.1545708179473877, max_rel=1006.8345336914062, norm_rel=0.024230970069766045, ref_abs_avg=41.74311828613281, test_abs_avg=41.74680709838867
production_forward grad[28] vs paper_forward: mean_abs=0.9871214628219604, max_abs=6.0, mean_rel=0.16768665611743927, max_rel=1278.0721435546875, norm_rel=0.02406231500208378, ref_abs_avg=41.173187255859375, test_abs_avg=41.18024444580078
production_forward grad[29] vs paper_forward: mean_abs=0.7684764862060547, max_abs=3.0, mean_rel=0.08458306640386581, max_rel=3.3239545822143555, norm_rel=0.023728670552372932, ref_abs_avg=31.73857879638672, test_abs_avg=31.807832717895508
production_forward grad[30] vs paper_forward: mean_abs=0.9310947060585022, max_abs=6.0, mean_rel=0.16203586757183075, max_rel=803.509765625, norm_rel=0.024630235508084297, ref_abs_avg=37.93181610107422, test_abs_avg=37.9319953918457
production_forward grad[31] vs paper_forward: mean_abs=0.914505124092102, max_abs=5.5, mean_rel=0.1663823425769806, max_rel=2117.497314453125, norm_rel=0.02433951571583748, ref_abs_avg=37.716552734375, test_abs_avg=37.716156005859375
production_forward grad[32] vs paper_forward: mean_abs=0.7176399230957031, max_abs=2.75, mean_rel=0.0750516876578331, max_rel=9.849782943725586, norm_rel=0.02433556132018566, ref_abs_avg=30.817049026489258, test_abs_avg=30.90110206604004
production_forward grad[33] vs paper_forward: mean_abs=0.8723030090332031, max_abs=6.0, mean_rel=0.15825489163398743, max_rel=1017.247802734375, norm_rel=0.02442355267703533, ref_abs_avg=35.81037521362305, test_abs_avg=35.81218338012695
production_forward grad[34] vs paper_forward: mean_abs=0.8595989942550659, max_abs=5.5, mean_rel=0.16030049324035645, max_rel=1061.163818359375, norm_rel=0.024600310251116753, ref_abs_avg=35.081459045410156, test_abs_avg=35.083396911621094
production_forward grad[35] vs paper_forward: mean_abs=0.6466665267944336, max_abs=3.0, mean_rel=0.1063254177570343, max_rel=13.412796974182129, norm_rel=0.022848624736070633, ref_abs_avg=29.224056243896484, test_abs_avg=29.22913360595703
production_forward grad[36] vs paper_forward: mean_abs=0.811225950717926, max_abs=5.0, mean_rel=0.1633222997188568, max_rel=1797.68505859375, norm_rel=0.024180656298995018, ref_abs_avg=33.62384033203125, test_abs_avg=33.625022888183594
production_forward grad[37] vs paper_forward: mean_abs=0.8048603534698486, max_abs=5.0, mean_rel=0.1628207564353943, max_rel=738.7098999023438, norm_rel=0.024335019290447235, ref_abs_avg=33.20044708251953, test_abs_avg=33.19568634033203
production_forward grad[38] vs paper_forward: mean_abs=0.609623908996582, max_abs=2.875, mean_rel=0.14153870940208435, max_rel=21.825397491455078, norm_rel=0.022769184783101082, ref_abs_avg=26.875974655151367, test_abs_avg=26.87994384765625
production_forward grad[39] vs paper_forward: mean_abs=0.7676750421524048, max_abs=7.0, mean_rel=0.1590031385421753, max_rel=1730.15771484375, norm_rel=0.023844383656978607, ref_abs_avg=32.269744873046875, test_abs_avg=32.269203186035156
production_forward grad[40] vs paper_forward: mean_abs=0.760809063911438, max_abs=4.75, mean_rel=0.14718955755233765, max_rel=677.6763305664062, norm_rel=0.02381826564669609, ref_abs_avg=32.033409118652344, test_abs_avg=32.03887176513672
production_forward grad[41] vs paper_forward: mean_abs=0.5855464935302734, max_abs=2.875, mean_rel=0.12852919101715088, max_rel=21.68138885498047, norm_rel=0.023423466831445694, ref_abs_avg=25.60778045654297, test_abs_avg=25.64255142211914
production_forward grad[42] vs paper_forward: mean_abs=0.7319337129592896, max_abs=5.5, mean_rel=0.16770236194133759, max_rel=1467.1849365234375, norm_rel=0.023712750524282455, ref_abs_avg=30.938379287719727, test_abs_avg=30.940876007080078
production_forward grad[43] vs paper_forward: mean_abs=0.7188779711723328, max_abs=4.75, mean_rel=0.16641105711460114, max_rel=1016.7569580078125, norm_rel=0.023715445771813393, ref_abs_avg=30.37830924987793, test_abs_avg=30.378948211669922
production_forward grad[44] vs paper_forward: mean_abs=0.5641007423400879, max_abs=2.6171875, mean_rel=0.11448574811220169, max_rel=5.877610206604004, norm_rel=0.023788975551724434, ref_abs_avg=23.288761138916016, test_abs_avg=23.290498733520508
production_forward grad[45] vs paper_forward: mean_abs=0.691692054271698, max_abs=4.5625, mean_rel=0.15514042973518372, max_rel=1410.0625, norm_rel=0.02347080036997795, ref_abs_avg=29.542495727539062, test_abs_avg=29.543182373046875
production_forward grad[46] vs paper_forward: mean_abs=0.6806788444519043, max_abs=5.0, mean_rel=0.13864728808403015, max_rel=1132.5992431640625, norm_rel=0.023308532312512398, ref_abs_avg=29.20745086669922, test_abs_avg=29.204452514648438
production_forward grad[47] vs paper_forward: mean_abs=0.5277500152587891, max_abs=2.21875, mean_rel=0.21281884610652924, max_rel=57.805389404296875, norm_rel=0.0224510058760643, ref_abs_avg=23.65066146850586, test_abs_avg=23.66516876220703
production_forward grad[48] vs paper_forward: mean_abs=0.6665143966674805, max_abs=4.0, mean_rel=0.15888077020645142, max_rel=1112.2279052734375, norm_rel=0.023253416642546654, ref_abs_avg=28.688676834106445, test_abs_avg=28.690357208251953
production_forward grad[49] vs paper_forward: mean_abs=0.6501166820526123, max_abs=4.5, mean_rel=0.1652449667453766, max_rel=1724.449951171875, norm_rel=0.022976292297244072, ref_abs_avg=28.343481063842773, test_abs_avg=28.34246826171875
production_forward grad[50] vs paper_forward: mean_abs=0.603940486907959, max_abs=2.0, mean_rel=0.11260071396827698, max_rel=16.607376098632812, norm_rel=0.02358432300388813, ref_abs_avg=25.975366592407227, test_abs_avg=26.012142181396484
production_forward grad[51] vs paper_forward: mean_abs=0.7400772571563721, max_abs=5.25, mean_rel=0.16337309777736664, max_rel=1053.984130859375, norm_rel=0.024808403104543686, ref_abs_avg=29.86263656616211, test_abs_avg=29.86254119873047
production_forward grad[52] vs paper_forward: mean_abs=0.7273421287536621, max_abs=5.015625, mean_rel=0.15883874893188477, max_rel=701.1464233398438, norm_rel=0.02499982714653015, ref_abs_avg=29.18985939025879, test_abs_avg=29.19192123413086
production_forward grad[53] vs paper_forward: mean_abs=0.5512580871582031, max_abs=2.34375, mean_rel=0.3012048006057739, max_rel=55.73248291015625, norm_rel=0.02420349046587944, ref_abs_avg=22.667766571044922, test_abs_avg=22.663219451904297
production_forward grad[54] vs paper_forward: mean_abs=0.6806259155273438, max_abs=5.5, mean_rel=0.1576574146747589, max_rel=805.2994384765625, norm_rel=0.024716144427657127, ref_abs_avg=27.59446144104004, test_abs_avg=27.5922794342041
production_forward grad[55] vs paper_forward: mean_abs=0.6660135984420776, max_abs=4.0, mean_rel=0.16119502484798431, max_rel=643.9343872070312, norm_rel=0.02450133115053177, ref_abs_avg=27.251012802124023, test_abs_avg=27.245803833007812
production_forward grad[56] vs paper_forward: mean_abs=0.5035877227783203, max_abs=1.9375, mean_rel=0.08083337545394897, max_rel=5.055439472198486, norm_rel=0.021339913830161095, ref_abs_avg=23.49966049194336, test_abs_avg=23.526540756225586
production_forward grad[57] vs paper_forward: mean_abs=0.633513331413269, max_abs=6.0, mean_rel=0.15042535960674286, max_rel=584.8497924804688, norm_rel=0.02410111576318741, ref_abs_avg=26.337615966796875, test_abs_avg=26.337970733642578
production_forward grad[58] vs paper_forward: mean_abs=0.6220766305923462, max_abs=4.75, mean_rel=0.15921899676322937, max_rel=1127.3560791015625, norm_rel=0.024166399613022804, ref_abs_avg=25.766258239746094, test_abs_avg=25.76342010498047
production_forward grad[59] vs paper_forward: mean_abs=0.4817540645599365, max_abs=2.0, mean_rel=0.17679548263549805, max_rel=33.759490966796875, norm_rel=0.022978339344263077, ref_abs_avg=21.326473236083984, test_abs_avg=21.305805206298828
production_forward grad[60] vs paper_forward: mean_abs=0.5921012163162231, max_abs=4.78125, mean_rel=0.15051138401031494, max_rel=946.7360229492188, norm_rel=0.023751115426421165, ref_abs_avg=24.91204833984375, test_abs_avg=24.914026260375977
production_forward grad[61] vs paper_forward: mean_abs=0.5870222449302673, max_abs=4.5, mean_rel=0.15912263095378876, max_rel=1088.7843017578125, norm_rel=0.023919444531202316, ref_abs_avg=24.58203887939453, test_abs_avg=24.586597442626953
production_forward grad[62] vs paper_forward: mean_abs=0.4565774202346802, max_abs=2.0625, mean_rel=0.2525251805782318, max_rel=78.74299621582031, norm_rel=0.02421986125409603, ref_abs_avg=19.484519958496094, test_abs_avg=19.492374420166016
production_forward grad[63] vs paper_forward: mean_abs=0.568342924118042, max_abs=4.0, mean_rel=0.16083519160747528, max_rel=874.0186157226562, norm_rel=0.023342721164226532, ref_abs_avg=24.343616485595703, test_abs_avg=24.343231201171875
production_forward grad[64] vs paper_forward: mean_abs=0.5585085153579712, max_abs=5.0, mean_rel=0.15730556845664978, max_rel=1156.794921875, norm_rel=0.023498032242059708, ref_abs_avg=23.834144592285156, test_abs_avg=23.834392547607422
production_forward grad[65] vs paper_forward: mean_abs=0.4249236583709717, max_abs=1.625, mean_rel=0.1464722752571106, max_rel=19.89228630065918, norm_rel=0.022007696330547333, ref_abs_avg=19.25201416015625, test_abs_avg=19.288494110107422
production_forward grad[66] vs paper_forward: mean_abs=0.5410266518592834, max_abs=4.25, mean_rel=0.1504644751548767, max_rel=1250.755859375, norm_rel=0.023248596116900444, ref_abs_avg=23.3234920501709, test_abs_avg=23.32605743408203
production_forward grad[67] vs paper_forward: mean_abs=0.5237786769866943, max_abs=3.625, mean_rel=0.13908223807811737, max_rel=596.3701171875, norm_rel=0.02274477481842041, ref_abs_avg=23.01348114013672, test_abs_avg=23.012834548950195
production_forward grad[68] vs paper_forward: mean_abs=0.45627617835998535, max_abs=1.75, mean_rel=0.0806334912776947, max_rel=5.111059665679932, norm_rel=0.023941101506352425, ref_abs_avg=18.7249813079834, test_abs_avg=18.742643356323242
production_forward grad[69] vs paper_forward: mean_abs=0.5144160985946655, max_abs=3.5, mean_rel=0.14128074049949646, max_rel=1062.6270751953125, norm_rel=0.022793568670749664, ref_abs_avg=22.579694747924805, test_abs_avg=22.581743240356445
production_forward grad[70] vs paper_forward: mean_abs=0.5072731375694275, max_abs=3.5625, mean_rel=0.1437688171863556, max_rel=809.0127563476562, norm_rel=0.022963592782616615, ref_abs_avg=22.11968994140625, test_abs_avg=22.120471954345703
production_forward grad[71] vs paper_forward: mean_abs=0.4129199981689453, max_abs=1.5, mean_rel=0.0804847776889801, max_rel=8.259167671203613, norm_rel=0.022644925862550735, ref_abs_avg=17.850954055786133, test_abs_avg=17.846466064453125
production_forward grad[72] vs paper_forward: mean_abs=0.49510735273361206, max_abs=3.5, mean_rel=0.1492491513490677, max_rel=982.1070556640625, norm_rel=0.022356783971190453, ref_abs_avg=22.112873077392578, test_abs_avg=22.112937927246094
production_forward grad[73] vs paper_forward: mean_abs=0.4850125014781952, max_abs=3.5, mean_rel=0.13861329853534698, max_rel=584.2521362304688, norm_rel=0.022434638813138008, ref_abs_avg=21.63409996032715, test_abs_avg=21.640655517578125
production_forward grad[74] vs paper_forward: mean_abs=0.44619089365005493, max_abs=1.96875, mean_rel=0.12451951205730438, max_rel=12.48934268951416, norm_rel=0.023539436981081963, ref_abs_avg=18.988122940063477, test_abs_avg=19.01197052001953
production_forward grad[75] vs paper_forward: mean_abs=0.5530115365982056, max_abs=5.0, mean_rel=0.1570143699645996, max_rel=1133.969482421875, norm_rel=0.02370152622461319, ref_abs_avg=23.32159423828125, test_abs_avg=23.32469940185547
production_forward grad[76] vs paper_forward: mean_abs=0.5321067571640015, max_abs=4.0, mean_rel=0.1402251422405243, max_rel=994.4103393554688, norm_rel=0.02353569306433201, ref_abs_avg=22.630910873413086, test_abs_avg=22.63231658935547
production_forward grad[77] vs paper_forward: mean_abs=0.41439348459243774, max_abs=1.5, mean_rel=0.28127747774124146, max_rel=102.14999389648438, norm_rel=0.022601131349802017, ref_abs_avg=18.42852783203125, test_abs_avg=18.443965911865234
production_forward grad[78] vs paper_forward: mean_abs=0.5040008425712585, max_abs=4.0, mean_rel=0.1481814980506897, max_rel=1028.211669921875, norm_rel=0.022991621866822243, ref_abs_avg=21.89897918701172, test_abs_avg=21.902976989746094
production_forward grad[79] vs paper_forward: mean_abs=0.4919624924659729, max_abs=4.0, mean_rel=0.16276973485946655, max_rel=658.2900390625, norm_rel=0.023004993796348572, ref_abs_avg=21.47727394104004, test_abs_avg=21.483959197998047
production_forward grad[80] vs paper_forward: mean_abs=0.3794928789138794, max_abs=1.375, mean_rel=0.18159857392311096, max_rel=45.14019775390625, norm_rel=0.021390778943896294, ref_abs_avg=17.740741729736328, test_abs_avg=17.709720611572266
production_forward grad[81] vs paper_forward: mean_abs=0.4782363176345825, max_abs=4.5, mean_rel=0.14297039806842804, max_rel=983.7157592773438, norm_rel=0.02271643467247486, ref_abs_avg=21.058176040649414, test_abs_avg=21.05921173095703
production_forward grad[82] vs paper_forward: mean_abs=0.4602818489074707, max_abs=3.875, mean_rel=0.13719142973423004, max_rel=543.8551025390625, norm_rel=0.022482063621282578, ref_abs_avg=20.555500030517578, test_abs_avg=20.556324005126953
production_forward grad[83] vs paper_forward: mean_abs=0.3519860804080963, max_abs=1.484375, mean_rel=0.24509532749652863, max_rel=89.10694122314453, norm_rel=0.021294597536325455, ref_abs_avg=16.49883460998535, test_abs_avg=16.493593215942383
production_forward grad[84] vs paper_forward: mean_abs=0.43584713339805603, max_abs=4.25, mean_rel=0.14633628726005554, max_rel=878.4048461914062, norm_rel=0.0219841580837965, ref_abs_avg=19.848026275634766, test_abs_avg=19.84939956665039
production_forward grad[85] vs paper_forward: mean_abs=0.4304916262626648, max_abs=3.875, mean_rel=0.14407026767730713, max_rel=743.0347900390625, norm_rel=0.022252772003412247, ref_abs_avg=19.48004913330078, test_abs_avg=19.485265731811523
production_forward grad[86] vs paper_forward: mean_abs=0.3269081711769104, max_abs=1.25, mean_rel=0.4759497344493866, max_rel=218.8824920654297, norm_rel=0.020478099584579468, ref_abs_avg=16.435558319091797, test_abs_avg=16.432861328125
production_forward grad[87] vs paper_forward: mean_abs=0.4097139835357666, max_abs=4.125, mean_rel=0.1454780101776123, max_rel=616.2506103515625, norm_rel=0.021681047976017, ref_abs_avg=18.988901138305664, test_abs_avg=18.989234924316406
production_forward grad[88] vs paper_forward: mean_abs=0.4023958444595337, max_abs=4.0, mean_rel=0.1315116137266159, max_rel=647.3062133789062, norm_rel=0.021468114107847214, ref_abs_avg=18.894126892089844, test_abs_avg=18.89950180053711
production_forward grad[89] vs paper_forward: mean_abs=0.30925726890563965, max_abs=1.25, mean_rel=0.1161336600780487, max_rel=29.43600082397461, norm_rel=0.020435690879821777, ref_abs_avg=15.507356643676758, test_abs_avg=15.50868034362793
production_forward grad[90] vs paper_forward: mean_abs=0.38978829979896545, max_abs=4.0, mean_rel=0.1327289342880249, max_rel=855.6271362304688, norm_rel=0.021048571914434433, ref_abs_avg=18.622215270996094, test_abs_avg=18.62249755859375
production_forward grad[91] vs paper_forward: mean_abs=0.38569164276123047, max_abs=4.0, mean_rel=0.13141925632953644, max_rel=790.6815185546875, norm_rel=0.02108919434249401, ref_abs_avg=18.503368377685547, test_abs_avg=18.504779815673828
production_forward grad[92] vs paper_forward: mean_abs=0.31705760955810547, max_abs=1.25, mean_rel=0.06281159818172455, max_rel=4.285227298736572, norm_rel=0.02104104869067669, ref_abs_avg=15.035136222839355, test_abs_avg=15.035673141479492
production_forward grad[93] vs paper_forward: mean_abs=0.37458500266075134, max_abs=4.0, mean_rel=0.12507463991641998, max_rel=922.9122314453125, norm_rel=0.020803265273571014, ref_abs_avg=18.205066680908203, test_abs_avg=18.205493927001953
production_forward grad[94] vs paper_forward: mean_abs=0.3572072982788086, max_abs=3.625, mean_rel=0.12369617819786072, max_rel=607.1964111328125, norm_rel=0.021007679402828217, ref_abs_avg=17.159446716308594, test_abs_avg=17.152143478393555
production_forward grad[95] vs paper_forward: mean_abs=0.27359336614608765, max_abs=1.09375, mean_rel=0.10556816309690475, max_rel=15.799345970153809, norm_rel=0.019311968237161636, ref_abs_avg=14.141448020935059, test_abs_avg=14.145090103149414
production_forward grad[96] vs paper_forward: mean_abs=0.3348672688007355, max_abs=3.5, mean_rel=0.12259964644908905, max_rel=843.1222534179688, norm_rel=0.01964593306183815, ref_abs_avg=17.300182342529297, test_abs_avg=17.299549102783203
production_forward grad[97] vs paper_forward: mean_abs=0.3300829231739044, max_abs=3.703125, mean_rel=0.11930444836616516, max_rel=706.7255859375, norm_rel=0.0191425122320652, ref_abs_avg=17.50154685974121, test_abs_avg=17.50971221923828
production_forward2 vs paper_forward output: mean_abs=0.001598995877429843, max_abs=0.0390625
production_forward2 grad[0] vs paper_forward: mean_abs=0.008417130447924137, max_abs=0.33984375, mean_rel=0.07399022579193115, max_rel=130.4309844970703, norm_rel=0.02020755410194397, ref_abs_avg=0.4484460949897766, test_abs_avg=0.44844865798950195
production_forward2 grad[1] vs paper_forward: mean_abs=7.1378302574157715, max_abs=56.0, mean_rel=0.1836176961660385, max_rel=729.9352416992188, norm_rel=0.02023393288254738, ref_abs_avg=310.4834289550781, test_abs_avg=310.5020751953125
production_forward2 grad[2] vs paper_forward: mean_abs=1.315037488937378, max_abs=5.5, mean_rel=0.11928556859493256, max_rel=10.576071739196777, norm_rel=0.024149512872099876, ref_abs_avg=54.8831672668457, test_abs_avg=54.960697174072266
production_forward2 grad[3] vs paper_forward: mean_abs=1.5466539859771729, max_abs=10.0, mean_rel=0.1748879998922348, max_rel=3921.750244140625, norm_rel=0.02419653907418251, ref_abs_avg=64.19351196289062, test_abs_avg=64.19412231445312
production_forward2 grad[4] vs paper_forward: mean_abs=1.516738772392273, max_abs=10.0, mean_rel=0.15162745118141174, max_rel=483.3612976074219, norm_rel=0.02382010966539383, ref_abs_avg=64.02261352539062, test_abs_avg=64.02618408203125
production_forward2 grad[5] vs paper_forward: mean_abs=1.0894088745117188, max_abs=4.5, mean_rel=0.06428211182355881, max_rel=1.5219806432724, norm_rel=0.022992592304944992, ref_abs_avg=48.73845672607422, test_abs_avg=48.82788848876953
production_forward2 grad[6] vs paper_forward: mean_abs=1.36234450340271, max_abs=9.0, mean_rel=0.16230158507823944, max_rel=3528.57177734375, norm_rel=0.023871123790740967, ref_abs_avg=57.370582580566406, test_abs_avg=57.36933898925781
production_forward2 grad[7] vs paper_forward: mean_abs=1.3323771953582764, max_abs=9.25, mean_rel=0.14142028987407684, max_rel=731.3009643554688, norm_rel=0.023510318249464035, ref_abs_avg=56.91973114013672, test_abs_avg=56.91271209716797
production_forward2 grad[8] vs paper_forward: mean_abs=0.9751243591308594, max_abs=4.203125, mean_rel=0.07556581497192383, max_rel=4.616573810577393, norm_rel=0.021301882341504097, ref_abs_avg=45.58454513549805, test_abs_avg=45.63379669189453
production_forward2 grad[9] vs paper_forward: mean_abs=1.2218576669692993, max_abs=7.625, mean_rel=0.16293925046920776, max_rel=2130.021484375, norm_rel=0.023536138236522675, ref_abs_avg=52.19643783569336, test_abs_avg=52.19416046142578
production_forward2 grad[10] vs paper_forward: mean_abs=1.1915953159332275, max_abs=7.0, mean_rel=0.1562541127204895, max_rel=968.0574340820312, norm_rel=0.023203864693641663, ref_abs_avg=51.634132385253906, test_abs_avg=51.63443374633789
production_forward2 grad[11] vs paper_forward: mean_abs=0.9505360126495361, max_abs=3.34375, mean_rel=0.32207733392715454, max_rel=105.48511505126953, norm_rel=0.02359248884022236, ref_abs_avg=40.377845764160156, test_abs_avg=40.36412048339844
production_forward2 grad[12] vs paper_forward: mean_abs=1.1307896375656128, max_abs=7.375, mean_rel=0.16374506056308746, max_rel=2047.9288330078125, norm_rel=0.02338392660021782, ref_abs_avg=48.611053466796875, test_abs_avg=48.611122131347656
production_forward2 grad[13] vs paper_forward: mean_abs=1.1057090759277344, max_abs=7.0, mean_rel=0.16454148292541504, max_rel=1045.141845703125, norm_rel=0.023169467225670815, ref_abs_avg=47.90937423706055, test_abs_avg=47.912109375
production_forward2 grad[14] vs paper_forward: mean_abs=0.8466776013374329, max_abs=3.5, mean_rel=0.2495899200439453, max_rel=51.4991569519043, norm_rel=0.023132413625717163, ref_abs_avg=36.97727584838867, test_abs_avg=36.906333923339844
production_forward2 grad[15] vs paper_forward: mean_abs=1.0500357151031494, max_abs=6.1875, mean_rel=0.16810861229896545, max_rel=1630.828125, norm_rel=0.023224925622344017, ref_abs_avg=45.370079040527344, test_abs_avg=45.37031173706055
production_forward2 grad[16] vs paper_forward: mean_abs=1.0165882110595703, max_abs=7.0, mean_rel=0.13960498571395874, max_rel=891.9424438476562, norm_rel=0.022857949137687683, ref_abs_avg=44.694664001464844, test_abs_avg=44.69491958618164
production_forward2 grad[17] vs paper_forward: mean_abs=0.7875666618347168, max_abs=2.75, mean_rel=0.2370598316192627, max_rel=41.26922607421875, norm_rel=0.023521138355135918, ref_abs_avg=33.04541015625, test_abs_avg=33.180442810058594
production_forward2 grad[18] vs paper_forward: mean_abs=0.985622763633728, max_abs=6.0, mean_rel=0.15045097470283508, max_rel=910.0482788085938, norm_rel=0.023047376424074173, ref_abs_avg=42.98725891113281, test_abs_avg=42.987205505371094
production_forward2 grad[19] vs paper_forward: mean_abs=0.9634947776794434, max_abs=6.0, mean_rel=0.1414477527141571, max_rel=670.1197509765625, norm_rel=0.02288658358156681, ref_abs_avg=42.357398986816406, test_abs_avg=42.35887908935547
production_forward2 grad[20] vs paper_forward: mean_abs=0.7465229034423828, max_abs=3.0, mean_rel=0.09007473289966583, max_rel=3.6529173851013184, norm_rel=0.02399730123579502, ref_abs_avg=31.256664276123047, test_abs_avg=31.214296340942383
production_forward2 grad[21] vs paper_forward: mean_abs=0.9311317205429077, max_abs=5.75, mean_rel=0.16727131605148315, max_rel=2646.21435546875, norm_rel=0.02292165346443653, ref_abs_avg=40.847232818603516, test_abs_avg=40.848243713378906
production_forward2 grad[22] vs paper_forward: mean_abs=0.9065781235694885, max_abs=5.5, mean_rel=0.16254526376724243, max_rel=879.9954833984375, norm_rel=0.022594526410102844, ref_abs_avg=40.344058990478516, test_abs_avg=40.346458435058594
production_forward2 grad[23] vs paper_forward: mean_abs=0.7245934009552002, max_abs=3.0, mean_rel=0.22552596032619476, max_rel=65.44578552246094, norm_rel=0.02284400537610054, ref_abs_avg=32.08567810058594, test_abs_avg=32.06752014160156
production_forward2 grad[24] vs paper_forward: mean_abs=0.8845046758651733, max_abs=6.0, mean_rel=0.17450068891048431, max_rel=2268.794921875, norm_rel=0.022760001942515373, ref_abs_avg=39.062042236328125, test_abs_avg=39.06342315673828
production_forward2 grad[25] vs paper_forward: mean_abs=0.8634569048881531, max_abs=5.5, mean_rel=0.15827980637550354, max_rel=1326.762939453125, norm_rel=0.022476376965641975, ref_abs_avg=38.61917495727539, test_abs_avg=38.61845397949219
production_forward2 grad[26] vs paper_forward: mean_abs=0.8509774208068848, max_abs=4.25, mean_rel=0.107310950756073, max_rel=8.569428443908691, norm_rel=0.024585923179984093, ref_abs_avg=35.501075744628906, test_abs_avg=35.49390411376953
production_forward2 grad[27] vs paper_forward: mean_abs=1.016782283782959, max_abs=6.5, mean_rel=0.15648937225341797, max_rel=1198.3079833984375, norm_rel=0.02446535974740982, ref_abs_avg=41.74311828613281, test_abs_avg=41.74506759643555
production_forward2 grad[28] vs paper_forward: mean_abs=0.9956233501434326, max_abs=6.25, mean_rel=0.16455775499343872, max_rel=1205.2081298828125, norm_rel=0.02428598143160343, ref_abs_avg=41.173187255859375, test_abs_avg=41.179603576660156
production_forward2 grad[29] vs paper_forward: mean_abs=0.7793941497802734, max_abs=2.625, mean_rel=0.09228262305259705, max_rel=4.8620285987854, norm_rel=0.024146972224116325, ref_abs_avg=31.73857879638672, test_abs_avg=31.808340072631836
production_forward2 grad[30] vs paper_forward: mean_abs=0.9380691647529602, max_abs=6.0, mean_rel=0.16225773096084595, max_rel=841.5665283203125, norm_rel=0.024812454357743263, ref_abs_avg=37.93181610107422, test_abs_avg=37.93146514892578
production_forward2 grad[31] vs paper_forward: mean_abs=0.9217289686203003, max_abs=5.5, mean_rel=0.16758303344249725, max_rel=1905.722900390625, norm_rel=0.02451625093817711, ref_abs_avg=37.716552734375, test_abs_avg=37.71440887451172
production_forward2 grad[32] vs paper_forward: mean_abs=0.7267918586730957, max_abs=2.5625, mean_rel=0.0631515234708786, max_rel=3.5530638694763184, norm_rel=0.024249514564871788, ref_abs_avg=30.817049026489258, test_abs_avg=30.924772262573242
production_forward2 grad[33] vs paper_forward: mean_abs=0.8782477378845215, max_abs=6.296875, mean_rel=0.16237449645996094, max_rel=1017.247802734375, norm_rel=0.024599231779575348, ref_abs_avg=35.81037521362305, test_abs_avg=35.81159973144531
production_forward2 grad[34] vs paper_forward: mean_abs=0.864910900592804, max_abs=5.0, mean_rel=0.16028648614883423, max_rel=913.9681396484375, norm_rel=0.024751173332333565, ref_abs_avg=35.081459045410156, test_abs_avg=35.084041595458984
production_forward2 grad[35] vs paper_forward: mean_abs=0.6625967025756836, max_abs=3.0, mean_rel=0.1254865527153015, max_rel=16.598281860351562, norm_rel=0.02339252643287182, ref_abs_avg=29.224056243896484, test_abs_avg=29.249645233154297
production_forward2 grad[36] vs paper_forward: mean_abs=0.8171240091323853, max_abs=5.5, mean_rel=0.16940714418888092, max_rel=2029.1256103515625, norm_rel=0.0243493914604187, ref_abs_avg=33.62384033203125, test_abs_avg=33.6249885559082
production_forward2 grad[37] vs paper_forward: mean_abs=0.8084268569946289, max_abs=5.0, mean_rel=0.1635316014289856, max_rel=822.7898559570312, norm_rel=0.0244526956230402, ref_abs_avg=33.20044708251953, test_abs_avg=33.19432830810547
production_forward2 grad[38] vs paper_forward: mean_abs=0.6028990745544434, max_abs=2.71875, mean_rel=0.14283107221126556, max_rel=25.5329647064209, norm_rel=0.022561321035027504, ref_abs_avg=26.875974655151367, test_abs_avg=26.859840393066406
production_forward2 grad[39] vs paper_forward: mean_abs=0.7717875838279724, max_abs=5.59375, mean_rel=0.15899139642715454, max_rel=1636.044677734375, norm_rel=0.02396128512918949, ref_abs_avg=32.269744873046875, test_abs_avg=32.269371032714844
production_forward2 grad[40] vs paper_forward: mean_abs=0.767099916934967, max_abs=5.5, mean_rel=0.149441659450531, max_rel=709.45751953125, norm_rel=0.024008439853787422, ref_abs_avg=32.033409118652344, test_abs_avg=32.03839874267578
production_forward2 grad[41] vs paper_forward: mean_abs=0.5862951278686523, max_abs=2.875, mean_rel=0.12166673690080643, max_rel=18.895984649658203, norm_rel=0.023493263870477676, ref_abs_avg=25.60778045654297, test_abs_avg=25.636747360229492
production_forward2 grad[42] vs paper_forward: mean_abs=0.735992431640625, max_abs=5.25, mean_rel=0.17037469148635864, max_rel=1655.34765625, norm_rel=0.023839881643652916, ref_abs_avg=30.938379287719727, test_abs_avg=30.941770553588867
production_forward2 grad[43] vs paper_forward: mean_abs=0.7222283482551575, max_abs=4.625, mean_rel=0.16852092742919922, max_rel=1041.034912109375, norm_rel=0.023819485679268837, ref_abs_avg=30.37830924987793, test_abs_avg=30.378385543823242
production_forward2 grad[44] vs paper_forward: mean_abs=0.5615091323852539, max_abs=2.375, mean_rel=0.11036492884159088, max_rel=5.4320149421691895, norm_rel=0.02349812351167202, ref_abs_avg=23.288761138916016, test_abs_avg=23.286052703857422
production_forward2 grad[45] vs paper_forward: mean_abs=0.6953315734863281, max_abs=4.5, mean_rel=0.1561114490032196, max_rel=1341.673828125, norm_rel=0.023592893034219742, ref_abs_avg=29.542495727539062, test_abs_avg=29.542449951171875
production_forward2 grad[46] vs paper_forward: mean_abs=0.6837981939315796, max_abs=4.5, mean_rel=0.1407090723514557, max_rel=1036.9632568359375, norm_rel=0.023428982123732567, ref_abs_avg=29.20745086669922, test_abs_avg=29.20412826538086
production_forward2 grad[47] vs paper_forward: mean_abs=0.5207879543304443, max_abs=2.0625, mean_rel=0.2771298289299011, max_rel=92.48648071289062, norm_rel=0.02221687324345112, ref_abs_avg=23.65066146850586, test_abs_avg=23.67160415649414
production_forward2 grad[48] vs paper_forward: mean_abs=0.670508623123169, max_abs=4.03125, mean_rel=0.158400297164917, max_rel=1263.8192138671875, norm_rel=0.023388464003801346, ref_abs_avg=28.688676834106445, test_abs_avg=28.69040298461914
production_forward2 grad[49] vs paper_forward: mean_abs=0.6532014608383179, max_abs=4.5, mean_rel=0.1677946150302887, max_rel=1578.3045654296875, norm_rel=0.023077664896845818, ref_abs_avg=28.343481063842773, test_abs_avg=28.342466354370117
production_forward2 grad[50] vs paper_forward: mean_abs=0.6159553527832031, max_abs=2.400390625, mean_rel=0.09345303475856781, max_rel=6.485077857971191, norm_rel=0.024250810965895653, ref_abs_avg=25.975366592407227, test_abs_avg=25.9971981048584
production_forward2 grad[51] vs paper_forward: mean_abs=0.7421501874923706, max_abs=5.25, mean_rel=0.1583646684885025, max_rel=979.176025390625, norm_rel=0.024875622242689133, ref_abs_avg=29.86263656616211, test_abs_avg=29.862346649169922
production_forward2 grad[52] vs paper_forward: mean_abs=0.7318546772003174, max_abs=5.5, mean_rel=0.1570589542388916, max_rel=740.0711059570312, norm_rel=0.025163015350699425, ref_abs_avg=29.18985939025879, test_abs_avg=29.19344711303711
production_forward2 grad[53] vs paper_forward: mean_abs=0.5564454793930054, max_abs=2.21875, mean_rel=0.3519323766231537, max_rel=83.99681091308594, norm_rel=0.024576397612690926, ref_abs_avg=22.667766571044922, test_abs_avg=22.6805419921875
production_forward2 grad[54] vs paper_forward: mean_abs=0.6836181282997131, max_abs=5.0, mean_rel=0.15552249550819397, max_rel=707.5098876953125, norm_rel=0.024819353595376015, ref_abs_avg=27.59446144104004, test_abs_avg=27.592205047607422
production_forward2 grad[55] vs paper_forward: mean_abs=0.6678851842880249, max_abs=4.5, mean_rel=0.16274210810661316, max_rel=633.2855834960938, norm_rel=0.024574585258960724, ref_abs_avg=27.251012802124023, test_abs_avg=27.244892120361328
production_forward2 grad[56] vs paper_forward: mean_abs=0.4928255081176758, max_abs=1.875, mean_rel=0.07627836614847183, max_rel=3.873943328857422, norm_rel=0.02112317644059658, ref_abs_avg=23.49966049194336, test_abs_avg=23.530738830566406
production_forward2 grad[57] vs paper_forward: mean_abs=0.6356067657470703, max_abs=5.0, mean_rel=0.1517331600189209, max_rel=710.9891967773438, norm_rel=0.02418680116534233, ref_abs_avg=26.337615966796875, test_abs_avg=26.33727264404297
production_forward2 grad[58] vs paper_forward: mean_abs=0.6239827871322632, max_abs=4.875, mean_rel=0.15735386312007904, max_rel=1454.010498046875, norm_rel=0.02425876073539257, ref_abs_avg=25.766258239746094, test_abs_avg=25.765121459960938
production_forward2 grad[59] vs paper_forward: mean_abs=0.48372364044189453, max_abs=1.75, mean_rel=0.1702309250831604, max_rel=35.56972122192383, norm_rel=0.0228581465780735, ref_abs_avg=21.326473236083984, test_abs_avg=21.29134750366211
production_forward2 grad[60] vs paper_forward: mean_abs=0.5944118499755859, max_abs=4.65625, mean_rel=0.15203164517879486, max_rel=802.8275146484375, norm_rel=0.023854421451687813, ref_abs_avg=24.91204833984375, test_abs_avg=24.91437530517578
production_forward2 grad[61] vs paper_forward: mean_abs=0.5892993211746216, max_abs=4.0, mean_rel=0.1610298454761505, max_rel=1181.302978515625, norm_rel=0.024013498798012733, ref_abs_avg=24.58203887939453, test_abs_avg=24.585948944091797
production_forward2 grad[62] vs paper_forward: mean_abs=0.4557873010635376, max_abs=2.125, mean_rel=0.2881014049053192, max_rel=93.68103790283203, norm_rel=0.024200472980737686, ref_abs_avg=19.484519958496094, test_abs_avg=19.491477966308594
production_forward2 grad[63] vs paper_forward: mean_abs=0.5701484680175781, max_abs=3.828125, mean_rel=0.16063648462295532, max_rel=773.68798828125, norm_rel=0.02342427894473076, ref_abs_avg=24.343616485595703, test_abs_avg=24.343231201171875
production_forward2 grad[64] vs paper_forward: mean_abs=0.5599788427352905, max_abs=5.0, mean_rel=0.1592545509338379, max_rel=1240.1435546875, norm_rel=0.023559685796499252, ref_abs_avg=23.834144592285156, test_abs_avg=23.833721160888672
production_forward2 grad[65] vs paper_forward: mean_abs=0.42492175102233887, max_abs=1.75, mean_rel=0.11975084990262985, max_rel=13.811675071716309, norm_rel=0.02208539843559265, ref_abs_avg=19.25201416015625, test_abs_avg=19.282991409301758
production_forward2 grad[66] vs paper_forward: mean_abs=0.5423471927642822, max_abs=4.0, mean_rel=0.1508217453956604, max_rel=1064.3795166015625, norm_rel=0.02329743281006813, ref_abs_avg=23.3234920501709, test_abs_avg=23.325809478759766
production_forward2 grad[67] vs paper_forward: mean_abs=0.5253357887268066, max_abs=3.5, mean_rel=0.13840070366859436, max_rel=720.2380981445312, norm_rel=0.02280128002166748, ref_abs_avg=23.01348114013672, test_abs_avg=23.01325225830078
production_forward2 grad[68] vs paper_forward: mean_abs=0.45285606384277344, max_abs=1.8125, mean_rel=0.07897348701953888, max_rel=4.575402736663818, norm_rel=0.0238818172365427, ref_abs_avg=18.7249813079834, test_abs_avg=18.74066734313965
production_forward2 grad[69] vs paper_forward: mean_abs=0.516310453414917, max_abs=3.625, mean_rel=0.14253026247024536, max_rel=1212.405517578125, norm_rel=0.022866718471050262, ref_abs_avg=22.579694747924805, test_abs_avg=22.581790924072266
production_forward2 grad[70] vs paper_forward: mean_abs=0.508080005645752, max_abs=3.5, mean_rel=0.14416202902793884, max_rel=842.3411254882812, norm_rel=0.023009562864899635, ref_abs_avg=22.11968994140625, test_abs_avg=22.120647430419922
production_forward2 grad[71] vs paper_forward: mean_abs=0.4152202606201172, max_abs=1.5, mean_rel=0.07870376110076904, max_rel=6.822790622711182, norm_rel=0.022827932611107826, ref_abs_avg=17.850954055786133, test_abs_avg=17.831871032714844
production_forward2 grad[72] vs paper_forward: mean_abs=0.49628427624702454, max_abs=4.0, mean_rel=0.1485747992992401, max_rel=973.2631225585938, norm_rel=0.022409822791814804, ref_abs_avg=22.112873077392578, test_abs_avg=22.112951278686523
production_forward2 grad[73] vs paper_forward: mean_abs=0.4857838749885559, max_abs=3.5, mean_rel=0.13886411488056183, max_rel=520.5053100585938, norm_rel=0.022475143894553185, ref_abs_avg=21.63409996032715, test_abs_avg=21.640417098999023
production_forward2 grad[74] vs paper_forward: mean_abs=0.44960105419158936, max_abs=1.875, mean_rel=0.12966099381446838, max_rel=11.027770042419434, norm_rel=0.023172840476036072, ref_abs_avg=18.988122940063477, test_abs_avg=19.020484924316406
production_forward2 grad[75] vs paper_forward: mean_abs=0.5534656047821045, max_abs=5.0, mean_rel=0.15906625986099243, max_rel=1220.997314453125, norm_rel=0.023722080513834953, ref_abs_avg=23.32159423828125, test_abs_avg=23.32473373413086
production_forward2 grad[76] vs paper_forward: mean_abs=0.5325906872749329, max_abs=4.0, mean_rel=0.14086000621318817, max_rel=1012.0581665039062, norm_rel=0.02355210855603218, ref_abs_avg=22.630910873413086, test_abs_avg=22.632728576660156
production_forward2 grad[77] vs paper_forward: mean_abs=0.39708203077316284, max_abs=1.625, mean_rel=0.32806068658828735, max_rel=101.42994689941406, norm_rel=0.0220657866448164, ref_abs_avg=18.42852783203125, test_abs_avg=18.442073822021484
production_forward2 grad[78] vs paper_forward: mean_abs=0.504716157913208, max_abs=4.0, mean_rel=0.14781391620635986, max_rel=928.3676147460938, norm_rel=0.023019857704639435, ref_abs_avg=21.89897918701172, test_abs_avg=21.90283966064453
production_forward2 grad[79] vs paper_forward: mean_abs=0.4939812421798706, max_abs=4.0, mean_rel=0.16443997621536255, max_rel=683.7554931640625, norm_rel=0.023080192506313324, ref_abs_avg=21.47727394104004, test_abs_avg=21.484149932861328
production_forward2 grad[80] vs paper_forward: mean_abs=0.38066112995147705, max_abs=1.4375, mean_rel=0.17649638652801514, max_rel=39.10722351074219, norm_rel=0.021330807358026505, ref_abs_avg=17.740741729736328, test_abs_avg=17.715076446533203
production_forward2 grad[81] vs paper_forward: mean_abs=0.4788781404495239, max_abs=4.5, mean_rel=0.1451491266489029, max_rel=934.264404296875, norm_rel=0.022747427225112915, ref_abs_avg=21.058176040649414, test_abs_avg=21.058807373046875
production_forward2 grad[82] vs paper_forward: mean_abs=0.46173322200775146, max_abs=4.0, mean_rel=0.1397753804922104, max_rel=570.4446411132812, norm_rel=0.022553304210305214, ref_abs_avg=20.555500030517578, test_abs_avg=20.556716918945312
production_forward2 grad[83] vs paper_forward: mean_abs=0.3552091419696808, max_abs=1.5625, mean_rel=0.2541903555393219, max_rel=96.14180755615234, norm_rel=0.02131609246134758, ref_abs_avg=16.49883460998535, test_abs_avg=16.485153198242188
production_forward2 grad[84] vs paper_forward: mean_abs=0.43618643283843994, max_abs=4.75, mean_rel=0.14634498953819275, max_rel=944.4386596679688, norm_rel=0.022008348256349564, ref_abs_avg=19.848026275634766, test_abs_avg=19.84903335571289
production_forward2 grad[85] vs paper_forward: mean_abs=0.43096110224723816, max_abs=4.125, mean_rel=0.1437268853187561, max_rel=915.3633422851562, norm_rel=0.022288357838988304, ref_abs_avg=19.48004913330078, test_abs_avg=19.48518180847168
production_forward2 grad[86] vs paper_forward: mean_abs=0.33694988489151, max_abs=1.5, mean_rel=0.41930824518203735, max_rel=188.0444793701172, norm_rel=0.020881399512290955, ref_abs_avg=16.435558319091797, test_abs_avg=16.4277286529541
production_forward2 grad[87] vs paper_forward: mean_abs=0.4103529453277588, max_abs=4.0, mean_rel=0.1452469825744629, max_rel=555.319091796875, norm_rel=0.02170548588037491, ref_abs_avg=18.988901138305664, test_abs_avg=18.989395141601562
production_forward2 grad[88] vs paper_forward: mean_abs=0.40290337800979614, max_abs=4.0, mean_rel=0.13198131322860718, max_rel=628.3662719726562, norm_rel=0.0214716587215662, ref_abs_avg=18.894126892089844, test_abs_avg=18.899494171142578
production_forward2 grad[89] vs paper_forward: mean_abs=0.31325316429138184, max_abs=1.21875, mean_rel=0.12056121230125427, max_rel=30.66512107849121, norm_rel=0.02072966657578945, ref_abs_avg=15.507356643676758, test_abs_avg=15.510200500488281
production_forward2 grad[90] vs paper_forward: mean_abs=0.3901630640029907, max_abs=4.0, mean_rel=0.13324546813964844, max_rel=752.6569213867188, norm_rel=0.02107473835349083, ref_abs_avg=18.622215270996094, test_abs_avg=18.622657775878906
production_forward2 grad[91] vs paper_forward: mean_abs=0.38611286878585815, max_abs=4.0, mean_rel=0.1305273473262787, max_rel=735.1561279296875, norm_rel=0.021116195246577263, ref_abs_avg=18.503368377685547, test_abs_avg=18.50482940673828
production_forward2 grad[92] vs paper_forward: mean_abs=0.3061861991882324, max_abs=1.125, mean_rel=0.061359815299510956, max_rel=4.936147689819336, norm_rel=0.020557856187224388, ref_abs_avg=15.035136222839355, test_abs_avg=15.033342361450195
production_forward2 grad[93] vs paper_forward: mean_abs=0.3746916651725769, max_abs=4.0, mean_rel=0.12604941427707672, max_rel=972.0453491210938, norm_rel=0.020811326801776886, ref_abs_avg=18.205066680908203, test_abs_avg=18.205345153808594
production_forward2 grad[94] vs paper_forward: mean_abs=0.3571212887763977, max_abs=3.625, mean_rel=0.12367358803749084, max_rel=603.9827880859375, norm_rel=0.020999470725655556, ref_abs_avg=17.159446716308594, test_abs_avg=17.15211296081543
production_forward2 grad[95] vs paper_forward: mean_abs=0.27363890409469604, max_abs=1.09375, mean_rel=0.10551035404205322, max_rel=15.799345970153809, norm_rel=0.019295472651720047, ref_abs_avg=14.141448020935059, test_abs_avg=14.144002914428711
production_forward2 grad[96] vs paper_forward: mean_abs=0.33487260341644287, max_abs=3.5, mean_rel=0.1225823163986206, max_rel=843.1222534179688, norm_rel=0.019644835963845253, ref_abs_avg=17.300182342529297, test_abs_avg=17.299522399902344
production_forward2 grad[97] vs paper_forward: mean_abs=0.3300769329071045, max_abs=3.703125, mean_rel=0.1192450150847435, max_rel=706.7255859375, norm_rel=0.01914205215871334, ref_abs_avg=17.50154685974121, test_abs_avg=17.509708404541016

