mamf-finder.py: Matrix Multiply Performance Finder
dtype: bf16 | device: NVIDIA H100 80GB HBM3 | GPU 0

  M     N     K   TFLOPS
1024  1024  1024   312.5
2048  2048  2048   567.8
4096  4096  4096   789.2
8192  8192  8192   891.3

Best: M=8192, N=8192, K=8192, 891.3 TFLOPS
