Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save Muzammiluddin-Syed-ECE/7b25bd209b96e585d5a8e9eb8b710e3c to your computer and use it in GitHub Desktop.

Select an option

Save Muzammiluddin-Syed-ECE/7b25bd209b96e585d5a8e9eb8b710e3c to your computer and use it in GitHub Desktop.
2026-02-19-SCALED_GEMM

MXFP4 GEMM Benchmark Comparison after XOR enabling

Old run: January 28, 2026 (Top of Main scaled GEMM before swizzle)

New run: February 19, 2026 (Top of Main scaled GEMM after swizzle)

Hardware: MI355, profiled with rocprofv3

Summary

Across all 50 shapes tested, the new scaled_matmul kernel delivers a geometric mean throughput improvement of 10.0% over the previous ToM kernel. The gains are not uniform — they correlate strongly with problem geometry:

  • Largest gains (20–28%) appear on shapes where M is not tile-aligned (e.g. M=500, 1000, 8100). The top performer is 500_16384_8192_512 at +27.9%.
  • Moderate gains (9–19%) appear on medium-to-large shapes with non-aligned M dimensions (e.g. M=16300, 53200).
  • Smallest gains (0–5%) appear on shapes where M is already a power-of-two (e.g. M=512, 1024, 8192, 16384, 53248), where the old kernel was already reasonably efficient.
  • One near-neutral result: 16384_512_256_16 shows a marginal regression of −0.2%, effectively a wash.

Beyond throughput, the new kernel completely eliminates LDS bank conflicts. The old kernel accumulated between 6,144 and 2.1 billion SQ_LDS_BANK_CONFLICT events per shape, while the new kernel reports zero across every shape.

Detailed Comparison

Shape (M_N_K/2_K/32) Old Throughput (TFLOP/s) New Throughput (TFLOP/s) Improvement (%) Old LDS Bank Conflicts New LDS Bank Conflicts
500_512_256_16 14.00 15.46 +10.4% 6,144 0
512_512_256_16 22.22 22.83 +2.7% 6,144 0
1000_512_256_16 28.07 30.91 +10.1% 12,288 0
1024_512_256_16 41.94 46.44 +10.7% 12,288 0
8100_512_256_16 217.56 236.44 +8.7% 98,304 0
8192_512_256_16 301.61 302.44 +0.3% 98,304 0
16300_512_256_16 406.93 444.15 +9.1% 196,608 0
16384_512_256_16 501.75 500.55 −0.2% 196,608 0
53200_512_256_16 701.49 768.78 +9.6% 638,976 0
53248_512_256_16 817.20 835.85 +2.3% 638,976 0
500_1024_8192_512 43.69 53.16 +21.7% 393,216 0
512_1024_8192_512 77.81 81.34 +4.5% 393,216 0
500_16384_8192_512 645.64 825.64 +27.9% 6,291,456 0
500_53248_8192_512 949.38 1,135.41 +19.6% 20,447,232 0
512_16384_8192_512 1,187.05 1,261.35 +6.3% 6,291,456 0
1000_1024_8192_512 86.41 107.97 +25.0% 786,432 0
512_53248_8192_512 1,841.33 1,887.23 +2.5% 20,447,232 0
1024_1024_8192_512 160.29 168.82 +5.3% 786,432 0
500_16384_26624_1664 723.84 830.95 +14.8% 20,447,232 0
512_16384_26624_1664 1,225.43 1,283.31 +4.7% 20,447,232 0
8100_1024_8192_512 666.34 823.10 +23.5% 6,291,456 0
8192_1024_8192_512 1,161.77 1,205.17 +3.7% 6,291,456 0
1000_16384_8192_512 1,247.13 1,554.31 +24.6% 12,582,912 0
16300_1024_8192_512 1,219.84 1,397.22 +14.5% 12,582,912 0
16384_1024_8192_512 2,058.36 2,151.48 +4.5% 12,582,912 0
1024_16384_8192_512 2,284.15 2,391.45 +4.7% 12,582,912 0
1000_53248_8192_512 984.10 1,178.35 +19.7% 40,894,464 0
53200_1024_8192_512 1,031.52 1,186.85 +15.1% 40,894,464 0
53248_1024_8192_512 1,842.47 1,923.14 +4.4% 40,894,464 0
1024_53248_8192_512 1,832.04 1,975.62 +7.8% 40,894,464 0
1000_16384_26624_1664 1,366.24 1,577.01 +15.4% 40,894,464 0
1024_16384_26624_1664 2,327.50 2,416.00 +3.8% 40,894,464 0
8100_16384_8192_512 1,301.23 1,550.58 +19.2% 100,663,296 0
8192_16384_8192_512 2,400.54 2,516.69 +4.8% 100,663,296 0
8100_53248_8192_512 1,018.71 1,183.00 +16.1% 327,155,712 0
8192_53248_8192_512 1,728.77 1,748.14 +1.1% 327,155,712 0
16300_16384_8192_512 1,277.47 1,444.97 +13.1% 201,326,592 0
16384_16384_8192_512 2,135.50 2,340.34 +9.6% 201,326,592 0
16300_53248_8192_512 1,037.25 1,227.77 +18.4% 654,311,424 0
53200_16384_8192_512 1,337.54 1,525.86 +14.1% 654,311,424 0
53248_16384_8192_512 2,163.97 2,248.16 +3.9% 654,311,424 0
16384_53248_8192_512 1,722.03 1,828.19 +6.2% 654,311,424 0
53200_53248_8192_512 1,107.45 1,297.45 +17.2% 2,126,512,128 0
53248_53248_8192_512 1,855.86 1,910.54 +2.9% 2,126,512,128 0
8100_16384_26624_1664 1,295.17 1,466.85 +13.3% 327,155,712 0
8192_16384_26624_1664 2,036.87 2,127.11 +4.4% 327,155,712 0
16300_16384_26624_1664 1,361.67 1,526.44 +12.1% 654,311,424 0
16384_16384_26624_1664 2,111.21 2,167.88 +2.7% 654,311,424 0
53200_16384_26624_1664 1,397.51 1,583.55 +13.3% 2,126,512,128 0
53248_16384_26624_1664 2,251.06 2,296.20 +2.0% 2,126,512,128 0
Geometric Mean +10.0%
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment