Old run: January 28, 2026 (Top of Main scaled GEMM before swizzle)
New run: February 19, 2026 (Top of Main scaled GEMM after swizzle)
Hardware: MI355, profiled with rocprofv3
Across all 50 shapes tested, the new scaled_matmul kernel delivers a geometric mean throughput improvement of 10.0% over the previous ToM kernel. The gains are not uniform — they correlate strongly with problem geometry:
- Largest gains (20–28%) appear on shapes where M is not tile-aligned (e.g. M=500, 1000, 8100). The top performer is
500_16384_8192_512at +27.9%. - Moderate gains (9–19%) appear on medium-to-large shapes with non-aligned M dimensions (e.g. M=16300, 53200).
- Smallest gains (0–5%) appear on shapes where M is already a power-of-two (e.g. M=512, 1024, 8192, 16384, 53248), where the old kernel was already reasonably efficient.
- One near-neutral result:
16384_512_256_16shows a marginal regression of −0.2%, effectively a wash.
Beyond throughput, the new kernel completely eliminates LDS bank conflicts. The old kernel accumulated between 6,144 and 2.1 billion SQ_LDS_BANK_CONFLICT events per shape, while the new kernel reports zero across every shape.
| Shape (M_N_K/2_K/32) | Old Throughput (TFLOP/s) | New Throughput (TFLOP/s) | Improvement (%) | Old LDS Bank Conflicts | New LDS Bank Conflicts |
|---|---|---|---|---|---|
| 500_512_256_16 | 14.00 | 15.46 | +10.4% | 6,144 | 0 |
| 512_512_256_16 | 22.22 | 22.83 | +2.7% | 6,144 | 0 |
| 1000_512_256_16 | 28.07 | 30.91 | +10.1% | 12,288 | 0 |
| 1024_512_256_16 | 41.94 | 46.44 | +10.7% | 12,288 | 0 |
| 8100_512_256_16 | 217.56 | 236.44 | +8.7% | 98,304 | 0 |
| 8192_512_256_16 | 301.61 | 302.44 | +0.3% | 98,304 | 0 |
| 16300_512_256_16 | 406.93 | 444.15 | +9.1% | 196,608 | 0 |
| 16384_512_256_16 | 501.75 | 500.55 | −0.2% | 196,608 | 0 |
| 53200_512_256_16 | 701.49 | 768.78 | +9.6% | 638,976 | 0 |
| 53248_512_256_16 | 817.20 | 835.85 | +2.3% | 638,976 | 0 |
| 500_1024_8192_512 | 43.69 | 53.16 | +21.7% | 393,216 | 0 |
| 512_1024_8192_512 | 77.81 | 81.34 | +4.5% | 393,216 | 0 |
| 500_16384_8192_512 | 645.64 | 825.64 | +27.9% | 6,291,456 | 0 |
| 500_53248_8192_512 | 949.38 | 1,135.41 | +19.6% | 20,447,232 | 0 |
| 512_16384_8192_512 | 1,187.05 | 1,261.35 | +6.3% | 6,291,456 | 0 |
| 1000_1024_8192_512 | 86.41 | 107.97 | +25.0% | 786,432 | 0 |
| 512_53248_8192_512 | 1,841.33 | 1,887.23 | +2.5% | 20,447,232 | 0 |
| 1024_1024_8192_512 | 160.29 | 168.82 | +5.3% | 786,432 | 0 |
| 500_16384_26624_1664 | 723.84 | 830.95 | +14.8% | 20,447,232 | 0 |
| 512_16384_26624_1664 | 1,225.43 | 1,283.31 | +4.7% | 20,447,232 | 0 |
| 8100_1024_8192_512 | 666.34 | 823.10 | +23.5% | 6,291,456 | 0 |
| 8192_1024_8192_512 | 1,161.77 | 1,205.17 | +3.7% | 6,291,456 | 0 |
| 1000_16384_8192_512 | 1,247.13 | 1,554.31 | +24.6% | 12,582,912 | 0 |
| 16300_1024_8192_512 | 1,219.84 | 1,397.22 | +14.5% | 12,582,912 | 0 |
| 16384_1024_8192_512 | 2,058.36 | 2,151.48 | +4.5% | 12,582,912 | 0 |
| 1024_16384_8192_512 | 2,284.15 | 2,391.45 | +4.7% | 12,582,912 | 0 |
| 1000_53248_8192_512 | 984.10 | 1,178.35 | +19.7% | 40,894,464 | 0 |
| 53200_1024_8192_512 | 1,031.52 | 1,186.85 | +15.1% | 40,894,464 | 0 |
| 53248_1024_8192_512 | 1,842.47 | 1,923.14 | +4.4% | 40,894,464 | 0 |
| 1024_53248_8192_512 | 1,832.04 | 1,975.62 | +7.8% | 40,894,464 | 0 |
| 1000_16384_26624_1664 | 1,366.24 | 1,577.01 | +15.4% | 40,894,464 | 0 |
| 1024_16384_26624_1664 | 2,327.50 | 2,416.00 | +3.8% | 40,894,464 | 0 |
| 8100_16384_8192_512 | 1,301.23 | 1,550.58 | +19.2% | 100,663,296 | 0 |
| 8192_16384_8192_512 | 2,400.54 | 2,516.69 | +4.8% | 100,663,296 | 0 |
| 8100_53248_8192_512 | 1,018.71 | 1,183.00 | +16.1% | 327,155,712 | 0 |
| 8192_53248_8192_512 | 1,728.77 | 1,748.14 | +1.1% | 327,155,712 | 0 |
| 16300_16384_8192_512 | 1,277.47 | 1,444.97 | +13.1% | 201,326,592 | 0 |
| 16384_16384_8192_512 | 2,135.50 | 2,340.34 | +9.6% | 201,326,592 | 0 |
| 16300_53248_8192_512 | 1,037.25 | 1,227.77 | +18.4% | 654,311,424 | 0 |
| 53200_16384_8192_512 | 1,337.54 | 1,525.86 | +14.1% | 654,311,424 | 0 |
| 53248_16384_8192_512 | 2,163.97 | 2,248.16 | +3.9% | 654,311,424 | 0 |
| 16384_53248_8192_512 | 1,722.03 | 1,828.19 | +6.2% | 654,311,424 | 0 |
| 53200_53248_8192_512 | 1,107.45 | 1,297.45 | +17.2% | 2,126,512,128 | 0 |
| 53248_53248_8192_512 | 1,855.86 | 1,910.54 | +2.9% | 2,126,512,128 | 0 |
| 8100_16384_26624_1664 | 1,295.17 | 1,466.85 | +13.3% | 327,155,712 | 0 |
| 8192_16384_26624_1664 | 2,036.87 | 2,127.11 | +4.4% | 327,155,712 | 0 |
| 16300_16384_26624_1664 | 1,361.67 | 1,526.44 | +12.1% | 654,311,424 | 0 |
| 16384_16384_26624_1664 | 2,111.21 | 2,167.88 | +2.7% | 654,311,424 | 0 |
| 53200_16384_26624_1664 | 1,397.51 | 1,583.55 | +13.3% | 2,126,512,128 | 0 |
| 53248_16384_26624_1664 | 2,251.06 | 2,296.20 | +2.0% | 2,126,512,128 | 0 |
| Geometric Mean | +10.0% |