Performance Model
(theoretical peak)
Write Kernel
(Triton / CUDA)
Profile
(Nsight Compute)
Gap from
peak?
Small
Ship It
(benchmark locked)
Large
Optimize
(tiles, memory, fusion)