Performance Model (theoretical peak) Write Kernel (Triton / CUDA) Profile (Nsight Compute) Gap from peak? Small Ship It (benchmark locked) Large Optimize (tiles, memory, fusion)