In the software era, applications did wildly diverse things (auth, payments, search, analytics) so the only common abstraction was the process itself. The container became the basic unit of cloud computing. The infra to power it produced a $300B+ cloud infrastructure market.
The unit has moved down the stack
Models need their Kernels
Between a model’s math and the silicon sits a layer of code that determines whether inference costs $0.001 or $0.01 per token, whether latency is 50ms or 500ms, whether one GPU serves 10 or 100 concurrent users. When a transformer processes a prompt, it launches a kernel that tiles Q, K, V matrices into blocks that fit in fast on-chip SRAM, computes partial results, writes output back to slower HBM. How that kernel manages memory access patterns and instruction scheduling impacts performance much more than any model architecture decision.
Workload shape determines Kernel design
MLPerf Inference benchmark suite tests four scenarios that capture distinct workload shapes: Offline (maximum throughput, all samples at once), Server (Poisson-distributed queries under latency SLA), Single Stream (minimum latency, one query at a time), and the newer Interactive scenario with tighter latency constraints for agentic and conversational applications.
They map directly to which kernel optimizations matter:
- High-batch offline workloads (document processing, batch embeddings) are compute-bound. Kernel optimization focuses on maximizing tensor core utilization, FP8/FP4 quantization, and large-tile GEMM configurations. The NVIDIA Blackwell platform holds every per-GPU MLPerf record because its 5th-gen tensor cores and native FP4 support dominate these scenarios.
- Low-batch interactive workloads (chatbots, code assistants, voice AI) are memory-bound. Each decode step generates one token per sequence, making arithmetic intensity extremely low. Here, FlashAttention’s memory-traffic reduction, speculative decoding’s parallelized verification, and PagedAttention’s batch-size amplification matter most.
- Mixed prefill/decode workloads (real-time serving with variable prompt lengths) stress both regimes simultaneously. Disaggregated prefilling separates compute-heavy prompt processing from memory-heavy decode, routing each to differently optimized kernel configurations.
Kernel improvements now happen in months, not years. As AI apps move into the consumer and enterprise landscape, the workload also keeps shifting (longer contexts, multimodal inputs, reasoning chains, MoE routing) and each shift demands a brand new set of kernel specializations.
For horizontal application builders
What innovation in the kernel looks like?
FlashAttention delivered 2-4x speedups by rethinking how attention uses GPU memory. Four generations later, FlashAttention-4 reaches 1605 TFLOPs/s on Blackwell (71% hardware utilization). PagedAttention cut KV cache memory waste from 60-80% to under 4%, improving throughput 2-4x. Quantization kernels compress weights to 4-bit with 3x speedups. Speculative decoding reaches 500 tok/s on DeepSeek-V3.1.
Discovering kernels
Kernel discovery follows a tight loop:
The creative work here is restructuring tile sizes, memory access patterns, instruction ordering, fusing operations to eliminate HBM writes.
The domain has perfect verifiability: TFLOPs/s against theoretical hardware peak. That verifiability makes kernel optimization tractable for AI agents. Karpathy’s autoresearch pattern (edit code, run experiment, evaluate, keep or revert, repeat) was immediately adapted for kernels. AutoKernel takes any PyTorch model, profiles it, extracts bottleneck operations, then runs 300+ automated experiments on Triton or CUDA C++ kernels overnight. NVIDIA demonstrated a closed-loop workflow using DeepSeek-R1 with a hardware verifier to auto-generate optimized attention kernels, achieving 100% numerical correctness on Stanford’s KernelBench Level-1 problems and 96% on Level-2 in just 15 minutes of inference-time compute per problem.
AI writing kernels for AI
AI models are now writing the kernels that make AI models run faster.
But the humans behind them still define the search space, set the objective function, and architect the verification infrastructure. The autoresearch loop accelerates kernel discovery; it hasn’t yet replaced the insight that decides the operations to fuse or which memory access pattern to rethink about.
The new lever for scale
Inference engines (vLLM, TensorRT-LLM, SGLang) are kernel orchestrators: they select kernels, batch requests, schedule phases, choose precision, parallelize across GPUs. They sit in the middle of a value chain where applications consume inference, models define computation, and kernels execute on silicon.
Value concentrates at the kernel layer because the margins of error are razor-thin and the talent is still scarce. A misaligned memory access or suboptimal tile size means 2x slower on identical hardware. Understanding GPU memory hierarchy isn’t enough; you need the insight to rethink the algorithm itself (online softmax, pingpong scheduling, asymmetric hardware pipelining). You can’t yet vibe-code a CUDA kernel. You can’t prompt-engineer your way to 71% hardware utilization on Blackwell. Together AI, Tri Dao’s lab at Princeton, NVIDIA’s CUTLASS team, a handful of engineers at Fireworks and Meta: entire industry runs on a handful of frontier talent and arcane techniques.
Hyperscalers emerged because running applications at scale was too complex to do in-house. The models era is producing inference providers: Together AI ($12.6B), Fireworks AI ($4B), Groq, Cerebras, Baseten ($5B), DeepInfra, SambaNova. They compete on kernel quality, hardware optimization, and serving infrastructure.
| Cloud Hyperscalers | Inference Providers |
|---|---|
| Abstracted servers + networking | Abstract GPUs + kernel optimization |
| Competed on price/performance per VM | Compete on cost/latency per token |
| Built proprietary hardware (Graviton, TPU) | Build proprietary kernels, some build custom silicon |
| Economies of scale drove margins | Kernel efficiency compounds across fleet |
| Vendor lock-in via platform services | Lock-in via optimized model serving + fine-tuning |
Together AI co-authors FlashAttention and maintains the Together Kernel collection. Fireworks was built by ex-PyTorch engineers, serving 10T+ tokens/day. Groq built entirely custom silicon. Each has a kernel-level moat. The market is already stratifying: custom silicon (Groq, Cerebras) on raw speed, GPU platforms (Together AI, Fireworks, Baseten) on flexibility, API-first (DeepInfra, Replicate) on simplicity. IaaS, PaaS, and managed services, reborn for inference.
The cost model
Cost per Token = (GPU Cost per Hour) / (Tokens per Second * 3600 * Utilization)
An H100 at ~$3/hour with baseline FP16 kernels at 50% utilization costs ~$1.66/M tokens. FlashAttention-3 + FP8 quantization doubles throughput, pushes utilization to 80%. Combined kernel optimization drops cost below $0.50/M tokens on the same hardware. Groq serves Llama 3.1 8B at $0.05/M input tokens. The gap between generic and optimized inference is 2-7x.
Yesterday, I was having a conversation around how do you find durable alpha when the underlying techniques and models are evolving so rapidly. My take, in fast-moving domains, you don’t build walls; you build speedier engines.
Locate the specific leverage points in your stack where small gains compound across scale, where the domain is verifiable enough to iterate fast, and where expertise is scarce enough that the advantage holds while you accelerate. Find the kernels.
References
- Dao, T. et al. (2022). “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” arXiv:2205.14135
- Dao, T. (2023). “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.” arXiv:2307.08691
- Zadouri, T. et al. (2026). “FlashAttention-4: Algorithm and Kernel Pipelining Co-Design.” Together AI
- Kwon, W. et al. (2023). “Efficient Memory Management for LLM Serving with PagedAttention.” arXiv:2309.06180
- Tillet, P. et al. (2021). “Introducing Triton: Open-source GPU programming for neural networks.” OpenAI
- Frantar, E. et al. (2022). “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers.” arXiv:2210.17323
- Lin, J. et al. (2023). “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration.” arXiv:2306.00978
- Together AI. (2025). “Best practices to accelerate inference for large-scale production workloads.” Together AI
- Groq. (2025). “What is a Language Processing Unit?” Groq
- Groq. (2025). “Inside the LPU: Deconstructing Groq’s Speed.” Groq
- NVIDIA. (2026). “Tuning Flash Attention for Peak Performance in NVIDIA CUDA Tile.” NVIDIA
- Upadhyay, A. (2024). “The Architecture of Groq’s LPU.” Coding Confessions
- vLLM Blog. (2023). “vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention.” vLLM
- NVIDIA. (2017). “CUTLASS: Fast Linear Algebra in CUDA C++.” NVIDIA
- Together AI. (2024). “Announcing Together Inference Engine 2.0.” Together AI
- IBM. (2025). “From microservices to AI agents: The evolution of application architecture.” IBM Think
- Leviathan, Y. et al. (2023). “Fast Inference from Transformers via Speculative Decoding.” arXiv:2302.01318
- Cerebras. “Wafer-Scale Engine Overview.” EmergentMind
- d-Matrix. (2025). “Why optimizing every layer of AI workloads is now critical.” d-Matrix
- Pure Storage. (2025). “LPU vs GPU: What’s the Difference?” Pure Storage