Why Speed Matters
At L1 (copilot), the human drives. They can tolerate a slow AI; it’s just making suggestions. At L3 (consultant), the AI executes and the human approves. If that approval window takes 3 seconds to render, the human disengages. By L4, slow means unsafe.
Latency isn’t a nice-to-have. It’s the difference between a tool that augments and one that frustrates into abandonment.
This part covers the physics: making AI fast enough that humans stay engaged, and trust it enough to delegate.
Think of the foundation model as the CPU; your product is the computer you build around it.
A CPU without memory, I/O, and an OS is useless. Same for a foundation model without context management, tool orchestration, and verification. Model-adjacent infrastructure turns stochastic text generation into shippable software.
The Stack
Seven layers make up the model-adjacent stack; lower layers (1-3) enable capability. Upper layers (5-7) gate trust. You can’t safely increase autonomy without investing in both.
block-beta
columns 1
L7["7. Alignment & Governance"]
L6["6. Observability & Evals"]
L5["5. Verification"]
L4["4. Memory"]
L3["3. Tools & Action"]
L2["2. Retrieval & Context"]
L1["1. Latency & Interactivity"]
FM["Foundation Model"]
Text description of diagram
Vertical stack diagram showing 8 layers of model-adjacent architecture, from bottom to top: Foundation Model (base), Layer 1: Latency & Interactivity, Layer 2: Retrieval & Context, Layer 3: Tools & Action, Layer 4: Memory, Layer 5: Verification, Layer 6: Observability & Evals, Layer 7: Alignment & Governance. Lower layers (1-3) enable capability. Upper layers (5-7) gate trust.
Layers 1-3 determine what’s possible. Latency keeps humans in the loop. Retrieval reduces hallucination. Tool permissions create hard boundaries.
Layers 5-7 determine what’s safe. Verification gates autonomous execution. Observability enables audit trails. Governance defines the ceiling.
Latency Engineering
Classic SaaS tolerated 200ms response times. Model-adjacent products need sub-50ms perceived latency, or immediate streaming.
Fast-Path / Slow-Path
Route most requests through a fast path. Reserve expensive reasoning for the tail.
New post: nanochat miniseries v1
— Andrej Karpathy (@karpathy) January 7, 2026
The correct way to think about LLMs is that you are not optimizing for a single specific model but for a family models controlled by a single dial (the compute you wish to spend) to achieve monotonically better results. This allows you to do… pic.twitter.com/84OwpSODcS
flowchart LR
Q[Query] --> R{Router}
R -->|80%| F[Fast Path]
R -->|20%| S[Slow Path]
F --> O[Output]
S --> O
Text description of diagram
Left-to-right flowchart showing request routing. Query enters a Router which splits traffic: 80% goes to Fast Path, 20% goes to Slow Path. Both paths converge to Output. Fast path uses cache plus small model. Slow path requires retrieval plus large model plus tools.
Most requests hit cache + small model. 20% need retrieval + large model + tools.
Latency Budget (500ms target)
| Stage | Budget |
|---|---|
| Routing | 30ms |
| Cache lookup | 10ms |
| Retrieval | 80ms |
| Model (TTFT) | 200ms |
| Safety check | 50ms |
| Tools | 100ms |
| Buffer | 30ms |
Track p50 and p99 separately. Tail latency is where users churn.
Techniques
Streaming. Show partial tokens. Users tolerate longer waits when they see progress.
Speculative decoding. Draft model proposes, target model verifies batches. vLLM with Eagle 3 achieves 2.5x inference speedup and 1.8x latency reduction in memory-bound scenarios (low request rates). Benefits diminish at high throughput without workload-specific tuning; test your actual traffic pattern.
Two-pass generation. Fast draft now, refinement later. Let users interrupt if the draft suffices.
Async tools. “Let me check that…” with a spinner beats blocking.
Products
| Product | Why Model-Adjacent |
|---|---|
| vLLM | PagedAttention requires understanding KV cache memory patterns |
| TensorRT-LLM | Kernel fusion, quantization requires compute graph knowledge |
| llama.cpp | INT4/INT8 without quality loss requires weight distribution knowledge |
| Fireworks AI | Draft/verify pattern requires understanding token prediction |
Token Economics
Tokens translate directly to compute, latency, and cost; manage them like CPU and memory budgets.
Prompt Structure
Prompt caching rewards stable prefixes:
STABLE: System instructions, tool defs, examples
SEMI-STABLE: Retrieved context, user preferences
VARIABLE: Current conversation, query
Put stable content first. Cache hit rates go from 0% to 70%+.
Cost Impact
| Structure | Cache Rate | Cost/1K requests |
|---|---|---|
| Bad (variable first) | 0% | $12.00 |
| Good (stable first) | 70% | $4.80 |
| Optimal (prefix sharing) | 85% | $2.70 |
Context Compaction
Long conversations accumulate tokens. After N turns: summarize into structured facts, drop raw history, keep last 2-3 turns.
Before: [System] + [20 turns] = 12,000 tokens
After: [System] + [Facts] + [3 turns] = 3,000 tokens
Token SLOs
Establish Service Level Objectives for cost and latency:
- p95 latency target per request type (e.g., <500ms for chat, <2s for analysis)
- Cost-per-request ceiling by feature (e.g., $0.01 for suggestions, $0.05 for generation)
- Cache hit rate floor (e.g., >70% for prompt cache)
Breaches trigger alerts or automated fallbacks to smaller models. Track per user/feature for attribution.
Products
| Product | Why Model-Adjacent |
|---|---|
| Anthropic Prompt Caching | Requires understanding attention computation reuse |
| SGLang | Radix attention, prefix sharing requires tree-structured attention knowledge |
| Martian / Not Diamond | Routing requires understanding model capability boundaries |
Sources
Latency & Serving
- Efficient Memory Management for LLMs with PagedAttention (vLLM)
- Fast Inference via Speculative Decoding
- TensorRT-LLM
Token Economics
- Prompt Caching (Anthropic)
- SGLang: Efficient Execution
2025-2026 Updates
- Faster Inference with vLLM & Speculative Decoding (Red Hat, 2025) — Eagle 3 benchmarks
- AI Agent Landscape 2025-2026 (Tao An) — Context compaction patterns
What’s Next
Latency and token economics are the physics. They determine what’s possible. But physics alone doesn’t create memory or capability.
Part 2 tackles memory (and the cost of forgetting) and tools (and the cost of breaking things).
Navigation
← Part 0: The Autonomy Ladder | Series Index | Part 2: Context & Tools →
Part of a 6-part series on building production AI systems.