The Goldfish Problem

A model without memory is a goldfish; every conversation starts from zero, every user a stranger. A model without tools is a brain in a jar, capable of thought but incapable of action. Part 1 made the model fast; this part gives it memory and hands. But memory that hallucinates is worse than no memory. Tools without permissions are security holes. Most products fail here, not from lack of capability, but from lack of discipline.


Retrieval, memory, and tool systems define what the model knows and what it can do. Get these wrong and the model hallucinates or fails silently.


Retrieval Systems

RAG is not “add a vector database.” It’s a cache hierarchy with freshness policies and provenance.

Cache Hierarchy

flowchart TB
    subgraph L1["L1: Prompt Cache — 60-90% hit"]
        P[System prompts]
    end
    subgraph L2["L2: Embedding Cache — 20-40% hit"]
        E[Query embeddings]
    end
    subgraph L3["L3: Result Cache — 10-30% hit"]
        RC["(query, version) → chunks"]
    end
    subgraph L4["L4: Document Store"]
        D[Ground truth]
    end
    L1 --> L2 --> L3 --> L4
Text description of diagram

Top-to-bottom flowchart showing a 4-level RAG cache hierarchy. L1 Prompt Cache (60-90% hit rate) contains system prompts. L2 Embedding Cache (20-40% hit rate) contains query embeddings. L3 Result Cache (10-30% hit rate) maps (query, version) pairs to chunks. L4 Document Store holds ground truth. Each level flows to the next, trading freshness for speed.

Each level trades freshness for speed. Make these tradeoffs explicit.

Freshness SLAs

SourceMax StalenessTrigger
Support tickets5 minWebhook
Product docs4 hoursGit push
Policies24 hoursManual publish
Historical data7 daysBatch job

Hybrid Retrieval

Vector search misses keyword matches. Keyword search misses semantic similarity. Use both.

flowchart LR
    Q[Query] --> V[Vector Search]
    Q --> B[BM25 Search]
    V --> RRF[Reciprocal Rank Fusion]
    B --> RRF
    RRF --> R[Final Results]
Text description of diagram

Left-to-right flowchart showing hybrid retrieval. Query splits into two parallel paths: Vector Search (semantic similarity) and BM25 Search (keyword matching). Both results feed into Reciprocal Rank Fusion (RRF) which combines rankings, then outputs Final Results. This hybrid approach catches both semantic and exact keyword matches.

The reranker is where quality is won or lost.

Decoupled Retrieval (2025 Pattern)

Separate search from retrieve:

  • Search stage: Small chunks (100-256 tokens) maximize recall during initial lookup
  • Retrieve stage: Larger spans (1024+ tokens) provide sufficient context for comprehension

This mirrors how humans research: scan many sources quickly, then read deeply.

Retrieval Design Trade-offs

DesignWhen It ShinesFailure Modes
Vector (HNSW)Unstructured semantic searchMisses exact matches; embedding drift
Hybrid (BM25+Vector)Mixed keyword + semanticHigher latency; reranker costs
GraphRAGEntity/relationship Q&ASchema governance overhead
Tool Retrieval IndexAgent tool selection at scaleTool sprawl; index staleness

Products

ProductWhy Model-Adjacent
Cohere RerankAttention over query-document pairs
Voyage AIEmbedding geometry optimization
Jina AIToken-level similarity (ColBERT)

Memory Architecture

Memory has become a product category in its own right; users expect AI to remember. They also expect control.

Three Types

Episodic: What happened, when. “Last week you asked about refund policies.”

Semantic: Stable facts. “User’s company is Acme Corp.”

Procedural: How to work with this user. “When user says ‘ship it’, deploy to staging.”

Compaction

Raw logs grow unbounded. Convert to structured facts periodically.

Before: 200 turns, 50KB After: facts + preferences + recent context, 2KB

User Control (Non-Negotiable)

  • View what’s remembered
  • Correct inaccuracies
  • Delete specific memories
  • Export data

Regulation increasingly mandates this. Build it in from day one.

Memory Governance Layer

Enterprise memory architectures now define:

  • Working memory: Immediate context for current task
  • Episodic memory: Logs of past sessions and actions
  • Semantic memory: Consolidated facts and relationships
  • Governance policies: Who owns memory, how it updates, when it must be forgotten

Products

ProductWhy Model-Adjacent
ZepTemporal knowledge graphs, entity relationships
Mem0Automatic memory extraction from conversations
LangGraphCheckpoint/restore for multi-step agents

Tool Ecosystems

Once agents call real systems, stringly-typed prompt integrations break; tools graduate from convenience features to load-bearing infrastructure.

MCP: The Protocol Shift

Model Context Protocol makes tools discoverable and self-describing.

Before: Every integration is custom code. After: Tools discovered at runtime with typed schemas.

MCP Industry Status (2026)

MCP has become the “USB-C for AI”:

  • 5,800+ servers published, 300+ clients integrated
  • Adopted by OpenAI (Agents SDK), Google (Gemini), Microsoft (VS Code, Copilot), Salesforce (Agentforce)
  • Donated to Linux Foundation’s Agentic AI Foundation (Dec 2025)

Stop building bespoke connectors. The protocol war is over.

Schema Quality

Bad:

{"name": "search", "description": "Search for stuff"}

Good:

{
  "name": "knowledge_base_search",
  "description": "Search internal docs. Use for policy questions. NOT for real-time data.",
  "parameters": {
    "query": {"type": "string", "minLength": 3},
    "doc_type": {"enum": ["policy", "product", "how-to"]}
  }
}

The model will misuse bad schemas.

Permissions

tool: database_query
permissions:
  allowed_tables: [orders, customers]
  denied_columns: [ssn, credit_card]
  rate_limit: 100/hour

Log every call: who, what, when, and the prompt context that triggered it.

Products

ProductWhy Model-Adjacent
Anthropic MCPTyped schemas models can reliably parse
OpenAI Function CallingJSON mode requires output constraint knowledge
ToolhouseSandboxing for unpredictable model calls

Sources

Retrieval

Memory

Tools

2025-2026 Updates


What’s Next

Context and tools give models knowledge and capability. But capability without verification is liability.

Every output is a hypothesis. Every action is a proposal. Part 3 covers the quality gates that turn proposals into safe executions.


← Part 1: Architecture | Series Index | Part 3: Quality Gates →


Part of a 6-part series on building production AI systems.