Model-Adjacent Products, Part 2: Context & Tools

The Goldfish Problem

A model without memory is a goldfish; every conversation starts from zero, every user a stranger. A model without tools is a brain in a jar, capable of thought but incapable of action. Part 1 made the model fast; this part gives it memory and hands. But memory that hallucinates is worse than no memory. Tools without permissions are security holes. Most products fail here, not from lack of capability, but from lack of discipline.

Retrieval, memory, and tool systems define what the model knows and what it can do. Get these wrong and the model hallucinates or fails silently.

Retrieval Systems

RAG is not “add a vector database.” It’s a cache hierarchy with freshness policies and provenance.

Cache Hierarchy

flowchart TB
    subgraph L1["L1: Prompt Cache — 60-90% hit"]
        P[System prompts]
    end
    subgraph L2["L2: Embedding Cache — 20-40% hit"]
        E[Query embeddings]
    end
    subgraph L3["L3: Result Cache — 10-30% hit"]
        RC["(query, version) → chunks"]
    end
    subgraph L4["L4: Document Store"]
        D[Ground truth]
    end
    L1 --> L2 --> L3 --> L4

Text description of diagram

Top-to-bottom flowchart showing a 4-level RAG cache hierarchy. L1 Prompt Cache (60-90% hit rate) contains system prompts. L2 Embedding Cache (20-40% hit rate) contains query embeddings. L3 Result Cache (10-30% hit rate) maps (query, version) pairs to chunks. L4 Document Store holds ground truth. Each level flows to the next, trading freshness for speed.

Each level trades freshness for speed. Make these tradeoffs explicit.

Freshness SLAs

Source	Max Staleness	Trigger
Support tickets	5 min	Webhook
Product docs	4 hours	Git push
Policies	24 hours	Manual publish
Historical data	7 days	Batch job

Hybrid Retrieval

Vector search misses keyword matches. Keyword search misses semantic similarity. Use both.

flowchart LR
    Q[Query] --> V[Vector Search]
    Q --> B[BM25 Search]
    V --> RRF[Reciprocal Rank Fusion]
    B --> RRF
    RRF --> R[Final Results]

Text description of diagram

Left-to-right flowchart showing hybrid retrieval. Query splits into two parallel paths: Vector Search (semantic similarity) and BM25 Search (keyword matching). Both results feed into Reciprocal Rank Fusion (RRF) which combines rankings, then outputs Final Results. This hybrid approach catches both semantic and exact keyword matches.

The reranker is where quality is won or lost.

Decoupled Retrieval (2025 Pattern)

Separate search from retrieve:

Search stage: Small chunks (100-256 tokens) maximize recall during initial lookup
Retrieve stage: Larger spans (1024+ tokens) provide sufficient context for comprehension

This mirrors how humans research: scan many sources quickly, then read deeply.

Retrieval Design Trade-offs

Design	When It Shines	Failure Modes
Vector (HNSW)	Unstructured semantic search	Misses exact matches; embedding drift
Hybrid (BM25+Vector)	Mixed keyword + semantic	Higher latency; reranker costs
GraphRAG	Entity/relationship Q&A	Schema governance overhead
Tool Retrieval Index	Agent tool selection at scale	Tool sprawl; index staleness

Products

Product	Why Model-Adjacent
Cohere Rerank	Attention over query-document pairs
Voyage AI	Embedding geometry optimization
Jina AI	Token-level similarity (ColBERT)

Memory Architecture

Memory has become a product category in its own right; users expect AI to remember. They also expect control.

Three Types

Episodic: What happened, when. “Last week you asked about refund policies.”

Semantic: Stable facts. “User’s company is Acme Corp.”

Procedural: How to work with this user. “When user says ‘ship it’, deploy to staging.”

Compaction

Raw logs grow unbounded. Convert to structured facts periodically.

Before: 200 turns, 50KB After: facts + preferences + recent context, 2KB

User Control (Non-Negotiable)

View what’s remembered
Correct inaccuracies
Delete specific memories
Export data

Regulation increasingly mandates this. Build it in from day one.

Memory Governance Layer

Enterprise memory architectures now define:

Working memory: Immediate context for current task
Episodic memory: Logs of past sessions and actions
Semantic memory: Consolidated facts and relationships
Governance policies: Who owns memory, how it updates, when it must be forgotten

Products

Product	Why Model-Adjacent
Zep	Temporal knowledge graphs, entity relationships
Mem0	Automatic memory extraction from conversations
LangGraph	Checkpoint/restore for multi-step agents

Tool Ecosystems

Once agents call real systems, stringly-typed prompt integrations break; tools graduate from convenience features to load-bearing infrastructure.

MCP: The Protocol Shift

Model Context Protocol makes tools discoverable and self-describing.

Before: Every integration is custom code. After: Tools discovered at runtime with typed schemas.

MCP Industry Status (2026)

MCP has become the “USB-C for AI”:

5,800+ servers published, 300+ clients integrated
Adopted by OpenAI (Agents SDK), Google (Gemini), Microsoft (VS Code, Copilot), Salesforce (Agentforce)
Donated to Linux Foundation’s Agentic AI Foundation (Dec 2025)

Stop building bespoke connectors. The protocol war is over.

Schema Quality

Bad:

{"name": "search", "description": "Search for stuff"}

Good:

{
  "name": "knowledge_base_search",
  "description": "Search internal docs. Use for policy questions. NOT for real-time data.",
  "parameters": {
    "query": {"type": "string", "minLength": 3},
    "doc_type": {"enum": ["policy", "product", "how-to"]}
  }
}

The model will misuse bad schemas.

Permissions

tool: database_query
permissions:
  allowed_tables: [orders, customers]
  denied_columns: [ssn, credit_card]
  rate_limit: 100/hour

Log every call: who, what, when, and the prompt context that triggered it.

Products

Product	Why Model-Adjacent
Anthropic MCP	Typed schemas models can reliably parse
OpenAI Function Calling	JSON mode requires output constraint knowledge
Toolhouse	Sandboxing for unpredictable model calls

Sources

Retrieval

RAG for Knowledge-Intensive NLP — Original RAG paper
ColBERT: Efficient Passage Search

Memory

MemGPT: LLMs as Operating Systems

Tools

A Deep Dive Into MCP (a16z)
Toolformer

2025-2026 Updates

From RAG to Context: 2025 Year-End Review (RAGFlow) — Decoupled pipelines, tool retrieval
MCP Enterprise Adoption Guide (Gupta, 2025) — 5,800 servers, industry adoption
A 2026 Memory Stack for Enterprise Agents (Mishra) — Memory tier architecture
Why MCP Won (The New Stack, 2025)

What’s Next

Context and tools give models knowledge and capability. But capability without verification is liability.

Every output is a hypothesis. Every action is a proposal. Part 3 covers the quality gates that turn proposals into safe executions.

← Part 1: Architecture | Series Index | Part 3: Quality Gates →

Part of a 6-part series on building production AI systems.

The Goldfish Problem#

Retrieval Systems#

Cache Hierarchy#

Freshness SLAs#

Hybrid Retrieval#

Decoupled Retrieval (2025 Pattern)#

Retrieval Design Trade-offs#

Products#

Memory Architecture#

Three Types#

Compaction#

User Control (Non-Negotiable)#

Memory Governance Layer#

Products#

Tool Ecosystems#

MCP: The Protocol Shift#

MCP Industry Status (2026)#

Schema Quality#

Permissions#

Products#

Sources#

What’s Next#

Navigation#

The Goldfish Problem

Retrieval Systems

Cache Hierarchy

Freshness SLAs

Hybrid Retrieval

Decoupled Retrieval (2025 Pattern)

Retrieval Design Trade-offs

Products

Memory Architecture

Three Types

Compaction

User Control (Non-Negotiable)

Memory Governance Layer

Products

Tool Ecosystems

MCP: The Protocol Shift

MCP Industry Status (2026)

Schema Quality

Permissions

Products

Sources

What’s Next

Navigation