Model-Adjacent Products, Part 3: Quality Gates

The Hypothesis Problem

You’ve built a fast model (Part 1) with memory and tools (Part 2). It now generates answers and takes actions at scale. An uncomfortable truth: every output is a hypothesis. The model doesn’t know it’s right. It’s predicting tokens. Sometimes those predictions are brilliant. Sometimes they’re confident hallucinations. Your job becomes to tell the difference before users do.

Model outputs are hypotheses. Verification, observability, and evals determine whether those hypotheses survive production.

Verifiability

“Verifiability is to AI what specifiability was to 1980s computing.” — Karpathy

A verifiable environment is resettable (new attempts possible), efficient (many attempts), and rewardable (automated scoring). Models spike in math, code, and puzzles because RLVR applies. They struggle with open-ended creative tasks because there’s no clear reward signal.

Your verification infrastructure doesn’t just catch errors; it enables capability. The tasks where you build strong verification become the tasks where your product improves.

Verification Pipeline

Treat outputs as proposals to be tested.

flowchart LR
    O[Output] --> S{Syntax}
    S -->|Pass| E{Execute}
    S -->|Fail| F1[Error]
    E -->|Pass| J{Judge}
    E -->|Fail| F2[Error]
    J -->|Pass| Done["✓ Verified"]
    J -->|Escalate| H[Human Review]

Text description of diagram

Left-to-right flowchart showing the verification pipeline for model outputs. Output enters Syntax check: if fail, goes to Error; if pass, proceeds to Execute check: if fail, goes to Error; if pass, proceeds to Judge (LLM or human evaluation): if pass, marked as Verified with checkmark; if Escalate, sent to Human Review. Three gates (Syntax, Execute, Judge) filter outputs before they reach production.

Execution-Based

For code, SQL, or executable output:

def verify_code(code: str, tests: list) -> Result:
    try:
        ast.parse(code)  # Syntax
    except SyntaxError:
        return Result(passed=False, stage="syntax")

    sandbox = Sandbox(timeout=5)
    sandbox.exec(code)  # Execute

    for test in tests:  # Validate
        if not test.passes(sandbox):
            return Result(passed=False, stage="test")

    return Result(passed=True)

LLM-as-Judge

For subjective quality, use a separate model:

Evaluate on: Accuracy, Completeness, Clarity, Safety
Score 1-5. Flag issues below 3.

Self-Consistency

For high-stakes decisions, sample multiple times. Disagreement triggers review or voting.

Observability

What you don’t measure, you can’t improve. What you don’t trace, you can’t debug.

Trace Structure

Every request produces a trace:

{
  "trace_id": "tr_abc123",
  "stages": [
    {"name": "routing", "duration_ms": 25},
    {"name": "cache", "result": "hit"},
    {"name": "generation", "tokens": {"in": 1200, "out": 340}}
  ],
  "total_ms": 258,
  "cost_usd": 0.0034
}

What to Track

Per-request: Latency by stage, token counts, cache hit/miss, tool calls, safety outcomes.

Aggregate: p50/p95/p99 latency, cache hit rates, error rates by type, cost per user/feature.

The Minimal Viable Stack

Comprehensive Logging — Capture prompts, contexts, tool calls, responses
Tracing — Step-level timing and token usage for chains and agents
Evaluation — Automated scoring (LLM-as-judge or heuristics) against golden sets
Gates — Block merges if faithfulness or latency SLOs regress

Drift Detection

Embedding drift tracks semantic shifts in model behavior that traditional metrics miss. Deploy monitors to catch when your model’s responses drift from expected distributions.

Products (2026 Landscape)

Product	Best For	Notable Features
Braintrust	Live monitoring + alerts	BTQL filters; tool tracing; cost attribution; AI agents for baseline detection
Arize Phoenix	Drift + RAG metrics	Advanced embedding drift detection; RAG-specific observability
DeepEval	LLM unit tests	Pytest-style testing; synthetic data monitoring; CI/CD integration
Langfuse	Prompt versioning	Token-level cost tracking; open-source option
Helicone	Full-spectrum observability	User/model interaction journeys; caching; cost attribution
W&B Weave	Experiment tracking	Evaluations Playground; systematic tests against golden datasets

Evals

Shipping without evals is shipping without tests.

Dataset Types

Golden set: Hand-curated, known-good answers. Run on every change.

Regression set: Previously failed cases. Prevents backsliding.

Adversarial set: Jailbreaks, prompt injections, edge cases.

Evals as CI

on:
  push:
    paths: ['prompts/**']

jobs:
  eval:
    steps:
      - run: python eval/run.py --suite golden --threshold 0.95
      - run: python eval/run.py --suite regression --threshold 1.0
      - run: python eval/run.py --suite adversarial --threshold 0.99

Prompt changes don’t merge without passing evals.

Products

Product	Why Model-Adjacent
Braintrust	LLM-as-judge with calibration
Patronus AI	Test case generation from production failures
Galileo	Factual inconsistency detection

Sources

Verification

Evals

2025-2026 Updates

Production RAG in 2025: Evaluation Suites, CI/CD Quality Gates (Dextralabs) — Golden sets, automated gates
Top 10 LLM Observability Tools 2025 (Braintrust) — Tooling comparison
Complete Guide to LLM Observability 2026 (Portkey) — Guardrails vs evals distinction

What’s Next

Verification catches errors. But not all errors are quality problems; some are policy violations.

“Should I help with this?” is different from “Is this answer correct?”

Part 4 covers governance: the runtime policy layer that defines what the model should and shouldn’t do.

← Part 2: Context & Tools | Series Index | Part 4: Governance →

Part of a 6-part series on building production AI systems.

The Hypothesis Problem#

Verifiability#

Verification Pipeline#

Execution-Based#

LLM-as-Judge#

Self-Consistency#

Observability#

Trace Structure#

What to Track#

The Minimal Viable Stack#

Drift Detection#

Products (2026 Landscape)#

Evals#

Dataset Types#

Evals as CI#

Products#

Sources#

What’s Next#

Navigation#