Skip to content

Baseten

The reference deployment lives on Baseten. It's the fastest path: one Truss push and you have a managed, autoscaled GPU endpoint. The full Truss runbook (with cost-cache --no-cache guidance) lives at deploy/baseten/README.md; this page is the operator's checklist.

Prerequisites

  • Baseten account with a project + an API key
  • uvx truss on your dev machine (pip install --upgrade truss, requires ≥ 0.15.2 for --no-cache)
  • A clone of the repo

1. Provision Baseten secrets

These are the named secrets the container reads from /secrets/<name>. Set them once via the Baseten dashboard (Workspace → Secrets) or via the API:

export BASETEN_API_KEY="..."

# Generate a tenant token (save it — this is what callers use)
TOK=$(openssl rand -hex 32)
echo "Save this token: $TOK"

# Create the secret
python3 -c "import json,os; print(json.dumps({'name':'mantis_api_token','value':os.environ['TOK']}))" > /tmp/payload.json
curl -sS -X POST -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -H "Content-Type: application/json" \
  --data-binary @/tmp/payload.json \
  https://api.baseten.co/v1/secrets

# Repeat for: anthropic_api_key, proxy_url, proxy_user, proxy_pass

For multi-tenant deployments, also create mantis_tenant_keys (a JSON keys file — see Tenant keys).

2. Push the Truss

DEPLOY_NAME="mantis-$(date -u +%Y%m%d-%H%M)"
uvx truss push deploy/baseten/holo3 --no-cache \
  --promote \
  --deployment-name "$DEPLOY_NAME" \
  --include-git-info

To deploy Fara-7B instead, point the push at deploy/baseten/fara:

uvx truss push deploy/baseten/fara --no-cache \
  --promote --deployment-name "$DEPLOY_NAME" --include-git-info

Both directories ship the same FastAPI surface (/v1/predict, /v1/cua, /v1/chat/completions, etc.) — only the in-pod inference server differs (llama.cpp + GGUF for Holo3; vLLM + bf16 weights for Fara). Fara skips the llama.cpp compile step, so the first build is ~5 min instead of ~50.

--no-cache is required on the first push after any change to src/mantis_agent/ (the package code is shipped via external_package_dirs and isn't always part of the image hash — without --no-cache Baseten can serve stale code). Subsequent pushes that only change build_commands / requirements / environment_variables can omit the flag.

The first Holo3 build does the full llama.cpp + CUDA compile (~50 min). Subsequent builds are ~5 min if you don't change build_commands. The Fara build skips llama.cpp entirely.

3. Wait for it to go ACTIVE

# Poll until terminal
while true; do
  STATE=$(curl -sS -H "Authorization: Api-Key $BASETEN_API_KEY" \
    "https://api.baseten.co/v1/models/$MODEL_ID/deployments/$DEPLOY_ID" \
    | jq -r .status)
  echo "$(date '+%H:%M:%S')  $STATE"
  case "$STATE" in ACTIVE|BUILD_FAILED|DEPLOY_FAILED) break ;; esac
  sleep 60
done

MODEL_ID and DEPLOY_ID come from the push output (https://app.baseten.co/models/<MODEL_ID>/logs/<DEPLOY_ID>).

4. Test the live endpoint

The Baseten gateway exposes two route families against every truss-server deployment:

URL What it serves
https://model-${MODEL_ID}.api.baseten.co/production/predict The configured predict_endpoint (default /predict) — the orchestrated run/status/resume entry point
https://model-${MODEL_ID}.api.baseten.co/production/sync/<any-path> Pass-through to arbitrary FastAPI routes in the container — /v1/chat/completions, /v1/models, /v1/health, /v1/cua

For a canary (non-promoted) deployment, swap production for deployment/<DEPLOY_ID> in either form.

Quick orchestrated-mode smoke:

ENDPOINT="https://model-${MODEL_ID}.api.baseten.co/production"

curl -fsS -X POST "$ENDPOINT/predict" \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -H "X-Mantis-Token: $TOK" \
  -H "Content-Type: application/json" \
  -d '{
    "detached": true,
    "micro": "plans/example/extract_listings.json",
    "profile_id":  "smoke",
    "workflow_id": "smoke-test-v1",
    "max_cost": 2,
    "max_time_minutes": 20
  }'

Expected: a queued response with a run_id. Then poll with {"action":"status","run_id":"..."} until terminal, then {"action":"result","run_id":"..."} for the leads.

Raw-inference smoke (remote-brain shape — model returns OpenAI-format tool calls; the caller runs its own CUA loop against its own browser):

curl -fsS -X POST "$ENDPOINT/sync/v1/chat/completions" \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -H "X-Mantis-Token: $TOK" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "model",
    "messages": [{"role": "user", "content": "Say hi."}],
    "max_tokens": 32
  }'

Auth model

Baseten requires both headers:

Header Layer What
Authorization: Api-Key <BASETEN_API_KEY> gateway Authenticates the platform request
X-Mantis-Token: <tenant_token> container Authenticates the tenant

The gateway passes Authorization: Api-Key … through to the container; we deliberately use a custom X-Mantis-Token header for the container layer so it doesn't collide with the platform header.

Cost guardrails

Knob Default Where to set
MAX_COST_USD $25 per run Container env (MANTIS_MAX_COST_USD) — clamps the request's max_cost
MAX_RUNTIME_MINUTES 60 per run MANTIS_MAX_RUNTIME_MINUTES
Per-tenant max_cost_per_run tenant config tenant keys file
Per-tenant rate_limit_per_minute 30 tenant keys file
Replica autoscale min=0 max=1 runtime.health_checks in deploy/baseten/holo3/config.yaml

If your traffic is bursty, leave min=0 and accept the ~50 min cold-start image build the first time the replica scales from zero. For low-latency production, set min=1 (always-on GPU charge).

Updating

Change What you need
Tenant keys Update the mantis_tenant_keys Baseten secret. Hot reload (5 s cache) — no redeploy.
Anthropic key for a tenant Update its anthropic_api_key_<tenant> secret. Hot reload via env.
Plan files (plans/...) Push a new deployment. (External package dirs aren't hot-reloaded.)
src/mantis_agent/ code truss push --no-cache.
build_commands / system deps truss push (no --no-cache needed).

Smoke test before promoting

If you want to canary, push without --promote:

uvx truss push deploy/baseten/holo3 --no-cache \
  --deployment-name "$DEPLOY_NAME-canary" \
  --include-git-info

You'll get a non-production environment URL. Run smoke tests there, then promote via the Baseten dashboard.

Serving a challenger for the promotion gate (#911/#918)

The slow-loop gate evaluates a trained challenger against the champion (base). Unlike Modal — which boots a fresh inference server per run and picks the challenger per request — the Baseten pod boots one shared inference server at model-load, so the challenger is a deployment-level choice.

Holo3 (qwen3_5_moe) — full-model swap (#918). Holo3's LoRA adapter can't be GGUF-converted (convert_lora_to_gguf lacks the MoE arch), but the merged model can. So the Holo3 challenger is a full merged-GGUF model, not a --lora overlay — point MANTIS_HOLO3_MODEL_DIR at the merged model (mounted via the truss weights: block, alongside the base mmproj). The ready-made config (deploy/baseten/holo3_challenger/config.yaml) does this; swap its weights: source for your merged-model repo, then:

uvx truss push deploy/baseten/holo3_challenger --no-cache \
  --deployment-name "holo3-challenger-$(git rev-parse --short HEAD)" \
  --include-git-info

The trainer produces the artifact: peft merge_and_unload (adapter + base) → convert_hf_to_gguf.py → quantize Q8_0 → publish.

Other (non-MoE / vLLM) bases — --lora overlay. For bases whose adapter can be converted/served, set instead (deployment-level env):

Env Effect
MANTIS_LORA_ADAPTER Serve base + this adapter. A pre-converted GGUF adapter (llama.cpp bases) or a PEFT dir (vLLM bases), mounted via weights:. Unset ⇒ base (champion).
MANTIS_LORA_SCALE llama.cpp only — adapter scale (default 1.0; emits --lora-scaled).
MANTIS_LORA_NAME vLLM only — served-model-name for the adapter (default challenger).

The serving image has no torch/transformers, so it can't convert a raw PEFT dir for llama.cpp bases — point MANTIS_LORA_ADAPTER at a .gguf. A non-.gguf ref for a llama.cpp base fails fast at boot.

Point the gate's MANTIS_CHAMPION_ENDPOINT at the holo3 deploy and MANTIS_EVAL_ENDPOINT at this one, then compare win-rate over the frozen holdout (training/eval_harness.py / mantis_trainer.gate).

See also