Baseten¶
The reference deployment lives on Baseten. It's the fastest path: one Truss push and you have a managed, autoscaled GPU endpoint. The full Truss runbook (with cost-cache --no-cache guidance) lives at deploy/baseten/README.md; this page is the operator's checklist.
Prerequisites¶
- Baseten account with a project + an API key
uvx trusson your dev machine (pip install --upgrade truss, requires ≥ 0.15.2 for--no-cache)- A clone of the repo
1. Provision Baseten secrets¶
These are the named secrets the container reads from /secrets/<name>. Set them once via the Baseten dashboard (Workspace → Secrets) or via the API:
export BASETEN_API_KEY="..."
# Generate a tenant token (save it — this is what callers use)
TOK=$(openssl rand -hex 32)
echo "Save this token: $TOK"
# Create the secret
python3 -c "import json,os; print(json.dumps({'name':'mantis_api_token','value':os.environ['TOK']}))" > /tmp/payload.json
curl -sS -X POST -H "Authorization: Api-Key $BASETEN_API_KEY" \
-H "Content-Type: application/json" \
--data-binary @/tmp/payload.json \
https://api.baseten.co/v1/secrets
# Repeat for: anthropic_api_key, proxy_url, proxy_user, proxy_pass
For multi-tenant deployments, also create mantis_tenant_keys (a JSON keys file — see Tenant keys).
2. Push the Truss¶
DEPLOY_NAME="mantis-$(date -u +%Y%m%d-%H%M)"
uvx truss push deploy/baseten/holo3 --no-cache \
--promote \
--deployment-name "$DEPLOY_NAME" \
--include-git-info
To deploy Fara-7B instead, point the push at deploy/baseten/fara:
uvx truss push deploy/baseten/fara --no-cache \
--promote --deployment-name "$DEPLOY_NAME" --include-git-info
Both directories ship the same FastAPI surface (/v1/predict, /v1/cua, /v1/chat/completions, etc.) — only the in-pod inference server differs (llama.cpp + GGUF for Holo3; vLLM + bf16 weights for Fara). Fara skips the llama.cpp compile step, so the first build is ~5 min instead of ~50.
--no-cache is required on the first push after any change to src/mantis_agent/ (the package code is shipped via external_package_dirs and isn't always part of the image hash — without --no-cache Baseten can serve stale code). Subsequent pushes that only change build_commands / requirements / environment_variables can omit the flag.
The first Holo3 build does the full llama.cpp + CUDA compile (~50 min). Subsequent builds are ~5 min if you don't change build_commands. The Fara build skips llama.cpp entirely.
3. Wait for it to go ACTIVE¶
# Poll until terminal
while true; do
STATE=$(curl -sS -H "Authorization: Api-Key $BASETEN_API_KEY" \
"https://api.baseten.co/v1/models/$MODEL_ID/deployments/$DEPLOY_ID" \
| jq -r .status)
echo "$(date '+%H:%M:%S') $STATE"
case "$STATE" in ACTIVE|BUILD_FAILED|DEPLOY_FAILED) break ;; esac
sleep 60
done
MODEL_ID and DEPLOY_ID come from the push output (https://app.baseten.co/models/<MODEL_ID>/logs/<DEPLOY_ID>).
4. Test the live endpoint¶
The Baseten gateway exposes two route families against every truss-server deployment:
| URL | What it serves |
|---|---|
https://model-${MODEL_ID}.api.baseten.co/production/predict |
The configured predict_endpoint (default /predict) — the orchestrated run/status/resume entry point |
https://model-${MODEL_ID}.api.baseten.co/production/sync/<any-path> |
Pass-through to arbitrary FastAPI routes in the container — /v1/chat/completions, /v1/models, /v1/health, /v1/cua |
For a canary (non-promoted) deployment, swap production for deployment/<DEPLOY_ID> in either form.
Quick orchestrated-mode smoke:
ENDPOINT="https://model-${MODEL_ID}.api.baseten.co/production"
curl -fsS -X POST "$ENDPOINT/predict" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-H "X-Mantis-Token: $TOK" \
-H "Content-Type: application/json" \
-d '{
"detached": true,
"micro": "plans/example/extract_listings.json",
"profile_id": "smoke",
"workflow_id": "smoke-test-v1",
"max_cost": 2,
"max_time_minutes": 20
}'
Expected: a queued response with a run_id. Then poll with {"action":"status","run_id":"..."} until terminal, then {"action":"result","run_id":"..."} for the leads.
Raw-inference smoke (remote-brain shape — model returns OpenAI-format tool calls; the caller runs its own CUA loop against its own browser):
curl -fsS -X POST "$ENDPOINT/sync/v1/chat/completions" \
-H "Authorization: Api-Key $BASETEN_API_KEY" \
-H "X-Mantis-Token: $TOK" \
-H "Content-Type: application/json" \
-d '{
"model": "model",
"messages": [{"role": "user", "content": "Say hi."}],
"max_tokens": 32
}'
Auth model¶
Baseten requires both headers:
| Header | Layer | What |
|---|---|---|
Authorization: Api-Key <BASETEN_API_KEY> |
gateway | Authenticates the platform request |
X-Mantis-Token: <tenant_token> |
container | Authenticates the tenant |
The gateway passes Authorization: Api-Key … through to the container; we deliberately use a custom X-Mantis-Token header for the container layer so it doesn't collide with the platform header.
Cost guardrails¶
| Knob | Default | Where to set |
|---|---|---|
MAX_COST_USD |
$25 per run | Container env (MANTIS_MAX_COST_USD) — clamps the request's max_cost |
MAX_RUNTIME_MINUTES |
60 per run | MANTIS_MAX_RUNTIME_MINUTES |
Per-tenant max_cost_per_run |
tenant config | tenant keys file |
Per-tenant rate_limit_per_minute |
30 | tenant keys file |
| Replica autoscale | min=0 max=1 | runtime.health_checks in deploy/baseten/holo3/config.yaml |
If your traffic is bursty, leave min=0 and accept the ~50 min cold-start image build the first time the replica scales from zero. For low-latency production, set min=1 (always-on GPU charge).
Updating¶
| Change | What you need |
|---|---|
| Tenant keys | Update the mantis_tenant_keys Baseten secret. Hot reload (5 s cache) — no redeploy. |
| Anthropic key for a tenant | Update its anthropic_api_key_<tenant> secret. Hot reload via env. |
Plan files (plans/...) |
Push a new deployment. (External package dirs aren't hot-reloaded.) |
src/mantis_agent/ code |
truss push --no-cache. |
build_commands / system deps |
truss push (no --no-cache needed). |
Smoke test before promoting¶
If you want to canary, push without --promote:
uvx truss push deploy/baseten/holo3 --no-cache \
--deployment-name "$DEPLOY_NAME-canary" \
--include-git-info
You'll get a non-production environment URL. Run smoke tests there, then promote via the Baseten dashboard.
Serving a challenger for the promotion gate (#911/#918)¶
The slow-loop gate evaluates a trained challenger against the champion
(base). Unlike Modal — which boots a fresh inference server per run and picks
the challenger per request — the Baseten pod boots one shared inference server
at model-load, so the challenger is a deployment-level choice.
Holo3 (qwen3_5_moe) — full-model swap (#918). Holo3's LoRA adapter can't be
GGUF-converted (convert_lora_to_gguf lacks the MoE arch), but the merged model
can. So the Holo3 challenger is a full merged-GGUF model, not a --lora
overlay — point MANTIS_HOLO3_MODEL_DIR at the merged model (mounted via the
truss weights: block, alongside the base mmproj). The ready-made config
(deploy/baseten/holo3_challenger/config.yaml)
does this; swap its weights: source for your merged-model repo, then:
uvx truss push deploy/baseten/holo3_challenger --no-cache \
--deployment-name "holo3-challenger-$(git rev-parse --short HEAD)" \
--include-git-info
The trainer produces the artifact: peft merge_and_unload (adapter + base) →
convert_hf_to_gguf.py → quantize Q8_0 → publish.
Other (non-MoE / vLLM) bases — --lora overlay. For bases whose adapter
can be converted/served, set instead (deployment-level env):
| Env | Effect |
|---|---|
MANTIS_LORA_ADAPTER |
Serve base + this adapter. A pre-converted GGUF adapter (llama.cpp bases) or a PEFT dir (vLLM bases), mounted via weights:. Unset ⇒ base (champion). |
MANTIS_LORA_SCALE |
llama.cpp only — adapter scale (default 1.0; emits --lora-scaled). |
MANTIS_LORA_NAME |
vLLM only — served-model-name for the adapter (default challenger). |
The serving image has no torch/transformers, so it can't convert a raw PEFT dir for llama.cpp bases — point
MANTIS_LORA_ADAPTERat a.gguf. A non-.ggufref for a llama.cpp base fails fast at boot.
Point the gate's MANTIS_CHAMPION_ENDPOINT at the holo3 deploy and
MANTIS_EVAL_ENDPOINT at this one, then compare win-rate over the frozen holdout
(training/eval_harness.py / mantis_trainer.gate).
See also¶
deploy/baseten/README.md— the source of truth for the Truss config- Tenant keys — how to provision per-tenant tokens
- Metrics — wiring Prometheus scrape from Baseten