Mantis CUA¶

A unified perception-reasoning-action agent for computer use. Given a structured plan, Mantis drives a real browser (or any Xvfb-rendered application), takes actions, extracts structured data, and produces both a JSON result and an optional polished video walkthrough.

       ┌──────────────────────┐         ┌─────────────────────────┐
3p ──► │ Mantis CUA service   │ ──────► │ Target app (Chrome,     │
caller │ Holo3 + Claude       │         │ file manager, terminal, │
       │ /v1/predict          │         │ LibreOffice, …)         │
       └──────────┬───────────┘         └─────────────────────────┘
                  │
                  ▼
       ┌──────────────────────┐
       │ Result + lead CSV +  │
       │ polished screencast  │
       └──────────────────────┘

What you get¶

Reliable multi-step plans. A structured MicroPlanRunner enforces section / gate / loop semantics so even small models behave on long workflows.
Cheap inference at click-and-scroll latency. Holo3 (35B GGUF on a single GPU) for tactical actions; Claude API only for surgical reasoning steps (extract / verify / ground a click).
Real browser, real desktop. Xvfb + Chrome + xdotool. No Playwright fingerprints. Works against sites with bot detection.
Cloud-portable. Same image runs on Baseten, Modal, EKS, GKE, or your own Docker host.
Multi-tenant out of the box. Per-key auth, per-tenant rate limits, idempotency, webhooks, URL allowlists, Prometheus metrics.
Screencast included. Every run can produce a title-card → captioned-run-with-action-overlays → outro video that's ready to share.

Pick a path¶

I just want to try it

Hit the live Baseten endpoint with a curl. No deploy needed.

Quickstart
I want to host it

Deploy on Baseten / Modal / EKS / GKE / your own Docker host.

Hosting
I want to integrate from my app

Auth, sending plans, polling for results, downloading recordings.

Client
I run a multi-tenant fleet

Provision tenant keys, enforce rate limits, wire webhooks + metrics.

Operations

Verified end-to-end¶

Path	Run	Result
Modal	3-listing extraction	2 / 3 leads with phone, ~$0.42, 13 min
Baseten	3-listing extraction	3 / 3 leads with phone, ~$0.42, 9.5 min

Both deployments produce structured JSON rows (year / make / model / price / phone / url) for every successfully extracted listing.

At a glance¶

	Notes
Languages	Python 3.11+
GPU footprint	1× H100 / A100 80 GB / L40S 48 GB (for Holo3 inference). Orchestrator can run on CPU.
Cost per task (3-listing reference)	GPU ~$0.12 + Claude ~$0.12 + proxy ~$0.18 = ~$0.42
Auth	`X-Mantis-Token` (custom header) + Baseten gateway `Authorization: Api-Key`
API style	OpenAI-compatible `/v1/chat/completions` for inference; Mantis-shape `/v1/predict` for orchestrated runs
Cloud paths	Baseten Truss · Modal · EKS (Terraform + k8s) · GKE (Terraform + k8s) · raw Docker
Multi-tenancy	File-backed JSON keys, per-tenant scopes / caps / Anthropic key / allowed domains / webhooks
Recording	Optional `record_video: true` produces a polished walkthrough with overlays for clicks / keys / scrolls / typing / drags

License¶

MIT. Source on GitHub.