Integrating any agent with Mantis¶
This is the agent-side integration playbook — a prescriptive checklist for plugging any third-party agent (OpenAI Computer Use, Anthropic Computer Use, Voyager, AutoGPT-style frameworks, your own bespoke loop) into a Mantis backend without rediscovering the integration mistakes that produce silent halts.
Companion reading:
- Generic CUA over HTTP — the simplest shape (no library, no env wiring; just
/v1/predict).- Embedding MicroPlanRunner — the library shape, where you drive the runner in your own process.
Pick your shape¶
Two integration shapes; pick by what your agent already owns.
| Shape | When | What you write |
|---|---|---|
| HTTP client only | Your agent has no browser stack and just needs extraction. | A 30-line client that POSTs to /v1/predict, polls until done, reads the result. See Generic CUA. |
| Library embed with your own env | Your agent owns the desktop / Xvfb / Chrome stack and wants Mantis as the brain. | A GymEnvironment adapter (~150 LoC) + a MicroPlanRunner driver. See Embedding MicroPlanRunner. |
If you're unsure, start with HTTP. It's harder to hit silent integration bugs because the runtime contract is enforced server-side.
The runtime contract — five things your env must do¶
These are the contracts the runner relies on. Most integration halts trace back to a violation of one of these. They apply to the library embed shape; the HTTP shape inherits them from Mantis's own xdotool env.
1. current_url is a @property, not a method¶
The runner reads self.env.current_url as a property. If you define it as
a method:
# WRONG — bound method is truthy, runner returns it as if it were the URL.
def current_url(self) -> str:
return self._cdp_get_url()
# RIGHT — @property so attribute access returns the value.
@property
def current_url(self) -> str:
return self._cdp_get_url()
Since PR #91 the runner
defensively detects callable(raw) and invokes it, but this is a fallback
— other readers (operators, dashboards, custom plan verifiers) may not.
2. CDP must be reachable from the runner's process¶
The runner queries Chrome DevTools Protocol's /json/list endpoint to
read the active tab's URL. Two things must be true:
- Chrome launched with
--remote-debugging-port=<port> --remote-debugging-address=127.0.0.1. - The runner's Python process can reach
127.0.0.1:<port>(same network namespace, no firewall blocking the loopback port).
If your env is a thin wrapper around someone else's Chrome instance, verify CDP independently before you ship:
import urllib.request, json
tabs = json.loads(urllib.request.urlopen(
"http://127.0.0.1:9222/json/list", timeout=2,
).read())
print([t["url"] for t in tabs if t["type"] == "page"])
Should print one or more page URLs. If empty / connection-refused, fix that before writing any agent code — every URL-dependent verify path silently falls back to OCR which is slower and less reliable.
3. reset(start_url=...) must actually navigate¶
The runner's _execute_navigate calls env.reset(task="navigate", start_url=url)
and expects the browser to be on url when it returns. Common
integration miss: a wrapper env's reset ignores start_url, expecting
the agent to issue an explicit navigate action afterwards.
def reset(self, task: str, *, start_url: str = "", **_) -> GymObservation:
if start_url:
self._navigate(start_url) # don't skip this
return self._capture()
Verifiable: a unit test that calls env.reset(start_url="https://example.com")
and asserts env.current_url resolves to https://example.com within the
first-paint wait.
4. step(action) synthesises real input events¶
Particularly clicks. SPAs that bind handlers to row containers via
addEventListener("click", …) need a real pointerdown → pointerup
pair. Some Xvfb / xdotool integrations send only the mousedown half
or skip the move-and-click choreography. Symptom: every click in the
trace logs but no navigation happens.
If your env can't send real events end-to-end, escalate via CDP
Input.dispatchMouseEvent for the small set of actions that matter
(issue #89 §1 follow-up
tracks this). Until then, expect the click-verify cascade
(plain → middle-click → 4 probe points) to bottom out and the runner to
mark the step CLICK FAILED.
5. screenshot() returns a fresh image, not a cached one¶
The verify loop calls env.screenshot() after every action and assumes
the returned PIL image reflects the current display. Wrappers that cache
the last screenshot for performance — or that take the screenshot
asynchronously and return a stale frame — break extraction silently
because the OCR-based URL fallback then reads a pre-navigation address
bar.
The micro-plan contract — what your agent emits¶
Your agent doesn't author MicroIntent JSON by hand. Send plan_text
(English) and let the server's PlanDecomposer turn it into a typed
plan:
{
"plan_text": "Open https://crm.example.com. Log in as alice / <password>. Open the Leads tab. Click the first qualified lead. In the edit form, set Industry Vertical to Space Exploration. Save changes.",
"max_cost": 3.0,
"max_time_minutes": 15,
"detached": true
}
The decomposer turns that into navigate / fill_field × 2 / submit
× 3 / select_option / submit. Cached by hash for repeat calls.
If you're hand-authoring micro-plans (volume use cases, regression testing), the glossary lists every step type. Two patterns specifically to know about:
navigatefirst-paint wait: slow proxied SPAs needparams={"wait_after_load_seconds": 35}on the navigate step, or setMANTIS_NAV_WAIT_SECONDS=35on the deployment. The default is 18 s (covers Cloudflare auto-solve).submitaliases: primary-submit buttons whose copy varies across products (Update Lead/Save Changes/Save) should setparams={"label": "Update Lead", "aliases": ["Save", "Save Changes", "Update"]}. The grounder treats any alias as a valid match, and the form-finder scrolls Page_Down up to 4× to find buttons below the fold.
Both of these are params-only — no env-side or library-side wiring
needed.
Diagnosing failures — the log lines that matter¶
When something halts, the runner's trace is the source of truth. The high-signal log lines:
| Line | Meaning |
|---|---|
[url] cdp=https://... |
CDP returned a URL — the verify path is healthy. |
[url] cdp unavailable (URLError); falling back to OCR |
Chrome isn't reachable on the CDP port. Fix the launch flags or the network namespace. |
[url] cdp empty; falling back to OCR |
CDP reachable but returned no tabs. Chrome started but no page is loaded yet — usually a navigate-too-fast or first-paint timing bug. |
[url] ocr=<empty> |
Both CDP and OCR returned empty. Either the page is mid-render or the address bar is occluded — most actionable as a wait_after_load_seconds bump. |
[claude-click] Plain click did not navigate |
The click landed but no navigation followed. Either the SPA needs synthetic events your env doesn't dispatch (see contract §4), or the URL-read source is wrong (see the [url] lines above). |
[claude-form] submit '<label>' not in viewport — scrolling Page_Down |
Form-finder is scrolling for a below-fold button. Healthy. |
[claude-form] submit: button '<label>' not found after 4 scroll(s) |
Genuine miss. Try setting params["aliases"] to cover copy variation. |
If you see (url=) empty across every retry of a click step, the
[url] … lines tell you exactly which leg to fix. Without them, you're
guessing.
The five integration mistakes (and how to detect them)¶
These are the issues real-world integrations have surfaced — preflight your integration against them.
| Mistake | Detection | Fix |
|---|---|---|
current_url defined as method instead of @property |
Trace shows (url=) empty even when CDP is reachable. The bound-method object short-circuits the truthy check. |
Change to @property. PR #91 also detects callable and invokes it. |
Chrome launched without --remote-debugging-port |
[url] cdp unavailable (URLError) on every step. |
Add the flag (and --remote-debugging-address=127.0.0.1) to the launch cmd. |
env.reset(start_url=...) ignored |
The first navigate step doesn't actually load the URL. Subsequent steps run against about:blank or a stale page. |
Make reset honour start_url. Pin with a unit test. |
Wrapper screenshot() returns cached frame |
OCR reads the pre-navigation address bar. CDP reads the new URL. The two disagree, often silently. | Force a fresh capture per call. |
| Pip-installed wheel doesn't match the SHA tag | Code at +abc1234 doesn't match git show abc1234. Build pipeline cached an older sdist. |
Rebuild without cache. Verify with pip show mantis-agent + a single line grep on the installed file. |
Pre-flight checklist¶
Before claiming your integration works, run all of these:
-
env.current_urlis a@property(not a method). - CDP
curl http://127.0.0.1:9222/json/listreturns the current page from inside the runner's container / process. -
env.reset(start_url="https://example.com")lands onhttps://example.com(assert viaenv.current_url). -
env.step(Action(CLICK, ...))synthesises a real click — verify with a test page thatconsole.logs onpointerdownandpointerup. -
env.screenshot()returns a fresh image — diff two consecutive calls after aKEY_PRESS Page_Downand confirm they differ. - Your
mantis-agentinstall matches the SHA you intended:python -c "from mantis_agent.gym.micro_runner import MicroPlanRunner; import inspect; print(inspect.getsourcefile(MicroPlanRunner))"thengrep _read_current_url <that-file>should hit. - A trivial
plan_textround-trip succeeds end-to-end (HTTP shape) or aMicroPlanround-trip succeeds end-to-end (library shape). - The trace shows
[url] cdp=...on at least one step.
If every box is ticked, you've avoided the known-failure surface. New
failure modes go to the issue tracker —
favour reproducers with the offending [url] / [claude-click] /
[claude-form] log lines attached.
Versioning and SHA pinning¶
Pin mantis-agent to a specific tag or git SHA in your agent's lockfile.
The decomposer prompt is versioned (v13_submit_aliases at time of
writing); cached plans regenerate automatically when the version bumps,
so you don't need cache-busting code on your side, but you do want
your agent's behaviour reproducible across deploys.
If you're tracking a Mantis branch (e.g. for an integration that hasn't
landed in main yet), use MANTIS_GIT_SHA=<sha> as the build arg the
deployer reads — that pattern keeps pip cache-aware while letting the
SHA tag remain accurate.
Plugging in your own brain¶
Mantis ships several reference brains (holo3, claude, opencua,
llamacpp, gemma4, agent-s) registered under the
mantis_agent.brain_protocol registry. Swap one of these — or add your
own — without forking the runtime.
1. Pick a built-in by name¶
Set MANTIS_BRAIN on the deployment:
export MANTIS_BRAIN=claude # surgical-reasoning only
# or
export MANTIS_BRAIN=opencua # local OpenCUA-7B endpoint
The legacy MANTIS_MODEL env var keeps working for one minor release.
When both are set, MANTIS_BRAIN wins. The legacy alias gemma4-cua
maps to gemma4.
2. Register your own backend¶
Implement the Brain protocol — load() plus think(frames, task,
action_history=None, screen_size=(1920, 1080)) returning an
InferenceResult-shaped object — then register it before the runtime
starts:
from mantis_agent import Brain, register_brain
class MyBrain:
def load(self) -> None:
...
def think(self, frames, task, action_history=None, screen_size=(1920, 1080)):
...
register_brain("my-brain", lambda: MyBrain())
Then run with MANTIS_BRAIN=my-brain. The runtime asks
mantis_agent.resolve_brain("my-brain") and gets a fresh instance.
The protocol is runtime_checkable, so isinstance(b, Brain) works for
quick smoke tests without forcing inheritance from a base class.
register_brain and resolve_brain are pure typing + dict — they pull
no GPU / API dependencies, so a plugin package can register a brain on
import without growing the slim install.
Where to go next¶
- Recipes — copy-paste micro-plans for jobs, e-commerce, news, real-estate.
- Embedding MicroPlanRunner — full Python surface for the library shape.
- Reference glossary — every step type, every helper, every env var.
- Issue tracker —
filing a bug? Include the
[url] …log lines.