Mantis CUA Agent — Learnings & Experiment Log¶

Updated: 2026-04-17 | 190+ commits | ~$175+ GPU spend | Branch: feat/gym-anything-integration

What this doc is. A journey log of the lessons learned while hardening Mantis on one canonical extraction workflow — walking consumer-marketplace listings, classifying private vs dealer sellers, and pulling structured rows. Some examples below reference specific sites and scripts from that engagement. The lessons generalise to any extraction or form-flow workflow; the Recipes page abstracts the patterns into reusable shapes.

Executive Summary¶

We're building a CUA (Computer Use Agent) that can drive any web app or desktop application with open-weight models instead of proprietary APIs. We've achieved 83.3% on OSWorld benchmarks (OS domain) and successfully extract structured data from listings sites end-to-end. The hardest open problem is unit economics — flows that depend on variable third-party data quality (e.g. seller-posted phone numbers on consumer marketplaces hit ~5-10%) make per-row costs hard to amortise, with extraction running $4–29 per row depending on model and run length.

What's Working¶

1. XdotoolGymEnv — Zero-Fingerprint Browser Automation¶

Real Chrome + Xvfb + xdotool = undetectable by Cloudflare/bot detection
X11 events are indistinguishable from human input
Screenshots via mss/scrot (pixel capture, no browser API)
Proven: No Cloudflare blocks once we switched from Playwright/CDP
Key lesson: Every automation framework (Playwright, CDP, Puppeteer) leaks signals. Only OS-level input injection is truly undetectable.

2. Gemma4 on OSWorld — 83.3% (20/24 OS tasks)¶

Gemma4 26B-A4B via llama.cpp on single A100-80GB
pyautogui as primary action language (model was trained on it)
Distillation loop: analyze failure → diagnose → store learning → retry
ReliableController: transparently upgrades pyautogui.write() to xdotool for special chars
Key lesson: Respect the model's training distribution. Making subprocess.run() the default dropped score from 83% to 67%.

3. Custom Gemma4-CUA Fine-Tune — 100% CRM in 3 Steps¶

QLoRA fine-tune on 5000 AgentNet tasks, 3 epochs, rank=32
Loss: 3.8 → 0.27
Fastest of all models tested (3 steps vs 12 for EvoCUA-32B)
Parse actions from reasoning_content (model thinks then acts)
Key lesson: Small, targeted fine-tunes massively outperform prompting alone.

4. Reasoning Budget is Task-Dependent¶

budget=0 for CLI/terminal tasks (OSWorld) — model over-thinks on shell commands
budget=512 sweet spot for browser CUA — enough to reason about visual layouts
budget=4096 for complex form-filling — but 3-5x slower
Key lesson: There's no universal budget. Match reasoning depth to task complexity.

5. Parallel Workers (1 Page Per Worker)¶

5x wall-clock speedup over sequential
Each worker owns a full page (~25 listings)
Retry 3x on crash/Cloudflare/preemption
Key lesson: Don't pre-slice listings across workers. Dynamic page queue is simpler and more fault-tolerant.

Cookies survive across runs (login once, extract many times)
Clean session files only, not full profile (avoids corruption)
IPRoyal Miami residential proxy with sticky sessions per run

7. Strict Extraction Validation¶

Real phone validation: 555-exchange filtered, URL fragments filtered
Dedup by phone digits + listing URL
Eliminates ~90% of false positives (social media links, ad tracking numbers)

8. Screenshot Replay — 60x Faster Prompt Iteration¶

Every run now saves screenshots to Modal volume (/data/screenshots/)
ReplayGymEnv replays cached screenshots locally without browser/GPU
scripts/test_extraction_prompt.py tests prompts against screenshots via Claude API in 30sec
scripts/replay_test.py CLI: download, test single prompts, run full replay
Key finding from screenshots: Phone was visible on step 20 but model clicked into photo gallery and got trapped for 60 steps

9. RegionGrounding — Heuristic Click Safety¶

Clamps click coordinates to safe content area (y>80, y<660, x>20, x<1260)
Prevents footer social icon clicks (y>660) and header menu clicks (y<80)
Zero overhead — no model, no CDP, just coordinate math
Graceful fallback: if grounding fails, uses brain's original coordinates

10. JSON Nested Action Parsing¶

Gemma4-CUA sometimes outputs {"action":"key_press","parameters":{"keys":"alt+left"}}
Parser now flattens nested parameters/arguments/params dicts
Also handles markdown-fenced JSON (\``json\n{...}\n````)
Eliminated parse failures that were wasting 30%+ of steps

What's NOT Working¶

1. Prompt Engineering Has Hit a Ceiling¶

Models click Facebook/Instagram icons despite explicit "NEVER CLICK" instructions in 25-50% of iterations
Adding more negative examples, caps, emphasis — no measurable improvement
Root cause: Text instructions can't override visual saliency. Colorful social media icons are visually dominant. The model sees them and clicks them regardless of prompt.
Spent: 15+ prompt iterations, multiple commit cycles, no improvement beyond ~75% adherence

2. The Fundamental Lead Economics Problem¶

Only ~5-10% of private sellers on BoatTrader post phone numbers
Best case: $4.26/lead (EvoCUA-32B, good proxy day)
Worst case: $29+/lead (Gemma4 parallel, prompt-only off-site avoidance)
80 listings scanned → 3 unique phone leads total
Implication: Even with perfect extraction accuracy, the hit rate caps economic viability. Need higher-yield sources or broader lead definition.

3. EvoCUA-8B Parse Failures¶

225/500 steps unparseable in brain_opencua
The model outputs valid reasoning but action format doesn't match our parser
EvoCUA-32B works fine with same parser — the 8B model is genuinely worse at structured output
Cost: Cheapest model is unusable, forcing us to 2x A100 for 32B

4. E4B Planner Produces Generic Plans¶

Gemma4-E4B (4B parameter model) as text planner generates instructions without visual anchors
"Click the boat listing" vs "Click the LARGE rectangle with boat photo and price text"
CUA models need spatially grounded instructions to avoid clicking wrong elements
Spent: Multiple iterations on planner prompt, fundamental model capability issue

5. CDP-Based Runtime Guardrails Don't Work in xdotool Env¶

JS injection for cookie dismissal, URL detection, CSS hiding
Timing races with xdotool (event arrives before JS executes)
WebSocket failures when Chrome is under load
Violates the zero-fingerprint architecture we chose for good reason
Decision: Removed all CDP guardrails. Model handles everything visually.

CDP backtrack was removed (for good reason — fingerprinting)
RegionGrounding clamps footer/header clicks but doesn't prevent inline social links
Off-site avoidance is prompt-based + region clamping, but ~25% still leak through
Mitigation: RegionGrounding reduced from 50% to ~25%. Full fix needs Opus visual planner.

7. Stale URLs Cause Silent 0% Runs¶

BoatTrader changed URL schema — condition-used/type-power/price-35000,/zip-33101/radius-100/ returns 404
seller-private/ filter URL also returns 404
Only boattrader.com/boats/ (base URL) works, but shows ALL boats including dealers
This caused ALL recent 0% runs — model was on 404 pages, not extraction failures
Fix needed: Opus visual planner discovers correct URL by browsing, or model applies filters visually via s1_search

8. Image Gallery Trap — Same-Model Grounding Doesn't Work¶

Model clicks boat photos → enters lightbox/image viewer → spends 40+ steps trying to close
Screenshot evidence: step 20 had phone visible, steps 40-80 stuck in gallery
Prompt says "click TITLE TEXT, NOT the photo" but model clicks photos anyway — visual saliency overrides text instructions
Tried: LLM grounding with same Gemma4 — FAILED. Same visual bias = same click targets
Fix: Use Claude Sonnet as separate grounding model — different model, no photo bias
Key learning: same-model grounding is useless for fixing visual biases. Must use a DIFFERENT model.
Cost: ~$0.01 per grounding call, ~$0.25 per listing (vs $5-15 wasted on gallery traps)

9. Holo3-35B-A3B — vLLM Blocked, llama.cpp Works¶

77.8% OSWorld-Verified (SOTA open-weight, released Mar 31 2026)
Qwen3.5-based MoE: 35B total params, only 3B active per token
vLLM cannot serve it: vLLM pins transformers<5, Holo3's qwen3_5_moe architecture needs transformers>=5.2. Hard dependency conflict across ALL vLLM versions (0.12–0.19).
H Company hosted API: works but free tier rate-limits (429 every few requests), no proper tool_calls returned, inconsistent action formats
llama.cpp GGUF: WORKS — Q8_0 (34GB) + mmproj-f16 (0.8GB) on 1x A100, server boots in 22-26 seconds
Model generates actions but uses varying output formats: {"code":"wait()"}, {"command":"click","x":200,"y":400}, Action: scroll({...}), plus standard tool_calls
Required 5-strategy fallback parser (tool_calls → Holo3 text → JSON → pyautogui → keywords)
Coordinate values sometimes arrive as strings or comma-separated pairs — needed robust int extraction
First VIABLE extraction in 104 seconds — model navigated to listing, scrolled, found data, reported back
Parse gaps fixed: Escape(), {"code":"wait()"}, {"command":"click"} all handled
Chrome --no-sandbox warning bar was shifting coordinates ~30px — fixed with --test-type flag
reasoning-budget=512 makes Holo3 WORSE — overthinks, generates paragraphs instead of action calls. Budget=0 (default) is better.
Best run: ¼ viable (25%), 0 parse failures, $0.13 total, 9-17s per listing
Persistent issues (same as Gemma4/EvoCUA):
Gallery trap: model clicks boat PHOTOS instead of title text despite explicit instructions → enters fullscreen gallery ("1 of 85") → gets stuck
No structured done(): model finds data but doesn't format VIABLE | Year: ... | Make: ... — outputs reasoning text instead
Action loops: repeats same click coordinates without progress until hard loop detector kills it
Opus browse-enhanced plan tested: generated pixel-level coordinates from cached screenshots ($0.06 one-time). Pixel coords shift between sessions (ads/banners). Updated to visual descriptions instead. Result: model still doesn't reliably click listing titles.
Key insight: These are NOT Holo3-specific problems. Every model (Gemma4, EvoCUA, Holo3) hits the same gallery trap and formatting issues. Neither prompt engineering NOR Opus visual planning alone can fix this. The fix is distillation — showing the model correct click trajectories via fine-tuning.

Holo3 Integration — Technical Details¶

What We Tried (in order)¶

vLLM self-hosted (2x A100) — BLOCKED. transformers<5 in vLLM vs transformers>=5.2 for qwen3_5_moe. No version of vLLM can load this model.
H Company hosted API — PARTIAL. Free tier, OpenAI-compatible. But:
429 rate limits every 3-5 rapid requests
Model returns text reasoning without tool_calls ~50% of the time
Coordinates arrive as strings ("420, 84") or in non-standard JSON keys ("command" instead of "action")
Cost: $0 (free tier) but unreliable for production
llama.cpp GGUF (1x A100) — WORKS. Using mradermacher/Holo3-35B-A3B-GGUF:
Q8_0 quant (34.4 GB) + mmproj-f16 (0.84 GB) = ~35 GB total
Fits 1x A100-80GB with ~40 GB to spare for KV cache
llama-server with --jinja --flash-attn on -ngl 99 -c 8192
Boots in 22-26 seconds (vs 2-5 min for vLLM cold start)
Same architecture as our Gemma4-CUA executor

Action Format Challenges¶

Holo3 outputs actions in at least 5 different formats depending on context:

# Format 1: Standard OpenAI tool_calls (rare from llama.cpp)
tool_calls: [{function: {name: "click", arguments: "{\"x\":640,\"y\":360}"}}]

# Format 2: Holo3 native text
Action: scroll({'direction': 'down', 'amount': 5})

# Format 3: JSON with "command" key (not "action")
{"command":"click","x":200,"y":400}

# Format 4: JSON with "code" key (function call as string)
{"code":"wait()","description":"Wait for page to load"}

# Format 5: Pure reasoning (no action — falls back to wait)
"I need to scroll down to find the phone number..."

Built a 5-strategy parser chain to handle all of these, with safe int extraction for malformed coordinates.

Holo3 Run Results¶

Run	Viable/Total	Time	Cost	Notes
API run 1	¼ (25%)	1 min	$0	Rate limited, format chaos
API run 2	0/4 (0%)	4 min	$0	400 errors from tool_calls
llama.cpp run 1	0/4 (0%)	2 min	$0.12	400s from tool calling, fixed by dropping tools
llama.cpp run 2	¼ (25%)	2 min	$0.13	First VIABLE! Zero parse failures
llama.cpp run 3	0/3 (0%)	7 min	$0.38	reasoning-budget=512 → overthinking, regressed
llama.cpp run 4	0/2 (0%)	3 min	$0.16	Gallery trap on every listing

Cost Comparison (Holo3 vs alternatives)¶

Approach	GPU	Cost/listing	Speed/listing	Status
Holo3 llama.cpp (Q8_0)	1x A100	~$0.03	9-17s	Working — gallery trap limits accuracy
Holo3 H Company API	None	$0 (free)	15-30s	Rate limited, unreliable format
EvoCUA-32B vLLM	2x A100	~$0.70	10 min	Working, proven 23% hit rate
Gemma4-CUA llama.cpp	1x A100	~$0.50	15-20 min	Working, best per-listing accuracy
Claude API	None	~$1-5	30-60s	Working, gold standard

Holo3 is 20-50x cheaper and 30-60x faster per listing than Gemma4/EvoCUA. The bottleneck is now purely accuracy — gallery trap and done() formatting — not cost or speed.

Why Testing & Iteration Cycles Are Slow¶

This is the single biggest drag on progress. Each experiment cycle takes 30-90 minutes wall clock and costs $4-12 in GPU.

The Feedback Loop¶

Code change → Modal deploy (~2min) → Container build (~3min) → Model load (~2min) →
Chrome launch → Navigate to site → Process listings (~10-20min each) →
Check results on Modal volume → Diagnose failure from logs → Repeat

Total: 30-90 minutes per iteration, $4-12 per run

Why It's Inherently Slow¶

No local testing possible: The full pipeline requires A100 GPU (llama.cpp + Gemma4 26B), Xvfb display, real Chrome, and IPRoyal proxy. Can't run locally on Mac.
Modal cold starts: Every code change requires a new container build. Image building (apt-get, pip install, model download) adds 2-5 minutes before any code runs.
Real website latency: BoatTrader pages take 3-10 seconds to load through residential proxy. Can't mock this without losing the zero-fingerprint guarantee.
Model inference is slow: Gemma4 with budget=512 takes 5-15 seconds per step. 40 steps per listing × 25 listings per page = 15-20 minutes per page minimum.
Failures are only visible at the end: A prompt engineering change might look fine for the first 5 listings then fall apart on listing #6 when the page layout shifts. Need full-page runs to validate.
No unit tests for visual grounding: We can't unit-test "does the model click the right thing on a boat listing page" without actually running the model on the page. There's no synthetic benchmark for this.
Log inspection is manual: Results land on Modal volumes. scripts/check_boattrader.sh helps but debugging requires reading through JSON traces step-by-step.

What We've Done to Speed Things Up¶

human_speed=False + removed inter-iteration sleep → 2-3x faster per listing
Parallel workers → 5x wall-clock reduction for multi-page runs
scripts/check_boattrader.sh / scripts/monitor_boattrader.sh → live progress without waiting for completion
--detach runs → start and check later instead of blocking terminal
Fast-fail on 404/error pages → skip stale listings immediately

What Would Actually Help (but haven't built yet)¶

Local Gemma4 inference via llama.cpp on Mac (M-series) for prompt iteration without Modal
Cached screenshot replay: Record screenshots from a real run, replay them locally to test prompt/parsing changes without GPU or network
Synthetic visual benchmark: 20 annotated BoatTrader screenshots with ground-truth click targets, testable offline

Cross-Model Pattern: The Same 3 Problems¶

Every model we've tested (Gemma4-CUA, EvoCUA-32B, EvoCUA-8B, Holo3-35B-A3B) hits the same three failure modes on BoatTrader, regardless of model quality or cost:

1. Gallery Trap (all models)¶

Model clicks the boat photo instead of the title text. Opens fullscreen image viewer ("1 of 85"). Gets stuck pressing Escape/Back for 10-40 steps. Prompt says "click TITLE TEXT not photo" — model ignores it because the photo is visually larger and more prominent.

2. No Structured Output (all models)¶

Model finds boat data (Year, Make, Model, Price visible on screen) but doesn't format done(summary="VIABLE | Year: 2024 | Make: ..."). Instead outputs reasoning text or incomplete summaries that fail validation.

3. Action Loops (all models)¶

After failing to make progress (gallery trap, wrong click, slow page load), model repeats the same action 10+ times until hard loop detector kills the iteration.

What This Means¶

Prompt engineering cannot fix these. We've tried 15+ prompt iterations across 4 models. The fix is one of: - Distillation: Show the model correct trajectories (click HERE not THERE) via fine-tuning - Opus visual planner: Generate instructions with pixel-level visual anchors from actual site browsing - Runtime guardrails: Detect gallery/lightbox state from screenshot, auto-press Escape + go back

The model choice affects speed and cost, not accuracy on this task. Holo3 at $0.03/listing and 9-17s is the right execution engine — it just needs better instructions or training.

Hypotheses Remaining to Test¶

Highest Priority (execution engine is ready — need accuracy)¶

H1: Opus Visual Planner (Issue #40)¶

Hypothesis: Claude Opus browses the site first via xdotool, discovers layout visually, then generates a rich task suite with visual anchors, error handlers, and negative examples. Cheap models execute the loop.

Why we believe this: Prompt engineering has plateaued because models need visual ground truth from the actual site, not text descriptions. Opus has strong visual reasoning and can generate CUA-aware instructions that include spatial descriptions ("the LARGE rectangle", "small colored squares in footer = social links, NEVER click").

Cost to test: ~$2-5 (Opus API for planning) + existing GPU for execution Expected impact: Reduce off-site navigation from 25-50% to <5%

H2: Cached Screenshot Replay for Fast Iteration¶

Hypothesis: Record screenshots + action traces from one real run. Replay the screenshots locally to test prompt/parsing changes without GPU, network, or real websites.

Why we believe this: 80% of our iteration time is waiting for infrastructure, not thinking about the problem. If prompt changes could be tested in 30 seconds instead of 30 minutes, we'd move 60x faster on the prompt engineering axis.

Cost to test: ~1 day of engineering Expected impact: Prompt iteration cycle from 30min → 30sec

H3: Broader Lead Definition (Email + Contact Form)¶

Hypothesis: If we expand "viable lead" beyond phone-only to include email addresses and contact form submissions, hit rate jumps from 5-10% to 30-50%.

Why we believe this: Most BoatTrader sellers have a "Contact Seller" button even when they don't show phone numbers. The CUA can fill out contact forms directly.

Cost to test: Modify extraction validation + add form-filling task to workflow Expected impact: 5-10x more leads per run at similar GPU cost

H4: Distillation from Claude Trajectories → Holo3¶

Hypothesis: Run Claude (API) on BoatTrader to generate perfect trajectories. Fine-tune Gemma4 on these trajectories for site-specific CUA behavior.

Why we believe this: Our Gemma4-CUA fine-tune (AgentNet data) achieved 100% on CRM in 3 steps. Site-specific distillation could achieve similar gains on BoatTrader. Now even more compelling: Holo3 is 20-50x cheaper than Gemma4/EvoCUA at $0.03/listing — even small accuracy gains from distillation would make it production-viable.

Cost to test: ~$10-20 Claude API (50 listings) + ~$15-20 fine-tuning on Modal Expected impact: Fix gallery trap + structured output → 60-80% viable rate at $0.03/listing

Medium Priority¶

H5: Gallery Trap Detector (Runtime Guardrail)¶

Hypothesis: Detect fullscreen image gallery state (dark background, "X of Y" text, large centered image) from screenshots using a simple classifier or heuristic. Auto-press Escape + Alt+Left when detected.

Why we believe this: Every model clicks photos. A runtime check is faster than retraining. Could be as simple as "if >60% of pixels are dark and 'of' text detected near top center → gallery state."

Cost to test: ~2 hours of engineering Expected impact: Eliminate gallery trap entirely (currently wastes 30-50% of steps)

Lower Priority / Exploratory¶

H7: Multi-Site Expansion¶

Hypothesis: The architecture (xdotool env + parallel workers + workflow runner) is general enough to work on other boat listing sites (YachtWorld, Boats.com) with only task file changes.

Risk: Each site has different layouts, anti-bot measures, and data formats.

H8: SoM (Set-of-Marks) for Visual Grounding¶

Hypothesis: Overlaying numbered bounding boxes on screenshots helps models identify clickable elements more accurately, reducing off-site clicks.

Why we haven't tested: Adds DOM dependency (need element detection), which conflicts with zero-fingerprint xdotool approach. Would need a vision-only SoM (e.g., YOLO-based element detection on screenshots).

H9: Reward Model for Self-Evaluation¶

Hypothesis: Train a small model to evaluate "did this step make progress?" from before/after screenshots. Use as a runtime guardrail to detect and recover from off-site navigation.

Cost to test: Significant — need labeled data + training pipeline Expected impact: Runtime safety net without CDP/DOM dependency

Decision Log¶

Date	Decision	Outcome
Apr 8	Start with Playwright for browser CUA	FAILED — Cloudflare blocks
Apr 9	Switch to CDP (connect to real Chrome)	PARTIAL — works but leaks signals
Apr 10	Switch to xdotool + Xvfb	SUCCESS — undetectable
Apr 10	Use EvoCUA-8B for cheap extraction	FAILED — 225/500 parse failures
Apr 11	Fine-tune Gemma4 on AgentNet	SUCCESS — 100% CRM in 3 steps
Apr 12	Use Gemma4-E4B as text planner	FAILED — generic instructions
Apr 13	Add CDP guardrails back into xdotool env	FAILED — timing races, fingerprint risk
Apr 14	Pure prompt-based off-site avoidance	PARTIAL — 50-75% effective
Apr 15	Parallel workers (1 page each)	SUCCESS — 5x speedup
Apr 15	EvoCUA-32B for volume extraction	SUCCESS — $4.26/lead on good runs
Apr 16	Gemma4-CUA budget=512 for accuracy	SUCCESS — best per-listing accuracy
Apr 17	Fix 10 code bugs + fast-fail 404s	SUCCESS — reduced waste significantly
Apr 17	Prompt engineering for off-site avoidance	FAILED — ceiling reached
Apr 17	Holo3 via vLLM (2x A100)	BLOCKED — transformers<5 vs >=5.2 conflict
Apr 17	Holo3 via H Company API	PARTIAL — rate limited, no tool_calls, format chaos
Apr 17	Holo3 via llama.cpp GGUF (1x A100)	WORKING — boots 22s, $0.03/listing, 25% viable
Apr 17	Holo3 with reasoning-budget=512	FAILED — overthinks, worse than no budget
Apr 17	Chrome --test-type flag	SUCCESS — removed warning bar coordinate shift
Apr 18	Screenshot analysis of Holo3 runs	CONFIRMED — gallery trap is the bottleneck, not parsing
Apr 18	Opus browse-enhanced plan + Holo3	FAILED — 0/2 viable, pixel coords shift, visual descriptions not enough

Key Numbers¶

Metric	Value
Total commits	200+
Total GPU spend	~$176+
OSWorld best (OS domain)	83.3% (20/24)
CRM best (Gemma4-CUA)	100% in 3 steps
BoatTrader listings scanned	~80
Unique phone leads found	3
Best cost/lead	$4.26 (EvoCUA-32B)
Worst cost/lead	$29+ (Gemma4 parallel)
Phone number hit rate	~5-10%
Off-site click waste	25-50% of iterations
Iteration cycle time	30-90 minutes
Iteration cost	$4-12 per run