Rate limits¶

Two dimensions, both per-tenant, both in-process per replica.

Dimension	Source	Default	On exceed
Concurrent runs	`tenant.max_concurrent_runs`	5	`429` with `Retry-After: 5`
Rate (token bucket)	`tenant.rate_limit_per_minute`	30	`429` with `Retry-After: <s-until-token>`

How the rate dimension works¶

A standard token bucket: each tenant has a bucket with rate_limit_per_minute capacity, refilled at rate_limit_per_minute / 60 tokens/sec. Every POST /v1/predict (run mode only — polling actions don't consume tokens) takes one token. When the bucket is empty, the request gets 429 with Retry-After set to the time until the next token will be available.

Set rate_limit_per_minute: 0 in the tenant config to disable the rate limit entirely (useful for trusted internal tenants).

How the concurrency dimension works¶

A simple counter per tenant. Each accepted run increments it; each finished/failed/cancelled run decrements it. When the counter hits max_concurrent_runs, new requests get 429. The Prometheus gauge mantis_concurrent_runs{tenant_id=...} exposes the live count.

Tuning¶

Symptom	Fix
Tenant reports frequent 429s on bursts	Raise `rate_limit_per_minute`
Tenant reports 429s on long-running parallel jobs	Raise `max_concurrent_runs`
GPU keeps OOMing because too many concurrent runs	Lower the global cap by setting a lower `max_concurrent_runs` for the busy tenants
Want strict cluster-wide caps across N replicas	Today: run `replicas: 1` and accept the SPOF. Future: Redis-backed limiter (Tier 2.5)

Per-replica vs cluster-wide¶

The current limiter is in-process per replica. With N replicas, the effective per-tenant cap is roughly N × configured_cap because each replica tracks its own counter.

For most workloads this is fine — Mantis runs are detached and stick to one replica, so traffic spreads naturally. For strict cluster-wide enforcement (e.g., regulatory caps), either:

Run a single replica (already the default for autoscale-aware deployments).
Wait for the planned Tier 2.5 Redis-backed limiter.
Roll your own — src/mantis_agent/rate_limit.py is small and the TenantRateLimiter interface is intentionally swap-friendly.

Observability¶

mantis_predict_requests_total{tenant_id="acme", mode="run", outcome="rate_limited"}  ← rate-limit hits
mantis_rate_limit_rejections_total{tenant_id="acme", kind="rate"}                    ← bucket empty
mantis_rate_limit_rejections_total{tenant_id="acme", kind="concurrent"}              ← concurrency cap
mantis_concurrent_runs{tenant_id="acme"}                                             ← live in-flight

Alert on rate(mantis_rate_limit_rejections_total[5m]) > 0.1 / s per tenant for chronic limit hitting.