Rate limits¶
Two dimensions, both per-tenant, both in-process per replica.
| Dimension | Source | Default | On exceed |
|---|---|---|---|
| Concurrent runs | tenant.max_concurrent_runs |
5 | 429 with Retry-After: 5 |
| Rate (token bucket) | tenant.rate_limit_per_minute |
30 | 429 with Retry-After: <s-until-token> |
How the rate dimension works¶
A standard token bucket: each tenant has a bucket with rate_limit_per_minute capacity, refilled at rate_limit_per_minute / 60 tokens/sec. Every POST /v1/predict (run mode only — polling actions don't consume tokens) takes one token. When the bucket is empty, the request gets 429 with Retry-After set to the time until the next token will be available.
Set rate_limit_per_minute: 0 in the tenant config to disable the rate limit entirely (useful for trusted internal tenants).
How the concurrency dimension works¶
A simple counter per tenant. Each accepted run increments it; each finished/failed/cancelled run decrements it. When the counter hits max_concurrent_runs, new requests get 429. The Prometheus gauge mantis_concurrent_runs{tenant_id=...} exposes the live count.
Tuning¶
| Symptom | Fix |
|---|---|
| Tenant reports frequent 429s on bursts | Raise rate_limit_per_minute |
| Tenant reports 429s on long-running parallel jobs | Raise max_concurrent_runs |
| GPU keeps OOMing because too many concurrent runs | Lower the global cap by setting a lower max_concurrent_runs for the busy tenants |
| Want strict cluster-wide caps across N replicas | Today: run replicas: 1 and accept the SPOF. Future: Redis-backed limiter (Tier 2.5) |
Per-replica vs cluster-wide¶
The current limiter is in-process per replica. With N replicas, the effective per-tenant cap is roughly N × configured_cap because each replica tracks its own counter.
For most workloads this is fine — Mantis runs are detached and stick to one replica, so traffic spreads naturally. For strict cluster-wide enforcement (e.g., regulatory caps), either:
- Run a single replica (already the default for autoscale-aware deployments).
- Wait for the planned Tier 2.5 Redis-backed limiter.
- Roll your own —
src/mantis_agent/rate_limit.pyis small and theTenantRateLimiterinterface is intentionally swap-friendly.
Observability¶
mantis_predict_requests_total{tenant_id="acme", mode="run", outcome="rate_limited"} ← rate-limit hits
mantis_rate_limit_rejections_total{tenant_id="acme", kind="rate"} ← bucket empty
mantis_rate_limit_rejections_total{tenant_id="acme", kind="concurrent"} ← concurrency cap
mantis_concurrent_runs{tenant_id="acme"} ← live in-flight
Alert on rate(mantis_rate_limit_rejections_total[5m]) > 0.1 / s per tenant for chronic limit hitting.
See also¶
- Tenant keys — where to set
rate_limit_per_minute/max_concurrent_runs - Metrics — full label set + scrape setup