Performance benchmarks
Every governed call adds one network round-trip to OPA plus a JSON serialise/deserialise. The bundled benchmark script measures that overhead across the five evaluation paths a production agent actually hits, so you can size capacity against your own hardware before you ship.
What it measures
benchmarks/bench_gate.py exercises the gate end-to-end: build a SessionContext, build a ToolCallInput (or GovernanceEvent), call PolicyGate.evaluate_*, time the wall clock. There is no model in the loop and no tool execution — just the gate's critical-path latency.
| Scenario | What it exercises |
|---|---|
simple_allow | read_customer lookup — the fast-path allow |
hard_deny | read_file /app/.env — hard-blocked by security.rego |
hitl_trigger | approve_refund $350 — allow=False, requires_hitl=True |
agent_spawn | agent.spawn event at delegation_depth=0 |
agent_delegate | agent.delegate event at delegation_depth=1 |
The script reports p50, p95, p99, and max for each. Those are the numbers that matter — every governed tool call is one round-trip on the agent's critical path, so the long tail is what users feel.
Run it
# Prerequisite: OPA running locally
docker compose up -d opa
# Default — 1000 runs, concurrency 5, 50-call warmup
python benchmarks/bench_gate.py
# Heavier sweep
python benchmarks/bench_gate.py --runs 2000 --concurrency 10Flags:
| Flag | Default | Meaning |
|---|---|---|
--runs | 1000 | Iterations per scenario |
--concurrency | 5 | Async semaphore — how many in-flight calls |
--warmup | 50 | Discarded calls before measurement starts |
--opa | http://localhost:8181 | Override the OPA URL |
The script's progress display colour-codes each percentile against a practical threshold: green < 10 ms, yellow < 30 ms, red ≥ 30 ms. If your p95 sits in the green band you're fine for any synchronous agent loop; yellow is the budget for chatty multi-tool turns; red means OPA is overloaded or the network hop is the problem.
What "good" looks like
The headline numbers in the README — single-digit-ms p50, low-tens-ms p95 against a local OPA — come from this script on developer hardware. Re-measure on your own infrastructure before you trust them as a capacity number; the variables that move the result are:
- OPA placement. Local Docker > sidecar > shared cluster service. Each hop adds round-trip latency that dominates the gate's own work.
- Policy bundle size.
simple_allowagainst the OSS modules (~100 LOC of Rego) is the lower bound. A bundle of thousands of rules will be measurably slower at p95. - OPA decision-log mode. Console-only logging is fastest; pushing to a remote service adds backpressure.
- Concurrency.
--concurrency 1measures the lonely path;--concurrency 50is closer to a busy multi-agent system.
Use the numbers
Two practical questions the bench script answers:
1. Can my agent's latency budget absorb governance? A typical agent turn is dominated by the model (hundreds of milliseconds to seconds). The gate's p95 should be a small fraction of that — if it isn't, profile where OPA is sitting.
2. How many requests per second can one OPA instance handle? Run with --concurrency matching your expected peak in-flight tool calls. If p99 starts climbing as you raise --concurrency, you've found the saturation point — provision OPA accordingly or move it in-process via Regorus.
Memory store benchmark
The repo also ships benchmarks/bench_memory_session.py, which measures MemoryStore write/read throughput against a real SQLite file (not in-memory — see Memory with provenance for why). Run it the same way:
python benchmarks/bench_memory_session.pyThis one matters less for the agent's per-turn latency (memory access is opportunistic, not on every tool call), but it's the right number to look at if you're caching tool outputs aggressively or storing large per-session context blobs.
Related
- Architecture — where the gate sits in the request path
- OpenTelemetry — turn on tracing to see per-call breakdown in production
- Installation — in-process Regorus vs. networked OPA