Performance benchmarks

Every governed call adds one network round-trip to OPA plus a JSON serialise/deserialise. The bundled benchmark script measures that overhead across the five evaluation paths a production agent actually hits, so you can size capacity against your own hardware before you ship.

What it measures

benchmarks/bench_gate.py exercises the gate end-to-end: build a SessionContext, build a ToolCallInput (or GovernanceEvent), call PolicyGate.evaluate_*, time the wall clock. There is no model in the loop and no tool execution — just the gate's critical-path latency.

Scenario	What it exercises
`simple_allow`	`read_customer` lookup — the fast-path allow
`hard_deny`	`read_file /app/.env` — hard-blocked by `security.rego`
`hitl_trigger`	`approve_refund $350` — `allow=False, requires_hitl=True`
`agent_spawn`	`agent.spawn` event at `delegation_depth=0`
`agent_delegate`	`agent.delegate` event at `delegation_depth=1`

The script reports p50, p95, p99, and max for each. Those are the numbers that matter — every governed tool call is one round-trip on the agent's critical path, so the long tail is what users feel.

Run it

bash

# Prerequisite: OPA running locally
docker compose up -d opa

# Default — 1000 runs, concurrency 5, 50-call warmup
python benchmarks/bench_gate.py

# Heavier sweep
python benchmarks/bench_gate.py --runs 2000 --concurrency 10

Flags:

Flag	Default	Meaning
`--runs`	`1000`	Iterations per scenario
`--concurrency`	`5`	Async semaphore — how many in-flight calls
`--warmup`	`50`	Discarded calls before measurement starts
`--opa`	`http://localhost:8181`	Override the OPA URL

The script's progress display colour-codes each percentile against a practical threshold: green < 10 ms, yellow < 30 ms, red ≥ 30 ms. If your p95 sits in the green band you're fine for any synchronous agent loop; yellow is the budget for chatty multi-tool turns; red means OPA is overloaded or the network hop is the problem.

What "good" looks like

The headline numbers in the README — single-digit-ms p50, low-tens-ms p95 against a local OPA — come from this script on developer hardware. Re-measure on your own infrastructure before you trust them as a capacity number; the variables that move the result are:

OPA placement. Local Docker > sidecar > shared cluster service. Each hop adds round-trip latency that dominates the gate's own work.
Policy bundle size. simple_allow against the OSS modules (~100 LOC of Rego) is the lower bound. A bundle of thousands of rules will be measurably slower at p95.
OPA decision-log mode. Console-only logging is fastest; pushing to a remote service adds backpressure.
Concurrency. --concurrency 1 measures the lonely path; --concurrency 50 is closer to a busy multi-agent system.

Use the numbers

Two practical questions the bench script answers:

1. Can my agent's latency budget absorb governance? A typical agent turn is dominated by the model (hundreds of milliseconds to seconds). The gate's p95 should be a small fraction of that — if it isn't, profile where OPA is sitting.

2. How many requests per second can one OPA instance handle? Run with --concurrency matching your expected peak in-flight tool calls. If p99 starts climbing as you raise --concurrency, you've found the saturation point — provision OPA accordingly or move it in-process via Regorus.

Memory store benchmark

The repo also ships benchmarks/bench_memory_session.py, which measures MemoryStore write/read throughput against a real SQLite file (not in-memory — see Memory with provenance for why). Run it the same way:

bash

python benchmarks/bench_memory_session.py

This one matters less for the agent's per-turn latency (memory access is opportunistic, not on every tool call), but it's the right number to look at if you're caching tool outputs aggressively or storing large per-session context blobs.

Architecture — where the gate sits in the request path
OpenTelemetry — turn on tracing to see per-call breakdown in production
Installation — in-process Regorus vs. networked OPA

Performance benchmarks ​

What it measures ​

Run it ​

What "good" looks like ​

Use the numbers ​

Memory store benchmark ​

Related ​