Skip to content

Performance benchmarks

Every governed call adds one network round-trip to OPA plus a JSON serialise/deserialise. The bundled benchmark script measures that overhead across the five evaluation paths a production agent actually hits, so you can size capacity against your own hardware before you ship.

What it measures

benchmarks/bench_gate.py exercises the gate end-to-end: build a SessionContext, build a ToolCallInput (or GovernanceEvent), call PolicyGate.evaluate_*, time the wall clock. There is no model in the loop and no tool execution — just the gate's critical-path latency.

ScenarioWhat it exercises
simple_allowread_customer lookup — the fast-path allow
hard_denyread_file /app/.env — hard-blocked by security.rego
hitl_triggerapprove_refund $350allow=False, requires_hitl=True
agent_spawnagent.spawn event at delegation_depth=0
agent_delegateagent.delegate event at delegation_depth=1

The script reports p50, p95, p99, and max for each. Those are the numbers that matter — every governed tool call is one round-trip on the agent's critical path, so the long tail is what users feel.

Run it

bash
# Prerequisite: OPA running locally
docker compose up -d opa

# Default — 1000 runs, concurrency 5, 50-call warmup
python benchmarks/bench_gate.py

# Heavier sweep
python benchmarks/bench_gate.py --runs 2000 --concurrency 10

Flags:

FlagDefaultMeaning
--runs1000Iterations per scenario
--concurrency5Async semaphore — how many in-flight calls
--warmup50Discarded calls before measurement starts
--opahttp://localhost:8181Override the OPA URL

The script's progress display colour-codes each percentile against a practical threshold: green < 10 ms, yellow < 30 ms, red ≥ 30 ms. If your p95 sits in the green band you're fine for any synchronous agent loop; yellow is the budget for chatty multi-tool turns; red means OPA is overloaded or the network hop is the problem.

What "good" looks like

The headline numbers in the README — single-digit-ms p50, low-tens-ms p95 against a local OPA — come from this script on developer hardware. Re-measure on your own infrastructure before you trust them as a capacity number; the variables that move the result are:

  • OPA placement. Local Docker > sidecar > shared cluster service. Each hop adds round-trip latency that dominates the gate's own work.
  • Policy bundle size. simple_allow against the OSS modules (~100 LOC of Rego) is the lower bound. A bundle of thousands of rules will be measurably slower at p95.
  • OPA decision-log mode. Console-only logging is fastest; pushing to a remote service adds backpressure.
  • Concurrency. --concurrency 1 measures the lonely path; --concurrency 50 is closer to a busy multi-agent system.

Use the numbers

Two practical questions the bench script answers:

1. Can my agent's latency budget absorb governance? A typical agent turn is dominated by the model (hundreds of milliseconds to seconds). The gate's p95 should be a small fraction of that — if it isn't, profile where OPA is sitting.

2. How many requests per second can one OPA instance handle? Run with --concurrency matching your expected peak in-flight tool calls. If p99 starts climbing as you raise --concurrency, you've found the saturation point — provision OPA accordingly or move it in-process via Regorus.

Memory store benchmark

The repo also ships benchmarks/bench_memory_session.py, which measures MemoryStore write/read throughput against a real SQLite file (not in-memory — see Memory with provenance for why). Run it the same way:

bash
python benchmarks/bench_memory_session.py

This one matters less for the agent's per-turn latency (memory access is opportunistic, not on every tool call), but it's the right number to look at if you're caching tool outputs aggressively or storing large per-session context blobs.

Released under the Apache 2.0 License.