Designing a multi-agent code reviewer — and measuring it honestly
A single LLM told to "review this code" spreads itself thin and misses things. So I built Quorum as a panel: a dispatcher routes a diff to specialist agents, they review in parallel, and a synthesizer returns one verdict. The interesting parts weren't the prompts — they were the routing, the output contract, and learning to measure the thing without lying to myself.
Why a panel beats one prompt
Ask one model to check security and performance and correctness and tests and style in a single pass and it does all of them at 60%. Each concern competes for the same attention budget, and the cross-cutting instruction ("also consider…") dilutes every individual check.
Quorum gives each concern its own agent with a focused system prompt — security, performance, correctness, tests, style — and runs them concurrently. Two practical wins fall out of that:
- Model tiering. Each agent declares a tier —
strongorfast. High-stakes checks (security, correctness) get the stronger model; nits (style) get the cheaper, faster one. You spend tokens where they matter. - Routing. A docs-only change skips the security and performance agents entirely. If the diff already updates tests, the "missing tests" agent is dropped. Less noise, lower cost.
Failure is a first-class case
The moment you have five agents calling a model provider, one of them will time out or return garbage. The rule I held to: one agent failing never aborts the panel. Each agent runs inside a wrapper that turns any exception into a recorded failed result, and the review continues with whatever the other agents produced. A flaky model call degrades the review; it doesn't fail it closed.
That same instinct paid off when I put the demo in the browser via Pyodide (WebAssembly), which has no threads. The parallel dispatcher would have thrown can't start new thread — so it falls back to sequential execution when threads aren't available. Same agents, same results, just slower. The pipeline runs anywhere.
The output contract: structured-first, with a safety net
"AI doesn't give you clean JSON" is half true. Modern providers do — if you ask correctly. Quorum requests native structured output per provider: OpenAI's json_schema mode, Anthropic's forced tool-use, Ollama's format field. The reply is schema-valid by construction, not scraped out of prose.
But not every backend can enforce a schema — a CLI subprocess, or a small local model that drifts. So there's a tolerant parser underneath that extracts the findings array whether the model returned a clean object, a bare array, or JSON buried in chatter. Structured-first, with a fallback that keeps the panel from breaking on a model that gets chatty.
The synthesizer is deterministic on purpose
Findings from all agents are merged, de-duplicated on (file, line, title), and sorted by severity. The verdict itself is not left to a model — it's a pure function of the findings, so the same review always yields the same call:
def decide_verdict(findings):
severities = {f.severity for f in findings}
if severities & {"critical", "high"}:
return "REQUEST_CHANGES"
if "medium" in severities:
return "COMMENT"
return "APPROVE"
An LLM writes the human-readable summary, but the gate that can block a merge is deterministic and auditable. You don't want the thing wired into CI to be a coin flip.
Measuring it without lying to myself
Here's the part I'm proudest of, because it's where I caught my own mistake.
I built a benchmark: a corpus of diffs with known planted defects (an eval() call, a SQL injection by string concatenation, an off-by-one, a bare except) plus some clean diffs. My first scoring used the obvious metric — precision, recall, F1 — counting any finding that didn't match a planted label as a false positive.
That metric lied. It told me the deterministic mock reviewer (F1 0.80) was better than the real LLM panel (F1 0.46). Which is nonsense — the LLM had perfect recall. What actually happened: on a diff that contains a known bug, a thorough reviewer also flags other real issues the corpus never labeled. My precision metric punished the good reviewer for being thorough, conflating "wrong" with "unlabeled-but-valid."
A metric that rewards silence is worse than no metric. It will quietly push you toward the dumber system.
So I threw out global precision/F1 and measured only the two things I can label with confidence:
- Recall — of the planted defects, how many did the panel catch? (scored on the defect diffs)
- False alarms — how many findings did it raise on the clean diffs, where the answer is "nothing"? (scored on the clean diffs)
Extra findings on defect diffs are reported separately, as information, not as errors. With that framing the numbers finally tell the truth:
| Provider | Recall | False alarms (clean) |
|---|---|---|
| mock (deterministic) | 0.67 | 0 |
| LLM panel | 1.00 | 0 |
The mock is precise but blind to anything without a keyword tell; the LLM panel catches the semantic defects too. The benchmark is deterministic under the mock provider, so it gates CI with no API key — and you can re-run it against a real provider to measure the live panel on the same corpus.
What I'd tell someone building one
- Split concerns into agents with focused prompts and route them; don't ask one model to do everything.
- Make failure cheap — isolate each agent so one bad call degrades, never aborts.
- Pin the output contract with native structured output, and keep a tolerant parser as a seatbelt.
- Keep the decision deterministic where it has consequences.
- Be ruthless about your eval. The wrong metric doesn't just mislead — it actively steers you toward the worse system. Measure what you can label honestly.
Quorum is open source (zero runtime dependencies) and runs entirely in your browser in the demo: