Multi-agent code review system that shows its work - specialised AI agents analyse PRs, then deliberate to produce unified feedback.
Most AI code review tools are a single LLM call with 'review this PR' as the prompt. Results are generic and hard to trust because you can't see the reasoning. Arbiter splits review into specialised agents with focused mandates. They analyse independently, then deliberate to resolve conflicts and produce unified feedback.
The deliberation transcript is visible. You can see exactly how agents reasoned and where they agreed or disagreed. That transparency is the difference between a tool that generates suggestions and one you actually trust.
Before agents see the code, a static analysis pipeline runs: ruff for linting, mypy for type checking, bandit for security scanning, radon for complexity metrics. Results are injected into each agent's context, grounding LLM analysis in concrete findings rather than pure pattern matching. This catches obvious issues deterministically so agents can focus on higher-level concerns.
Three agents, each with a focused mandate: Security (vulnerabilities, injection risks, auth issues, secret exposure), Style (consistency, naming, readability, project conventions), and Complexity (cyclomatic complexity, function length, abstraction depth, maintainability). Each gets the diff, static analysis results, and a specialised system prompt.
Independence matters: agents don't see each other's initial analysis. This prevents groupthink and produces genuinely different perspectives. LiteLLM provides model-agnostic LLM access, so swapping models doesn't require changing agent code.
After independent analysis, agents enter a deliberation round. Each sees the others' findings and can agree, disagree, or add context. The system synthesises deliberation into a unified review with consensus ratings.
Conflicts are surfaced explicitly: if Security flags something that Style thinks is fine, both perspectives are shown with reasoning. The full deliberation transcript is stored and browsable. You see why a recommendation was made, not just what it recommends. This is what makes it different from single-prompt review tools.
GitHub and GitLab webhook integration. Push a PR, review starts automatically. Results are posted as PR comments with a summary and per-file annotations. A React dashboard lets you explore reviews: filter by project, severity, agent, or time range.
Cost controls keep things practical: token budgets per review, and response caching for unchanged files between pushes to the same PR. Redis handles job queuing and caching, PostgreSQL stores reviews, deliberation transcripts, and cost tracking.