fullstack

Arbiter

Multi-agent code review system that shows its work - specialised AI agents analyse PRs, then deliberate to produce unified feedback.

PythonFastAPIPostgreSQLRedisReactLiteLLM

View Source

Overview

Most AI code review tools are a single LLM call with 'review this PR' as the prompt. Results are generic and hard to trust because you can't see the reasoning. Arbiter splits review into specialised agents with focused mandates. They analyse independently, then deliberate to resolve conflicts and produce unified feedback.

The deliberation transcript is visible. You can see exactly how agents reasoned and where they agreed or disagreed. That transparency is the difference between a tool that generates suggestions and one you actually trust.

Static Analysis Pre-Pass

Before agents see the code, a static analysis pipeline runs: ruff for linting, mypy for type checking, bandit for security scanning, radon for complexity metrics. Results are injected into each agent's context, grounding LLM analysis in concrete findings rather than pure pattern matching. This catches obvious issues deterministically so agents can focus on higher-level concerns.

Agent Specialisation

Three agents, each with a focused mandate: Security (vulnerabilities, injection risks, auth issues, secret exposure), Style (consistency, naming, readability, project conventions), and Complexity (cyclomatic complexity, function length, abstraction depth, maintainability). Each gets the diff, static analysis results, and a specialised system prompt.

Independence matters: agents don't see each other's initial analysis. This prevents groupthink and produces genuinely different perspectives. LiteLLM provides model-agnostic LLM access, so swapping models doesn't require changing agent code.

Deliberation

After independent analysis, agents enter a deliberation round. Each sees the others' findings and can agree, disagree, or add context. The system synthesises deliberation into a unified review with consensus ratings.

Conflicts are surfaced explicitly: if Security flags something that Style thinks is fine, both perspectives are shown with reasoning. The full deliberation transcript is stored and browsable. You see why a recommendation was made, not just what it recommends. This is what makes it different from single-prompt review tools.

Integration & Dashboard

GitHub and GitLab webhook integration. Push a PR, review starts automatically. Results are posted as PR comments with a summary and per-file annotations. A React dashboard lets you explore reviews: filter by project, severity, agent, or time range.

Cost controls keep things practical: token budgets per review, and response caching for unchanged files between pushes to the same PR. Redis handles job queuing and caching, PostgreSQL stores reviews, deliberation transcripts, and cost tracking.

Related Projects

CodeTutor

fullstack

Interactive coding interview preparation platform with step-by-step algorithm visualisations.

TypeScriptNext.jsReact+4

Portfolio Site

fullstack

This site. Next.js frontend with a FastAPI backend, self-hosted CI/CD, and live infrastructure monitoring.

Next.jsTypeScriptTailwind CSS+4

Overview

Static Analysis Pre-Pass

Agent Specialisation

Deliberation

Integration & Dashboard