EXP-0010 — Karpathy's llm-council: four frontier LLMs that judge each other
#forge#karpathy#ensemble#multi-model#openrouter#fastapi#react#open-source
David OlssonWhen you ask GPT-5, Gemini, Claude, or Grok the same question, you'll often get four meaningfully different answers — different facts emphasized, different framings, sometimes different conclusions. Picking which one to trust is hard. You can't really tell, from outside, which model knows what it's talking about for any given topic.
Andrej Karpathy — one of the most respected researchers in AI — published a small open-source weekend project that turns this into a tool. Instead of choosing one model, you build a council of all four. You ask your question once; the council does three things:
- All four answer in parallel. You see all four responses side-by-side.
- The four models then rank each other's answers. Crucially, they don't know which answer is whose — the responses are anonymized, so they can't play favorites with their own output.
- A designated chairman model takes all the answers and all the rankings and produces one final synthesized response.
The result is something more honest than any single model can give you: you see how the leading AIs disagree, you see which ones the others actually respect on this question, and you see a final answer that takes all that into account.
Forge — our experiment harness — cloned the project and confirmed that the install works cleanly and the design is solid: 6 Python files in the backend, a React + Vite frontend, a clear three-stage protocol. We could not actually run a council query because every request goes through a paid AI router (OpenRouter), and our sandbox is no-secrets by design — it doesn't carry API keys. What we can confirm is that the protocol is real, the design is honest, and anyone with an OpenRouter account and a few dollars of credit can run their own council in a few minutes.
Karpathy is explicit in the README that this is a "vibe coded Saturday hack" he doesn't intend to maintain — but the design pattern (multiple frontier LLMs, anonymized cross-review, chairman synthesis) is the real contribution and travels well beyond this codebase.
Status: experimented, result partial. Backend uv sync clean. Frontend stack inventoried. Forge cannot exercise a real council query — every request needs an OpenRouter API key with credit, which the no-secrets sandbox doesn't carry. The substantive forge findings are the three-stage protocol (collect → review → synthesize) and the self-judge frontier-model panel that Karpathy assembled.
This is a forge writeup of karpathy/llm-council at commit 92e1fcc. Karpathy's own framing in the README: "This project was 99% vibe coded as a fun Saturday hack… I'm not going to support it in any way, it's provided here as is for other people's inspiration." That self-described disposability is itself part of the point — the design is what's reusable, not the codebase.
TL;DR
- Stack: FastAPI backend (6 Python files), React 19 + Vite 7 frontend,
uvfor Python, npm for JS. OpenRouter as the single LLM gateway. - Council members (at HEAD):
openai/gpt-5.1,google/gemini-3-pro-preview,anthropic/claude-sonnet-4.5,x-ai/grok-4. Four frontier models, four vendors. - Chairman:
google/gemini-3-pro-preview. The same Gemini that participates as a council member also synthesizes the final response. (Choice of chairman is configurable.) - Three-stage protocol — collect first opinions → anonymized cross-review → chairman synthesis. Each stage maps to one function in
backend/council.py. - Install: clean. Backend
uv syncresolved the full FastAPI + uvicorn + httpx + watchfiles + websockets graph; frontend declares React 19 + react-markdown. - Smoke probe: not attempted. Every council request hits OpenRouter; forge sandbox carries no API key. Static inventory only.
- License: no LICENSE file at HEAD. README is explicit: provided as-is, no support.
What it is
The Karpathy framing in three sentences: instead of asking GPT-5 (or Gemini, or Claude) a question and trusting one answer, send the same question to all four at once, show the answers side-by-side, then have each model rank the others' answers (without knowing which answer came from which competitor), then have a designated Chairman model synthesize a final answer from all of it. You get four first-pass opinions, four cross-rankings, and one synthesis. The cost is roughly 9× a normal query (4 firsts + 4 reviews + 1 chairman). The benefit, if you're trying to evaluate the quality of frontier LLMs against each other, is that you see how each model judges the others — including how each model judges its own output when blinded to the source.
Karpathy's stated use case: reading books with LLMs, where the marginal value of a second opinion is high. The repo is positioned as a vibe-coded Saturday hack, with the README explicitly disclaiming maintenance. Forge's purpose in writing this up is to capture the design pattern, not to assess production readiness — there is no production claim to assess.
The three-stage protocol
From backend/council.py:
| stage | function | what happens |
|---|---|---|
| 1 | stage1_collect_responses(user_query) | Send the user's prompt to all 4 council members in parallel. Collect 4 raw responses. |
| 2 | stage2_collect_rankings(...) | For each member, show it the anonymized responses of the other 3 (and its own, also anonymized). Ask it to rank them for accuracy and insight. Parse the ranking with parse_ranking_from_text. |
| 3 | stage3_synthesize_final(...) | Hand the chairman model the original query + all 4 responses + all 4 rankings. It produces the final answer. |
A small but meaningful detail: calculate_aggregate_rankings() exists, which suggests the system aggregates the four cross-rankings into an overall scoreboard rather than letting the chairman implicitly weight them. Worth reading that function if you want to understand exactly how the council's collective judgment is reduced to one ordering.
The frontend
frontend/package.json is small — React 19, react-dom, react-markdown. Vite as the build tool. ESLint with the React-hooks and React-refresh plugins. That's it. Karpathy resisted the urge to add a state-management library, a routing library, or a UI kit. For a side project this is the right call — the surface is small enough that React state hooks are sufficient.
backend/main.py exposes six FastAPI routes — list/create/read conversations, plus message and streaming-message endpoints. The streaming variant uses SSE for incremental display of council responses as they arrive. This is the right UX choice: the four-vendor parallel call still has tail-latency in the slowest model, and showing the others' responses live keeps the wait useful.
What forge actually verified
git clone https://github.com/karpathy/llm-council.git
cd llm-council && git checkout 92e1fcc
# inside ghcr.io/astral-sh/uv:python3.12-bookworm
uv sync --no-install-project
# 6 backend .py files: __init__.py, config.py, council.py, main.py, openrouter.py, storage.py
uv sync resolved every dependency. The dependency graph is small and modern: FastAPI + uvicorn for the server, httpx for the OpenRouter calls, websockets and watchfiles for dev reloading, python-dotenv for the API-key load. No surprise — the project is what its README claims.
The frontend package.json declares React 19.2 + react-dom + react-markdown as runtime, with Vite 7.2 as the dev/build tool. No frontend install was attempted in this run (would add another 200+ packages and another minute of wall time, and the frontend can't really be tested without the backend running, and the backend can't really be tested without an OpenRouter key).
What forge could not verify
- End-to-end council query. Every stage of the protocol calls OpenRouter. No key, no query. We could supply a mock OpenRouter and exercise the orchestration, which would prove the three-stage protocol is wired correctly — that's a reasonable follow-up.
- Tail-latency behavior. When one council member is slow, does the UI block on it, or does it stream the others through? Reading
main.pyandcouncil.pystrongly suggests streaming, but the test that proves it requires a live call. - Ranking consistency. Does Gemini consistently rank itself first when blinded to identity? Does Claude defer to GPT-5 systematically? These are the interesting questions the system is built to answer; they need a key + a budget + multiple test prompts.
All three are doable with an OpenRouter key and ~$5 of credit per test session.
Why this matters as a design
Multi-model ensemble has been a research idea for years (mixture-of-experts at training time; majority-vote at inference time). The Karpathy contribution here is to make the ensemble visible to the user, not just synthesize it silently:
- Stage 1 surfaces the four raw opinions side-by-side. You see disagreement directly.
- Stage 2 makes each model justify its judgments of the others. This is implicit calibration data.
- Stage 3 produces the synthesis only after the user has had a chance to compare.
A normal multi-vendor product would hide the four opinions behind a single "best" answer. The council shows you the panel. For tasks where calibration matters — research, due diligence, anything where being wrong has a real cost — this is the right UX even if you don't end up trusting the synthesis.
The cost is also worth naming: 9× a normal query, sometimes 10× because the chairman gets a longer prompt. At current OpenRouter prices that's pennies for a chat-length prompt and meaningful for a book-length one. Karpathy's stated use case ("reading books with LLMs") sits exactly at the boundary where the cost is real but defensible.
License note
There is no LICENSE file at HEAD. Per the README: "provided here as is for other people's inspiration." The intent is permissive; the legal status is ambiguous in the same way as EXP-0009 (autoresearch). For anyone wanting to fork or re-publish, contact the author.
Reproducibility
| upstream repo | https://github.com/karpathy/llm-council |
| commit pinned | 92e1fccb1bdcf1bab7221aa9ed90f9dc72529131 |
| license | none (LICENSE file absent at HEAD) |
| base image (backend) | ghcr.io/astral-sh/uv:python3.12-bookworm |
| backend install | uv sync --no-install-project — exit 0 |
| backend files | 6 (config.py, council.py, main.py, openrouter.py, storage.py, __init__.py) |
| frontend stack | React 19.2 + react-dom + react-markdown + Vite 7.2 |
| council members | gpt-5.1 / gemini-3-pro-preview / claude-sonnet-4.5 / grok-4 |
| chairman | gemini-3-pro-preview |
| smoke probe | not attempted (requires OpenRouter key) |
Companion gist holds the install log, the env manifest, and the council config so the panel composition is preserved alongside the writeup.
See also
- EXP-0009 — autoresearch — Karpathy's other recent project, paired ship in the same forge run.
- EXP-0006 — Agentic RL — forge's own ensemble-and-judging harness. Related shape (multiple rollouts, group advantage) but applied to RL rather than inference.
- Meet forge — the operationalization rule.
Built and verified by forge. The council protocol is the substantive finding; running an actual council query is a follow-up that needs an OpenRouter key and a small budget.
Companion gist (install log, env, config.py, upstream README)