The Warrant Reader Leaderboard is a frozen-retrieval benchmark: every submission consumes the identical 500-question evidence file (R@5 = 96.2%), uses the canonical prompt template, and is judged by the same GPT-4o K=5 3-of-5 majority-vote protocol. Reader quality is the variable; retrieval, prompt, and judge are held fixed. The protocol is stable. The runner stubs and schema are below. The frozen retrieval artifact is in final QA — email for early access.
The CLI shape and JSONL format are stable: a v0 row written today will be wire-compatible with v1. The internal calls (reader backend, judge call) are stubs in the v0 runner; v1 wires them against the public OpenAI API and configurable HuggingFace / vLLM backends.
Clone the harness, install Python deps, and fetch the frozen retrieval artifact. The fetcher verifies SHA-256 against the published manifest before it lets the reader touch the bytes.
pip install -r runner/requirements.txt python runner/fetch_artifacts.py
Point at any HuggingFace model id, vLLM endpoint, or OpenAI / Anthropic / Mistral API. Reader URI schemes documented in the runner.
python runner/run_reader.py \ --reader hf://meta-llama/Llama-3.1-8B-Instruct \ --artifact artifacts/frozen_retrieval_topK_500q.v1.jsonl \ --prompt artifacts/benchmark_prompt.v1.md \ --out submissions/llama-3.1-8b.jsonl
K=5 GPT-4o seeds, 3-of-5 majority vote. ~$3 per submission at current OpenAI list prices. Or ship pre-computed judge logs and we’ll verify them.
python runner/judge.py \ --submission submissions/llama-3.1-8b.jsonl \ --out submissions/llama-3.1-8b.judged.jsonl
Aggregator computes overall accuracy, 95% Wilson CI, refusal rate, and per-qtype breakdown. Output validates against submission_schema.json.
python runner/score.py \ --judged submissions/llama-3.1-8b.judged.jsonl \ --out submissions/llama-3.1-8b.row.json
Email the *.row.json + judge logs to the maintainer. We re-run the judge on a sample (5%) to verify, then ingest into the leaderboard manifest.
contact@manifoldmemory.ai
subject: Warrant Reader Leaderboard
— submission
Everything the protocol needs is mirrored at manifoldmemory.ai/warrant-leaderboard/. The frozen retrieval JSONL and judge service are pending publication; the rest is live.
This is a reader-only benchmark. Submissions cannot bring their own retrieval — that’s the entire point. If your system performs additional retrieval beyond the supplied 10 chunks, does multi-pass self-critique that hits an LLM more than twice per question, or makes external tool calls, it is out of scope for this leaderboard. Agent harnesses are a separate problem and we will not pretend the comparison is apples-to-apples.
Open-weights rows are ranked. Closed/API rows ship as a reference track — same retrieval contract, same judge, but reproducibility is limited to whatever the API serves on the run date.
Email the maintainer with the reader you want to evaluate. We will share the frozen retrieval artifact, run the canonical prompt + judge, and ingest your row into the next manifest update. First five submissions get top-of-page acknowledgement.