# Warrant Reader Leaderboard — Submission Harness (v0) **Status:** v0 scaffold. Public submission flow is **in progress**. This repo documents the protocol, schemas, and runner shape so external readers can begin preparing submissions while we finalise the frozen-retrieval artifact and the hosted judge service. The leaderboard itself is live at: - **Page:** - **Machine-readable manifest:** --- ## What this benchmark measures A single, narrow question: > Given the **same 500 LongMemEval-S questions**, the **same top-10 retrieved > evidence chunks** (`R@5 = 96.20%` on a frozen retrieval pipeline), and the > **same GPT-4o judge** (5 seeds, 3-of-5 majority vote), **which reader can > actually use the evidence?** It is **not** a general LongMemEval leaderboard. Submissions cannot bring their own retrieval — that's the entire point. Reader quality is the variable; retrieval, prompting, and judging are held fixed. ## Tracks | Track | What it accepts | Ranked? | |--------------------|------------------------------------------------------|--------:| | Open-weights | HuggingFace model id or local weights, single-pass | yes | | Closed / API | OpenAI / Anthropic / Google / Mistral API endpoints | no | | Experimental | Routed / fallback variants (must declare router) | yes\* | \* Experimental rows ship alongside the canonical row but are clearly labelled on the page. The canonical row is the one we publicly stand behind. ## Frozen retrieval contract Every submission consumes the **identical** evidence file: - `BGE-large ∪ QNDN v0 ∪ BM25` candidate union - Reciprocal-Rank Fusion, `k = 60` - MixK reranker, top-10 to reader - `481 / 500` questions have the gold-source fragment in top-5 (`R@5 = 96.20%`) The artifact is `artifacts/frozen_retrieval_topK_500q.v1.jsonl` (size: ~12 MB). Each line is one question with its top-10 chunks, qtype, and a deterministic `evidence_hash`. **This artifact is currently `[PENDING UPLOAD]`. Email `contact@manifoldmemory.ai` for early access.** ## How a submission works ```bash # 1. Install pip install -r runner/requirements.txt # 2. Pull the frozen retrieval artifact (manifest verifies SHA-256) python runner/fetch_artifacts.py # 3. Run your reader against the artifact python runner/run_reader.py \ --reader hf://meta-llama/Llama-3.1-8B-Instruct \ --artifact artifacts/frozen_retrieval_topK_500q.v1.jsonl \ --prompt artifacts/benchmark_prompt.v1.md \ --out submissions/llama-3.1-8b.jsonl # 4. Judge (calls GPT-4o K=5; ~$3 per submission) python runner/judge.py \ --submission submissions/llama-3.1-8b.jsonl \ --out submissions/llama-3.1-8b.judged.jsonl # 5. Score and emit a leaderboard row python runner/score.py \ --judged submissions/llama-3.1-8b.judged.jsonl \ --out submissions/llama-3.1-8b.row.json ``` The output `*.row.json` matches `artifacts/submission_schema.json` exactly. That's what we ingest into `leaderboard.json`. ## Schema See [`artifacts/submission_schema.json`](./artifacts/submission_schema.json) for the JSON-Schema definition. Every existing leaderboard row is a valid instance. ## Roadmap - [x] Public leaderboard page with track separation and qtype breakdown - [x] Machine-readable `leaderboard.json` manifest - [x] Schema definition for submission rows - [ ] Publish `frozen_retrieval_topK_500q.v1.jsonl` (size: ~12 MB) - [ ] Publish `benchmark_prompt.v1.md` - [ ] Publish reference reader implementations (HF + vLLM) - [ ] Publish hosted judge service or self-host instructions - [ ] Add Mistral-7B as a third-party open row - [ ] Replace oracle-qtype routing with learned classifier in the canonical row ## Contact Submissions, questions, methodology disputes: [contact@manifoldmemory.ai](mailto:contact@manifoldmemory.ai) This repo will eventually live on GitHub. Until then, this directory is the canonical source of truth for the protocol shape.