# Warrant Reader Leaderboard — Submission Harness (v0)

**Status:** v0 scaffold. Public submission flow is **in progress**. This repo
documents the protocol, schemas, and runner shape so external readers can begin
preparing submissions while we finalise the frozen-retrieval artifact and the
hosted judge service.

The leaderboard itself is live at:

- **Page:** <https://manifoldmemory.ai/leaderboard>
- **Machine-readable manifest:** <https://manifoldmemory.ai/leaderboard.json>

---

## What this benchmark measures

A single, narrow question:

> Given the **same 500 LongMemEval-S questions**, the **same top-10 retrieved
> evidence chunks** (`R@5 = 96.20%` on a frozen retrieval pipeline), and the
> **same GPT-4o judge** (5 seeds, 3-of-5 majority vote), **which reader can
> actually use the evidence?**

It is **not** a general LongMemEval leaderboard. Submissions cannot bring their
own retrieval — that's the entire point. Reader quality is the variable;
retrieval, prompting, and judging are held fixed.

## Tracks

| Track              | What it accepts                                      | Ranked? |
|--------------------|------------------------------------------------------|--------:|
| Open-weights       | HuggingFace model id or local weights, single-pass   | yes     |
| Closed / API       | OpenAI / Anthropic / Google / Mistral API endpoints  | no      |
| Experimental       | Routed / fallback variants (must declare router)     | yes\*   |

\* Experimental rows ship alongside the canonical row but are clearly labelled
on the page. The canonical row is the one we publicly stand behind.

## Frozen retrieval contract

Every submission consumes the **identical** evidence file:

- `BGE-large ∪ QNDN v0 ∪ BM25` candidate union
- Reciprocal-Rank Fusion, `k = 60`
- MixK reranker, top-10 to reader
- `481 / 500` questions have the gold-source fragment in top-5
  (`R@5 = 96.20%`)

The artifact is `artifacts/frozen_retrieval_topK_500q.v1.jsonl` (size:
~12 MB). Each line is one question with its top-10 chunks, qtype, and a
deterministic `evidence_hash`.

**This artifact is currently `[PENDING UPLOAD]`. Email
`contact@manifoldmemory.ai` for early access.**

## How a submission works

```bash
# 1. Install
pip install -r runner/requirements.txt

# 2. Pull the frozen retrieval artifact (manifest verifies SHA-256)
python runner/fetch_artifacts.py

# 3. Run your reader against the artifact
python runner/run_reader.py \
    --reader hf://meta-llama/Llama-3.1-8B-Instruct \
    --artifact artifacts/frozen_retrieval_topK_500q.v1.jsonl \
    --prompt   artifacts/benchmark_prompt.v1.md \
    --out      submissions/llama-3.1-8b.jsonl

# 4. Judge (calls GPT-4o K=5; ~$3 per submission)
python runner/judge.py \
    --submission submissions/llama-3.1-8b.jsonl \
    --out        submissions/llama-3.1-8b.judged.jsonl

# 5. Score and emit a leaderboard row
python runner/score.py \
    --judged submissions/llama-3.1-8b.judged.jsonl \
    --out    submissions/llama-3.1-8b.row.json
```

The output `*.row.json` matches `artifacts/submission_schema.json` exactly.
That's what we ingest into `leaderboard.json`.

## Schema

See [`artifacts/submission_schema.json`](./artifacts/submission_schema.json)
for the JSON-Schema definition. Every existing leaderboard row is a valid
instance.

## Roadmap

- [x] Public leaderboard page with track separation and qtype breakdown
- [x] Machine-readable `leaderboard.json` manifest
- [x] Schema definition for submission rows
- [ ] Publish `frozen_retrieval_topK_500q.v1.jsonl` (size: ~12 MB)
- [ ] Publish `benchmark_prompt.v1.md`
- [ ] Publish reference reader implementations (HF + vLLM)
- [ ] Publish hosted judge service or self-host instructions
- [ ] Add Mistral-7B as a third-party open row
- [ ] Replace oracle-qtype routing with learned classifier in the canonical row

## Contact

Submissions, questions, methodology disputes:
[contact@manifoldmemory.ai](mailto:contact@manifoldmemory.ai)

This repo will eventually live on GitHub. Until then, this directory is the
canonical source of truth for the protocol shape.