Warrant holds retrieval fixed. This leaderboard measures what happens next.
Reader is the bottleneck. Same 500 LongMemEval-S questions, same retrieval contract (R@5 = 96.2%), same GPT-4o judge — swap only the reader and accuracy moves from 53.2% to 71.4%, an 18.2 pp spread driven entirely by reader choice. The canonical open-weights Hybrid reaches 70.0%; retrieval-only Stack reaches 63.8%; the best experimental routed variant reaches 71.4%. Under the same frozen retrieval contract, the closed gpt-5-mini reference lands at 59.0%.
First-stage union of three retrievers, fused via reciprocal rank, reranked by a MixK cross-encoder, top-10 chunks delivered. Same 500 LongMemEval-S questions, same haystack, same chunk boundaries, same answer prompts, same judge.
Open-weights readers ranked by overall accuracy on the same 500 LongMemEval-S questions and the same top-10 evidence chunks. Each row carries a track: canonical is the headline Warrant publicly stands behind; experimental rows are routed variants we have measured but do not headline; retrieval-only and no retrieval are ablations. Refusals are scored as incorrect in accuracy; the Ref % column reports them separately because abstention on missing evidence is a different failure mode from fabrication.
| # | Track | Reader / Pipeline | Overall acc · 95% CI |
SSA | SSU | SSP | TR | KU | MS | Ref % | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| #1 | Experimental |
Gemma-4-26B-A4B-it
Hybrid + F3-on-TR (oracle qtype + date-anchor on TR slice)
|
71.4%[67.3–75.2]% | 92.9%52/56 | 88.6%62/70 | 26.7%8/30 | 63.9%85/133 | 74.4%58/78 | 69.2%92/133 | 10.8% | |
|
Best routed variant we have measured. Adds a date-anchor pre-prompt on the temporal-reasoning slice only (F3). Allowed but not the public headline; +1.4 pp over Canonical, McNemar p ≈ 0.38 — directional, not yet decisive. |
|||||||||||
| ★ #2 | Canonical |
Gemma-4-26B-A4B-it
Hybrid (oracle qtype: SSA→Naked, else→Stack)
|
70.0%[65.8–73.9]% | 94.6%53/56 | 88.6%62/70 | 33.3%10/30 | 56.4%75/133 | 73.1%57/78 | 69.9%93/133 | 10.0% | |
|
The number Warrant publicly stands behind. Pure routing over the existing Stack and Naked answers; no extra inference. Production-equivalent uses a learned qtype classifier in place of oracle (pending; expected 1–3 pp gap). |
|||||||||||
| #3 | Open swap |
Qwen3.6-27B
Hybrid (oracle qtype: SSA→Naked, else→Stack)
|
66.2%[61.9–70.2]% | 96.4%54/56 | 88.6%62/70 | 40.0%12/30 | 45.9%61/133 | 71.8%56/78 | 64.7%86/133 | 14.8% | |
|
Same canonical pipeline, different open-weights reader. Phase 91.8 swap. The 3.8 pp gap to the Gemma Canonical is paid mostly on the SSA branch (Qwen's 110K reading is weaker). |
|||||||||||
| #4 | Retrieval-only |
Gemma-4-26B-A4B-it
Stack (BGE∪QNDN∪BM25 → RRF → top-10)
|
63.8%[59.5–67.9]% | 39.3%22/56 | 88.6%62/70 | 33.3%10/30 | 55.6%74/133 | 75.6%59/78 | 69.2%92/133 | 14.4% | |
|
Stack only, no full-context fallback on the SSA slice. The +6.2 pp the Canonical Hybrid earns over this row is the routing contribution; the rest is the reader. |
|||||||||||
| #5 | Retrieval-only |
Qwen3.6-27B
Stack (same retrieval contract)
|
60.0%[55.6–64.2]% | 37.5%21/56 | 90.0%63/70 | 40.0%12/30 | 46.6%62/133 | 73.1%57/78 | 63.9%85/133 | 21.2% | |
|
Reader swap, Stack-only. Higher refusal rate than Gemma at the same evidence — Qwen declines more readily when the top-10 chunks under-cover the question. |
|||||||||||
| #6 | No retrieval |
Qwen3.6-27B
Naked (no retrieval)
|
58.4%[54.0–62.6]% | 96.4%54/56 | 91.4%64/70 | 16.7%5/30 | 36.1%48/133 | 75.6%59/78 | 46.6%62/133 | 38.8% | |
|
No-retrieval floor for the open-swap track. |
|||||||||||
| #7 | Experimental |
Gemma-4-26B-A4B-it
CWFIX (Stack + F1 refusal-retry + F2 chronological-KU + F3 date-anchor, all qtypes)
|
56.6%[52.2–60.9]% | 30.4%17/56 | 87.1%61/70 | 20.0%6/30 | 62.4%83/133 | 64.1%50/78 | 49.6%66/133 | 20.4% | |
|
Cheap-win-fix bundle applied unconditionally. Regresses vs Canonical (F1 and F2 are harmful when ungated). Published here to keep the negative result auditable; F3 alone, scoped to TR, is the only keeper. |
|||||||||||
| #8 | No retrieval |
Gemma-4-26B-A4B-it
Naked (no retrieval, LME-S 110 K-token haystack)
|
53.2%[48.8–57.5]% | 94.6%53/56 | 88.6%62/70 | 13.3%4/30 | 33.8%45/133 | 65.4%51/78 | 38.3%51/133 | 39.2% | |
|
Floor row: reader receives the full LME-S haystack and no retrieval at all. The +10.6 pp Stack earns over this row is the retrieval contribution. Run on 8×A100 (≈140 GB); H200/H100 OOMs without quantisation. |
|||||||||||
Click any row for model id, judge metadata, and methodology notes.
qtype legend: SSA = single-session-assistant · SSU = single-session-user · SSP = single-session-preference ·
TR = temporal-reasoning · KU = knowledge-update · MS = multi-session.
Cells show acc% over k/n after K=5 3-of-5 GPT-4o majority vote.
Hybrid rows route by oracle qtype label from LongMemEval-S; the production-equivalent learned classifier is pending and is
expected to land within 1–3 pp of oracle. Naked rows have no retrieval, only the full LME-S 110 K-token haystack.
Closed-weights API readers are not part of the open-weights ranking. They are reported in a separate reference track on the identical retrieval contract so the open and closed regimes can be read side-by-side without conflating them into one ranking. Reproducibility is limited to whatever the API serves on the run date; the row is published as a comparison anchor, not a contestant.
| Track | Reader / Pipeline | Overall acc · 95% CI |
SSA | SSU | SSP | TR | KU | MS | Ref % | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| ref | Closed reference |
gpt-5-mini
Same retrieval contract, closed-weights reader
|
59.0%[54.6–63.2]% | 37.5%21/56 | 90.0%63/70 | 36.7%11/30 | 59.4%79/133 | 66.7%52/78 | 51.9%69/133 | 25.0% | |
|
Closed-weights API reader on the identical retrieval contract. Reference row for cross-class comparison; reproducibility is limited to whatever the API serves on the run date. |
|||||||||||
Same retrieval contract, same 500 questions, same K=5 GPT-4o majority-vote judge. Same protocol — different track. For context: the canonical open-weights row above lands at 70.0%; the closed reference row reads the identical evidence at 59.0%. Take this as evidence that fixed-evidence reading is its own bottleneck, not as a head-to-head claim on either model's preferred pipeline.
The board is not a single ranking. It is a stack of comparisons under the same retrieval contract: canonical vs. experimental routing, retrieval vs. no retrieval, open-weights vs. closed-weights. Read each pair against the canonical anchor.
The number Warrant stands behind publicly. Production-equivalent (oracle-qtype disclosed; learned classifier expected within 1–3 pp).
Routed or prompt variants we have measured under the same protocol. Allowed on the board, but not the public headline.
Stack handed straight to the reader. No qtype routing, no full-context fallback. The contribution of retrieval-without-routing.
Reader receives the full LME-S 110 K-token haystack and zero retrieved chunks. Floor row; the contribution of retrieval is measured against it.
Same canonical pipeline, different open-weights reader. Quantifies how much of the result is reader-specific.
Closed-weights API reader on the identical retrieval contract. Reference comparison only — not a head-to-head; reproducibility is limited to whatever the API serves on the run date.
A frozen-retrieval, reader-only comparison. Every row received the identical top-10 chunks for the same 500 LongMemEval-S questions, judged by the same K=5 GPT-4o protocol.
Per-qtype breakdowns expose where each reader fails (SSA for chunk-rerank disruption, KU for stale knowledge, TR for date arithmetic) rather than aggregating them away into a single number.
Refusals are scored as incorrect in accuracy. They are also reported separately in the Ref % column because abstention on missing evidence is a different failure mode from fabrication, and the two have different downstream consequences for any system built on top.
Receipts are attached: per-row answer files, all five seed-judge files, and the synth scripts that produce the Hybrid rows from the underlying Stack and Naked answers.
A general LongMemEval leaderboard. Public LME-S numbers blend retrieval, routing, and reading; this board pins retrieval and routing and isolates the reader. Systems do not bring their own retrieval.
A claim that "small lab beats OpenAI." gpt-5-mini is published as a reference row on a fixed retrieval contract; it is not a head-to-head on either model's preferred pipeline.
An agent-harness benchmark. Every row is a single-pass reader call — multi-step planners (Hindsight, ReAct, etc.) are out of scope by design, since they conflate reader skill with planner skill.
A frozen artifact. Submissions of additional readers (open weights or API) are welcome under the same contract: a public harness ships next.
The leaderboard is a derived artifact. The underlying answer-and-judge files are committed alongside the build script so any row can be re-scored from primary sources in one command.
_remote_pulls/phase91_a50/, _remote_pulls/phase91_a100_gemma_naked/,
_remote_pulls/phase91_hybrid_gemma/, _remote_pulls/phase91_hybrid_qwen/,
_remote_pulls/phase91_cwfix/, _remote_pulls/phase91_frontier/,
_remote_pulls/phase91_hybrid_f3tr/.
EXPERIMENT_JOURNAL_RECOVERED_2026-04-24.md § Phase 86 (R@5 receipt) → Phase 91 (Stack vs Naked) → Phase 95-K (oracle-qtype disclosure).
_tmp_scripts/_synth_hybrid_gemma.py (oracle-qtype router for Gemma),
_tmp_scripts/_synth_hybrid_qwen.py (Qwen),
_tmp_scripts/_synth_hybrid_f3_tr.py (Hybrid + F3-on-TR best-of-board).
_tmp_scripts/_build_leaderboard.py · machine-readable copy: leaderboard.json.