Benchmark · v0 · LongMemEval-S 500Q · 9 readers

Reader Leaderboard.

A frozen-retrieval benchmark for evidence readers.

Warrant holds retrieval fixed. This leaderboard measures what happens next.

Reader is the bottleneck. Same 500 LongMemEval-S questions, same retrieval contract (R@5 = 96.2%), same GPT-4o judge — swap only the reader and accuracy moves from 53.2% to 71.4%, an 18.2 pp spread driven entirely by reader choice. The canonical open-weights Hybrid reaches 70.0%; retrieval-only Stack reaches 63.8%; the best experimental routed variant reaches 71.4%. Under the same frozen retrieval contract, the closed gpt-5-mini reference lands at 59.0%.

Read this first Scores are reader-side results over a frozen Warrant retrieval artifact. Systems do not bring their own retrieval. Same 500 LongMemEval-S questions, same top-10 chunks, same prompt family, same GPT-4o judge (K=5, 3-of-5 majority). This is not a general LongMemEval leaderboard.

Built 2026-04-25T02:16:34Z · 9 reader configurations · 500 questions each · K=5 seeds {42,1,2,3,4}

Frozen retrieval. Identical for every row.

First-stage union of three retrievers, fused via reciprocal rank, reranked by a MixK cross-encoder, top-10 chunks delivered. Same 500 LongMemEval-S questions, same haystack, same chunk boundaries, same answer prompts, same judge.

First stage
BGE-large-en-v1.5 ∪ QNDN v0 ∪ BM25
Fusion
Reciprocal Rank Fusion (RRF k=60)
Rerank
MixK-trained cross-encoder
Delivered to reader
top-10 chunks to reader
Recall@5
96.20%  ·  481/500 questions
Reader gap
32.6 pp e2e upside available given retrieval contract
Judge
GPT-4o, K=5 seeds, 3-of-5 majority
Benchmark
LongMemEval-S (LME-S 500-Q)

The reader is the variable. Retrieval is held fixed.

Open-weights readers ranked by overall accuracy on the same 500 LongMemEval-S questions and the same top-10 evidence chunks. Each row carries a track: canonical is the headline Warrant publicly stands behind; experimental rows are routed variants we have measured but do not headline; retrieval-only and no retrieval are ablations. Refusals are scored as incorrect in accuracy; the Ref % column reports them separately because abstention on missing evidence is a different failure mode from fabrication.

# Track Reader / Pipeline Overall
acc · 95% CI
SSASSUSSPTRKUMS Ref %
#1 Experimental
Gemma-4-26B-A4B-it
Hybrid + F3-on-TR (oracle qtype + date-anchor on TR slice)
71.4%[67.3–75.2]% 92.9%52/5688.6%62/7026.7%8/3063.9%85/13374.4%58/7869.2%92/133 10.8%
#2 Canonical
Gemma-4-26B-A4B-it
Hybrid (oracle qtype: SSA→Naked, else→Stack)
70.0%[65.8–73.9]% 94.6%53/5688.6%62/7033.3%10/3056.4%75/13373.1%57/7869.9%93/133 10.0%
#3 Open swap
Qwen3.6-27B
Hybrid (oracle qtype: SSA→Naked, else→Stack)
66.2%[61.9–70.2]% 96.4%54/5688.6%62/7040.0%12/3045.9%61/13371.8%56/7864.7%86/133 14.8%
#4 Retrieval-only
Gemma-4-26B-A4B-it
Stack (BGE∪QNDN∪BM25 → RRF → top-10)
63.8%[59.5–67.9]% 39.3%22/5688.6%62/7033.3%10/3055.6%74/13375.6%59/7869.2%92/133 14.4%
#5 Retrieval-only
Qwen3.6-27B
Stack (same retrieval contract)
60.0%[55.6–64.2]% 37.5%21/5690.0%63/7040.0%12/3046.6%62/13373.1%57/7863.9%85/133 21.2%
#6 No retrieval
Qwen3.6-27B
Naked (no retrieval)
58.4%[54.0–62.6]% 96.4%54/5691.4%64/7016.7%5/3036.1%48/13375.6%59/7846.6%62/133 38.8%
#7 Experimental
Gemma-4-26B-A4B-it
CWFIX (Stack + F1 refusal-retry + F2 chronological-KU + F3 date-anchor, all qtypes)
56.6%[52.2–60.9]% 30.4%17/5687.1%61/7020.0%6/3062.4%83/13364.1%50/7849.6%66/133 20.4%
#8 No retrieval
Gemma-4-26B-A4B-it
Naked (no retrieval, LME-S 110 K-token haystack)
53.2%[48.8–57.5]% 94.6%53/5688.6%62/7013.3%4/3033.8%45/13365.4%51/7838.3%51/133 39.2%

Click any row for model id, judge metadata, and methodology notes.

qtype legend: SSA = single-session-assistant · SSU = single-session-user · SSP = single-session-preference · TR = temporal-reasoning · KU = knowledge-update · MS = multi-session. Cells show acc% over k/n after K=5 3-of-5 GPT-4o majority vote. Hybrid rows route by oracle qtype label from LongMemEval-S; the production-equivalent learned classifier is pending and is expected to land within 1–3 pp of oracle. Naked rows have no retrieval, only the full LME-S 110 K-token haystack.

Reference row. Read alongside, not against.

Closed-weights API readers are not part of the open-weights ranking. They are reported in a separate reference track on the identical retrieval contract so the open and closed regimes can be read side-by-side without conflating them into one ranking. Reproducibility is limited to whatever the API serves on the run date; the row is published as a comparison anchor, not a contestant.

  Track Reader / Pipeline Overall
acc · 95% CI
SSASSUSSPTRKUMS Ref %
ref Closed reference
gpt-5-mini
Same retrieval contract, closed-weights reader
59.0%[54.6–63.2]% 37.5%21/5690.0%63/7036.7%11/3059.4%79/13366.7%52/7851.9%69/133 25.0%

Same retrieval contract, same 500 questions, same K=5 GPT-4o majority-vote judge. Same protocol — different track. For context: the canonical open-weights row above lands at 70.0%; the closed reference row reads the identical evidence at 59.0%. Take this as evidence that fixed-evidence reading is its own bottleneck, not as a head-to-head claim on either model's preferred pipeline.

Six tracks. Each measures a different thing.

The board is not a single ranking. It is a stack of comparisons under the same retrieval contract: canonical vs. experimental routing, retrieval vs. no retrieval, open-weights vs. closed-weights. Read each pair against the canonical anchor.

Canonical

The number Warrant stands behind publicly. Production-equivalent (oracle-qtype disclosed; learned classifier expected within 1–3 pp).

Experimental

Routed or prompt variants we have measured under the same protocol. Allowed on the board, but not the public headline.

Retrieval-only

Stack handed straight to the reader. No qtype routing, no full-context fallback. The contribution of retrieval-without-routing.

No retrieval

Reader receives the full LME-S 110 K-token haystack and zero retrieved chunks. Floor row; the contribution of retrieval is measured against it.

Open swap

Same canonical pipeline, different open-weights reader. Quantifies how much of the result is reader-specific.

Closed reference

Closed-weights API reader on the identical retrieval contract. Reference comparison only — not a head-to-head; reproducibility is limited to whatever the API serves on the run date.

Fixed-evidence reading is a separate bottleneck.

On this evidence-reading task, under a fixed retrieval contract, the open-weights canonical Hybrid reaches 70.0%; the closed-weights gpt-5-mini reference reads the identical top-10 chunks and lands at 59.0%.
The point isn't that one reader beats another — it's that once retrieval is held fixed, the spread is no longer dominated by retrieval quality. Closed frontier-class APIs do not automatically ingest evidence better than open-weights mid-class readers; on this contract, in this regime, they don't. Reader choice matters. Whatever extra performance a closed-weights vendor sells on a long-context end-to-end benchmark, a meaningful share of it is its own retrieval, not the reader's reading. This board separates the two so the reader can be priced honestly.
Caveats: gpt-5-mini is an API reader; the run reflects whatever model OpenAI served on the run date and is not strictly reproducible. It is published as a reference row, not a head-to-head claim. The canonical row uses oracle-qtype routing — the production learned-classifier path is pending, expected within 1–3 pp.

What this board is

A frozen-retrieval, reader-only comparison. Every row received the identical top-10 chunks for the same 500 LongMemEval-S questions, judged by the same K=5 GPT-4o protocol.

Per-qtype breakdowns expose where each reader fails (SSA for chunk-rerank disruption, KU for stale knowledge, TR for date arithmetic) rather than aggregating them away into a single number.

Refusals are scored as incorrect in accuracy. They are also reported separately in the Ref % column because abstention on missing evidence is a different failure mode from fabrication, and the two have different downstream consequences for any system built on top.

Receipts are attached: per-row answer files, all five seed-judge files, and the synth scripts that produce the Hybrid rows from the underlying Stack and Naked answers.

What this board isn't

A general LongMemEval leaderboard. Public LME-S numbers blend retrieval, routing, and reading; this board pins retrieval and routing and isolates the reader. Systems do not bring their own retrieval.

A claim that "small lab beats OpenAI." gpt-5-mini is published as a reference row on a fixed retrieval contract; it is not a head-to-head on either model's preferred pipeline.

An agent-harness benchmark. Every row is a single-pass reader call — multi-step planners (Hindsight, ReAct, etc.) are out of scope by design, since they conflate reader skill with planner skill.

A frozen artifact. Submissions of additional readers (open weights or API) are welcome under the same contract: a public harness ships next.

Every number traceable to a file.

The leaderboard is a derived artifact. The underlying answer-and-judge files are committed alongside the build script so any row can be re-scored from primary sources in one command.

  • Per-row answer + 5-seed judge files
    _remote_pulls/phase91_a50/, _remote_pulls/phase91_a100_gemma_naked/, _remote_pulls/phase91_hybrid_gemma/, _remote_pulls/phase91_hybrid_qwen/, _remote_pulls/phase91_cwfix/, _remote_pulls/phase91_frontier/, _remote_pulls/phase91_hybrid_f3tr/.
  • Frozen retrieval contract spec · R@5 derivation
    EXPERIMENT_JOURNAL_RECOVERED_2026-04-24.md § Phase 86 (R@5 receipt) → Phase 91 (Stack vs Naked) → Phase 95-K (oracle-qtype disclosure).
  • Hybrid synthesis scripts
    _tmp_scripts/_synth_hybrid_gemma.py (oracle-qtype router for Gemma), _tmp_scripts/_synth_hybrid_qwen.py (Qwen), _tmp_scripts/_synth_hybrid_f3_tr.py (Hybrid + F3-on-TR best-of-board).
  • Build script for this page
    _tmp_scripts/_build_leaderboard.py · machine-readable copy: leaderboard.json.
  • Judge harness · submit your own reader
    Coming next: a stand-alone CLI that takes a reader callable + the frozen top-10 chunks per question and produces the answer + judge artifacts in the same schema as the rows above. Watch this page or contact us for early access.