Reader Leaderboard — ManifoldMemory / Warrant

The contract · what every reader receives

Frozen retrieval. Identical for every row.

First-stage union of three retrievers, fused via reciprocal rank, reranked by a MixK cross-encoder, top-10 chunks delivered. Same 500 LongMemEval-S questions, same haystack, same chunk boundaries, same answer prompts, same judge.

First stage: BGE-large-en-v1.5 ∪ QNDN v0 ∪ BM25
Fusion: Reciprocal Rank Fusion (RRF k=60)
Rerank: MixK-trained cross-encoder
Delivered to reader: top-10 chunks to reader
Recall@5: 96.20% · 481/500 questions
Reader gap: 32.6 pp e2e upside available given retrieval contract
Judge: GPT-4o, K=5 seeds, 3-of-5 majority
Benchmark: LongMemEval-S (LME-S 500-Q)

Open-weights leaderboard · ranked by overall accuracy

The reader is the variable. Retrieval is held fixed.

Open-weights readers ranked by overall accuracy on the same 500 LongMemEval-S questions and the same top-10 evidence chunks. Each row carries a track: canonical is the headline Warrant publicly stands behind; experimental rows are routed variants we have measured but do not headline; retrieval-only and no retrieval are ablations. Refusals are scored as incorrect in accuracy; the Ref % column reports them separately because abstention on missing evidence is a different failure mode from fabrication.

#	Track	Reader / Pipeline	Overall acc · 95% CI	SSA	SSU	SSP	TR	KU	MS	Ref %
#1	Experimental	Gemma-4-26B-A4B-it Hybrid + F3-on-TR (oracle qtype + date-anchor on TR slice)	71.4%[67.3–75.2]%	92.9%52/56	88.6%62/70	26.7%8/30	63.9%85/133	74.4%58/78	69.2%92/133	10.8%
Modelgoogle/gemma-4-26B-A4B-it Params26B (A4B mixture-of-experts) JudgedN=500, K=5, threshold=3-of-5 Best routed variant we have measured. Adds a date-anchor pre-prompt on the temporal-reasoning slice only (F3). Allowed but not the public headline; +1.4 pp over Canonical, McNemar p ≈ 0.38 — directional, not yet decisive.
★ #2	Canonical	Gemma-4-26B-A4B-it Hybrid (oracle qtype: SSA→Naked, else→Stack)	70.0%[65.8–73.9]%	94.6%53/56	88.6%62/70	33.3%10/30	56.4%75/133	73.1%57/78	69.9%93/133	10.0%
Modelgoogle/gemma-4-26B-A4B-it Params26B (A4B MoE) JudgedN=500, K=5, threshold=3-of-5 The number Warrant publicly stands behind. Pure routing over the existing Stack and Naked answers; no extra inference. Production-equivalent uses a learned qtype classifier in place of oracle (pending; expected 1–3 pp gap).
#3	Open swap	Qwen3.6-27B Hybrid (oracle qtype: SSA→Naked, else→Stack)	66.2%[61.9–70.2]%	96.4%54/56	88.6%62/70	40.0%12/30	45.9%61/133	71.8%56/78	64.7%86/133	14.8%
ModelQwen/Qwen3.6-27B Params27B dense JudgedN=500, K=5, threshold=3-of-5 Same canonical pipeline, different open-weights reader. Phase 91.8 swap. The 3.8 pp gap to the Gemma Canonical is paid mostly on the SSA branch (Qwen's 110K reading is weaker).
#4	Retrieval-only	Gemma-4-26B-A4B-it Stack (BGE∪QNDN∪BM25 → RRF → top-10)	63.8%[59.5–67.9]%	39.3%22/56	88.6%62/70	33.3%10/30	55.6%74/133	75.6%59/78	69.2%92/133	14.4%
Modelgoogle/gemma-4-26B-A4B-it Params26B (A4B MoE) JudgedN=500, K=5, threshold=3-of-5 Stack only, no full-context fallback on the SSA slice. The +6.2 pp the Canonical Hybrid earns over this row is the routing contribution; the rest is the reader.
#5	Retrieval-only	Qwen3.6-27B Stack (same retrieval contract)	60.0%[55.6–64.2]%	37.5%21/56	90.0%63/70	40.0%12/30	46.6%62/133	73.1%57/78	63.9%85/133	21.2%
ModelQwen/Qwen3.6-27B Params27B dense JudgedN=500, K=5, threshold=3-of-5 Reader swap, Stack-only. Higher refusal rate than Gemma at the same evidence — Qwen declines more readily when the top-10 chunks under-cover the question.
#6	No retrieval	Qwen3.6-27B Naked (no retrieval)	58.4%[54.0–62.6]%	96.4%54/56	91.4%64/70	16.7%5/30	36.1%48/133	75.6%59/78	46.6%62/133	38.8%
ModelQwen/Qwen3.6-27B Params27B dense JudgedN=500, K=5, threshold=3-of-5 No-retrieval floor for the open-swap track.
#7	Experimental	Gemma-4-26B-A4B-it CWFIX (Stack + F1 refusal-retry + F2 chronological-KU + F3 date-anchor, all qtypes)	56.6%[52.2–60.9]%	30.4%17/56	87.1%61/70	20.0%6/30	62.4%83/133	64.1%50/78	49.6%66/133	20.4%
Modelgoogle/gemma-4-26B-A4B-it Params26B (A4B MoE) JudgedN=500, K=5, threshold=3-of-5 Cheap-win-fix bundle applied unconditionally. Regresses vs Canonical (F1 and F2 are harmful when ungated). Published here to keep the negative result auditable; F3 alone, scoped to TR, is the only keeper.
#8	No retrieval	Gemma-4-26B-A4B-it Naked (no retrieval, LME-S 110 K-token haystack)	53.2%[48.8–57.5]%	94.6%53/56	88.6%62/70	13.3%4/30	33.8%45/133	65.4%51/78	38.3%51/133	39.2%
Modelgoogle/gemma-4-26B-A4B-it Params26B (A4B MoE) JudgedN=500, K=5, threshold=3-of-5 Floor row: reader receives the full LME-S haystack and no retrieval at all. The +10.6 pp Stack earns over this row is the retrieval contribution. Run on 8×A100 (≈140 GB); H200/H100 OOMs without quantisation.

Click any row for model id, judge metadata, and methodology notes.

qtype legend: SSA = single-session-assistant · SSU = single-session-user · SSP = single-session-preference · TR = temporal-reasoning · KU = knowledge-update · MS = multi-session. Cells show acc% over k/n after K=5 3-of-5 GPT-4o majority vote. Hybrid rows route by oracle qtype label from LongMemEval-S; the production-equivalent learned classifier is pending and is expected to land within 1–3 pp of oracle. Naked rows have no retrieval, only the full LME-S 110 K-token haystack.

Closed / API reference track · not ranked

Reference row. Read alongside, not against.

Closed-weights API readers are not part of the open-weights ranking. They are reported in a separate reference track on the identical retrieval contract so the open and closed regimes can be read side-by-side without conflating them into one ranking. Reproducibility is limited to whatever the API serves on the run date; the row is published as a comparison anchor, not a contestant.

	Track	Reader / Pipeline	Overall acc · 95% CI	SSA	SSU	SSP	TR	KU	MS	Ref %
ref	Closed reference	gpt-5-mini Same retrieval contract, closed-weights reader	59.0%[54.6–63.2]%	37.5%21/56	90.0%63/70	36.7%11/30	59.4%79/133	66.7%52/78	51.9%69/133	25.0%
Modelopenai/gpt-5-mini-2026-04-14 Params≈ proprietary JudgedN=500, K=5, threshold=3-of-5 Closed-weights API reader on the identical retrieval contract. Reference row for cross-class comparison; reproducibility is limited to whatever the API serves on the run date.

Same retrieval contract, same 500 questions, same K=5 GPT-4o majority-vote judge. Same protocol — different track. For context: the canonical open-weights row above lands at 70.0%; the closed reference row reads the identical evidence at 59.0%. Take this as evidence that fixed-evidence reading is its own bottleneck, not as a head-to-head claim on either model's preferred pipeline.

Tracks · how to read the board

Six tracks. Each measures a different thing.

The board is not a single ranking. It is a stack of comparisons under the same retrieval contract: canonical vs. experimental routing, retrieval vs. no retrieval, open-weights vs. closed-weights. Read each pair against the canonical anchor.

Canonical

The number Warrant stands behind publicly. Production-equivalent (oracle-qtype disclosed; learned classifier expected within 1–3 pp).

Experimental

Routed or prompt variants we have measured under the same protocol. Allowed on the board, but not the public headline.

Retrieval-only

Stack handed straight to the reader. No qtype routing, no full-context fallback. The contribution of retrieval-without-routing.

No retrieval

Reader receives the full LME-S 110 K-token haystack and zero retrieved chunks. Floor row; the contribution of retrieval is measured against it.

Open swap

Same canonical pipeline, different open-weights reader. Quantifies how much of the result is reader-specific.

Closed reference

Closed-weights API reader on the identical retrieval contract. Reference comparison only — not a head-to-head; reproducibility is limited to whatever the API serves on the run date.

What this board says

Fixed-evidence reading is a separate bottleneck.

On this evidence-reading task, under a fixed retrieval contract, the open-weights canonical Hybrid reaches 70.0%; the closed-weights gpt-5-mini reference reads the identical top-10 chunks and lands at 59.0%.

The point isn't that one reader beats another — it's that once retrieval is held fixed, the spread is no longer dominated by retrieval quality. Closed frontier-class APIs do not automatically ingest evidence better than open-weights mid-class readers; on this contract, in this regime, they don't. Reader choice matters. Whatever extra performance a closed-weights vendor sells on a long-context end-to-end benchmark, a meaningful share of it is its own retrieval, not the reader's reading. This board separates the two so the reader can be priced honestly.

Caveats: gpt-5-mini is an API reader; the run reflects whatever model OpenAI served on the run date and is not strictly reproducible. It is published as a reference row, not a head-to-head claim. The canonical row uses oracle-qtype routing — the production learned-classifier path is pending, expected within 1–3 pp.

What this board is

A frozen-retrieval, reader-only comparison. Every row received the identical top-10 chunks for the same 500 LongMemEval-S questions, judged by the same K=5 GPT-4o protocol.

Per-qtype breakdowns expose where each reader fails (SSA for chunk-rerank disruption, KU for stale knowledge, TR for date arithmetic) rather than aggregating them away into a single number.

Refusals are scored as incorrect in accuracy. They are also reported separately in the Ref % column because abstention on missing evidence is a different failure mode from fabrication, and the two have different downstream consequences for any system built on top.

Receipts are attached: per-row answer files, all five seed-judge files, and the synth scripts that produce the Hybrid rows from the underlying Stack and Naked answers.

What this board isn't

A general LongMemEval leaderboard. Public LME-S numbers blend retrieval, routing, and reading; this board pins retrieval and routing and isolates the reader. Systems do not bring their own retrieval.

A claim that "small lab beats OpenAI." gpt-5-mini is published as a reference row on a fixed retrieval contract; it is not a head-to-head on either model's preferred pipeline.

An agent-harness benchmark. Every row is a single-pass reader call — multi-step planners (Hindsight, ReAct, etc.) are out of scope by design, since they conflate reader skill with planner skill.

A frozen artifact. Submissions of additional readers (open weights or API) are welcome under the same contract: a public harness ships next.

Receipts

Every number traceable to a file.

The leaderboard is a derived artifact. The underlying answer-and-judge files are committed alongside the build script so any row can be re-scored from primary sources in one command.

Per-row answer + 5-seed judge files
_remote_pulls/phase91_a50/, _remote_pulls/phase91_a100_gemma_naked/, _remote_pulls/phase91_hybrid_gemma/, _remote_pulls/phase91_hybrid_qwen/, _remote_pulls/phase91_cwfix/, _remote_pulls/phase91_frontier/, _remote_pulls/phase91_hybrid_f3tr/.
Frozen retrieval contract spec · R@5 derivation
EXPERIMENT_JOURNAL_RECOVERED_2026-04-24.md § Phase 86 (R@5 receipt) → Phase 91 (Stack vs Naked) → Phase 95-K (oracle-qtype disclosure).
Hybrid synthesis scripts
_tmp_scripts/_synth_hybrid_gemma.py (oracle-qtype router for Gemma), _tmp_scripts/_synth_hybrid_qwen.py (Qwen), _tmp_scripts/_synth_hybrid_f3_tr.py (Hybrid + F3-on-TR best-of-board).
Build script for this page
_tmp_scripts/_build_leaderboard.py · machine-readable copy: leaderboard.json.
Judge harness · submit your own reader
Coming next: a stand-alone CLI that takes a reader callable + the frozen top-10 chunks per question and produces the answer + judge artifacts in the same schema as the rows above. Watch this page or contact us for early access.