Benchmark · v0 · LongMemEval-S 500Q · 27 readers

ManifoldMemory Reader Leaderboard.

A frozen-retrieval benchmark for the reader half of AI memory.

ManifoldMemory measures the two hidden bottlenecks in AI memory: evidence access and evidence reading. This page holds retrieval fixed and measures what happens next.

Reader is the bottleneck. Same 500 LongMemEval-S questions, same retrieval contract, same GPT-4o judge — swap only the reader and accuracy ranges from 8.8% to 75.6% across tested readers. Within the 26B/27B-class reader group, the spread is 18.2 pp. The current open-weights ceiling under this contract is 73.4% (gpt-oss-120b @ high, Stack + F3-on-TR).

What this measures: reader synthesis ability under a fixed retrieval contract — i.e. an internal-manifold benchmark conditional on the external retrieval manifold being held constant at R@5 = 96.2%. The new % of ceiling annotation under each accuracy shows how much of that 96.2% upper bound a reader actually captures (overall acc / R@5).

Production reader policies
Open-Max 73.4%
gpt-oss-120b @ high · Stack + F3-on-TR
Max-quality production policy. Self-host-verified on the frozen retrieval contract; no closed-weights dependency. The number ManifoldMemory publicly stands behind for highest-accuracy reads.
Efficient 70.0%
Gemma-4-26B · Hybrid (oracle qtype)
Efficient production policy. Smaller, cheaper, easier to self-host than Open-Max; slightly lower accuracy but the same retrieval contract and the same prompt family.
New · Open-weight SOTA (any reader size) ManifoldMemory + Qwen-3.6-27B (v2.1-K64.1 routed): 91.60% on LongMemEval-S 500q — new ManifoldMemory single-reader best; new open-weight SOTA, any reader size.

+0.60 pp over Gemma-4-31B-IT v2.1-K64 routed (91.00%, prior SOTA, different reader, ~1 d old); +1.60 pp over the same Qwen reader on the v2.1-K64 baseline (90.00%, ~36 h old); +2.60 pp over Hindsight + gpt-oss-120b (~89.0%) at ~4.4× smaller reader. Single-knob retrieval intervention vs v2.1-K64: SSP routing K=24 → K=48 (the SSP K=48 signal first surfaced in Phase 97.2.18 micro-test B at +3 q on a 30-SSP slice, then replicated in Phase 97.2.19 v2.1-K64+ at +4 q on the full 500-q SSP slice with a hardened LIST rider that masked the gain via -4 q on MS, then isolated and replicated again here at +4 q with the LIST rider DROPPED — same canonical retrieval, same K=64 cache, same v2.1-K64 prompts file, same 5-seed GPT-4o MV-3/5 strict judge).

Per-qtype attribution vs Qwen v2.1-K64 baseline (90.00%, same reader, same prompts, same K=64 cache; only difference is K_BY_QTYPE[SSP] 24→48): SSP +13.33 pp (21→25 / 30, +4 q — the targeted, isolated, thrice-replicated intervention); TR +3.76 pp (116→121 / 133, +5 q — Novita run-to-run noise on a slice that did not change); MS −0.75 pp (116→115 / 133, −1 q within Novita ±1 pp noise); KU −1.28 pp (−1 q noise); SSU +1.43 pp (+1 q noise); SSA 0.00 pp (tied at near-ceiling). Net +8 questions, +1.60 pp over baseline; clean attribution: +4 q from SSP K=48 + +4 q from run-to-run noise on slices that did not change. Validates Field Notes Law 4 — a simpler retrieval-side contract (one K-routing knob) outperforms the more complex hardened LIST rider that v2.1-K64+ tried in Phase 97.2.19.

Closed-reader SOTA (Hindsight v0.4.19 with Claude, 94.6%) remains untouched. Reader: novita://qwen/qwen3.6-27b (Apache-2.0, 27 B dense), per-qtype K routing (K=24 default, K=48 on SSP, K=64 on TR+MS) + v2.1-K64 rider package unchanged (chronological + forbidden-older + scan-ALL-64 + SHORT+ABS), conc=4, 0 fails on full pass. Total cost ~$1.40 (reader ~$0.55 + judge ~$0.85). See the frontier board ↓

Methodology · retrieval recall vs QA accuracy Two different axes, both measured here. R@K asks "is the right memory in the evidence panel?". End-to-end QA accuracy asks "did the reader produce the right answer?". They are not interchangeable; the page reports both so you can compare like-for-like.
Retrieval · R@5
96.20%
481 / 500 · post-MixK rerank top-5
Retrieval · loose R@64
100.00%
500 / 500 · ≥1 gold session in top-64
End-to-end · strict QA
91.60%
458 / 500 · GPT-4o MV-3/5 strict judge

Why both? Retrieval recall (R@K) measures whether the correct memory appears in the retrieved evidence. End-to-end QA accuracy measures whether the reader produces the correct answer once that evidence is in front of it. They are different lanes and should be compared lane-by-lane — a system can win retrieval recall and still lose QA accuracy, or the reverse.

What 100.0% loose R@64 means precisely: at K=64, at least one gold session is present in the top-64 evidence panel for every one of the 500 questions. Strict R@64 = 99.0% (495/500): for multi-session questions strict R@K requires every gold session in top-K, and 5 MS questions still have at least one gold source missing at K=64. Loose R@K (the metric reported here and by the LongMemEval paper) is the recall ceiling that bounds end-to-end accuracy.

How this lands vs MemPalace: MemPalace's published headline on LongMemEval is 96.6% R@5 raw and 100% R@5 hybrid — both retrieval recall numbers, the same metric family as our R@5 = 96.20% and loose R@64 = 100.00%. MemPalace does not report final-answer QA accuracy on this benchmark. Our 91.60% is the end-to-end strict-judge QA score on the same 500 questions, an independent axis. Different lane, different question. ManifoldMemory holds #1 in the open-weight single-reader QA accuracy lane on LongMemEval-S 500q; MemPalace's claim is in the retrieval-recall lane and should be read as such.

What this implies for the next ceiling-break: at loose R@64 = 100%, retrieval is no longer the limiting factor under loose gold-session coverage. The remaining ~9 pp to perfect QA accuracy is split between (a) strict multi-source coverage for MS questions (5 strict-bound failures at K=64), (b) reader-side counting / list / qualifier reasoning on multi-session aggregations, (c) temporal-reasoning arithmetic precision, and (d) answer normalization under the strict judge protocol. Phase 97.3 work targets (a) and (b) at retain-time via auto-consolidation, atomic-fact extraction, and time-aware indexing — reader-side architecture, not retrieval.

MIX track ManifoldMemory MIX-v1-closed: 89.00% on LongMemEval-S 500q — first multi-reader configuration on the ManifoldMemory board.

+0.20 pp over the prior single-reader best (Gemma-31B v2.1 at 88.80%; statistical tie within Novita's ±1 pp run-to-run band; now superseded on a single reader by A3 Gemma stacked at 89.40%). Six candidate readers × contracts (gpt-5.5 v1 K=10, gpt-5.5 v2 K=24, claude-opus-4-6 v2 K=24, gemma-4-31b v2.1 K=24/48, claude-opus-4-6 v2.1 K=24/48, gemma-4-31b v2 K=24) produce independent answers.

gpt-5.5 medium acts as a blind evidence arbiter — sees only question, K=48 evidence panel, and neutrally-labeled candidate answers (A–F shuffled per qid). No qtype labels, no gold answers, no model identities, no contract ids. Synthesis fallback: 0.6% (3/500) — well under the 15% threshold; clean MIX label, not a re-reader.

Honest framing: not a single-model score; reported in its own MIX track. See the frontier board ↓

v2 K=24 board Qwen-3.6-27B (Apache-2.0) takes the v2 K=24 board: 88.60% on LongMemEval-S 500q — cheapest competitive reader.

+0.60 pp over Claude Opus 4.6 (88.00%, prior leader) at <1/100 the per-row cost ($0.42 vs $46.94), and +0.80 pp over Gemma-4-31B (87.80%, prior open-weight peer). Wins SSU, TR, KU, MS on slice (4 of 6 qtypes) under uniform K=24; MS reaches 81.20% without the K=48 routing patch — structural long-evidence advantage from Qwen's 256K-context architecture.

See the Qwen row ↓

v2.1 contract Gemma-4-31B-IT under the v2.1 routed contract: 88.80% on LongMemEval-S 500q — earlier ManifoldMemory high-water mark (now superseded by Qwen K=64 routed at 90.00% and A3 Gemma stacked at 89.40% above).

+0.20 pp over the v2 K=24 leader (Qwen-3.6-27B at 88.60%) and +0.80 pp over the prior Claude Opus 4.6 v2 K=24 row at ~66x lower per-row reader+judge cost ($0.71 total vs $47.34). Per-qtype K routing (K=24 default, K=48 on multi-session + temporal-reasoning only) recovers the multi-session retrieval bottleneck, lifting MS from 76.69% to 81.95% (+7 questions on slice) on the same reader and same canonical pipeline (BGE + QNDN + BM25 → RRF → MixK).

A1h Gemma-4-31B at 87.80% (uniform K=24) holds SOTA in the open-weight ≤40B reader class on the AMB·LongMemEval leaderboard (+4.2 pp over Hindsight + OSS-20B, +13.8 pp over the AMB-verified hybrid-search baseline). 86.80% with GPT-5.5 medium (closed-weights, OpenAI direct), 84.80% with the smaller A4B MoE variant (26B / ~4B active, single-24 GB-GPU footprint).

Behind the AMB-verified overall SOTA — Hindsight v0.4.19 at 94.6% — by 5.8 pp; that gap reflects architectural memory features (auto-consolidation, time-aware retrieval, multi-session graph linkage at retain-time) that Phase 97.3 will build. Honest framing: this is ManifoldMemory's open-weight best on a routed K=24/48 contract, not a global LME-S SOTA claim.

See the frontier board ↓

Read this first Scores are reader-side results over a frozen ManifoldMemory retrieval artifact — systems do not bring their own retrieval.

The frontier board above ranks every ManifoldMemory row that goes beyond the v1 canonical contract (six tracks: v2.1-K64, A3 stacked, K=48 v2, MIX-v1-closed, v2.1 K=24/48, v2 K=24) by overall accuracy. Each row carries a coloured track pill linking back to its contract definition (Tracks explained at the top of the section). The v1 main canonical board below uses the v1 contract (top-10 chunks, canonical prompt v1) so every row stays apples-to-apples with the published K=10 baseline.

Same 500 LongMemEval-S questions, same GPT-4o judge (K=5, 3-of-5 majority) across all contracts. This is not a general LongMemEval leaderboard.

Built 2026-05-01T01:53Z · 45 configurations (44 single-reader + 1 MIX) · 500 questions each · K=5 seeds {42,1,2,3,4}

Frozen retrieval. Identical for every row.

First-stage union of three retrievers, fused via reciprocal rank, reranked by a MixK cross-encoder, top-10 chunks delivered to the reader. Same 500 LongMemEval-S questions, same haystack, same chunk boundaries, same answer prompts, same judge.

First stage
BGE-large-en-v1.5 ∪ QNDN v0 ∪ BM25
Fusion
Reciprocal Rank Fusion (RRF k=60)
Rerank
MixK-trained cross-encoder
Delivered to reader
top-10 chunks to reader (post-MixK rerank from 500 candidates)
Recall@5
96.20%  ·  481/500 questions
Reader gap
32.4 pp e2e upside available given retrieval contract
Ceiling captured
per row: overall acc / R@5 — share of the 96.2% retrieval ceiling captured by reader synthesis
Judge
GPT-4o, K=5 seeds, 3-of-5 majority
Benchmark
LongMemEval-S (LME-S 500-Q)

The frontier board. 20 rows, ten contracts, one ranking.

Every ManifoldMemory row that goes beyond the v1 canonical contract, ranked together by overall accuracy on LongMemEval-S 500q. Each row shows a coloured track pill (v2.1-K64, A3 stacked, K=48 v2, MIX-v1-closed, v2.1 K=24/48, v2 K=24) plus its existing expand chevron for the per-row detail card. Open Tracks explained below for full contract definitions.

Tracks explained — click to expand each contract definition
v2-K64-NoChrono+T16K   v2-K64-NoChrono + 16K thinking · closed-weights thinking-budget axis

Same canonical retrieval pipeline + same K-routing + same NoChrono prompts as track-v2-k64-noproto. The ONLY varying element is the Anthropic API thinking parameter: disabled in the baseline, enabled at budget_tokens=16384 here (with the API-mandated temperature=1.0).

Phase 97.9 result on Opus 4.6 NoChrono: 92.00 % (460/500), +0.40 pp = +2 q over the NoChrono baseline at 91.60 %. 95 % CI on the paired diff is [−1.60, +2.40] — includes zero, so the lift is sampling-noise-scale; H_S1 (clean SOTA beat) FAILED, H_S2 (marginal lift) TRIGGERED. Net flips +2 (wins 14, losses 12). Phase 97.8 subset projection of +11 flips eroded ~5× at the full-set level by 12 regressions on previously-passing qids; subset transfer audit shows 9/11 of the flips reproduced under the full-set judge (82 %, within sampling-noise budget). Honest framing: HIGHEST point-estimate accuracy on the board, but NOT a new SOTA claim — closed-weights, CI overlaps, H_S1 missed by 1 question. Reported as a separate track for methodological transparency.

v2.1-K64   v2.1-K64 contract · K=24/64 routed + v2.1 rider package

Per-qtype K routing with K=64 on TR+MS only, plus the v2.1 rider package (chronological + KU forbidden-older + scan-ALL-64 + SHORT+ABS).

Same canonical retrieval pipeline as v2 (BGE+QNDN+BM25 → RRF k=60 → MixK rerank). Two changes from v2.1 K=24/48: (1) K raised to K=24/64 — K=24 by default, K=64 on the two retrieval-bottleneck qtypes (TR + MS) only; (2) SYSTEM_RIDER_LIST scan-ALL count updated from "Scan ALL 48" → "Scan ALL 64" so the rider's self-described K matches the K=64 evidence delivered to MS. K=64 cache is byte-identical-strict-superset of K=48 and K=24 caches (verified 500/500 byte-match at top-48 and top-24). Strict R@64 = 99.0% (495/500), loose R@64 = 100.0% — first contract where every gold session is in the evidence panel for every question. First ManifoldMemory single-reader to clear 90% on LME-S 500q.

A3 stacked   A3 stacked contract · per-qtype best-in-slot K + rider stacked

Stacks F_K=48 (where retrieval depth wins) and F_v2.1-rider (where rider+routing wins) per qtype on Gemma-4-31B.

Routing fixed before evaluation, applied to all 500 questions consistently, no gold-answer / judged-correct / post-hoc picking. Per-qtype routing table: SSA K=48+SHORT · SSU K=48+SHORT · SSP K=24+PREF · TR K=48+TEMPORAL · KU K=24+KU(forbidden-older) · MS K=48+LIST(K=48) · _abs SHORT+ABS shortcut. Single byte-level prompt diff vs v2.1: SYSTEM_RIDER_LIST "Scan ALL 24" → "Scan ALL 48" so the rider's self-described K matches the K=48 evidence delivered to MS. Stacking is additive within Novita's ±1 pp non-determinism: A2 K=48 (89.00%) + recovered KU/_abs from v2.1 (~+0.40 pp) = predicted 89.40%, landed at 89.40%.

K=48 v2   K=48 v2 contract · uniform K=48, v2 prompts, no rider

F_K=48 force-isolation diagnostic. Same v2 prompts as the K=24 board, no rider, every qtype receives K=48 chunks.

The K=48 cache is a byte-identical strict superset of the public K=24 artifact (verified 500/500 byte-match on top-24 chunks per qid) so cross-contract comparison is exact below the 24th rank. By holding rider+routing fixed at the v2 baseline and only changing K, this contract decomposes the v2.1 contract's win into two independently-attributable forces: F_K=48 (this contract) and F_v2.1-rider (the rider+routing diff). Per-reader F_K=48 isolation: Qwen MS +5 questions (K=24→K=48), Gemma SSA +2 / SSU +1 / TR −1 / MS +1. Both readers regress on KU and _abs at K=48 without the rider — F_v2.1-rider is the right knob for those qtypes, not F_K.

MIX-v1-closed   MIX-v1-closed · multi-reader champion pool + GPT-5.5 evidence arbiter

First multi-reader configuration on the ManifoldMemory board. Six pre-registered candidate readers; GPT-5.5 medium blindly arbitrates each question.

Six candidate readers × contracts (gpt-5.5 v1 K=10, gpt-5.5 v2 K=24, claude-opus-4-6 v2 K=24, gemma-4-31b v2.1 K=24/48, claude-opus-4-6 v2.1 K=24/48, gemma-4-31b v2 K=24) produce independent answers; gpt-5.5 medium then arbitrates each question by reading question + K=48 evidence panel + neutrally-labeled candidate answers (A–F shuffled per qid with sha256-hashed seed). The arbiter sees no qtype labels, no gold answers, no model identities, no contract ids — only question, evidence, and candidate text. Pre-registered protocol with frozen prompt fingerprints (system sha256 5a3df742a1b2f5f4, user template sha256 a3071a14cbb9a31a, shuffle seed MIX-v1-closed-2026-04-29-shuffle-seed-42). Synthesis fallback (selected_label = S) only when no candidate is supportable: 3/500 = 0.6% synth rate, well under the 15% threshold. Reported in its own MIX track — not a single-model score.

v2.1 K=24/48   v2.1 contract · K=24/48 routed + chronological rider

Per-qtype K routing — K=24 default, K=48 on TR + MS only — plus a tightened chronological + KU rider package.

Three changes from v2 uniform K=24: (1) per-qtype K routing — deterministic from question_type, no gold-answer picking; (2) upgraded SYSTEM_RIDER_TEMPORAL with an explicit 7-step chronological protocol (identify events → list YYYY/MM/DD → sort → use QUESTION_DATE as today → compute → show arithmetic → Answer); (3) tightened knowledge-update rider (latest-value enforcement). The K=48 cache used here is a byte-identical strict superset of the public K=24 artifact (top-24 of K=48 == K=24 cache, verified 500/500 byte-match). Reader-specific: Gemma uses the chronological rider well (TR +1 q, KU +rider gain) while Opus collapses on TR under the same rider (−15.79 pp) — the Field Notes Law 2 reader-specific scaffold finding.

v2 K=24   v2 contract · uniform K=24 + Track 2-lite v2 prompts

Same canonical retrieval pipeline as v1, with two changes: top-K bumped 10 → 24, and Track 2-lite v2 prompts (per-qtype oracle routing + LMS-S session-date back-fill).

R@24 = 99.6% (498/500). The K=24 artifact is a strict superset of the published K=10 artifact — sanity-checked for chunk-text equality & cosine-drift on the 500-q overlap. The +12.4 pp lift over the K=10 + canonical-v1 Claude row decomposes as roughly +3.2 pp from K=24 and +9.2 pp from the Track 2-lite v2 prompt + date back-fill — the prompt + data layer is doing most of the work, not the reader's parameter count or the retrieval contract. Qwen-3.6-27B (open-weights), Claude Opus 4.6 (closed-weights frontier), and Gemma-4-31B-IT all converge to the 87.8–88.6% band on this recipe.

# Track Reader / Pipeline Overall
acc · 95% CI
SSASSUSSPTRKUMS Ref %
#1 v2-K64-NoChrono+T16K
claude-opus-4-6 + 16K thinking closed-weights · #1 by point estimate · NOT a new SOTA (CI overlaps 91.60%)
Stack v2-K64-NoChrono + 16K thinking (per-qtype K routing: K=24 default, K=64 on TR+MS; v2 NoChrono prompts unchanged from track-v2-k64-noproto; reader thinking enabled at budget_tokens=16384 + API-mandated temperature=1.0)
92.0%[89.3–94.1]%92.0% of ceiling 94.6%53/5697.1%68/7083.3%25/3092.5%123/13396.2%75/7887.2%116/133 0.0%
#2 v2.1-K64.1
qwen-3.6-27b-novita SOTA · open-weight (any size) · new ManifoldMemory best
Stack v2.1-K64.1 (per-qtype K routing: K=24 default, K=48 on SSP, K=64 on TR+MS; v2.1-K64 prompts unchanged; chronological + forbidden-older + scan-ALL-64 + SHORT+ABS riders)
91.6%[88.8–93.7]%91.6% of ceiling 96.4%54/5698.6%69/7083.3%25/3091.0%121/13394.9%74/7886.5%115/133 0.0%
#3 v2.1-K64.1
gemma-4-31b-novita SOTA tied · open-weight (any size) · cross-reader replication
Stack v2.1-K64.1 (per-qtype K routing: K=24 default, K=48 on SSP, K=64 on TR+MS; v2.1-K64 prompts unchanged; chronological + forbidden-older + scan-ALL-64 + SHORT+ABS riders)
91.6%[88.8–93.7]%91.6% of ceiling 92.9%52/5697.1%68/7083.3%25/3092.5%123/13398.7%77/7885.0%113/133 0.0%
#4 v2-K64-NoChrono
claude-opus-4-6 closed-weights BiS · ties open-weight SOTA · protocol mitigation
Stack v2-K64-NoChrono (per-qtype K routing: K=24 default, K=64 on TR+MS; v2 prompts; 7-step chronological protocol REPLACED by SHORT+TEMPORAL_HINT on TR only; KU/LIST/_abs unchanged from v2)
91.6%[88.8–93.7]%91.6% of ceiling 96.4%54/5697.1%68/7086.7%26/3091.0%121/13394.9%74/7886.5%115/133 0.0%
#5 v2.1-K64
gemma-4-31b-novita prior SOTA · v2.1-K64 baseline · different reader
Stack v2.1-K64 (per-qtype K routing: K=24 default, K=64 on TR+MS; v2.1 rider package: chronological + forbidden-older + scan-ALL-64 + SHORT+ABS)
91.0%[88.2–93.2]%91.0% of ceiling 92.9%52/5695.7%67/7080.0%24/3091.7%122/13396.2%75/7886.5%115/133 0.0%
° #6 EventOS-v0
gemma-4-31b-novita event-sidecar architectural diagnostic · not a SOTA candidate
Stack EventOS-v0 (per-qtype K routing identical to v2.1-K64.1: K=24 default, K=48 on SSP, K=64 on TR+MS; v2.1-K64 prompts unchanged + EVENT-INDEX HINT line on TEMPORAL + LIST riders only; user message gains a single-pass EVENT INDEX block of up to 64 events sourced from offline Gemini Flash extraction over K=64 panel sessions)
90.8%[87.9–93.0]%90.8% of ceiling 94.6%53/5697.1%68/7080.0%24/3090.2%120/13394.9%74/7886.5%115/133 0.0%
° #7 v2.1-K64
qwen-3.6-27b-novita first 90% · v2.1-K64 baseline · same reader as #1
Stack v2.1-K64 (per-qtype K routing: K=24 default, K=64 on TR+MS; v2.1 rider package: chronological + forbidden-older + scan-ALL-64 + SHORT+ABS)
90.0%[87.1–92.3]%90.0% of ceiling 96.4%54/5697.1%68/7070.0%21/3087.2%116/13396.2%75/7887.2%116/133 0.0%
° #8 A3 stacked
gemma-4-31b-novita earlier high-water mark
Stack v2.A3 (per-qtype K + rider stacked: SSA/SSU=K48+SHORT, SSP=K24+PREF, TR=K48+TEMPORAL, KU=K24+KU, MS=K48+LIST(K=48), _abs=SHORT+ABS)
89.4%[86.4–91.8]%89.8% of ceiling 94.6%53/5694.3%66/7083.3%25/3089.5%119/13393.6%73/7883.5%111/133 0.0%
° #9 K=48 v2
qwen-3.6-27b-novita cheap K=48 baseline · open-weight
Stack v2.K48u (canonical retrieval @ K=48 uniform; v2 prompts; no rider)
89.2%[86.2–91.7]%89.6% of ceiling 94.6%53/5695.7%67/7076.7%23/3088.0%117/13393.6%73/7885.0%113/133 0.0%
° #10 K=48 v2
gemma-4-31b-novita
Stack v2.K48u (canonical retrieval @ K=48 uniform; v2 prompts; no rider)
89.0%[85.9–91.5]%89.4% of ceiling 96.4%54/5695.7%67/7080.0%24/3088.0%117/13393.6%73/7882.7%110/133 0.0%
° #11 MIX-v1-closed
MIX-v1-closed (champion pool + GPT-5.5 arbiter)
MIX (6-candidate champion pool + GPT-5.5 evidence arbiter; K=48 evidence panel; pick-or-synthesize)
89.0%[85.9–91.5]%89.4% of ceiling 96.4%54/5698.6%69/7066.7%20/3087.2%116/13392.3%72/7885.7%114/133 0.0%
° #12 v2.1 K=24/48
gemma-4-31b-novita
Stack v2.1 (canonical retrieval @ K=24/48 per-qtype; Track 2-lite v2 + chronological rider)
88.8%[85.7–91.3]%89.2% of ceiling 92.9%52/5694.3%66/7080.0%24/3088.7%118/13396.2%75/7882.0%109/133 0.0%
° #13 v2 K=24
qwen-3.6-27b-novita
Stack v2 (canonical retrieval @ K=24; Track 2-lite v2 prompts)
88.6%[85.5–91.1]%89.0% of ceiling 94.6%53/5698.6%69/7073.3%22/3087.2%116/13396.2%75/7881.2%108/133 0.0%
° #14 v2 K=24
claude-opus-4-6
Stack v2 (canonical retrieval @ K=24; Track 2-lite v2 prompts)
88.0%[84.9–90.6]%88.4% of ceiling 96.4%54/5695.7%67/7086.7%26/3086.5%115/13393.6%73/7878.9%105/133 5.8%
° #15 v2 K=24
gemma-4-31b-novita
Stack v2 (canonical retrieval @ K=24; Track 2-lite v2 prompts)
87.8%[84.6–90.4]%88.2% of ceiling 96.4%54/5695.7%67/7086.7%26/3087.2%116/13394.9%74/7876.7%102/133 6.6%
° #16 v2-K64-DropStep2
claude-opus-4-6 protocol bisection diagnostic · not a SOTA candidate
Stack v2-K64-DropStep2 (per-qtype K routing: K=24 default, K=64 on TR+MS; v2 prompts with chronological protocol's STEP 2 ablated; remaining steps 1, 3-7 byte-identical, renumbered 1, 2-6)
87.8%[84.6–90.5]%87.8% of ceiling 92.9%52/5698.6%69/7083.3%25/3076.7%102/13394.9%74/7888.0%117/133 0.0%
° #17 v2 K=24
gpt-5.5-openai
Stack v2 (canonical retrieval @ K=24; Track 2-lite v2 prompts)
86.8%[83.6–89.5]%87.1% of ceiling 96.4%54/5698.6%69/7080.0%24/3082.0%109/13392.3%72/7879.7%106/133 0.0%
° #18 v2.1 K=24/48
claude-opus-4-6
Stack v2.1 (canonical retrieval @ K=24/48 per-qtype; Track 2-lite v2 + chronological rider)
85.0%[81.6–87.9]%85.3% of ceiling 96.4%54/5697.1%68/7083.3%25/3070.7%94/13392.3%72/7884.2%112/133 0.0%
° #19 v2 K=24
gemma-4-26b-a4b-novita
Stack v2 (canonical retrieval @ K=24; Track 2-lite v2 prompts)
84.8%[82.0–88.2]%85.1% of ceiling 94.6%53/5697.1%68/7070.0%21/3079.7%106/13394.9%74/7876.7%102/133 7.6%
° #20 v2 K=24
gpt-oss-120b-novita
Stack v2 (canonical retrieval @ K=24; Track 2-lite v2 prompts)
84.6%[81.2–87.5]%84.9% of ceiling 94.6%53/5695.7%67/7070.0%21/3082.7%110/13393.6%73/7874.4%99/133 7.6%

Frozen retrieval artifacts: K=24 (R@24=99.6%) · K=48 cache (sha256 f973dd45797f344a..., R@48=100.0%) · K=64 cache (R@64-loose=100.0%, R@64-strict=99.0%). All three are byte-identical strict supersets — top-24 of K=48 == K=24 cache, top-48 of K=64 == K=48 cache, both verified 500/500 byte-match. Same 500 LongMemEval-S questions, same GPT-4o judge (K=5, 3-of-5 majority) across all six contracts.

The reader is the variable. Retrieval is held fixed.

Open-weights readers ranked by overall accuracy on the same 500 LongMemEval-S questions and the same top-10 evidence chunks (post-MixK rerank). Two rows carry ManifoldMemory's production tags: ★ Open-Max — the max-quality production policy, currently 73.4% — and ☆ Efficient — the cheaper-to-self-host production policy, currently 70.0%. Every other row is a comparison anchor under the same retrieval contract.

Hosting matters: Open-Max, Efficient, experimental, retrieval-only, no retrieval, and open swap rows are self-hosted on our own H100/A100/H200 cluster — we control checkpoint, prompt, runtime, and quantisation. Open · API-hosted rows are open-weights readers run through the vendor's hosted API; weights are public, but provider-side serving template, sampling defaults, and version routing are not under our control, so those rows are reported as hosted open-weight references in the same ranked board. Refusals are scored as incorrect in accuracy; the Ref % column reports them separately because abstention on missing evidence is a different failure mode from fabrication.

# Track Reader / Pipeline Overall
acc · 95% CI
SSASSUSSPTRKUMS Ref %
#2 ManifoldMemory · Open-Max
gpt-oss-120b @ high
Stack + F3-on-TR (oracle qtype: TR -> F3 date-anchor, else -> Stack; no Naked routing)
73.4%[69.4–77.1]%76.3% of ceiling 96.4%54/5692.9%65/7040.0%12/3068.4%91/13387.2%68/7857.9%77/133 17.6%
#2 Open swap
gpt-oss-120b @ high
Stack (frozen retrieval contract; canonical prompt v1)
72.0%[67.9–75.8]%74.8% of ceiling 96.4%54/5692.9%65/7040.0%12/3063.2%84/13387.2%68/7857.9%77/133 18.4%
#3 Experimental
Gemma-4-26B-A4B-it
Hybrid + F3-on-TR (oracle qtype + date-anchor on TR slice)
71.4%[67.3–75.2]%74.2% of ceiling 92.9%52/5688.6%62/7026.7%8/3063.9%85/13374.4%58/7869.2%92/133 10.8%
#4 Open · API-hosted
gemma-4-31b-novita
Stack (frozen retrieval contract; canonical prompt v1)
70.8%[66.7–74.6]%73.6% of ceiling 94.6%53/5691.4%64/7033.3%10/3063.2%84/13383.3%65/7858.6%78/133 18.8%
#5 Open swap
gpt-oss-120b @ high
Hybrid v2 + F3-on-TR (corrected Hybrid: SSA -> Naked v2, TR -> F3 date-anchor, else -> Stack)
70.4%[66.3–74.2]%73.2% of ceiling 69.6%39/5692.9%65/7040.0%12/3068.4%91/13387.2%68/7857.9%77/133 17.4%
#6 ManifoldMemory · Efficient
Gemma-4-26B-A4B-it
Hybrid (oracle qtype: SSA→Naked, else→Stack)
70.0%[65.8–73.9]%72.8% of ceiling 94.6%53/5688.6%62/7033.3%10/3056.4%75/13373.1%57/7869.9%93/133 10.0%
#7 Open swap
gpt-oss-120b @ high
Hybrid v2 (corrected Hybrid: SSA -> Naked v2 [haystack-first prompt], else -> Stack)
69.0%[64.8–72.9]%71.7% of ceiling 69.6%39/5692.9%65/7040.0%12/3063.2%84/13387.2%68/7857.9%77/133 18.2%
#8 Open · API-hosted
deepseek-v4-pro
Stack (frozen retrieval contract; canonical prompt v1)
67.4%[63.2–71.4]%70.1% of ceiling 96.4%54/5685.7%60/7036.7%11/3063.2%84/13380.8%63/7848.9%65/133 25.8%
#9 Open swap
Qwen3.6-27B
Hybrid (oracle qtype: SSA→Naked, else→Stack)
66.2%[61.9–70.2]%68.8% of ceiling 96.4%54/5688.6%62/7040.0%12/3045.9%61/13371.8%56/7864.7%86/133 14.8%
#10 Open · API-hosted
Qwen2.5-72B-Instruct
Stack (frozen retrieval contract; canonical prompt v1)
65.2%[60.9–69.2]%67.8% of ceiling 96.4%54/5687.1%61/7033.3%10/3053.4%71/13374.4%58/7854.1%72/133 21.8%
#11 Open swap
gpt-oss-20b @ low
Stack (frozen retrieval contract; canonical prompt v1)
64.0%[59.7–68.1]%66.5% of ceiling 94.6%53/5684.3%59/7043.3%13/3049.6%66/13383.3%65/7848.1%64/133 24.8%
#12 Retrieval-only
Gemma-4-26B-A4B-it
Stack (BGE∪QNDN∪BM25 → RRF → top-10)
63.8%[59.5–67.9]%66.3% of ceiling 39.3%22/5688.6%62/7033.3%10/3055.6%74/13375.6%59/7869.2%92/133 14.4%
#13 Open · API-hosted
gpt-oss-20b @ high
Stack (frozen retrieval contract; canonical prompt v1)
62.8%[58.5–66.9]%65.3% of ceiling 89.3%50/5684.3%59/7023.3%7/3057.9%77/13373.1%57/7848.1%64/133 19.2%
#14 Retrieval-only
Qwen3.6-27B
Stack (same retrieval contract)
60.0%[55.6–64.2]%62.4% of ceiling 37.5%21/5690.0%63/7040.0%12/3046.6%62/13373.1%57/7863.9%85/133 21.2%
#15 Open · API-hosted
kimi-k2.6
Stack (frozen retrieval contract; canonical prompt v1)
58.6%[54.2–62.8]%60.9% of ceiling 92.9%52/5680.0%56/7016.7%5/3054.1%72/13361.5%48/7845.1%60/133 19.6%
#16 No retrieval
Qwen3.6-27B
Naked (no retrieval)
58.4%[54.0–62.6]%60.7% of ceiling 96.4%54/5691.4%64/7016.7%5/3036.1%48/13375.6%59/7846.6%62/133 38.8%
#17 Experimental
Gemma-4-26B-A4B-it
CWFIX (Stack + F1 refusal-retry + F2 chronological-KU + F3 date-anchor, all qtypes)
56.6%[52.2–60.9]%58.8% of ceiling 30.4%17/5687.1%61/7020.0%6/3062.4%83/13364.1%50/7849.6%66/133 20.4%
#18 No retrieval
Gemma-4-26B-A4B-it
Naked (no retrieval, LME-S 110 K-token haystack)
53.2%[48.8–57.5]%55.3% of ceiling 94.6%53/5688.6%62/7013.3%4/3033.8%45/13365.4%51/7838.3%51/133 39.2%
#19 Open swap
Qwen2.5-7B-Instruct
Stack (frozen retrieval contract; canonical prompt v1)
25.0%[21.4–29.0]%26.0% of ceiling 57.1%32/5638.6%27/7020.0%6/3018.8%25/13321.8%17/7813.5%18/133 52.2%
#20 Open swap
Qwen2.5-3B-Instruct
Stack (frozen retrieval contract; canonical prompt v1)
12.4%[9.8–15.6]%12.9% of ceiling 10.7%6/5614.3%10/706.7%2/3010.5%14/13315.4%12/7813.5%18/133 23.6%
#21 Open swap
Qwen2.5-1.5B-Instruct
Stack (frozen retrieval contract; canonical prompt v1)
8.8%[6.6–11.6]%9.1% of ceiling 10.7%6/5615.7%11/706.7%2/306.8%9/1339.0%7/786.8%9/133 51.2%

Click any row for model id, judge metadata, and methodology notes.

qtype legend: SSA = single-session-assistant · SSU = single-session-user · SSP = single-session-preference · TR = temporal-reasoning · KU = knowledge-update · MS = multi-session. Cells show acc% over k/n after K=5 3-of-5 GPT-4o majority vote. Hybrid rows route by oracle qtype label from LongMemEval-S; the production-equivalent learned classifier is pending and is expected to land within 1–3 pp of oracle. Naked rows have no retrieval, only the full LME-S 110 K-token haystack.

Behavioural projection of the reader manifold. Pipeline acts as a directional transformation.

Is this benchmark just who has highest accuracy? No — the single accuracy column is a scalar collapse of a richer 2D structure. Decomposed by qtype family, each reader+pipeline pair lives at a point in the plane below: X = associative reading (answer-in-one-chunk lookup), Y = geometric synthesis (answer-across-chunks composition). Same axes Google / CMU use to distinguish associative from geometric internal memory in deep sequence models (Noroozizadeh et al. 2025) — but read off behaviour, not hidden states.

Behavioural projection of the reader manifold: 2D scatter of top-10 reader+pipeline pairs. X-axis is associative memory strength (avg of single-session-assistant and single-session-user accuracies); Y-axis is geometric memory strength (avg of single-session-preference, temporal-reasoning, knowledge-update, and multi-session accuracies). Marker size scales with overall accuracy; ring opacity scales with answer reliability (1 minus refusal rate). Axes regenerate from the leaderboard data on every build.
X · avg(SSA, SSU) · lookup against the external manifold Y · avg(SSP, TR, KU, MS) · synthesis on the internal manifold size · overall accuracy  ·  ring · 1 − refusal rate

What the four regions mean

Balanced leaders (top-right): high on both axes — readers that can both look up locally and compose across fragments. Strong-local / moderate-synthesis (middle-right): local-memory pillar intact, synthesis lags 4–6 points — the canonical "more associative than geometric" failure mode. Local-strong / synthesis-weak (bottom-right): highest local-memory scores on the board but synthesis falls off the cliff — "associative-only" readers in the Google / CMU sense. Hybrid trade-off / SSA penalty (top-left): a fixed reader stranded ~15 points to the left of itself on a different pipeline — the visible imprint of pipeline-as-vector-field.

Pipeline = directional transformation

When the same model appears at multiple points (e.g. gpt-oss-120b @ high on Stack vs. Hybrid-v2), the gap is not a reader-geometry property — it's the pipeline itself acting as a vector field over a fixed substrate. Routing can over-optimise one direction and break another (e.g. SSA → Naked v2 trades a 15 pp local-memory drop for a small synthesis gain). That is the literal visualisation of why holding retrieval fixed and varying the reader+pipeline pair is the right experimental knob to twist.

What this projection does not claim

This is a 2D projection of behaviour — the actual hidden-state geometry is unmeasured. Position on this projection is mediated by chat template, sampling defaults, refusal posture, and serving host; the behavioural fingerprint is not a hidden-state read. We do not claim "reader X has more geometric parametric memory than reader Y." The projection is the cheap (Audit-B-lite) version of the full Reader Manifold Audit; the hidden-state work is queued in the experiment journal.

How to read the projection. Each marker is one reader+pipeline pair. Distance from origin = total memory strength; angle = associative-vs-geometric ratio. The dashed lines mark the median X and Y across the top-10 — readers above-and-right of both medians are the balanced leaders, readers in the bottom-right are strong on lookup but weak on composition. The chart re-renders from leaderboard.json on every build, so it never falls out of sync with the table.

MRCRv2 vs Reader Leaderboard. The weak correlation is the result.

If our two-axis (LMS / GSS) framing is right, then numerical scores from a benchmark that emphasises native long-context system ability (multi-round retrieval, planning, in-context computation, distractor density — e.g. MRCRv2) should not rank-correlate strongly with numerical scores from a benchmark that isolates fixed-evidence reader synthesis (frozen retrieval, top-10 chunks, structured grounded-summary output — this leaderboard). Below: 5-row overlap of models that appear in both, with per-model deltas. Two-population structure with a convergence point at Opus 4.6. The cleanest single data point is GPT-5.4: #1 on MRCRv2 at 97%, mid-pack here at 71.2%.

DISPLAY ONLY — source verification pending. MRCRv2 numbers in the table below are sourced from the BenchLM-tracked snapshot captured 2026-04-24. BenchLM is tracking MRCRv2 in its local dataset, but exact-source verification records for these rows are still being attached. Until exact-source attachments are completed they should not be treated as fully verified public benchmark rows. We will re-run the correlation when verified MRCRv2 numbers ship. We treat these rows as a directional cross-check, not a metric claim. The two-population structure and rank inversions are robust to ±5 pp uncertainty on individual MRCRv2 rows; per-row magnitudes may shift when verified attachments land. We will re-run the correlation and journal a follow-up at that point.
Model Track MRCRv2 Reader (best row) Δ (Reader − MRCR) Reading
Gemma-4-26B-A4B-it
Hybrid + F3-on-TR
open-weights44.1%71.4%+27.3 ppMediocre at native long-context, excellent reader given clean evidence.
GPT-OSS-120B @ high
Open-Max (Stack)
open-weights59.0%73.4%+14.4 ppSame direction as Gemma; weaker magnitude.
Claude Opus 4.6
fixed-retrieval reference
closed-API76.0%75.6%-0.4 ppThe only overlapping model where the two benchmarks numerically agree.
GPT-5 mini
fixed-retrieval reference
closed-API79.0%59.0%-20.0 ppStrong at native long-context; mediocre reader on fixed evidence.
GPT-5.4
fixed-retrieval reference
closed-API97.0%71.2%-25.8 ppLargest inverted gap. The cleanest single refutation of "MRCRv2 ranking == Reader Leaderboard ranking".
contract · Reader column = best ManifoldMemory pipeline per model on this leaderboard's 500 LME-S questions (K=10 to reader, GPT-4o judge K=5 3-of-5) MRCRv2 · BenchLM-tracked snapshot (2026-04-24, status display_only) best-row Pearson · 0.00 (pipeline / routing dominates rank order)

Two-population structure

Positive-Δ readers (Gemma-4-26B-A4B-it, GPT-OSS-120B @ high) punch above their long-context score once retrieval does the heavy lifting — they read better than their MRCRv2 implies, by 14–27 pp. Negative-Δ readers (GPT-5.4, GPT-5 mini) regress 20–26 pp on fixed evidence — a meaningful share of their MRCRv2 score is non-reading work (planning, multi-round routing). Convergence point (Opus 4.6, Δ ≈ 0): the only overlapping model where the two benchmarks numerically agree. Models strong on both axes are rare.

Why this is the prediction, not noise

Phase 96-T.3.1 already mapped Anthropic's MRCRv2 / GraphWalks split onto the same two axes the projection above names (LMS / GSS). T.3.1 was the theoretical cross-check; this table is the empirical one — the rank-order disagrees between the two benchmarks, and disagrees in a structured way (positive-Δ / negative-Δ / convergence). Anyone arguing "your benchmark just measures general model strength" has to explain why the #1 MRCRv2 model is mid-pack on ManifoldMemory.

What this table does not claim

Per-row MRCRv2 magnitudes are display-only pending source verification. The two-population direction is robust to per-row noise; the numerical deltas are not. We frame this as a directional cross-check, journal the falsifiability note in Phase 96-T.3.3, and re-run the correlation when verified MRCRv2 numbers ship. Internal evidence for the framework remains the 2D projection above and the 5-model overlap shown here; the verified-MRCR re-run is the next public step.

How to read the table. Rows are sorted by ascending MRCRv2. The Reader (best row) column is the highest-accuracy ManifoldMemory pipeline for that model on this leaderboard's 500-Q LME-S evaluation; Δ is the gap between the two benchmarks on the same model. Green Δ = stronger reader than long-context score predicts; red Δ = weaker. The table regenerates from mrcr_snapshot.v1.json on every build, and auto-suppresses when the snapshot is removed or unstaged.

Reference row. Read alongside, not against.

These rows are genuinely closed-weights — the model can only be reached through the vendor's API, and the weights are not public. They are reported in a separate reference track on the identical retrieval contract so the open and closed regimes can be read side-by-side without conflating them into one ranking. Reproducibility is limited to whatever the API serves on the run date; these rows are published as comparison anchors, not contestants. Hosted open-weight rows (DeepSeek V4 Pro, Kimi K2.6) are not here — they sit in the ranked board above under the Open · API-hosted tag.

  Track Reader / Pipeline Overall
acc · 95% CI
SSASSUSSPTRKUMS Ref %
ref Closed reference
claude-opus-4-6
Stack (frozen retrieval contract; canonical prompt v1)
75.6%[71.6–79.2]%78.6% of ceiling 91.1%51/5692.9%65/7056.7%17/3072.2%96/13387.2%68/7860.9%81/133 15.8%
ref Closed reference
gpt-5.5
Stack (frozen retrieval contract; canonical prompt v1)
75.0%[71.0–78.6]%78.0% of ceiling 100.0%56/5695.7%67/7056.7%17/3069.9%93/13384.6%66/7857.1%76/133 16.8%
ref Closed reference
claude-opus-4-7
Stack (frozen retrieval contract; canonical prompt v1)
72.6%[68.5–76.3]%75.5% of ceiling 98.2%55/5685.7%60/7046.7%14/3069.9%93/13380.8%63/7858.6%78/133 26.6%
ref Closed reference
gpt-5.4
Stack (frozen retrieval contract; canonical prompt v1)
71.2%[67.1–75.0]%74.0% of ceiling 96.4%54/5690.0%63/7043.3%13/3066.9%89/13380.8%63/7855.6%74/133 20.4%
ref Closed reference
gemini-3-flash
Stack (frozen retrieval contract; canonical prompt v1)
70.8%[66.7–74.6]%73.6% of ceiling 92.9%52/5692.9%65/7036.7%11/3064.7%86/13383.3%65/7856.4%75/133 17.0%
ref Closed reference
gpt-5-mini
Same retrieval contract, closed-weights reader
59.0%[54.6–63.2]%61.3% of ceiling 37.5%21/5690.0%63/7036.7%11/3059.4%79/13366.7%52/7851.9%69/133 25.0%

Same retrieval contract, same 500 questions, same K=5 GPT-4o majority-vote judge. Same protocol — different track. For context: the open-weights ★ Open-Max row above lands at 73.4% (gpt-oss-120b @ high, Stack + F3-on-TR); the closed reference row reads the identical evidence at 75.6%. Take this as evidence that fixed-evidence reading is its own bottleneck, not as a head-to-head claim on either model's preferred pipeline.

Eight tracks. Each measures a different thing.

The board is not a single ranking. It is a stack of comparisons under the same retrieval contract: Open-Max vs. Efficient production policies, canonical vs. experimental routing, retrieval vs. no retrieval, open-weights vs. closed-weights. Read each pair against the two ManifoldMemory production rows (Open-Max for max quality, Efficient for cheap-to-self-host).

ManifoldMemory · Open-Max

ManifoldMemory's max-quality reader policy. The strongest open-weight reader configuration measured under the frozen retrieval contract — self-host-verified, no closed-weights dependency. The number ManifoldMemory publicly stands behind for the highest-accuracy mode.

ManifoldMemory · Efficient

ManifoldMemory's efficient reader policy: smaller, cheaper, easier to self-host than Open-Max, slightly lower accuracy. Production-equivalent (oracle-qtype disclosed; learned classifier expected within 1–3 pp).

Experimental

Routed or prompt variants we have measured under the same protocol. Allowed on the board, but not the public headline.

Retrieval-only

Stack handed straight to the reader. No qtype routing, no full-context fallback. The contribution of retrieval-without-routing.

No retrieval

Reader receives the full LME-S 110 K-token haystack and zero retrieved chunks. Floor row; the contribution of retrieval is measured against it.

Open swap

Same Hybrid-family pipeline as the Efficient row, different open-weights reader, run locally (HF/vLLM/H200). Quantifies how much of the result is reader-specific. Self-host-verified — no closed-weights or vendor-API dependency.

Open · API-hosted

Public open weights (Hugging Face, permissive licence) but the row was produced through the vendor's hosted endpoint, not a local checkpoint. We did not independently verify checkpoint hash, serving template, thinking-mode toggle, sampling defaults, hidden system prompt, provider-side patches, or version/route pinning — so this is a hosted open-weight reference, not equivalent to a self-hosted swap. Ranked alongside open-weights rows because the underlying weights are public.

Closed reference

Closed-weights API reader on the identical retrieval contract. Reference comparison only — not a head-to-head; reproducibility is limited to whatever the API serves on the run date.

Fixed-evidence reading is a separate bottleneck.

Once retrieval is fixed, a reproducible open-weight reader reaches the same band as frontier API readers: 73.4% for gpt-oss-120b @ high (Stack + F3-on-TR) vs 72.6% for Opus 4.7 and 75.0%75.6% for GPT-5.5 / Opus 4.6 on the identical top-10 evidence.
The point isn't that one reader beats another — it's that once retrieval is held fixed, a self-host-verified open-weight reader can sit inside the same accuracy band as the frontier closed-weights APIs. Whatever extra performance a closed-weights vendor sells on a long-context end-to-end benchmark, a meaningful share of it is its own retrieval, not the reader's reading. This board separates the two so the reader can be priced honestly — and so ManifoldMemory can ship an open-weight reader as the max-quality production policy without a closed-weights dependency.
Caveats: closed-weights API readers (claude-opus-4-6, gpt-5.5, claude-opus-4-7, gpt-5.4, gemini-3-flash, gpt-5-mini) are published as reference rows, not head-to-head claims; the run reflects whatever the API served on the run date and is not strictly reproducible. Both ManifoldMemory production rows (Open-Max and Efficient) use oracle-qtype routing — the production learned-classifier path is pending, expected within 1–3 pp.

What this board is

A frozen-retrieval, reader-only comparison. Every row received the identical top-10 chunks (post-MixK rerank from 500 candidates) for the same 500 LongMemEval-S questions, judged by the same K=5 GPT-4o protocol.

Per-qtype breakdowns expose where each reader fails (SSA for chunk-rerank disruption, KU for stale knowledge, TR for date arithmetic) rather than aggregating them away into a single number.

Refusals are scored as incorrect in accuracy. They are also reported separately in the Ref % column because abstention on missing evidence is a different failure mode from fabrication, and the two have different downstream consequences for any system built on top.

Receipts are attached: per-row answer files, all five seed-judge files, and the synth scripts that produce the Hybrid rows from the underlying Stack and Naked answers.

What this board isn't

A general LongMemEval leaderboard. Public LME-S numbers blend retrieval, routing, and reading; this board pins retrieval and routing and isolates the reader. Systems do not bring their own retrieval.

A claim that "small lab beats OpenAI." Closed-weights API readers (claude-opus-4-6, gpt-5.5, claude-opus-4-7, gpt-5.4, gemini-3-flash, gpt-5-mini) are published as reference rows on a fixed retrieval contract; they are not head-to-head claims on either model's preferred pipeline.

An agent-harness benchmark. Every row is a single-pass reader call — multi-step planners (Hindsight, ReAct, etc.) are out of scope by design, since they conflate reader skill with planner skill.

A frozen artifact. Submissions of additional readers (open weights or API) are welcome under the same contract: a public harness ships next.

Every number traceable to a file.

The leaderboard is a derived artifact. The underlying answer-and-judge files are committed alongside the build script so any row can be re-scored from primary sources in one command.

  • Per-row answer + 5-seed judge files
    _remote_pulls/phase91_a50/, _remote_pulls/phase91_a100_gemma_naked/, _remote_pulls/phase91_hybrid_gemma/, _remote_pulls/phase91_hybrid_qwen/, _remote_pulls/phase91_cwfix/, _remote_pulls/phase91_frontier/, _remote_pulls/phase91_hybrid_f3tr/.
  • Frozen retrieval contract spec · R@5 derivation
    EXPERIMENT_JOURNAL_RECOVERED_2026-04-24.md § Phase 86 (R@5 receipt) → Phase 91 (Stack vs Naked) → Phase 95-K (oracle-qtype disclosure).
  • Hybrid synthesis scripts
    _tmp_scripts/_synth_hybrid_gemma.py (oracle-qtype router for Gemma), _tmp_scripts/_synth_hybrid_qwen.py (Qwen), _tmp_scripts/_synth_hybrid_f3_tr.py (Hybrid + F3-on-TR best-of-board).
  • Build script for this page
    _tmp_scripts/_build_leaderboard.py · machine-readable copy: leaderboard.json.
  • Frozen retrieval substrate · live
    The 500-question top-10 chunks the readers above actually saw, content-addressed by SHA-256: frozen_retrieval_topK_500q.v1.jsonl (~6 MB) · manifest · submission protocol. Anyone can re-run this contract: same questions, same chunks, same prompt, same judge.