ManifoldMemory measures the two hidden bottlenecks in AI memory: evidence access and evidence reading. This page holds retrieval fixed and measures what happens next.
Reader is the bottleneck. Same 500 LongMemEval-S questions, same retrieval contract,
same GPT-4o judge — swap only the reader and accuracy ranges from
8.8% to 75.6% across tested readers. Within the 26B/27B-class reader group,
the spread is 18.2 pp. The current open-weights ceiling under this contract is
73.4% (gpt-oss-120b @ high, Stack + F3-on-TR).
What this measures: reader synthesis ability under a fixed retrieval contract —
i.e. an internal-manifold benchmark conditional on the external retrieval manifold being
held constant at R@5 = 96.2%. The new % of ceiling annotation under each accuracy
shows how much of that 96.2% upper bound a reader actually captures (overall acc / R@5).
First-stage union of three retrievers, fused via reciprocal rank, reranked by a MixK cross-encoder, top-10 chunks delivered to the reader. Same 500 LongMemEval-S questions, same haystack, same chunk boundaries, same answer prompts, same judge.
Every ManifoldMemory row that goes beyond the v1 canonical contract, ranked together by overall accuracy on LongMemEval-S 500q. Each row shows a coloured track pill (v2.1-K64, A3 stacked, K=48 v2, MIX-v1-closed, v2.1 K=24/48, v2 K=24) plus its existing expand chevron for the per-row detail card. Open Tracks explained below for full contract definitions.
Same canonical retrieval pipeline + same K-routing + same NoChrono prompts as track-v2-k64-noproto. The ONLY varying element is the Anthropic API thinking parameter: disabled in the baseline, enabled at budget_tokens=16384 here (with the API-mandated temperature=1.0).
Phase 97.9 result on Opus 4.6 NoChrono: 92.00 % (460/500), +0.40 pp = +2 q over the NoChrono baseline at 91.60 %. 95 % CI on the paired diff is [−1.60, +2.40] — includes zero, so the lift is sampling-noise-scale; H_S1 (clean SOTA beat) FAILED, H_S2 (marginal lift) TRIGGERED. Net flips +2 (wins 14, losses 12). Phase 97.8 subset projection of +11 flips eroded ~5× at the full-set level by 12 regressions on previously-passing qids; subset transfer audit shows 9/11 of the flips reproduced under the full-set judge (82 %, within sampling-noise budget). Honest framing: HIGHEST point-estimate accuracy on the board, but NOT a new SOTA claim — closed-weights, CI overlaps, H_S1 missed by 1 question. Reported as a separate track for methodological transparency.
Per-qtype K routing with K=64 on TR+MS only, plus the v2.1 rider package (chronological + KU forbidden-older + scan-ALL-64 + SHORT+ABS).
Same canonical retrieval pipeline as v2 (BGE+QNDN+BM25 → RRF k=60 → MixK rerank). Two changes from v2.1 K=24/48: (1) K raised to K=24/64 — K=24 by default, K=64 on the two retrieval-bottleneck qtypes (TR + MS) only; (2) SYSTEM_RIDER_LIST scan-ALL count updated from "Scan ALL 48" → "Scan ALL 64" so the rider's self-described K matches the K=64 evidence delivered to MS. K=64 cache is byte-identical-strict-superset of K=48 and K=24 caches (verified 500/500 byte-match at top-48 and top-24). Strict R@64 = 99.0% (495/500), loose R@64 = 100.0% — first contract where every gold session is in the evidence panel for every question. First ManifoldMemory single-reader to clear 90% on LME-S 500q.
Stacks F_K=48 (where retrieval depth wins) and F_v2.1-rider (where rider+routing wins) per qtype on Gemma-4-31B.
Routing fixed before evaluation, applied to all 500 questions consistently, no gold-answer / judged-correct / post-hoc picking. Per-qtype routing table: SSA K=48+SHORT · SSU K=48+SHORT · SSP K=24+PREF · TR K=48+TEMPORAL · KU K=24+KU(forbidden-older) · MS K=48+LIST(K=48) · _abs SHORT+ABS shortcut. Single byte-level prompt diff vs v2.1: SYSTEM_RIDER_LIST "Scan ALL 24" → "Scan ALL 48" so the rider's self-described K matches the K=48 evidence delivered to MS. Stacking is additive within Novita's ±1 pp non-determinism: A2 K=48 (89.00%) + recovered KU/_abs from v2.1 (~+0.40 pp) = predicted 89.40%, landed at 89.40%.
F_K=48 force-isolation diagnostic. Same v2 prompts as the K=24 board, no rider, every qtype receives K=48 chunks.
The K=48 cache is a byte-identical strict superset of the public K=24 artifact (verified 500/500 byte-match on top-24 chunks per qid) so cross-contract comparison is exact below the 24th rank. By holding rider+routing fixed at the v2 baseline and only changing K, this contract decomposes the v2.1 contract's win into two independently-attributable forces: F_K=48 (this contract) and F_v2.1-rider (the rider+routing diff). Per-reader F_K=48 isolation: Qwen MS +5 questions (K=24→K=48), Gemma SSA +2 / SSU +1 / TR −1 / MS +1. Both readers regress on KU and _abs at K=48 without the rider — F_v2.1-rider is the right knob for those qtypes, not F_K.
First multi-reader configuration on the ManifoldMemory board. Six pre-registered candidate readers; GPT-5.5 medium blindly arbitrates each question.
Six candidate readers × contracts (gpt-5.5 v1 K=10, gpt-5.5 v2 K=24, claude-opus-4-6 v2 K=24, gemma-4-31b v2.1 K=24/48, claude-opus-4-6 v2.1 K=24/48, gemma-4-31b v2 K=24) produce independent answers; gpt-5.5 medium then arbitrates each question by reading question + K=48 evidence panel + neutrally-labeled candidate answers (A–F shuffled per qid with sha256-hashed seed). The arbiter sees no qtype labels, no gold answers, no model identities, no contract ids — only question, evidence, and candidate text. Pre-registered protocol with frozen prompt fingerprints (system sha256 5a3df742a1b2f5f4, user template sha256 a3071a14cbb9a31a, shuffle seed MIX-v1-closed-2026-04-29-shuffle-seed-42). Synthesis fallback (selected_label = S) only when no candidate is supportable: 3/500 = 0.6% synth rate, well under the 15% threshold. Reported in its own MIX track — not a single-model score.
Per-qtype K routing — K=24 default, K=48 on TR + MS only — plus a tightened chronological + KU rider package.
Three changes from v2 uniform K=24: (1) per-qtype K routing — deterministic from question_type, no gold-answer picking; (2) upgraded SYSTEM_RIDER_TEMPORAL with an explicit 7-step chronological protocol (identify events → list YYYY/MM/DD → sort → use QUESTION_DATE as today → compute → show arithmetic → Answer); (3) tightened knowledge-update rider (latest-value enforcement). The K=48 cache used here is a byte-identical strict superset of the public K=24 artifact (top-24 of K=48 == K=24 cache, verified 500/500 byte-match). Reader-specific: Gemma uses the chronological rider well (TR +1 q, KU +rider gain) while Opus collapses on TR under the same rider (−15.79 pp) — the Field Notes Law 2 reader-specific scaffold finding.
Same canonical retrieval pipeline as v1, with two changes: top-K bumped 10 → 24, and Track 2-lite v2 prompts (per-qtype oracle routing + LMS-S session-date back-fill).
R@24 = 99.6% (498/500). The K=24 artifact is a strict superset of the published K=10 artifact — sanity-checked for chunk-text equality & cosine-drift on the 500-q overlap. The +12.4 pp lift over the K=10 + canonical-v1 Claude row decomposes as roughly +3.2 pp from K=24 and +9.2 pp from the Track 2-lite v2 prompt + date back-fill — the prompt + data layer is doing most of the work, not the reader's parameter count or the retrieval contract. Qwen-3.6-27B (open-weights), Claude Opus 4.6 (closed-weights frontier), and Gemma-4-31B-IT all converge to the 87.8–88.6% band on this recipe.
| # | Track | Reader / Pipeline | Overall acc · 95% CI |
SSA | SSU | SSP | TR | KU | MS | Ref % | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ★ #1 | v2-K64-NoChrono+T16K |
claude-opus-4-6 + 16K thinking closed-weights · #1 by point estimate · NOT a new SOTA (CI overlaps 91.60%)
Stack v2-K64-NoChrono + 16K thinking (per-qtype K routing: K=24 default, K=64 on TR+MS; v2 NoChrono prompts unchanged from track-v2-k64-noproto; reader thinking enabled at budget_tokens=16384 + API-mandated temperature=1.0)
|
92.0%[89.3–94.1]%92.0% of ceiling | 94.6%53/56 | 97.1%68/70 | 83.3%25/30 | 92.5%123/133 | 96.2%75/78 | 87.2%116/133 | 0.0% | |
|
HIGHEST point-estimate accuracy on the unified ManifoldMemory frontier board (#1 by overall acc), but NOT a new SOTA claim. 460/500 = 92.00%. +0.40 pp over the open-weight v2.1-K64.1 SOTA (Qwen + Gemma at 91.60 %) and over the Opus 4.6 NoChrono baseline (91.60 %); 95 % CI on the paired diff vs the 91.60 % baseline is [−1.60, +2.40] — includes zero, so the +0.40 pp delta (= +2 questions) is indistinguishable from sampling noise. |
|||||||||||
| ☆ #2 | v2.1-K64.1 |
qwen-3.6-27b-novita SOTA · open-weight (any size) · new ManifoldMemory best
Stack v2.1-K64.1 (per-qtype K routing: K=24 default, K=48 on SSP, K=64 on TR+MS; v2.1-K64 prompts unchanged; chronological + forbidden-older + scan-ALL-64 + SHORT+ABS riders)
|
91.6%[88.8–93.7]%91.6% of ceiling | 96.4%54/56 | 98.6%69/70 | 83.3%25/30 | 91.0%121/133 | 94.9%74/78 | 86.5%115/133 | 0.0% | |
|
NEW ManifoldMemory single-reader best AND new open-weight SOTA at any reader size. 458/500 = 91.60%. Beats Gemma-4-31B-IT v2.1-K64 routed (91.00%, prior SOTA, different reader, ~1 d old) by +0.60 pp; beats the same Qwen reader on the v2.1-K64 baseline (90.00%, prior open-weight SOTA, ~36 h old) by +1.60 pp; beats A3 Gemma stacked (89.40%, different reader, more complex contract) by +2.20 pp; beats A1l Qwen v2 K=24 (88.60%, same reader, K-routing only intervention) by +3.00 pp. |
|||||||||||
| ☆ #3 | v2.1-K64.1 |
gemma-4-31b-novita SOTA tied · open-weight (any size) · cross-reader replication
Stack v2.1-K64.1 (per-qtype K routing: K=24 default, K=48 on SSP, K=64 on TR+MS; v2.1-K64 prompts unchanged; chronological + forbidden-older + scan-ALL-64 + SHORT+ABS riders)
|
91.6%[88.8–93.7]%91.6% of ceiling | 92.9%52/56 | 97.1%68/70 | 83.3%25/30 | 92.5%123/133 | 98.7%77/78 | 85.0%113/133 | 0.0% | |
|
Cross-reader replication of the v2.1-K64.1 open-weight SOTA. Ties Qwen-3.6-27B v2.1-K64.1 at 91.60 %. 458/500 = 91.60%. Single isolated intervention vs Gemma v2.1-K64 (91.00 %, prior Gemma BiS, same reader, same retrieval cache, same prompts; only difference is K_BY_QTYPE[SSP] 24→48): SSP +3.33 pp (24→25 / 30, +1 q — the targeted intervention; cross-reader replication of the Qwen +4 q signal but smaller magnitude on Gemma); KU +2.57 pp (75→77 / 78, +2 q on Novita run-to-run noise, slice unchanged); TR +0.75 pp (122→123 / 133, +1 q noise); SSU +1.43 pp (67→68 / 70, +1 q noise); SSA 0.0 pp (52→52 / 56); MS −1.50 pp (115→113 / 133, −2 q noise). Net +3 questions, 6 wrong→right vs 3 right→wrong. |
|||||||||||
| ☆ #4 | v2-K64-NoChrono |
claude-opus-4-6 closed-weights BiS · ties open-weight SOTA · protocol mitigation
Stack v2-K64-NoChrono (per-qtype K routing: K=24 default, K=64 on TR+MS; v2 prompts; 7-step chronological protocol REPLACED by SHORT+TEMPORAL_HINT on TR only; KU/LIST/_abs unchanged from v2)
|
91.6%[88.8–93.7]%91.6% of ceiling | 96.4%54/56 | 97.1%68/70 | 86.7%26/30 | 91.0%121/133 | 94.9%74/78 | 86.5%115/133 | 0.0% | |
|
NEW Opus BiS on the ManifoldMemory retrieval substrate. Ties the open-weight SOTA at 91.60 %. 458/500 = 91.60%. +3.60 pp over A1g K=24 v2 (88.00 %, prior Opus BiS); +6.20 pp over A1g_K64v2 (85.40 %, K=64 with-protocol regression). Mitigation prescription for Field Notes Law 10 (Protocol-Reader Coupling). |
|||||||||||
| ☆ #5 | v2.1-K64 |
gemma-4-31b-novita prior SOTA · v2.1-K64 baseline · different reader
Stack v2.1-K64 (per-qtype K routing: K=24 default, K=64 on TR+MS; v2.1 rider package: chronological + forbidden-older + scan-ALL-64 + SHORT+ABS)
|
91.0%[88.2–93.2]%91.0% of ceiling | 92.9%52/56 | 95.7%67/70 | 80.0%24/30 | 91.7%122/133 | 96.2%75/78 | 86.5%115/133 | 0.0% | |
|
NEW ManifoldMemory single-reader best AND new open-weight SOTA at any reader size. 455/500 = 91.00%. Beats Qwen-3.6-27B v2.1-K64 routed (90.00%, prior SOTA, same contract, different reader) by +1.00 pp; beats A3 Gemma stacked (89.40%, same reader, earlier high-water mark) by +1.60 pp; beats A2 Gemma K=48 v2 (89.00%) by +2.00 pp; beats A1h_v3 Gemma v2.1 K=24/48 routed (88.80%) by +2.20 pp. First reader-by-reader replication of the v2.1-K64 contract — same canonical retrieval, same K=64 frozen cache, same v2.1-K64 prompts file, same strict judge — only the reader changes. |
|||||||||||
| ° #6 | EventOS-v0 |
gemma-4-31b-novita event-sidecar architectural diagnostic · not a SOTA candidate
Stack EventOS-v0 (per-qtype K routing identical to v2.1-K64.1: K=24 default, K=48 on SSP, K=64 on TR+MS; v2.1-K64 prompts unchanged + EVENT-INDEX HINT line on TEMPORAL + LIST riders only; user message gains a single-pass EVENT INDEX block of up to 64 events sourced from offline Gemini Flash extraction over K=64 panel sessions)
|
90.8%[87.9–93.0]%90.8% of ceiling | 94.6%53/56 | 97.1%68/70 | 80.0%24/30 | 90.2%120/133 | 94.9%74/78 | 86.5%115/133 | 0.0% | |
|
First event-sidecar architectural test on the ManifoldMemory retrieval substrate (Phase 97.3.0). 454/500 = 90.80%. −4 q vs Gemma v2.1-K64.1 baseline (91.60%). Inspired by the Chronos paper (arxiv 2603.16862), which shows the next ceiling on LME-S after loose-R@64 = 100 % is structured temporal/event memory, not retrieval depth. EventOS-v0 builds a deterministic, single-pass event sidecar: (1) offline event extraction with Gemini 2.5 Flash over 9,271 unique sessions in the K=64 panels yielded 61,785 structured events with subject / verb / object / date_start / aliases / state_type; (2) at query time, events from sessions in the K=64 panel are sorted oldest-first and rendered as a compact EVENT INDEX block appended after the canonical evidence panel; (3) the TEMPORAL + LIST riders gain a single EVENT-INDEX HINT line. No agent loop, no tool-calling, no dynamic prompting — the v0 objective is to test whether the ARCHITECTURE alone (not the retrieval / extraction quality, not the agent loop) shifts the ceiling. |
|||||||||||
| ° #7 | v2.1-K64 |
qwen-3.6-27b-novita first 90% · v2.1-K64 baseline · same reader as #1
Stack v2.1-K64 (per-qtype K routing: K=24 default, K=64 on TR+MS; v2.1 rider package: chronological + forbidden-older + scan-ALL-64 + SHORT+ABS)
|
90.0%[87.1–92.3]%90.0% of ceiling | 96.4%54/56 | 97.1%68/70 | 70.0%21/30 | 87.2%116/133 | 96.2%75/78 | 87.2%116/133 | 0.0% | |
|
FIRST ManifoldMemory single-reader to clear 90% on LongMemEval-S 500q (now superseded by Gemma-4-31B-IT v2.1-K64 routed at 91.00% above — same contract, different reader). 450/500 = 90.00%. Beats A3 Gemma stacked (89.40%, prior SOTA, 24 h old, different reader / different contract) by +0.60 pp; beats A2 Qwen K=48 v2 same reader (89.20%) by +0.80 pp; beats A1l Qwen v2 K=24 same reader (88.60%) by +1.40 pp; +1.00 pp over MIX-v1-closed (89.00%) on a single reader. |
|||||||||||
| ° #8 | A3 stacked |
gemma-4-31b-novita earlier high-water mark
Stack v2.A3 (per-qtype K + rider stacked: SSA/SSU=K48+SHORT, SSP=K24+PREF, TR=K48+TEMPORAL, KU=K24+KU, MS=K48+LIST(K=48), _abs=SHORT+ABS)
|
89.4%[86.4–91.8]%89.8% of ceiling | 94.6%53/56 | 94.3%66/70 | 83.3%25/30 | 89.5%119/133 | 93.6%73/78 | 83.5%111/133 | 0.0% | |
|
Earlier ManifoldMemory single-reader high-water mark (now superseded by Gemma-4-31B-IT v2.1-K64 routed at 91.00% above — same reader, simpler contract, +1.60 pp). 447/500 = 89.40%. Beats A2 Qwen K=48 v2 (89.20%, prior open-weight peer at K=48 uniform) by +0.20 pp; beats A2 Gemma K=48 v2 (89.00%, same reader different contract) by +0.40 pp; beats A1h_v3 Gemma v2.1 K=24/48 routed (88.80%, prior Gemma single-reader best) by +0.60 pp; +0.40 pp over MIX-v1-closed (89.00%) on a single reader. Surpassed by Qwen-3.6-27B + v2.1-K64 routed (90.00%) on a different reader / different contract — F_K=48→K=64 retrieval-breadth force buys another +3 questions on MS that Gemma at K=48 could not reach. |
|||||||||||
| ° #9 | K=48 v2 |
qwen-3.6-27b-novita cheap K=48 baseline · open-weight
Stack v2.K48u (canonical retrieval @ K=48 uniform; v2 prompts; no rider)
|
89.2%[86.2–91.7]%89.6% of ceiling | 94.6%53/56 | 95.7%67/70 | 76.7%23/30 | 88.0%117/133 | 93.6%73/78 | 85.0%113/133 | 0.0% | |
|
F_K=48 force-isolation baseline (prior open-weight SOTA, now superseded by Qwen K=64 routed at 90.00% same reader and A3 Gemma stacked at 89.40% above). Still the cheapest competitive open-weight reader on the K=48 board. 446/500 = 89.20%. Beats A1h_v3 Gemma v2.1 K=24/48 routed (88.80%) by +0.40 pp; beats A1l Qwen v2 K=24 (88.60%, same reader at half the K) by +0.60 pp; +0.20 pp over the multi-reader MIX-v1-closed (89.00%) on a single reader. |
|||||||||||
| ° #10 | K=48 v2 |
gemma-4-31b-novita
Stack v2.K48u (canonical retrieval @ K=48 uniform; v2 prompts; no rider)
|
89.0%[85.9–91.5]%89.4% of ceiling | 96.4%54/56 | 95.7%67/70 | 80.0%24/30 | 88.0%117/133 | 93.6%73/78 | 82.7%110/133 | 0.0% | |
|
F_K=48 diagnostic on Gemma-4-31B-IT. 445/500 = 89.00% — statistically tied with A1h_v3 Gemma v2.1 K=24/48 routed (88.80%, +1 question / +0.20 pp; within Novita's ±1 pp non-determinism band). The two contracts reach the same ~89% Gemma ceiling via different paths. |
|||||||||||
| ° #11 | MIX-v1-closed |
MIX-v1-closed (champion pool + GPT-5.5 arbiter)
MIX (6-candidate champion pool + GPT-5.5 evidence arbiter; K=48 evidence panel; pick-or-synthesize)
|
89.0%[85.9–91.5]%89.4% of ceiling | 96.4%54/56 | 98.6%69/70 | 66.7%20/30 | 87.2%116/133 | 92.3%72/78 | 85.7%114/133 | 0.0% | |
|
First multi-reader score on the ManifoldMemory board. 445/500 = 89.00%, +0.20 pp over A1h_v3 Gemma-31B v2.1 (88.80%) — statistical tie within ±1 pp Novita non-determinism. Lands −0.5 pp below the pre-registered Expected band (89.5–91.5%) and −2.20 pp below the per-qtype-oracle ceiling (91.20%); MIX recovered ~50% of the oracle-best-of-N gap above the single-reader best. |
|||||||||||
| ° #12 | v2.1 K=24/48 |
gemma-4-31b-novita
Stack v2.1 (canonical retrieval @ K=24/48 per-qtype; Track 2-lite v2 + chronological rider)
|
88.8%[85.7–91.3]%89.2% of ceiling | 92.9%52/56 | 94.3%66/70 | 80.0%24/30 | 88.7%118/133 | 96.2%75/78 | 82.0%109/133 | 0.0% | |
|
New ManifoldMemory open-weight best — first row to clear the v2 ceiling on the frozen retrieval substrate. 444/500 = 88.80% with Gemma-4-31B-IT, +0.80 pp over the prior closed-weights Claude Opus 4.6 leader (88.00%) at ~66x lower per-row reader+judge cost ($0.71 total for A1h_v3 vs $47.34 for A1g). Same reader as the v2 A1h v1 row (87.80%); the +1.00 pp lift comes entirely from the v2.1 routing/rider package, not the reader. |
|||||||||||
| ° #13 | v2 K=24 |
qwen-3.6-27b-novita
Stack v2 (canonical retrieval @ K=24; Track 2-lite v2 prompts)
|
88.6%[85.5–91.1]%89.0% of ceiling | 94.6%53/56 | 98.6%69/70 | 73.3%22/30 | 87.2%116/133 | 96.2%75/78 | 81.2%108/133 | 0.0% | |
|
New top of the v2 uniform K=24 board, +0.60 pp over Claude Opus 4.6 (88.00 %, prior leader) at < 1/100 the per-row cost ($0.42 reader vs $46.94), and +0.80 pp over Gemma-4-31B (87.80 %, prior open-weight peer). 27B dense Apache-2.0 model wins SSU, TR, KU, MS on slice (4 of 6 qtypes); loses to Opus on SSA (-1.79 pp) and SSP (-13.34 pp). |
|||||||||||
| ° #14 | v2 K=24 |
claude-opus-4-6
Stack v2 (canonical retrieval @ K=24; Track 2-lite v2 prompts)
|
88.0%[84.9–90.6]%88.4% of ceiling | 96.4%54/56 | 95.7%67/70 | 86.7%26/30 | 86.5%115/133 | 93.6%73/78 | 78.9%105/133 | 5.8% | |
|
Phase 97.2.10-B2 best-in-slot Opus row. Same canonical retrieval substrate as the K=10 closed-reference Claude row below (BGE + QNDN v0 + BM25 → RRF k=60 → MixK rerank); pipeline byte-identical below the rerank, only the persisted top-K (10 → 24) and the prompt class (canonical v1 → Track 2-lite v2) differ. Refusal rate dropped from 15.8% (K=10 + canonical v1) to 5.8% on this contract because (a) K=24 reduces the unanswerable-evidence floor and (b) the conversational-recommendation rider on single-session-preference reframes them as helpful answers rather than terse refusals. +12.4 pp over the K=10 + canonical v1 Claude row, decomposed into ~3.2 pp from K=24 and ~9.2 pp from the Track 2-lite v2 prompt + LMS-S session-date back-fill (against A1f = 78.80% on K=10 + Track 2-lite v2). Multi-session breaks the 70% ceiling for the first time at 78.95% (105/133, +18.05 pp over the K=10 + canonical v1 Claude row). 0.4 pp behind Ensue (88.2%) and ~1.2 pp behind Hindsight + GPT-OSS-120B (~89.0%) on the published full-LME-S leaderboard. Reader: Anthropic Messages API, max_tokens=768, sampling omitted (Opus 4.6 rejects them at the API level), reader_concurrency=4. |
|||||||||||
| ° #15 | v2 K=24 |
gemma-4-31b-novita
Stack v2 (canonical retrieval @ K=24; Track 2-lite v2 prompts)
|
87.8%[84.6–90.4]%88.2% of ceiling | 96.4%54/56 | 95.7%67/70 | 86.7%26/30 | 87.2%116/133 | 94.9%74/78 | 76.7%102/133 | 6.6% | |
|
SOTA on AMB·LongMemEval in the open-weight, ≤40B-parameter reader class — +4.2 pp over Hindsight + OSS-20B (83.6% paper) and +13.8 pp over the AMB-verified hybrid-search baseline (74.0%). Phase 97.2.10-B2 best-in-slot open-weight row. Same recipe as the Claude row above; only the reader differs. +17.0 pp over the K=10 + canonical v1 Gemma-4-31B row (70.8%), and within 0.2 pp of Claude Opus 4.6 at ~1/175 the per-token cost ($0.10/$0.20 per Mt vs $15/$75 per Mt). Confirms that the +12.4 pp lift over canonical v1 is dominated by the prompt + data layer (Track 2-lite v2 + date back-fill), not the reader's parameter count or vendor — both readers converge to the 87.8−88.0% band when given the same K=24 evidence and Track 2-lite v2 prompts. Reader: Novita.ai (api.novita.ai/v3/openai), google/gemma-4-31b-it, max_tokens=768, temperature=0, reader_concurrency=8 (with 141 rate-limit retries during the run, resolved at concurrency=3 then concurrency=1). |
|||||||||||
| ° #16 | v2-K64-DropStep2 |
claude-opus-4-6 protocol bisection diagnostic · not a SOTA candidate
Stack v2-K64-DropStep2 (per-qtype K routing: K=24 default, K=64 on TR+MS; v2 prompts with chronological protocol's STEP 2 ablated; remaining steps 1, 3-7 byte-identical, renumbered 1, 2-6)
|
87.8%[84.6–90.5]%87.8% of ceiling | 92.9%52/56 | 98.6%69/70 | 83.3%25/30 | 76.7%102/133 | 94.9%74/78 | 88.0%117/133 | 0.0% | |
|
Within-protocol bisection diagnostic for Field Notes Law 10. 439/500 = 87.80%. Same K-routing as A1g_K64v2 (K=24 default, K=64 on TR+MS); only the chronological protocol's literal-format enumeration step is dropped (step 2: "Write a structured list, one event per line, in the EXACT format YYYY/MM/DD: <phrase>"). Result vs A1g_K64v2 (full protocol, 85.40 %): +2.40 pp / +12 q net overall. Result vs A1g_K64v2_NoChrono (no protocol, 91.60 %): −3.80 pp / −19 q overall. |
|||||||||||
| ° #17 | v2 K=24 |
gpt-5.5-openai
Stack v2 (canonical retrieval @ K=24; Track 2-lite v2 prompts)
|
86.8%[83.6–89.5]%87.1% of ceiling | 96.4%54/56 | 98.6%69/70 | 80.0%24/30 | 82.0%109/133 | 92.3%72/78 | 79.7%106/133 | 0.0% | |
|
Best reader on this stack for single-session-user (98.6%) and multi-session (79.7%, +0.8 pp over Claude Opus 4.6) — the first non-Claude reader to clear 79% on multi-session under the v2 contract. Phase 97.2.11 A1k row. Same byte-identical retrieval substrate as the Claude / Gemma rows above (BGE + QNDN v0 + BM25 → RRF k=60 → MixK rerank → top-24); only the reader vendor + model differ. Reader: OpenAI direct (api.openai.com/v1), gpt-5.5 (snapshot 2026-04-23), reasoning_effort=medium, max_completion_tokens=4096, concurrency=4. Methodology note: GPT-5+ models bill reason+output combined under max_completion_tokens; an initial run at max_tokens=768 truncated 33 temporal-reasoning rows (cap hit during reasoning, empty output). Re-run on the truncated subset at max_tokens=4096 recovered +4.00 pp over the truncated baseline (82.80% → 86.80%); journaled as a permanent harness default for retrieval-grounded QA on GPT-5+ readers. Cost: $4.39 / 500q = $0.0101 per correct answer — ~10x cheaper per correct than A1g Claude Opus 4.6 ($0.107/correct), ~3x more expensive than A1h Gemma-4-31B (~$0.0030/correct). 0 refusals (vs 5.8% Claude / 6.6% Gemma-31B): GPT-5.5's reasoning mode commits to a best-evidence answer rather than abstaining on borderline single-session-preference questions, but pays a ~5 pp gap on temporal-reasoning vs Gemma-31B (81.95% vs 87.22%) where reasoning_effort=high is expected to close most of the slack (future A1k_high run). |
|||||||||||
| ° #18 | v2.1 K=24/48 |
claude-opus-4-6
Stack v2.1 (canonical retrieval @ K=24/48 per-qtype; Track 2-lite v2 + chronological rider)
|
85.0%[81.6–87.9]%85.3% of ceiling | 96.4%54/56 | 97.1%68/70 | 83.3%25/30 | 70.7%94/133 | 92.3%72/78 | 84.2%112/133 | 0.0% | |
|
Documented negative result — the v2.1 prompt patch is reader-specific. A1g_K48T is the closed-weights companion to A1h_v3 on the v2.1 contract: same retrieval substrate, same rider package, only the reader differs. Result: 425/500 = 85.00%, −3.00 pp below A1g v2 K=24 (88.00%) and −3.80 pp below A1h_v3 v2.1 (88.80%). Net regression. |
|||||||||||
| ° #19 | v2 K=24 |
gemma-4-26b-a4b-novita
Stack v2 (canonical retrieval @ K=24; Track 2-lite v2 prompts)
|
84.8%[82.0–88.2]%85.1% of ceiling | 94.6%53/56 | 97.1%68/70 | 70.0%21/30 | 79.7%106/133 | 94.9%74/78 | 76.7%102/133 | 7.6% | |
|
Cheapest-to-deploy entry in this score band on AMB·LongMemEval — A4B MoE with ~4B active params per token runs on a single 24 GB consumer GPU; +1.2 pp over Hindsight + OSS-20B (83.6% paper, dense 20B requires ~40 GB VRAM) at less than half the active-compute footprint. Phase 97.2.10-B2-FOLLOWUP A1i row. Reader: novita://google/gemma-4-26b-a4b-it, max_tokens=768, temperature=0, reader_concurrency=2 (Novita rate-limited; 68 retries during the run). |
|||||||||||
| ° #20 | v2 K=24 |
gpt-oss-120b-novita
Stack v2 (canonical retrieval @ K=24; Track 2-lite v2 prompts)
|
84.6%[81.2–87.5]%84.9% of ceiling | 94.6%53/56 | 95.7%67/70 | 70.0%21/30 | 82.7%110/133 | 93.6%73/78 | 74.4%99/133 | 7.6% | |
|
Direct head-to-head with Hindsight on the same open-weight reader — gpt-oss-120b @ reasoning_effort=high is the same backbone Hindsight paper-Table 3 reports 89.0% with on LongMemEval-S, and AMB-verified Hindsight v0.4.19 reports 94.6% with. Our retrieval substrate (BGE + QNDN + BM25 → RRF → MixK → top-24) + Track 2-lite v2 prompts on the same reader land 84.6%, −4.4 pp behind Hindsight-paper and −10.0 pp behind AMB-verified Hindsight v0.4.19. The gap is architectural, not retrieval — their TEMPR (graph- and time-aware retrieval), CARA (consolidated agent recall), and Observations (auto knowledge consolidation) stack does work above and below our pipeline that we don't have yet (Phase 97.3 roadmap). v1 of this run hit `stop=length` on 28 of 500 questions at max_tokens=4096 because gpt-oss harmony-format consumes the hidden analysis channel first and exposes content only at the final channel; those 28 returned empty -> guaranteed-zero. v2 re-ran them at max_tokens=16384, recovering 25 of 28 (3 still hit length even at 16K — the model genuinely uses >16K reasoning tokens on the hardest multi-session questions). Final 84.60% = 423/500, +2.60 pp over v1's 82.00%. Reader: Novita.ai, openai/gpt-oss-120b, reasoning_effort=high, temperature=0, concurrency=4. Inference cost ~$2.36 + ~$0.18 for the truncated re-run + ~$0.20 judge. |
|||||||||||
Frozen retrieval artifacts: K=24 (R@24=99.6%) · K=48 cache (sha256 f973dd45797f344a..., R@48=100.0%) · K=64 cache (R@64-loose=100.0%, R@64-strict=99.0%). All three are byte-identical strict supersets — top-24 of K=48 == K=24 cache, top-48 of K=64 == K=48 cache, both verified 500/500 byte-match. Same 500 LongMemEval-S questions, same GPT-4o judge (K=5, 3-of-5 majority) across all six contracts.
Open-weights readers ranked by overall accuracy on the same 500 LongMemEval-S questions and the same top-10 evidence chunks (post-MixK rerank).
Two rows carry ManifoldMemory's production tags: ★ Open-Max — the
max-quality production policy, currently 73.4% — and ☆ Efficient
— the cheaper-to-self-host production policy, currently 70.0%. Every other row is a comparison
anchor under the same retrieval contract.
Hosting matters: Open-Max, Efficient, experimental, retrieval-only, no retrieval,
and open swap rows are self-hosted on our own H100/A100/H200 cluster — we control checkpoint, prompt, runtime,
and quantisation. Open · API-hosted rows are open-weights readers run through the vendor's hosted API; weights
are public, but provider-side serving template, sampling defaults, and version routing are not under our control, so those rows
are reported as hosted open-weight references in the same ranked board.
Refusals are scored as incorrect in accuracy; the Ref % column reports them separately because abstention on missing
evidence is a different failure mode from fabrication.
| # | Track | Reader / Pipeline | Overall acc · 95% CI |
SSA | SSU | SSP | TR | KU | MS | Ref % | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| ★ #2 | ManifoldMemory · Open-Max |
gpt-oss-120b @ high
Stack + F3-on-TR (oracle qtype: TR -> F3 date-anchor, else -> Stack; no Naked routing)
|
73.4%[69.4–77.1]%76.3% of ceiling | 96.4%54/56 | 92.9%65/70 | 40.0%12/30 | 68.4%91/133 | 87.2%68/78 | 57.9%77/133 | 17.6% | |
|
RL v1 Phase 95-M.10b: Stack + F3-on-TR companion to the M.9b Stack baseline, and the row that earns ManifoldMemory's Open-Max production tag. Routes the 133 temporal-reasoning questions through the F3 date-anchor variant (CURRENT_DATE prefix + standard scaffold, prompt benchmark_prompt.f3tr.v1.md); reuses the M.9b Stack answers verbatim for the other 367 questions. F3 lifts TR from 84/133 (63.2%) to 91/133 (68.4%), +5.2 pp on slice and +1.4 pp overall (72.0% -> 73.4%). Beats Gemma-4 Hybrid (70.0%, Efficient track) by 3.4 pp and Gemma-4 Hybrid+F3-on-TR (71.4%, experimental) by 2.0 pp -- first measured open-weight reader to clear the Gemma-4 Hybrid Efficient threshold by >=3 pp on the LME-S 500-q frozen retrieval contract. The M.10c canonical-Hybrid v2 attempt landed below this row (corrected Naked-v2 prompt: SSA Naked = 39/56 = 69.6% measured, well below Stack-SSA's 96.4%); Naked routing offers no marginal value over Stack at this model + provider + 4096-token-budget combination. Reader: novita://openai/gpt-oss-120b@high, max_tokens=4096, concurrency=4, temperature=0. Judged file synthesised from M.9b Stack judged + M.10 F3-on-TR judged (judge is deterministic per (qid, answer) pair: gpt-4o-2024-08-06, T=0, fixed seeds {42,1,2,3,4}); 0 answer-text mismatches verified before scoring. |
|||||||||||
| #2 | Open swap |
gpt-oss-120b @ high
Stack (frozen retrieval contract; canonical prompt v1)
|
72.0%[67.9–75.8]%74.8% of ceiling | 96.4%54/56 | 92.9%65/70 | 40.0%12/30 | 63.2%84/133 | 87.2%68/78 | 57.9%77/133 | 18.4% | |
|
RL v1 Phase 95-M.9 self-host-verified open-weight reader: openai/gpt-oss-120b @ reasoning_effort=high. Apache-2.0 weights; MoE 117B total / 5.1B active params, native MXFP4 (~60 GB on disk). The published run was produced via Novita.ai (api.novita.ai/v3/openai) for cost/concurrency reasons, then cross-host-calibrated against a self-hosted H200 141GB deployment (HF transformers + openai-harmony stack, identical sampling and prompt contract). M.10d cross-host calibration: Δ ≈ 0 pp accuracy on spot-checks (informal — no formal 500q replay was preserved; the M.9 20b @ low pair already established a formal Δ = 0.0 pp anchor on the same gpt-oss family on plain-vLLM hosts). Production self-hosting target: a single H200 141GB or B200 192GB; H100 80GB is too tight for 5.1B-active forward + KV cache once concurrency rises. max_tokens=4096 (matches @20b-high budget policy: extend total budget so effective visible-answer budget stays ~1024 despite the analysis channel growing 3-4× at high effort). concurrency=8, temperature=0. Result: 72.0% (CI [67.9, 75.8]) — within the CI band of claude-opus-4-7 (72.6%), beats gemini-3-flash (70.8%), trails gpt-5.5 (73.4%) by 1.4 pp. Scales +9.2 pp over the 20b @ high companion (62.8%) at the same effort, with every qtype improving (notably SSP 23.3 → 40.0%, KU 73.1 → 87.2%). Audition-grade open-weight result. |
|||||||||||
| #3 | Experimental |
Gemma-4-26B-A4B-it
Hybrid + F3-on-TR (oracle qtype + date-anchor on TR slice)
|
71.4%[67.3–75.2]%74.2% of ceiling | 92.9%52/56 | 88.6%62/70 | 26.7%8/30 | 63.9%85/133 | 74.4%58/78 | 69.2%92/133 | 10.8% | |
|
Best Gemma-4 routed variant we have measured. Adds a date-anchor pre-prompt on the temporal-reasoning slice only (F3). Allowed but not the public headline; +1.4 pp over the Efficient row (Gemma-4 Hybrid), McNemar p ≈ 0.38 — directional, not yet decisive. |
|||||||||||
| #4 | Open · API-hosted |
gemma-4-31b-novita
Stack (frozen retrieval contract; canonical prompt v1)
|
70.8%[66.7–74.6]%73.6% of ceiling | 94.6%53/56 | 91.4%64/70 | 33.3%10/30 | 63.2%84/133 | 83.3%65/78 | 58.6%78/133 | 18.8% | |
|
Hosted open-weight reader: google/gemma-4-31b-it via Novita.ai (api.novita.ai/v3/openai). Gemma 4 flagship dense architecture, 31.3B all-active, 256K context, native bf16. Provider-side checkpoint hash / serving template / sampling defaults / thinking-mode toggle unverified. max_tokens=2048, concurrency=8, temperature=0. Architectural comparison row to the self-hosted Gemma-4-26B-A4B-it (MoE 26.8B/3.8B-active) rows already on the board: same family, dense vs MoE; serving-stack also differs (Novita serverless vs H100 HF transformers). |
|||||||||||
| #5 | Open swap |
gpt-oss-120b @ high
Hybrid v2 + F3-on-TR (corrected Hybrid: SSA -> Naked v2, TR -> F3 date-anchor, else -> Stack)
|
70.4%[66.3–74.2]%73.2% of ceiling | 69.6%39/56 | 92.9%65/70 | 40.0%12/30 | 68.4%91/133 | 87.2%68/78 | 57.9%77/133 | 17.4% | |
|
RL v1 Phase 95-M.10c: corrected Hybrid + F3-on-TR (oracle qtype: SSA -> Naked-v2 [haystack-first prompt], TR -> F3 date-anchor, else -> Stack). The original v1 Hybrid attempt placed the question BEFORE a 125K-token haystack and produced 0/56 SSA Naked (lost-in-the-question failure); the v2 prompt rearranges to haystack first, question after, with anchor 'Answer the QUESTION above using ONLY the CONVERSATION HISTORY shown above the separator'. Smoke check 3/3 on the same SSA qids that v1 missed. Full-run measured: SSA Naked = 39/56 = 69.6%, well below Stack-SSA's 96.4%. Two failure modes: (a) 6/56 cells returned empty completions (refused=False, raw='') -- likely max_tokens=4096 consumed by the analysis channel on the longer haystacks, and (b) of the 50 non-empty answers, only ~33 (66%) pass K=5 GPT-4o judge under 3-of-5 majority vote -- gpt-oss-120b produces fluent answers that frequently miss gold on nuanced wording, exact numbers, and named entities. Net: 70.4% overall, 3.0 pp BELOW the Stack+F3-on-TR companion (M.10b, 73.4%): Naked routing on the SSA slice destroys ~1.5x more accuracy than F3-on-TR adds on the TR slice for this reader. Other 444 cells reuse the M.9b Stack answers (311 cells) and the M.10 F3-on-TR answers (133 cells) verbatim; judge file built deterministically from existing per-(qid, answer) judges plus K=5 GPT-4o on the 56 corrected SSA cells. Reader: novita://openai/gpt-oss-120b@high, max_tokens=4096, concurrency=4, temperature=0. Conclusion: corrected Hybrid+F3 does not surpass Stack+F3 for gpt-oss-120b @ high. ManifoldMemory reader policy: Stack+F3-on-TR (M.10b). |
|||||||||||
| ☆ #6 | ManifoldMemory · Efficient |
Gemma-4-26B-A4B-it
Hybrid (oracle qtype: SSA→Naked, else→Stack)
|
70.0%[65.8–73.9]%72.8% of ceiling | 94.6%53/56 | 88.6%62/70 | 33.3%10/30 | 56.4%75/133 | 73.1%57/78 | 69.9%93/133 | 10.0% | |
|
The number ManifoldMemory publicly stands behind. Pure routing over the existing Stack and Naked answers; no extra inference. Production-equivalent uses a learned qtype classifier in place of oracle (pending; expected 1–3 pp gap). |
|||||||||||
| #7 | Open swap |
gpt-oss-120b @ high
Hybrid v2 (corrected Hybrid: SSA -> Naked v2 [haystack-first prompt], else -> Stack)
|
69.0%[64.8–72.9]%71.7% of ceiling | 69.6%39/56 | 92.9%65/70 | 40.0%12/30 | 63.2%84/133 | 87.2%68/78 | 57.9%77/133 | 18.2% | |
|
RL v1 Phase 95-M.10c: corrected Hybrid v2 (oracle qtype: SSA -> Naked-v2 [haystack-first prompt], else -> Stack). The v1 attempt placed the question BEFORE the 125K-token haystack and produced 0/56 SSA Naked (lost-in-the-question failure); the v2 prompt rearranges to haystack first, question after, with anchor 'Answer the QUESTION above using ONLY the CONVERSATION HISTORY shown above the separator'. Smoke check 3/3 correct on the same SSA qids that v1 missed. Full-run measured Naked SSA = 39/56 = 69.6%, well below the Stack-SSA baseline of 54/56 = 96.4%, because (a) 6/56 cells returned empty completions (refused=False, raw='') -- likely the 4096-token max-tokens budget consumed entirely by the analysis channel on the longer haystacks -- and (b) of the 50 non-empty answers, only ~33 (66%) pass the K=5 GPT-4o judge under 3-of-5 majority vote, since gpt-oss-120b produces fluent answers that frequently miss gold on nuanced wording, exact numbers, and named entities. Net: corrected Hybrid v2 lands 4.4 pp BELOW Stack+F3-on-TR (M.10b, 73.4%) and 3.0 pp below Stack alone (M.9b, 72.0%). Other 444 cells reuse the M.9b Stack answers verbatim; judge file built deterministically from existing per-(qid, answer) judges plus K=5 GPT-4o on the 56 corrected SSA cells. Reader: novita://openai/gpt-oss-120b@high, max_tokens=4096, concurrency=4, temperature=0. Conclusion: for gpt-oss-120b @ high, Naked routing offers NO marginal value on the SSA slice; ship Stack everywhere with F3-on-TR (M.10b) as the ManifoldMemory reader policy. |
|||||||||||
| #8 | Open · API-hosted |
deepseek-v4-pro
Stack (frozen retrieval contract; canonical prompt v1)
|
67.4%[63.2–71.4]%70.1% of ceiling | 96.4%54/56 | 85.7%60/70 | 36.7%11/30 | 63.2%84/133 | 80.8%63/78 | 48.9%65/133 | 25.8% | |
|
RL v1 Phase 95-M.8 hosted open-weight reference: deepseek-ai/DeepSeek-V4-Pro on Hugging Face under MIT (1.6T total / 49B active MoE, FP4+FP8, 1M ctx). Run via DeepSeek's first-party API (api.deepseek.com/v1) for cost/speed, NOT a local checkpoint — checkpoint hash, serving template, thinking-mode toggle, sampling defaults beyond temperature, hidden system prompt, provider-side patches, and version/route pinning were not independently verified. Thinking-on-by-default — completions return both content and reasoning_content; reasoning tokens consume the same completion budget. max_tokens=2048 (doubled internally to leave headroom for visible answer + cite array), temperature=0, reader_concurrency=16. Same 500q frozen retrieval contract. Local-vLLM re-verification queued when budget permits. |
|||||||||||
| #9 | Open swap |
Qwen3.6-27B
Hybrid (oracle qtype: SSA→Naked, else→Stack)
|
66.2%[61.9–70.2]%68.8% of ceiling | 96.4%54/56 | 88.6%62/70 | 40.0%12/30 | 45.9%61/133 | 71.8%56/78 | 64.7%86/133 | 14.8% | |
|
Same Hybrid pipeline as the ManifoldMemory Efficient row, different open-weights reader. Phase 91.8 swap. The 3.8 pp gap to the Gemma-4 Hybrid (Efficient track) is paid mostly on the SSA branch (Qwen's 110K reading is weaker). |
|||||||||||
| #10 | Open · API-hosted |
Qwen2.5-72B-Instruct
Stack (frozen retrieval contract; canonical prompt v1)
|
65.2%[60.9–69.2]%67.8% of ceiling | 96.4%54/56 | 87.1%61/70 | 33.3%10/30 | 53.4%71/133 | 74.4%58/78 | 54.1%72/133 | 21.8% | |
|
RL v1 Phase 95-N.1 hosted open-weight reference: Qwen/Qwen2.5-72B-Instruct via Novita.ai (api.novita.ai/v3/openai), model_id=qwen/qwen-2.5-72b-instruct. Apache-2.0 weights, dense 72.7B, 128K context, non-thinking (no analysis channel; full max_tokens budget is the visible answer). Provider-side checkpoint hash / quantization (FP8 vs GPTQ-Int4 vs BF16 unknown) / serving template / hidden system prompt unverified — same open-weights-api caveat that flags hosted DS V4 Pro and Kimi K2.6. User intent referenced GPTQ-Int4 specifically; Novita does not expose quant choice on this endpoint, so the measured score reflects whatever quant the provider serves, not necessarily GPTQ-Int4. Cross-host calibration on H100 80GB with the actual Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4 weights is follow-on work if needed (mirrors the M.10d cross-host pattern for gpt-oss-120b). max_tokens=1024, temperature=0, reader_concurrency=8. Refusal rate 21.8% (109/500) — the dominant headline-suppressor; Qwen 2.5 is conspicuously over-conservative on single-session-preference (10/30 = 33.3%, vs Gemma's ~80%) and weak on temporal-reasoning (71/133 = 53.4%, vs gpt-oss-120b Stack's ~63% and Stack+F3's ~68%). SSA branch is healthy at 96.4% (54/56), matching the field. Net: this row sits in the 60-67% open-weights-api band alongside DeepSeek V4 Pro (67.4%) and Qwen 3.6 Hybrid (66.2%), comfortably below the ManifoldMemory Open-Max reader (gpt-oss-120b Stack+F3, 73.4%) and ManifoldMemory Efficient reader (Gemma-4 Hybrid, 70.0%) on the same frozen retrieval contract. |
|||||||||||
| #11 | Open swap |
gpt-oss-20b @ low
Stack (frozen retrieval contract; canonical prompt v1)
|
64.0%[59.7–68.1]%66.5% of ceiling | 94.6%53/56 | 84.3%59/70 | 43.3%13/30 | 49.6%66/133 | 83.3%65/78 | 48.1%64/133 | 24.8% | |
|
RL v1 Phase 95-M.9 self-hosted open-weight reader: openai/gpt-oss-20b @ reasoning_effort=low (Apache 2.0, MoE 21B total / 3.6B active params, native MXFP4 ~13 GB on disk). Run on H100 80GB via HF transformers + the openai-harmony / kernels stack for native MXFP4; max_new_tokens=1024 (analysis channel is brief at low effort, so the effective visible-answer budget is ~1024, matching the leaderboard policy for thinking models). Output uses the harmony multi-channel format (analysis | final); we parse out the <|channel|>final<|message|> block and discard analysis before scoring. CROSS-HOST CALIBRATION (open-weights-api confound, measured): the same Apache-2.0 weights served via Novita.ai's hosted MXFP4 vLLM stack — different inference engine, different host, different OpenAI-compatible chat-template path — produced an identical 64.0% / 320-of-500 / matching per-qtype breakdown on the same 500-question contract (Δ = 0.0 pp accuracy, +0.2 pp refusal = 1 question of 500). For this reader, the open-weights-api caveat that flags hosted DS V4 Pro / Kimi K2.6 is empirically zero on plain-vLLM hosts; we therefore do not publish the Novita-low row separately — it would be visual padding without measurement value. Caveat does not auto-extend to first-party APIs (DeepSeek/Moonshot) which customize chat templates and thinking-mode toggles. Companion @high row lands separately (canonical = Novita; self-hosted @high terminated mid-run after the @low calibration retired the need for a self-hosted re-confirmation at 6–10× the wall-clock cost). |
|||||||||||
| #12 | Retrieval-only |
Gemma-4-26B-A4B-it
Stack (BGE∪QNDN∪BM25 → RRF → top-10)
|
63.8%[59.5–67.9]%66.3% of ceiling | 39.3%22/56 | 88.6%62/70 | 33.3%10/30 | 55.6%74/133 | 75.6%59/78 | 69.2%92/133 | 14.4% | |
|
Stack only, no full-context fallback on the SSA slice. The +6.2 pp the Efficient Hybrid (Gemma-4) earns over this row is the routing contribution; the rest is the reader. |
|||||||||||
| #13 | Open · API-hosted |
gpt-oss-20b @ high
Stack (frozen retrieval contract; canonical prompt v1)
|
62.8%[58.5–66.9]%65.3% of ceiling | 89.3%50/56 | 84.3%59/70 | 23.3%7/30 | 57.9%77/133 | 73.1%57/78 | 48.1%64/133 | 19.2% | |
|
RL v1 Phase 95-M.9 hosted open-weight reader: openai/gpt-oss-20b @ reasoning_effort=high via Novita.ai (api.novita.ai/v3/openai). Apache-2.0 weights, MoE 21B total / 3.6B active, native MXFP4. max_tokens=4096 (analysis channel is ~3-4× longer at high effort; we extend total budget so the effective visible-answer budget stays ~1024, matching the DeepSeek/Kimi thinking-model policy and the @low row). Output uses the harmony multi-channel format (analysis | final); Novita auto-routes the analysis channel into reasoning_content and the final answer into content, which we use directly. Hosted/canonical for this row: a self-hosted H100 companion at @high was launched but terminated early — at sustained ~0.013 q/s under the 4096-token budget (high reasoning emits ~3-4× more analysis tokens, lengthening every forward pass), the row needed 6-10 hr wall-clock to confirm what the @low cross-host calibration (Δ=0.0 pp) already implied. Within-row ablation against the @low companion: HIGH UNDERPERFORMS LOW BY 1.2 PP on this benchmark (62.8% vs 64.0%), trading -5.8 pp refusal for -1.2 pp accuracy — at high effort the model talks itself out of correct retrieval-grounded reads (SSA 89.3% vs 94.6% at low; SSP collapses to 23.3%). For this reader on this contract, low is the right config; high is anti-helpful. |
|||||||||||
| #14 | Retrieval-only |
Qwen3.6-27B
Stack (same retrieval contract)
|
60.0%[55.6–64.2]%62.4% of ceiling | 37.5%21/56 | 90.0%63/70 | 40.0%12/30 | 46.6%62/133 | 73.1%57/78 | 63.9%85/133 | 21.2% | |
|
Reader swap, Stack-only. Higher refusal rate than Gemma at the same evidence — Qwen declines more readily when the top-10 chunks under-cover the question. |
|||||||||||
| #15 | Open · API-hosted |
kimi-k2.6
Stack (frozen retrieval contract; canonical prompt v1)
|
58.6%[54.2–62.8]%60.9% of ceiling | 92.9%52/56 | 80.0%56/70 | 16.7%5/30 | 54.1%72/133 | 61.5%48/78 | 45.1%60/133 | 19.6% | |
|
RL v1 Phase 95-M.8 hosted open-weight reference: moonshotai/Kimi-K2.6 on Hugging Face under Modified MIT (≈1T total / ≈32B active MoE, INT4, 256K ctx). Run via Moonshot AI's hosted API (api.moonshot.ai/v1) for cost/speed, NOT a local checkpoint — checkpoint hash, serving template, thinking-mode toggle, sampling defaults, hidden system prompt, provider-side patches, and version/route pinning were not independently verified. K2.x is thinking-on-by-default with very verbose reasoning_content (~3–5K chars of CoT on real evidence prompts) that shares the completion budget; max_tokens=2048 (KimiReader doubles 1024 internally), temperature omitted (kimi-k2.x rejects temperature ≠ 1), reader_concurrency=8. Same 500q frozen retrieval contract. Local-vLLM re-verification queued when budget permits. |
|||||||||||
| #16 | No retrieval |
Qwen3.6-27B
Naked (no retrieval)
|
58.4%[54.0–62.6]%60.7% of ceiling | 96.4%54/56 | 91.4%64/70 | 16.7%5/30 | 36.1%48/133 | 75.6%59/78 | 46.6%62/133 | 38.8% | |
|
No-retrieval floor for the open-swap track. |
|||||||||||
| #17 | Experimental |
Gemma-4-26B-A4B-it
CWFIX (Stack + F1 refusal-retry + F2 chronological-KU + F3 date-anchor, all qtypes)
|
56.6%[52.2–60.9]%58.8% of ceiling | 30.4%17/56 | 87.1%61/70 | 20.0%6/30 | 62.4%83/133 | 64.1%50/78 | 49.6%66/133 | 20.4% | |
|
Cheap-win-fix bundle applied unconditionally. Regresses vs Efficient (F1 and F2 are harmful when ungated). Published here to keep the negative result auditable; F3 alone, scoped to TR, is the only keeper. |
|||||||||||
| #18 | No retrieval |
Gemma-4-26B-A4B-it
Naked (no retrieval, LME-S 110 K-token haystack)
|
53.2%[48.8–57.5]%55.3% of ceiling | 94.6%53/56 | 88.6%62/70 | 13.3%4/30 | 33.8%45/133 | 65.4%51/78 | 38.3%51/133 | 39.2% | |
|
Floor row: reader receives the full LME-S haystack and no retrieval at all. The +10.6 pp Stack earns over this row is the retrieval contribution. Run on 8×A100 (≈140 GB); H200/H100 OOMs without quantisation. |
|||||||||||
| #19 | Open swap |
Qwen2.5-7B-Instruct
Stack (frozen retrieval contract; canonical prompt v1)
|
25.0%[21.4–29.0]%26.0% of ceiling | 57.1%32/56 | 38.6%27/70 | 20.0%6/30 | 18.8%25/133 | 21.8%17/78 | 13.5%18/133 | 52.2% | |
|
RL v1 proof-of-flow: first row produced by the public submission harness end-to-end (frozen evidence -> public runner -> real reader -> canonical judge -> leaderboard row). H100 80GB, bf16, 500q, max_new_tokens=256, temperature=0. Substantially weaker than the 27B Qwen row at the same evidence; the row's purpose is harness reproducibility, not a Qwen2.5-class performance claim. |
|||||||||||
| #20 | Open swap |
Qwen2.5-3B-Instruct
Stack (frozen retrieval contract; canonical prompt v1)
|
12.4%[9.8–15.6]%12.9% of ceiling | 10.7%6/56 | 14.3%10/70 | 6.7%2/30 | 10.5%14/133 | 15.4%12/78 | 13.5%18/133 | 23.6% | |
|
RL v1 Phase 95-M.5: 3B point on the Qwen2.5/3.6 family parameter-scaling spine (1.5B -> 3B -> 7B -> 27B Stack), same harness, same retrieval contract. H100 80GB, bf16, 500q, max_new_tokens=256, temperature=0. |
|||||||||||
| #21 | Open swap |
Qwen2.5-1.5B-Instruct
Stack (frozen retrieval contract; canonical prompt v1)
|
8.8%[6.6–11.6]%9.1% of ceiling | 10.7%6/56 | 15.7%11/70 | 6.7%2/30 | 6.8%9/133 | 9.0%7/78 | 6.8%9/133 | 51.2% | |
|
RL v1 Phase 95-M.5: 1.5B floor on the Qwen2.5/3.6 family parameter-scaling spine. Same harness, same retrieval contract. H100 80GB, bf16, 500q, max_new_tokens=256, temperature=0. |
|||||||||||
Click any row for model id, judge metadata, and methodology notes.
qtype legend: SSA = single-session-assistant · SSU = single-session-user · SSP = single-session-preference ·
TR = temporal-reasoning · KU = knowledge-update · MS = multi-session.
Cells show acc% over k/n after K=5 3-of-5 GPT-4o majority vote.
Hybrid rows route by oracle qtype label from LongMemEval-S; the production-equivalent learned classifier is pending and is
expected to land within 1–3 pp of oracle. Naked rows have no retrieval, only the full LME-S 110 K-token haystack.
Is this benchmark just who has highest accuracy? No — the single accuracy column is a scalar collapse of a richer 2D structure. Decomposed by qtype family, each reader+pipeline pair lives at a point in the plane below: X = associative reading (answer-in-one-chunk lookup), Y = geometric synthesis (answer-across-chunks composition). Same axes Google / CMU use to distinguish associative from geometric internal memory in deep sequence models (Noroozizadeh et al. 2025) — but read off behaviour, not hidden states.
Balanced leaders (top-right): high on both axes — readers that can both look up locally and compose across fragments. Strong-local / moderate-synthesis (middle-right): local-memory pillar intact, synthesis lags 4–6 points — the canonical "more associative than geometric" failure mode. Local-strong / synthesis-weak (bottom-right): highest local-memory scores on the board but synthesis falls off the cliff — "associative-only" readers in the Google / CMU sense. Hybrid trade-off / SSA penalty (top-left): a fixed reader stranded ~15 points to the left of itself on a different pipeline — the visible imprint of pipeline-as-vector-field.
When the same model appears at multiple points (e.g. gpt-oss-120b @ high on Stack vs. Hybrid-v2), the gap is not a reader-geometry property — it's the pipeline itself acting as a vector field over a fixed substrate. Routing can over-optimise one direction and break another (e.g. SSA → Naked v2 trades a 15 pp local-memory drop for a small synthesis gain). That is the literal visualisation of why holding retrieval fixed and varying the reader+pipeline pair is the right experimental knob to twist.
This is a 2D projection of behaviour — the actual hidden-state geometry is unmeasured. Position on this projection is mediated by chat template, sampling defaults, refusal posture, and serving host; the behavioural fingerprint is not a hidden-state read. We do not claim "reader X has more geometric parametric memory than reader Y." The projection is the cheap (Audit-B-lite) version of the full Reader Manifold Audit; the hidden-state work is queued in the experiment journal.
How to read the projection.
Each marker is one reader+pipeline pair. Distance from origin = total memory strength;
angle = associative-vs-geometric ratio. The dashed lines mark the median X and Y across the top-10
— readers above-and-right of both medians are the balanced leaders, readers in the bottom-right are
strong on lookup but weak on composition.
The chart re-renders from leaderboard.json on every build, so it
never falls out of sync with the table.
If our two-axis (LMS / GSS) framing is right, then numerical scores from a benchmark that emphasises native long-context system ability (multi-round retrieval, planning, in-context computation, distractor density — e.g. MRCRv2) should not rank-correlate strongly with numerical scores from a benchmark that isolates fixed-evidence reader synthesis (frozen retrieval, top-10 chunks, structured grounded-summary output — this leaderboard). Below: 5-row overlap of models that appear in both, with per-model deltas. Two-population structure with a convergence point at Opus 4.6. The cleanest single data point is GPT-5.4: #1 on MRCRv2 at 97%, mid-pack here at 71.2%.
| Model | Track | MRCRv2 | Reader (best row) | Δ (Reader − MRCR) | Reading |
|---|---|---|---|---|---|
Gemma-4-26B-A4B-it Hybrid + F3-on-TR | open-weights | 44.1% | 71.4% | +27.3 pp | Mediocre at native long-context, excellent reader given clean evidence. |
GPT-OSS-120B @ high Open-Max (Stack) | open-weights | 59.0% | 73.4% | +14.4 pp | Same direction as Gemma; weaker magnitude. |
Claude Opus 4.6 fixed-retrieval reference | closed-API | 76.0% | 75.6% | -0.4 pp | The only overlapping model where the two benchmarks numerically agree. |
GPT-5 mini fixed-retrieval reference | closed-API | 79.0% | 59.0% | -20.0 pp | Strong at native long-context; mediocre reader on fixed evidence. |
GPT-5.4 fixed-retrieval reference | closed-API | 97.0% | 71.2% | -25.8 pp | Largest inverted gap. The cleanest single refutation of "MRCRv2 ranking == Reader Leaderboard ranking". |
Positive-Δ readers (Gemma-4-26B-A4B-it, GPT-OSS-120B @ high) punch above their long-context score once retrieval does the heavy lifting — they read better than their MRCRv2 implies, by 14–27 pp. Negative-Δ readers (GPT-5.4, GPT-5 mini) regress 20–26 pp on fixed evidence — a meaningful share of their MRCRv2 score is non-reading work (planning, multi-round routing). Convergence point (Opus 4.6, Δ ≈ 0): the only overlapping model where the two benchmarks numerically agree. Models strong on both axes are rare.
Phase 96-T.3.1 already mapped Anthropic's MRCRv2 / GraphWalks split onto the same two axes the projection above names (LMS / GSS). T.3.1 was the theoretical cross-check; this table is the empirical one — the rank-order disagrees between the two benchmarks, and disagrees in a structured way (positive-Δ / negative-Δ / convergence). Anyone arguing "your benchmark just measures general model strength" has to explain why the #1 MRCRv2 model is mid-pack on ManifoldMemory.
Per-row MRCRv2 magnitudes are display-only pending source verification. The two-population direction is robust to per-row noise; the numerical deltas are not. We frame this as a directional cross-check, journal the falsifiability note in Phase 96-T.3.3, and re-run the correlation when verified MRCRv2 numbers ship. Internal evidence for the framework remains the 2D projection above and the 5-model overlap shown here; the verified-MRCR re-run is the next public step.
How to read the table.
Rows are sorted by ascending MRCRv2. The Reader (best row) column is the highest-accuracy ManifoldMemory
pipeline for that model on this leaderboard's 500-Q LME-S evaluation; Δ is the gap between
the two benchmarks on the same model. Green Δ = stronger reader than long-context score predicts;
red Δ = weaker. The table regenerates from
mrcr_snapshot.v1.json on every build, and auto-suppresses
when the snapshot is removed or unstaged.
These rows are genuinely closed-weights — the model can only be reached through the vendor's API, and the weights are not public. They are reported in a separate reference track on the identical retrieval contract so the open and closed regimes can be read side-by-side without conflating them into one ranking. Reproducibility is limited to whatever the API serves on the run date; these rows are published as comparison anchors, not contestants. Hosted open-weight rows (DeepSeek V4 Pro, Kimi K2.6) are not here — they sit in the ranked board above under the Open · API-hosted tag.
| Track | Reader / Pipeline | Overall acc · 95% CI |
SSA | SSU | SSP | TR | KU | MS | Ref % | ||
|---|---|---|---|---|---|---|---|---|---|---|---|
| ref | Closed reference |
claude-opus-4-6
Stack (frozen retrieval contract; canonical prompt v1)
|
75.6%[71.6–79.2]%78.6% of ceiling | 91.1%51/56 | 92.9%65/70 | 56.7%17/30 | 72.2%96/133 | 87.2%68/78 | 60.9%81/133 | 15.8% | |
|
RL v1 Phase 95-M.7 predecessor closed-reference reader: anthropic://claude-opus-4-6 (the version before 4.7). Anthropic Messages API, max_tokens=1024, sampling parameters omitted (4.6 also rejects temperature/top_p/top_k at the API level), reader_concurrency=4 (Tier-1 30K input-tpm cap), SDK max_retries=8. Same 500q frozen retrieval contract. |
|||||||||||
| ref | Closed reference |
gpt-5.5
Stack (frozen retrieval contract; canonical prompt v1)
|
75.0%[71.0–78.6]%78.0% of ceiling | 100.0%56/56 | 95.7%67/70 | 56.7%17/30 | 69.9%93/133 | 84.6%66/78 | 57.1%76/133 | 16.8% | |
|
RL v1 Phase 95-M.5 frontier closed-reference reader: openai://gpt-5.5 (snapshot gpt-5.5-2026-04-23). Chat Completions API, max_completion_tokens=1024, reasoning_effort=low, reader_concurrency=16. Same 500q frozen retrieval contract; visible answer tokens share completion budget with internal CoT. |
|||||||||||
| ref | Closed reference |
claude-opus-4-7
Stack (frozen retrieval contract; canonical prompt v1)
|
72.6%[68.5–76.3]%75.5% of ceiling | 98.2%55/56 | 85.7%60/70 | 46.7%14/30 | 69.9%93/133 | 80.8%63/78 | 58.6%78/133 | 26.6% | |
|
RL v1 Phase 95-M.6 frontier closed-reference reader: anthropic://claude-opus-4-7 (released 2026-04-16). Anthropic Messages API, max_tokens=1024, adaptive thinking (default), sampling parameters omitted (Opus 4.7 rejects temperature/top_p/top_k at the API level), reader_concurrency=4 (Tier-1 30K input-tpm cap), SDK max_retries=8 to absorb transient 429s. Same 500q frozen retrieval contract. |
|||||||||||
| ref | Closed reference |
gpt-5.4
Stack (frozen retrieval contract; canonical prompt v1)
|
71.2%[67.1–75.0]%74.0% of ceiling | 96.4%54/56 | 90.0%63/70 | 43.3%13/30 | 66.9%89/133 | 80.8%63/78 | 55.6%74/133 | 20.4% | |
|
RL v1 Phase 95-M.7 predecessor closed-reference reader: openai://gpt-5.4 (snapshot gpt-5.4-2026-03-05; the version before gpt-5.5). Chat Completions API, max_completion_tokens=1024, reasoning_effort=low, reader_concurrency=16. Same 500q frozen retrieval contract. |
|||||||||||
| ref | Closed reference |
gemini-3-flash
Stack (frozen retrieval contract; canonical prompt v1)
|
70.8%[66.7–74.6]%73.6% of ceiling | 92.9%52/56 | 92.9%65/70 | 36.7%11/30 | 64.7%86/133 | 83.3%65/78 | 56.4%75/133 | 17.0% | |
|
RL v1 Phase 95-M.6 frontier-Flash closed-reference reader: gemini://gemini-3-flash-preview (Gemini 3 family, Flash tier). google-genai SDK, max_output_tokens=1024, temperature=0, thinking_budget=512. Tier 1 limits on Flash are generous (multi-thousand RPD vs Pro 250 RPD), so reader_concurrency=16 sustained ~8 q/s. Same 500q frozen retrieval contract. (Pro Preview row deferred: gemini-3.1-pro-preview hit Tier 1 daily quota at row 89/500; will re-run when daily window resets.) |
|||||||||||
| ref | Closed reference |
gpt-5-mini
Same retrieval contract, closed-weights reader
|
59.0%[54.6–63.2]%61.3% of ceiling | 37.5%21/56 | 90.0%63/70 | 36.7%11/30 | 59.4%79/133 | 66.7%52/78 | 51.9%69/133 | 25.0% | |
|
Closed-weights API reader on the identical retrieval contract. Reference row for cross-class comparison; reproducibility is limited to whatever the API serves on the run date. |
|||||||||||
Same retrieval contract, same 500 questions, same K=5 GPT-4o majority-vote judge. Same protocol — different track. For context: the open-weights ★ Open-Max row above lands at 73.4% (gpt-oss-120b @ high, Stack + F3-on-TR); the closed reference row reads the identical evidence at 75.6%. Take this as evidence that fixed-evidence reading is its own bottleneck, not as a head-to-head claim on either model's preferred pipeline.
The board is not a single ranking. It is a stack of comparisons under the same retrieval contract: Open-Max vs. Efficient production policies, canonical vs. experimental routing, retrieval vs. no retrieval, open-weights vs. closed-weights. Read each pair against the two ManifoldMemory production rows (Open-Max for max quality, Efficient for cheap-to-self-host).
ManifoldMemory's max-quality reader policy. The strongest open-weight reader configuration measured under the frozen retrieval contract — self-host-verified, no closed-weights dependency. The number ManifoldMemory publicly stands behind for the highest-accuracy mode.
ManifoldMemory's efficient reader policy: smaller, cheaper, easier to self-host than Open-Max, slightly lower accuracy. Production-equivalent (oracle-qtype disclosed; learned classifier expected within 1–3 pp).
Routed or prompt variants we have measured under the same protocol. Allowed on the board, but not the public headline.
Stack handed straight to the reader. No qtype routing, no full-context fallback. The contribution of retrieval-without-routing.
Reader receives the full LME-S 110 K-token haystack and zero retrieved chunks. Floor row; the contribution of retrieval is measured against it.
Same Hybrid-family pipeline as the Efficient row, different open-weights reader, run locally (HF/vLLM/H200). Quantifies how much of the result is reader-specific. Self-host-verified — no closed-weights or vendor-API dependency.
Public open weights (Hugging Face, permissive licence) but the row was produced through the vendor's hosted endpoint, not a local checkpoint. We did not independently verify checkpoint hash, serving template, thinking-mode toggle, sampling defaults, hidden system prompt, provider-side patches, or version/route pinning — so this is a hosted open-weight reference, not equivalent to a self-hosted swap. Ranked alongside open-weights rows because the underlying weights are public.
Closed-weights API reader on the identical retrieval contract. Reference comparison only — not a head-to-head; reproducibility is limited to whatever the API serves on the run date.
A frozen-retrieval, reader-only comparison. Every row received the identical top-10 chunks (post-MixK rerank from 500 candidates) for the same 500 LongMemEval-S questions, judged by the same K=5 GPT-4o protocol.
Per-qtype breakdowns expose where each reader fails (SSA for chunk-rerank disruption, KU for stale knowledge, TR for date arithmetic) rather than aggregating them away into a single number.
Refusals are scored as incorrect in accuracy. They are also reported separately in the Ref % column because abstention on missing evidence is a different failure mode from fabrication, and the two have different downstream consequences for any system built on top.
Receipts are attached: per-row answer files, all five seed-judge files, and the synth scripts that produce the Hybrid rows from the underlying Stack and Naked answers.
A general LongMemEval leaderboard. Public LME-S numbers blend retrieval, routing, and reading; this board pins retrieval and routing and isolates the reader. Systems do not bring their own retrieval.
A claim that "small lab beats OpenAI." Closed-weights API readers (claude-opus-4-6, gpt-5.5, claude-opus-4-7, gpt-5.4, gemini-3-flash, gpt-5-mini) are published as reference rows on a fixed retrieval contract; they are not head-to-head claims on either model's preferred pipeline.
An agent-harness benchmark. Every row is a single-pass reader call — multi-step planners (Hindsight, ReAct, etc.) are out of scope by design, since they conflate reader skill with planner skill.
A frozen artifact. Submissions of additional readers (open weights or API) are welcome under the same contract: a public harness ships next.
The leaderboard is a derived artifact. The underlying answer-and-judge files are committed alongside the build script so any row can be re-scored from primary sources in one command.
_remote_pulls/phase91_a50/, _remote_pulls/phase91_a100_gemma_naked/,
_remote_pulls/phase91_hybrid_gemma/, _remote_pulls/phase91_hybrid_qwen/,
_remote_pulls/phase91_cwfix/, _remote_pulls/phase91_frontier/,
_remote_pulls/phase91_hybrid_f3tr/.
EXPERIMENT_JOURNAL_RECOVERED_2026-04-24.md § Phase 86 (R@5 receipt) → Phase 91 (Stack vs Naked) → Phase 95-K (oracle-qtype disclosure).
_tmp_scripts/_synth_hybrid_gemma.py (oracle-qtype router for Gemma),
_tmp_scripts/_synth_hybrid_qwen.py (Qwen),
_tmp_scripts/_synth_hybrid_f3_tr.py (Hybrid + F3-on-TR best-of-board).
_tmp_scripts/_build_leaderboard.py · machine-readable copy: leaderboard.json.