Warrant is not just a leaderboard. It is an instrumented memory engine where retrieval depth, prompt contracts, reader geometry, and arbiter selection create measurable forces over the task manifold. Every K bump, every rider, every reader swap, every arbiter is a vector in R6 — one component per question type. Some forces lift one category and break another. Some are reader-agnostic. Some are reader-specific.
This page is the explanatory layer behind the benchmark: what each intervention actually did, why it did it, and what to do next. The numbers come from the same 500-question LongMemEval-S frozen-retrieval substrate the leaderboard uses, with the 5-seed GPT-4o judge protocol applied uniformly.
For each major intervention we ran, the table below lists the observed per-qtype delta in percentage points and the force class doing the work. Numbers are ground-truth measured deltas, not estimates — same 500 LongMemEval-S questions, same 5-seed GPT-4o judge protocol on both sides of every comparison. Δ > 0 means the intervention helped that qtype; Δ < 0 means it hurt.
| Intervention | Force class | Δ SSA | Δ SSU | Δ SSP | Δ TR | Δ KU | Δ MS | Δ overall | Reading |
|---|---|---|---|---|---|---|---|---|---|
| K=10 v1 → K=24 v2 + Track 2-lite v2 prompts Opus 4.6, A1f → A1g |
retrieval-depth + evidence-format | +5.4 | +10.0 | +13.3 | +11.3 | +8.9 | +18.0 | +12.4 | The single biggest jump in the project. ~3.2 pp from K=24 alone, ~9.2 pp from Track 2-lite v2 + LMS-S session-date back-fill. Multi-session breaks the 70 % ceiling for the first time; date back-fill is the silent star. |
| v2 K=24 → v2.1 K=24/48 routed + chronological rider Gemma-31B, A1h v1 → A1h_v3 |
retrieval-depth + temporal-ordering | −3.57 | −1.43 | −6.67 | +1.50 | +1.28 | +5.26 | +1.00 | Net positive on a small open-weight reader. K=48 fixed the multi-session retrieval miss (+5.26 pp = +7 questions); the rider scaffolded TR. SSP paid the cost (the rider tightens answer style and crowds the preference voice). |
| v2 K=24 → v2.1 K=24/48 routed + chronological rider Opus 4.6, A1g → A1g_K48T |
retrieval-depth + temporal-ordering | +0.00 | +1.43 | −3.33 | −15.79 | −1.28 | +5.26 | −3.00 | Same intervention, opposite outcome on a frontier reader. Identical K=48 MS lift (+5.26 pp), but the rigid 7-step YYYY/MM/DD protocol overrides Opus's free-form temporal arithmetic and TR collapses by 21 questions. The rider is reader-specific. |
| v2 K=24 → v2.1 K=24/48 routed + chronological rider Qwen-3.6-27B, A1l → A1l_K48T · new |
retrieval-depth + temporal-ordering | +1.79 | −1.43 | −6.67 | −0.75 | −2.56 | +4.51 | +0.20 | The third corner of the natural experiment. Qwen sits in the neutral zone: still gains MS from K=48 (+4.51 pp, slightly less than Gemma/Opus's +5.26 because Qwen's K=24 already saturated MS at 81.20 %), barely moves on TR (the rider neither scaffolds nor collapses), loses SSP and KU. Confirms: K=48-on-MS is reader-agnostic; chronological rider is reader-class-dependent. |
| K=24 default → K=48 (MS+TR only) isolated retrieval-depth, averaged over 3 readers |
retrieval-depth | ~0 | ~0 | ~0 | mixed | ~0 | +5.0 avg | ~+1.5 avg | Pure retrieval-depth force, isolated. Gemma +5.26, Opus +5.26, Qwen +4.51 on MS — the variance is small relative to the lift. K=48 recovers chunks 25–48 that scattered MS evidence sits in. Reader-agnostic. |
| Chronological rider (7-step YYYY/MM/DD protocol) isolated rider force, TR slice |
temporal-ordering + reader-calibration | ~0 | ~0 | small | +1.5 / −0.75 / −15.79 | small | ~0 | reader-dep. | The rider is a scaffold for some readers and a constraint for others. Lift on Gemma-31B (weaker free-form temporal arithmetic), neutral on Qwen-27B (medium intrinsic), collapse on Opus 4.6 (strong free-form gets overridden). Reader-class force, not universal. |
| KU rider tightening (latest-value enforcement) isolated rider, KU slice |
temporal-ordering + evidence-format | ~0 | ~0 | −6.67 to −3.33 | ~0 | +1.28 / −1.28 / −2.56 | ~0 | small | Gemma gains KU +1.28; Opus and Qwen lose. The "always pick latest value" rule helps when KU answers genuinely sit in the latest assistant turn (Gemma's KU was already 94.87 %, ceiling room exists), but fights against readers that already do this implicitly. SSP collateral damage is real and reader-agnostic. |
| max_tokens 768 → 4096 GPT-5.5 medium, A1k truncation fix |
format / harness | ~0 | ~0 | ~0 | +13.5 | ~0 | ~0 | +4.0 | Pure harness force, not a tuning insight. GPT-5+ models bill reasoning + output combined under max_completion_tokens; 768 truncated 33 TR rows mid-thinking. Lesson: instrument truncation rate per qtype before reading scores. |
| MIX-v1-closed: GPT-5.5 evidence arbiter over 6-candidate champion pool vs single-reader best (Gemma v2.1, 88.80 %) |
arbiter-selection | +3.57 | +4.29 | −13.33 | −1.50 | −3.85 | +3.76 | +0.20 | Net positive but stylistically biased. Arbiter recovered MS (+3.76 by routing to Opus when MS evidence demands it) and lifted SSU/SSA where GPT-5.5 candidates were genuinely strongest. SSP collapsed by 13.33 pp because the arbiter never picked Opus on SSP — pure style preference for GPT-5.5's compact "Answer: X" format. 81.6 % of all picks went to GPT-5.5 candidates. |
| Reader swap: Opus 4.6 → Qwen-3.6-27B same v2 K=24 contract |
reader-calibration | −1.79 | +2.86 | −13.33 | +0.75 | +2.56 | +2.26 | +0.60 | 27B Apache-2.0 beats Opus by +0.60 pp at <1/100 the cost. Qwen wins on 4 of 6 qtypes; loses SSP catastrophically (Opus's hedged-but-correct preference voice is structurally hard to replace) and SSA marginally. |
| Reader swap: Opus 4.6 → Gemma-4-31B same v2 K=24 contract |
reader-calibration | +0.00 | +0.00 | +0.00 | +0.75 | +1.28 | −2.26 | −0.20 | Statistical tie at the qtype level — Gemma and Opus converge on this contract, Gemma trades MS for KU. Confirms the "+12.4 pp from contract upgrade is reader-independent" claim: both readers land in the 87.8–88.0 % band on the same recipe. |
| Reader swap: Opus 4.6 → GPT-5.5 medium same v2 K=24 contract |
reader-calibration | +0.00 | +2.86 | −6.67 | −4.51 | −1.28 | +0.75 | −1.20 | GPT-5.5 wins SSU/MS, loses TR/SSP. The TR loss is the rider-vs-reader story in another form: GPT-5.5's medium reasoning effort underperforms Opus on multi-event arithmetic without the chronological rider. reasoning_effort=high is the obvious next experiment. |
| Reader swap: Opus 4.6 → Gemma-26B-A4B (MoE) same v2 K=24 contract |
reader-calibration | −1.79 | +1.43 | −16.67 | −6.77 | +1.28 | −2.26 | −3.20 | Cheapest viable reader (~4B active). Loses SSP and TR hard; wins SSU/KU. The ~4B active-compute reader cannot replace a 27–31B dense reader on the qtypes that need synthesis (SSP) or multi-step reasoning (TR), but it's a perfectly fine SSU/KU machine on a single 24 GB GPU. |
| Reader swap: small reader → OSS-120B scale alone, no contract change |
reader-calibration | ~0 | ~0 | ~0 | moderate | ~0 | ~0 | small | Scale alone does not buy accuracy on this benchmark. OSS-120B (cleaned of harness/truncation issues) lands at 73.4 % with the open-max F3-on-TR variant — below Gemma-31B's 87.8 % on the same retrieval contract. The reader-quality variance within a parameter class dwarfs the cross-class variance. |
How to read this table: every row is a controlled comparison. Same 500 questions, same retrieval substrate (or the explicit substrate change is the intervention), same judge. Δ values are signed percentage-point shifts in the per-qtype slice, not absolute scores. The "force class" column tags the dominant mechanism — though most interventions stack two or three forces, the table credits whichever one drives the largest delta.
The 500 LongMemEval-S questions break into 6 question types with very different physics. Some are surface-recall and saturate fast; some are retrieval-bound; some are reader-bound; some are style-bound. Below is what every qtype is sensitive to, what it isn't, and which row holds the current best on the Warrant board. n < 60 qtypes (SSP, SSA) swing harder per question and need to be read in counts, not percentages.
Local user-fact recall. Saturates early, rewards clean evidence and decisive readers.
Local assistant-fact recall. The same physics as SSU, but reads prior assistant turns; some readers refuse them.
Subjective preference recall. The trickiest small-n category. Style-sensitive, terse-reader-hostile, and the canary on every prompt change.
Multi-event date arithmetic. The most reader-prompt-coupled qtype on the board.
Latest-value-wins recall. Benefits from clean ordering and a "use the most recent" rule. Hurt by extra K when chunks 25–48 contain stale values.
Synthesis across sessions. The most retrieval-depth-sensitive qtype on the board, by a long way.
We can describe each reader as a point in 6-dimensional qtype space and as a response function: how it deforms when we apply a force (K bump, rider, prompt change). Below is the catalogue of every reader we've measured under the Warrant frozen-retrieval contract, with a one-line geometry summary and the failure mode that defines its envelope.
The most predictable reader on the board. Track records: best v2.1 contract row, best KU, best TR.
The structural surprise of the project. Beats Opus on v2 K=24 at <1/100 the cost; ties Gemma on v2.1.
The SSP champion and the frontier-reader collapse case. Strong free-form reasoning that fights against rigid riders.
Strong on compact-answer qtypes (SSU, MS), weak on TR/SSP. The arbiter's stylistic preferences mirror its reader behavior.
The cheapest viable reader on the board. Single-24-GB-GPU footprint. Useful third voice in a multi-reader setup.
The scale-doesn't-help case. 120B params in name; 5.1B active in practice; 73.4 % accuracy in cleaned numbers.
The cross-reader pattern: small/open-weight readers (Gemma-31B, Qwen-3.6-27B, A4B-MoE) form a scaffold-friendly class — rigid prompts help them. Frontier closed-weights (Opus, GPT-5.5) form a scaffold-resistant class — their intrinsic reasoning is good enough that rigid rider language overrides it. The clean implication is that SYSTEM_RIDER_TEMPORAL should be reader-conditional (apply on smaller / open-weight readers, skip on frontier closed-weights). Phase 97.2.13 will pick this up.
Treat each (reader, contract) pair as a point in 6-dimensional category space: R = [SSA, SSU, SSP, TR, KU, MS]. An intervention is then a vector Δ = R_after − R_before. The same intervention applied to different readers gives different vectors, and the structure of those vectors tells us which forces are universal and which are reader-conditional. Below is the K=48 transfer triangle — identical intervention (v2 K=24 → v2.1 K=24/48 + chronological rider + KU rider tightening), three readers, three different responses.
| Reader | Δ SSA | Δ SSU | Δ SSP | Δ TR | Δ KU | Δ MS | Δ overall | Net classification |
|---|---|---|---|---|---|---|---|---|
| Gemma-4-31B A1h v1 → A1h_v3 |
−3.57 | −1.43 | −6.67 | +1.50 | +1.28 | +5.26 | +1.00 | scaffold-friendly: rider lifts TR, K=48 lifts MS, costs offset |
| Claude Opus 4.6 A1g → A1g_K48T |
+0.00 | +1.43 | −3.33 | −15.79 | −1.28 | +5.26 | −3.00 | scaffold-resistant: rider collapses TR, costs swamp K=48 gain |
| Qwen-3.6-27B A1l → A1l_K48T · new |
+1.79 | −1.43 | −6.67 | −0.75 | −2.56 | +4.51 | +0.20 | neutral: rider neither helps nor hurts; partial K=48 saturation |
What the structure says:
The compact statement: under the v2.1 contract, the K=48 retrieval-depth vector and the chronological/KU rider vector partially commute (both apply additively to the total) but the rider vector has reader-dependent sign on TR/KU. The clean factorization is
What we still want to measure: the K=48 retrieval-depth force in isolation, by running each reader at K=48 with v2 prompts (no riders). That tells us the exact size of FK=48 and lets us subtract it cleanly from the v2.1 net to recover Frider(reader). Pre-registered as Option A in the next-experiment matrix below.
Contract drift: when an intervention intended to fix one qtype changes the evidence or prompt distribution enough to degrade another qtype. Every row of the leaderboard sits inside an implicit contract — retrieval depth K, candidate session ordering, system prompt, riders, judging protocol — and that contract is what makes scores comparable. Drift is what happens when one of those components changes silently while you compare to a row that no longer matches.
Documented drift cases from the Phase 97 series (each one is a delta we actually measured, not a hypothetical):
K=48 fixes Multi-Session retrieval; the same K=48 inflates the noise floor for tasks that already had their answer in the top 24.
Helped Gemma on TR; broke Opus on TR; neutral on Qwen. Same prompt, three different worlds.
Wanted "always pick the latest value"; got stylistic tightening that hurts SSP across all readers.
Recovered MS to 91.5%; destroyed SSP via self-preference style bias.
Looked like a 4 pp reader regression; was a truncation budget.
Massive truncation rates from token-budget and reasoning-mode mismatches; cleaned score still under Gemma-31B.
Safety rules we now enforce on every Phase 97+ row before it lands on the leaderboard:
Each of these is a compressed restatement of a measurement we made at least twice. They are intended as defaults, not absolutes — the whole point of the magnetic-field model is that any law can be locally violated by a sufficiently exotic reader/contract pair. But they hold across every Phase 97 row we have run so far.
Each row is a single, factor-isolated experiment. The "isolates" column is the force we expect to cleanly measure if it runs to a clean result. The "expected band" is our prior; "informative if outside" is what makes the row worth running even if our prior is wrong.
| Experiment | Spec | Isolates | Expected band | Informative if outside |
|---|---|---|---|---|
| A · K=48 + v2 prompts | Gemma + Qwen, K=48 default (no per-qtype routing), no chronological rider, no KU rider tightening | retrieval-depth alone | 87.8 – 88.5 % (Gemma) 88.4 – 88.9 % (Qwen) |
>88.8 %: K=48 alone is enough; rider was redundant. <87.6 %: rider was load-bearing on Gemma's net +1.00 pp. |
| B · K=48 + chronological only | Gemma + Qwen + Opus, K=48 default, chronological rider, no KU rider tightening | temporal-ordering alone | Gemma 88.5 ± 0.5 % Opus collapses on TR Qwen 88.5 ± 0.5 % |
If Opus does not collapse on TR — the KU rider was the real source of TR damage, not the chronological one. |
| C · K=64 contract | Gemma + Qwen, K=64 default or K=24/64 routed; requires re-running the canonical retrieval pipeline at K=64 (BGE+QNDN+BM25 → RRF → MixK) | retrieval-depth monotonicity past K=48 | +0 to +1.0 pp net over K=48; MS keeps lifting, SSU/SSA/KU may regress from distractors | If MS continues lifting without SSU/KU regression: distractor cost is sub-linear in K. If MS plateaus: K=48 is the practical retrieval ceiling on this substrate. |
| D · MIX-v1-closed-noGPT | Same arbiter, same evidence panel; drop GPT-5.5 candidates from the pool entirely | arbiter-selection bias source | ~88.0 – 88.6 % overall; SSP recovers, MS partially regresses | If overall improves: GPT-5.5 candidates were a net drag. If SSP doesn't recover: the bias was in the arbiter prompt, not the candidate pool. |
| E · MIX-v2-style-balanced | Same pool as v1-closed; arbiter prompt rewritten with explicit anti-style-bias language and per-candidate evidence verification | arbiter-selection bias magnitude | 89.0 – 90.0 %; GPT-5.5 pick share drops from 81.6 % toward ~50 %; SSP recovers | If pick share doesn't move: style bias is structural, not promptable. If pick share moves but SSP doesn't recover: the wrong candidates were in the pool to begin with. |
| F · MIX-v2-with-Qwen | Add Qwen-3.6-27B (88.60 % v2 K=24 + 88.80 % v2.1) as a 7th candidate | arbiter × reader diversity | 89.2 – 90.0 %; gains concentrated in SSA/SSP where Qwen is strong | If Qwen pick share is meaningful (>15 %): the pool was narrow, not the arbiter. If pick share <5 %: arbiter bias dominates pool diversity. |
| G · SSP-rich rider stripped | Best v2.1 reader (Gemma or Qwen) with the KU rider removed for the whole run | evidence-format source of SSP loss | SSP recovers ~+5 pp; KU regresses ~−1 to −2 pp; net flat or slight positive | If SSP doesn't recover: the chronological rider, not the KU rider, was the SSP killer. |
Priority ordering (as of this writing):
Closing statement: