Tuning theory · Phase 97.2 series · LongMemEval-S 500q

Field Notes.

A tuning bible for the Warrant memory engine.

Warrant is not just a leaderboard. It is an instrumented memory engine where retrieval depth, prompt contracts, reader geometry, and arbiter selection create measurable forces over the task manifold. Every K bump, every rider, every reader swap, every arbiter is a vector in R6 — one component per question type. Some forces lift one category and break another. Some are reader-agnostic. Some are reader-specific.

This page is the explanatory layer behind the benchmark: what each intervention actually did, why it did it, and what to do next. The numbers come from the same 500-question LongMemEval-S frozen-retrieval substrate the leaderboard uses, with the 5-seed GPT-4o judge protocol applied uniformly.

Every change is a force. Every force has a signature.

For each major intervention we ran, the table below lists the observed per-qtype delta in percentage points and the force class doing the work. Numbers are ground-truth measured deltas, not estimates — same 500 LongMemEval-S questions, same 5-seed GPT-4o judge protocol on both sides of every comparison. Δ > 0 means the intervention helped that qtype; Δ < 0 means it hurt.

retrieval-depth temporal-ordering evidence-format reader-calibration arbiter-selection distractor / noise format / harness
Intervention Force class Δ SSA Δ SSU Δ SSP Δ TR Δ KU Δ MS Δ overall Reading
K=10 v1 → K=24 v2 + Track 2-lite v2 prompts
Opus 4.6, A1f → A1g
retrieval-depth + evidence-format +5.4 +10.0 +13.3 +11.3 +8.9 +18.0 +12.4 The single biggest jump in the project. ~3.2 pp from K=24 alone, ~9.2 pp from Track 2-lite v2 + LMS-S session-date back-fill. Multi-session breaks the 70 % ceiling for the first time; date back-fill is the silent star.
v2 K=24 → v2.1 K=24/48 routed + chronological rider
Gemma-31B, A1h v1 → A1h_v3
retrieval-depth + temporal-ordering −3.57 −1.43 −6.67 +1.50 +1.28 +5.26 +1.00 Net positive on a small open-weight reader. K=48 fixed the multi-session retrieval miss (+5.26 pp = +7 questions); the rider scaffolded TR. SSP paid the cost (the rider tightens answer style and crowds the preference voice).
v2 K=24 → v2.1 K=24/48 routed + chronological rider
Opus 4.6, A1g → A1g_K48T
retrieval-depth + temporal-ordering +0.00 +1.43 −3.33 −15.79 −1.28 +5.26 −3.00 Same intervention, opposite outcome on a frontier reader. Identical K=48 MS lift (+5.26 pp), but the rigid 7-step YYYY/MM/DD protocol overrides Opus's free-form temporal arithmetic and TR collapses by 21 questions. The rider is reader-specific.
v2 K=24 → v2.1 K=24/48 routed + chronological rider
Qwen-3.6-27B, A1l → A1l_K48T · new
retrieval-depth + temporal-ordering +1.79 −1.43 −6.67 −0.75 −2.56 +4.51 +0.20 The third corner of the natural experiment. Qwen sits in the neutral zone: still gains MS from K=48 (+4.51 pp, slightly less than Gemma/Opus's +5.26 because Qwen's K=24 already saturated MS at 81.20 %), barely moves on TR (the rider neither scaffolds nor collapses), loses SSP and KU. Confirms: K=48-on-MS is reader-agnostic; chronological rider is reader-class-dependent.
K=24 default → K=48 (MS+TR only)
isolated retrieval-depth, averaged over 3 readers
retrieval-depth ~0 ~0 ~0 mixed ~0 +5.0 avg ~+1.5 avg Pure retrieval-depth force, isolated. Gemma +5.26, Opus +5.26, Qwen +4.51 on MS — the variance is small relative to the lift. K=48 recovers chunks 25–48 that scattered MS evidence sits in. Reader-agnostic.
Chronological rider (7-step YYYY/MM/DD protocol)
isolated rider force, TR slice
temporal-ordering + reader-calibration ~0 ~0 small +1.5 / −0.75 / −15.79 small ~0 reader-dep. The rider is a scaffold for some readers and a constraint for others. Lift on Gemma-31B (weaker free-form temporal arithmetic), neutral on Qwen-27B (medium intrinsic), collapse on Opus 4.6 (strong free-form gets overridden). Reader-class force, not universal.
KU rider tightening (latest-value enforcement)
isolated rider, KU slice
temporal-ordering + evidence-format ~0 ~0 −6.67 to −3.33 ~0 +1.28 / −1.28 / −2.56 ~0 small Gemma gains KU +1.28; Opus and Qwen lose. The "always pick latest value" rule helps when KU answers genuinely sit in the latest assistant turn (Gemma's KU was already 94.87 %, ceiling room exists), but fights against readers that already do this implicitly. SSP collateral damage is real and reader-agnostic.
max_tokens 768 → 4096
GPT-5.5 medium, A1k truncation fix
format / harness ~0 ~0 ~0 +13.5 ~0 ~0 +4.0 Pure harness force, not a tuning insight. GPT-5+ models bill reasoning + output combined under max_completion_tokens; 768 truncated 33 TR rows mid-thinking. Lesson: instrument truncation rate per qtype before reading scores.
MIX-v1-closed: GPT-5.5 evidence arbiter over 6-candidate champion pool
vs single-reader best (Gemma v2.1, 88.80 %)
arbiter-selection +3.57 +4.29 −13.33 −1.50 −3.85 +3.76 +0.20 Net positive but stylistically biased. Arbiter recovered MS (+3.76 by routing to Opus when MS evidence demands it) and lifted SSU/SSA where GPT-5.5 candidates were genuinely strongest. SSP collapsed by 13.33 pp because the arbiter never picked Opus on SSP — pure style preference for GPT-5.5's compact "Answer: X" format. 81.6 % of all picks went to GPT-5.5 candidates.
Reader swap: Opus 4.6 → Qwen-3.6-27B
same v2 K=24 contract
reader-calibration −1.79 +2.86 −13.33 +0.75 +2.56 +2.26 +0.60 27B Apache-2.0 beats Opus by +0.60 pp at <1/100 the cost. Qwen wins on 4 of 6 qtypes; loses SSP catastrophically (Opus's hedged-but-correct preference voice is structurally hard to replace) and SSA marginally.
Reader swap: Opus 4.6 → Gemma-4-31B
same v2 K=24 contract
reader-calibration +0.00 +0.00 +0.00 +0.75 +1.28 −2.26 −0.20 Statistical tie at the qtype level — Gemma and Opus converge on this contract, Gemma trades MS for KU. Confirms the "+12.4 pp from contract upgrade is reader-independent" claim: both readers land in the 87.8–88.0 % band on the same recipe.
Reader swap: Opus 4.6 → GPT-5.5 medium
same v2 K=24 contract
reader-calibration +0.00 +2.86 −6.67 −4.51 −1.28 +0.75 −1.20 GPT-5.5 wins SSU/MS, loses TR/SSP. The TR loss is the rider-vs-reader story in another form: GPT-5.5's medium reasoning effort underperforms Opus on multi-event arithmetic without the chronological rider. reasoning_effort=high is the obvious next experiment.
Reader swap: Opus 4.6 → Gemma-26B-A4B (MoE)
same v2 K=24 contract
reader-calibration −1.79 +1.43 −16.67 −6.77 +1.28 −2.26 −3.20 Cheapest viable reader (~4B active). Loses SSP and TR hard; wins SSU/KU. The ~4B active-compute reader cannot replace a 27–31B dense reader on the qtypes that need synthesis (SSP) or multi-step reasoning (TR), but it's a perfectly fine SSU/KU machine on a single 24 GB GPU.
Reader swap: small reader → OSS-120B
scale alone, no contract change
reader-calibration ~0 ~0 ~0 moderate ~0 ~0 small Scale alone does not buy accuracy on this benchmark. OSS-120B (cleaned of harness/truncation issues) lands at 73.4 % with the open-max F3-on-TR variant — below Gemma-31B's 87.8 % on the same retrieval contract. The reader-quality variance within a parameter class dwarfs the cross-class variance.

How to read this table: every row is a controlled comparison. Same 500 questions, same retrieval substrate (or the explicit substrate change is the intervention), same judge. Δ values are signed percentage-point shifts in the per-qtype slice, not absolute scores. The "force class" column tags the dominant mechanism — though most interventions stack two or three forces, the table credits whichever one drives the largest delta.

Each qtype is a different material. Each one bends differently under load.

The 500 LongMemEval-S questions break into 6 question types with very different physics. Some are surface-recall and saturate fast; some are retrieval-bound; some are reader-bound; some are style-bound. Below is what every qtype is sensitive to, what it isn't, and which row holds the current best on the Warrant board. n < 60 qtypes (SSP, SSA) swing harder per question and need to be read in counts, not percentages.

SSUsingle-session-user (n=70)

Local user-fact recall. Saturates early, rewards clean evidence and decisive readers.

  • Sensitive to: evidence cleanliness, the LMS-S session-date back-fill (date-anchored questions need the date), reader decisiveness on borderline phrasing.
  • Not sensitive to: K depth (K=24 already saturates at ~95–99 %), the chronological rider, multi-session synthesis tooling.
  • Failure mode: a reader that hedges instead of committing on a single mention — refusal-rate is the leading indicator.
  • Counts: 1–3 question swings between top readers. The 70 / 56 small-n caveat applies less here than on SSP/SSA because most readers are above 95 %.
Best: Qwen-3.6-27B v2 K=24 = 98.57 % · tied with GPT-5.5 v2 K=24 · MIX-v1-closed 98.57 %

SSAsingle-session-assistant (n=56)

Local assistant-fact recall. The same physics as SSU, but reads prior assistant turns; some readers refuse them.

  • Sensitive to: a reader's willingness to treat assistant turns as canonical evidence (some refuse on principle), short-evidence prompt scaffolds.
  • Not sensitive to: K depth, retrieval breadth.
  • Failure mode: reader treats assistant content as "speculative" and refuses or hedges. Gemma family does this less than Opus does on a few rows.
  • Counts: a 1-question swing on n=56 is 1.79 pp. Read deltas in questions.
Best: GPT-5.5 v1 K=10 = 100 % (56/56) · A1g/A1g_K48T Opus v2/v2.1 = 96.43 %

SSPsingle-session-preference (n=30)

Subjective preference recall. The trickiest small-n category. Style-sensitive, terse-reader-hostile, and the canary on every prompt change.

  • Sensitive to: answer richness (terse "Answer: X" formats lose; hedged-but-correct prose wins), arbiter style preference, the chronological/KU riders (which tighten output style and crowd the preference voice).
  • Not sensitive to: K depth (K=24 sufficient), retrieval contract changes that don't touch prompts.
  • Failure mode: a reader trained for compactness ranks a preference question as "list candidates and pick one" instead of "explain the user's preference". GPT-5.5 falls into this; Opus avoids it.
  • The MIX-v1-closed lesson: the GPT-5.5 arbiter never picked Opus on SSP over 30 questions, despite Opus being the SSP champion at 86.67 %. Pure style preference, not evidence preference.
  • Counts: 1 question on n=30 = 3.33 pp. Almost any rider change shows up here as noise.
Best: Claude Opus 4.6 v2 K=24 = 86.67 % (26/30) · same on Gemma-31B v2 K=24

TRtemporal-reasoning (n=133)

Multi-event date arithmetic. The most reader-prompt-coupled qtype on the board.

  • Sensitive to: the chronological rider conditioned on reader class; QUESTION_DATE back-fill (the date-anchored questions need the date in the prompt header); reader's intrinsic temporal-arithmetic skill.
  • Not very sensitive to: K depth above ~24 (TR's remaining errors are mostly reader-bound, not retrieval-bound — A1h_v3 saw TR move only +1.50 pp from the K=24 → K=48 lift, much smaller than MS's +5.26).
  • Failure mode A (open-weight): can't enumerate events without a scaffold. Gemma-31B fails free-form TR; gains +1.50 pp from the chronological rider.
  • Failure mode B (frontier): rigid rider overrides good free-form reasoning. Opus's TR collapsed by 15.79 pp under the v2.1 rider (94/133 → 70.7 %); the rider fights Opus's intrinsic skill.
  • Failure mode C (harness): max_tokens=768 truncates GPT-5.5's reasoning mid-thinking on TR; +13.5 pp recovery at max_tokens=4096.
Best: Gemma-4-31B v2.1 K=24/48 = 88.72 % (118/133) · Qwen-3.6-27B v2 K=24 / v2.1 = 87.22 / 86.47 %

KUknowledge-update (n=78)

Latest-value-wins recall. Benefits from clean ordering and a "use the most recent" rule. Hurt by extra K when chunks 25–48 contain stale values.

  • Sensitive to: the latest-value rider, chronological ordering of evidence, retrieval that doesn't pull obsolete values to the top.
  • Not sensitive to: SSU-class evidence, the chronological rider's date arithmetic (KU is "newest wins", not "compute date").
  • Failure mode A: K=48 pulls in stale values that confuse a reader without the latest-value rule. Qwen lost KU −2.56 pp under v2.1 K=48 because the additional K=48 evidence on KU questions added stale distractors.
  • Failure mode B: a reader that already does latest-value implicitly is fought by the explicit rider — Opus −1.28 pp on KU under v2.1.
Best: Gemma-4-31B v2.1 K=24/48 = 96.15 % (75/78) · Qwen-3.6-27B v2 K=24 ties at 96.15 %

MSmulti-session (n=133)

Synthesis across sessions. The most retrieval-depth-sensitive qtype on the board, by a long way.

  • Sensitive to: K depth (the only qtype where K=24 → K=48 reliably moves the needle), session-diversity in the retrieved chunks, reader's long-context synthesis skill (Opus & Qwen are stronger here than Gemma-31B).
  • The K=48 transfer triangle: identical +5.26 pp lift on Gemma and Opus, +4.51 pp on Qwen (which was already partially saturated on K=24). Reader-agnostic retrieval-depth force.
  • Reader cap before retrieval cap: Qwen-3.6-27B reaches MS = 81.20 % on uniform K=24 (no routing), almost matching Gemma's K=48-routed 81.95 % — a structural long-evidence advantage from Qwen's 256K-context architecture.
  • Failure mode: 7 of 31 v1 multi-session misses on Gemma A1h v1 were retrieval-bound to chunks 25–48. The rest are reader-bound.
Best: MIX-v1-closed = 85.71 % (114/133) · Opus 4.6 v2.1 K=24/48 = 84.21 % · Qwen v2.1 K=24/48 = 85.71 %

Each reader is a different material. Same field, different shape.

We can describe each reader as a point in 6-dimensional qtype space and as a response function: how it deforms when we apply a force (K bump, rider, prompt change). Below is the catalogue of every reader we've measured under the Warrant frozen-retrieval contract, with a one-line geometry summary and the failure mode that defines its envelope.

Gemma-4-31B-IT31.3 B dense · novita://google/gemma-4-31b-it

The most predictable reader on the board. Track records: best v2.1 contract row, best KU, best TR.

  • Strengths: TR with the chronological rider (88.72 % — the rider scaffolds Gemma's weaker free-form temporal arithmetic), KU with the latest-value rider, predictable response to retrieval-depth interventions.
  • Weaknesses: SSP variance (the v2.1 rider crowds the preference voice; SSP −6.67 pp under v2.1), MS reader-cap (76.69 % on uniform K=24 — only a K=48 lift recovers it).
  • Force response: +1.00 pp overall on v2 → v2.1; absorbs both the K=48 retrieval force and the chronological rider productively.
  • Cost: ~$0.71 per 500 questions on Novita ($0.10/$0.20 per Mt). 1/66 the per-row cost of Opus.
Best row: A1h_v3 v2.1 K=24/48 = 88.80 % · cross-contract Warrant single-reader best

Qwen-3.6-27B27 B dense · Apache-2.0 · novita://qwen/qwen3.6-27b

The structural surprise of the project. Beats Opus on v2 K=24 at <1/100 the cost; ties Gemma on v2.1.

  • Strengths: SSU (98.57 %, board-leading), MS without routing (81.20 % on uniform K=24 — nearly matches Gemma's K=48-routed 81.95 %), KU (96.15 %, ties Gemma v2.1).
  • Weaknesses: SSP (73.33 %, 13.34 pp behind Opus — the same style-bias gap GPT-5.5 has, just smaller).
  • Force response: partially saturated on K=24. v2 → v2.1 nets +0.20 pp because Qwen already extracts much of the MS K=48 evidence at K=24, and the chronological rider neither scaffolds nor collapses it (TR −0.75).
  • Architecture note: 256K-context training appears to give Qwen a genuine long-evidence synthesis edge. This is the first reader on the board where MS approaches the reader-cap before the retrieval-cap.
Best row: v2 K=24 = 88.60 % · v2.1 K=24/48 = 88.80 %

Claude Opus 4.6proprietary · anthropic://claude-opus-4-6

The SSP champion and the frontier-reader collapse case. Strong free-form reasoning that fights against rigid riders.

  • Strengths: SSP (86.67 %, 13.34 pp ahead of the closest open-weight), SSA (96.43 %), MS with K=48 (84.21 %), refusal calibration on borderline questions.
  • Weaknesses: TR under the v2.1 chronological rider — collapses 15.79 pp because the rigid 7-step protocol overrides Opus's natural multi-event arithmetic. The rider fights the reader.
  • Force response: v2 → v2.1 nets −3.00 pp (same K=48 MS gain as Gemma, but TR collapse swamps it). The cleanest documented "reader-prompt incompatibility" on the board.
  • Cost: ~$47 per 500 questions. 66x more expensive than Gemma. Now beaten by Qwen-3.6-27B on v2 K=24.
Best row: v2 K=24 = 88.00 % · v2.1 K=24/48 documented negative result at 85.00 %

GPT-5.5 mediumclosed-weights · openai://gpt-5.5 (reasoning_effort=medium)

Strong on compact-answer qtypes (SSU, MS), weak on TR/SSP. The arbiter's stylistic preferences mirror its reader behavior.

  • Strengths: SSU (98.57 %, ties for board lead), MS (79.70 % on v2 K=24 — first non-Claude reader to clear 79 % on MS), 0 % refusal rate.
  • Weaknesses: TR (81.95 % on v2 K=24 — medium reasoning effort underperforms Opus by 4.51 pp on multi-event arithmetic), SSP (80 %, style preference for terse answers).
  • Force response (as reader): max_tokens matters more than usual — bills reasoning + output combined under max_completion_tokens; truncated 33 TR rows at 768 tokens, gained +13.5 pp on TR after raising to 4096.
  • Force response (as arbiter): when given 6 candidate answers including its own, picks itself 81.6 % of the time. Style preference, not evidence preference.
Best row: v2 K=24 = 86.80 % · reasoning_effort=high not yet tested

Gemma-4-26B-A4B (MoE)26 B / ~4 B active · novita://google/gemma-4-26b-a4b-it

The cheapest viable reader on the board. Single-24-GB-GPU footprint. Useful third voice in a multi-reader setup.

  • Strengths: SSU (97.14 %, board-grade), KU (94.87 %, ties open-weight peers), commodity hardware deployability.
  • Weaknesses: SSP (70 %, the worst on the board), TR (79.70 %), MS (76.69 %, tied with Gemma-31B's v2 number). The synthesis-heavy qtypes need more than 4 B active params.
  • Force response: behaves like Gemma-31B's geometry but with thinner reasoning — same direction, smaller magnitude.
  • Use case: SSU/KU machine, third voice in MIX, baseline for "can a 4B-active reader hold its own?" experiments.
Best row: v2 K=24 = 84.80 %

OSS-120Bopenai/gpt-oss-120b @ high · 117 B total / 5.1 B active (MXFP4)

The scale-doesn't-help case. 120B params in name; 5.1B active in practice; 73.4 % accuracy in cleaned numbers.

  • Strengths: with the F3-on-TR variant, lifts TR by +5.2 pp on slice (84/133 → 91/133). First measured open-weight reader to clear Gemma-4 Hybrid Efficient by ≥3 pp on the LME-S 500-q frozen retrieval contract.
  • Weaknesses: parameter count is misleading (5.1 B active per token via MXFP4). Did not beat Gemma-31B in any honest cleaned comparison; v2 K=24 attempts had severe truncation issues (320/500 truncated rows in A1j) before harness fixes.
  • Force response: highly sensitive to harness configuration (max_tokens, reasoning effort). Output-format and budget mismatches dominated the early A1j numbers.
  • Decision: explicitly retired from the experiment plan as of 2026-04-29 — "do not launch in any way, shape or form".
Best row: Stack + F3-on-TR (Open-Max) = 73.4 %

The cross-reader pattern: small/open-weight readers (Gemma-31B, Qwen-3.6-27B, A4B-MoE) form a scaffold-friendly class — rigid prompts help them. Frontier closed-weights (Opus, GPT-5.5) form a scaffold-resistant class — their intrinsic reasoning is good enough that rigid rider language overrides it. The clean implication is that SYSTEM_RIDER_TEMPORAL should be reader-conditional (apply on smaller / open-weight readers, skip on frontier closed-weights). Phase 97.2.13 will pick this up.

Every row is a point. Every intervention is a vector.

Treat each (reader, contract) pair as a point in 6-dimensional category space: R = [SSA, SSU, SSP, TR, KU, MS]. An intervention is then a vector Δ = R_after − R_before. The same intervention applied to different readers gives different vectors, and the structure of those vectors tells us which forces are universal and which are reader-conditional. Below is the K=48 transfer triangle — identical intervention (v2 K=24 → v2.1 K=24/48 + chronological rider + KU rider tightening), three readers, three different responses.

Reader Δ SSA Δ SSU Δ SSP Δ TR Δ KU Δ MS Δ overall Net classification
Gemma-4-31B
A1h v1 → A1h_v3
−3.57 −1.43 −6.67 +1.50 +1.28 +5.26 +1.00 scaffold-friendly: rider lifts TR, K=48 lifts MS, costs offset
Claude Opus 4.6
A1g → A1g_K48T
+0.00 +1.43 −3.33 −15.79 −1.28 +5.26 −3.00 scaffold-resistant: rider collapses TR, costs swamp K=48 gain
Qwen-3.6-27B
A1l → A1l_K48T · new
+1.79 −1.43 −6.67 −0.75 −2.56 +4.51 +0.20 neutral: rider neither helps nor hurts; partial K=48 saturation

What the structure says:

The compact statement: under the v2.1 contract, the K=48 retrieval-depth vector and the chronological/KU rider vector partially commute (both apply additively to the total) but the rider vector has reader-dependent sign on TR/KU. The clean factorization is

factorization
Δv2.1 vs v2(reader)  =  FK=48  +  Frider(reader)
FK=48 is reader-agnostic, ~constant across MS / TR (~+5 pp on MS, ~0 elsewhere). Frider(reader) is reader-conditional and varies wildly on TR (−16 to +1.5) and KU (−2.5 to +1.3). Reading both separately, not as one fused "v2.1 contract", is the basic discipline of any future rider experiment.
Receipts: A1h v1 → A1h_v3 (Gemma, +1.00 net) · A1g → A1g_K48T (Opus, −3.00 net) · A1l → A1l_K48T (Qwen, +0.20 net). Same K=48 cache, same rider package, three different overall outcomes from the same vector arithmetic.

What we still want to measure: the K=48 retrieval-depth force in isolation, by running each reader at K=48 with v2 prompts (no riders). That tells us the exact size of FK=48 and lets us subtract it cleanly from the v2.1 net to recover Frider(reader). Pre-registered as Option A in the next-experiment matrix below.

The intervention you meant to make is rarely the only intervention you made.

Contract drift: when an intervention intended to fix one qtype changes the evidence or prompt distribution enough to degrade another qtype. Every row of the leaderboard sits inside an implicit contract — retrieval depth K, candidate session ordering, system prompt, riders, judging protocol — and that contract is what makes scores comparable. Drift is what happens when one of those components changes silently while you compare to a row that no longer matches.

Documented drift cases from the Phase 97 series (each one is a delta we actually measured, not a hypothetical):

K=48 → MS gain, KU/TR risk retrieval-depth drift

K=48 fixes Multi-Session retrieval; the same K=48 inflates the noise floor for tasks that already had their answer in the top 24.

  • What we wanted: scattered-session evidence recovered — achieved (+5.0 pp MS, universal across readers).
  • What we also got: 24 extra chunks of distractor risk on KU and TR. On Opus this combined with the rider into a −15.79 pp TR collapse.
  • Lesson: K is a contract change, not a "free zoom". Comparing a K=24 row to a K=48 row is comparing two different evidence distributions.

Chronological rider temporal-ordering drift

Helped Gemma on TR; broke Opus on TR; neutral on Qwen. Same prompt, three different worlds.

  • What we wanted: stronger temporal arithmetic via explicit chronological scaffolding.
  • What we also got: Opus already had a stronger free-form temporal voice; the rider overrode it and forced a worse one. Reader-conditional sign on TR (+1.50 / −15.79 / −0.75).
  • Lesson: a "system rider" is never global. It interacts with reader internals. Always measure the rider on every reader before stacking.

KU rider tightening evidence-format drift

Wanted "always pick the latest value"; got stylistic tightening that hurts SSP across all readers.

  • What we wanted: an explicit "latest-value wins" rule for KU questions.
  • What we also got: a terser overall answer voice that bleeds into SSP (preference questions), where richness and synthesis are rewarded by the judge. SSP loss universal: −6.67 / −3.33 / −6.67 across Gemma, Opus, Qwen.
  • Lesson: a "qtype-specific" rider rarely stays inside its qtype. The reader doesn't know which qtype it's answering.

GPT-5.5 arbiter arbiter-selection drift

Recovered MS to 91.5%; destroyed SSP via self-preference style bias.

  • What we wanted: a blind evidence-grounded arbiter that picks the best-supported candidate per question.
  • What we also got: the arbiter picked its own style 81.6 % of the time. SSP regressed −20 pp vs the per-qtype best, because the arbiter consistently rejected Opus's richer SSP voice in favor of GPT-5.5's terser one.
  • Lesson: an arbiter is not neutral. Style preference is a real, measurable force. A finalizer must be evaluated per-qtype, not just overall.

max_tokens lift on GPT-5.5 format / harness drift

Looked like a 4 pp reader regression; was a truncation budget.

  • What we wanted: GPT-5.5 evaluated under the v2 K=24 contract.
  • What we also got: 33 truncated answers because max_completion_tokens=768 caps reasoning and output combined for GPT-5+. After lifting to 4096, score recovered from 82.80 % to 86.80 %.
  • Lesson: a "reader regression" can be a harness regression in disguise. Always check truncation, refusal, and cite-empty rates before claiming a reader is weaker.

OSS-120B harness incompatibility format / harness drift

Massive truncation rates from token-budget and reasoning-mode mismatches; cleaned score still under Gemma-31B.

  • What we wanted: a fair shot for a 120B reader on the same substrate.
  • What we also got: 320/500 truncated rows in the first run, hidden reasoning tokens eating the output budget, prompt/evidence-contract mismatch with our v2 template.
  • Lesson: bigger is not automatically better when the harness contract was tuned for a different reader family. A reader has to be brought into the contract, not assumed compatible.

Safety rules we now enforce on every Phase 97+ row before it lands on the leaderboard:

rule 1
Never report a changed K, prompt, or rider as the same contract.
If any of { K_default, K_per_qtype, system_prompt, user_template, riders } changes, the contract is new. It gets its own track on the leaderboard or its own column in the comparison table. "v2.1" is its own contract, distinct from "v2", even though some readers share both rows.
rule 2
Always isolate intervention deltas before stacking.
If you bump K and add a rider in the same run, you cannot attribute the delta. Run K-only and rider-only on at least one reader before stacking, otherwise you are reporting a fused vector you cannot decompose later.
Concrete example: A1h_v3 (Gemma v2.1) is K=48-routed AND chronological rider AND KU rider tightening, all at once. The +1.00 pp net hides what each piece contributed. The pre-registered K=48-only test (Option A in §7) exists specifically to recover that decomposition.
rule 3
Always check regressions by qtype, not just overall.
A −3 pp overall regression on Opus K48T hides a +5.26 pp MS gain. A +1 pp Gemma overall hides a −6.67 pp SSP loss. The 6-vector tells you what's actually moving; the scalar tells you almost nothing.
rule 4
Small qtypes swing hard. Report counts, not only percentages.
SSP has n=30 questions, SSA has n=56. A single question on SSP is ~3.3 pp; a 2-question swing is ±6.7 pp. Rider deltas on SSP across the K=48 transfer triangle are −2 / −1 / −2 questions — statistically modest, headline-loud. Always show k/N alongside %.
rule 5
Separate retrieval-bound failures from reader-bound failures.
If the gold session isn't in the top-K candidates, no reader can answer. That's a retrieval failure, not a reader failure — and it's the upper bound for any reader on that question. We track retrieval_hit_at_k per row precisely so we can attribute correctly. Don't blame the reader for evidence it didn't see.
In the K=48 transfer triangle, MS is mostly retrieval-bound at K=24 and mostly reader-bound at K=48. That's why the K=24 → K=48 lift on MS is universal across readers (the bottleneck moved); it's also why the K=48 contract no longer sees MS as the limiting factor for any reader.
rule 6
An arbiter, finalizer, or judge is not part of the contract — it is its own row.
MIX-v1-closed is not a "stack v2.1 result with a finalizer attached". It is a separate evaluation contract: 6-candidate champion pool + GPT-5.5 evidence arbiter, with its own bias profile (style preference, MS recovery), its own per-qtype signature, and its own gold-leakage guards. We give it its own track on the leaderboard for exactly this reason.

Seven laws we keep re-discovering.

Each of these is a compressed restatement of a measurement we made at least twice. They are intended as defaults, not absolutes — the whole point of the magnetic-field model is that any law can be locally violated by a sufficiently exotic reader/contract pair. But they hold across every Phase 97 row we have run so far.

law 1
Retrieval depth is not monotonic.
More K helps when the bottleneck is recall (scattered MS evidence). It hurts when the bottleneck is signal density: extra distractors confuse temporal arithmetic and update-tracking. Symptom of "too much K": TR or KU regress while MS gains. Symptom of "too little K": MS regresses while everything else holds.
Evidence: K=10 → K=24 lifted MS without TR/KU cost. K=24 → K=48 lifted MS by ~+5 pp universally but cost Opus −15.79 on TR. K=48 is at the edge of the productive range; K=64 is pre-registered to test whether MS continues lifting or whether distractor cost catches up first.
law 2
Temporal scaffolds are reader-specific.
A chronological rider is a scaffold for readers with weaker free-form temporal handling and a constraint for readers with stronger ones. There is no globally good rider; there are only readers a given rider helps.
Same chronological rider on the same K=48 cache: Gemma TR +1.50, Qwen TR −0.75, Opus TR −15.79. Sign-flip across readers, not across questions.
law 3
Multi-Session is breadth-limited before it is reasoning-limited.
If the gold evidence is split across 3+ sessions, what moves MS most is candidate-pool breadth (more K, more session diversity in the candidate set), not a smarter synthesizer. Once breadth saturates, then and only then does reader synthesis quality become the binding constraint.
+5.0 pp average MS lift from K=24 → K=48 across Gemma / Opus / Qwen, while reader swaps at fixed K=24 moved MS by <2 pp on average. Retrieval before readers, every time.
law 4
SSP is style-sensitive.
Preference questions reward richness, hedging, and synthesis voice. Any rider that tightens output style — chronological, latest-value, terse-finalizer — will cost SSP points. The cheapest fix is to not apply tightening riders to SSP; the deepest fix is a richer per-qtype response budget.
SSP loss universal under v2.1: −6.67 / −3.33 / −6.67 across Gemma, Opus, Qwen. SSP loss universal under MIX arbiter: −20 pp vs per-qtype best, driven by GPT-5.5 style preference.
law 5
Arbiter choice is not neutral.
A finalizer prefers candidates that match its own voice unless explicitly forced to verify evidence per candidate. A blind, evidence-grounded prompt narrows the bias but does not erase it. Always measure arbiter pick-share by candidate identity and check for self-preference.
MIX-v1-closed: GPT-5.5 picked GPT-5.5 candidates 81.6 % of the time despite candidate labels being shuffled and identities hidden. Style is its own signature.
law 6
Small readers can beat large readers when the contract matches their geometry.
A reader's effective score is the dot product of its strengths with the contract's demands. A 27 B reader tuned to the contract beats a 120 B reader fighting the contract. Scale is one axis among many; harness compatibility, prompt template, and rider package are co-equal.
Qwen-3.6-27B (88.60 %) beats GPT-OSS-120B (~80 % cleaned) under the same v2 K=24 substrate. Gemma-4-31B (87.80 %) beats GPT-OSS-120B too. Reader-contract fit dominates parameter count.
law 7
Overall accuracy hides opposing forces.
A +1 pp net result can contain a +7 MS recovery and three small-qtype regressions. A −3 pp net can contain a +5 MS recovery and a −15 TR collapse. The 6-vector is the truth; the scalar is the headline. Always paper-trail the vector.
A1h_v3: +1.00 net = +5.26 MS − 6.67 SSP − 3.57 SSA + 1.50 TR + 1.28 KU − 1.43 SSU. Same shape, different magnitudes, on every v2.1 reader.

What we'd run next, in priority order.

Each row is a single, factor-isolated experiment. The "isolates" column is the force we expect to cleanly measure if it runs to a clean result. The "expected band" is our prior; "informative if outside" is what makes the row worth running even if our prior is wrong.

Experiment Spec Isolates Expected band Informative if outside
A · K=48 + v2 prompts Gemma + Qwen, K=48 default (no per-qtype routing), no chronological rider, no KU rider tightening retrieval-depth alone 87.8 – 88.5 % (Gemma)
88.4 – 88.9 % (Qwen)
>88.8 %: K=48 alone is enough; rider was redundant. <87.6 %: rider was load-bearing on Gemma's net +1.00 pp.
B · K=48 + chronological only Gemma + Qwen + Opus, K=48 default, chronological rider, no KU rider tightening temporal-ordering alone Gemma 88.5 ± 0.5 %
Opus collapses on TR
Qwen 88.5 ± 0.5 %
If Opus does not collapse on TR — the KU rider was the real source of TR damage, not the chronological one.
C · K=64 contract Gemma + Qwen, K=64 default or K=24/64 routed; requires re-running the canonical retrieval pipeline at K=64 (BGE+QNDN+BM25 → RRF → MixK) retrieval-depth monotonicity past K=48 +0 to +1.0 pp net over K=48; MS keeps lifting, SSU/SSA/KU may regress from distractors If MS continues lifting without SSU/KU regression: distractor cost is sub-linear in K. If MS plateaus: K=48 is the practical retrieval ceiling on this substrate.
D · MIX-v1-closed-noGPT Same arbiter, same evidence panel; drop GPT-5.5 candidates from the pool entirely arbiter-selection bias source ~88.0 – 88.6 % overall; SSP recovers, MS partially regresses If overall improves: GPT-5.5 candidates were a net drag. If SSP doesn't recover: the bias was in the arbiter prompt, not the candidate pool.
E · MIX-v2-style-balanced Same pool as v1-closed; arbiter prompt rewritten with explicit anti-style-bias language and per-candidate evidence verification arbiter-selection bias magnitude 89.0 – 90.0 %; GPT-5.5 pick share drops from 81.6 % toward ~50 %; SSP recovers If pick share doesn't move: style bias is structural, not promptable. If pick share moves but SSP doesn't recover: the wrong candidates were in the pool to begin with.
F · MIX-v2-with-Qwen Add Qwen-3.6-27B (88.60 % v2 K=24 + 88.80 % v2.1) as a 7th candidate arbiter × reader diversity 89.2 – 90.0 %; gains concentrated in SSA/SSP where Qwen is strong If Qwen pick share is meaningful (>15 %): the pool was narrow, not the arbiter. If pick share <5 %: arbiter bias dominates pool diversity.
G · SSP-rich rider stripped Best v2.1 reader (Gemma or Qwen) with the KU rider removed for the whole run evidence-format source of SSP loss SSP recovers ~+5 pp; KU regresses ~−1 to −2 pp; net flat or slight positive If SSP doesn't recover: the chronological rider, not the KU rider, was the SSP killer.

Priority ordering (as of this writing):

  1. A first — cheapest, most diagnostic, completes the K=48 transfer triangle into a 4-point factorial. Gives us the size of FK=48 in isolation and lets every previous v2.1 row be re-decomposed.
  2. G next — tests the KU-rider hypothesis without requiring new retrieval. SSP recovery is worth a track-1 row.
  3. D and E in parallel — both are arbiter-only experiments, no new retrieval cache, both attack the same MIX bias problem from different angles.
  4. B — only useful after A clarifies which rider is doing what; otherwise we're stacking again.
  5. C — the most expensive and the most hypothesis-driven. Worth building K=64 retrieval if A and G show retrieval is still the binding constraint.
  6. F — adds candidate diversity; expected to be incremental, not transformative.

Closing statement:

framework
Warrant is a measurement instrument, not a benchmark row.
Every row of every leaderboard is a 6-vector, not a scalar. Every intervention is a vector field acting on those points, with reader-conditional sign on the rider components and reader-agnostic sign on the retrieval components. Tuning is the practice of selecting force combinations whose net vector is positive on the qtypes you care about, while explicitly accepting and reporting the regressions on the qtypes you don't.

The benchmark is the answer key. Field Notes is the calculus.