Independent retrieval research lab

Retrieval, measured.

Evidence-first memory for corpora that can't leave the building.

ManifoldMemory is an independent research lab working on the mathematics and engineering of retrieval at scale: how to find the right fragment in a million-document private archive, how to make a language model refuse when the evidence isn't there, and how to deploy both inside infrastructure that cannot make external API calls. Our research ships as Warrant, an evidence-first retrieval primitive for regulated enterprises.

See the product → What we research Journal-first. Reproducible. No vibes.

A small research group. One hard problem.

Retrieval quality degrades as archives grow. Dense retrievers that work at 100 K documents lose double-digit accuracy at 10 M. The lab exists to understand why, fix it at the mathematical level, and ship the fix as a deployable product. We publish our measurements; we don't publish hype.

01 · Measure or don't claim

Every number on this site has a receipt.

If a metric doesn't have a pipeline, a pool definition, and a reproducible evaluation script behind it, it doesn't appear. We maintain a ~25,000-line experiment journal where every claim on every page is cross-referenced to raw data.

02 · Negative results are load-bearing

The things that didn't work are part of the product.

Four large programs of work were walked away from after honest measurement: a 7× bigger encoder, a 6.7× bigger reranker, an "obviously better" training objective that composed antagonistically, and a generation-from-latents track that mode-collapsed under every loss we tried. Each closure is documented.

03 · Ship the smallest model that survives

Parameter count is not the moat.

Our production reranker is 30 M parameters. We tested 200 M and 500 M variants; they lost. The scaling advantage is in the training objective and manifold geometry, not in how many floats the model has.

04 · Nothing leaves the customer

Private-corpus is the design constraint, not an afterthought.

Every component runs on-premise on a single commodity GPU. No external API, no shared index, no data egress. The threat model is a regulated customer who literally cannot send data to a managed service, and the architecture reflects that.

Four open questions. Measured answers.

The lab operates on four active research threads. Each one has a measurable question, a reproducible experimental protocol, and a set of findings that feed directly into the product.

Thread 01 · Manifold geometry

Hopfield retrieval on real natural language

Modern Hopfield networks (Ramsauer et al.) predict an exponential-in-Δ² storage capacity and a β-regime phase transition in retrieval dynamics. Both results had been demonstrated mostly on synthetic data. We measured them on a learned 7.47 M-row natural-language manifold.

Finding. CLOOB fine-tuning sharpens the minimum pairwise separation from Δ = 0.051 to Δ = 0.146 (2.86×). The predicted β-phase transition appears cleanly at β ≈ 10 on the sharpened manifold, and is absent on the un-sharpened one. One-step iterative refinement flips sign between regimes (−0.021 → +0.076).
Thread 02 · Antagonistic levers

When two good ideas fight each other

The naive assumption in retrieval research is that orthogonal interventions compose additively. We repeatedly find they don't. A sharper within-modality manifold, combined with a mixed-K curriculum, collapses P@1 by 60%. A CLOOB-sharpened substrate, combined with a cross-modal projection head, retrieves 1.9× worse than the un-sharpened substrate does.

Finding. Two interventions that individually buy +11.9% and +34.5% P@1 compose into −54.8%, not +46.4%. Interaction term −0.264. Pattern holds across a second pair of levers measured a month later. "Orthogonal" is an empirical claim, not an assumption.
Thread 03 · Retrieval at scale

The geometry of retriever failure modes

At 1 M+ documents, different retrievers fail in different shapes. A sharp-head retriever lands the gold answer at rank 1 or not at all. A shallow-tail retriever reliably lands it near rank 50 but rarely at rank 1. Reporting a single P@K number hides the shape. We characterise retrievers across the full depth curve.

Finding. At 3.52 M, BGE-large P@1 = 0.149 > our QNDN v0 P@1 = 0.068. But their top-1000 union coverage is 65.1 % vs 39.5 % / 43.9 % for either alone — genuine decorrelation. A two-retriever union reranked by a shared scorer lifts production P@1 by +83 % over the single-retriever baseline.
Thread 04 · Calibrated refusal

The LLM contract for evidence-bounded corpora

A language model grounded on retrieved evidence should answer when the evidence is sufficient, cite when it answers, and refuse otherwise. Most deployed RAG systems fail the third clause. We treat refusal as a first-class primitive with its own audit suite, including a cross-encoder-style similarity floor and a distinct UX for three outcome states.

Finding. Adversarial 10-question audit on the public demo: 5 / 5 decisions correct, 0 hallucinations. Four distinct RAG failure modes (wrong-topic-plausible, false-confident-wrong, confidently-adjacent, on-topic-sub-question-gap) caught at the LLM, not by a cosine threshold. 100-question audit at scale: ~94 / 100.
The product ships what the research measures. Every number on the Warrant page is traceable to a finding on this one.
The two pages are not separate. The Lab page describes what we study; the Warrant page describes what shipped out of it. If a claim on the Warrant page does not have an experiment behind it here, it is not on the Warrant page.

Field-level contributions, honestly scoped.

Quantities we have measured that, to our knowledge, had not previously been measured on a learned natural-language manifold at this scale. Each has a file path and a reproducible evaluation behind it. None has been peer-reviewed yet; that's the gate between our journal and a submittable paper.

CLOOB → Δ bridge
2.86×

Minimum pairwise cosine separation Δ on a 7.47 M-row NL manifold, before and after CLOOB continued training. 0.051 → 0.146. First direct measurement of the CLOOB → Hopfield capacity bridge at M+ scale on real prose.

Phase transition
β ≈ 10

Continuation-retrieval P@1 traces a clean metastable-regime optimum at β ≈ 10 on the sharpened manifold. The transition is absent from the un-sharpened manifold under identical evaluation. Present-with / absent-without causal cut.

Refinement sign flip
± 0.08

One step of iterative Hopfield refinement: ΔP@1 = −0.021 on the soft manifold, +0.076 on the sharp one. Same operator, opposite effect — the in-regime / out-of-regime cut the theory predicts.

Cross-modal antagonism
1.9×

QPhead trained on the CLOOB-sharpened substrate retrieves 1.9× worse at native P@1 than the same head on the un-sharpened substrate. Within-modality sharpening is measurably antagonistic to cross-modal alignment.

Scaling decay gap
3.3×

197 K → 3.72 M real distractors. Off-the-shelf dense P@1 drops 21.1%; our stack drops 6.4%. Median rank of the correct answer: 14 vs 16,197. 13.4× shallower rank-degradation.

Union supra-additivity
+83%

Two complementary first-stage retrievers (sharp-head + shallow-tail) reranked over their union lift P@1 by +83% over the best single-retriever baseline at 3.52 M. Top-1000 union coverage: 65.1% vs 39.5 / 43.9 % alone.

We have not beaten SOTA dense retrievers in isolation.
Our single-retriever P@1 at 3.52 M is 0.068; BGE-large is 0.149. The measured win is in scaling decay, complementarity under rerank, and calibrated refusal — not in native P@1 on a small pool. We say this here so nothing on the Warrant page reads as an over-claim.
Receipt: `EXPERIMENT_JOURNAL.md`, Phase 86 final 3.52 M K_first=1000 matrix.

Journal-first, measured always, small where possible.

The lab's operating mode isn't a publication pipeline; it's a measurement pipeline. Experiments run nightly, results go into a single versioned journal, and the product inherits only what survives ablation.

What the lab does active

  • Builds and measures retrieval primitives at up to 7.5 M real documents on one-to-two GPU stations.
  • Maintains a strict apples-to-apples benchmarking hygiene: matched pools, matched queries, matched evaluators across every head-to-head.
  • Ships production-grade pipelines (Warrant) on the same hardware researchers prototype on — no ops gap between research and deploy.
  • Publishes negative results with the same rigour as positive ones; the pinned "do-not-chase" list is public in our internal journal and summarised on this site.

What the lab explicitly does not do out of scope

  • Train foundation models. We build retrieval primitives that sit in front of any reader. The reader is the customer's choice.
  • Chase benchmark leaderboards for their own sake. LongMemEval is a public receipt, not a goal; MTEB / BEIR replication is future work, not a current optimisation target.
  • Serve web-scale open-domain search. That is a $100 B-R&D-spend moat we cannot touch on its axes, and we don't try.
  • Ship a managed cloud API. The product is deployable on commodity hardware inside the customer's infrastructure. That is the whole design constraint.

The lab is open to technical collaboration.

Researchers interested in Hopfield-on-natural-language, retrieval scaling, or evidence-first LLM contracts can write to the lab directly. We share reproducible pipelines for any published number on this site, typically under an NDA or academic-collaboration agreement. Commercial evaluation requests should use the Warrant page below.

research@manifoldmemory.ai →
Warrant · enterprise retrieval · private pilots

The retrieval contract for corpora you can't send anywhere.

Million-scale. Calibrated refusal. One commodity GPU.

Warrant is a production retrieval primitive for regulated enterprises. It takes a natural-language question, returns the top source fragments from a private corpus in milliseconds, and — when evidence is insufficient — refuses rather than hallucinates. Deployable on a single commodity GPU inside your existing infrastructure. No data egress. No external APIs. No managed service in the path.

Request a pilot → See the benchmarks For legal, healthcare, defence, finance, critical infrastructure.
Scales where dense retrieval breaks
−6.4%P@1 decay
197 K → 3.72 M real in-distribution distractors. Off-the-shelf dense baselines lose −21%. Warrant loses −6.4%. Median rank of the correct answer stays at 14 instead of blowing up to 16,197.
Fits on commodity hardware
2.32 ms/query
Measured on a V100-16GB, batch=1 single-query latency, bare inference (corpus encoding amortised separately). 431 queries per second. Fits T4, L4, A10, RTX 4090, H100. $0.089 per million queries at on-demand cloud pricing — roughly 22,000× cheaper than a managed cross-encoder API at list price. Methodology & receipts →
Measured calibrated refusal
5 / 5correct decisions
Adversarial audit of 10 fuzzy, keyword-poor queries. Zero hallucinations. Four distinct RAG failure modes caught at the LLM. Refusal is a first-class primitive, not a prompt trick. 100-Q audit at scale: ~94/100 correct.

Four qualifiers. Four disqualifiers.

Warrant is not a general search product. It is engineered for a specific intersection of constraints. If your deployment satisfies all four, the alternatives have measurable problems. If it satisfies none, there are cheaper tools. This page is for the first group.

Million-scale archives

Disqualifies: vanilla dense baselines

Standard dense retrievers lose ~21% of top-1 accuracy going from 197 K to 3.7 M documents. A 10×-larger encoder does not fix it. Warrant decays ~6.4% over the same corpus growth because the ranking advantage lives in the training objective, not in parameter count.

Private corpus, no egress

Disqualifies: OpenAI · Cohere · Voyage · Gemini · managed SaaS

Every API-based retrieval and rerank service is disqualified at step zero in a regulated procurement: the customer's data cannot contractually or legally be sent to an external endpoint. Warrant runs entirely inside your environment — single GPU, no network egress required.

Calibrated refusal as a primitive

Disqualifies: every off-the-shelf RAG stack

In regulated verticals, a confident-wrong answer is a compliance or liability event, not a user-experience bug. Warrant treats refusal as a first-class output state with its own audit suite: answered-with-cites, refused-no-cites, and refused-low-confidence are visually and programmatically distinct.

One commodity GPU

Disqualifies: CE rerankers at scale · multi-node search platforms

The full serving stack fits in < 16 GB VRAM. A cross-encoder reranker over a 3.7 M pool is ~29 minutes per query on an H100 — operationally unshippable. Warrant is ~11 ms on the same hardware: ~158,000× faster, at equal or better first-stage quality. Methodology & receipts →

A staged retrieval contract. Every stage audited.

The production pipeline is a staged union-retrieval system with a learned reranker and an evidence-bounded reader. Each stage has its own measurable contribution, documented on the Lab page and reproducible from the customer's own gold queries during pilot.

Stage 1a · dense retrieval
BGE-large-en-v1.5
Industry-standard dense encoder, fully self-hosted. Retrieves the top-1000 candidates by cosine similarity on a BGE-encoded FP16 corpus. Sharp-head bias: lands the gold near rank 1 when it finds it; median rank 4 when found.
Stage 1b · latent retrieval
QNDN v0 — native latent encoder + query head
Our 13 M-parameter query head over a 92 M-parameter document encoder trained on attractor dynamics. Shallow-tail bias: gold reliably surfaces in the deep top-1000 (P@1000 = 0.439 vs BGE's 0.395 at 3.52 M; lead widens to +27 % at P@10000). Complementary to BGE, not a replacement. Same corpus, same row ordering. Retrieves the top-1000 in parallel.
→ union, deduplicated →
Stage 2 · learned rerank
MixK — 30 M-parameter Perceiver reranker
Re-scores the union (~1,500 unique candidates) in a single forward pass using cross-attention over latent representations. Sub-quadratic in K: scales monotonically from K_first = 100 to K_first = 1000 without retraining. Returns top-10.
Stage 3 · evidence-bounded reader
Grounded summary with calibrated refusal
Optional. An open-weight reader (customer's choice of Gemma, Qwen, Llama, or similar) reads the top-10 fragments, produces a cited answer or explicitly refuses when the evidence is insufficient. The similarity-floor gate is tuned per-corpus during onboarding.

The two retrievers measure different things. The flip happens around K = 200.

BGE-large is a sharp-head retriever: when it finds the gold passage, it places it at rank 1 and rarely lower — but a hard tail of queries it never ranks at all. QNDN v0 is a shallow-tail retriever: less precise at rank 1, but it lands gold somewhere in the deep top-1000+ much more reliably. The two curves cross around K = 200 and QNDN's lead widens monotonically with K. Identical 2,000-query evaluation, same corpus row-ordering, seed = 42, K_first = 10,000.

Pool Retriever P@1 P@10 P@100 P@1000 P@10000 Δ vs BGE @ deep‑K
3.52 M BGE-large-en-v1.5 0.149 0.224 0.302 0.395 0.510 baseline
QNDN v0 (Warrant Stage 1b) 0.068 0.167 0.274 0.439 0.645 +11% @ K=1000
+27% @ K=10000
7.52 M BGE-large-en-v1.5 0.147 0.222 0.300 0.393 0.501 baseline
QNDN v0 (Warrant Stage 1b) 0.066 0.160 0.262 0.421 0.628 +7% @ K=1000
+25% @ K=10000

The 27 % relative lead at P@10000 (0.645 vs 0.510 at 3.52 M) is what makes the union architecturally meaningful: BGE alone misses 49 % of gold passages at K = 10,000; the BGE ∪ QNDN v0 union misses 18 %. A 31 pp recall improvement before the rerank layer ever fires — that is why we union the two first-stage retrievers rather than picking one. The pipeline does not need QNDN to win at P@1; it needs QNDN to be holding the gold at K = 1,000+ when the union is handed to MixK. Receipts: EXPERIMENT_JOURNAL.md Phase 86 final matrix (final_matrix_3m5_k10000.md, final_matrix_7m52_k10000.md; 2,000 queries, K_first = 10,000, seed = 42).

Your existing reader. Your existing embedding index. Warrant slots into both.
If you already deploy BGE or a similar dense retriever, Stage 1a is drop-in. If you already run a rerank layer, Warrant replaces it at measurably better scaling behaviour. If you already have a grounded-reading prompt for Gemma or Qwen, Stage 3 is your prompt plus a calibrated floor. The product is the pipeline, not any single component.

The number that decides million-document RAG.

The only metric that survives a production deployment is how the stack behaves as the archive grows. Below: six pool sizes, three retrievers, one evaluation protocol. The Warrant row is the one that stays flat.

P@1 and median rank at 197 K → 3.72 M real distractors

1,000 gold queries, real in-distribution distractors (GitHub-MD corpus, same population as the training questions). Higher P@1 is better; lower median rank is better. Same evaluator, same pipeline, apples-to-apples across all three rows.

Pool size Warrant P@1 BGE-small P@1 BGE-large P@1 Warrant median rank BGE-small median rank BGE-large median rank
197,5780.35850.14900.1600121,209647
500,0000.35450.14600.1570122,6161,389
1,000,0000.34900.14100.1510124,8982,465
2,000,0000.34550.13150.1400138,9964,593
3,000,0000.33800.12150.13401412,9446,851
3,720,1730.33550.11750.12651416,1978,671

Relative P@1 decay from 197 K to 3.72 M: Warrant −6.4%, BGE-small −21.1%, BGE-large −20.6%. Median-rank blow-up over the same range: Warrant 1.17×, BGE-small 13.4×, BGE-large 13.4×. BGE-large carries 10× the parameters of Warrant's reranker and does not close the gap — the scaling advantage is from the training objective, not parameter count.

Retrieval plus routing. Hybrid beats either alone.

A public benchmark to anchor the claim. 500 conversational-memory questions, ~120 K-token haystacks, GPT-4o judge with K=5 seeds + 3-of-5 majority vote. Same reader (Gemma-4-26B-A4B, a 26B MoE with 4B active parameters), same decoding, same prompt — only the context-delivery mechanism is swapped. The hybrid row uses oracle qtype routing: routing decisions are made on the dataset's ground-truth question-type label, which isolates the upper bound of what a one-line router can buy on top of stack-only retrieval. Productionising the router via a lightweight learned classifier is a one-day add we have not yet shipped; the expected gap to oracle, given how distinguishable LME-s qtypes are, is 1–3 pp. We disclose the routing assumption rather than overstate the production state.

Configuration Retrieval R@5 Judge accuracy Δ vs stack-only
Gemma-4-26B-A4B · naked (110 K full-context, 8×A100) none 53.2% −10.6 pp
Gemma-4-26B-A4B + Warrant retrieval (stack-only) BGE ∪ QNDN ∪ BM25 → RRF → MixK → top-100 96.20% 63.8% baseline
Gemma-4-26B-A4B + Warrant Hybrid (oracle-qtype routing) oracle qtype router → stack or 110K-naked 96.20% 70.0% +6.2 pp

R@5 = 96.20%: on 481 of 500 questions, the correct ground-truth source fragment appears inside the top-5 chunks the retriever hands to the reader. The retrieval layer is doing its job. The remaining gap breaks down cleanly by question type: for the ~11% of questions where the gold evidence is a single assistant turn, the retriever's chunk-level similarity ranking destroys the dialogue structure the reader needs to use it — so the hybrid router hands those questions directly to the 110 K-context naked reader, which recovers +50 pp on that slice without touching anything else. Receipts: EXPERIMENT_JOURNAL.md, Phase 91.12 (SSA falsification) + Phase 91.13 (hybrid close-out, K=5 multi-seed judge) + Phase 95-K (oracle-qtype disclosure; predicted-qtype variant pending).

The open-weights leaderboard, in context

Most “top-of-LongMemEval” results on public leaderboards run closed-weights frontier readers (Claude Opus, GPT-5-mini, Gemini-3 Pro) behind multi-step agentic loops. Those aren't open-source deployments — they're API rentals. The open-weights field is much smaller. Here's where Warrant sits.

System Reader Active params QA accuracy
Hindsight + gpt-oss-120Bgpt-oss-120B (MoE)5.1 B89.0%
Hindsight + gpt-oss-20Bgpt-oss-20B (MoE)3.6 B83.6%
Warrant Hybrid + Gemma-4Gemma-4-26B-A4B (MoE)4.0 B70.0%
Warrant Hybrid + Qwen3.6-27BQwen3.6-27B (dense)27 B66.2%
naked gpt-oss-20B (full-context)gpt-oss-20B3.6 B39.0%
naked Llama-3.1-70B (full-context)Llama-3.1-70B70 B~35%

Single-pass pipeline with a one-line qtype router (the published 70.0% number uses oracle qtype labels; the production-equivalent learned classifier is pending and expected to land within 1–3 pp of oracle). No agent loops, no ensembles, no closed APIs, no external calls. ~$0.0004 per query (stack path) at on-demand single-GPU cloud pricing; naked path amortises on whatever full-context serving the customer already runs. The gap to Hindsight is primarily a reader gap, not a retrieval gap — R@5 = 96.2% confirms retrieval is not the bottleneck. Receipt for every row: published paper leaderboards (Hindsight, OpenAI gpt-oss release, Meta Llama-3.1 report); Warrant rows from our own A5.0 stack + A100-hosted naked 110 K runs, judged under the identical LME-s protocol with K=5 GPT-4o seeds + 3-of-5 majority vote (Phase 91.13; oracle-qtype disclosure in Phase 95-K).

New · Apr 2026
The dedicated Reader Leaderboard ships with 9 reader configurations on the same frozen retrieval contract.

Same 500 LongMemEval-S questions, same top-10 chunks, swap the reader. Per-qtype breakdowns, refusal rates, 95% Wilson CIs, K=5 GPT-4o judge receipts. The canonical open-weights Hybrid lands at 70.0%; under the same retrieval contract, the closed gpt-5-mini reference reads at 59.0% — evidence that fixed-evidence reading is its own bottleneck, separable from retrieval.

Open the leaderboard →
Decomposed honestly: retrieval earns +10.6 pp, routing earns another +6.2 pp. Both matter.
At 110 K context the naked reader lands 53.2% end-to-end. Adding the retrieval stack takes it to 63.8% (+10.6 pp) — the stack puts the right evidence in front of the reader on the four categories where cross-session navigation matters (multi-session, temporal-reasoning, single-session-preference, knowledge-update). Adding the qtype router on top — which hands the ~11% of single-session-assistant questions directly to the naked reader — takes it to 70.0% (+6.2 pp over stack-only, 95% CI [+5.63, +6.69]). The router uses oracle qtype labels for the published number; productionising it via a learned classifier is a one-day add expected to land within 1–3 pp. The router exists because retrieval's chunk-level ranking is structurally unable to preserve the dialogue ordering those questions need; once we measured that, routing became cheaper than trying to fix it inside the stack.
Phase 91.13 close-out · K=5 multi-seed judge · paired hybrid-lift 95% CI is the statistical floor on the headline. Routing-mechanism disclosure: Phase 95-K.

The part regulated buyers actually pay for.

In legal, healthcare, defence, and finance, a confidently wrong answer is a liability event. Warrant's refusal behaviour is measured, audited, and deployed as a three-state UX: answered with citations, refused without citations, refused on low confidence. The customer never receives an ungrounded claim presented as fact.

Adversarial query What retrieval returned Cosine Decision
Novel character: Crime and Punishment / PorfiryCorrect book0.909Answered & cited
First electronic general-purpose computerIBM PC (1981) — wrong decade0.858Refused
F1 champion — BrazilSchumacher — false-confident wrong0.858Refused
Einstein's day job before 1905Hertz — adjacent physicist0.971Refused
Sacagawea's tribal originRight person, no tribal detail in evidenceRefused

5 of 5 decisions correct. Zero hallucinations. Four canonical RAG failure modes (wrong-topic-but-plausible, false-confident-wrong, confidently-adjacent, on-topic-sub-question-gap) caught at the LLM, not by a cosine threshold — all four adversarial cases scored above 0.85 similarity. At scale (100 Q): ~94/100 correct behaviours, 4 confident hallucinations, 2 premise-acceptance errors. Full audit is reproducible on the customer's pilot corpus during onboarding.

A refusal must look like a refusal.
Warrant returns three visually and programmatically distinct states: answered with cites (hero-card with bold quote, only cited fragments rendered prominently), refused with no cites (warn-coloured card, candidate list suppressed), and refused on low confidence (red-coloured card, reader explicitly not consulted). A pilot user can tell at a glance which state they are in. For regulated buyers this is not a cosmetic detail — it is the product.
One stack, two deployment profiles. Same retrieval, different refusal calibration.
The refusal-audit numbers above (~94 / 100 correct, zero hallucinations) and the LongMemEval-s accuracy (70.0% hybrid / 63.8% stack-only) use the same retrieval pipeline. What differs is the similarity-floor gate: the production profile holds SIM_FLOOR at 0.45 (so out-of-corpus queries get refused rather than guessed), and the benchmark profile holds the reader's own grounded-refusal behaviour without a retrieval-side floor (so LongMemEval's universally in-corpus questions aren't over-gated). Both profiles ship in the container, are per-corpus-tunable during pilot week 3, and are audit-logged distinctly. Regulated customers get the production profile. The benchmark profile exists so the LongMemEval number is comparable to other published numbers.
Measured directly in journal entry 91.11 (H9 falsified — retrieval-side gates contribute < 1 pp on LME; reader-side refusal is the load-bearing component).

Where every qualifier is a hard constraint.

These are not hypothetical. Every segment below is a buyer where at least three of Warrant's four qualifiers (scale, privacy, refusal, commodity hardware) are non-negotiable before a vendor conversation can begin.

Legal

Law firms & in-house counsel

M-scale case files, contracts, discovery archives. Privileged material cannot leave the firm. Hallucination is a malpractice-tier risk. Typical archive: 1–50 M documents.

Healthcare

Hospitals & payers

Clinical-note archives and clinical literature. HIPAA. PHI cannot be sent to external APIs. "Maybe" is medically dangerous — refusal is the correct behaviour when evidence is absent.

Defence & Intelligence

Classified & controlled corpora

Air-gapped deployment is the baseline, not a feature. M-scale intel reports, operational archives. No external API, no shared index, no cloud dependency on any part of the pipeline.

Financial services

Banks with MNPI walls

Research archives, compliance records, trading documentation. Material non-public information boundaries, audit trails, regulators on the other end of every query. Evidence-traceable retrieval is a procurement requirement.

Critical infrastructure

Utilities, nuclear, pharma R&D

Regulated, audited, often air-gapped. Operational archives, compliance documentation, R&D notebooks. Same pattern: rich private corpus, fuzzy queries, zero tolerance for ungrounded claims.

EU enterprises

GDPR & data-residency

Data-residency constraints that make US-cloud managed APIs legally painful or impossible. Self-hosted, EU-jurisdiction deployment with full provenance of every model weight and every piece of evidence used.

Measurable pilot. Weeks, not quarters.

We take a small number of pilots per quarter. Each runs inside a dedicated customer instance against the customer's own corpus and gold queries. If Warrant doesn't beat your current stack on your own numbers, there is nothing to buy.

Week 1

Ingest & encode

Corpus ingest, dense + latent encode, row-aligned index build. Customer provides 50–200 gold QA pairs for evaluation. No data leaves your environment.

Week 2

Head-to-head

Apples-to-apples benchmark against your current retrieval stack (BGE / Elastic / Cohere / in-house) on the customer's own gold set. P@1, P@10, median rank, latency, cost.

Week 3

Refusal calibration

Similarity-floor fit on a customer-specific adversarial suite. Zero-hallucination target on ≥ 100-question audit. SIM_FLOOR and UX states frozen against measured distribution.

Week 4

Hand-off or end

Production-ready container, deployment playbook, monitoring dashboards, on-call runbook. If the numbers don't justify it, we hand you the benchmark results and end the engagement cleanly.

Coreshipping
Warrant Core
  • Dense stage-1 + MixK rerank
  • Single-GPU deployment (V100 / L4 / RTX 4090 / A10)
  • 2–5 M documents
  • Calibrated refusal with standard UX states
  • Container + deployment playbook
Fits a single 16 GB GPU. Covers the majority of regulated mid-size archives.
Unionrecommended
Warrant Union
  • Dense stage-1 ∪ QNDN v0 → MixK rerank, K_first = 1000
  • +83% P@1 lift over Core at 3.5 M
  • Single-GPU or 2-GPU deployment (H100 or dual L4)
  • 5–15 M documents
  • All Core features + union-coverage reporting
The production reference stack. Best price/performance at M-scale regulated archives.
Enterprisecustom
Warrant Enterprise
  • Union stack + cross-encoder tier-3 rerank on shortlist
  • Custom encoder fine-tuning on customer corpus
  • Multi-GPU / multi-node deployment
  • 15 M+ documents
  • Dedicated audit suite, on-call engineering, quarterly review
For large regulated archives with specialist vocabulary (legal, biomedical, classified).
Pilot commitment

A four-week engagement on your corpus, your gold set, your hardware. Measurable outcome, no lock-in.

We take one or two pilots per quarter. Preference goes to regulated verticals with a live retrieval stack already in production to compare against.

The questions procurement always asks.

Is any customer data sent to ManifoldMemory?

No. Warrant is deployed as a container inside your environment. Customer data, embeddings, queries, and logs remain on customer infrastructure. There is no telemetry, no phone-home, and no managed component in the inference path. Model weights ship with the container and can be audited offline.

How does Warrant compare to our existing BGE / Elasticsearch / Cohere stack?

Warrant is designed to sit on top of your existing dense retriever or to replace it. The published 3.72 M benchmark above is the head-to-head against BGE-small and BGE-large. For Elasticsearch (lexical) pipelines, Warrant adds semantic recall that BM25 misses; for Cohere-rerank pipelines, Warrant runs inside your perimeter and is measurably cheaper per query. Pilot week 2 reproduces the comparison on your corpus.

What hardware is required to operate Warrant?

Core fits on a single 16 GB GPU (V100, T4, L4, A10, RTX 4090). Union is comfortable on a single H100 or two L4s. Enterprise with cross-encoder rerank is typically 2–4 GPUs depending on QPS. No specialist hardware, no TPU, no custom networking. Warrant has been measured on commodity spot instances at $0.04–$2 per hour.

How does calibrated refusal actually work?

The retrieval stage returns candidates with similarity scores. A SIM_FLOOR is fitted per-corpus during onboarding against an adversarial audit suite (≥ 100 known out-of-corpus / false-premise queries). Below SIM_FLOOR, the reader is not invoked and the UI returns the "refused — low confidence" state. Above SIM_FLOOR but without sufficient citation coverage, the reader is invoked with a refusal-preferred prompt and its refusals are passed through. All three outcome states are programmatically distinguishable and audit-loggable.

Can we bring our own reader LLM?

Yes. Warrant is reader-agnostic. The hybrid pipeline has been measured end-to-end with Gemma-4-26B-A4B (70.0% hybrid, 63.8% stack-only) and Qwen3.6-27B (66.2% hybrid, 60.0% stack-only) on LongMemEval-s under identical protocols. gpt-4o-mini as reader is also supported (receipt on request). Any open-weight reader that accepts a grounded-reading prompt works. Reader weights and inference remain on customer infrastructure.

What does the contract look like?

Pilots are a fixed-fee four-week engagement. Production licensing is annual, per-deployment, with tiering by corpus size and update cadence rather than per-query. No share of customer data, no joint-IP claims, no model-weight lock-in. All artefacts are auditable.

What would make Warrant the wrong choice?

Three honest cases. (1) Open-domain web search — we are not a Google competitor. (2) Multimodal corpora (images, video, audio) — Warrant is text-only. (3) < 100 K-document archives — at that scale, our scaling advantage does not yet materialise and a plain BGE deployment is cheaper. We will tell you this in the first call rather than after the pilot.

Run Warrant against your hardest gold queries.

Send us a short description of your corpus size, domain, current retrieval stack, and the top three gold queries that currently fail. We respond within 2–5 business days with a pilot scope, an NDA, and an honest assessment of whether Warrant is the right tool. If it isn't, we'll say so.

contact@manifoldmemory.ai →