Independent retrieval research lab · self-funded

One geometric retrieval engine. Many domains.

We don't do semantic search. We do asymmetric, geometry-based navigation of million-scale corpora.

QNDN is our retrieval substrate — a Hopfield-inspired, asymmetric encoder/decoder whose measured edge lives in the deep tail of very large corpora, exactly where standard dense and lexical search structurally fail. One engine, trained once, generalises across biomedicine, patents, security, proteins, materials, and regulated archives — with zero per-vertical retraining.

Try Materia, live → How the engine works Materia is our biomedical discovery engine, running on the full PubMed corpus.

An asymmetric retriever, not a semantic one.

Semantic encoders (BGE, ColBERT) map queries and documents into the same symmetric space. QNDN is deliberately asymmetric: a task-conditioned query head navigates a learned latent manifold toward documents via attractor dynamics — closer to a Hopfield associative memory than to cosine similarity.

The production stack fuses three complementary first-stage retrievers and reranks over their union. Each retriever has a different bias: a sharp head that lands obvious hits near rank 1, a shallow-tail head that reliably surfaces the non-obvious evidence ranked 800th that a human never scrolls to, and a lexical head for exact terms. Their union, reranked, is worth far more than any one alone.

Stage 1a · dense

BGE-large

Industry-standard self-hosted dense encoder. Sharp-head bias — lands the gold near rank 1 when it finds it.

Stage 1b · latent

QNDN v0 — NDN + QPhead

Our 13M-param asymmetric query head over a 92M-param attractor-trained document encoder. Shallow-tail bias — gold reliably surfaces in the deep top-1000.

Stage 1c · lexical

BM25

Classical exact-term retrieval. Anchors rare identifiers, codes, and names that embeddings smear together.

BGEQNDNBM25 RRF union, deduplicated MixK 30M-param Perceiver rerank evidence-bounded reader

Six domains. One substrate.

The same QNDN substrate plus an evidence-first reader generalises from legal text to patents to drug-repurposing atlases with zero per-vertical retraining. Swap the corpus, keep the geometry. Each vertical is named for what it does.

Materia

Live
Biomedical discovery

Grounded drug-repurposing and target discovery. Ask a research question; get a source-backed brief plus candidate molecules and proteins, every claim tied to its evidence.

Corpus: 40.5M PubMed passages + 575K UniProt candidate bridge
Open Materia →

Caveat

Research
Patents & prior art

Prior-art search and novelty / IP critique. Surfaces the non-obvious filings that anticipate a claim, and flags where a stack is crowded versus genuinely open.

Corpus: USPTO full-text

Vestige

Research
Security & threat intel

Exploit-centric retrieval for bug-bounty and CTI. Turns a raw signal into "here's what that exact pattern led to in the wild" — disclosed reports, audit findings, before/after-fix pairs.

Corpus: disclosed reports + smart-contract audits

Kith

Research
Protein discovery

Asymmetric retrieval over protein space: sequence-only → candidate structural cousins, no structure required up front. The same Hopfield geometry, applied to biology.

Corpus: UniProt-scale sequence space

Forge

Research
Materials discovery

Evidence-grounded materials scouting — e.g. thermal-interface and accelerator materials. Pairs the discovery engine with prior-art checks before you ever touch a lab.

Corpus: materials literature + patents

Warrant

Enterprise
Regulated archives

Evidence-first retrieval for corpora you can't send anywhere. Air-gapped, calibrated refusal as a first-class output, one commodity GPU, no external APIs.

Corpus: the customer's private million-scale archive

The moat is in the deep tail — and it's measured.

We don't beat SOTA dense retrievers on small-pool top-1. The measured win is where it matters for discovery and regulated search: scaling decay, complementarity under rerank, and calibrated refusal. Numbers below are from our internal scaling protocol, not a buyer corpus.

Scales where dense retrieval breaks
−6.4%P@1 decay
197K → 3.72M real distractors. Off-the-shelf dense baselines lose −21%; our stack loses −6.4%. Median rank of the gold answer stays at 14 instead of blowing up to 16,197 — 13.4× shallower degradation.
Union supra-additivity
+83%P@1 vs best single
Two complementary first-stage retrievers (sharp-head + shallow-tail) reranked over their union lift P@1 by +83% over the best single retriever at 3.52M. Top-1000 union coverage: 65.1% vs 39.5 / 43.9% alone.
Calibrated refusal
~94/ 100 correct
Refusal is a first-class primitive, not a prompt trick. Adversarial audit of fuzzy, keyword-poor queries: zero hallucinations, four distinct RAG failure modes caught. R@5 = 96.2% on LongMemEval-S.
We have not beaten SOTA dense retrievers in isolation — and we say so here.

Single-retriever P@1 at 3.52M is 0.068; BGE-large is 0.149. The advantage is in scaling decay, complementarity under rerank, and calibrated refusal — not native P@1 on a small pool. The honest comparison is on a buyer's own corpus, not on this page.

Receipts: EXPERIMENT_JOURNAL.md · Phase 86 final 3.52M K_first=1000 matrix · public model leaderboard at /leaderboard.

Journal-first, measured always, small where possible.

The lab's operating mode isn't a publication pipeline; it's a measurement pipeline. Experiments run nightly, results go into a single versioned journal, and the product inherits only what survives ablation.

What the lab does

  • Builds and measures retrieval primitives at up to 7.5M real documents on one-to-two GPU stations.
  • Holds strict apples-to-apples hygiene: matched pools, matched queries, matched evaluators in every head-to-head.
  • Ships production pipelines on the same hardware researchers prototype on — no ops gap between research and deploy.
  • Publishes negative results with the same rigour as positive ones.

What the lab does not do

  • Train foundation models. We build retrieval primitives that sit in front of any reader. The reader is the customer's choice.
  • Chase leaderboards for their own sake. Public benchmarks are receipts, not goals.
  • Serve web-scale open-domain search. That's a $100B-R&D moat we don't try to touch on its own axes.
  • Ship a managed cloud API in the data path. Everything is deployable on commodity hardware inside your perimeter.

The lab is open to technical collaboration.

Researchers and teams interested in Hopfield-on-natural-language, retrieval scaling, asymmetric protein/biomedical retrieval, or a domain pilot can write to the lab directly. We share reproducible pipelines for any number on this site, typically under NDA or academic-collaboration agreement.

research@manifoldmemory.ai →