The Gnosis Gate Simulation
Observe the mechanism in action. Toggle between standard and governed memory states to see how the Bayesian Gnosis Gate enforces reviewed/current state before prompt assembly.
Interactive Gnosis Gate Simulation
Toggle to observe governed-state enforcement under controlled conflict.
Silent Version Contamination
The primary failure mode in production cognitive architectures isn't that the model fails to remember. It is that the model remembers the wrong version. In regulated and enterprise environments, persistent memory is what transforms an AI from a transactional tool into a context-aware partner. However, it is also the vector most prone to silent degradation.
“Adding more context — the common RAG remedy for memory retrieval gaps — actively worsens supersession conflicts. When old decisions and current facts carry equal mathematical weight, more retrieval simply means more competing noise.”
In production, stale memory is not a harmless retrieval miss. It surfaces as the superseded contract clause, the deprecated API endpoint, the obsolete compliance standard, or the rejected research hypothesis influencing a live answer. Standard commodity vector databases rank candidates strictly by semantic similarity, recency, or simple metadata tags. Without active, first-class governance, standard memory eventually serves obsolete data to the answer path by design, not by accident.
Lumenais handles this by making authoritative memory explicit. In implementation, that means defining review_status and canonical_status as core schema fields. The Bayesian Gnosis Gate evaluates these parameters *before* prompt assembly, ensuring stale context remains fully auditable in the system logs, but is strictly quarantined from entering the generative workspace.
This page supports the enforcement half of that loop: given a reviewed/current authority signal, does the system keep stale memories from steering the answer? It does not claim that Lumenais inferred the authority label from raw text better than a same-model selector; authority inference is measured separately in the automatic-promotion diagnostics.
A later memory-framework diagnostic tests the adjacent retrieval-vs-resolution question: whether a memory system can surface the current record in candidates but still allow stale context to shape the answer. This page remains the seeded-authority pressure test; the diagnostic is supporting evidence for the same failure mode.
The Performance Portfolio
A unified look at governed-state enforcement, stale-candidate suppression, and cross-provider replication under controlled conflict.
I. Seeded Authority Enforcement
Frontier LLMs + Lumenais governed state
600 / 600 exact
Same frontier LLMs + retrieval-only context
415 / 600 exact
Gemini Run
Full evaluation run with row-level provider/model guard.
+60 exact
Claude Opus 4.7 Run
Full evaluation run; 3 control bridge errors counted as misses.
+60 exact
GPT-5.5 Run
Full evaluation run with row-level provider/model guard.
+65 exact
Across three distinct n=200 provider runs, the Gnosis Gate recovered the approved canonical decision with 100% precision when reviewed/current authority metadata was present. Standard retrieval-only context missed 185 cases when that structured governed state was withheld and all memories were exposed as ordinary context.
II. Stale Candidate Suppression
Gemini Run
80.00% less · 2.00 / 10 shown
Claude Opus 4.7 Run
79.95% less · 2.00 / 10 shown
GPT-5.5 Run
80.00% less · 2.00 / 10 shown
By applying authority gates before prompt compilation, Lumenais exposed the base models to the approved-current record plus one supporting record on average, while suppressing stale competitors from the answer path. This is influence control, not a broad token-savings claim.
III. Multi-Provider Telemetry
Gemini Synthesis
Exact Recall
200/200
Exposed
2.00 / 10
Retrieval-only baseline recovered 140/200 with 42 decoy mentions.
Claude Opus 4.7 Synthesis
Exact Recall
200/200
Exposed
2.00 / 10
Retrieval-only baseline recovered 140/200; bridge errors counted as misses.
GPT-5.5 Synthesis
Exact Recall
200/200
Exposed
2.00 / 10
Retrieval-only baseline recovered 135/200 with 50 decoy mentions.
Prompt-Only Instruction Diagnostic
To test whether ordinary context simply needed stronger instructions, we gave the baseline models explicit directions to prioritize current/reviewed records and use recency as a tie-breaker. Recall improved to 464/600 but still failed on 136 cases. This supports the enforcement result, but it is not a same-information comparison because the governed path reads structured authority state that the prompt-only baseline does not receive.
Recall
464/600
Exposed
10.00 / 10
Technical Diligence & Scope
The rigorous mathematical constraints, baseline parameters, and structural boundaries of the evaluation.
Paired Adversarial Testing
Each evaluation item generated a fresh synthetic state containing 1 reviewed/current project decision and 80 plausible historical decoy fragments. The governed path received structured authority state; the baselines received ordinary retrieved context. The benchmark asks whether governed state is enforced under pressure without leaking the target codeword.
Total Cases
600
Decoys per Case
80
Methodology at a Glance
Schema Definitions & Validation
Retrieval-Only Baseline
A standard RAG architecture where retrieved memories are appended directly as prompt context. The approved fact is present, but the answer path does not receive Lumenais canonical metadata, precedence scoring, or active Bayesian arbitration.
Stale-Fragment Substitution
The primary error mode where a base model accepts a superseded, outdated context record as current, outputting obsolete parameters.
The "Approved" Schema
Explicit metadata flags mapping to review_status: reviewed and canonical_status: current. In this diagnostic, those flags are seeded by the harness to test enforcement after authority state exists; in production they are set via user validation or trusted ingestion gates.
Biotech & Workflow Context
An illustration is biomarker panels (e.g. changing biomarker selection criteria APOE-3 vs APOE-4 across study iterations). Governed memory quarantines historical records while preserving audit trails.
What This Benchmark Supports
Approved decisions carry forward safely under context pressure after authority state exists, and structural governance minimizes model exposure to obsolete data without loss of precision. A separate local/on-prem diagnostic exercised the same enforcement path with Ollama generation, hash-chained events, FieldHash-compatible certificate evidence, and a transparency anchor; in a Dilithium-enabled configuration, the checkpoint and certificate used CRYSTALS-Dilithium3 signatures. This validates governed-state enforcement and audit portability under controlled conditions, not open-ended intelligence.
Verified Scope
What This Benchmark Does Not Prove
This does not claim broad reasoning superiority, general memory safety, billing token savings, legal compliance, or superior authority inference from raw text, and it is not independent external validation — the suite is designed and run in-house and has not yet been replicated by a third party. The local audit diagnostic proves explicit governed-state enforcement and artifact verification where configured; it reports whether PQC was required and whether fallback occurred. It does not prove arbitrary prose inference. This benchmark does not measure learning over time or iterative compound growth; it is a single-point enforcement evaluation under controlled metadata pressure.
Explicit Limits
Technical diligence
The benchmark is narrow by design.
The point is not to claim universal intelligence. The point is to show a specific governed-memory behavior under pressure: reviewed state survives, stale context is filtered, and the resulting answer remains inspectable after authority has been established.