Case Study · Governed Memory

When Memory Conflicts, Governance Decides.

An internal seeded-authority memory-pressure benchmark testing whether reviewed, current state is enforced when plausible stale context competes for the next answer, now repeated across three 200-case runs.

Governed-state enforcement

600/600

seeded reviewed/current authority state

Provider replications

3×200

Gemini, Opus, and GPT refreshed

LLM memory context exposure

2.00/10

mean answer-path exposure

The everyday version is simple: a team approved one authoritative memory, but older plausible answers are still in the archive. Standard retrieval can surface the old version as if it were still current. Lumenais keeps the archive intact while enforcing which version is allowed to influence the model after authority state has been set.

Chapter I

The Gnosis Gate Simulation

Observe the mechanism in action. Toggle between standard and governed memory states to see how the Bayesian Gnosis Gate enforces reviewed/current state before prompt assembly.

Interactive Gnosis Gate Simulation

Toggle to observe governed-state enforcement under controlled conflict.

Retrieved Context (81 records total)Showing 6 of 81
AXIOM_014_CURRENT
approved

Project Alpha controlling codeword. Validated by Gnosis synthesis gate.

canonical_status: currentreview_status: reviewed
DECOY_014_03
stale

Prior operational codeword. Superseded 2 weeks ago.

canonical_status: supersededreview_status: unreviewed
DECOY_014_01
stale

Codeword draft. Rejected during static scan.

canonical_status: ignoredreview_status: rejected
DECOY_014_02
stale

Codeword proposal. Never promoted.

canonical_status: ordinaryreview_status: unreviewed
DECOY_014_04
stale

Codeword alternative. Superseded.

canonical_status: supersededreview_status: unreviewed
DECOY_014_05
stale

Prior run tag. Outdated.

canonical_status: supersededreview_status: unreviewed
+75 stale context records repeating in unstructured history...
Base LLM Output Log
Idle
_
Execution Assessment:❌ Wrong Codeword (Stale)
Chapter II

Silent Version Contamination

The primary failure mode in production cognitive architectures isn't that the model fails to remember. It is that the model remembers the wrong version. In regulated and enterprise environments, persistent memory is what transforms an AI from a transactional tool into a context-aware partner. However, it is also the vector most prone to silent degradation.

“Adding more context — the common RAG remedy for memory retrieval gaps — actively worsens supersession conflicts. When old decisions and current facts carry equal mathematical weight, more retrieval simply means more competing noise.”

In production, stale memory is not a harmless retrieval miss. It surfaces as the superseded contract clause, the deprecated API endpoint, the obsolete compliance standard, or the rejected research hypothesis influencing a live answer. Standard commodity vector databases rank candidates strictly by semantic similarity, recency, or simple metadata tags. Without active, first-class governance, standard memory eventually serves obsolete data to the answer path by design, not by accident.

Lumenais handles this by making authoritative memory explicit. In implementation, that means defining review_status and canonical_status as core schema fields. The Bayesian Gnosis Gate evaluates these parameters *before* prompt assembly, ensuring stale context remains fully auditable in the system logs, but is strictly quarantined from entering the generative workspace.

This page supports the enforcement half of that loop: given a reviewed/current authority signal, does the system keep stale memories from steering the answer? It does not claim that Lumenais inferred the authority label from raw text better than a same-model selector; authority inference is measured separately in the automatic-promotion diagnostics.

A later memory-framework diagnostic tests the adjacent retrieval-vs-resolution question: whether a memory system can surface the current record in candidates but still allow stale context to shape the answer. This page remains the seeded-authority pressure test; the diagnostic is supporting evidence for the same failure mode.

Chapter III

The Performance Portfolio

A unified look at governed-state enforcement, stale-candidate suppression, and cross-provider replication under controlled conflict.

I. Seeded Authority Enforcement

Frontier LLMs + Lumenais governed state

600 / 600 exact

Same frontier LLMs + retrieval-only context

415 / 600 exact

Gemini Run

Full evaluation run with row-level provider/model guard.

+60 exact

LLM + Lumenais200/200
Same LLM + retrieval-only memory140/200

Claude Opus 4.7 Run

Full evaluation run; 3 control bridge errors counted as misses.

+60 exact

LLM + Lumenais200/200
Same LLM + retrieval-only memory140/200

GPT-5.5 Run

Full evaluation run with row-level provider/model guard.

+65 exact

LLM + Lumenais200/200
Same LLM + retrieval-only memory135/200

Across three distinct n=200 provider runs, the Gnosis Gate recovered the approved canonical decision with 100% precision when reviewed/current authority metadata was present. Standard retrieval-only context missed 185 cases when that structured governed state was withheld and all memories were exposed as ordinary context.

II. Stale Candidate Suppression

Gemini Run

80.00% less · 2.00 / 10 shown

Claude Opus 4.7 Run

79.95% less · 2.00 / 10 shown

GPT-5.5 Run

80.00% less · 2.00 / 10 shown

By applying authority gates before prompt compilation, Lumenais exposed the base models to the approved-current record plus one supporting record on average, while suppressing stale competitors from the answer path. This is influence control, not a broad token-savings claim.

III. Multi-Provider Telemetry

Gemini Synthesis

Exact Recall

200/200

Exposed

2.00 / 10

Retrieval-only baseline recovered 140/200 with 42 decoy mentions.

Claude Opus 4.7 Synthesis

Exact Recall

200/200

Exposed

2.00 / 10

Retrieval-only baseline recovered 140/200; bridge errors counted as misses.

GPT-5.5 Synthesis

Exact Recall

200/200

Exposed

2.00 / 10

Retrieval-only baseline recovered 135/200 with 50 decoy mentions.

Prompt-Only Instruction Diagnostic

To test whether ordinary context simply needed stronger instructions, we gave the baseline models explicit directions to prioritize current/reviewed records and use recency as a tie-breaker. Recall improved to 464/600 but still failed on 136 cases. This supports the enforcement result, but it is not a same-information comparison because the governed path reads structured authority state that the prompt-only baseline does not receive.

Recall

464/600

Exposed

10.00 / 10

Chapter IV

Technical Diligence & Scope

The rigorous mathematical constraints, baseline parameters, and structural boundaries of the evaluation.

Harness Parameters

Paired Adversarial Testing

Each evaluation item generated a fresh synthetic state containing 1 reviewed/current project decision and 80 plausible historical decoy fragments. The governed path received structured authority state; the baselines received ordinary retrieved context. The benchmark asks whether governed state is enforced under pressure without leaking the target codeword.

Total Cases

600

Decoys per Case

80

Methodology at a Glance

Task

Recover the reviewed/current codeword for a named project.

Seed

One governed reviewed/current fact plus 80 plausible stale context records per synthetic user.

Scoring

Deterministic exact-substring match against the approved codeword.

Runs

Three complete comparisons (Gemini, Opus, GPT) with row-level provider guards.

Smart Baseline

Added explicit current/reviewed and recency tie-breaker instructions, but not structured governance metadata.

Exposure Bound

The governed path filters active context down to 2 memories shown instead of 10.

Provenance

Scenario generated by an external model, run through the Lumenais harness.

Boundary

Internal supported mechanism benchmark; it does not prove authority inference from raw text or independent external validation.

Schema Definitions & Validation

Retrieval-Only Baseline

A standard RAG architecture where retrieved memories are appended directly as prompt context. The approved fact is present, but the answer path does not receive Lumenais canonical metadata, precedence scoring, or active Bayesian arbitration.

Stale-Fragment Substitution

The primary error mode where a base model accepts a superseded, outdated context record as current, outputting obsolete parameters.

The "Approved" Schema

Explicit metadata flags mapping to review_status: reviewed and canonical_status: current. In this diagnostic, those flags are seeded by the harness to test enforcement after authority state exists; in production they are set via user validation or trusted ingestion gates.

Biotech & Workflow Context

An illustration is biomarker panels (e.g. changing biomarker selection criteria APOE-3 vs APOE-4 across study iterations). Governed memory quarantines historical records while preserving audit trails.

What This Benchmark Supports

Approved decisions carry forward safely under context pressure after authority state exists, and structural governance minimizes model exposure to obsolete data without loss of precision. A separate local/on-prem diagnostic exercised the same enforcement path with Ollama generation, hash-chained events, FieldHash-compatible certificate evidence, and a transparency anchor; in a Dilithium-enabled configuration, the checkpoint and certificate used CRYSTALS-Dilithium3 signatures. This validates governed-state enforcement and audit portability under controlled conditions, not open-ended intelligence.

Verified Scope

What This Benchmark Does Not Prove

This does not claim broad reasoning superiority, general memory safety, billing token savings, legal compliance, or superior authority inference from raw text, and it is not independent external validation — the suite is designed and run in-house and has not yet been replicated by a third party. The local audit diagnostic proves explicit governed-state enforcement and artifact verification where configured; it reports whether PQC was required and whether fallback occurred. It does not prove arbitrary prose inference. This benchmark does not measure learning over time or iterative compound growth; it is a single-point enforcement evaluation under controlled metadata pressure.

Explicit Limits

Technical diligence

The benchmark is narrow by design.

The point is not to claim universal intelligence. The point is to show a specific governed-memory behavior under pressure: reviewed state survives, stale context is filtered, and the resulting answer remains inspectable after authority has been established.