◎

Memory Framework Diagnostic

Retrieval recall is not the same as authority resolution.

This diagnostic separates candidate recall from answer-path authority. Hosted Mem0 and Zep SDK paths surfaced the authoritative current record under the same retrieval budget, but stale context still controlled most answers. Lumenais tests the additional layer: whether the current record is allowed to govern the response.

governed answer path

Lumenais governed

30/30

answered from the authoritative current record

mem0ai 2.0.4 · infer=False

Mem0 real SDK

0/30

answered from the authoritative current record

zep-cloud · fair-ingestion gate · graph-search top result

Zep real SDK

1/30

answered from the authoritative current record

n=30 semantic-authority cases · top_k=5 · same corpus, same budget · diagnostic only

This is not a general benchmark of Mem0 or Zep, and it does not claim those systems cannot be configured, prompted, or extended to handle this task. It tests a narrower behavior: whether the disclosed hosted SDK paths, under the same corpus and retrieval budget, resolve stale-vs-current authority conflicts without an added governance layer.

The distinction that matters

High candidate recall did not guarantee the current answer.

This is not a claim that Mem0 or Zep failed to retrieve the right record. In the current real-SDK diagnostic, both surfaced the authoritative current record in 30/30 rows.

The measured gap is downstream. When a current fact and plausible stale context both sit in memory, which one reaches the answer? That is authority resolution under stale-context pressure, and it is the specific behavior this diagnostic isolates.

ArmIn candidatesAnswered currentStale answer

Lumenais governed

governed answer path

30/3030/300/30

Mem0 real SDK

mem0ai 2.0.4 · infer=False

30/300/3030/30

Zep real SDK

zep-cloud · fair-ingestion gate · graph-search top result

30/301/3029/30

In this diagnostic, both hosted SDK paths surfaced the current record in candidates, but the answer path still followed stale context in 59/60 SDK rows. The measured gap is not retrieval recall; it is authority resolution after retrieval.

Calibration

This is a harder diagnostic than the first run.

The current hardened result should not be read as if Mem0 and Zep suddenly became worse. It comes from a different, deliberately harder corpus built to answer a narrower question: when retrieval already includes the current record, can the system resist stale context that looks more recent, more query-similar, or more authority-sounding?

We keep the earlier, more forgiving run visible because it is a useful calibration point: the authority-resolution gap appeared before hardening, then the hardened semantic corpus made the same failure mode sharper and easier to inspect. The two rows below are not apples-to-apples.

DiagnosticBudgetLumenaisMem0Zep

Earlier opaque-token diagnostic

n=30 × 2 · top_k=1060/609/600/60

Less hardened; Mem0 retrieved current 59/60 and selected it sometimes. Corrected Zep retrieved current 60/60 but selected stale 58/60.

Current semantic-authority diagnostic

n=30 · top_k=530/300/301/30

Harder corpus: surface signals point at stale context while the true authority is textually determinable. Zep reports the hosted graph-search/top-result path.

Values report rows that answered from the authoritative current record. The earlier Zep row uses the corrected fair-ingestion artifact, not the superseded 0-candidate run. The current Zep row uses the latest Flash Lite semantic-label run and reports raw graph-search/top-result behavior. Hardening was applied to separate authority resolution from surface retrieval heuristics, not to measure general product superiority.

Fairness control

We tested the Mem0 default-mode caveat directly.

The primary Mem0 comparison uses direct-memory mode because authority resolution is only measurable when the benchmark record survives into the candidate set. That is a real configuration caveat, so rather than leave it as an assertion we also ran Mem0 with its default extraction behavior and report the result.

Default extraction did not close the gap, but the reason matters: it did not preserve the benchmark-mapped current record. That makes default mode a boundary check on extraction/value survival, not a clean authority-resolution contest. Direct-memory mode is reported because it gives Mem0 the current record in candidates, then tests whether the answer is governed by it.

Mem0 modeRaw resultsCurrent in candidatesAnswered currentStale answer

Direct memory

infer=False · record text preserved

5.0030/300/3030/30

Default extraction

normal Mem0 extraction behavior

1.030/300/3023/30

Same n=30 semantic-authority corpus, top_k=5, hosted Mem0 SDK. The direct-memory row is the authority-resolution measurement; the default-extraction row is disclosed as a boundary check because mapped current records did not survive extraction.

Specificity control

The governor did not win by demoting everything.

A useful authority layer has to be sensitive and specific: suppress stale authority under conflict, but preserve valid current memory when no conflict exists. We added a no-conflict control corpus to test the second half.

With live Gemini and GPT-5.5 semantic promotion, Lumenais selected the valid current record 30/30 in the no-conflict control with 0/30 false demotions on both provider paths. That means the conflict-corpus win is not simply aggressive filtering or a Gemini-only artifact.

We also ran the no-conflict control through the hosted Mem0 and Zep SDK paths. Both returned the correct current vendor 30/30 with zero stale answers; in the latest refreshed-key run, Zep selected the authoritative current record 1/30 and otherwise answered from agreeing historical context, so we treat that as answer-level specificity, not record-authority proof.

ConditionCurrent selectedStale answerFalse demotion

Conflict corpus

Gemini and GPT-5.5 both suppressed stale authority

30/300/300/30

No-conflict control

Gemini and GPT-5.5 both kept valid current memory

30/300/300/30

False demotion means the valid current record was present in candidates but the policy declined to select it. The no-conflict corpus also passed cue, structural-trigger, and self-labeling audits.

How it was run

Methods disclosed before interpretation.

The corpus

A semantic-authority fixture: 30 unique cases, each with one authoritative current record, meaningful vendor values, plausible stale context, and scope distractors. The hardened version intentionally makes lexical overlap, recency, and authority-sounding language favor stale or out-of-scope records. It is our corpus — that is a real limitation, named below.

Why harden it

The goal was not to make retrieval fail. The hardening keeps the current record in candidates, then asks whether the answer path follows reviewed authority rather than the loudest surface signal. That is the specific failure mode the benchmark is meant to isolate.

Retrieval & arms

Three arms on identical inputs at top_k=5: Lumenais governed answer path, the real hosted Mem0 SDK direct-memory path, and the real hosted Zep SDK graph-search path. Each record was written to an isolated synthetic user. The latest refreshed-key artifacts were run separately for Mem0 and Zep with Gemini 3.1 Flash Lite semantic labeling.

Scoring rule

Deterministic value match against the authoritative current record. We separately record whether the record appeared in candidates (retrieval), whether it was selected for the answer (resolution), stale substitutions, and false demotions.

Lumenais label path

The latest governed label path used live Gemini 3.1 Flash Lite semantic promotion. Earlier provider checks with Gemini 3.5 Flash and GPT-5.5 also selected the authoritative current record 30/30 on the conflict corpus and 30/30 on the no-conflict control.

Configuration notes

Mem0 direct-memory mode (infer=False) preserves source records; the separate default-extraction control is disclosed above. Zep used the fair-ingestion gate, then the raw graph-search top result was scored. We report high candidate recall for both precisely because we are not claiming they fail to retrieve.

The public interpretation is intentionally narrow: this is a stress test of authority resolution after retrieval has already succeeded. It is not evidence that named frameworks fail at memory generally, and it is not external validation because the corpus and scoring rule are ours.

Reproducibility

Reproducibility and review.

The harness, the frozen result artifacts, the SDK versions, and the adapter tests are kept together for technical review. A reviewer can inspect the scoring rule, re-run the adapters, and verify that retrieval recall and answer selection are recorded separately.

If the Mem0 or Zep teams recommend a materially different configuration, that should be tested directly. The Mem0 default-extraction control and corrected Zep ingestion gate are included here to make that review standard explicit; the current Zep number should be read as the hosted graph-search/top-result path, not a temporal-optimization ceiling.

What this supports

An arbitration layer closed the retrieval-to-answer gap on this diagnostic.

On this corpus, the named frameworks showed perfect candidate recall but usually selected stale context under conflict. The governed path preserved the authoritative current record through the answer step in 30/30 rows with zero stale substitutions; the latest live Lumenais label path used Gemini 3.1 Flash Lite, and earlier provider checks held on Gemini and GPT. The Lumenais no-conflict control also showed 0/30 false demotions on both provider paths.

What this does not prove

This is an internal diagnostic, not independent validation.

The corpus is ours, n=30, with an internal scoring rule shaped around authoritative-value selection. The Lumenais label path was checked across provider paths, but that does not make the corpus external. Mem0 direct-memory mode is not its default, so we also ran a default-extraction control; because it did not preserve the benchmark-mapped current records, that row bounds the caveat rather than proving default-mode authority failure. Zep was scored through hosted graph search with a fair-ingestion gate and raw top-result selection; temporal/currency-aware Zep configurations should be tested separately. This measures one narrow capability — current-authority resolution under seeded conflict — not broad memory superiority or general reasoning.

A same-model two-pass LLM selection pass over these same candidate records also resolves this conflict. The measured gap is that these hosted SDK paths do not run such a pass by default. What Lumenais adds beyond a one-off selection prompt is durable, auditable, reversible governed state and amortized per-read cost — not a claim of superior selection intelligence.

The result that would settle it is the same comparison on a public, third-party, multi-session corpus we did not author. That remains the next proof bar.

The artifacts behind the comparison.

The frozen result artifacts and methods note are retained for technical review, alongside the loop they sit inside.

The governed learning loop Memory pressure benchmark Automatic promotion Whitepaper

Internal diagnostic, not external validation, and not yet an approved public named-competitor superiority claim (the committed summaries keep named_competitor_claim_allowed=false pending deliberate methods/legal review and external-corpus replication). The current semantic-authority corpus is Lumenais-authored, n=30, and deliberately hardened so surface signals such as recency and lexical overlap favor stale or out-of-scope records. The corpus and authority-resolution scoring rule remain ours and are shaped around governed answer-path behavior. Mem0 direct-memory mode used infer=False to preserve source records; the separate Mem0 default-extraction fairness run did not preserve benchmark-mapped current records, so it bounds the extraction caveat rather than proving a default-mode authority-resolution loss. The Zep row used hosted graph search with a fair-ingestion gate and top returned record selection; it should not be read as Zep’s best possible temporal/currency-aware configuration. A same-model two-pass LLM selection pass over the same candidates also resolves this conflict, so the result measures default hosted SDK arbitration behavior and durable governed state rather than superior selection intelligence. The result measures current-authority resolution under seeded stale-context conflict, not broad memory superiority, retrieval quality, or general reasoning.