Smaller tool menu. Supporting diagnostic.
Across Gemini, GPT, and Claude, Lumenais cut the model's tool surface from ~28 to ~5 with no measurable cost to function-call accuracy against full-tool and prompt-smart arms. This is useful context hygiene, not the core enterprise wedge, and it still needs a matched top-k retrieval baseline before being treated as differentiated.
Tool surface reduction
~83%
vs full-tool arms; matched top-k pending
Gold-function retention
240/240
near-equivalent distractors excluded
Function-name correctness
343/360
baseline 339/360; prompt-smart 341/360
AST-like correctness
335/360
baseline 330/360; prompt-smart 334/360
False tool calls
14/360
baseline 17/360; prompt-smart 16/360
For an agent, a long tool list is like a crowded control panel. The right button may be present, but every extra option adds room for confusion, review burden, and prompt load. This diagnostic shows the panel can be narrowed before the model acts; it does not yet show that the narrowing beats commodity top-k retrieval.
The outcome: A smaller, more auditable answer path that can reduce prompt load and make agent behavior easier to review. The current evidence supports parity against full-context baselines, not a unique routing moat.
Tool calling fails when the relevant tool is surrounded by plausible neighbors.
Modern agents often expose the base model to a large menu of tools, schemas, and distractors. That improves capability coverage, but it also increases the chance of irrelevant calls, argument confusion, and unnecessary prompt load.
This benchmark tests a narrower question: can a governed routing layer reduce what reaches the model without hiding the function the task actually requires?
Visible tool surface
Full-tool baseline
27.53 mean tools
Prompt-smart baseline
27.53 mean tools
Lumenais filtered path
4.74 mean tools
The measured result is a much smaller tool surface, the required function retained in every target-tool case, and function-call quality held at parity against full-tool and prompt-smart arms. A matched top-k retrieval arm is the next control needed before claiming differentiated efficiency.
Three provider paths, same control-plane pattern.
The run used BFCL V4 simple, multiple-function, and irrelevance task families across Gemini 3.1 Flash Lite, GPT-5.5, and Claude Opus 4.7. Each arm saw the same task rows; the difference was how much visible tool context was sent to the model.
Function-call quality
AST-like correctness: whether the selected function and structured arguments match the expected call pattern.
Full-tool baseline
330/360 AST-like
Prompt-smart baseline
334/360 AST-like
Lumenais filtered path
335/360 AST-like
False tool calls on irrelevance
Full-tool baseline
17 false calls
Prompt-smart baseline
16 false calls
Lumenais filtered path
14 false calls
Lower is better here. The effect is modest but directionally useful: filtering reduces the surface where irrelevant calls can be selected.
Provider replication
The aggregate is not hiding a single-provider result.
Each provider saw the same 120 BFCL-derived rows and the same three arms. Values are shown as full-tool / prompt-smart / Lumenais filtered, except visible tools, which shows full-tool to filtered.
Gemini 3.1 Flash Lite
GPT-5.5
Claude Opus 4.7
The compression result is checked against the raw run artifacts.
Because compression can look good by hiding necessary context, the review checks focus on whether the filtered path kept the required tool, survived a different distractor shuffle, and matches the underlying run records.
Retention is measured under distractor sets that exclude near-equivalent tools, so exact-function scoring stays unambiguous. The result tests clutter reduction without hiding the required function; it does not claim robustness to every possible tool synonym or duplicate schema.
The close function-calling differences are treated as quality parity, not a statistically significant quality lift. Because the benchmark does not yet include a matched top-k retrieval baseline, the load-bearing result is context-hygiene instrumentation rather than a public superiority claim.
Retention check
240/240
Every target-tool case in the three-provider aggregate kept the required gold function visible to the model.
Seed replication
80/80
A second Gemini 3.1 Flash Lite run with a different distractor seed retained every required function and preserved the compression pattern.
Consistency gate
18 tests
The tool benchmark test suite recomputes the public numbers from the raw run artifacts, including retention and seed-replication checks.
Context control is useful infrastructure.
The strong claim is not that symbolic manifolds beat frontier models at generation, or that this filter beats commodity top-k retrieval. The supported claim is narrower: a governed layer can shape and audit the model's operating context before generation begins.
In practice, that can mean fewer irrelevant tools, shorter prompts, and a cleaner decision surface for the same underlying model.
What this does not prove.
BFCL-derived internal distractor-pressure diagnostic, not an official BFCL leaderboard score. The run compares against full-tool and prompt-smart arms, but not against a matched BM25/lexical/embedding top-k retriever. Distractor sets exclude near-equivalent tools to keep exact-function scoring unambiguous, so gold-function retention should not be read as a differentiated selection moat. The result supports context-hygiene instrumentation, not a public claim that Lumenais beats commodity tool retrieval.
The result is best read as a supporting product-architecture diagnostic: under tool clutter, Lumenais can compress the answer path while keeping the base model's function-calling performance intact. The commercial wedge remains governed memory and influence control.
Read the full benchmark context.
The benchmark page places this result beside memory pressure, governed-learning controls, and the broader workflow-quality suite.