Tool-context compression

Smaller tool menu. Supporting diagnostic.

Across Gemini, GPT, and Claude, Lumenais cut the model's tool surface from ~28 to ~5 with no measurable cost to function-call accuracy against full-tool and prompt-smart arms. This is useful context hygiene, not the core enterprise wedge, and it still needs a matched top-k retrieval baseline before being treated as differentiated.

Tool surface reduction

~83%

vs full-tool arms; matched top-k pending

Gold-function retention

240/240

near-equivalent distractors excluded

Function-name correctness

343/360

baseline 339/360; prompt-smart 341/360

AST-like correctness

335/360

baseline 330/360; prompt-smart 334/360

False tool calls

14/360

baseline 17/360; prompt-smart 16/360

For an agent, a long tool list is like a crowded control panel. The right button may be present, but every extra option adds room for confusion, review burden, and prompt load. This diagnostic shows the panel can be narrowed before the model acts; it does not yet show that the narrowing beats commodity top-k retrieval.

The outcome: A smaller, more auditable answer path that can reduce prompt load and make agent behavior easier to review. The current evidence supports parity against full-context baselines, not a unique routing moat.

Failure mode

Tool calling fails when the relevant tool is surrounded by plausible neighbors.

Modern agents often expose the base model to a large menu of tools, schemas, and distractors. That improves capability coverage, but it also increases the chance of irrelevant calls, argument confusion, and unnecessary prompt load.

This benchmark tests a narrower question: can a governed routing layer reduce what reaches the model without hiding the function the task actually requires?

Visible tool surface

Full-tool baseline

27.53 mean tools

Prompt-smart baseline

27.53 mean tools

Lumenais filtered path

4.74 mean tools

The measured result is a much smaller tool surface, the required function retained in every target-tool case, and function-call quality held at parity against full-tool and prompt-smart arms. A matched top-k retrieval arm is the next control needed before claiming differentiated efficiency.

Results

Three provider paths, same control-plane pattern.

The run used BFCL V4 simple, multiple-function, and irrelevance task families across Gemini 3.1 Flash Lite, GPT-5.5, and Claude Opus 4.7. Each arm saw the same task rows; the difference was how much visible tool context was sent to the model.

Function-call quality

AST-like correctness: whether the selected function and structured arguments match the expected call pattern.

Full-tool baseline

330/360 AST-like

Prompt-smart baseline

334/360 AST-like

Lumenais filtered path

335/360 AST-like

False tool calls on irrelevance

Full-tool baseline

17 false calls

Prompt-smart baseline

16 false calls

Lumenais filtered path

14 false calls

Lower is better here. The effect is modest but directionally useful: filtering reduces the surface where irrelevant calls can be selected.

Provider replication

The aggregate is not hiding a single-provider result.

Each provider saw the same 120 BFCL-derived rows and the same three arms. Values are shown as full-tool / prompt-smart / Lumenais filtered, except visible tools, which shows full-tool to filtered.

Gemini 3.1 Flash Lite

Visible tools27.53 → 4.74
Function name113 / 114 / 114
AST-like111 / 111 / 111
False calls5 / 5 / 5

GPT-5.5

Visible tools27.53 → 4.74
Function name115 / 114 / 116
AST-like110 / 112 / 113
False calls4 / 5 / 3

Claude Opus 4.7

Visible tools27.53 → 4.74
Function name111 / 113 / 113
AST-like109 / 111 / 111
False calls8 / 6 / 6
Audit checks

The compression result is checked against the raw run artifacts.

Because compression can look good by hiding necessary context, the review checks focus on whether the filtered path kept the required tool, survived a different distractor shuffle, and matches the underlying run records.

Retention is measured under distractor sets that exclude near-equivalent tools, so exact-function scoring stays unambiguous. The result tests clutter reduction without hiding the required function; it does not claim robustness to every possible tool synonym or duplicate schema.

The close function-calling differences are treated as quality parity, not a statistically significant quality lift. Because the benchmark does not yet include a matched top-k retrieval baseline, the load-bearing result is context-hygiene instrumentation rather than a public superiority claim.

Retention check

240/240

Every target-tool case in the three-provider aggregate kept the required gold function visible to the model.

Seed replication

80/80

A second Gemini 3.1 Flash Lite run with a different distractor seed retained every required function and preserved the compression pattern.

Consistency gate

18 tests

The tool benchmark test suite recomputes the public numbers from the raw run artifacts, including retention and seed-replication checks.

What this supports

Context control is useful infrastructure.

The strong claim is not that symbolic manifolds beat frontier models at generation, or that this filter beats commodity top-k retrieval. The supported claim is narrower: a governed layer can shape and audit the model's operating context before generation begins.

In practice, that can mean fewer irrelevant tools, shorter prompts, and a cleaner decision surface for the same underlying model.

Caveats

What this does not prove.

BFCL-derived internal distractor-pressure diagnostic, not an official BFCL leaderboard score. The run compares against full-tool and prompt-smart arms, but not against a matched BM25/lexical/embedding top-k retriever. Distractor sets exclude near-equivalent tools to keep exact-function scoring unambiguous, so gold-function retention should not be read as a differentiated selection moat. The result supports context-hygiene instrumentation, not a public claim that Lumenais beats commodity tool retrieval.

The result is best read as a supporting product-architecture diagnostic: under tool clutter, Lumenais can compress the answer path while keeping the base model's function-calling performance intact. The commercial wedge remains governed memory and influence control.

Read the full benchmark context.

The benchmark page places this result beside memory pressure, governed-learning controls, and the broader workflow-quality suite.