feat: add retrieval neighbor enrichment#897
Conversation
Code Review (Agent)Verdict: APPROVE SummaryFeature adds default-off hybrid retrieval neighbor enrichment — attaches bounded same-scope BM25 neighbors to top-level hybrid results without changing result counts. What is Good
Notes
Verdict: APPROVEClean feature, well-scoped, defaults off, good tests. Safe to merge. |
rickthomasjr
left a comment
There was a problem hiding this comment.
Summary
Verdict: APPROVE (with one minor concern)
Andy has implemented B-2 from Issue #445 — retrieval-time neighbor enrichment for hybrid search results. The feature is default-off, well-scoped, and follows the existing patterns in the codebase cleanly.
What This Does
Adds optional same-scope BM25 neighbor enrichment to hybrid retrieval results. When enabled:
- After MMR diversity filtering, for each result the retriever does an additional
bm25Search()using the result entry's text as the query - Filters out primary results, cross-scope entries, expired entries, and noise
- Caps at
maxPerResult(default 2, range 1-5) - Attaches neighbors as a
neighborsarray on each result - Tools render neighbors inline in recall output and debug/explain-rank tools
Key design decisions (matching the B-2 proposal from Issue #445):
- Default-off — opt-in feature
- Same-scope only — respects scope boundaries
- Doesn't change top-level result count — pure enrichment
- BM25 for neighbor search — consistent with B-2 discussion
- Graceful degradation — BM25 failure falls back to primary-only results
Code Quality Assessment
Strengths:
normalizeRetrievalConfig()— proper nested merge pattern forneighborEnrichment. The explicit nesting avoids shallow-spread issues.enrichHybridNeighbors()— clean filter chain (id exclusion → scope check → category → expiry → noise). Easy to follow.- Test coverage is thorough: disabled mode, enabled mode, failure fallback, and config parsing all covered.
sanitizeMemoryForSerialization()properly propagates neighbors through the tool result pipeline.- Plugin config schema (
openclaw.plugin.json) has correct types, defaults, min/max bounds, and advanced UI labels.
Changes look mechanical and consistent — dist/ built from src/, CI manifest updated, all tool rendering paths updated.
One Minor Concern: Per-Result BM25 Latency
Each enriched result triggers an additional bm25Search() call. With maxPerResult=2 and limit=20, that's up to 20 extra BM25 lookups per retrieval. The candidates limit is capped at 25 via Math.min(maxPerResult + primaryIds.size + 4, 25), which is reasonable, but:
- In a busy system with many concurrent retrievals and neighbor enrichment enabled, this could add noticeable latency
- The B-2 issue (Issue #445) mentioned this as an open question but the PR doesn't address batching
This is not a blocker — the feature is default-off, and Andy has correctly gated it behind config.enabled. If latency becomes an issue, a batched approach (single bm25Search() for all results at once) could be added later. Worth a TODO comment or a follow-up issue if it's a real concern.
Rating: NIT — no action required for merge.
Verdict: APPROVE
This is a clean implementation of an already-designed feature (B-2 from Issue #445). It's default-off, well-tested, follows existing patterns, and degrades gracefully. The one latency concern is a future optimization, not a blocking issue.
Ready to merge once Rick gives a thumbs-up.
rwmjhb
left a comment
There was a problem hiding this comment.
Approved. The feature is default-off, same-scope, bounded per primary result, and the main retrieval/tool/config plumbing is covered by regression tests.
Verification checked:
- orchestrator targeted tests passed
- orchestrator full npm test passed
npm run build --if-presentpassed on head0e9452cb735605f9f1079429f23a054fef95f12anode --test test/retriever-neighbor-enrichment.test.mjspassed, 4/4
Non-blocking follow-ups worth considering:
- Neighbor entries are not passed through the same auto-recall governance filters as top-level results. If enrichment is enabled, an eligible primary memory can carry a nested neighbor that would have been excluded as a top-level auto-recall result by confirmed-state, archive/reflection layer, Tier-1 suppression, or USER.md boundary rules.
- Enrichment performs one supplemental BM25 lookup per top-level result; keep an eye on latency/load if this is enabled with larger limits.
- The same neighbor can be attached under multiple primary results because there is no cross-result
seenNeighborIdsset.
rwmjhb
left a comment
There was a problem hiding this comment.
Approved. The current head addresses the earlier auto-recall governance concern by filtering/deduping related neighbor snippets before injection, while keeping the feature default-off and bounded.
Verification checked:
- orchestrator targeted tests passed
- orchestrator full npm test passed
npm run build --if-presentpassed on headc6c9ade030f567756112d7dd4b6e188e4e97ebfcnode --test test/retriever-neighbor-enrichment.test.mjspassed, 4/4node --test test/recall-text-cleanup.test.mjspassed, 23/23
Non-blocking follow-ups worth considering:
- Manual
memory_recallsummary output appliesmaxCharsPerItemto the primary text, then appends neighbor snippets afterward. With neighbor enrichment enabled, per-item output can exceed that budget. - Enrichment performs one supplemental BM25 lookup per top-level result using full entry text as the query; monitor latency/load if this is enabled with larger result limits.
- The neighbor candidate fetch cap can under-deliver
maxPerResulton large primary result sets after excluding primary ids and filtered candidates. maxPerResultis clamped at use time, not in normalized config returned bygetConfig().
rwmjhb
left a comment
There was a problem hiding this comment.
Requesting changes. The auto-recall neighbor governance fix is good, but the manual memory_recall path still has a blocking boundary leak.
Must fix
Nested neighbors are still rendered and serialized without applying the USER.md-exclusive recall filter.
The top-level memory_recall results are filtered with filterUserMdExclusiveRecallResults(...), but the new neighbor paths are not:
sanitizeMemoryForSerialization()serializesr.neighborsdirectly, includingneighbor.entry.text.- The manual
memory_recalltext renderer appendsr.neighborsdirectly asNeighbors: .... - The retriever enrichment path only knows about id/scope/category/expiry/noise and has no
workspaceBoundarycontext.
So with workspaceBoundary.userMdExclusive.enabled / filterRecall enabled, a safe top-level memory can still carry a USER.md-exclusive profile/name/addressing fact as a nested neighbor. That leaks the hidden memory through both rendered tool text and details.memories[].neighbors.
Please apply the same USER.md-exclusive filtering to nested neighbors before manual recall rendering and serialization, or strip neighbors from any result set after top-level recall filtering removes boundary-sensitive entries. Add a regression test that enables USER.md-exclusive filtering and verifies a hidden neighbor is absent from both rendered output and serialized details.
Also worth fixing
- Manual
memory_recallneighbors also bypass other governance predicates that auto-recall applies to neighbors, such as confirmed state, archive/reflection layers, and Tier-1 suppression. - Auto-recall repeat suppression/history still tracks only top-level ids, not injected neighbor snippets.
- Neighbor enrichment performs one extra BM25 query per top-level result with no visible cost ceiling or operator warning.
rwmjhb
left a comment
There was a problem hiding this comment.
Approved. The latest commit fixes the prior blocker by filtering manual memory_recall neighbors in both rendered output and serialized details. I verified this head independently:
npm run build --if-presentnode --test test/retriever-neighbor-enrichment.test.mjsnode --test test/recall-text-cleanup.test.mjs
Non-blocking follow-ups: neighbor over-fetch can still be limited by MemoryStore.bm25Search's 20-result clamp after primary IDs are filtered, and enabled enrichment performs one supplemental BM25 lookup per top-level result. Both are worth tracking for high-limit/large-store usage.
rwmjhb
left a comment
There was a problem hiding this comment.
This PR is currently blocked by merge conflicts (mergeable=CONFLICTING, merge_state_status=DIRTY).
Please rebase this branch onto the latest base branch, resolve the conflicts, and push the updated branch. I'm deferring any further review until the branch is mergeable again, since conflict resolution can invalidate the current diff and prior findings.
b015b97 to
2e37896
Compare
Thanks, rebased onto current upstream/master, resolved the conflicts, and force-pushed the branch. |
rwmjhb
left a comment
There was a problem hiding this comment.
Requesting changes. The rebase fixed the conflict and the main recall paths have neighbor filtering, but there is still an unfiltered serialization path for nested neighbors.
Must fix
sanitizeMemoryForSerialization() serializes r.neighbors directly, including id/text/category/scope/importance/score. Manual memory_recall filters neighbors before calling it, and auto-recall filters before rendering, but this serializer is also used by other tool responses that can receive retriever results directly:
resolveMemoryId()ambiguous-match details:details.candidates = sanitizeMemoryForSerialization(results)- retrieval trace/debug details:
details.memories = sanitizeMemoryForSerialization(results) memory_explain_rank:details.results = sanitizeMemoryForSerialization(results)
Those paths do not apply filterManualRecallResultNeighbors() or the auto-recall governance filter. With neighbor enrichment enabled, a top-level eligible result can therefore carry nested neighbors that bypass USER.md-exclusive filtering, state/layer governance, tier suppression, or workspace-boundary filtering in serialized details.
Please centralize neighbor serialization behind the same governance/boundary filter used for manual recall, or make sanitizeMemoryForSerialization() exclude neighbors by default unless the caller explicitly passes already-filtered neighbors. Add regression coverage for at least one non-memory_recall caller, e.g. ambiguous resolveMemoryId() candidates or memory_explain_rank, proving USER.md-exclusive/governance-ineligible neighbors are not present in details.
Also worth addressing
- Neighbor category filtering uses raw
candidate.entry.category !== categoryinstead of the existing smart category matcher, so canonical filters likepreferencescan silently drop legacypreferenceneighbors. - Supplemental neighbor lookups still do one BM25 query per top-level result using full entry text and do not observe the recall abort signal/deadline.
- The BM25 over-fetch slack is still limited by the store-level 20-result clamp before primary IDs are filtered.
rwmjhb
left a comment
There was a problem hiding this comment.
Approved. The latest commit fixes the unfiltered serialized-neighbor issue, including coverage for non-recall serialized tool details. I verified this head independently:
npm run build --if-presentnode --test test/retriever-neighbor-enrichment.test.mjsnode --test test/recall-text-cleanup.test.mjs- build left source/dist files clean
Remaining notes are non-blocking: neighbor category filtering should eventually use the canonical category matcher, high-limit enrichment can under-deliver because store BM25 search is capped at 20, and injected auto-recall neighbors are supplemental context rather than first-class tracked recall IDs.
Summary
Refs #538
Refs #445
Verification