Skip to content

feat: add retrieval neighbor enrichment#897

Merged
rwmjhb merged 5 commits into
CortexReach:masterfrom
TurboTheTurtle:codex/issue-538-neighbor-enrichment-v2
Jun 17, 2026
Merged

feat: add retrieval neighbor enrichment#897
rwmjhb merged 5 commits into
CortexReach:masterfrom
TurboTheTurtle:codex/issue-538-neighbor-enrichment-v2

Conversation

@TurboTheTurtle

Copy link
Copy Markdown
Contributor

Summary

  • add default-off hybrid retrieval neighbor enrichment using same-scope BM25 lookups
  • attach bounded neighbors as supplemental recall context without changing top-level result counts
  • add config schema/UI fields and focused regression coverage

Refs #538
Refs #445

Verification

  • node --test test/retriever-neighbor-enrichment.test.mjs
  • node test/retriever-rerank-regression.mjs
  • node scripts/verify-ci-test-manifest.mjs
  • node test/plugin-manifest-regression.mjs
  • npm run build
  • npm run test:core-regression
  • npm run test:packaging-and-workflow

@rickthomasjr

Copy link
Copy Markdown

Code Review (Agent)

Verdict: APPROVE

Summary

Feature adds default-off hybrid retrieval neighbor enrichment — attaches bounded same-scope BM25 neighbors to top-level hybrid results without changing result counts.

What is Good

  • Default-off design: Neighbor enrichment is opt-in (enabled: false), zero impact on existing users. Low risk.
  • Scope-aware filtering: Neighbors are restricted to the same scope and (when provided) same category as the primary result. No cross-contamination.
  • Graceful degradation: BM25 neighbor lookup failure falls through with a warning, not an exception. Primary results still return.
  • Config normalization: normalizeRetrievalConfig() properly deep-merges neighborEnrichment so partial configs do not lose other settings.
  • Test coverage: 183-line test file covers disabled mode, enabled mode with neighbors, BM25 fallback failure, and partial config parsing. All CI checks pass.
  • Bounding: maxPerResult clamped to 1-5. Supplemental BM25 fetch capped at 25. Good defense against runaway queries.
  • Serialization: sanitizeMemoryForSerialization() and tool output render properly handle the optional neighbors field.

Notes

  1. Performance: Each primary result triggers a supplemental BM25 lookup. The Math.min(maxPerResult + primaryIds.size + 4, 25) cap is sensible. BM25 is fast, but worth monitoring if retrieval latency is already tight.
  2. Context budget: Neighbor context is appended to contentText before sanitizeForContext().slice(0, effectivePerItemMaxChars). Total context window budget (primary + neighbors) is still capped per item — good.
  3. Test mocking: The test uses a mock store. Fine for unit-level logic testing; consider a lightweight integration test with a real in-memory index for the enrichment path.

Verdict: APPROVE

Clean feature, well-scoped, defaults off, good tests. Safe to merge.

@rickthomasjr rickthomasjr left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Verdict: APPROVE (with one minor concern)

Andy has implemented B-2 from Issue #445 — retrieval-time neighbor enrichment for hybrid search results. The feature is default-off, well-scoped, and follows the existing patterns in the codebase cleanly.


What This Does

Adds optional same-scope BM25 neighbor enrichment to hybrid retrieval results. When enabled:

  1. After MMR diversity filtering, for each result the retriever does an additional bm25Search() using the result entry's text as the query
  2. Filters out primary results, cross-scope entries, expired entries, and noise
  3. Caps at maxPerResult (default 2, range 1-5)
  4. Attaches neighbors as a neighbors array on each result
  5. Tools render neighbors inline in recall output and debug/explain-rank tools

Key design decisions (matching the B-2 proposal from Issue #445):

  • Default-off — opt-in feature
  • Same-scope only — respects scope boundaries
  • Doesn't change top-level result count — pure enrichment
  • BM25 for neighbor search — consistent with B-2 discussion
  • Graceful degradation — BM25 failure falls back to primary-only results

Code Quality Assessment

Strengths:

  • normalizeRetrievalConfig() — proper nested merge pattern for neighborEnrichment. The explicit nesting avoids shallow-spread issues.
  • enrichHybridNeighbors() — clean filter chain (id exclusion → scope check → category → expiry → noise). Easy to follow.
  • Test coverage is thorough: disabled mode, enabled mode, failure fallback, and config parsing all covered.
  • sanitizeMemoryForSerialization() properly propagates neighbors through the tool result pipeline.
  • Plugin config schema (openclaw.plugin.json) has correct types, defaults, min/max bounds, and advanced UI labels.

Changes look mechanical and consistentdist/ built from src/, CI manifest updated, all tool rendering paths updated.


One Minor Concern: Per-Result BM25 Latency

Each enriched result triggers an additional bm25Search() call. With maxPerResult=2 and limit=20, that's up to 20 extra BM25 lookups per retrieval. The candidates limit is capped at 25 via Math.min(maxPerResult + primaryIds.size + 4, 25), which is reasonable, but:

  • In a busy system with many concurrent retrievals and neighbor enrichment enabled, this could add noticeable latency
  • The B-2 issue (Issue #445) mentioned this as an open question but the PR doesn't address batching

This is not a blocker — the feature is default-off, and Andy has correctly gated it behind config.enabled. If latency becomes an issue, a batched approach (single bm25Search() for all results at once) could be added later. Worth a TODO comment or a follow-up issue if it's a real concern.

Rating: NIT — no action required for merge.


Verdict: APPROVE

This is a clean implementation of an already-designed feature (B-2 from Issue #445). It's default-off, well-tested, follows existing patterns, and degrades gracefully. The one latency concern is a future optimization, not a blocking issue.

Ready to merge once Rick gives a thumbs-up.

@rwmjhb rwmjhb left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. The feature is default-off, same-scope, bounded per primary result, and the main retrieval/tool/config plumbing is covered by regression tests.

Verification checked:

  • orchestrator targeted tests passed
  • orchestrator full npm test passed
  • npm run build --if-present passed on head 0e9452cb735605f9f1079429f23a054fef95f12a
  • node --test test/retriever-neighbor-enrichment.test.mjs passed, 4/4

Non-blocking follow-ups worth considering:

  • Neighbor entries are not passed through the same auto-recall governance filters as top-level results. If enrichment is enabled, an eligible primary memory can carry a nested neighbor that would have been excluded as a top-level auto-recall result by confirmed-state, archive/reflection layer, Tier-1 suppression, or USER.md boundary rules.
  • Enrichment performs one supplemental BM25 lookup per top-level result; keep an eye on latency/load if this is enabled with larger limits.
  • The same neighbor can be attached under multiple primary results because there is no cross-result seenNeighborIds set.

@rwmjhb rwmjhb left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. The current head addresses the earlier auto-recall governance concern by filtering/deduping related neighbor snippets before injection, while keeping the feature default-off and bounded.

Verification checked:

  • orchestrator targeted tests passed
  • orchestrator full npm test passed
  • npm run build --if-present passed on head c6c9ade030f567756112d7dd4b6e188e4e97ebfc
  • node --test test/retriever-neighbor-enrichment.test.mjs passed, 4/4
  • node --test test/recall-text-cleanup.test.mjs passed, 23/23

Non-blocking follow-ups worth considering:

  • Manual memory_recall summary output applies maxCharsPerItem to the primary text, then appends neighbor snippets afterward. With neighbor enrichment enabled, per-item output can exceed that budget.
  • Enrichment performs one supplemental BM25 lookup per top-level result using full entry text as the query; monitor latency/load if this is enabled with larger result limits.
  • The neighbor candidate fetch cap can under-deliver maxPerResult on large primary result sets after excluding primary ids and filtered candidates.
  • maxPerResult is clamped at use time, not in normalized config returned by getConfig().

@rwmjhb rwmjhb left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes. The auto-recall neighbor governance fix is good, but the manual memory_recall path still has a blocking boundary leak.

Must fix

Nested neighbors are still rendered and serialized without applying the USER.md-exclusive recall filter.

The top-level memory_recall results are filtered with filterUserMdExclusiveRecallResults(...), but the new neighbor paths are not:

  • sanitizeMemoryForSerialization() serializes r.neighbors directly, including neighbor.entry.text.
  • The manual memory_recall text renderer appends r.neighbors directly as Neighbors: ....
  • The retriever enrichment path only knows about id/scope/category/expiry/noise and has no workspaceBoundary context.

So with workspaceBoundary.userMdExclusive.enabled / filterRecall enabled, a safe top-level memory can still carry a USER.md-exclusive profile/name/addressing fact as a nested neighbor. That leaks the hidden memory through both rendered tool text and details.memories[].neighbors.

Please apply the same USER.md-exclusive filtering to nested neighbors before manual recall rendering and serialization, or strip neighbors from any result set after top-level recall filtering removes boundary-sensitive entries. Add a regression test that enables USER.md-exclusive filtering and verifies a hidden neighbor is absent from both rendered output and serialized details.

Also worth fixing

  • Manual memory_recall neighbors also bypass other governance predicates that auto-recall applies to neighbors, such as confirmed state, archive/reflection layers, and Tier-1 suppression.
  • Auto-recall repeat suppression/history still tracks only top-level ids, not injected neighbor snippets.
  • Neighbor enrichment performs one extra BM25 query per top-level result with no visible cost ceiling or operator warning.

@rwmjhb rwmjhb left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. The latest commit fixes the prior blocker by filtering manual memory_recall neighbors in both rendered output and serialized details. I verified this head independently:

  • npm run build --if-present
  • node --test test/retriever-neighbor-enrichment.test.mjs
  • node --test test/recall-text-cleanup.test.mjs

Non-blocking follow-ups: neighbor over-fetch can still be limited by MemoryStore.bm25Search's 20-result clamp after primary IDs are filtered, and enabled enrichment performs one supplemental BM25 lookup per top-level result. Both are worth tracking for high-limit/large-store usage.

@rwmjhb rwmjhb left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is currently blocked by merge conflicts (mergeable=CONFLICTING, merge_state_status=DIRTY).

Please rebase this branch onto the latest base branch, resolve the conflicts, and push the updated branch. I'm deferring any further review until the branch is mergeable again, since conflict resolution can invalidate the current diff and prior findings.

@TurboTheTurtle TurboTheTurtle force-pushed the codex/issue-538-neighbor-enrichment-v2 branch from b015b97 to 2e37896 Compare June 15, 2026 01:25
@TurboTheTurtle

Copy link
Copy Markdown
Contributor Author

This PR is currently blocked by merge conflicts (mergeable=CONFLICTING, merge_state_status=DIRTY).

Please rebase this branch onto the latest base branch, resolve the conflicts, and push the updated branch. I'm deferring any further review until the branch is mergeable again, since conflict resolution can invalidate the current diff and prior findings.

Thanks, rebased onto current upstream/master, resolved the conflicts, and force-pushed the branch.

@rwmjhb rwmjhb left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes. The rebase fixed the conflict and the main recall paths have neighbor filtering, but there is still an unfiltered serialization path for nested neighbors.

Must fix

sanitizeMemoryForSerialization() serializes r.neighbors directly, including id/text/category/scope/importance/score. Manual memory_recall filters neighbors before calling it, and auto-recall filters before rendering, but this serializer is also used by other tool responses that can receive retriever results directly:

  • resolveMemoryId() ambiguous-match details: details.candidates = sanitizeMemoryForSerialization(results)
  • retrieval trace/debug details: details.memories = sanitizeMemoryForSerialization(results)
  • memory_explain_rank: details.results = sanitizeMemoryForSerialization(results)

Those paths do not apply filterManualRecallResultNeighbors() or the auto-recall governance filter. With neighbor enrichment enabled, a top-level eligible result can therefore carry nested neighbors that bypass USER.md-exclusive filtering, state/layer governance, tier suppression, or workspace-boundary filtering in serialized details.

Please centralize neighbor serialization behind the same governance/boundary filter used for manual recall, or make sanitizeMemoryForSerialization() exclude neighbors by default unless the caller explicitly passes already-filtered neighbors. Add regression coverage for at least one non-memory_recall caller, e.g. ambiguous resolveMemoryId() candidates or memory_explain_rank, proving USER.md-exclusive/governance-ineligible neighbors are not present in details.

Also worth addressing

  • Neighbor category filtering uses raw candidate.entry.category !== category instead of the existing smart category matcher, so canonical filters like preferences can silently drop legacy preference neighbors.
  • Supplemental neighbor lookups still do one BM25 query per top-level result using full entry text and do not observe the recall abort signal/deadline.
  • The BM25 over-fetch slack is still limited by the store-level 20-result clamp before primary IDs are filtered.

@rwmjhb rwmjhb left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. The latest commit fixes the unfiltered serialized-neighbor issue, including coverage for non-recall serialized tool details. I verified this head independently:

  • npm run build --if-present
  • node --test test/retriever-neighbor-enrichment.test.mjs
  • node --test test/recall-text-cleanup.test.mjs
  • build left source/dist files clean

Remaining notes are non-blocking: neighbor category filtering should eventually use the canonical category matcher, high-limit enrichment can under-deliver because store BM25 search is capped at 20, and injected auto-recall neighbors are supplemental context rather than first-class tracked recall IDs.

@rwmjhb rwmjhb merged commit 37f5d8c into CortexReach:master Jun 17, 2026
8 checks passed
@TurboTheTurtle TurboTheTurtle deleted the codex/issue-538-neighbor-enrichment-v2 branch June 17, 2026 19:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants