feat: add optional Redis write lock#879
Conversation
rwmjhb
left a comment
There was a problem hiding this comment.
Requesting changes. The direction is valuable, but the current fallback behavior breaks the distributed-lock guarantee under contention.
Must fix
RedisLockManager.acquire() throws RedisLockUnavailableError both when Redis commands are unavailable and when SET NX cannot acquire the lock before acquireTimeoutMs. MemoryStore.runWithWriteLock() catches any RedisLockUnavailableError and falls back to runWithFileLock().
That means if process A still owns the Redis lock, process B can time out waiting and then proceed under only the local file lock. Process A never took that file lock, so both writers can run concurrently in different lock domains. This fails exactly in the high-contention/distributed scenario the Redis lock is meant to fix.
Please separate these cases:
- Redis genuinely unavailable before entering the Redis lock protocol: file-lock fallback can be acceptable.
- Redis reachable but lock is held / acquire timeout: do not switch lock domains. Keep retrying, fail the write so the caller can retry, or otherwise preserve Redis-domain mutual exclusion.
Also worth fixing
- The Redis lease uses
SET PX ttlMs NX, but callbacks run without renewal or an operation timeout. A write longer thanttlMscan overlap with another writer after the lease expires. - A transient Redis command error currently aborts acquisition rather than retrying until the deadline, amplifying the fallback problem.
- Fallback to file locking is logged only once per
MemoryStore, so sustained Redis lock degradation can become invisible. - Static analysis flags one new
anycast in the dynamic ioredis import path.
Verification note: targeted tests and full npm test passed, but the current tests do not cover Redis-lock-held timeout falling back into the file-lock domain or lease expiry during an active write.
998e701 to
1d733a4
Compare
rwmjhb
left a comment
There was a problem hiding this comment.
Requesting changes. The new version fixes the previous Redis-lock-domain fallback direction, but there is still one blocking correctness issue in the lease-loss path.
Must fix
RedisLockManager.withLock() starts the write callback, suppresses its rejection, and then returns:
Promise.race([operation, leaseLostPromise])If lease renewal fails or lock ownership is lost, leaseLostPromise can reject first and the caller receives RedisLockLeaseLostError. But the original LanceDB write callback is not cancelled, awaited to completion before reporting, or given an AbortSignal. That means a caller can retry while the original write is still running and may later commit outside the caller's control.
For a distributed write lock, this is not safe: lease loss must not make the caller believe the write aborted unless the write operation can actually be cancelled.
Please change the lease-loss behavior so one of these is true:
- the active write can be cancelled/aborted when the lease is lost, or
withLock()waits for the write to settle and returns a distinct non-retryable post-write lease-integrity error, or- lease loss is handled in a way that cannot cause retry overlap with the original in-flight write.
Also worth fixing
runStorageMaintenance()still usesrunWithFileLock()while Redis-enabled writes userunWithWriteLock().table.optimize()is storage-mutating, so Redis deployments can split lock domains between maintenance and normal writes.- Plugin stop/reload does not close Redis clients.
MemoryStore.destroy()closes lock resources, but the service stop handler only clears timers. openclaw.plugin.jsonstill describesacquireTimeoutMsas falling back to the file lock, but lock-held/acquire-timeout now fails/retries instead.- A single transient renewal error currently marks the lease lost immediately.
- Consider whether
ioredisshould be a hard dependency for an optional feature. - The
process.report.excludeNetworkand legacy ID broadening changes look unrelated to the Redis locking scope and should be split out or justified separately.
Verification note: targeted tests and full npm test passed, but the current tests do not cover lease loss while a write continues, storage maintenance under Redis locking, or Redis client cleanup on plugin stop.
|
Pushed d292e6b to address the lease-loss path: Ran:
|
Agent Code Review — #879 (Redis write lock)Scope: Redis write lock feature (~207 lines src, ~478 lines tests, config plumbing) P1 —
|
| Severity | Count | Details |
|---|---|---|
| P1 | 2 | Shutdown race, renew interval at low TTL |
| P2 | 2 | No warning on Redis connection fail, retry delay minimum surprising |
| P3 | 1 | Empty resource fallback |
The implementation is solid overall. Lua scripts are atomic, token-based release prevents stolen-lock issues, and auto-renewal with lease-lost detection is well-designed.
rwmjhb
left a comment
There was a problem hiding this comment.
This needs a rebase before the next full review.
GitHub currently reports this branch as mergeable=CONFLICTING / merge_state_status=DIRTY, so the orchestrator stopped at the conflict gate after R0/R1. The PR is still valuable, but reviewing the implementation now would be against a branch that has to be rewritten for conflict resolution.
Please rebase or merge the latest base branch, resolve the conflicts, and push the updated branch. The re-review will run against the new head.
326e4f1 to
d356888
Compare
rwmjhb
left a comment
There was a problem hiding this comment.
Approved. The current head resolves the earlier blocking Redis lock issues: lease-loss now waits for the active write before reporting integrity loss, storage maintenance uses the Redis write-lock domain, plugin stop closes lock resources, and the manifest/tests were updated.
Verification checked:
- orchestrator targeted tests passed
- orchestrator full npm test passed
npm run build --if-presentpassed on headd35688898b123cbffa03124c59e89b832f5286bbnode --test test/redis-lock.test.mjspassed, 17/17node --test test/storage-maintenance.test.mjspassed, 6/6node test/plugin-manifest-regression.mjspassed
Non-blocking follow-ups worth considering:
- Service stop now destroys the cached singleton store, but
_singletonStateis not cleared there. A same-process stop followed by re-register can reuse a destroyed store unless the host also callsresetRegistration(). RedisLockManager.close()can still race an active Redis-locked write during shutdown; it closes the shared client without tracking active holders.locking.redis.enabled=falsewithlocking.redis.url="${REDIS_URL}"can still fail config parsing ifREDIS_URLis unset, because the URL is resolved before the disabled check.ioredisis now a hard runtime dependency for an optional/default-off feature.- The
process.report.excludeNetworkworkaround is host-global and unrelated to Redis locking; it would be cleaner split out or justified separately. redisUrlstill implicitly enables Redis locking without explicitlocking.redis.enabled=true.
rwmjhb
left a comment
There was a problem hiding this comment.
Requesting changes. The Redis lock direction is valuable, and the normal-path tests are green, but this head still has two correctness issues that can break the distributed-lock guarantee or active registrations.
Must fix
- Stopping one API registration tears down the shared singleton used by the others.
register() reuses the process-wide _singletonState/MemoryStore across API registrations, but each registration's service stop handler unconditionally calls store.destroy() and clears _singletonState for that one API. In a process with multiple live registrations, stopping or reloading one registration can close the shared store and Redis lock resources while the other registrations still have hooks/tools closing over that store.
Please remove the stopping API from the registration map first and only destroy/clear the singleton when no registered APIs remain, or introduce explicit reference-counted ownership for the shared store. Add a regression test with multiple register() calls where stopping one registration does not invalidate the other.
- Redis unavailability can silently split the lock domain.
MemoryStore.runWithWriteLock() catches RedisLockUnavailableError and falls back to runWithFileLock(). That means one process with a failed Redis client/connect path can write under only the local file lock while another process with a working Redis client is still coordinating through Redis for the same dbPath. This recreates the split-lock-domain problem the PR is intended to solve. The shutdown path has a related variant because RedisLockManager.close() sets isClosed before active locks drain, causing racing writes to throw the same unavailable error and fall back to file locking.
Please distinguish "Redis was never configured/genuinely unavailable before using Redis mode" from "Redis lock mode is configured but this process cannot currently acquire/use it". In Redis-lock mode, acquisition/client/shutdown failures should fail closed rather than falling back into a different lock domain while peers may still use Redis. Add tests for a Redis client failure and shutdown race.
Also worth addressing
- Lease-integrity loss is skipped when the protected write callback rejects, so an ambiguous write can surface as an ordinary storage error.
redisUrlis resolved before higher-prioritylocking.redis.url, so a stale legacy placeholder can abort config parsing even when the nested URL is valid.loadLanceDBnow mutates globalprocess.report.excludeNetwork; that host-wide side effect is unrelated to Redis locking and should be split out or scoped/restored.
|
Pushed eeadda7 to keep the shared singleton alive while other plugin registrations are still active, and still clear it once the final registration stops so a later registration gets a fresh store. Ran:
|
rwmjhb
left a comment
There was a problem hiding this comment.
This head is currently blocked by merge conflicts (mergeable=CONFLICTING, merge_state_status=DIRTY).
I can see the new commit is aimed at keeping the shared store alive for active registrations, but the branch needs to be rebased onto the latest base branch first. Please rebase/merge the base branch, resolve the conflicts, and push the updated branch. I'll re-run the full review once it is mergeable again, since conflict resolution may change the Redis lock and store lifecycle code paths.
eeadda7 to
c865d70
Compare
|
Pushed the Redis-lock follow-ups: close now keeps the manager shut down after shutdown, lease renewal is floored at 100ms, Redis connection failures emit the configured warning before falling back, retry delay uses a 100ms floor, and empty resources are rejected instead of sharing a default key.\n\nRan:\n- node --test test/redis-lock.test.mjs test/storage-maintenance.test.mjs\n- node test/plugin-manifest-regression.mjs\n- npm run build --if-present |
rwmjhb
left a comment
There was a problem hiding this comment.
Requesting changes for generated output sync.
The Redis-lock review verdict is otherwise no longer blocking, and the focused checks pass on this head, but the checked-in dist output is stale. I verified locally on c865d70:
npm run build --if-presentpassesnode --test test/redis-lock.test.mjspassesnode --test test/storage-maintenance.test.mjspassesnode test/plugin-manifest-regression.mjspasses
After running the build, the worktree is dirty:
M dist/src/retriever.jsThe generated change is the compiled comment for the FTS routing initialization patch in dist/src/retriever.js. Please run the build and commit the updated dist output so the source/dist artifacts are reproducible from this branch.
Non-blocking follow-ups from review: Redis-unavailable fallback still deliberately degrades to file locking, automatic index folding currently uses the file-lock domain, and legacy redisUrl precedence can still surprise migrations. Those are documented tradeoffs/edge cases rather than blockers for this head.
rwmjhb
left a comment
There was a problem hiding this comment.
Requesting changes on the latest head. The generated-output sync issue is fixed, but the Redis lock implementation still has distributed-lock correctness gaps.
Must fix
- Automatic index folding still bypasses the Redis write-lock domain.
noteDataModification() can trigger foldIndices(), and foldIndices() runs table.optimize() under runWithFileLock() rather than runWithWriteLock(). In Redis-enabled deployments, normal writes coordinate through Redis while this storage-mutating optimize path only takes the local file lock, so it can overlap with Redis-locked writes and recreate the split lock-domain problem. Please route index folding through the same Redis write lock, or explicitly defer/disable it when Redis locking is active, with a regression test.
- Redis connection/unavailable failures still fail open into a separate lock domain.
runWithWriteLock() catches RedisLockUnavailableError and falls back to runWithFileLock(). createClient() also maps Redis connection failures into that unavailable error. In a multi-host deployment, one host with a failed Redis connection can then write under only its local file lock while another host is still using Redis for the same dbPath. Those writers are no longer mutually exclusive. Redis-configured mode should fail closed by default after Redis locking is selected, or require an explicit best-effort fallback option with clear docs.
Also worth addressing
- Lease loss is only checked after a successful callback; if the protected LanceDB write rejects after a partial write while the lease was lost, callers can receive an ordinary storage error instead of a lease-integrity error.
withLock()tracks active locks only afteracquire()resolves, soclose()can return while an acquisition/write is still in flight.- Legacy
redisUrlis resolved before higher-prioritylocking.redis.url; a stale legacy env placeholder can break startup even when the nested URL is valid.
Verification note: R0 targeted checks and full npm test passed on this head, but those tests do not cover the lock-domain split paths above.
Summary
Fixes #659
Verification