Skip to content

feat: add optional Redis write lock#879

Open
TurboTheTurtle wants to merge 6 commits into
CortexReach:masterfrom
TurboTheTurtle:codex/issue-659-redis-lock
Open

feat: add optional Redis write lock#879
TurboTheTurtle wants to merge 6 commits into
CortexReach:masterfrom
TurboTheTurtle:codex/issue-659-redis-lock

Conversation

@TurboTheTurtle

Copy link
Copy Markdown
Contributor

Summary

  • add an optional Redis write lock manager using token ownership and Lua safe release
  • route LanceDB write paths through Redis when enabled, with file-lock fallback when Redis is unavailable
  • expose Redis lock config and ioredis dependency with unit and storage CI coverage

Fixes #659

Verification

  • node --test test/redis-lock.test.mjs
  • npm run build
  • npm run test:storage-and-schema
  • npm run test:packaging-and-workflow

@rwmjhb rwmjhb left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes. The direction is valuable, but the current fallback behavior breaks the distributed-lock guarantee under contention.

Must fix

RedisLockManager.acquire() throws RedisLockUnavailableError both when Redis commands are unavailable and when SET NX cannot acquire the lock before acquireTimeoutMs. MemoryStore.runWithWriteLock() catches any RedisLockUnavailableError and falls back to runWithFileLock().

That means if process A still owns the Redis lock, process B can time out waiting and then proceed under only the local file lock. Process A never took that file lock, so both writers can run concurrently in different lock domains. This fails exactly in the high-contention/distributed scenario the Redis lock is meant to fix.

Please separate these cases:

  • Redis genuinely unavailable before entering the Redis lock protocol: file-lock fallback can be acceptable.
  • Redis reachable but lock is held / acquire timeout: do not switch lock domains. Keep retrying, fail the write so the caller can retry, or otherwise preserve Redis-domain mutual exclusion.

Also worth fixing

  • The Redis lease uses SET PX ttlMs NX, but callbacks run without renewal or an operation timeout. A write longer than ttlMs can overlap with another writer after the lease expires.
  • A transient Redis command error currently aborts acquisition rather than retrying until the deadline, amplifying the fallback problem.
  • Fallback to file locking is logged only once per MemoryStore, so sustained Redis lock degradation can become invisible.
  • Static analysis flags one new any cast in the dynamic ioredis import path.

Verification note: targeted tests and full npm test passed, but the current tests do not cover Redis-lock-held timeout falling back into the file-lock domain or lease expiry during an active write.

@TurboTheTurtle TurboTheTurtle force-pushed the codex/issue-659-redis-lock branch from 998e701 to 1d733a4 Compare June 12, 2026 08:09

@rwmjhb rwmjhb left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes. The new version fixes the previous Redis-lock-domain fallback direction, but there is still one blocking correctness issue in the lease-loss path.

Must fix

RedisLockManager.withLock() starts the write callback, suppresses its rejection, and then returns:

Promise.race([operation, leaseLostPromise])

If lease renewal fails or lock ownership is lost, leaseLostPromise can reject first and the caller receives RedisLockLeaseLostError. But the original LanceDB write callback is not cancelled, awaited to completion before reporting, or given an AbortSignal. That means a caller can retry while the original write is still running and may later commit outside the caller's control.

For a distributed write lock, this is not safe: lease loss must not make the caller believe the write aborted unless the write operation can actually be cancelled.

Please change the lease-loss behavior so one of these is true:

  • the active write can be cancelled/aborted when the lease is lost, or
  • withLock() waits for the write to settle and returns a distinct non-retryable post-write lease-integrity error, or
  • lease loss is handled in a way that cannot cause retry overlap with the original in-flight write.

Also worth fixing

  • runStorageMaintenance() still uses runWithFileLock() while Redis-enabled writes use runWithWriteLock(). table.optimize() is storage-mutating, so Redis deployments can split lock domains between maintenance and normal writes.
  • Plugin stop/reload does not close Redis clients. MemoryStore.destroy() closes lock resources, but the service stop handler only clears timers.
  • openclaw.plugin.json still describes acquireTimeoutMs as falling back to the file lock, but lock-held/acquire-timeout now fails/retries instead.
  • A single transient renewal error currently marks the lease lost immediately.
  • Consider whether ioredis should be a hard dependency for an optional feature.
  • The process.report.excludeNetwork and legacy ID broadening changes look unrelated to the Redis locking scope and should be split out or justified separately.

Verification note: targeted tests and full npm test passed, but the current tests do not cover lease loss while a write continues, storage maintenance under Redis locking, or Redis client cleanup on plugin stop.

@TurboTheTurtle

Copy link
Copy Markdown
Contributor Author

Pushed d292e6b to address the lease-loss path: withLock() now waits for the active write to settle and reports a non-retryable lease-integrity error if ownership was lost before completion. I also moved storage maintenance onto the same write-lock path, close Redis lock resources through destroy(), and updated the acquire-timeout wording.

Ran:

  • node --test test/redis-lock.test.mjs test/storage-maintenance.test.mjs
  • node test/plugin-manifest-regression.mjs

@rickthomasjr

Copy link
Copy Markdown

Agent Code Review — #879 (Redis write lock)

Scope: Redis write lock feature (~207 lines src, ~478 lines tests, config plumbing)
CI: All 6 checks pass


P1 — close() race with concurrent withLock()

close() sets this.clientPromise = null synchronously. If a withLock() call has just awaited getClient() and received a client ref, close() could be called concurrently (from stop() in the plugin), nulling clientPromise. The client ref is already resolved, so this is fine for that call — but after close() returns, any concurrent withLock() will call getClient(), see clientPromise === null, call createClient() to create a new client, and proceed with the new client while the old one was just closed.

Recommendation: Add an isClosed flag (boolean, set before nulling clientPromise in close()) and check it in getClient() — throw RedisLockUnavailableError instead of creating a new client after shutdown.

P1 — withLock() renew interval too aggressive at low TTL

With the default 60s TTL, renew fires every 20s. But when the user sets ttlMs: 1_000 (1s), renewIntervalMs = Math.floor(1/3) = 0 — and Math.max(1, 0) = 1, so it fires every 1ms. This means renewal runs ~1000x/sec on a 1s lease, which is wasteful and could cause Redis load issues under contention.

Recommendation: Use Math.max(100, Math.floor(ttlMs / 5)) instead of Math.max(1, Math.floor(this.config.ttlMs / 3)).

P2 — No onWarning emitted when Redis connection fails

In createClient(), the catch block wraps the error in RedisLockUnavailableError. The calling code in runWithWriteLock catches this and falls back to file lock, but no warning is emitted about the Redis connection failure. The admin won't know Redis locking is unavailable.

Recommendation: Call onWarning in the catch block of createClient() before wrapping the error.

P2 — Retry delay minimum of 1s is surprising

Math.max(this.retryDelayMs, 1_000) in acquire() means a user-specified retryDelayMs: 50 becomes 1000ms. With acquireTimeoutMs: 5000, the first retry takes at least 1000ms, so maximum 4-5 attempts. This is fine for most cases but the effective minimum is undocumented.

Recommendation: Reduce the floor to 100ms (Math.max(this.retryDelayMs, 100)) or document the 1s minimum in the config interface.

P3 — Empty resource falls back to "default" silently

resource || "default" in makeKey(). If resource is empty string, it hashes "default". Two different empty-string resources would collide on the same lock key. In practice dbPath is always non-empty, so this is low risk but the fallback is silent.

Recommendation: Use resource ?? "default" or throw if resource is falsy.


Summary

Severity Count Details
P1 2 Shutdown race, renew interval at low TTL
P2 2 No warning on Redis connection fail, retry delay minimum surprising
P3 1 Empty resource fallback

The implementation is solid overall. Lua scripts are atomic, token-based release prevents stolen-lock issues, and auto-renewal with lease-lost detection is well-designed.

@rwmjhb rwmjhb left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a rebase before the next full review.

GitHub currently reports this branch as mergeable=CONFLICTING / merge_state_status=DIRTY, so the orchestrator stopped at the conflict gate after R0/R1. The PR is still valuable, but reviewing the implementation now would be against a branch that has to be rewritten for conflict resolution.

Please rebase or merge the latest base branch, resolve the conflicts, and push the updated branch. The re-review will run against the new head.

@TurboTheTurtle TurboTheTurtle force-pushed the codex/issue-659-redis-lock branch from 326e4f1 to d356888 Compare June 14, 2026 07:05

@rwmjhb rwmjhb left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. The current head resolves the earlier blocking Redis lock issues: lease-loss now waits for the active write before reporting integrity loss, storage maintenance uses the Redis write-lock domain, plugin stop closes lock resources, and the manifest/tests were updated.

Verification checked:

  • orchestrator targeted tests passed
  • orchestrator full npm test passed
  • npm run build --if-present passed on head d35688898b123cbffa03124c59e89b832f5286bb
  • node --test test/redis-lock.test.mjs passed, 17/17
  • node --test test/storage-maintenance.test.mjs passed, 6/6
  • node test/plugin-manifest-regression.mjs passed

Non-blocking follow-ups worth considering:

  • Service stop now destroys the cached singleton store, but _singletonState is not cleared there. A same-process stop followed by re-register can reuse a destroyed store unless the host also calls resetRegistration().
  • RedisLockManager.close() can still race an active Redis-locked write during shutdown; it closes the shared client without tracking active holders.
  • locking.redis.enabled=false with locking.redis.url="${REDIS_URL}" can still fail config parsing if REDIS_URL is unset, because the URL is resolved before the disabled check.
  • ioredis is now a hard runtime dependency for an optional/default-off feature.
  • The process.report.excludeNetwork workaround is host-global and unrelated to Redis locking; it would be cleaner split out or justified separately.
  • redisUrl still implicitly enables Redis locking without explicit locking.redis.enabled=true.

@rwmjhb rwmjhb left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes. The Redis lock direction is valuable, and the normal-path tests are green, but this head still has two correctness issues that can break the distributed-lock guarantee or active registrations.

Must fix

  1. Stopping one API registration tears down the shared singleton used by the others.

register() reuses the process-wide _singletonState/MemoryStore across API registrations, but each registration's service stop handler unconditionally calls store.destroy() and clears _singletonState for that one API. In a process with multiple live registrations, stopping or reloading one registration can close the shared store and Redis lock resources while the other registrations still have hooks/tools closing over that store.

Please remove the stopping API from the registration map first and only destroy/clear the singleton when no registered APIs remain, or introduce explicit reference-counted ownership for the shared store. Add a regression test with multiple register() calls where stopping one registration does not invalidate the other.

  1. Redis unavailability can silently split the lock domain.

MemoryStore.runWithWriteLock() catches RedisLockUnavailableError and falls back to runWithFileLock(). That means one process with a failed Redis client/connect path can write under only the local file lock while another process with a working Redis client is still coordinating through Redis for the same dbPath. This recreates the split-lock-domain problem the PR is intended to solve. The shutdown path has a related variant because RedisLockManager.close() sets isClosed before active locks drain, causing racing writes to throw the same unavailable error and fall back to file locking.

Please distinguish "Redis was never configured/genuinely unavailable before using Redis mode" from "Redis lock mode is configured but this process cannot currently acquire/use it". In Redis-lock mode, acquisition/client/shutdown failures should fail closed rather than falling back into a different lock domain while peers may still use Redis. Add tests for a Redis client failure and shutdown race.

Also worth addressing

  • Lease-integrity loss is skipped when the protected write callback rejects, so an ambiguous write can surface as an ordinary storage error.
  • redisUrl is resolved before higher-priority locking.redis.url, so a stale legacy placeholder can abort config parsing even when the nested URL is valid.
  • loadLanceDB now mutates global process.report.excludeNetwork; that host-wide side effect is unrelated to Redis locking and should be split out or scoped/restored.

@TurboTheTurtle

Copy link
Copy Markdown
Contributor Author

Pushed eeadda7 to keep the shared singleton alive while other plugin registrations are still active, and still clear it once the final registration stops so a later registration gets a fresh store.

Ran:

  • node test/plugin-manifest-regression.mjs
  • node --test test/redis-lock.test.mjs test/storage-maintenance.test.mjs
  • npm run build --if-present

@rwmjhb rwmjhb left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This head is currently blocked by merge conflicts (mergeable=CONFLICTING, merge_state_status=DIRTY).

I can see the new commit is aimed at keeping the shared store alive for active registrations, but the branch needs to be rebased onto the latest base branch first. Please rebase/merge the base branch, resolve the conflicts, and push the updated branch. I'll re-run the full review once it is mergeable again, since conflict resolution may change the Redis lock and store lifecycle code paths.

@TurboTheTurtle TurboTheTurtle force-pushed the codex/issue-659-redis-lock branch from eeadda7 to c865d70 Compare June 15, 2026 02:26
@TurboTheTurtle

Copy link
Copy Markdown
Contributor Author

Pushed the Redis-lock follow-ups: close now keeps the manager shut down after shutdown, lease renewal is floored at 100ms, Redis connection failures emit the configured warning before falling back, retry delay uses a 100ms floor, and empty resources are rejected instead of sharing a default key.\n\nRan:\n- node --test test/redis-lock.test.mjs test/storage-maintenance.test.mjs\n- node test/plugin-manifest-regression.mjs\n- npm run build --if-present

@TurboTheTurtle TurboTheTurtle requested a review from rwmjhb June 16, 2026 04:41

@rwmjhb rwmjhb left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes for generated output sync.

The Redis-lock review verdict is otherwise no longer blocking, and the focused checks pass on this head, but the checked-in dist output is stale. I verified locally on c865d70:

  • npm run build --if-present passes
  • node --test test/redis-lock.test.mjs passes
  • node --test test/storage-maintenance.test.mjs passes
  • node test/plugin-manifest-regression.mjs passes

After running the build, the worktree is dirty:

M dist/src/retriever.js

The generated change is the compiled comment for the FTS routing initialization patch in dist/src/retriever.js. Please run the build and commit the updated dist output so the source/dist artifacts are reproducible from this branch.

Non-blocking follow-ups from review: Redis-unavailable fallback still deliberately degrades to file locking, automatic index folding currently uses the file-lock domain, and legacy redisUrl precedence can still surprise migrations. Those are documented tradeoffs/edge cases rather than blockers for this head.

@TurboTheTurtle TurboTheTurtle requested a review from rwmjhb June 16, 2026 08:39

@rwmjhb rwmjhb left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes on the latest head. The generated-output sync issue is fixed, but the Redis lock implementation still has distributed-lock correctness gaps.

Must fix

  1. Automatic index folding still bypasses the Redis write-lock domain.

noteDataModification() can trigger foldIndices(), and foldIndices() runs table.optimize() under runWithFileLock() rather than runWithWriteLock(). In Redis-enabled deployments, normal writes coordinate through Redis while this storage-mutating optimize path only takes the local file lock, so it can overlap with Redis-locked writes and recreate the split lock-domain problem. Please route index folding through the same Redis write lock, or explicitly defer/disable it when Redis locking is active, with a regression test.

  1. Redis connection/unavailable failures still fail open into a separate lock domain.

runWithWriteLock() catches RedisLockUnavailableError and falls back to runWithFileLock(). createClient() also maps Redis connection failures into that unavailable error. In a multi-host deployment, one host with a failed Redis connection can then write under only its local file lock while another host is still using Redis for the same dbPath. Those writers are no longer mutually exclusive. Redis-configured mode should fail closed by default after Redis locking is selected, or require an explicit best-effort fallback option with clear docs.

Also worth addressing

  • Lease loss is only checked after a successful callback; if the protected LanceDB write rejects after a partial write while the lease was lost, callers can receive an ordinary storage error instead of a lease-integrity error.
  • withLock() tracks active locks only after acquire() resolves, so close() can return while an acquisition/write is still in flight.
  • Legacy redisUrl is resolved before higher-priority locking.redis.url; a stale legacy env placeholder can break startup even when the nested URL is valid.

Verification note: R0 targeted checks and full npm test passed on this head, but those tests do not cover the lock-domain split paths above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Solution: Redis Distributed Lock for High Concurrency (extends #643)

3 participants