feat(learning): harden rebuild and recovery semantics#245
Conversation
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
📝 WalkthroughWalkthroughAdds a persistent LearningMaintenanceRecoveryStore, wires recovery-aware workspace registration into AceClawDaemon and LearningMaintenanceScheduler, extends scheduler control flow with a RECOVERY trigger and workspace registration, strengthens rebuild consistency checks, and improves atomic writes and JSON parsing resilience. Changes
Sequence Diagram(s)sequenceDiagram
participant D as AceClawDaemon
participant S as LearningMaintenanceScheduler
participant R as LearningMaintenanceRecoveryStore
participant M as MaintenancePipeline
Note over D,R: Startup
D->>R: new LearningMaintenanceRecoveryStore()
D->>S: registerWorkspace(workspaceHash, workingDir)
Note over S,R: Recovery check
S->>R: needsRecovery(workspaceHash)?
R-->>S: true/false
alt Recovery Needed
S->>R: markStarted(workspaceHash, trigger="recovery")
R-->>S: RecoveryState(RUNNING)
S->>M: execute(recoveryScopes)
M-->>S: success / failure
alt success
S->>R: clear(workspaceHash)
else failure
S->>R: markFailed(workspaceHash, exception)
R-->>S: RecoveryState(FAILED)
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Caution Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional.
❌ Failed checks (1 error, 1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches
🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@aceclaw-daemon/src/main/java/dev/aceclaw/daemon/LearningMaintenanceRecoveryStore.java`:
- Around line 68-82: The code currently swallows parse errors in load(Path)
causing an unreadable-but-present state file to be treated like "no state";
instead, change load(Path projectRoot) to distinguish a missing file from an
unreadable file by not returning Optional.empty() on parse failures: if
Files.isRegularFile(file) and reading/parsing fails, propagate an IOException
(or wrap parsing RuntimeExceptions in an IOException with the underlying cause)
so callers (e.g. needsRecovery()) can treat an unreadable existing file as
recovery-needed; apply the same change to the corresponding error-handling block
referenced around lines 85-98 so any parsing/IO failure for an existing state
file is surfaced rather than swallowed.
In
`@aceclaw-daemon/src/main/java/dev/aceclaw/daemon/LearningMaintenanceScheduler.java`:
- Around line 191-202: The batch failure handling currently requeues every scope
in scopesToRun and only performs per-scope cleanup after the whole loop, causing
successful scopes to be retried and recovery state to be resurrected; modify
LearningMaintenanceScheduler so you track completed scopes (e.g., a local Set
completedScopes) and perform the success cleanup (pendingScopes.remove(...) and
recoveryScopes.remove(...), and recoveryStore.clear(...)) immediately after each
successful pipeline.run(...) for that scope; in the catch block only requeue the
scope that threw and the scopes after it (i.e., scopesToRun minus
completedScopes), and do not requeue scopes present in completedScopes. Ensure
references: pipeline.run(trigger.id(), scope),
recoveryStore.clear(scope.workingDir()),
pendingScopes.remove(scope.workspaceHash(), scope),
recoveryScopes.remove(scope.workspaceHash(), scope), and
lastRunAt/sessionsSinceLastRun logic remain correct.
In
`@aceclaw-daemon/src/test/java/dev/aceclaw/daemon/LearningMaintenanceSchedulerTest.java`:
- Around line 222-231: The test races because waitForTriggers(triggers, 1) only
ensures the pipeline lambda appended to triggers but not that pipeline.run
completed and recoveryStore.clear executed; after
scheduler.registerWorkspace("ws-a", workspace) add a call to
waitForMaintenanceToSettle(scheduler) (or equivalent helper) before asserting
recoveryStore.load(workspace). In short: after waitForTriggers(triggers, 1) call
waitForMaintenanceToSettle(scheduler) to wait for maintenance to finish, then
perform assertThat(recoveryStore.load(workspace)).isEmpty().
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: eec73ed1-4cb5-4667-a68c-1fef876ebdfb
📒 Files selected for processing (10)
aceclaw-daemon/src/main/java/dev/aceclaw/daemon/AceClawDaemon.javaaceclaw-daemon/src/main/java/dev/aceclaw/daemon/HistoricalIndexRebuilder.javaaceclaw-daemon/src/main/java/dev/aceclaw/daemon/LearningMaintenanceRecoveryStore.javaaceclaw-daemon/src/main/java/dev/aceclaw/daemon/LearningMaintenanceScheduler.javaaceclaw-daemon/src/main/java/dev/aceclaw/daemon/SessionHistoryStore.javaaceclaw-daemon/src/test/java/dev/aceclaw/daemon/HistoricalIndexRebuilderTest.javaaceclaw-daemon/src/test/java/dev/aceclaw/daemon/LearningMaintenanceRecoveryStoreTest.javaaceclaw-daemon/src/test/java/dev/aceclaw/daemon/LearningMaintenanceSchedulerTest.javaaceclaw-daemon/src/test/java/dev/aceclaw/daemon/SessionHistoryStoreTest.javaaceclaw-memory/src/main/java/dev/aceclaw/memory/HistoricalLogIndex.java
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
…ing-rebuild-hardening
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
aceclaw-daemon/src/main/java/dev/aceclaw/daemon/LearningMaintenanceRecoveryStore.java (1)
24-31:⚠️ Potential issue | 🟠 MajorA corrupt recovery marker currently prevents recovery from ever starting.
needsRecovery()now treats an unreadable state file as recovery-needed, butmarkStarted()andmarkFailed()both callload(projectRoot)again before writing. InLearningMaintenanceScheduler.tryTrigger(...), that exception is thrown beforepipeline.run(...), so malformed JSON becomes an endless recovery loop instead of a one-time cleanup pass. Existing blank files are also still collapsed toOptional.empty(), which skips recovery entirely.Please keep “missing” separate from “present but unusable”, and let the write path fall back to a fresh state when the previous file cannot be parsed.
Also applies to: 47-50, 74-84
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@aceclaw-daemon/src/main/java/dev/aceclaw/daemon/LearningMaintenanceRecoveryStore.java` around lines 24 - 31, markStarted and markFailed call load(projectRoot) and treat unreadable/malformed state files the same as "missing", causing malformed JSON to block recovery; change both methods (markStarted, markFailed) to distinguish "present but unreadable" from "missing" by catching parse/IO exceptions from load(projectRoot) and proceeding with a fresh RecoveryState fallback for the write path while still allowing needsRecovery() to return true for unreadable files; specifically, modify calls to load(...) used inside markStarted and markFailed to handle Optional.empty() (missing) vs exceptions (corrupt) separately—log the parse error, create a new default RecoveryState to continue writing, and ensure LearningMaintenanceScheduler.tryTrigger can proceed instead of failing on malformed state files.aceclaw-daemon/src/main/java/dev/aceclaw/daemon/LearningMaintenanceScheduler.java (1)
217-229:⚠️ Potential issue | 🟠 MajorPersist the scopes skipped after the first batch failure.
Only
activeScopeis written torecoveryStore. The scopes after it are requeued inrecoveryScopesonly in memory, so a daemon restart before the next recovery pass drops them completely and breaks the new resume-after-interruption semantics.Possible fix
for (var scope : scopesToRun) { if (completedScopes.contains(scope)) { continue; } - if (scope.equals(activeScope) && recoveryStore != null) { + if (recoveryStore != null) { try { recoveryStore.markFailed(scope.workingDir(), scope.workspaceHash(), trigger.id(), e); } catch (Exception ignored) { // best-effort recovery metadata only } } recoveryScopes.put(scope.workspaceHash(), scope); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@aceclaw-daemon/src/main/java/dev/aceclaw/daemon/LearningMaintenanceScheduler.java` around lines 217 - 229, When an activeScope failure occurs the code only persists that single scope to recoveryStore, leaving subsequent scopes only in-memory; change the block inside the for-loop handling scope.equals(activeScope) so that after calling recoveryStore.markFailed(...) for the activeScope you also persist every remaining scope in scopesToRun that is not in completedScopes to recoveryStore (call the same recoveryStore.markFailed API with scope.workingDir(), scope.workspaceHash(), trigger.id() and a clear placeholder exception/message indicating the scope was skipped due to the earlier failure), and still add them into recoveryScopes as currently done; reference the symbols scopesToRun, completedScopes, activeScope, recoveryStore.markFailed(...), recoveryScopes.put(...), scope.workingDir(), scope.workspaceHash(), trigger.id(), and the caught exception variable e when persisting the activeScope.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@aceclaw-daemon/src/main/java/dev/aceclaw/daemon/LearningMaintenanceScheduler.java`:
- Around line 188-208: The code sets sessionsAtStart to 0 for Trigger.RECOVERY
which prevents recovery runs from consuming the pre-run session counter; change
the logic so sessionsAtStart captures the current sessionsSinceLastRun value for
all triggers (remove the special-case zeroing for Trigger.RECOVERY) so that the
later update sessionsSinceLastRun.updateAndGet(current -> Math.max(0, current -
sessionsAtStart)) correctly reduces the counter after successful recovery runs;
adjust the assignment of sessionsAtStart in LearningMaintenanceScheduler (the
variable named sessionsAtStart and the conditional using Trigger.RECOVERY)
accordingly.
---
Duplicate comments:
In
`@aceclaw-daemon/src/main/java/dev/aceclaw/daemon/LearningMaintenanceRecoveryStore.java`:
- Around line 24-31: markStarted and markFailed call load(projectRoot) and treat
unreadable/malformed state files the same as "missing", causing malformed JSON
to block recovery; change both methods (markStarted, markFailed) to distinguish
"present but unreadable" from "missing" by catching parse/IO exceptions from
load(projectRoot) and proceeding with a fresh RecoveryState fallback for the
write path while still allowing needsRecovery() to return true for unreadable
files; specifically, modify calls to load(...) used inside markStarted and
markFailed to handle Optional.empty() (missing) vs exceptions (corrupt)
separately—log the parse error, create a new default RecoveryState to continue
writing, and ensure LearningMaintenanceScheduler.tryTrigger can proceed instead
of failing on malformed state files.
In
`@aceclaw-daemon/src/main/java/dev/aceclaw/daemon/LearningMaintenanceScheduler.java`:
- Around line 217-229: When an activeScope failure occurs the code only persists
that single scope to recoveryStore, leaving subsequent scopes only in-memory;
change the block inside the for-loop handling scope.equals(activeScope) so that
after calling recoveryStore.markFailed(...) for the activeScope you also persist
every remaining scope in scopesToRun that is not in completedScopes to
recoveryStore (call the same recoveryStore.markFailed API with
scope.workingDir(), scope.workspaceHash(), trigger.id() and a clear placeholder
exception/message indicating the scope was skipped due to the earlier failure),
and still add them into recoveryScopes as currently done; reference the symbols
scopesToRun, completedScopes, activeScope, recoveryStore.markFailed(...),
recoveryScopes.put(...), scope.workingDir(), scope.workspaceHash(),
trigger.id(), and the caught exception variable e when persisting the
activeScope.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 13a0d824-b28b-43fd-839c-4ecde9c4a7d6
📒 Files selected for processing (4)
aceclaw-daemon/src/main/java/dev/aceclaw/daemon/LearningMaintenanceRecoveryStore.javaaceclaw-daemon/src/main/java/dev/aceclaw/daemon/LearningMaintenanceScheduler.javaaceclaw-daemon/src/test/java/dev/aceclaw/daemon/LearningMaintenanceRecoveryStoreTest.javaaceclaw-daemon/src/test/java/dev/aceclaw/daemon/LearningMaintenanceSchedulerTest.java
| int sessionsAtStart = trigger == Trigger.RECOVERY ? 0 : sessionsSinceLastRun.get(); | ||
|
|
||
| Thread.ofVirtual().name("learning-maintenance-" + trigger.id()).start(() -> { | ||
| var completedScopes = new HashSet<WorkspaceScope>(); | ||
| WorkspaceScope activeScope = null; | ||
| try { | ||
| for (var scope : scopesToRun) { | ||
| activeScope = scope; | ||
| if (recoveryStore != null) { | ||
| recoveryStore.markStarted(scope.workingDir(), scope.workspaceHash(), trigger.id()); | ||
| } | ||
| pipeline.run(trigger.id(), scope); | ||
| if (recoveryStore != null) { | ||
| recoveryStore.clear(scope.workingDir()); | ||
| } | ||
| pendingScopes.remove(scope.workspaceHash(), scope); | ||
| recoveryScopes.remove(scope.workspaceHash(), scope); | ||
| completedScopes.add(scope); | ||
| } | ||
| lastRunAt = clock.instant(); | ||
| sessionsSinceLastRun.updateAndGet(current -> Math.max(0, current - sessionsAtStart)); |
There was a problem hiding this comment.
Recovery runs should also consume the pre-run session counter.
RECOVERY snapshots sessionsAtStart as 0, so a failed SESSION_COUNT run that later succeeds through recovery leaves the old count in sessionsSinceLastRun. The next close can retrigger maintenance immediately even though a full maintenance pass just completed.
Possible fix
- int sessionsAtStart = trigger == Trigger.RECOVERY ? 0 : sessionsSinceLastRun.get();
+ int sessionsAtStart = sessionsSinceLastRun.get();📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| int sessionsAtStart = trigger == Trigger.RECOVERY ? 0 : sessionsSinceLastRun.get(); | |
| Thread.ofVirtual().name("learning-maintenance-" + trigger.id()).start(() -> { | |
| var completedScopes = new HashSet<WorkspaceScope>(); | |
| WorkspaceScope activeScope = null; | |
| try { | |
| for (var scope : scopesToRun) { | |
| activeScope = scope; | |
| if (recoveryStore != null) { | |
| recoveryStore.markStarted(scope.workingDir(), scope.workspaceHash(), trigger.id()); | |
| } | |
| pipeline.run(trigger.id(), scope); | |
| if (recoveryStore != null) { | |
| recoveryStore.clear(scope.workingDir()); | |
| } | |
| pendingScopes.remove(scope.workspaceHash(), scope); | |
| recoveryScopes.remove(scope.workspaceHash(), scope); | |
| completedScopes.add(scope); | |
| } | |
| lastRunAt = clock.instant(); | |
| sessionsSinceLastRun.updateAndGet(current -> Math.max(0, current - sessionsAtStart)); | |
| int sessionsAtStart = sessionsSinceLastRun.get(); | |
| Thread.ofVirtual().name("learning-maintenance-" + trigger.id()).start(() -> { | |
| var completedScopes = new HashSet<WorkspaceScope>(); | |
| WorkspaceScope activeScope = null; | |
| try { | |
| for (var scope : scopesToRun) { | |
| activeScope = scope; | |
| if (recoveryStore != null) { | |
| recoveryStore.markStarted(scope.workingDir(), scope.workspaceHash(), trigger.id()); | |
| } | |
| pipeline.run(trigger.id(), scope); | |
| if (recoveryStore != null) { | |
| recoveryStore.clear(scope.workingDir()); | |
| } | |
| pendingScopes.remove(scope.workspaceHash(), scope); | |
| recoveryScopes.remove(scope.workspaceHash(), scope); | |
| completedScopes.add(scope); | |
| } | |
| lastRunAt = clock.instant(); | |
| sessionsSinceLastRun.updateAndGet(current -> Math.max(0, current - sessionsAtStart)); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@aceclaw-daemon/src/main/java/dev/aceclaw/daemon/LearningMaintenanceScheduler.java`
around lines 188 - 208, The code sets sessionsAtStart to 0 for Trigger.RECOVERY
which prevents recovery runs from consuming the pre-run session counter; change
the logic so sessionsAtStart captures the current sessionsSinceLastRun value for
all triggers (remove the special-case zeroing for Trigger.RECOVERY) so that the
later update sessionsSinceLastRun.updateAndGet(current -> Math.max(0, current -
sessionsAtStart)) correctly reduces the counter after successful recovery runs;
adjust the assignment of sessionsAtStart in LearningMaintenanceScheduler (the
variable named sessionsAtStart and the conditional using Trigger.RECOVERY)
accordingly.
Summary
Testing
Closes #232
Summary by CodeRabbit
New Features
Bug Fixes
Tests