Skip to content

feat(learning): harden rebuild and recovery semantics#245

Merged
xinhuagu merged 3 commits into
mainfrom
codex/issue-232-learning-rebuild-hardening
Mar 14, 2026
Merged

feat(learning): harden rebuild and recovery semantics#245
xinhuagu merged 3 commits into
mainfrom
codex/issue-232-learning-rebuild-hardening

Conversation

@xinhuagu

@xinhuagu xinhuagu commented Mar 13, 2026

Copy link
Copy Markdown
Owner

Summary

  • add persisted learning maintenance recovery state and recovery-trigger scheduling
  • harden historical index rebuild with per-session coverage verification and legacy-aware consistency checks
  • make session history/snapshot persistence and loading more resilient under partial writes and malformed data

Testing

  • ./gradlew :aceclaw-memory:test --tests dev.aceclaw.memory.HistoricalLogIndexTest :aceclaw-daemon:test --tests dev.aceclaw.daemon.HistoricalIndexRebuilderTest --tests dev.aceclaw.daemon.LearningMaintenanceSchedulerTest --tests dev.aceclaw.daemon.LearningMaintenanceRecoveryStoreTest --tests dev.aceclaw.daemon.SessionHistoryStoreTest --no-daemon

Closes #232

Summary by CodeRabbit

  • New Features

    • Automatic recovery for failed maintenance runs with persisted recovery state and workspace registration for scheduled maintenance
    • New session-coverage reporting for historical logs
  • Bug Fixes

    • More robust rebuild consistency checks for historical indexes
    • Better handling of malformed historical data and fully atomic session-history writes
  • Tests

    • Expanded tests for recovery flows and data-consistency scenarios

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@coderabbitai

coderabbitai Bot commented Mar 13, 2026

Copy link
Copy Markdown
📝 Walkthrough

Walkthrough

Adds a persistent LearningMaintenanceRecoveryStore, wires recovery-aware workspace registration into AceClawDaemon and LearningMaintenanceScheduler, extends scheduler control flow with a RECOVERY trigger and workspace registration, strengthens rebuild consistency checks, and improves atomic writes and JSON parsing resilience.

Changes

Cohort / File(s) Summary
Daemon wiring & recovery store
aceclaw-daemon/src/main/java/dev/aceclaw/daemon/AceClawDaemon.java, aceclaw-daemon/src/main/java/dev/aceclaw/daemon/LearningMaintenanceRecoveryStore.java
New LearningMaintenanceRecoveryStore class; daemon initializes the store and registers startup workspace with the scheduler.
Scheduler recovery integration
aceclaw-daemon/src/main/java/dev/aceclaw/daemon/LearningMaintenanceScheduler.java
Constructor now accepts LearningMaintenanceRecoveryStore; added RECOVERY trigger, registerWorkspace(...), recoveryScopes tracking, and recovery-aware trigger/flow changes.
Rebuild & coverage checks
aceclaw-daemon/src/main/java/dev/aceclaw/daemon/HistoricalIndexRebuilder.java, aceclaw-memory/src/main/java/dev/aceclaw/memory/HistoricalLogIndex.java
Historical index now exposes per-session coverage; rebuilder uses expected vs actual coverage (excluding legacy entries) and validates post-rebuild consistency.
History store robustness
aceclaw-daemon/src/main/java/dev/aceclaw/daemon/SessionHistoryStore.java
saveSession now uses atomic write helper; JSON parsing error handling expanded to catch RuntimeException and return Optional.empty.
Recovery & scheduler tests
aceclaw-daemon/src/test/java/dev/aceclaw/daemon/LearningMaintenanceRecoveryStoreTest.java, aceclaw-daemon/src/test/java/dev/aceclaw/daemon/LearningMaintenanceSchedulerTest.java
New unit tests for recovery store; scheduler tests updated/added to cover RECOVERY trigger and pass recoveryStore parameter.
Rebuilder & history tests
aceclaw-daemon/src/test/java/dev/aceclaw/daemon/HistoricalIndexRebuilderTest.java, aceclaw-daemon/src/test/java/dev/aceclaw/daemon/SessionHistoryStoreTest.java
Adds test for incomplete coverage triggering rebuild and tests for malformed snapshot handling and atomic write verification.

Sequence Diagram(s)

sequenceDiagram
    participant D as AceClawDaemon
    participant S as LearningMaintenanceScheduler
    participant R as LearningMaintenanceRecoveryStore
    participant M as MaintenancePipeline

    Note over D,R: Startup
    D->>R: new LearningMaintenanceRecoveryStore()
    D->>S: registerWorkspace(workspaceHash, workingDir)

    Note over S,R: Recovery check
    S->>R: needsRecovery(workspaceHash)?
    R-->>S: true/false

    alt Recovery Needed
        S->>R: markStarted(workspaceHash, trigger="recovery")
        R-->>S: RecoveryState(RUNNING)
        S->>M: execute(recoveryScopes)
        M-->>S: success / failure
        alt success
            S->>R: clear(workspaceHash)
        else failure
            S->>R: markFailed(workspaceHash, exception)
            R-->>S: RecoveryState(FAILED)
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • PR #214: Earlier wiring of LearningMaintenanceScheduler and AceClawDaemon; this change extends that integration with a persistent recovery store and workspace registration.
  • PR #216: Related updates to learning/maintenance components (scheduler, rebuilder, history store) that this PR builds upon.
  • PR #212: Prior work adding workspace-scoped awareness and workspaceHash usage across learning components; relevant to recovery/workspace registration changes.

Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

  • Ignore

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
Block Major Correctness And Security Risks ❌ Error Session counter reset bug in LearningMaintenanceScheduler where RECOVERY triggers fail to decrement sessionsSinceLastRun, causing repeated maintenance attempts that violate scheduling invariants. Remove conditional in sessionsAtStart assignment to ensure all trigger types properly consume the session counter and add test coverage for failed SESSION_COUNT followed by RECOVERY.
Docstring Coverage ⚠️ Warning Docstring coverage is 6.25% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(learning): harden rebuild and recovery semantics' accurately reflects the main changes: hardening historical rebuild with coverage verification and adding recovery state persistence.
Linked Issues check ✅ Passed All coding objectives from issue #232 are met: stronger rebuild invariants via per-session coverage checks, recovery semantics via LearningMaintenanceRecoveryStore and recovery triggers, consistency verification post-rebuild, and safer partial-state handling.
Out of Scope Changes check ✅ Passed All changes align with scope: recovery state persistence, rebuild invariants, consistency checks, and resilient history persistence. No session history storage redesign or unrelated changes detected.
✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch codex/issue-232-learning-rebuild-hardening
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@aceclaw-daemon/src/main/java/dev/aceclaw/daemon/LearningMaintenanceRecoveryStore.java`:
- Around line 68-82: The code currently swallows parse errors in load(Path)
causing an unreadable-but-present state file to be treated like "no state";
instead, change load(Path projectRoot) to distinguish a missing file from an
unreadable file by not returning Optional.empty() on parse failures: if
Files.isRegularFile(file) and reading/parsing fails, propagate an IOException
(or wrap parsing RuntimeExceptions in an IOException with the underlying cause)
so callers (e.g. needsRecovery()) can treat an unreadable existing file as
recovery-needed; apply the same change to the corresponding error-handling block
referenced around lines 85-98 so any parsing/IO failure for an existing state
file is surfaced rather than swallowed.

In
`@aceclaw-daemon/src/main/java/dev/aceclaw/daemon/LearningMaintenanceScheduler.java`:
- Around line 191-202: The batch failure handling currently requeues every scope
in scopesToRun and only performs per-scope cleanup after the whole loop, causing
successful scopes to be retried and recovery state to be resurrected; modify
LearningMaintenanceScheduler so you track completed scopes (e.g., a local Set
completedScopes) and perform the success cleanup (pendingScopes.remove(...) and
recoveryScopes.remove(...), and recoveryStore.clear(...)) immediately after each
successful pipeline.run(...) for that scope; in the catch block only requeue the
scope that threw and the scopes after it (i.e., scopesToRun minus
completedScopes), and do not requeue scopes present in completedScopes. Ensure
references: pipeline.run(trigger.id(), scope),
recoveryStore.clear(scope.workingDir()),
pendingScopes.remove(scope.workspaceHash(), scope),
recoveryScopes.remove(scope.workspaceHash(), scope), and
lastRunAt/sessionsSinceLastRun logic remain correct.

In
`@aceclaw-daemon/src/test/java/dev/aceclaw/daemon/LearningMaintenanceSchedulerTest.java`:
- Around line 222-231: The test races because waitForTriggers(triggers, 1) only
ensures the pipeline lambda appended to triggers but not that pipeline.run
completed and recoveryStore.clear executed; after
scheduler.registerWorkspace("ws-a", workspace) add a call to
waitForMaintenanceToSettle(scheduler) (or equivalent helper) before asserting
recoveryStore.load(workspace). In short: after waitForTriggers(triggers, 1) call
waitForMaintenanceToSettle(scheduler) to wait for maintenance to finish, then
perform assertThat(recoveryStore.load(workspace)).isEmpty().

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: eec73ed1-4cb5-4667-a68c-1fef876ebdfb

📥 Commits

Reviewing files that changed from the base of the PR and between 2c9e5f9 and 012f902.

📒 Files selected for processing (10)
  • aceclaw-daemon/src/main/java/dev/aceclaw/daemon/AceClawDaemon.java
  • aceclaw-daemon/src/main/java/dev/aceclaw/daemon/HistoricalIndexRebuilder.java
  • aceclaw-daemon/src/main/java/dev/aceclaw/daemon/LearningMaintenanceRecoveryStore.java
  • aceclaw-daemon/src/main/java/dev/aceclaw/daemon/LearningMaintenanceScheduler.java
  • aceclaw-daemon/src/main/java/dev/aceclaw/daemon/SessionHistoryStore.java
  • aceclaw-daemon/src/test/java/dev/aceclaw/daemon/HistoricalIndexRebuilderTest.java
  • aceclaw-daemon/src/test/java/dev/aceclaw/daemon/LearningMaintenanceRecoveryStoreTest.java
  • aceclaw-daemon/src/test/java/dev/aceclaw/daemon/LearningMaintenanceSchedulerTest.java
  • aceclaw-daemon/src/test/java/dev/aceclaw/daemon/SessionHistoryStoreTest.java
  • aceclaw-memory/src/main/java/dev/aceclaw/memory/HistoricalLogIndex.java

Comment thread aceclaw-daemon/src/main/java/dev/aceclaw/daemon/LearningMaintenanceScheduler.java Outdated
@xinhuagu

Copy link
Copy Markdown
Owner Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Mar 14, 2026

Copy link
Copy Markdown
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
aceclaw-daemon/src/main/java/dev/aceclaw/daemon/LearningMaintenanceRecoveryStore.java (1)

24-31: ⚠️ Potential issue | 🟠 Major

A corrupt recovery marker currently prevents recovery from ever starting.

needsRecovery() now treats an unreadable state file as recovery-needed, but markStarted() and markFailed() both call load(projectRoot) again before writing. In LearningMaintenanceScheduler.tryTrigger(...), that exception is thrown before pipeline.run(...), so malformed JSON becomes an endless recovery loop instead of a one-time cleanup pass. Existing blank files are also still collapsed to Optional.empty(), which skips recovery entirely.

Please keep “missing” separate from “present but unusable”, and let the write path fall back to a fresh state when the previous file cannot be parsed.

Also applies to: 47-50, 74-84

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@aceclaw-daemon/src/main/java/dev/aceclaw/daemon/LearningMaintenanceRecoveryStore.java`
around lines 24 - 31, markStarted and markFailed call load(projectRoot) and
treat unreadable/malformed state files the same as "missing", causing malformed
JSON to block recovery; change both methods (markStarted, markFailed) to
distinguish "present but unreadable" from "missing" by catching parse/IO
exceptions from load(projectRoot) and proceeding with a fresh RecoveryState
fallback for the write path while still allowing needsRecovery() to return true
for unreadable files; specifically, modify calls to load(...) used inside
markStarted and markFailed to handle Optional.empty() (missing) vs exceptions
(corrupt) separately—log the parse error, create a new default RecoveryState to
continue writing, and ensure LearningMaintenanceScheduler.tryTrigger can proceed
instead of failing on malformed state files.
aceclaw-daemon/src/main/java/dev/aceclaw/daemon/LearningMaintenanceScheduler.java (1)

217-229: ⚠️ Potential issue | 🟠 Major

Persist the scopes skipped after the first batch failure.

Only activeScope is written to recoveryStore. The scopes after it are requeued in recoveryScopes only in memory, so a daemon restart before the next recovery pass drops them completely and breaks the new resume-after-interruption semantics.

Possible fix
                 for (var scope : scopesToRun) {
                     if (completedScopes.contains(scope)) {
                         continue;
                     }
-                    if (scope.equals(activeScope) && recoveryStore != null) {
+                    if (recoveryStore != null) {
                         try {
                             recoveryStore.markFailed(scope.workingDir(), scope.workspaceHash(), trigger.id(), e);
                         } catch (Exception ignored) {
                             // best-effort recovery metadata only
                         }
                     }
                     recoveryScopes.put(scope.workspaceHash(), scope);
                 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@aceclaw-daemon/src/main/java/dev/aceclaw/daemon/LearningMaintenanceScheduler.java`
around lines 217 - 229, When an activeScope failure occurs the code only
persists that single scope to recoveryStore, leaving subsequent scopes only
in-memory; change the block inside the for-loop handling
scope.equals(activeScope) so that after calling recoveryStore.markFailed(...)
for the activeScope you also persist every remaining scope in scopesToRun that
is not in completedScopes to recoveryStore (call the same
recoveryStore.markFailed API with scope.workingDir(), scope.workspaceHash(),
trigger.id() and a clear placeholder exception/message indicating the scope was
skipped due to the earlier failure), and still add them into recoveryScopes as
currently done; reference the symbols scopesToRun, completedScopes, activeScope,
recoveryStore.markFailed(...), recoveryScopes.put(...), scope.workingDir(),
scope.workspaceHash(), trigger.id(), and the caught exception variable e when
persisting the activeScope.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@aceclaw-daemon/src/main/java/dev/aceclaw/daemon/LearningMaintenanceScheduler.java`:
- Around line 188-208: The code sets sessionsAtStart to 0 for Trigger.RECOVERY
which prevents recovery runs from consuming the pre-run session counter; change
the logic so sessionsAtStart captures the current sessionsSinceLastRun value for
all triggers (remove the special-case zeroing for Trigger.RECOVERY) so that the
later update sessionsSinceLastRun.updateAndGet(current -> Math.max(0, current -
sessionsAtStart)) correctly reduces the counter after successful recovery runs;
adjust the assignment of sessionsAtStart in LearningMaintenanceScheduler (the
variable named sessionsAtStart and the conditional using Trigger.RECOVERY)
accordingly.

---

Duplicate comments:
In
`@aceclaw-daemon/src/main/java/dev/aceclaw/daemon/LearningMaintenanceRecoveryStore.java`:
- Around line 24-31: markStarted and markFailed call load(projectRoot) and treat
unreadable/malformed state files the same as "missing", causing malformed JSON
to block recovery; change both methods (markStarted, markFailed) to distinguish
"present but unreadable" from "missing" by catching parse/IO exceptions from
load(projectRoot) and proceeding with a fresh RecoveryState fallback for the
write path while still allowing needsRecovery() to return true for unreadable
files; specifically, modify calls to load(...) used inside markStarted and
markFailed to handle Optional.empty() (missing) vs exceptions (corrupt)
separately—log the parse error, create a new default RecoveryState to continue
writing, and ensure LearningMaintenanceScheduler.tryTrigger can proceed instead
of failing on malformed state files.

In
`@aceclaw-daemon/src/main/java/dev/aceclaw/daemon/LearningMaintenanceScheduler.java`:
- Around line 217-229: When an activeScope failure occurs the code only persists
that single scope to recoveryStore, leaving subsequent scopes only in-memory;
change the block inside the for-loop handling scope.equals(activeScope) so that
after calling recoveryStore.markFailed(...) for the activeScope you also persist
every remaining scope in scopesToRun that is not in completedScopes to
recoveryStore (call the same recoveryStore.markFailed API with
scope.workingDir(), scope.workspaceHash(), trigger.id() and a clear placeholder
exception/message indicating the scope was skipped due to the earlier failure),
and still add them into recoveryScopes as currently done; reference the symbols
scopesToRun, completedScopes, activeScope, recoveryStore.markFailed(...),
recoveryScopes.put(...), scope.workingDir(), scope.workspaceHash(),
trigger.id(), and the caught exception variable e when persisting the
activeScope.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 13a0d824-b28b-43fd-839c-4ecde9c4a7d6

📥 Commits

Reviewing files that changed from the base of the PR and between 012f902 and b1f171d.

📒 Files selected for processing (4)
  • aceclaw-daemon/src/main/java/dev/aceclaw/daemon/LearningMaintenanceRecoveryStore.java
  • aceclaw-daemon/src/main/java/dev/aceclaw/daemon/LearningMaintenanceScheduler.java
  • aceclaw-daemon/src/test/java/dev/aceclaw/daemon/LearningMaintenanceRecoveryStoreTest.java
  • aceclaw-daemon/src/test/java/dev/aceclaw/daemon/LearningMaintenanceSchedulerTest.java

Comment on lines +188 to 208
int sessionsAtStart = trigger == Trigger.RECOVERY ? 0 : sessionsSinceLastRun.get();

Thread.ofVirtual().name("learning-maintenance-" + trigger.id()).start(() -> {
var completedScopes = new HashSet<WorkspaceScope>();
WorkspaceScope activeScope = null;
try {
for (var scope : scopesToRun) {
activeScope = scope;
if (recoveryStore != null) {
recoveryStore.markStarted(scope.workingDir(), scope.workspaceHash(), trigger.id());
}
pipeline.run(trigger.id(), scope);
if (recoveryStore != null) {
recoveryStore.clear(scope.workingDir());
}
pendingScopes.remove(scope.workspaceHash(), scope);
recoveryScopes.remove(scope.workspaceHash(), scope);
completedScopes.add(scope);
}
lastRunAt = clock.instant();
sessionsSinceLastRun.updateAndGet(current -> Math.max(0, current - sessionsAtStart));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Recovery runs should also consume the pre-run session counter.

RECOVERY snapshots sessionsAtStart as 0, so a failed SESSION_COUNT run that later succeeds through recovery leaves the old count in sessionsSinceLastRun. The next close can retrigger maintenance immediately even though a full maintenance pass just completed.

Possible fix
-        int sessionsAtStart = trigger == Trigger.RECOVERY ? 0 : sessionsSinceLastRun.get();
+        int sessionsAtStart = sessionsSinceLastRun.get();
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
int sessionsAtStart = trigger == Trigger.RECOVERY ? 0 : sessionsSinceLastRun.get();
Thread.ofVirtual().name("learning-maintenance-" + trigger.id()).start(() -> {
var completedScopes = new HashSet<WorkspaceScope>();
WorkspaceScope activeScope = null;
try {
for (var scope : scopesToRun) {
activeScope = scope;
if (recoveryStore != null) {
recoveryStore.markStarted(scope.workingDir(), scope.workspaceHash(), trigger.id());
}
pipeline.run(trigger.id(), scope);
if (recoveryStore != null) {
recoveryStore.clear(scope.workingDir());
}
pendingScopes.remove(scope.workspaceHash(), scope);
recoveryScopes.remove(scope.workspaceHash(), scope);
completedScopes.add(scope);
}
lastRunAt = clock.instant();
sessionsSinceLastRun.updateAndGet(current -> Math.max(0, current - sessionsAtStart));
int sessionsAtStart = sessionsSinceLastRun.get();
Thread.ofVirtual().name("learning-maintenance-" + trigger.id()).start(() -> {
var completedScopes = new HashSet<WorkspaceScope>();
WorkspaceScope activeScope = null;
try {
for (var scope : scopesToRun) {
activeScope = scope;
if (recoveryStore != null) {
recoveryStore.markStarted(scope.workingDir(), scope.workspaceHash(), trigger.id());
}
pipeline.run(trigger.id(), scope);
if (recoveryStore != null) {
recoveryStore.clear(scope.workingDir());
}
pendingScopes.remove(scope.workspaceHash(), scope);
recoveryScopes.remove(scope.workspaceHash(), scope);
completedScopes.add(scope);
}
lastRunAt = clock.instant();
sessionsSinceLastRun.updateAndGet(current -> Math.max(0, current - sessionsAtStart));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@aceclaw-daemon/src/main/java/dev/aceclaw/daemon/LearningMaintenanceScheduler.java`
around lines 188 - 208, The code sets sessionsAtStart to 0 for Trigger.RECOVERY
which prevents recovery runs from consuming the pre-run session counter; change
the logic so sessionsAtStart captures the current sessionsSinceLastRun value for
all triggers (remove the special-case zeroing for Trigger.RECOVERY) so that the
later update sessionsSinceLastRun.updateAndGet(current -> Math.max(0, current -
sessionsAtStart)) correctly reduces the counter after successful recovery runs;
adjust the assignment of sessionsAtStart in LearningMaintenanceScheduler (the
variable named sessionsAtStart and the conditional using Trigger.RECOVERY)
accordingly.

@xinhuagu xinhuagu merged commit dbedb5d into main Mar 14, 2026
4 checks passed
@xinhuagu xinhuagu deleted the codex/issue-232-learning-rebuild-hardening branch March 14, 2026 08:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(learning): historical rebuild and recovery hardening

1 participant