Skip to content

fix(idalib): don't reap busy headless workers on health-probe timeout#461

Open
lich0821 wants to merge 1 commit into
mrexodia:mainfrom
lich0821:fix/idalib-busy-worker-reap
Open

fix(idalib): don't reap busy headless workers on health-probe timeout#461
lich0821 wants to merge 1 commit into
mrexodia:mainfrom
lich0821:fix/idalib-busy-worker-reap

Conversation

@lich0821

Copy link
Copy Markdown

Problem

IdalibSupervisor.resolve_session() runs a health probe (TCP connect +
JSON-RPC ping) before forwarding every tool call. On the first probe failure
it immediately called _terminate_worker() and raised "… is not reachable".

But a headless idalib worker is single-threaded. While it is busy running
auto-analysis, building caches, or decompiling, it physically cannot service
the ping until it yields. The supervisor interpreted that silence as death and
killed a perfectly healthy worker, destroying the session.

This is easy to hit with the warmup flags enabled on a large database: open
with run_auto_analysis=true (the default), and the very analysis the open
requested keeps the worker busy long enough that the next probe reaps it.

Fix

Treat "busy" and "dead" as different states. The OS process liveness
(session.is_alive()), not a single ping, is the source of truth:

State Detection Action
Dead process exited unregister + reap, raise not reachable
Busy alive, ping fails, within grace keep, retry w/ backoff, raise retryable error
Wedged alive, ping keeps failing past grace unregister + reap, raise wedged

A grace window (IDA_MCP_WEDGED_GRACE_SEC, default 300s) ensures a genuinely
stuck worker (e.g. an analysis loop) still can't pin a worker slot forever —
so the limited worker pool can't be exhausted by zombies. Set it to 0 to
never reap a live worker (pure keep-forever, for single-user long-lived setups).

_prune_dead_worker_sessions_locked() is given the same busy-vs-dead
distinction: it now prunes only processes that have actually exited.

New / changed knobs (all env-overridable, backward-compatible defaults)

Env Default Was
IDA_MCP_HEALTH_TCP_TIMEOUT 2.0 hard-coded 0.5
IDA_MCP_HEALTH_RPC_TIMEOUT 10.0 hard-coded 2.0
IDA_MCP_HEALTH_RETRIES 3 (none — reaped on first failure)
IDA_MCP_HEALTH_RETRY_BACKOFF 1.0
IDA_MCP_WEDGED_GRACE_SEC 300

Plus: idb_open(idle_ttl_sec=0) (or any <= 0) now means "never self-exit",
for workers a user keeps attached to across many sessions. Positive values keep
the existing MIN-clamp + load-time-padding behavior.

Compatibility

No default behavior changes for healthy workers — they answer the probe on the
first try and take the unchanged fast path. The only behavioral change is for an
unreachable worker: previously reaped on the first failed ping, now retried
and only reaped if its process has exited or it stays wedged past the grace
window. The widened timeouts only ever add latency on the unreachable path.

Known limitation (pre-existing, out of scope here)

While a worker is opening a very large database, open_session() blocks on the
synchronous call_worker_tool(worker, "idb_open", …) until the open completes.
Because the HTTP transport services requests on a single thread, unrelated calls
(tools/list, idb_list, even health probes for other sessions) queue behind
that open and appear to hang until it finishes. This is independent of the change
here — the fix operates on the health-probe decision, but during a blocking
open requests never reach that decision point.

This PR does not address that serialization; it is called out only for honesty.
In practice the fix's value shows precisely here: a client that retries (and
times out) repeatedly against a worker busy opening a large DB no longer gets
that worker reaped out from under it — the session survives the whole open. A
follow-up could make long opens non-blocking, but that is a larger transport
change orthogonal to busy-worker reaping.

Tests

tests/test_idalib_supervisor.py and tests/test_worker_lifecycle.py:

  • renamed …removes_unreachable_worker…removes_dead_worker (dead = exited)
  • new: keeps_busy_but_alive_worker, retries_then_recovers_busy_worker,
    reaps_wedged_worker_after_grace, grace_zero_never_reaps_live_worker,
    recovery_clears_wedged_tracking
  • lifecycle: zero_or_negative_means_never_exit,
    never_exit_overrides_load_time, clamps_small_positive_*,
    check_never_fires_when_ttl_disabled
  • updated prunes_unreachable_existing_mapping to use a dead (exited) process

pytest tests/198 passed, 114 subtests (Python 3.14, pytest 9.1.1).

resolve_session() probed each worker (TCP + JSON-RPC ping) before forwarding
a tool call and reaped it on the first failure. A worker mid auto-analysis or
mid decompile is single-threaded and cannot answer the ping until it yields,
so a busy worker was misclassified as dead and torn down — losing the live
session. With aggressive warmup (run_auto_analysis/build_caches/init_hexrays)
on a large database this is reliably reproducible: the worker is reaped during
its own analysis.

Distinguish three worker states instead of two:
  - process exited            -> dead, reap ("not reachable")
  - process alive, no ping    -> busy, keep and retry with backoff
  - alive + unresponsive long -> wedged, reap after a grace window

Also:
  - health-probe timeouts widened and made env-overridable
    (IDA_MCP_HEALTH_TCP_TIMEOUT=2.0, IDA_MCP_HEALTH_RPC_TIMEOUT=10.0)
  - retry with backoff before reaping
    (IDA_MCP_HEALTH_RETRIES=3, IDA_MCP_HEALTH_RETRY_BACKOFF=1.0)
  - wedged grace window IDA_MCP_WEDGED_GRACE_SEC=300 (0 = never reap a
    live worker)
  - _prune_dead_worker_sessions_locked prunes only exited processes
    (is_alive), not unreachable-but-live ones
  - idle_ttl_sec <= 0 now means "never self-exit" for long-lived workers

All defaults preserve existing behavior for healthy workers; only the
handling of an unreachable worker changes (now conservative instead of
trigger-happy).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant