fix(idalib): don't reap busy headless workers on health-probe timeout by lich0821 · Pull Request #461 · mrexodia/ida-pro-mcp

lich0821 · 2026-06-24T07:44:24Z

Problem

IdalibSupervisor.resolve_session() runs a health probe (TCP connect +
JSON-RPC ping) before forwarding every tool call. On the first probe failure
it immediately called _terminate_worker() and raised "… is not reachable".

But a headless idalib worker is single-threaded. While it is busy running
auto-analysis, building caches, or decompiling, it physically cannot service
the ping until it yields. The supervisor interpreted that silence as death and
killed a perfectly healthy worker, destroying the session.

This is easy to hit with the warmup flags enabled on a large database: open
with run_auto_analysis=true (the default), and the very analysis the open
requested keeps the worker busy long enough that the next probe reaps it.

Fix

Treat "busy" and "dead" as different states. The OS process liveness
(session.is_alive()), not a single ping, is the source of truth:

State	Detection	Action
Dead	process exited	unregister + reap, raise `not reachable`
Busy	alive, ping fails, within grace	keep, retry w/ backoff, raise retryable error
Wedged	alive, ping keeps failing past grace	unregister + reap, raise `wedged`

A grace window (IDA_MCP_WEDGED_GRACE_SEC, default 300s) ensures a genuinely
stuck worker (e.g. an analysis loop) still can't pin a worker slot forever —
so the limited worker pool can't be exhausted by zombies. Set it to 0 to
never reap a live worker (pure keep-forever, for single-user long-lived setups).

_prune_dead_worker_sessions_locked() is given the same busy-vs-dead
distinction: it now prunes only processes that have actually exited.

New / changed knobs (all env-overridable, backward-compatible defaults)

Env	Default	Was
`IDA_MCP_HEALTH_TCP_TIMEOUT`	`2.0`	hard-coded `0.5`
`IDA_MCP_HEALTH_RPC_TIMEOUT`	`10.0`	hard-coded `2.0`
`IDA_MCP_HEALTH_RETRIES`	`3`	(none — reaped on first failure)
`IDA_MCP_HEALTH_RETRY_BACKOFF`	`1.0`	—
`IDA_MCP_WEDGED_GRACE_SEC`	`300`	—

Plus: idb_open(idle_ttl_sec=0) (or any <= 0) now means "never self-exit",
for workers a user keeps attached to across many sessions. Positive values keep
the existing MIN-clamp + load-time-padding behavior.

Compatibility

No default behavior changes for healthy workers — they answer the probe on the
first try and take the unchanged fast path. The only behavioral change is for an
unreachable worker: previously reaped on the first failed ping, now retried
and only reaped if its process has exited or it stays wedged past the grace
window. The widened timeouts only ever add latency on the unreachable path.

Known limitation (pre-existing, out of scope here)

While a worker is opening a very large database, open_session() blocks on the
synchronous call_worker_tool(worker, "idb_open", …) until the open completes.
Because the HTTP transport services requests on a single thread, unrelated calls
(tools/list, idb_list, even health probes for other sessions) queue behind
that open and appear to hang until it finishes. This is independent of the change
here — the fix operates on the health-probe decision, but during a blocking
open requests never reach that decision point.

This PR does not address that serialization; it is called out only for honesty.
In practice the fix's value shows precisely here: a client that retries (and
times out) repeatedly against a worker busy opening a large DB no longer gets
that worker reaped out from under it — the session survives the whole open. A
follow-up could make long opens non-blocking, but that is a larger transport
change orthogonal to busy-worker reaping.

Tests

tests/test_idalib_supervisor.py and tests/test_worker_lifecycle.py:

renamed …removes_unreachable_worker → …removes_dead_worker (dead = exited)
new: keeps_busy_but_alive_worker, retries_then_recovers_busy_worker,
reaps_wedged_worker_after_grace, grace_zero_never_reaps_live_worker,
recovery_clears_wedged_tracking
lifecycle: zero_or_negative_means_never_exit,
never_exit_overrides_load_time, clamps_small_positive_*,
check_never_fires_when_ttl_disabled
updated prunes_unreachable_existing_mapping to use a dead (exited) process

pytest tests/ → 198 passed, 114 subtests (Python 3.14, pytest 9.1.1).

resolve_session() probed each worker (TCP + JSON-RPC ping) before forwarding a tool call and reaped it on the first failure. A worker mid auto-analysis or mid decompile is single-threaded and cannot answer the ping until it yields, so a busy worker was misclassified as dead and torn down — losing the live session. With aggressive warmup (run_auto_analysis/build_caches/init_hexrays) on a large database this is reliably reproducible: the worker is reaped during its own analysis. Distinguish three worker states instead of two: - process exited -> dead, reap ("not reachable") - process alive, no ping -> busy, keep and retry with backoff - alive + unresponsive long -> wedged, reap after a grace window Also: - health-probe timeouts widened and made env-overridable (IDA_MCP_HEALTH_TCP_TIMEOUT=2.0, IDA_MCP_HEALTH_RPC_TIMEOUT=10.0) - retry with backoff before reaping (IDA_MCP_HEALTH_RETRIES=3, IDA_MCP_HEALTH_RETRY_BACKOFF=1.0) - wedged grace window IDA_MCP_WEDGED_GRACE_SEC=300 (0 = never reap a live worker) - _prune_dead_worker_sessions_locked prunes only exited processes (is_alive), not unreachable-but-live ones - idle_ttl_sec <= 0 now means "never self-exit" for long-lived workers All defaults preserve existing behavior for healthy workers; only the handling of an unreachable worker changes (now conservative instead of trigger-happy).

lich0821 requested a deployment to external-pr June 24, 2026 07:44 — with GitHub Actions Waiting

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(idalib): don't reap busy headless workers on health-probe timeout#461

fix(idalib): don't reap busy headless workers on health-probe timeout#461
lich0821 wants to merge 1 commit into
mrexodia:mainfrom
lich0821:fix/idalib-busy-worker-reap

lich0821 commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lich0821 commented Jun 24, 2026

Problem

Fix

New / changed knobs (all env-overridable, backward-compatible defaults)

Compatibility

Known limitation (pre-existing, out of scope here)

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant