fix(idalib): don't reap busy headless workers on health-probe timeout#461
Open
lich0821 wants to merge 1 commit into
Open
fix(idalib): don't reap busy headless workers on health-probe timeout#461lich0821 wants to merge 1 commit into
lich0821 wants to merge 1 commit into
Conversation
resolve_session() probed each worker (TCP + JSON-RPC ping) before forwarding
a tool call and reaped it on the first failure. A worker mid auto-analysis or
mid decompile is single-threaded and cannot answer the ping until it yields,
so a busy worker was misclassified as dead and torn down — losing the live
session. With aggressive warmup (run_auto_analysis/build_caches/init_hexrays)
on a large database this is reliably reproducible: the worker is reaped during
its own analysis.
Distinguish three worker states instead of two:
- process exited -> dead, reap ("not reachable")
- process alive, no ping -> busy, keep and retry with backoff
- alive + unresponsive long -> wedged, reap after a grace window
Also:
- health-probe timeouts widened and made env-overridable
(IDA_MCP_HEALTH_TCP_TIMEOUT=2.0, IDA_MCP_HEALTH_RPC_TIMEOUT=10.0)
- retry with backoff before reaping
(IDA_MCP_HEALTH_RETRIES=3, IDA_MCP_HEALTH_RETRY_BACKOFF=1.0)
- wedged grace window IDA_MCP_WEDGED_GRACE_SEC=300 (0 = never reap a
live worker)
- _prune_dead_worker_sessions_locked prunes only exited processes
(is_alive), not unreachable-but-live ones
- idle_ttl_sec <= 0 now means "never self-exit" for long-lived workers
All defaults preserve existing behavior for healthy workers; only the
handling of an unreachable worker changes (now conservative instead of
trigger-happy).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
IdalibSupervisor.resolve_session()runs a health probe (TCP connect +JSON-RPC
ping) before forwarding every tool call. On the first probe failureit immediately called
_terminate_worker()and raised"… is not reachable".But a headless idalib worker is single-threaded. While it is busy running
auto-analysis, building caches, or decompiling, it physically cannot service
the ping until it yields. The supervisor interpreted that silence as death and
killed a perfectly healthy worker, destroying the session.
This is easy to hit with the warmup flags enabled on a large database: open
with
run_auto_analysis=true(the default), and the very analysis the openrequested keeps the worker busy long enough that the next probe reaps it.
Fix
Treat "busy" and "dead" as different states. The OS process liveness
(
session.is_alive()), not a single ping, is the source of truth:not reachablewedgedA grace window (
IDA_MCP_WEDGED_GRACE_SEC, default 300s) ensures a genuinelystuck worker (e.g. an analysis loop) still can't pin a worker slot forever —
so the limited worker pool can't be exhausted by zombies. Set it to
0tonever reap a live worker (pure keep-forever, for single-user long-lived setups).
_prune_dead_worker_sessions_locked()is given the same busy-vs-deaddistinction: it now prunes only processes that have actually exited.
New / changed knobs (all env-overridable, backward-compatible defaults)
IDA_MCP_HEALTH_TCP_TIMEOUT2.00.5IDA_MCP_HEALTH_RPC_TIMEOUT10.02.0IDA_MCP_HEALTH_RETRIES3IDA_MCP_HEALTH_RETRY_BACKOFF1.0IDA_MCP_WEDGED_GRACE_SEC300Plus:
idb_open(idle_ttl_sec=0)(or any<= 0) now means "never self-exit",for workers a user keeps attached to across many sessions. Positive values keep
the existing MIN-clamp + load-time-padding behavior.
Compatibility
No default behavior changes for healthy workers — they answer the probe on the
first try and take the unchanged fast path. The only behavioral change is for an
unreachable worker: previously reaped on the first failed ping, now retried
and only reaped if its process has exited or it stays wedged past the grace
window. The widened timeouts only ever add latency on the unreachable path.
Known limitation (pre-existing, out of scope here)
While a worker is opening a very large database,
open_session()blocks on thesynchronous
call_worker_tool(worker, "idb_open", …)until the open completes.Because the HTTP transport services requests on a single thread, unrelated calls
(
tools/list,idb_list, even health probes for other sessions) queue behindthat open and appear to hang until it finishes. This is independent of the change
here — the fix operates on the health-probe decision, but during a blocking
open requests never reach that decision point.
This PR does not address that serialization; it is called out only for honesty.
In practice the fix's value shows precisely here: a client that retries (and
times out) repeatedly against a worker busy opening a large DB no longer gets
that worker reaped out from under it — the session survives the whole open. A
follow-up could make long opens non-blocking, but that is a larger transport
change orthogonal to busy-worker reaping.
Tests
tests/test_idalib_supervisor.pyandtests/test_worker_lifecycle.py:…removes_unreachable_worker→…removes_dead_worker(dead = exited)keeps_busy_but_alive_worker,retries_then_recovers_busy_worker,reaps_wedged_worker_after_grace,grace_zero_never_reaps_live_worker,recovery_clears_wedged_trackingzero_or_negative_means_never_exit,never_exit_overrides_load_time,clamps_small_positive_*,check_never_fires_when_ttl_disabledprunes_unreachable_existing_mappingto use a dead (exited) processpytest tests/→ 198 passed, 114 subtests (Python 3.14, pytest 9.1.1).