feat(router): add ttft_timeout to detect hung providers on non-streaming calls#30337
feat(router): add ttft_timeout to detect hung providers on non-streaming calls#30337TheCodeWrangler wants to merge 13 commits into
Conversation
Greptile SummaryAdds
Confidence Score: 5/5Safe to merge. All behavior changes are opt-in (both parameters default to Every issue flagged in prior review threads — absolute TTFT deadline, tool-call detection, explicit No files require special attention.
|
| Filename | Overview |
|---|---|
| litellm/router.py | Adds _collect_stream_with_ttft_timeout, _get_ttft_timeout, _get_stream_idle_timeout, and wires them into _acompletion. All previously flagged issues (absolute deadline, tool-call detection, is None resolution chain, Phase 2 error propagation, stream_idle_timeout before first token, semaphore held through reconstruction) are correctly addressed in this revision. |
| litellm/types/utils.py | Adds ttft_timeout and stream_idle_timeout to all_litellm_params so they are stripped before the request reaches upstream providers. Correct and minimal change. |
| tests/test_litellm/test_router.py | Adds 12 new mock-only tests covering happy path, hung provider, preamble deadline, empty stream, _acompletion intercept, stalled mid-stream, slow-first-token (idle only), both timeouts, semaphore contract, caller cancellation, failed_deployment_id tagging, and resolution-chain precedence. No real network calls. |
| tests/test_litellm/test_filter_out_litellm_params.py | Adds a test asserting both new params are filtered out while genuine provider params are kept. Correct and targeted. |
Reviews (7): Last reviewed commit: "refactor(router): drop redundant ttft de..." | Re-trigger Greptile
|
Thanks for the detailed review @greptile-apps. All three issues were addressed in the follow-up commit (c0075e3): 1. Per-chunk timeout reset loop = asyncio.get_running_loop()
deadline = loop.time() + ttft_timeout
# ...
remaining = deadline - loop.time()
if remaining <= 0:
raise litellm.Timeout(...)
chunk = await asyncio.wait_for(aiter.__anext__(), timeout=remaining)Preamble chunks (role deltas, empty tool-call deltas) no longer reset the clock — the full 2. Tool-call responses never marked as first token delta = chunk.choices[0].delta if chunk.choices else None
if delta and (delta.content or delta.tool_calls):
first_token_received = True3. Silent mid-stream error swallowing in Phase 2 # Phase 2: let real errors propagate
async for chunk in aiter:
chunks.append(chunk)Also switched |
|
The three issues are all cleanly addressed. A quick walkthrough of what I see in c0075e3: Deadline-based clock — correct. The Tool-call first-token detection — correct. Phase 2 error propagation — correct. The bare One minor point worth noting for a follow-up: return (
kwargs.get("ttft_timeout", None)
or data.get("ttft_timeout", None)
or self.ttft_timeout
...
)A caller passing for source in (kwargs.get("ttft_timeout"), data.get("ttft_timeout"), self.ttft_timeout, self.default_litellm_params.get("ttft_timeout")):
if source is not None:
return source
return NoneThis is a minor polish item rather than a blocker. The core logic of the feature is sound — the deadline approach, tool-call handling, and error propagation are all solid. Tip: You can customize Greptile's behavior for this repo with |
PR overviewAll previously flagged issues have been addressed. No open security concerns remain on this pull request. Security reviewNo open security issues remain on this pull request. Fixed/addressed: 3 · PR risk: 0/10 |
c0075e3 to
d368fe9
Compare
|
Good catch on both points, addressed in 29e6de0. On the or-chain: _get_ttft_timeout now iterates sources with explicit is not None checks so ttft_timeout=0.0 is honored rather than silently skipped. On the content-policy bypass: the reconstructed ModelResponse from _collect_stream_with_ttft_timeout now goes through _should_raise_content_policy_error before being returned, identical to the existing check on the native non-streaming path. |
…ing calls Adds ttft_timeout parameter to Router. When set, non-streaming calls internally switch to stream=True so the router can detect a hung provider (one that accepts the connection but never sends tokens) within ttft_timeout seconds, rather than waiting for the full request timeout which can be very long for large generation requests. Raises litellm.Timeout to trigger existing cooldown and fallback machinery. Caller always receives a standard ModelResponse via stream_chunk_builder. Uses a single hard deadline rather than per-chunk wait_for, so preamble chunks (role deltas, empty tool-call deltas) do not reset the clock. Checks both delta.content and delta.tool_calls for first-token detection. Phase 2 lets real errors propagate rather than swallowing them. Uses asyncio.get_running_loop().
_get_ttft_timeout used or-chaining which would skip a caller-supplied ttft_timeout=0.0 as falsy. Replaced with explicit is not None iteration. The ttft_timeout streaming path bypassed the content-policy violation check that runs for native non-streaming responses. The reconstructed ModelResponse now goes through _should_raise_content_policy_error before being returned, matching the existing non-streaming behavior.
29e6de0 to
999d442
Compare
|
The README.md concern is pre-existing content our PR does not touch. The only files changed here are litellm/router.py and tests/test_litellm/test_router.py. Happy to file a separate issue for the README TLS guidance if that's useful, but it's out of scope for this PR. |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Add tests for the empty-stream path (StopAsyncIteration -> APIError) and the _acompletion intercept path that forces stream=True internally when ttft_timeout is set.
…stream Extends the ttft_timeout feature with stream_idle_timeout: a per-chunk inter-token deadline that fires litellm.Timeout when a provider accepts a connection, sends some tokens, then goes silent. Both parameters are independent; either or both can be set at router or per-deployment level.
…spatch Add _acompletion intercept test for stream_idle_timeout-only config (ttft_timeout=None) and assert both params are forwarded correctly to _collect_stream_with_ttft_timeout. Also tighten the existing ttft_timeout intercept test to assert the forwarded values explicitly.
…nd close stream on exit The ttft_timeout / stream_idle_timeout path promotes a stream=False call to streaming and drains it via _collect_stream_with_ttft_timeout. Two correctness issues are addressed: - max_parallel_requests semaphore: reconstruction previously ran after the 'async with rpm_semaphore' block had already exited, so a stream=False caller could exceed the configured concurrency for the entire drain. Reconstruction now happens inside the semaphore via _await_response, restoring the non-streaming concurrency guarantee. - Connection cleanup: the drain loop now runs under try/finally and calls response.aclose() on timeout, cancellation, or normal completion, so a caller disconnect mid-reconstruction releases the upstream connection instead of leaking it. The reconstructed ModelResponse now flows through the shared content-policy check and _track_deployment_metrics, removing the duplicated content-policy block. The two identical ttft raise sites are collapsed into one. Tests: semaphore-held-through-reconstruction (fails before the fix), stream-closed-on-caller-cancellation, and a ttft Timeout tagging failed_deployment_id so weighted failover / cooldown can exclude the hung deployment on retry.
Adds a regression test pinning the resolution precedence (per-request kwarg > per-deployment litellm_params > router-level > default_litellm_params) for both _get_ttft_timeout and _get_stream_idle_timeout, which the router_code_coverage gate flagged as untested.
|
Thanks for the contribution! A couple of things to address before this is ready for merge:
We're also triggering a Greptile code review in the meantime. |
…r API When set in a deployment's litellm_params, ttft_timeout and stream_idle_timeout were assembled into input_kwargs and forwarded to litellm.acompletion. Because neither key was in all_litellm_params, the provider param filtering treated them as model-specific extras and passed them to the upstream API, which 400s on unknown fields; this broke the exact per-deployment configuration the feature documents. Add both keys to all_litellm_params alongside stream_timeout so filter_out_litellm_params strips them before the provider call, while the router still resolves their values from litellm_params. Regression test asserts they are filtered out while genuine provider params (temperature) are kept.
|
@Sameerlite Thanks for the review. On the CI failures: they all come from the router_settings documentation gate, which reads On Greptile's blocker: it correctly caught that |
When stream_idle_timeout was set without ttft_timeout, Phase 1 (wait-for-first-token) was skipped and the idle loop wrapped the very first __anext__ with stream_idle_timeout. That measured time-to-first-token, not the inter-token gap, so any provider whose first token arrived slower than stream_idle_timeout was wrongly killed as 'stalled mid-stream', contradicting the documented 'between consecutive tokens' semantics. Run the first-token phase whenever either timeout is set, but bound it only by ttft_timeout's absolute deadline; when ttft_timeout is None the first-token wait is unbounded (still capped by the outer request timeout) and stream_idle_timeout governs only the gaps after content has started. Regression test: stream_idle_timeout-only with a first token slower than the idle window followed by prompt chunks must succeed; it fails on the pre-fix behavior.
|
Addressed the remaining finding from the last review: when |
… feat/router-ttft-timeout # Conflicts: # litellm/router.py
Use PEP 585/604 forms (list/dict, X | None) for the ttft_timeout/stream_idle_timeout annotations added by this PR so UP006/UP045 totals stay within the strict-rule budget ceiling the base enforces.
The promoted-stream path returns a reconstructed ModelResponse; assert it flows through the same content-policy check as the non-streaming path. A content_filter finish_reason with a content-policy fallback configured must raise ContentPolicyViolationError from _acompletion. Mutation-verified: fails if the _acompletion content-policy raise is removed.
|
Pushed three changes since the last review: merged the latest Greptile is happy now @Sameerlite |
|
Thanks for the PR! A couple of things to get this over the finish line:
Once those are addressed we'll take another look — appreciate the contribution! |
|
Thanks @Sameerlite. I have gone through the three open Greptile threads and resolved them. Two were already handled by the current revision: the Phase 2 collection no longer swallows mid-stream errors (it is a plain |
asyncio.wait_for already raises TimeoutError for a non-positive timeout, so the explicit remaining<=0 check duplicated wait_for's own behavior. Removing it keeps the same observable result (litellm.Timeout via the except clause) and lets wait_for return an already-ready chunk instead of spuriously timing out.
|
Quick note on the two red The actual blocker you flagged is cleared: all three Greptile threads are resolved (two were already addressed in code, the third is intentional and scoped to non-streaming, with reasoning in the thread). Greptile is at 5/5 and the rest of CI is green |
|
Thanks for addressing all the open review threads and for the detailed explanation on the CI failures! Triggering a fresh Greptile review on the latest commit: Once Greptile confirms 5/5 on the current SHA, we'll take another look! |
feat(router): add ttft_timeout and stream_idle_timeout to detect hung and stalling providers
Relevant issues
Pre-Submission checklist
tests/litellm/directory, Adding at least 1 test is a hard requirement - see detailsmake test-unitType
New Feature
Changes
Adds two new
Routerparameters for detecting providers that accept connections but then fail to deliver:ttft_timeout: float— fireslitellm.Timeoutif no first token arrives within N seconds of the connection being accepted (catches hung providers before any content is sent)stream_idle_timeout: float— fireslitellm.Timeoutif no chunk arrives within N seconds between consecutive tokens (catches providers that stall mid-stream after delivering some content)Both parameters are independent; either or both can be set. When set, non-streaming calls (
stream=False) are internally promoted tostream=Trueso the router has visibility into token timing. The caller always receives a standardModelResponsereconstructed viastream_chunk_builder.Why this matters
Deployments with large request timeouts (e.g. 120s for long generation tasks) have no way to distinguish a legitimately slow provider from one that hangs or stalls. Without this, a single degraded provider blocks users for the full timeout before cooldown/fallback kicks in. With
ttft_timeout=10andstream_idle_timeout=30, the worst-case user wait is bounded regardless of the configuredtimeout.How it works
stream=False,_acompletioninternally overridesstream=Truettft_timeout): a single hard deadline (not a per-chunk reset) so preamble chunks (role deltas, empty tool-call deltas) do not extend the budget. First-token detection covers bothdelta.contentanddelta.tool_callsstream_idle_timeout): per-chunkasyncio.wait_forwraps each__anext__call; if any inter-token gap exceeds the limit, raiseslitellm.Timeout_should_raise_content_policy_error, matching the existing non-streaming pathPer-deployment config (preferred for heterogeneous deployments)
Both parameters follow the same resolution chain as
stream_timeout: per-request kwarg -> per-deploymentlitellm_params-> router-level ->default_litellm_params.Performance considerations
Enabling either timeout changes the non-streaming path from O(1) to O(tokens): instead of one HTTP response body and one JSON parse, the router creates one Python object per streaming chunk and runs
stream_chunk_builderto reconstruct. For a 500-token response this is ~500 small short-lived objects vs. 1; at low-to-moderate throughput the difference is negligible, but at high throughput with large responses the GC pressure is measurable.The tradeoff is worthwhile when
ttft_timeout/stream_idle_timeoutare much smaller thantimeout. Iftimeoutis already short (e.g. 10s), the native timeout fires quickly enough that these parameters add overhead without meaningfully improving UX. A reasonable rule: only set these on deployments wheretimeoutis long enough that a hang or stall would be visibly bad for users.stream_idle_timeoutadds O(tokens)asyncio.wait_forcall overhead on top of the buffering cost — one coroutine wrap per chunk. In practice this is ~1-2 µs per chunk and dwarfed by network I/O, but it is worth knowing the ceiling.Concurrency, cleanup, and failover
Because a
stream=Falsecaller is promoted to streaming internally, the drain has to preserve the guarantees that caller already had.The
max_parallel_requestssemaphore is now held for the full reconstruction, not just for opening the stream. Previously theasync with rpm_semaphoreblock exited as soon as theCustomStreamWrapperwas returned, so the entire drain (the slow part) ran with the slot already released; astream=Falsecaller could therefore exceed its configured concurrency. Reconstruction now runs inside the semaphoreThe drain runs under
try/finallyand callsresponse.aclose()on timeout, on caller cancellation, and on normal completion, so a client disconnect mid-reconstruction releases the upstream connection instead of leaking itA
ttft_timeout/stream_idle_timeoutfailure raiseslitellm.Timeout, which already flows into the existing retry, cooldown, and fallback machinery and is tagged withfailed_deployment_id. Withenable_weighted_failover=Trueon asimple-shufflegroup, that tag lets the in-request re-pick exclude the hung deployment instead of re-selecting a high-weight bad host and burning the retry budgetSafe defaults
Both parameters default to off (
None); nothing changes for callers who do not set them. They are best set per deployment rather than globally, since a value tuned for a fast chat model will abandon legitimate reasoning calls that have a large time-to-first-token and longer inter-token gaps. Treatstream_idle_timeoutas a freeze detector and keep it well above the model's per-token p99, not as a slowness detectorFiles changed
litellm/router.py—ttft_timeoutandstream_idle_timeoutparams on__init__;_get_ttft_timeoutand_get_stream_idle_timeouthelpers;_collect_stream_with_ttft_timeout(both phases, drained undertry/finallywithaclose);_acompletiondrains and reconstructs inside the concurrency semaphore via_await_responseand routes the reconstructedModelResponsethrough the shared content-policy and metrics pathlitellm/types/utils.py— addsttft_timeoutandstream_idle_timeouttoall_litellm_params(alongsidestream_timeout) so they are stripped from the request before it reaches the provider and never forwarded as unknown fieldstests/test_litellm/test_router.py— happy path, hung provider, preamble-chunk deadline, empty stream,_acompletionintercept, stalled mid-stream, non-stalling idle timeout, both timeouts active together, semaphore held through reconstruction, stream closed on caller cancellation, a ttftlitellm.Timeouttaggingfailed_deployment_id, and the resolution-chain precedencetests/test_litellm/test_filter_out_litellm_params.py— asserts both params are filtered out of provider-bound kwargs while genuine provider params are keptA companion docs PR (BerriAI/litellm-docs#353) adds the two params to the
router_settingsreference table; thedocumentationgate reads that file from the litellm-docs repo, so it goes green here once that merges