Skip to content

feat(router): add ttft_timeout to detect hung providers on non-streaming calls#30337

Open
TheCodeWrangler wants to merge 13 commits into
BerriAI:litellm_internal_stagingfrom
TheCodeWrangler:feat/router-ttft-timeout
Open

feat(router): add ttft_timeout to detect hung providers on non-streaming calls#30337
TheCodeWrangler wants to merge 13 commits into
BerriAI:litellm_internal_stagingfrom
TheCodeWrangler:feat/router-ttft-timeout

Conversation

@TheCodeWrangler

@TheCodeWrangler TheCodeWrangler commented Jun 13, 2026

Copy link
Copy Markdown

feat(router): add ttft_timeout and stream_idle_timeout to detect hung and stalling providers

Relevant issues

Pre-Submission checklist

  • I have Added testing in the tests/litellm/ directory, Adding at least 1 test is a hard requirement - see details
  • I have added a screenshot of my new test passing locally
image
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible, it only solves 1 specific problem

Type

New Feature

Changes

Adds two new Router parameters for detecting providers that accept connections but then fail to deliver:

  • ttft_timeout: float — fires litellm.Timeout if no first token arrives within N seconds of the connection being accepted (catches hung providers before any content is sent)
  • stream_idle_timeout: float — fires litellm.Timeout if no chunk arrives within N seconds between consecutive tokens (catches providers that stall mid-stream after delivering some content)

Both parameters are independent; either or both can be set. When set, non-streaming calls (stream=False) are internally promoted to stream=True so the router has visibility into token timing. The caller always receives a standard ModelResponse reconstructed via stream_chunk_builder.

Why this matters

Deployments with large request timeouts (e.g. 120s for long generation tasks) have no way to distinguish a legitimately slow provider from one that hangs or stalls. Without this, a single degraded provider blocks users for the full timeout before cooldown/fallback kicks in. With ttft_timeout=10 and stream_idle_timeout=30, the worst-case user wait is bounded regardless of the configured timeout.

How it works

  1. If either timeout is set and the caller uses stream=False, _acompletion internally overrides stream=True
  2. Phase 1 (ttft_timeout): a single hard deadline (not a per-chunk reset) so preamble chunks (role deltas, empty tool-call deltas) do not extend the budget. First-token detection covers both delta.content and delta.tool_calls
  3. Phase 2 (stream_idle_timeout): per-chunk asyncio.wait_for wraps each __anext__ call; if any inter-token gap exceeds the limit, raises litellm.Timeout
  4. The reconstructed response goes through _should_raise_content_policy_error, matching the existing non-streaming path

Per-deployment config (preferred for heterogeneous deployments)

Both parameters follow the same resolution chain as stream_timeout: per-request kwarg -> per-deployment litellm_params -> router-level -> default_litellm_params.

router = Router(
    model_list=[
        {
            "model_name": "my-model",
            "litellm_params": {
                "model": "anthropic/claude-sonnet-4-6",
                "ttft_timeout": 8.0,
                "stream_idle_timeout": 30.0,
            },
        },
        {
            "model_name": "my-model",
            "litellm_params": {
                "model": "openai/gpt-4o",
                "ttft_timeout": 15.0,
                "stream_idle_timeout": 30.0,
            },
        },
    ],
    timeout=120,
    allowed_fails=1,
    fallbacks=[{"my-model": ["my-model"]}],
)

Performance considerations

Enabling either timeout changes the non-streaming path from O(1) to O(tokens): instead of one HTTP response body and one JSON parse, the router creates one Python object per streaming chunk and runs stream_chunk_builder to reconstruct. For a 500-token response this is ~500 small short-lived objects vs. 1; at low-to-moderate throughput the difference is negligible, but at high throughput with large responses the GC pressure is measurable.

The tradeoff is worthwhile when ttft_timeout / stream_idle_timeout are much smaller than timeout. If timeout is already short (e.g. 10s), the native timeout fires quickly enough that these parameters add overhead without meaningfully improving UX. A reasonable rule: only set these on deployments where timeout is long enough that a hang or stall would be visibly bad for users.

stream_idle_timeout adds O(tokens) asyncio.wait_for call overhead on top of the buffering cost — one coroutine wrap per chunk. In practice this is ~1-2 µs per chunk and dwarfed by network I/O, but it is worth knowing the ceiling.

Concurrency, cleanup, and failover

Because a stream=False caller is promoted to streaming internally, the drain has to preserve the guarantees that caller already had.

The max_parallel_requests semaphore is now held for the full reconstruction, not just for opening the stream. Previously the async with rpm_semaphore block exited as soon as the CustomStreamWrapper was returned, so the entire drain (the slow part) ran with the slot already released; a stream=False caller could therefore exceed its configured concurrency. Reconstruction now runs inside the semaphore

The drain runs under try/finally and calls response.aclose() on timeout, on caller cancellation, and on normal completion, so a client disconnect mid-reconstruction releases the upstream connection instead of leaking it

A ttft_timeout / stream_idle_timeout failure raises litellm.Timeout, which already flows into the existing retry, cooldown, and fallback machinery and is tagged with failed_deployment_id. With enable_weighted_failover=True on a simple-shuffle group, that tag lets the in-request re-pick exclude the hung deployment instead of re-selecting a high-weight bad host and burning the retry budget

Safe defaults

Both parameters default to off (None); nothing changes for callers who do not set them. They are best set per deployment rather than globally, since a value tuned for a fast chat model will abandon legitimate reasoning calls that have a large time-to-first-token and longer inter-token gaps. Treat stream_idle_timeout as a freeze detector and keep it well above the model's per-token p99, not as a slowness detector

Files changed

  • litellm/router.pyttft_timeout and stream_idle_timeout params on __init__; _get_ttft_timeout and _get_stream_idle_timeout helpers; _collect_stream_with_ttft_timeout (both phases, drained under try/finally with aclose); _acompletion drains and reconstructs inside the concurrency semaphore via _await_response and routes the reconstructed ModelResponse through the shared content-policy and metrics path
  • litellm/types/utils.py — adds ttft_timeout and stream_idle_timeout to all_litellm_params (alongside stream_timeout) so they are stripped from the request before it reaches the provider and never forwarded as unknown fields
  • tests/test_litellm/test_router.py — happy path, hung provider, preamble-chunk deadline, empty stream, _acompletion intercept, stalled mid-stream, non-stalling idle timeout, both timeouts active together, semaphore held through reconstruction, stream closed on caller cancellation, a ttft litellm.Timeout tagging failed_deployment_id, and the resolution-chain precedence
  • tests/test_litellm/test_filter_out_litellm_params.py — asserts both params are filtered out of provider-bound kwargs while genuine provider params are kept

A companion docs PR (BerriAI/litellm-docs#353) adds the two params to the router_settings reference table; the documentation gate reads that file from the litellm-docs repo, so it goes green here once that merges

@greptile-apps

greptile-apps Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Adds ttft_timeout and stream_idle_timeout to the Router for detecting hung and stalling providers on non-streaming calls. When either is set, _acompletion internally promotes stream=False to stream=True, drains and reconstructs the stream into a ModelResponse via stream_chunk_builder, holding the max_parallel_requests semaphore throughout.

  • Phase 1 (ttft_timeout): a single absolute deadline computed before the loop; preamble chunks (empty deltas, role-only chunks) do not reset the budget, and first-token detection covers both delta.content and delta.tool_calls.
  • Phase 2 (stream_idle_timeout): per-chunk asyncio.wait_for wraps each __anext__ call starting only after Phase 1 delivers a content-bearing chunk, so a slow first token with only stream_idle_timeout set does not spuriously time out.
  • Resolution chain (per-request kwarg → per-deployment litellm_params → router-level → default_litellm_params) uses explicit is None guards throughout; aclose() is called in a finally block on timeout, cancellation, and normal completion; and litellm.Timeout propagates through the existing retry/cooldown machinery with failed_deployment_id stamped by the existing except litellm.Timeout handler in _acompletion.

Confidence Score: 5/5

Safe to merge. All behavior changes are opt-in (both parameters default to None), the existing non-streaming path is unchanged for callers who do not set them, and every previously raised concern has been correctly addressed in this revision.

Every issue flagged in prior review threads — absolute TTFT deadline, tool-call detection, explicit is None resolution chain, Phase 2 error propagation, stream_idle_timeout not firing before first content, semaphore held through reconstruction, and aclose() on all exit paths — is verifiably fixed in the current code. The new test test_router_stream_idle_timeout_does_not_fire_before_first_token directly validates the trickiest interaction (slow first token with only stream_idle_timeout set). No real network calls are introduced in the test suite.

No files require special attention.

Important Files Changed

Filename Overview
litellm/router.py Adds _collect_stream_with_ttft_timeout, _get_ttft_timeout, _get_stream_idle_timeout, and wires them into _acompletion. All previously flagged issues (absolute deadline, tool-call detection, is None resolution chain, Phase 2 error propagation, stream_idle_timeout before first token, semaphore held through reconstruction) are correctly addressed in this revision.
litellm/types/utils.py Adds ttft_timeout and stream_idle_timeout to all_litellm_params so they are stripped before the request reaches upstream providers. Correct and minimal change.
tests/test_litellm/test_router.py Adds 12 new mock-only tests covering happy path, hung provider, preamble deadline, empty stream, _acompletion intercept, stalled mid-stream, slow-first-token (idle only), both timeouts, semaphore contract, caller cancellation, failed_deployment_id tagging, and resolution-chain precedence. No real network calls.
tests/test_litellm/test_filter_out_litellm_params.py Adds a test asserting both new params are filtered out while genuine provider params are kept. Correct and targeted.

Reviews (7): Last reviewed commit: "refactor(router): drop redundant ttft de..." | Re-trigger Greptile

Comment thread litellm/router.py Outdated
Comment thread litellm/router.py Outdated
Comment thread litellm/router.py Outdated
Comment thread litellm/router.py Outdated
Comment thread litellm/router.py Outdated
@TheCodeWrangler

Copy link
Copy Markdown
Author

Thanks for the detailed review @greptile-apps. All three issues were addressed in the follow-up commit (c0075e3):

1. Per-chunk timeout reset
Replaced the per-chunk asyncio.wait_for(aiter.__anext__(), ttft_timeout) pattern with a single hard deadline:

loop = asyncio.get_running_loop()
deadline = loop.time() + ttft_timeout
# ...
remaining = deadline - loop.time()
if remaining <= 0:
    raise litellm.Timeout(...)
chunk = await asyncio.wait_for(aiter.__anext__(), timeout=remaining)

Preamble chunks (role deltas, empty tool-call deltas) no longer reset the clock — the full ttft_timeout budget is shared across all chunks until the first real token arrives. A new test test_router_ttft_timeout_not_reset_by_preamble_chunks covers this case.

2. Tool-call responses never marked as first token
The first-token check now covers both content and tool-call deltas:

delta = chunk.choices[0].delta if chunk.choices else None
if delta and (delta.content or delta.tool_calls):
    first_token_received = True

3. Silent mid-stream error swallowing in Phase 2
Removed the bare except Exception: pass. Phase 2 now lets real errors propagate directly to the Router's fallback machinery:

# Phase 2: let real errors propagate
async for chunk in aiter:
    chunks.append(chunk)

Also switched asyncio.get_event_loop() to asyncio.get_running_loop() (correct for async context) and added verbose_router_logger.warning at both timeout paths for observability.

@greptile-apps

greptile-apps Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

The three issues are all cleanly addressed. A quick walkthrough of what I see in c0075e3:

Deadline-based clock — correct. The loop.time() + ttft_timeout deadline is computed once and remaining is decremented on every wait_for call, so preamble chunks do not extend the budget. The new test validates this path.

Tool-call first-token detection — correct. delta.tool_calls is now included in the guard, and the test helper explicitly sets chunk.choices[0].delta.tool_calls = None to avoid MagicMock's implicit truthiness masking the check.

Phase 2 error propagation — correct. The bare except Exception: pass is gone; real errors surface to the Router's fallback machinery.

One minor point worth noting for a follow-up: _get_ttft_timeout uses or-chaining on numeric values:

return (
    kwargs.get("ttft_timeout", None)
    or data.get("ttft_timeout", None)
    or self.ttft_timeout
    ...
)

A caller passing ttft_timeout=0.0 would have it silently skipped (falsy) and the next source in the chain used instead. In _acompletion the guard is correctly is not None, so this only causes an inconsistency if someone genuinely passes 0.0 (which is pathological). Still, aligning it to use explicit is not None checks would be more robust:

for source in (kwargs.get("ttft_timeout"), data.get("ttft_timeout"), self.ttft_timeout, self.default_litellm_params.get("ttft_timeout")):
    if source is not None:
        return source
return None

This is a minor polish item rather than a blocker. The core logic of the feature is sound — the deadline approach, tool-call handling, and error propagation are all solid.

Tip: You can customize Greptile's behavior for this repo with .greptile/rules.md and .greptile/config.json.

Comment thread litellm/router.py Outdated
@veria-ai

veria-ai Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

PR overview

All previously flagged issues have been addressed. No open security concerns remain on this pull request.

Security review

No open security issues remain on this pull request.

Fixed/addressed: 3 · PR risk: 0/10

@TheCodeWrangler TheCodeWrangler force-pushed the feat/router-ttft-timeout branch from c0075e3 to d368fe9 Compare June 13, 2026 02:26
@CLAassistant

CLAassistant commented Jun 13, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@TheCodeWrangler

Copy link
Copy Markdown
Author

Good catch on both points, addressed in 29e6de0.

On the or-chain: _get_ttft_timeout now iterates sources with explicit is not None checks so ttft_timeout=0.0 is honored rather than silently skipped.

On the content-policy bypass: the reconstructed ModelResponse from _collect_stream_with_ttft_timeout now goes through _should_raise_content_policy_error before being returned, identical to the existing check on the native non-streaming path.

Comment thread README.md Outdated
…ing calls

Adds ttft_timeout parameter to Router. When set, non-streaming calls internally
switch to stream=True so the router can detect a hung provider (one that accepts
the connection but never sends tokens) within ttft_timeout seconds, rather than
waiting for the full request timeout which can be very long for large generation
requests. Raises litellm.Timeout to trigger existing cooldown and fallback machinery.
Caller always receives a standard ModelResponse via stream_chunk_builder.

Uses a single hard deadline rather than per-chunk wait_for, so preamble chunks
(role deltas, empty tool-call deltas) do not reset the clock. Checks both
delta.content and delta.tool_calls for first-token detection. Phase 2 lets real
errors propagate rather than swallowing them. Uses asyncio.get_running_loop().
_get_ttft_timeout used or-chaining which would skip a caller-supplied
ttft_timeout=0.0 as falsy. Replaced with explicit is not None iteration.

The ttft_timeout streaming path bypassed the content-policy violation
check that runs for native non-streaming responses. The reconstructed
ModelResponse now goes through _should_raise_content_policy_error
before being returned, matching the existing non-streaming behavior.
@TheCodeWrangler TheCodeWrangler force-pushed the feat/router-ttft-timeout branch from 29e6de0 to 999d442 Compare June 13, 2026 11:10
@TheCodeWrangler

Copy link
Copy Markdown
Author

The README.md concern is pre-existing content our PR does not touch. The only files changed here are litellm/router.py and tests/test_litellm/test_router.py. Happy to file a separate issue for the README TLS guidance if that's useful, but it's out of scope for this PR.

@codecov

codecov Bot commented Jun 13, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Add tests for the empty-stream path (StopAsyncIteration -> APIError)
and the _acompletion intercept path that forces stream=True internally
when ttft_timeout is set.
…stream

Extends the ttft_timeout feature with stream_idle_timeout: a per-chunk
inter-token deadline that fires litellm.Timeout when a provider accepts
a connection, sends some tokens, then goes silent. Both parameters are
independent; either or both can be set at router or per-deployment level.
…spatch

Add _acompletion intercept test for stream_idle_timeout-only config
(ttft_timeout=None) and assert both params are forwarded correctly to
_collect_stream_with_ttft_timeout. Also tighten the existing ttft_timeout
intercept test to assert the forwarded values explicitly.
Comment thread litellm/router.py
@TheCodeWrangler

Copy link
Copy Markdown
Author

@greptileai

…nd close stream on exit

The ttft_timeout / stream_idle_timeout path promotes a stream=False call to
streaming and drains it via _collect_stream_with_ttft_timeout. Two correctness
issues are addressed:

- max_parallel_requests semaphore: reconstruction previously ran after the
  'async with rpm_semaphore' block had already exited, so a stream=False caller
  could exceed the configured concurrency for the entire drain. Reconstruction
  now happens inside the semaphore via _await_response, restoring the
  non-streaming concurrency guarantee.

- Connection cleanup: the drain loop now runs under try/finally and calls
  response.aclose() on timeout, cancellation, or normal completion, so a caller
  disconnect mid-reconstruction releases the upstream connection instead of
  leaking it.

The reconstructed ModelResponse now flows through the shared content-policy
check and _track_deployment_metrics, removing the duplicated content-policy
block. The two identical ttft raise sites are collapsed into one.

Tests: semaphore-held-through-reconstruction (fails before the fix),
stream-closed-on-caller-cancellation, and a ttft Timeout tagging
failed_deployment_id so weighted failover / cooldown can exclude the hung
deployment on retry.
Adds a regression test pinning the resolution precedence (per-request kwarg >
per-deployment litellm_params > router-level > default_litellm_params) for both
_get_ttft_timeout and _get_stream_idle_timeout, which the router_code_coverage
gate flagged as untested.
@Sameerlite

Copy link
Copy Markdown
Collaborator

Thanks for the contribution!

A couple of things to address before this is ready for merge:

  • It looks like some CI checks are failing — could you take a look and fix them, or let us know if you believe the failures are unrelated to this change?

We're also triggering a Greptile code review in the meantime.

@greptileai

…r API

When set in a deployment's litellm_params, ttft_timeout and stream_idle_timeout were
assembled into input_kwargs and forwarded to litellm.acompletion. Because neither key
was in all_litellm_params, the provider param filtering treated them as model-specific
extras and passed them to the upstream API, which 400s on unknown fields; this broke the
exact per-deployment configuration the feature documents.

Add both keys to all_litellm_params alongside stream_timeout so filter_out_litellm_params
strips them before the provider call, while the router still resolves their values from
litellm_params. Regression test asserts they are filtered out while genuine provider
params (temperature) are kept.
@TheCodeWrangler

TheCodeWrangler commented Jun 16, 2026

Copy link
Copy Markdown
Author

@Sameerlite Thanks for the review.

On the CI failures: they all come from the router_settings documentation gate, which reads config_settings.md out of the separate BerriAI/litellm-docs repo rather than this one. The two new router params need reference-table rows there, so I opened a companion PR at BerriAI/litellm-docs#353 that adds them; I validated it locally by running tests/documentation_tests/test_router_settings.py against the edited file. Once #353 merges, both the documentation check and the documentation_test_router_settings step inside code-quality go green on the next run here. The other half of the code-quality failure was the router_code_coverage gate flagging the two new getters as untested, which is already fixed on this branch by the resolution-chain test in 15bd7fb

On Greptile's blocker: it correctly caught that ttft_timeout and stream_idle_timeout were being forwarded to the upstream provider when set in a deployment's litellm_params, which would 400. Fixed in f45f71b by adding both keys to all_litellm_params alongside stream_timeout, with a regression test in test_filter_out_litellm_params.py. The minor or-chaining note it raised was already handled; the getters use explicit is not None checks

@Sameerlite

Copy link
Copy Markdown
Collaborator

@greptileai

Comment thread litellm/router.py
When stream_idle_timeout was set without ttft_timeout, Phase 1 (wait-for-first-token)
was skipped and the idle loop wrapped the very first __anext__ with stream_idle_timeout.
That measured time-to-first-token, not the inter-token gap, so any provider whose first
token arrived slower than stream_idle_timeout was wrongly killed as 'stalled mid-stream',
contradicting the documented 'between consecutive tokens' semantics.

Run the first-token phase whenever either timeout is set, but bound it only by
ttft_timeout's absolute deadline; when ttft_timeout is None the first-token wait is
unbounded (still capped by the outer request timeout) and stream_idle_timeout governs
only the gaps after content has started.

Regression test: stream_idle_timeout-only with a first token slower than the idle window
followed by prompt chunks must succeed; it fails on the pre-fix behavior.
@TheCodeWrangler

Copy link
Copy Markdown
Author

Addressed the remaining finding from the last review: when stream_idle_timeout was set without ttft_timeout, the idle clock was wrapping the first token wait and could kill a slow-starting provider as a mid-stream stall. The first-token phase now runs whenever either timeout is set but is bounded only by ttft_timeout's deadline; with ttft_timeout unset the first-token wait is unbounded and stream_idle_timeout only governs gaps after content begins. Added a regression test for the stream_idle_timeout-only path that fails on the prior behavior (commit 22335b1)

@greptileai

… feat/router-ttft-timeout

# Conflicts:
#	litellm/router.py
Use PEP 585/604 forms (list/dict, X | None) for the ttft_timeout/stream_idle_timeout
annotations added by this PR so UP006/UP045 totals stay within the strict-rule budget
ceiling the base enforces.
The promoted-stream path returns a reconstructed ModelResponse; assert it flows through
the same content-policy check as the non-streaming path. A content_filter finish_reason
with a content-policy fallback configured must raise ContentPolicyViolationError from
_acompletion. Mutation-verified: fails if the _acompletion content-policy raise is removed.
@TheCodeWrangler

TheCodeWrangler commented Jun 17, 2026

Copy link
Copy Markdown
Author

Pushed three changes since the last review: merged the latest litellm_internal_staging to clear a conflict, modernized the new annotations to PEP 585/604 so the strict-rule budget gate passes, and added a test asserting the reconstructed response from a promoted stream goes through the same content-policy check as the non-streaming path. Re-requesting a review on the current head

Greptile is happy now @Sameerlite
#30337 (comment)

@Sameerlite

Copy link
Copy Markdown
Collaborator

Thanks for the PR! A couple of things to get this over the finish line:

  • Greptile's review scored 5/5 but there are still some open review threads — could you take a look and resolve them? Once they're all cleared we're good to go.

Once those are addressed we'll take another look — appreciate the contribution!

@TheCodeWrangler

Copy link
Copy Markdown
Author

Thanks @Sameerlite. I have gone through the three open Greptile threads and resolved them. Two were already handled by the current revision: the Phase 2 collection no longer swallows mid-stream errors (it is a plain async for now, so disconnects and provider errors propagate to the cooldown/fallback path), and the timeout resolution uses explicit is None checks instead of or-chaining so an explicit 0.0 is honored. The third (timeout not enforced when the caller passes stream=True) is intentional and is what the title scopes this to; for streaming callers we hand back the raw stream to preserve streaming semantics, and they can enforce their own first-token deadline on the iterator they control. Details are in each thread

asyncio.wait_for already raises TimeoutError for a non-positive timeout, so
the explicit remaining<=0 check duplicated wait_for's own behavior. Removing
it keeps the same observable result (litellm.Timeout via the except clause)
and lets wait_for return an already-ready chunk instead of spuriously timing
out.
@TheCodeWrangler

Copy link
Copy Markdown
Author

Quick note on the two red documentation checks: they are unrelated to this change. The test_router_settings.py gate does open("docs/my-website/docs/proxy/config_settings.md"), but that file was removed from this repo in the docs migration to litellm-docs (commit c35f3a5). It is now absent from both main and litellm_internal_staging, while the test that reads it is still present, so the check fails on any PR that adds a router setting; nothing in this PR can satisfy it from within the repo. I have the reference-table rows for the two new params staged in BerriAI/litellm-docs#353 for whenever the docs side is reconciled.

The actual blocker you flagged is cleared: all three Greptile threads are resolved (two were already addressed in code, the third is intentional and scoped to non-streaming, with reasoning in the thread). Greptile is at 5/5 and the rest of CI is green

@Sameerlite

Copy link
Copy Markdown
Collaborator

Thanks for addressing all the open review threads and for the detailed explanation on the CI failures! Triggering a fresh Greptile review on the latest commit:

@greptileai

Once Greptile confirms 5/5 on the current SHA, we'll take another look!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants