Skip to content

[Core] Reduce RoPE cache size for shorter context length#28136

Open
labAxiaoming wants to merge 1 commit into
sgl-project:mainfrom
labAxiaoming:main
Open

[Core] Reduce RoPE cache size for shorter context length#28136
labAxiaoming wants to merge 1 commit into
sgl-project:mainfrom
labAxiaoming:main

Conversation

@labAxiaoming

@labAxiaoming labAxiaoming commented Jun 13, 2026

Copy link
Copy Markdown

Motivation

Some models define very large max_position_embeddings, but users often serve
them with a smaller --context-length. In those cases, get_rope() still builds
cos/sin caches for the full model config length, which increases model
initialization memory usage without benefiting the configured runtime context.

Changes

  • Clamp selected RoPE cache construction to runtime model_config.context_len
    when it is smaller than max_position.
  • Apply the optimization only to RoPE variants whose cache can be safely
    truncated and later expanded:
    • default
    • llama3
    • proportional
  • Skip cases that need full/custom position cache behavior:
    • multimodal models with rope_scaling is None
    • mRoPE
    • FOPE
    • dual chunk attention
    • other RoPE scaling variants such as dynamic/yarn/deepseek/longrope

Validation

Ran a remote config smoke test locally:
test_sgalng_remote_rope_configs.py

python test_sgalng_remote_rope_configs.py --context-len 32768 --dtype bfloat16

RoPE cache memory summary

Model Case RoPE Cache len Memory MiB Saved Prefix
Qwen/Qwen3-4B-Instruct-2507 text default 262,144 -> 32,768 64.00 -> 8.00 56.00 (87.5%) PASS
LLM-Research/Llama-4-Scout-17B-16E-Instruct text llama3 10,485,760 -> 32,768 2560.00 -> 8.00 2552.00 (99.7%) PASS
ZhipuAI/GLM-5.1 text default 202,752 -> 32,768 24.75 -> 4.00 20.75 (83.8%) PASS
XiaomiMiMo/MiMo-V2.5 text default 1,048,576 -> 32,768 128.00 -> 4.00 124.00 (96.9%) PASS
MiniMax/MiniMax-M2.7 text default 204,800 -> 32,768 50.00 -> 8.00 42.00 (84.0%) PASS
google/gemma-4-31B-it text-sliding_attention default 262,144 -> 32,768 128.00 -> 16.00 112.00 (87.5%) PASS
google/gemma-4-31B-it text-full_attention proportional 262,144 -> 32,768 256.00 -> 32.00 224.00 (87.5%) PASS
Total - - - 3210.75 -> 80.00 3130.75 (97.5%) -

Summary:

  • Total RoPE cache memory: 3210.75 MiB -> 80.00 MiB
  • Saved: 3130.75 MiB (97.5%)
  • Prefix comparison: all tested cases PASS

Also verified that cache expansion via _ensure_cos_sin_cache_length() matches
direct full-cache construction for default, llama3, and proportional.

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

CI States

Latest PR Test (Base): ⏳ Run #27461850431
Latest PR Test (Extra): ⏳ Run #27461850376

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a helper function _get_runtime_model_config and updates get_rope to clamp max_position to the runtime context length under specific conditions. The review feedback points out two important issues: first, the precedence of scaling_type resolution should prioritize dual_chunk_attention_config over rope_scaling to prevent incorrect resolution; second, clamping max_position directly can cause subsequent lookups for original_max_position_embeddings to default to the clamped value instead of the original value, which could degrade model accuracy for scaling types like llama3.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread python/sglang/srt/layers/rotary_embedding/factory.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant