[Core] Reduce RoPE cache size for shorter context length by labAxiaoming · Pull Request #28136 · sgl-project/sglang

labAxiaoming · 2026-06-13T08:14:29Z

Motivation

Some models define very large max_position_embeddings, but users often serve
them with a smaller --context-length. In those cases, get_rope() still builds
cos/sin caches for the full model config length, which increases model
initialization memory usage without benefiting the configured runtime context.

Changes

Clamp selected RoPE cache construction to runtime model_config.context_len
when it is smaller than max_position.
Apply the optimization only to RoPE variants whose cache can be safely
truncated and later expanded:
- default
- llama3
- proportional
Skip cases that need full/custom position cache behavior:
- multimodal models with rope_scaling is None
- mRoPE
- FOPE
- dual chunk attention
- other RoPE scaling variants such as dynamic/yarn/deepseek/longrope

Validation

Ran a remote config smoke test locally:
test_sgalng_remote_rope_configs.py

python test_sgalng_remote_rope_configs.py --context-len 32768 --dtype bfloat16

RoPE cache memory summary

Model	Case	RoPE	Cache len	Memory MiB	Saved	Prefix
Qwen/Qwen3-4B-Instruct-2507	text	default	262,144 -> 32,768	64.00 -> 8.00	56.00 (87.5%)	PASS
LLM-Research/Llama-4-Scout-17B-16E-Instruct	text	llama3	10,485,760 -> 32,768	2560.00 -> 8.00	2552.00 (99.7%)	PASS
ZhipuAI/GLM-5.1	text	default	202,752 -> 32,768	24.75 -> 4.00	20.75 (83.8%)	PASS
XiaomiMiMo/MiMo-V2.5	text	default	1,048,576 -> 32,768	128.00 -> 4.00	124.00 (96.9%)	PASS
MiniMax/MiniMax-M2.7	text	default	204,800 -> 32,768	50.00 -> 8.00	42.00 (84.0%)	PASS
google/gemma-4-31B-it	text-sliding_attention	default	262,144 -> 32,768	128.00 -> 16.00	112.00 (87.5%)	PASS
google/gemma-4-31B-it	text-full_attention	proportional	262,144 -> 32,768	256.00 -> 32.00	224.00 (87.5%)	PASS
Total	-	-	-	3210.75 -> 80.00	3130.75 (97.5%)	-

Summary:

Total RoPE cache memory: 3210.75 MiB -> 80.00 MiB
Saved: 3130.75 MiB (97.5%)
Prefix comparison: all tested cases PASS

Also verified that cache expansion via _ensure_cos_sin_cache_length() matches
direct full-cache construction for default, llama3, and proportional.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

CI States

Latest PR Test (Base): ⏳ Run #27461850431
Latest PR Test (Extra): ⏳ Run #27461850376

gemini-code-assist

Code Review

This pull request introduces a helper function _get_runtime_model_config and updates get_rope to clamp max_position to the runtime context length under specific conditions. The review feedback points out two important issues: first, the precedence of scaling_type resolution should prioritize dual_chunk_attention_config over rope_scaling to prevent incorrect resolution; second, clamping max_position directly can cause subsequent lookups for original_max_position_embeddings to default to the clamped value instead of the original value, which could degrade model accuracy for scaling types like llama3.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

labAxiaoming requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners June 13, 2026 08:14

gemini-code-assist Bot reviewed Jun 13, 2026

View reviewed changes

Comment thread python/sglang/srt/layers/rotary_embedding/factory.py Outdated

[Core] Reduce RoPE cache size for shorter context length

01202be

labAxiaoming force-pushed the main branch from 72642a9 to 01202be Compare June 13, 2026 08:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Reduce RoPE cache size for shorter context length#28136

[Core] Reduce RoPE cache size for shorter context length#28136
labAxiaoming wants to merge 1 commit into
sgl-project:mainfrom
labAxiaoming:main

labAxiaoming commented Jun 13, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

labAxiaoming commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Changes

Validation

RoPE cache memory summary

Checklist

Review and Merge Process

CI States

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

labAxiaoming commented Jun 13, 2026 •

edited

Loading