Skip to content

ggml: optimize concat op by replacing per-element memcpy with row-level memcpy#24575

Open
sirohikartik wants to merge 1 commit into
ggml-org:masterfrom
sirohikartik:optim/concat-forward
Open

ggml: optimize concat op by replacing per-element memcpy with row-level memcpy#24575
sirohikartik wants to merge 1 commit into
ggml-org:masterfrom
sirohikartik:optim/concat-forward

Conversation

@sirohikartik

Copy link
Copy Markdown
Contributor

Overview

Optimize ggml_compute_forward_concat_any by replacing per-element memcpy with row-level memcpy.

The original implementation called memcpy once per scalar element with a branch inside the innermost loop to select between src0 and src1. For a typical KV-cache concat shape [4096 x 1 x 16 x 1] along dim=2 this results in 65,536 separate memcpy calls of 4 bytes each.

This PR splits the loop into two separate regions (one per source tensor) eliminating the per-element branch, and collapses the i0 loop entirely to copy one full row per memcpy call instead of one element.

Benchmark

Isolated microbenchmark using identical tensor layout and loop logic.
Shape: [4096 x 1 x 16 x 1], concat dim=2, fp32, 200 runs
Apple Silicon (M1 Air):

old new speedup
warm cache 1.61 ms/call 0.012 ms/call 133x
cold cache 1.59 ms/call 0.019 ms/call 83x

Cold cache measured by flushing 64MB through memory before every call.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES I used Claude Sonnet 4.6 to help understand the existing
    code, identify the inefficiency, and verify the approach. All changes reviewed and benchmarked by me.

@sirohikartik sirohikartik requested a review from ggerganov as a code owner June 13, 2026 14:07
@github-actions github-actions Bot added the ggml changes relating to the ggml tensor library for machine learning label Jun 13, 2026
@sirohikartik

Copy link
Copy Markdown
Contributor Author

Hi @ggerganov I think this is ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant