Skip to content

fix(v3rpc): make lease keep-alive stream processing synchronous to prevent goroutine leak#21960

Open
HarshalPatel1972 wants to merge 1 commit into
etcd-io:mainfrom
HarshalPatel1972:fix/lease-stream-leak
Open

fix(v3rpc): make lease keep-alive stream processing synchronous to prevent goroutine leak#21960
HarshalPatel1972 wants to merge 1 commit into
etcd-io:mainfrom
HarshalPatel1972:fix/lease-stream-leak

Conversation

@HarshalPatel1972

Copy link
Copy Markdown

Description

Fixes #16942

This PR resolves a persistent gRPC lease keep-alive stream background goroutine leak that occurs during unexpected client disconnects, network partitions, or rapid context cancellations.

Technical Root Cause

Previously, LeaseKeepAlive decoupled the RPC lifecycle from the actual processing engine by spinning up ls.leaseKeepAlive(stream) inside an independent background goroutine (go func()), while the parent handler blocked on a separate select block listening to <-stream.Context().Done().

When a network drop or context cancellation unblocked the parent select, the parent handler exited immediately, signaling to gRPC that the stream was dead. However, the background processing thread remained permanently trapped and concurrently blocked on stream.Recv(). Because the stream context was already torn down, it could never return a clean io.EOF, resulting in silent goroutine aggregation and massive heap memory creep under volatile networking workloads.

Resolution Strategy

  • Refactored LeaseKeepAlive to synchronously execute and return the loop handler: return ls.leaseKeepAlive(stream).
  • Because stream.Recv() is natively context-aware, collapsing the async goroutine architecture binds the execution thread directly to the lifespans of the underlying network frames, allowing the runtime to drain out naturally when a cancel sequence fires.

Testing & Reproduction steps

  • Implemented TestLeaseKeepAlive_GoroutineLeakOnDisconnect inside server/etcdserver/api/v3rpc/lease_test.go.
  • The test mounts a mock stream handler, pushes load frames, triggers an unannounced context cancel step, and asserts via runtime.NumGoroutine() that the background reader exits instantly without leaving hanging execution contexts.

Copilot AI review requested due to automatic review settings June 12, 2026 21:15

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@k8s-ci-robot

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: HarshalPatel1972
Once this PR has been reviewed and has the lgtm label, please assign jmhbnz for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot

Copy link
Copy Markdown

Hi @HarshalPatel1972. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

…event goroutine leak (etcd-io#16942)

Signed-off-by: Harshal Patel <hp842484@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants