Skip to content

etcdserver: add TestWriteTxnPanicWithoutApply_deadlock to detect goroutine leak when write txn panics#21941

Open
zhengshi1998 wants to merge 1 commit into
etcd-io:mainfrom
zhengshi1998:issue-21939
Open

etcdserver: add TestWriteTxnPanicWithoutApply_deadlock to detect goroutine leak when write txn panics#21941
zhengshi1998 wants to merge 1 commit into
etcd-io:mainfrom
zhengshi1998:issue-21939

Conversation

@zhengshi1998

@zhengshi1998 zhengshi1998 commented Jun 11, 2026

Copy link
Copy Markdown

This change adds a regression test to detect a goroutine deadlock/leak in the write transaction panic path. See issue #21939 for the context.

In TestWriteTxnPanicWithoutApply, a background goroutine can get stuck when a write txn panics before being properly finalized.

To make this issue reproducible and prevent future regressions, I added a new test TestWriteTxnPanicWithoutApply_deadlock, which is identical to TestWriteTxnPanicWithoutApply except that it uses goleak.VerifyNone to detect leaked goroutines.

This ensures that any failure to properly exit background goroutines will be caught by CI.

Output from goleak when running go test -run TestWriteTxnPanicWithoutApply_deadlock :

--- FAIL: TestWriteTxnPanicWithoutApply_deadlock (0.48s)
    logger.go:146: 2026-06-11T11:26:48.215+0800 INFO    bbolt   Opening db file (/tmp/TestWriteTxnPanicWithoutApply_deadlock3091311943/001/etcd_backend_test1594815258/database) with mode -rw------- and with options: {Timeout: 0s, NoGrowSync: false, NoFreelistSync: true, PreLoadFreelist: false, FreelistType: , ReadOnly: false, MmapFlags: 8000, InitialMmapSize: 10737418240, PageSize: 0, MaxSize: 0, NoSync: false, OpenFile: 0x0, Mlock: false, Logger: 0x354193f32818, NoStatistics: false}
    logger.go:146: 2026-06-11T11:26:48.220+0800 INFO    bbolt   Opening bbolt db (/tmp/TestWriteTxnPanicWithoutApply_deadlock3091311943/001/etcd_backend_test1594815258/database) successfully
    logger.go:146: 2026-06-11T11:26:48.220+0800 DEBUG   bbolt   Starting a new transaction [writable: true]
    logger.go:146: 2026-06-11T11:26:48.220+0800 DEBUG   bbolt   Starting a new transaction [writable: true] successfully
    logger.go:146: 2026-06-11T11:26:48.220+0800 DEBUG   bbolt   Starting a new transaction [writable: false]
    logger.go:146: 2026-06-11T11:26:48.220+0800 DEBUG   bbolt   Starting a new transaction [writable: false] successfully
    logger.go:146: 2026-06-11T11:26:48.220+0800 DEBUG   bbolt   Creating bucket if not exist "key"
    logger.go:146: 2026-06-11T11:26:48.220+0800 DEBUG   bbolt   Creating bucket if not exist "key" successfully
    logger.go:146: 2026-06-11T11:26:48.220+0800 DEBUG   bbolt   Creating bucket if not exist "meta"
    logger.go:146: 2026-06-11T11:26:48.220+0800 DEBUG   bbolt   Creating bucket if not exist "meta" successfully
    logger.go:146: 2026-06-11T11:26:48.220+0800 DEBUG   bbolt   Committing transaction 2
    logger.go:146: 2026-06-11T11:26:48.224+0800 DEBUG   bbolt   Committing transaction 2 successfully
    logger.go:146: 2026-06-11T11:26:48.224+0800 DEBUG   bbolt   Starting a new transaction [writable: true]
    logger.go:146: 2026-06-11T11:26:48.224+0800 DEBUG   bbolt   Starting a new transaction [writable: true] successfully
    logger.go:146: 2026-06-11T11:26:48.224+0800 DEBUG   bbolt   Starting a new transaction [writable: false]
    logger.go:146: 2026-06-11T11:26:48.224+0800 DEBUG   bbolt   Starting a new transaction [writable: false] successfully
    logger.go:146: 2026-06-11T11:26:48.226+0800 INFO    kvstore restored        {"current-rev": 1}
    logger.go:146: 2026-06-11T11:26:48.238+0800 DEBUG   bbolt   Putting key "\x00\x00\x00\x00\x00\x00\x00\x02_\x00\x00\x00\x00\x00\x00\x00\x00"
    logger.go:146: 2026-06-11T11:26:48.238+0800 DEBUG   bbolt   Putting key "\x00\x00\x00\x00\x00\x00\x00\x02_\x00\x00\x00\x00\x00\x00\x00\x00" successfully
    logger.go:146: 2026-06-11T11:26:48.238+0800 PANIC   unexpected error during txn with writes {"error": "applyTxn: failed Range: rangeKeys: context cancelled: context canceled"}
    txn_test.go:488: found unexpected goroutines:
        [Goroutine 76 in state sync.Mutex.Lock, with internal/sync.runtime_SemacquireMutex on top of the stack:
        internal/sync.runtime_SemacquireMutex(0x354194280978?, 0x5c?, 0x354193fcf788?)
                /home/zhengshi357/go1.26.4/src/runtime/sema.go:95 +0x25
        internal/sync.(*Mutex).lockSlow(0x3541941c57c0)
                /home/zhengshi357/go1.26.4/src/internal/sync/mutex.go:149 +0x15d
        internal/sync.(*Mutex).Lock(...)
                /home/zhengshi357/go1.26.4/src/internal/sync/mutex.go:70
        sync.(*Mutex).Lock(...)
                /home/zhengshi357/go1.26.4/src/sync/mutex.go:46
        go.etcd.io/etcd/server/v3/storage/backend.(*batchTx).safePending(0x354193fcf778?)
                /home/zhengshi357/CCT-projects/etcd/server/storage/backend/batch_tx.go:256 +0x4b
        go.etcd.io/etcd/server/v3/storage/backend.(*backend).run(0x354194041e00)
                /home/zhengshi357/CCT-projects/etcd/server/storage/backend/backend.go:452 +0x10d
        created by go.etcd.io/etcd/server/v3/storage/backend.newBackend in goroutine 75
                /home/zhengshi357/CCT-projects/etcd/server/storage/backend/backend.go:255 +0x699
        ]
FAIL
exit status 1
FAIL    go.etcd.io/etcd/server/v3/etcdserver/txn        0.488s

@k8s-ci-robot

Copy link
Copy Markdown

Hi @zhengshi1998. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: zhengshi1998
Once this PR has been reviewed and has the lgtm label, please assign spzala for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Comment thread server/etcdserver/txn/txn_test.go Outdated
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
"go.uber.org/zap/zaptest"
"go.uber.org/goleak"

@serathius serathius Jun 11, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use existing goleak detection that is already used in etcd testing.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. This is the output:

--- FAIL: TestWriteTxnPanicWithoutApply_deadlock (1.05s)
    logger.go:146: 2026-06-11T22:05:42.467+0800 INFO    bbolt   Opening db file (/tmp/TestWriteTxnPanicWithoutApply_deadlock4289319619/001/etcd_backend_test2452315083/database) with mode -rw------- and with options: {Timeout: 0s, NoGrowSync: false, NoFreelistSync: true, PreLoadFreelist: false, FreelistType: , ReadOnly: false, MmapFlags: 8000, InitialMmapSize: 10737418240, PageSize: 0, MaxSize: 0, NoSync: false, OpenFile: 0x0, Mlock: false, Logger: 0x1e5685686008, NoStatistics: false}
    logger.go:146: 2026-06-11T22:05:42.471+0800 INFO    bbolt   Opening bbolt db (/tmp/TestWriteTxnPanicWithoutApply_deadlock4289319619/001/etcd_backend_test2452315083/database) successfully
    logger.go:146: 2026-06-11T22:05:42.471+0800 DEBUG   bbolt   Starting a new transaction [writable: true]
    logger.go:146: 2026-06-11T22:05:42.471+0800 DEBUG   bbolt   Starting a new transaction [writable: true] successfully
    logger.go:146: 2026-06-11T22:05:42.471+0800 DEBUG   bbolt   Starting a new transaction [writable: false]
    logger.go:146: 2026-06-11T22:05:42.471+0800 DEBUG   bbolt   Starting a new transaction [writable: false] successfully
    logger.go:146: 2026-06-11T22:05:42.471+0800 DEBUG   bbolt   Creating bucket if not exist "key"
    logger.go:146: 2026-06-11T22:05:42.471+0800 DEBUG   bbolt   Creating bucket if not exist "key" successfully
    logger.go:146: 2026-06-11T22:05:42.471+0800 DEBUG   bbolt   Creating bucket if not exist "meta"
    logger.go:146: 2026-06-11T22:05:42.471+0800 DEBUG   bbolt   Creating bucket if not exist "meta" successfully
    logger.go:146: 2026-06-11T22:05:42.471+0800 DEBUG   bbolt   Committing transaction 2
    logger.go:146: 2026-06-11T22:05:42.476+0800 DEBUG   bbolt   Committing transaction 2 successfully
    logger.go:146: 2026-06-11T22:05:42.476+0800 DEBUG   bbolt   Starting a new transaction [writable: true]
    logger.go:146: 2026-06-11T22:05:42.476+0800 DEBUG   bbolt   Starting a new transaction [writable: true] successfully
    logger.go:146: 2026-06-11T22:05:42.476+0800 DEBUG   bbolt   Starting a new transaction [writable: false]
    logger.go:146: 2026-06-11T22:05:42.476+0800 DEBUG   bbolt   Starting a new transaction [writable: false] successfully
    logger.go:146: 2026-06-11T22:05:42.476+0800 INFO    kvstore restored        {"current-rev": 1}
    logger.go:146: 2026-06-11T22:05:42.487+0800 DEBUG   bbolt   Putting key "\x00\x00\x00\x00\x00\x00\x00\x02_\x00\x00\x00\x00\x00\x00\x00\x00"
    logger.go:146: 2026-06-11T22:05:42.487+0800 DEBUG   bbolt   Putting key "\x00\x00\x00\x00\x00\x00\x00\x02_\x00\x00\x00\x00\x00\x00\x00\x00" successfully
    logger.go:146: 2026-06-11T22:05:42.487+0800 PANIC   unexpected error during txn with writes {"error": "applyTxn: failed Range: rangeKeys: context cancelled: context canceled"}
    leak.go:129: Test appears to have leaked :
        internal/sync.runtime_SemacquireMutex(0x1e5685637bd8?, 0x5c?, 0x1e5685407f88?)
                /home/zhengshi357/go1.26.4/src/runtime/sema.go:95 +0x25
        internal/sync.(*Mutex).lockSlow(0x1e568562e340)
                /home/zhengshi357/go1.26.4/src/internal/sync/mutex.go:149 +0x15d
        internal/sync.(*Mutex).Lock(...)
                /home/zhengshi357/go1.26.4/src/internal/sync/mutex.go:70
        sync.(*Mutex).Lock(...)
                /home/zhengshi357/go1.26.4/src/sync/mutex.go:46
        go.etcd.io/etcd/server/v3/storage/backend.(*batchTx).safePending(0x1e5685407f78?)
                /home/zhengshi357/CCT-projects/etcd/server/storage/backend/batch_tx.go:256 +0x4b
        go.etcd.io/etcd/server/v3/storage/backend.(*backend).run(0x1e56854e6000)
                /home/zhengshi357/CCT-projects/etcd/server/storage/backend/backend.go:452 +0x10d
        created by go.etcd.io/etcd/server/v3/storage/backend.newBackend in goroutine 70
                /home/zhengshi357/CCT-projects/etcd/server/storage/backend/backend.go:255 +0x699
FAIL
exit status 1
FAIL    go.etcd.io/etcd/server/v3/etcdserver/txn        1.056s

…ct goroutine leak on panic

This change adds a regression test to detect a goroutine deadlock/leak in the write transaction panic path.

In `TestWriteTxnPanicWithoutApply`, a background goroutine can get stuck when a write txn panics before being properly finalized.

To make this issue reproducible and prevent future regressions, I added a new test `TestWriteTxnPanicWithoutApply_deadlock`, which is identical to
`TestWriteTxnPanicWithoutApply` except that it uses `goleak.VerifyNone` to detect leaked goroutines.

This ensures that any failure to properly exit background goroutines will be caught by CI.

Signed-off-by: zhengshi1998 <zhengshi1998@foxmail.com>
@serathius

Copy link
Copy Markdown
Member

Ok, why do we panic at all in TXN code? Wouldn't panic mean that etcd crashes? At that point do we really care about goroutine leak?

@zhengshi1998

Copy link
Copy Markdown
Author

Thanks for the feedback. I think there are a few concerns here:

  1. Even if etcd is not primarily designed as an embedded database, some users do embed it or wrap it in higher-level systems. In those cases, they may rely on recover to handle certain recoverable failures like context timeouts. However, the current TXN implementation can leave a mutex held after panic, which effectively blocks all other goroutines waiting on TXN, turning a transient failure into a system-wide stall.

  2. I think regardless of whether panic is considered fatal, it should not violate lock invariants. Otherwise, the system may enter an unrecoverable deadlock state instead of failing cleanly.

  3. In concurrent test scenarios, these blocked goroutines accumulate and consume memory, which can lead to test hangs or resource exhaustion over time.

@serathius

Copy link
Copy Markdown
Member

Ad 1 while it possible, it would be better to start from removing a panic, no?

@zhengshi1998

Copy link
Copy Markdown
Author

The write TXN panics because if ctx gets canceled, then the err is not nil and later lg.Panic() (https://github.com/etcd-io/etcd/blob/main/server/etcdserver/txn/txn.go#L91) throws a panic.

func txn(ctx context.Context, lg *zap.Logger, txnWrite mvcc.TxnWrite, rt *pb.TxnRequest, isWrite bool, txnPath []bool, skipRangeExecution bool) (*pb.TxnResponse, error) {
	txnResp, _ := newTxnResp(rt, txnPath)
	_, err := executeTxn(ctx, lg, txnWrite, rt, txnPath, txnResp, skipRangeExecution)
	if err != nil {
		if isWrite {
			// CAUTION: When a txn performing write operations starts, we always expect it to be successful.
			// If a write failure is seen we SHOULD NOT try to recover the server, but crash with a panic to make the failure explicit.
			// Trying to silently recover (e.g by ignoring the failed txn or calling txn.End() early) poses serious risks:
			// - violation of transaction atomicity if some write operations have been partially executed
			// - data inconsistency across different etcd members if they applied the txn asymmetrically
			lg.Panic("unexpected error during txn with writes", zap.Error(err))
		} else {
			lg.Error("unexpected error during readonly txn", zap.Error(err))
		}
	}
        ...
}

From the comments and the commit history, my understanding is that the panic is intentional, so I'm not sure whether replacing lg.Panic() with lg.Error() would be consistent with the existing design.

Are you suggesting that context cancellation should be handled differently and not reach this panic path in the first place?

@serathius

Copy link
Copy Markdown
Member

I'm suggesting that we should confirm that this was the right decision.

Do we treat panic as part of embedded etcd and hunt all possible goroutine leaks on panics, which will take a lot of work to get right. Or we just drop this panic.

@serathius

Copy link
Copy Markdown
Member

/cc @fuweid @ahrtr

@k8s-ci-robot k8s-ci-robot requested review from ahrtr and fuweid June 12, 2026 06:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

3 participants