etcdserver: add TestWriteTxnPanicWithoutApply_deadlock to detect goroutine leak when write txn panics#21941
etcdserver: add TestWriteTxnPanicWithoutApply_deadlock to detect goroutine leak when write txn panics#21941zhengshi1998 wants to merge 1 commit into
Conversation
|
Hi @zhengshi1998. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: zhengshi1998 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| "github.com/stretchr/testify/assert" | ||
| "github.com/stretchr/testify/require" | ||
| "go.uber.org/zap/zaptest" | ||
| "go.uber.org/goleak" |
There was a problem hiding this comment.
Use existing goleak detection that is already used in etcd testing.
There was a problem hiding this comment.
Done. This is the output:
--- FAIL: TestWriteTxnPanicWithoutApply_deadlock (1.05s)
logger.go:146: 2026-06-11T22:05:42.467+0800 INFO bbolt Opening db file (/tmp/TestWriteTxnPanicWithoutApply_deadlock4289319619/001/etcd_backend_test2452315083/database) with mode -rw------- and with options: {Timeout: 0s, NoGrowSync: false, NoFreelistSync: true, PreLoadFreelist: false, FreelistType: , ReadOnly: false, MmapFlags: 8000, InitialMmapSize: 10737418240, PageSize: 0, MaxSize: 0, NoSync: false, OpenFile: 0x0, Mlock: false, Logger: 0x1e5685686008, NoStatistics: false}
logger.go:146: 2026-06-11T22:05:42.471+0800 INFO bbolt Opening bbolt db (/tmp/TestWriteTxnPanicWithoutApply_deadlock4289319619/001/etcd_backend_test2452315083/database) successfully
logger.go:146: 2026-06-11T22:05:42.471+0800 DEBUG bbolt Starting a new transaction [writable: true]
logger.go:146: 2026-06-11T22:05:42.471+0800 DEBUG bbolt Starting a new transaction [writable: true] successfully
logger.go:146: 2026-06-11T22:05:42.471+0800 DEBUG bbolt Starting a new transaction [writable: false]
logger.go:146: 2026-06-11T22:05:42.471+0800 DEBUG bbolt Starting a new transaction [writable: false] successfully
logger.go:146: 2026-06-11T22:05:42.471+0800 DEBUG bbolt Creating bucket if not exist "key"
logger.go:146: 2026-06-11T22:05:42.471+0800 DEBUG bbolt Creating bucket if not exist "key" successfully
logger.go:146: 2026-06-11T22:05:42.471+0800 DEBUG bbolt Creating bucket if not exist "meta"
logger.go:146: 2026-06-11T22:05:42.471+0800 DEBUG bbolt Creating bucket if not exist "meta" successfully
logger.go:146: 2026-06-11T22:05:42.471+0800 DEBUG bbolt Committing transaction 2
logger.go:146: 2026-06-11T22:05:42.476+0800 DEBUG bbolt Committing transaction 2 successfully
logger.go:146: 2026-06-11T22:05:42.476+0800 DEBUG bbolt Starting a new transaction [writable: true]
logger.go:146: 2026-06-11T22:05:42.476+0800 DEBUG bbolt Starting a new transaction [writable: true] successfully
logger.go:146: 2026-06-11T22:05:42.476+0800 DEBUG bbolt Starting a new transaction [writable: false]
logger.go:146: 2026-06-11T22:05:42.476+0800 DEBUG bbolt Starting a new transaction [writable: false] successfully
logger.go:146: 2026-06-11T22:05:42.476+0800 INFO kvstore restored {"current-rev": 1}
logger.go:146: 2026-06-11T22:05:42.487+0800 DEBUG bbolt Putting key "\x00\x00\x00\x00\x00\x00\x00\x02_\x00\x00\x00\x00\x00\x00\x00\x00"
logger.go:146: 2026-06-11T22:05:42.487+0800 DEBUG bbolt Putting key "\x00\x00\x00\x00\x00\x00\x00\x02_\x00\x00\x00\x00\x00\x00\x00\x00" successfully
logger.go:146: 2026-06-11T22:05:42.487+0800 PANIC unexpected error during txn with writes {"error": "applyTxn: failed Range: rangeKeys: context cancelled: context canceled"}
leak.go:129: Test appears to have leaked :
internal/sync.runtime_SemacquireMutex(0x1e5685637bd8?, 0x5c?, 0x1e5685407f88?)
/home/zhengshi357/go1.26.4/src/runtime/sema.go:95 +0x25
internal/sync.(*Mutex).lockSlow(0x1e568562e340)
/home/zhengshi357/go1.26.4/src/internal/sync/mutex.go:149 +0x15d
internal/sync.(*Mutex).Lock(...)
/home/zhengshi357/go1.26.4/src/internal/sync/mutex.go:70
sync.(*Mutex).Lock(...)
/home/zhengshi357/go1.26.4/src/sync/mutex.go:46
go.etcd.io/etcd/server/v3/storage/backend.(*batchTx).safePending(0x1e5685407f78?)
/home/zhengshi357/CCT-projects/etcd/server/storage/backend/batch_tx.go:256 +0x4b
go.etcd.io/etcd/server/v3/storage/backend.(*backend).run(0x1e56854e6000)
/home/zhengshi357/CCT-projects/etcd/server/storage/backend/backend.go:452 +0x10d
created by go.etcd.io/etcd/server/v3/storage/backend.newBackend in goroutine 70
/home/zhengshi357/CCT-projects/etcd/server/storage/backend/backend.go:255 +0x699
FAIL
exit status 1
FAIL go.etcd.io/etcd/server/v3/etcdserver/txn 1.056s
…ct goroutine leak on panic This change adds a regression test to detect a goroutine deadlock/leak in the write transaction panic path. In `TestWriteTxnPanicWithoutApply`, a background goroutine can get stuck when a write txn panics before being properly finalized. To make this issue reproducible and prevent future regressions, I added a new test `TestWriteTxnPanicWithoutApply_deadlock`, which is identical to `TestWriteTxnPanicWithoutApply` except that it uses `goleak.VerifyNone` to detect leaked goroutines. This ensures that any failure to properly exit background goroutines will be caught by CI. Signed-off-by: zhengshi1998 <zhengshi1998@foxmail.com>
91c1a3d to
e8d7cc8
Compare
|
Ok, why do we panic at all in TXN code? Wouldn't panic mean that etcd crashes? At that point do we really care about goroutine leak? |
|
Thanks for the feedback. I think there are a few concerns here:
|
|
Ad 1 while it possible, it would be better to start from removing a panic, no? |
|
The write TXN panics because if func txn(ctx context.Context, lg *zap.Logger, txnWrite mvcc.TxnWrite, rt *pb.TxnRequest, isWrite bool, txnPath []bool, skipRangeExecution bool) (*pb.TxnResponse, error) {
txnResp, _ := newTxnResp(rt, txnPath)
_, err := executeTxn(ctx, lg, txnWrite, rt, txnPath, txnResp, skipRangeExecution)
if err != nil {
if isWrite {
// CAUTION: When a txn performing write operations starts, we always expect it to be successful.
// If a write failure is seen we SHOULD NOT try to recover the server, but crash with a panic to make the failure explicit.
// Trying to silently recover (e.g by ignoring the failed txn or calling txn.End() early) poses serious risks:
// - violation of transaction atomicity if some write operations have been partially executed
// - data inconsistency across different etcd members if they applied the txn asymmetrically
lg.Panic("unexpected error during txn with writes", zap.Error(err))
} else {
lg.Error("unexpected error during readonly txn", zap.Error(err))
}
}
...
}From the comments and the commit history, my understanding is that the panic is intentional, so I'm not sure whether replacing lg.Panic() with lg.Error() would be consistent with the existing design. Are you suggesting that context cancellation should be handled differently and not reach this panic path in the first place? |
|
I'm suggesting that we should confirm that this was the right decision. Do we treat panic as part of embedded etcd and hunt all possible goroutine leaks on panics, which will take a lot of work to get right. Or we just drop this panic. |
This change adds a regression test to detect a goroutine deadlock/leak in the write transaction panic path. See issue #21939 for the context.
In
TestWriteTxnPanicWithoutApply, a background goroutine can get stuck when a write txn panics before being properly finalized.To make this issue reproducible and prevent future regressions, I added a new test
TestWriteTxnPanicWithoutApply_deadlock, which is identical toTestWriteTxnPanicWithoutApplyexcept that it usesgoleak.VerifyNoneto detect leaked goroutines.This ensures that any failure to properly exit background goroutines will be caught by CI.
Output from goleak when running
go test -run TestWriteTxnPanicWithoutApply_deadlock: