[source-mongodb-v2] CDC unviable for a low-share, high-write collection on a busy shared oplog — change-stream COLLSCAN + shutdown deadlock. What are people doing at scale? #79656

Henry Bogardus (usbogie) · 2026-06-10T19:12:41Z

Henry Bogardus (usbogie)
Jun 10, 2026

Summary

We're trying to replicate a single large, high-write MongoDB collection to Snowflake via source-mongodb-v2 CDC, and we cannot make incremental CDC keep up. Catch-up runs at ~50 events/min and falls further behind in real time, and the connector also hits a shutdown deadlock that turns otherwise-fine syncs into failures. We believe we've root-caused it to the change stream's oplog COLLSCAN, but we'd love confirmation and to hear how others run CDC on collections like this at scale.

Environment

Airbyte: self-hosted (OSS) on AWS EC2
Connector: airbyte/source-mongodb-v2:2.0.7 (Debezium 2.6.2.Final, mongodb-driver-core 4.11.0)
Destination: Snowflake
Source: MongoDB Atlas, M60 dedicated replica set, single database
Collection: ~330M documents, very high write volume. Writes are bulk inserts plus a
delete-and-reinsert pattern (re-posting deletes old docs and inserts new ones with new _ids).
The collection is a small fraction of total cluster write volume (busy shared oplog).
update_capture_mode = Lookup, initial_waiting_seconds = 1200, oplog retention ~1 week.
This collection is in its own dedicated connection (isolated to avoid the global
resume-token coupling described in [source-mongodb-v2] Update resume token to latest oplog position even when no new records exist #48435).

What we observe

1. Catch-up throughput is ~50 events/min and never converges.
After the initial snapshot (which is fast), CDC resumes from the snapshot-start token and is hours behind. It then advances only ~1 second of cluster-time per ~90 minutes of wall-clock, i.e. it loses ground in real time. Representative steady-state log:

CDC events queue poll(): blocked for PT2M0.2S after 100 previous call(s) ...
CDC events queue poll(): returned a change event ... "collection":"<our_collection>","ord":1013 ...
CDC events queue poll(): returned a heartbeat event: progressing to Timestamp{seconds=..., inc=1013}
CDC events queue stats: size=0, cap=10000, puts=44, polls=0

The queue is always size=0 (the Snowflake side is not the bottleneck), and the source EC2 host is ~80% idle CPU with ~0 iowait — so the connector is blocked waiting on MongoDB, not on local compute or the destination.

2. The bottleneck appears to be the change-stream oplog scan. This matches discussion #42393, where the $changeStream aggregate does a COLLSCAN of the oplog (reported there: 1.3M oplog docs scanned to return 465 events, ~1m44s, under a global read lock). Our heartbeat inc tracks the returned event ord, consistent with scanning past large numbers of other-collection oplog entries to find ours. A bigger Atlas tier wouldn't help because the cost scales with total cluster write volume, not with our collection or hardware.

3. Shutdown deadlock → exit code 2 (matches #38705). With default socket timeout, the change-stream fetcher blocks in a getMore that can't be interrupted on engine close, so the CDK force-exits non-zero after the orphaned-thread grace period:

The main thread is exiting while children non-daemon threads ... are still active.
Active non-daemon thread: debezium-mongodbconnector-...-replicator-fetcher-0 (RUNNABLE)
  ... sun.nio.ch.Net.poll ... MongoChangeStreamCursorImpl.tryNext ... getMore
Failed to interrupt children non-daemon threads, forcefully exiting NOW...
Source process exited with non-zero exit code 2

Each attempt commits ~101 records + state, then exits 2 → Airbyte logs a partial failure and retries; 20 partial failures hit the limit and the job fails.

4. socketTimeoutMS can't thread the needle. We added socketTimeoutMS=60000 to the connection string to make the blocked socket read unwind before the force-kill — that fixed the exit-2 hang, but legitimate change-stream getMores then exceed 60s and throw:

Caused by: com.mongodb.MongoSocketReadTimeoutException: Timeout while receiving message
  ... ChangeStreamOperation.execute ... AggregateOperationImpl.execute ...
The maximum number of 0 retries has been attempted

Because the connector runs Debezium with 0 retries, a single slow scan fails the whole sync. There appears to be no socketTimeoutMS value that works: scans routinely exceed ~2 minutes, but the orphaned-thread force-kill fires at ~2 minutes, so "long enough for the scan" and "short enough to unwind on shutdown" don't overlap.

What we've ruled out

Bigger Atlas tier — already M60; cost is O(total cluster writes), not hardware-bound.
Bigger EC2 — host is ~80% idle; it's waiting on MongoDB, not compute.
initial_waiting_seconds — already at the max (1200).
Post Image / pre-and-post-images — would remove the per-event updateLookup, but not the
oplog COLLSCAN, and only applies to events written after enabling it (doesn't help the backlog).

Questions for the community

Is anyone successfully running source-mongodb-v2 CDC on a low-share, high-write collection in a busy shared oplog at hundreds of millions of docs? What throughput do you actually get, and how is your cluster/connector configured?
Is the per-batch oplog COLLSCAN expected, and is there any way to reduce it (server-side filtering, pipeline pushdown, anything that avoids scanning unrelated oplog entries)?
Is there a supported way to avoid both the exit-code-2 shutdown deadlock ([source-mongodb-v2] MongoDb source connector stops working / not stable (non-zero exit code 2) #38705) and the read-timeout-on-slow-scan? Can Debezium retries / socket timeout be tuned in this connector?
Does switching to Post Image mode materially improve catch-up throughput in practice, or does the oplog scan stay the bottleneck?
For collections where CDC simply can't keep up, what's the recommended pattern? Scheduled Full Refresh? An external cursor-based incremental? Are there plans to parallelize CDC or to re-introduce cursor-based (updatedAt) incremental for mongodb-v2?

Happy to share full logs, query profiles, or explain output. Thanks!

airbyte-support-bot · 2026-06-10T19:15:15Z

airbyte-support-bot
Jun 10, 2026
Maintainer

Hi Henry Bogardus (@usbogie), thank you for this incredibly thorough write-up — the detail on the oplog COLLSCAN behavior, the shutdown deadlock, and the socketTimeoutMS dilemma is very helpful.

We've escalated this to our engineering team for investigation: airbytehq/oncall#12848.

A few clarifying questions that would help the team dig in:

MongoDB server version — Could you confirm the exact MongoDB version on your Atlas M60 cluster? (e.g., 6.x, 7.x)
Oplog size — Do you know your current oplog size (in GB) and approximate oplog window? You mentioned ~1 week retention — is that the configured target or the observed window?
explain() output — You mentioned willingness to share explain output from the $changeStream aggregate. That would be very valuable for confirming whether server-side $match pushdown is happening (or could be applied).
Connector logs — If you can share the full connector log from a representative CDC run (with any sensitive connection details masked), that would help the team reproduce the timing characteristics.

Please ensure you mask or remove any sensitive information (API keys, passwords, tokens, connection strings) before sharing logs.

Interim workaround notes

While the team investigates, a couple of things to consider:

Scheduled Full Refresh may be more practical for this collection in the short term if CDC throughput cannot keep up. This avoids both the oplog scan bottleneck and the shutdown deadlock entirely.
Post Image mode (changeStreamFullDocument: updateLookup → fullDocument: whenAvailable with pre/post images enabled) would eliminate the per-event updateLookup round-trip but, as you noted, would not address the underlying oplog COLLSCAN. It may still provide some throughput improvement for update-heavy workloads by removing the lookup latency per event.

We'll provide updates on the oncall issue as the investigation progresses. In the meantime, feel free to share any additional diagnostics there.

Need more help? Join Airbyte Community Slack for peer support, or if you're a Cloud customer, open a support ticket referencing this URL.

2 replies

Henry Bogardus (usbogie) Jun 10, 2026
Author

Mongo version: 8.0.23
Oplog size: ~1.32 TB (usedMB ≈ 1,348,097), auto-grown by Atlas from a 990 MB floor. retention window ~136 h (~5.7 days). oplog churn ≈ ~9.7 GB/hour (~230 GB/day) cluster-wide
explain is rejected on a changestraggregate. here is the queryPlanner plan for the equivalent oplog read the stream performs
{ "stage": "COLLSCAN", "filter": { "ns": { "$eq": "." } }, "direction": "forward" }
not enabled on the collection
attached
mongodb_ledgerlineitems_to_snowflake_logs_3553_txt (1).txt

Henry Bogardus (usbogie) Jun 10, 2026
Author

airbyte-support-bot

Henry Bogardus (usbogie) · 2026-06-16T19:11:57Z

Henry Bogardus (usbogie)
Jun 16, 2026
Author

airbyte-support-bot

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[source-mongodb-v2] CDC unviable for a low-share, high-write collection on a busy shared oplog — change-stream COLLSCAN + shutdown deadlock. What are people doing at scale? #79656

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[source-mongodb-v2] CDC unviable for a low-share, high-write collection on a busy shared oplog — change-stream COLLSCAN + shutdown deadlock. What are people doing at scale? #79656

Uh oh!

Henry Bogardus (usbogie) Jun 10, 2026

Summary

Environment

What we observe

What we've ruled out

Questions for the community

Replies: 2 comments · 2 replies

Uh oh!

airbyte-support-bot Jun 10, 2026 Maintainer

Uh oh!

Henry Bogardus (usbogie) Jun 10, 2026 Author

Uh oh!

Henry Bogardus (usbogie) Jun 10, 2026 Author

Uh oh!

Henry Bogardus (usbogie) Jun 16, 2026 Author

Henry Bogardus (usbogie)
Jun 10, 2026

Replies: 2 comments 2 replies

airbyte-support-bot
Jun 10, 2026
Maintainer

Henry Bogardus (usbogie) Jun 10, 2026
Author

Henry Bogardus (usbogie) Jun 10, 2026
Author

Henry Bogardus (usbogie)
Jun 16, 2026
Author