You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Under load, pgdog v0.1.34 (tokio 1.45.0) crashes with SIGSEGV (exit 139) from corruption of tokio's timer-wheel intrusive linked list, triggered when the per-query timeout Sleep is dropped in query_engine::query::execute.
Observed in production: 5 crashes across a 27-replica deployment over ~18 days (all clustered in one ~10h window), each replica auto-restarting and recovering.
Backtrace
thread 'tokio-runtime-worker' panicked at tokio-1.45.0/src/util/linked_list.rs:123:9:
assertion `left != right` failed
left: Some(0x7fe679353a98)
right: Some(0x7fe679353a98)
thread 'tokio-runtime-worker' panicked at tokio-1.45.0/src/util/linked_list.rs:186:9:
assertion failed: self.tail.is_none()
stack backtrace:
9: tokio::runtime::time::wheel::Wheel::remove
10: <tokio::runtime::time::entry::TimerEntry as core::ops::drop::Drop>::drop
11: core::ptr::drop_in_place<tokio::time::sleep::Sleep>
12: pgdog::frontend::client::query_engine::query::QueryEngine::execute::{{closure}}
13: pgdog::frontend::client::query_engine::QueryEngine::handle::{{closure}}
14: pgdog::frontend::client::Client::spawn_internal::{{closure}}
15: pgdog::frontend::listener::Listener::handle_client::{{closure}}
16: pgdog::frontend::listener::Listener::listen::{{closure}}::{{closure}}
...
panic in a destructor during cleanup
thread caused non-unwinding panic. aborting.
The timer-wheel intrusive list ends up with a node linked to itself (left == right). Dropping the Sleep runs TimerEntry::drop → Wheel::remove, which trips the assertion; because it panics inside a Drop, a second panic fires ("panic in a destructor during cleanup") → non-unwinding abort. On replicas where the corrupted list is dereferenced before the assert, the process segfaults directly — hence exit 139 rather than 134.
Environment
pgdog v0.1.34 (ghcr.io/pgdogdev/pgdog:v0.1.34), tokio 1.45.0
Multi-threaded tokio runtime, ~46 concurrent client connections per replica, simple steady query traffic to a single Postgres (RDS) backend
Per-query/statement timeout configured (the dropped Sleep in query_engine::execute)
Notes / questions
The crash signature points at the per-query timeout Sleep lifecycle in query_engine::execute. Both "Rewrite engine 3.0" (Rewrite engine 3.0 #676) and fix: apply query_timeout to entire client/server exhange #755 ("apply query_timeout to entire client/server exchange") substantially reworked this path after v0.1.34 — does either knowingly fix a timer-handling soundness bug here?
Is there any path that drops/polls the query-timeout Sleep across runtime threads, or any unsafe near the timeout handling that could corrupt the timer entry?
We're upgrading to v0.1.43 (tokio 1.52.3) and will report back whether it recurs. Filing so the pre-rewrite behavior is documented.
Summary
Under load, pgdog v0.1.34 (tokio 1.45.0) crashes with SIGSEGV (exit 139) from corruption of tokio's timer-wheel intrusive linked list, triggered when the per-query timeout
Sleepis dropped inquery_engine::query::execute.Observed in production: 5 crashes across a 27-replica deployment over ~18 days (all clustered in one ~10h window), each replica auto-restarting and recovering.
Backtrace
The timer-wheel intrusive list ends up with a node linked to itself (
left == right). Dropping theSleeprunsTimerEntry::drop→Wheel::remove, which trips the assertion; because it panics inside aDrop, a second panic fires ("panic in a destructor during cleanup") → non-unwinding abort. On replicas where the corrupted list is dereferenced before the assert, the process segfaults directly — hence exit 139 rather than 134.Environment
v0.1.34(ghcr.io/pgdogdev/pgdog:v0.1.34), tokio1.45.0Sleepinquery_engine::execute)Notes / questions
Sleeplifecycle inquery_engine::execute. Both "Rewrite engine 3.0" (Rewrite engine 3.0 #676) and fix: apply query_timeout to entire client/server exhange #755 ("apply query_timeout to entire client/server exchange") substantially reworked this path after v0.1.34 — does either knowingly fix a timer-handling soundness bug here?Sleepacross runtime threads, or anyunsafenear the timeout handling that could corrupt the timer entry?We're upgrading to v0.1.43 (tokio 1.52.3) and will report back whether it recurs. Filing so the pre-rewrite behavior is documented.