SIGSEGV / tokio timer-wheel corruption when dropping per-query timeout Sleep (query_engine::execute, v0.1.34 / tokio 1.45.0)

## Summary

Under load, pgdog v0.1.34 (tokio 1.45.0) crashes with **SIGSEGV (exit 139)** from corruption of tokio's timer-wheel intrusive linked list, triggered when the per-query timeout `Sleep` is dropped in `query_engine::query::execute`.

Observed in production: 5 crashes across a 27-replica deployment over ~18 days (all clustered in one ~10h window), each replica auto-restarting and recovering.

## Backtrace

```
thread 'tokio-runtime-worker' panicked at tokio-1.45.0/src/util/linked_list.rs:123:9:
assertion `left != right` failed
  left:  Some(0x7fe679353a98)
 right:  Some(0x7fe679353a98)

thread 'tokio-runtime-worker' panicked at tokio-1.45.0/src/util/linked_list.rs:186:9:
assertion failed: self.tail.is_none()
stack backtrace:
   9: tokio::runtime::time::wheel::Wheel::remove
  10: <tokio::runtime::time::entry::TimerEntry as core::ops::drop::Drop>::drop
  11: core::ptr::drop_in_place<tokio::time::sleep::Sleep>
  12: pgdog::frontend::client::query_engine::query::QueryEngine::execute::{{closure}}
  13: pgdog::frontend::client::query_engine::QueryEngine::handle::{{closure}}
  14: pgdog::frontend::client::Client::spawn_internal::{{closure}}
  15: pgdog::frontend::listener::Listener::handle_client::{{closure}}
  16: pgdog::frontend::listener::Listener::listen::{{closure}}::{{closure}}
...
panic in a destructor during cleanup
thread caused non-unwinding panic. aborting.
```

The timer-wheel intrusive list ends up with a node linked to itself (`left == right`). Dropping the `Sleep` runs `TimerEntry::drop` → `Wheel::remove`, which trips the assertion; because it panics inside a `Drop`, a second panic fires ("panic in a destructor during cleanup") → non-unwinding abort. On replicas where the corrupted list is dereferenced before the assert, the process segfaults directly — hence exit 139 rather than 134.

## Environment

- pgdog `v0.1.34` (`ghcr.io/pgdogdev/pgdog:v0.1.34`), tokio `1.45.0`
- Multi-threaded tokio runtime, ~46 concurrent client connections per replica, simple steady query traffic to a single Postgres (RDS) backend
- Per-query/statement timeout configured (the dropped `Sleep` in `query_engine::execute`)

## Notes / questions

- The crash signature points at the per-query timeout `Sleep` lifecycle in `query_engine::execute`. Both "Rewrite engine 3.0" (#676) and #755 ("apply query_timeout to entire client/server exchange") substantially reworked this path after v0.1.34 — does either knowingly fix a timer-handling soundness bug here?
- Is there any path that drops/polls the query-timeout `Sleep` across runtime threads, or any `unsafe` near the timeout handling that could corrupt the timer entry?

We're upgrading to v0.1.43 (tokio 1.52.3) and will report back whether it recurs. Filing so the pre-rewrite behavior is documented.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIGSEGV / tokio timer-wheel corruption when dropping per-query timeout Sleep (query_engine::execute, v0.1.34 / tokio 1.45.0) #1052

Summary

Backtrace

Environment

Notes / questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

SIGSEGV / tokio timer-wheel corruption when dropping per-query timeout Sleep (query_engine::execute, v0.1.34 / tokio 1.45.0) #1052

Description

Summary

Backtrace

Environment

Notes / questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions