Fix remote initiator host name by ianton-ru · Pull Request #1550 · Altinity/ClickHouse

ianton-ru · 2026-03-19T11:06:58Z

Changelog category (leave one):

Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Solved #1549
Fix remote initiator host name

Documentation entry for user-facing changes

...

CI/CD Options

Exclude tests:

Regression jobs to run:

github-actions · 2026-03-19T11:07:51Z

Workflow [PR], commit [4b998c4]

ianton-ru · 2026-03-19T11:28:48Z

@codex review

chatgpt-codex-connector · 2026-03-19T11:37:06Z

Codex Review: Didn't find any major issues. Can't wait for the next one!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

ianton-ru · 2026-03-19T13:05:16Z

Integratin tests failed, bug is in branch, not in this pr - #1554

alsugiliazova · 2026-03-19T19:47:27Z

Audit Report: PR #1550 — Fix remote initiator host name

Confirmed defects

No confirmed defects in reviewed scope.

Not confirmed (low priority)

unescapeForFileName vs Poco::URI::decode: Cluster::Address::fromString uses unescapeForFileName; this PR uses Poco::URI::decode. Functionally equivalent for valid escapeForFileName output. Consistency preference, not a defect.

alsugiliazova · 2026-03-19T19:50:25Z

PR #1550 CI Verification Report

CI Results: 51 success, 2 failures, 47 skipped

Check-runs: 164 total

#	Suite	Error	Related to PR
1	Integration tests (amd_asan, flaky check)	55 of 115 test runs failed	Yes — crash during shutdown, see analysis below
2	GrypeScanServer (-alpine) / Grype Scan	CVE scan	No — infrastructure

Flaky Check Crash Analysis — linked to #1554

The flaky check ran test_s3_cluster/test.py 5 times. Results: 60 passed, 55 failed. All 55 failures are cascade failures from a single crash, not independent test issues.

Root cause: `test_graceful_shutdown[1-5]`

test_graceful_shutdown is the test that triggers the crash. It spawns 10 threads running s3Cluster(cluster_simple, ...) queries in a loop, then explicitly calls stop_clickhouse() on a node while queries are in-flight.

During node shutdown, a heap-use-after-free is triggered:

ERROR: AddressSanitizer: heap-use-after-free on address 0x5040020a6df0
READ of size 8 at 0x5040020a6df0 thread T773 (QueryPipelineEx)
    #0 StorageObjectStorageStableTaskDistributor::rescheduleTasksFromReplica(unsigned long)
       StorageObjectStorageStableTaskDistributor.cpp:312:28
    #1 RemoteQueryExecutor::processPacket(Packet) RemoteQueryExecutor.cpp:739:47
    #2 RemoteQueryExecutor::readAsync() RemoteQueryExecutor.cpp:595:28
    #3 RemoteSource::tryGenerate() RemoteSource.cpp:189:36

The StorageObjectStorageStableTaskDistributor object is freed during shutdown while RemoteQueryExecutor still holds a reference and tries to call rescheduleTasksFromReplica(). ASAN catches the use-after-free and kills the ClickHouse process. The test framework then detects "Sanitizer assert found for instance s0_0_0", and all subsequent tests fail instantly because the node is dead.

The actual code bug (`StorageObjectStorageStableTaskDistributor.cpp:311-312`)

replica_to_files_to_be_processed.erase(number_of_current_replica);  // line 311: invalidates iterator
for (const auto & file : processed_file_list_ptr->second)            // line 312: USE-AFTER-FREE

processed_file_list_ptr is obtained from replica_to_files_to_be_processed.find() on line 298. Line 311 erases that entry from the unordered_map, which invalidates the iterator. Line 312 then dereferences the dangling iterator — a deterministic use-after-free every time this function is called.

The fix is trivial — move the values before erasing:

auto files = std::move(processed_file_list_ptr->second);
replica_to_files_to_be_processed.erase(number_of_current_replica);
for (const auto & file : files) { ... }

Why `test_graceful_shutdown` triggers it

test_graceful_shutdown spawns 10 threads running s3Cluster('cluster_simple', ...) queries, then calls stop_clickhouse(kill=True) on node s0_1_0. When the node dies while queries are in-flight, RemoteQueryExecutor::processPacket() receives Protocol::Server::ConnectionLost, checks task_iterator->supportRerunTask() (returns true for StorageObjectStorageStableTaskDistributor), and calls rescheduleTasksFromReplica() — triggering the use-after-free.

The crash is timing-dependent: it only fires if a query is still waiting on the dying replica at the moment of shutdown. test_graceful_shutdown passed in runs [3-5] and [2-5] (47s each, queries finished before kill) but crashed on run [1-5] (18.31s, a query was mid-flight when the kill landed).

Why this crash appeared only in PR #1550

The bug is NOT caused by PR #1550's code changes. It is a pre-existing bug in StorageObjectStorageStableTaskDistributor, introduced on 2025-11-21 when the stable task distributor was added to the codebase.

PR #1550 exposed it because:

PR Fix remote initiator host name #1550 modified test_s3_cluster/test.py (added test_object_storage_remote_initiator)
The CI flaky check runs ALL tests in modified files 5 times under ASAN
test_graceful_shutdown was included in this 5× stress run — but the last time test_s3_cluster/test.py was modified was 2025-11-26 (before the StableTaskDistributor existed in its current form)
Running test_graceful_shutdown 5× under ASAN multiplied the probability of hitting the timing window where a ConnectionLost arrives during an in-flight query, triggering rescheduleTasksFromReplica() and the ASAN-detected use-after-free

In regular integration tests (single run, no ASAN), the bug is invisible: the freed memory often still contains valid data, and the timing window is hit less frequently.

Key observations

55 failures = 1 real crash + 54 cascade: After the ASAN crash killed s0_0_0, every subsequent test failed instantly (~0.3s) with "Sanitizer assert found for instance"
PR's own test is not the trigger: test_object_storage_remote_initiator passed all 3 runs before the crash; its 2 remaining runs failed only as part of the cascade
Bug filed as #1554

Skipped Suites (47)

Excluded by PR author: all TSAN, MSAN, UBSAN, Coverage, and Regression suites.

Conclusion

PR #1550's URL-decoding fix in IStorageCluster.cpp is correct, and its own test test_object_storage_remote_initiator passed all runs before the crash. All 15 regular integration test shards, 14 stateless test suites, builds, unit tests, stress test, and BuzzHouse passed.

The flaky check failure is caused by a pre-existing use-after-free bug in StorageObjectStorageStableTaskDistributor::rescheduleTasksFromReplica() (#1554). Lines 311-312 erase a map entry and then iterate over the invalidated iterator — a deterministic use-after-free that ASAN catches. It was triggered by test_graceful_shutdown killing a node while s3Cluster queries were in-flight, causing RemoteQueryExecutor to receive ConnectionLost and call rescheduleTasksFromReplica().

This bug was introduced on 2025-11-21 with the StorageObjectStorageStableTaskDistributor feature, but was never caught because no one modified test_s3_cluster/test.py since 2025-11-26 — meaning the flaky check (5× ASAN runs) was never triggered on this file until PR #1550. The 55 apparent test failures are 1 real crash + 54 cascade failures. PR #1550's code changes are not the cause.

…ator

alsugiliazova · 2026-03-19T22:50:18Z

Regression run https://github.com/Altinity/clickhouse-regression/actions/runs/23320650305

alsugiliazova · 2026-03-20T00:28:45Z

Re-run Results (Run #23314369179)

Two new commits were added before the re-run:

40d9e2b — "Temporary disable test_graceful_shutdown" — renamed test_graceful_shutdown to _test_graceful_shutdown so pytest skips it
4b998c4 — Merge branch antalya-25.8 into the PR branch

Results: 50 success, 3 failures, 42 skipped.

Integration tests (amd_asan, flaky check) — PASSED

The flaky check passed because test_graceful_shutdown was disabled by the developer (renamed to _test_graceful_shutdown). This eliminates the trigger for the #1554 use-after-free crash. The PR's own test test_object_storage_remote_initiator passed all runs.

Failures (3 job failures + 1 test failure inside a passing job) — none related to PR

#	Suite	Failed Test	Error	Related to PR
1	Stateless tests (amd_asan, distributed plan, parallel, 2/2)	`02887_mutations_subcolumns`	`ALTER TABLE ... DELETE WHERE obj.k2 = 1` hung (UNFINISHED after 3.65s)	No — unrelated mutation/subcolumn test
2	Stateless tests (amd_binary, ParallelReplicas, s3 storage, sequential)	`03377_object_storage_list_objects_cache`	S3 error "No such key"	No — unrelated S3 cache test
3	GrypeScanKeeper / Grype Scan	—	CVE scan	No — infrastructure
4	GrypeScanServer (-alpine) / Grype Scan	—	CVE scan	No — infrastructure

Note: 02887_mutations_subcolumns failed within a job that GitHub Actions marked as success (the CI framework allowed the failure). The CI run report at S3 correctly lists it as a FAIL.

zzer, BuzzHouse, Docker images, compatibility checks, Finish Workflow, FinishCIReport — all passed

Conclusion

PR #1550 is verified.

Fix remote initiator host name

efd436d

ianton-ru added antalya bugfix antalya-25.8 labels Mar 19, 2026

mkmkme previously approved these changes Mar 19, 2026

View reviewed changes

svb-alt added the antalya-25.8.16.20002 label Mar 19, 2026

alsugiliazova added the verified-with-issue Verified by QA and issue(s) found. label Mar 19, 2026

Temporary disable test_graceful_shutdown

40d9e2b

ianton-ru dismissed mkmkme’s stale review via 40d9e2b March 19, 2026 19:59

Merge branch 'antalya-25.8' into bugfix/antalya-25.8/fix_remote_initi…

4b998c4

…ator

alsugiliazova added the verified Verified by QA label Mar 20, 2026

zvonand merged commit 145bc9d into antalya-25.8 Mar 20, 2026
208 of 216 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix remote initiator host name#1550

Fix remote initiator host name#1550
zvonand merged 3 commits intoantalya-25.8from
bugfix/antalya-25.8/fix_remote_initiator

ianton-ru commented Mar 19, 2026

Uh oh!

github-actions bot commented Mar 19, 2026 •

edited

Loading

Uh oh!

ianton-ru commented Mar 19, 2026

Uh oh!

chatgpt-codex-connector bot commented Mar 19, 2026

Uh oh!

ianton-ru commented Mar 19, 2026

Uh oh!

alsugiliazova commented Mar 19, 2026

Uh oh!

alsugiliazova commented Mar 19, 2026

Uh oh!

alsugiliazova commented Mar 19, 2026

Uh oh!

alsugiliazova commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ianton-ru commented Mar 19, 2026

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

CI/CD Options

Exclude tests:

Regression jobs to run:

Uh oh!

github-actions bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ianton-ru commented Mar 19, 2026

Uh oh!

chatgpt-codex-connector bot commented Mar 19, 2026

Uh oh!

ianton-ru commented Mar 19, 2026

Uh oh!

alsugiliazova commented Mar 19, 2026

Audit Report: PR #1550 — Fix remote initiator host name

Confirmed defects

Not confirmed (low priority)

Uh oh!

alsugiliazova commented Mar 19, 2026

PR #1550 CI Verification Report

CI Results: 51 success, 2 failures, 47 skipped

Check-runs: 164 total

Flaky Check Crash Analysis — linked to #1554

Root cause: test_graceful_shutdown[1-5]

The actual code bug (StorageObjectStorageStableTaskDistributor.cpp:311-312)

Why test_graceful_shutdown triggers it

Why this crash appeared only in PR #1550

Key observations

Skipped Suites (47)

Conclusion

Uh oh!

alsugiliazova commented Mar 19, 2026

Uh oh!

alsugiliazova commented Mar 20, 2026

Re-run Results (Run #23314369179)

Integration tests (amd_asan, flaky check) — PASSED

Failures (3 job failures + 1 test failure inside a passing job) — none related to PR

Conclusion

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

github-actions bot commented Mar 19, 2026 •

edited

Loading

Root cause: `test_graceful_shutdown[1-5]`

The actual code bug (`StorageObjectStorageStableTaskDistributor.cpp:311-312`)

Why `test_graceful_shutdown` triggers it