Skip to content

Fix remote initiator host name#1550

Merged
zvonand merged 3 commits intoantalya-25.8from
bugfix/antalya-25.8/fix_remote_initiator
Mar 20, 2026
Merged

Fix remote initiator host name#1550
zvonand merged 3 commits intoantalya-25.8from
bugfix/antalya-25.8/fix_remote_initiator

Conversation

@ianton-ru
Copy link

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Solved #1549
Fix remote initiator host name

Documentation entry for user-facing changes

...

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • Tiered Storage (2h)

@github-actions
Copy link

github-actions bot commented Mar 19, 2026

Workflow [PR], commit [4b998c4]

@ianton-ru
Copy link
Author

@codex review

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. Can't wait for the next one!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

mkmkme
mkmkme previously approved these changes Mar 19, 2026
@ianton-ru
Copy link
Author

Integratin tests failed, bug is in branch, not in this pr - #1554

@alsugiliazova
Copy link
Member

Audit Report: PR #1550 — Fix remote initiator host name

Confirmed defects

No confirmed defects in reviewed scope.

Not confirmed (low priority)

  • unescapeForFileName vs Poco::URI::decode: Cluster::Address::fromString uses unescapeForFileName; this PR uses Poco::URI::decode. Functionally equivalent for valid escapeForFileName output. Consistency preference, not a defect.

@alsugiliazova
Copy link
Member

PR #1550 CI Verification Report

CI Results: 51 success, 2 failures, 47 skipped

Check-runs: 164 total

# Suite Error Related to PR
1 Integration tests (amd_asan, flaky check) 55 of 115 test runs failed Yes — crash during shutdown, see analysis below
2 GrypeScanServer (-alpine) / Grype Scan CVE scan No — infrastructure

Flaky Check Crash Analysis — linked to #1554

The flaky check ran test_s3_cluster/test.py 5 times. Results: 60 passed, 55 failed. All 55 failures are cascade failures from a single crash, not independent test issues.

Root cause: test_graceful_shutdown[1-5]

test_graceful_shutdown is the test that triggers the crash. It spawns 10 threads running s3Cluster(cluster_simple, ...) queries in a loop, then explicitly calls stop_clickhouse() on a node while queries are in-flight.

During node shutdown, a heap-use-after-free is triggered:

ERROR: AddressSanitizer: heap-use-after-free on address 0x5040020a6df0
READ of size 8 at 0x5040020a6df0 thread T773 (QueryPipelineEx)
    #0 StorageObjectStorageStableTaskDistributor::rescheduleTasksFromReplica(unsigned long)
       StorageObjectStorageStableTaskDistributor.cpp:312:28
    #1 RemoteQueryExecutor::processPacket(Packet) RemoteQueryExecutor.cpp:739:47
    #2 RemoteQueryExecutor::readAsync() RemoteQueryExecutor.cpp:595:28
    #3 RemoteSource::tryGenerate() RemoteSource.cpp:189:36

The StorageObjectStorageStableTaskDistributor object is freed during shutdown while RemoteQueryExecutor still holds a reference and tries to call rescheduleTasksFromReplica(). ASAN catches the use-after-free and kills the ClickHouse process. The test framework then detects "Sanitizer assert found for instance s0_0_0", and all subsequent tests fail instantly because the node is dead.

The actual code bug (StorageObjectStorageStableTaskDistributor.cpp:311-312)

replica_to_files_to_be_processed.erase(number_of_current_replica);  // line 311: invalidates iterator
for (const auto & file : processed_file_list_ptr->second)            // line 312: USE-AFTER-FREE

processed_file_list_ptr is obtained from replica_to_files_to_be_processed.find() on line 298. Line 311 erases that entry from the unordered_map, which invalidates the iterator. Line 312 then dereferences the dangling iterator — a deterministic use-after-free every time this function is called.

The fix is trivial — move the values before erasing:

auto files = std::move(processed_file_list_ptr->second);
replica_to_files_to_be_processed.erase(number_of_current_replica);
for (const auto & file : files) { ... }

Why test_graceful_shutdown triggers it

test_graceful_shutdown spawns 10 threads running s3Cluster('cluster_simple', ...) queries, then calls stop_clickhouse(kill=True) on node s0_1_0. When the node dies while queries are in-flight, RemoteQueryExecutor::processPacket() receives Protocol::Server::ConnectionLost, checks task_iterator->supportRerunTask() (returns true for StorageObjectStorageStableTaskDistributor), and calls rescheduleTasksFromReplica() — triggering the use-after-free.

The crash is timing-dependent: it only fires if a query is still waiting on the dying replica at the moment of shutdown. test_graceful_shutdown passed in runs [3-5] and [2-5] (47s each, queries finished before kill) but crashed on run [1-5] (18.31s, a query was mid-flight when the kill landed).

Why this crash appeared only in PR #1550

The bug is NOT caused by PR #1550's code changes. It is a pre-existing bug in StorageObjectStorageStableTaskDistributor, introduced on 2025-11-21 when the stable task distributor was added to the codebase.

PR #1550 exposed it because:

  1. PR Fix remote initiator host name #1550 modified test_s3_cluster/test.py (added test_object_storage_remote_initiator)
  2. The CI flaky check runs ALL tests in modified files 5 times under ASAN
  3. test_graceful_shutdown was included in this 5× stress run — but the last time test_s3_cluster/test.py was modified was 2025-11-26 (before the StableTaskDistributor existed in its current form)
  4. Running test_graceful_shutdown 5× under ASAN multiplied the probability of hitting the timing window where a ConnectionLost arrives during an in-flight query, triggering rescheduleTasksFromReplica() and the ASAN-detected use-after-free

In regular integration tests (single run, no ASAN), the bug is invisible: the freed memory often still contains valid data, and the timing window is hit less frequently.

Key observations

  • 55 failures = 1 real crash + 54 cascade: After the ASAN crash killed s0_0_0, every subsequent test failed instantly (~0.3s) with "Sanitizer assert found for instance"
  • PR's own test is not the trigger: test_object_storage_remote_initiator passed all 3 runs before the crash; its 2 remaining runs failed only as part of the cascade
  • Bug filed as #1554

Skipped Suites (47)

Excluded by PR author: all TSAN, MSAN, UBSAN, Coverage, and Regression suites.

Conclusion

PR #1550's URL-decoding fix in IStorageCluster.cpp is correct, and its own test test_object_storage_remote_initiator passed all runs before the crash. All 15 regular integration test shards, 14 stateless test suites, builds, unit tests, stress test, and BuzzHouse passed.

The flaky check failure is caused by a pre-existing use-after-free bug in StorageObjectStorageStableTaskDistributor::rescheduleTasksFromReplica() (#1554). Lines 311-312 erase a map entry and then iterate over the invalidated iterator — a deterministic use-after-free that ASAN catches. It was triggered by test_graceful_shutdown killing a node while s3Cluster queries were in-flight, causing RemoteQueryExecutor to receive ConnectionLost and call rescheduleTasksFromReplica().

This bug was introduced on 2025-11-21 with the StorageObjectStorageStableTaskDistributor feature, but was never caught because no one modified test_s3_cluster/test.py since 2025-11-26 — meaning the flaky check (5× ASAN runs) was never triggered on this file until PR #1550. The 55 apparent test failures are 1 real crash + 54 cascade failures. PR #1550's code changes are not the cause.

@alsugiliazova alsugiliazova added the verified-with-issue Verified by QA and issue(s) found. label Mar 19, 2026
@alsugiliazova
Copy link
Member

@alsugiliazova
Copy link
Member

Re-run Results (Run #23314369179)

Two new commits were added before the re-run:

  • 40d9e2b — "Temporary disable test_graceful_shutdown" — renamed test_graceful_shutdown to _test_graceful_shutdown so pytest skips it
  • 4b998c4 — Merge branch antalya-25.8 into the PR branch

Results: 50 success, 3 failures, 42 skipped.

Integration tests (amd_asan, flaky check) — PASSED

The flaky check passed because test_graceful_shutdown was disabled by the developer (renamed to _test_graceful_shutdown). This eliminates the trigger for the #1554 use-after-free crash. The PR's own test test_object_storage_remote_initiator passed all runs.

Failures (3 job failures + 1 test failure inside a passing job) — none related to PR

# Suite Failed Test Error Related to PR
1 Stateless tests (amd_asan, distributed plan, parallel, 2/2) 02887_mutations_subcolumns ALTER TABLE ... DELETE WHERE obj.k2 = 1 hung (UNFINISHED after 3.65s) No — unrelated mutation/subcolumn test
2 Stateless tests (amd_binary, ParallelReplicas, s3 storage, sequential) 03377_object_storage_list_objects_cache S3 error "No such key" No — unrelated S3 cache test
3 GrypeScanKeeper / Grype Scan CVE scan No — infrastructure
4 GrypeScanServer (-alpine) / Grype Scan CVE scan No — infrastructure

Note: 02887_mutations_subcolumns failed within a job that GitHub Actions marked as success (the CI framework allowed the failure). The CI run report at S3 correctly lists it as a FAIL.

zzer, BuzzHouse, Docker images, compatibility checks, Finish Workflow, FinishCIReport — all passed

Conclusion

PR #1550 is verified.

@alsugiliazova alsugiliazova added the verified Verified by QA label Mar 20, 2026
@zvonand zvonand merged commit 145bc9d into antalya-25.8 Mar 20, 2026
208 of 216 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants