Skip to content

feat: clickhouse_cluster destination#3608

Open
jorritsandbrink wants to merge 63 commits intodevelfrom
feat/2200-add-clickhouse-cluster-support
Open

feat: clickhouse_cluster destination#3608
jorritsandbrink wants to merge 63 commits intodevelfrom
feat/2200-add-clickhouse-cluster-support

Conversation

@jorritsandbrink
Copy link
Copy Markdown
Collaborator

Description

This PR adds a new destination clickhouse_cluster that:

  • builds on top of clickhouse destination
  • adapts generated SQL to work well in cluster setups
  • extends functionality to support creation of additional distributed tables

Related Issues

Additional Context

Related PR: #2573

@jorritsandbrink jorritsandbrink self-assigned this Feb 5, 2026
@jorritsandbrink jorritsandbrink added the ci full Use to trigger CI on a PR for full load tests label Feb 5, 2026
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented Feb 5, 2026

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Preview URL Updated (UTC)
✅ Deployment successful!
View logs
docs 1691eac Commit Preview URL

Branch Preview URL
Feb 06 2026, 07:31 AM

# we overwrite with the same row. merge falls back to replace when no keys specified
assert len(db_rows) == 1
else:
# NOTE: on second load, number of records in table "t" is zero in case of merge on
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bug got solved by adding an insert_deduplication_token to INSERT INTO statements.

@jorritsandbrink
Copy link
Copy Markdown
Collaborator Author

@rudolfix Could you you review this PR?

I struggled a lot trying to get consistent results, especially with the delete-insert merge strategy, because it relies on subqueries which ClickHouse doesn't seem to like when using replicated/distributed tables. Spent too much time debugging flaky tests, but it seems they're all fixed now.

Notes:

  • when working with distributed tables, SELECT and INSERT operations should be done on the distributed table, while other operations (e.g. TRUNCATE, DELETE FROM) should be done on the base table
  • when querying a table trough pipeline.dataset(), it will resolve to the distributed table (querying the base table would only return data from one shard)

@jorritsandbrink jorritsandbrink marked this pull request as ready for review February 6, 2026 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci full Use to trigger CI on a PR for full load tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cluster support for clickhouse

1 participant