Add query labels via sqlcommenter comment parsing#67
Add query labels via sqlcommenter comment parsing#67iskakaushik wants to merge 11 commits intomainfrom
Conversation
Extract user-defined key='value' labels from SQL comments in the sqlcommenter format (/* controller='users',action='show' */) and export them to ClickHouse as a JSON column and to OTel as db.query.label.* attributes. Parsing happens entirely in the background worker (cold path) with zero changes to the hot path — hooks, events, ring buffer, and DSA are untouched. The bgworker scans the query text backward for the last /* */ comment block, parses key='value' pairs with URL decoding, and passes the structured ParseResult to each exporter directly. - New GUC: pg_stat_ch.track_labels (bool, default true, PGC_SIGHUP) - New ClickHouse column: labels JSON(max_dynamic_paths=64) - Parser: ExtractLastComment + ParseSqlcommenter + SerializeLabelsJson - 45 unit tests covering parsing, URL decoding, truncation, and edge cases - TAP integration test for end-to-end ClickHouse label export
There was a problem hiding this comment.
Pull request overview
Adds support for extracting user-defined sqlcommenter key='value' labels from the last /* ... */ SQL comment and exporting them as structured labels to both ClickHouse (new labels JSON column) and OpenTelemetry (db.query.label.* attributes), gated by a new pg_stat_ch.track_labels GUC.
Changes:
- Implement sqlcommenter comment extraction/parsing + JSON serialization, and invoke it from the bgworker export loop.
- Extend exporters with a
StatsExporter::AppendLabels()hook to emit labels to ClickHouse and OTel. - Add ClickHouse schema support + unit and TAP integration tests, plus regression coverage for the new GUC.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| test/unit/sqlcommenter_parse_test.cc | Adds GTest unit coverage for comment extraction, parsing, decoding, truncation, and JSON serialization. |
| test/regression/sql/guc.sql | Exercises visibility of new pg_stat_ch.track_labels GUC. |
| test/regression/expected/guc.out | Updates expected regression output to include pg_stat_ch.track_labels. |
| t/030_query_labels.pl | Adds end-to-end TAP test asserting labels round-trip to ClickHouse and respect track_labels. |
| src/export/stats_exporter.cc | Parses last comment per event and forwards parsed labels to exporters in the bgworker export loop. |
| src/export/sqlcommenter_parse.h | Declares parser/serializer APIs and ParseResult representation. |
| src/export/sqlcommenter_parse.cc | Implements backward scan for last block comment, sqlcommenter parsing, URL-decoding, and JSON escaping/serialization. |
| src/export/otel_exporter.cc | Implements AppendLabels() to publish db.query.label.* attributes (and tag map entries). |
| src/export/exporter_interface.h | Extends exporter interface with AppendLabels(const ParseResult&). |
| src/export/clickhouse_exporter.cc | Adds a labels column to exported ClickHouse blocks and appends serialized label JSON. |
| src/config/guc.cc | Introduces pg_stat_ch.track_labels (PGC_SIGHUP, default on). |
| include/config/guc.h | Exposes psch_track_labels for use in exporter code. |
| docker/init/00-schema.sql | Adds labels JSON(max_dynamic_paths=64) column to the canonical ClickHouse schema. |
| CMakeLists.txt | Adds a new sqlcommenter_parse_test unit test target. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| $node->safe_psql('postgres', 'SELECT 42'); | ||
| $node->safe_psql('postgres', 'SELECT pg_stat_ch_flush()'); | ||
|
|
||
| my $labels = psch_wait_for_clickhouse_query( | ||
| "SELECT labels FROM pg_stat_ch.events_raw " | ||
| . "WHERE query LIKE '%42%' LIMIT 1", |
There was a problem hiding this comment.
In this new subtest the ClickHouse lookup filters on WHERE query LIKE '%42%', but queries are normalized in pg_stat_ch (numeric literals are replaced with $N placeholders; see existing normalization tests). This can make the row unfindable and the test flaky/failing. Prefer filtering on a stable identifier that survives normalization (e.g., an alias/table name) or by pid + ordering, instead of the literal value.
| $node->safe_psql('postgres', 'SELECT 42'); | |
| $node->safe_psql('postgres', 'SELECT pg_stat_ch_flush()'); | |
| my $labels = psch_wait_for_clickhouse_query( | |
| "SELECT labels FROM pg_stat_ch.events_raw " | |
| . "WHERE query LIKE '%42%' LIMIT 1", | |
| $node->safe_psql('postgres', | |
| 'SELECT 42 AS no_comment_query_labels'); | |
| $node->safe_psql('postgres', 'SELECT pg_stat_ch_flush()'); | |
| my $labels = psch_wait_for_clickhouse_query( | |
| "SELECT labels FROM pg_stat_ch.events_raw " | |
| . "WHERE query LIKE '%no_comment_query_labels%' LIMIT 1", |
| // Query text. CH: RecordString "query"; OTel semconv: "db.query.text". | ||
| virtual shared_ptr<Column<string_view>> DbQueryTextColumn() = 0; | ||
| // Query labels from sqlcommenter comments. Called inside the event loop. | ||
| // CH: serializes to JSON, appends to a String "labels" column; |
There was a problem hiding this comment.
The comment says ClickHouse appends labels to a String labels column, but the schema change in this PR defines labels as a ClickHouse JSON column. Updating this description to match the actual column type will avoid confusion for future maintainers/users.
| // CH: serializes to JSON, appends to a String "labels" column; | |
| // CH: serializes to JSON, appends to a JSON "labels" column; |
|
|
||
| query String COMMENT 'Full SQL query text (may be truncated). Used for debugging and query analysis.', | ||
|
|
||
| labels JSON(max_dynamic_paths=64) COMMENT 'Query labels from sqlcommenter comments (key=value pairs in /* */ blocks). Access subpaths directly: labels.controller, labels.action. Empty {} when no labels present. See: https://google.github.io/sqlcommenter/', |
There was a problem hiding this comment.
The column comment describes sqlcommenter labels as "key=value" pairs, but the format handled/advertised elsewhere in this PR is key='value' (single-quoted values). Consider updating the comment text to reflect the actual expected syntax to avoid user confusion.
| labels JSON(max_dynamic_paths=64) COMMENT 'Query labels from sqlcommenter comments (key=value pairs in /* */ blocks). Access subpaths directly: labels.controller, labels.action. Empty {} when no labels present. See: https://google.github.io/sqlcommenter/', | |
| labels JSON(max_dynamic_paths=64) COMMENT 'Query labels from sqlcommenter comments (key=''value'' pairs in /* */ blocks). Access subpaths directly: labels.controller, labels.action. Empty {} when no labels present. See: https://google.github.io/sqlcommenter/', |
- Parse \' as escaped single quote in values (spec meta char escaping) - Combined MetaUnescapeAndUrlDecode replaces UrlDecode for spec compliance - Add ClickHouse migration for existing installs (001_add_labels_column.sql) - Document labels column in events-schema reference and migration in ClickHouse guide - Fix CI to build all unit test targets (sqlcommenter_parse_test was missing) - Add 9 spec compliance tests (escaped quotes, spec exhibit round-trip)
- Real-world Django ORM sample (6 labels with URL-encoded framework version, route, traceparent, tracestate) - Multi-line comment extraction and parsing - Multiple comment blocks (last block wins) - Non-sqlcommenter last comment returns no labels - -- style comments don't interfere with /* */ extraction - Full end-to-end Django pipeline with JSON verification 61 tests total (up from 54).
Extract Scanner struct with SkipWhitespace/Consume/ScanKey/ScanQuotedValue methods. Add DecodeField helper to deduplicate key/value decode logic. Replace inline character checks with IsWhitespace/IsKeyTerminator. Use std::from_chars for hex decoding.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| attr_name.append(labels.labels[i].key.data(), labels.labels[i].key.size()); | ||
| string val(labels.labels[i].value); | ||
| current_log_record->SetAttribute(attr_name, val); | ||
| current_row_tags[attr_name] = std::move(val); | ||
| } |
There was a problem hiding this comment.
AppendLabels() currently inserts parsed sqlcommenter labels into current_row_tags (line 138), which means they become dimensions for all emitted metrics (HistogramColumn::Crunch records metrics with current_row_tags). Since these labels are user-controlled and high-cardinality by nature, this can explode metric cardinality and dramatically increase OTel backend cost/ingestion load. Consider keeping labels as log-record attributes only (do not add them to current_row_tags), or gate them behind a separate config specifically for metrics tagging.
Also, the label key is appended directly into the attribute name; after URL-decoding it may contain spaces/control characters (tests include "my key"), which can violate common OTel attribute-key expectations and may break downstream collectors/backends. Please sanitize/normalize keys (e.g., allowlist [A-Za-z0-9_.-] and replace others / skip invalid keys) before constructing the attribute name.
There was a problem hiding this comment.
@JoshDreamland you've been dealing with otel perf here, does this copilot perf advice seem applicable here?
| labels_col_->Append(SerializeLabelsJson(labels)); | ||
| } | ||
|
|
||
| void BeginBatch() final { | ||
| block = std::make_unique<clickhouse::Block>(); | ||
| columns.clear(); | ||
| exported_count = 0; | ||
| labels_col_ = Wrap<clickhouse::ColumnString, string_view>("labels"); |
There was a problem hiding this comment.
ClickHouseExporter::BeginBatch() unconditionally adds a "labels" column to every inserted block. On upgrades where the ClickHouse schema hasn’t yet been migrated to include labels, inserts will fail continuously — even when pg_stat_ch.track_labels = off — because the block still contains an unknown column. Consider only creating/appending the labels column when label tracking is enabled and the destination schema supports it (or otherwise detect the missing column and emit a clear actionable error that points to the migration script, while allowing operation with labels disabled).
| labels_col_->Append(SerializeLabelsJson(labels)); | |
| } | |
| void BeginBatch() final { | |
| block = std::make_unique<clickhouse::Block>(); | |
| columns.clear(); | |
| exported_count = 0; | |
| labels_col_ = Wrap<clickhouse::ColumnString, string_view>("labels"); | |
| if (labels_col_) { | |
| labels_col_->Append(SerializeLabelsJson(labels)); | |
| } | |
| } | |
| void BeginBatch() final { | |
| block = std::make_unique<clickhouse::Block>(); | |
| columns.clear(); | |
| exported_count = 0; | |
| labels_col_.reset(); | |
| if (pg_stat_ch_track_labels) { | |
| labels_col_ = Wrap<clickhouse::ColumnString, string_view>("labels"); | |
| } |
Replaces hand-rolled AppendJsonEscaped + string-concat JSON builder with nlohmann::ordered_json. Escaping is now handled by a well-tested library instead of custom code.
- Change labels column from JSON(max_dynamic_paths=64) to String
DEFAULT '{}' — clickhouse-cpp doesn't support the JSON column type
- Fix OTel export test: duration metric renamed to
pg_stat_ch_db_client_operation_duration_seconds, label db → db_name
- Fix OTel precedence warning: !system(...) == 0 → system(...) != 0
- Fix query labels test: use JSONExtractString instead of subpath
syntax, use DDL instead of normalized literals for no-comment test
- Rename 030_query_labels.pl → 032_query_labels.pl (avoid collisions
with 030_nested_query_normalization and 031_standby_export)
New CI jobs that run the integration TAP tests against PG 18: - clickhouse-tap: starts ClickHouse via docker-compose, runs 010_clickhouse_export and 011_clickhouse_reconnect - otel-tap: starts OTel collector via docker-compose, runs 024_otel_export and 025_otel_reconnect
- Revert redundant clickhouse-tap/otel-tap jobs from ci.yml (ci-tap.yml already runs all TAP tests) - Switch ci-tap.yml from raw docker run to docker-compose so port mappings match what the tests expect (18123/19000) - Start OTel collector alongside ClickHouse so OTel TAP tests run instead of being skipped
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 21 out of 21 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| | Column | Type | Description | | ||
| |---|---|---| | ||
| | `labels` | `String DEFAULT '{}'` | Key-value labels extracted from [sqlcommenter](https://google.github.io/sqlcommenter/) comments appended to the query. For example, `/* controller='users',action='show' */` produces `{"controller":"users","action":"show"}`. Access subpaths directly in ClickHouse: `labels.controller`, `labels.action`. Empty `{}` when no labels are present. | |
There was a problem hiding this comment.
The docs claim you can access JSON subpaths as labels.controller / labels.action, but the schema defines labels as String (see docker/init/00-schema.sql and migration). With a String column, ClickHouse won't support dot-subpath access; you need functions like JSONExtractString(labels, 'controller') (as used in TAP tests), or change the column type to ClickHouse JSON to match the documentation.
| | `labels` | `String DEFAULT '{}'` | Key-value labels extracted from [sqlcommenter](https://google.github.io/sqlcommenter/) comments appended to the query. For example, `/* controller='users',action='show' */` produces `{"controller":"users","action":"show"}`. Access subpaths directly in ClickHouse: `labels.controller`, `labels.action`. Empty `{}` when no labels are present. | | |
| | `labels` | `String DEFAULT '{}'` | Key-value labels extracted from [sqlcommenter](https://google.github.io/sqlcommenter/) comments appended to the query. For example, `/* controller='users',action='show' */` produces `{"controller":"users","action":"show"}`. Since this column stores JSON as a string, extract values in ClickHouse with functions such as `JSONExtractString(labels, 'controller')` and `JSONExtractString(labels, 'action')`. Empty `{}` when no labels are present. | |
|
|
||
| ALTER TABLE pg_stat_ch.events_raw | ||
| ADD COLUMN IF NOT EXISTS labels String DEFAULT '{}' | ||
| COMMENT 'Query labels from sqlcommenter comments (key=value pairs in /* */ blocks). Access subpaths directly: labels.controller, labels.action. Empty {} when no labels present. See: https://google.github.io/sqlcommenter/' |
There was a problem hiding this comment.
This migration adds labels as a String, but the column COMMENT says you can access subpaths directly via labels.controller/labels.action. That dot-subpath syntax requires the ClickHouse JSON type; for a String column the documented access pattern should use JSONExtract* functions, or the column type should be switched to JSON to match the comment and docs.
| COMMENT 'Query labels from sqlcommenter comments (key=value pairs in /* */ blocks). Access subpaths directly: labels.controller, labels.action. Empty {} when no labels present. See: https://google.github.io/sqlcommenter/' | |
| COMMENT 'Query labels from sqlcommenter comments (key=value pairs in /* */ blocks). Extract fields with JSONExtractString(labels, ''controller'') and JSONExtractString(labels, ''action''). Empty {} when no labels present. See: https://google.github.io/sqlcommenter/' |
|
|
||
| | Migration | Version | Description | | ||
| |---|---|---| | ||
| | `001_add_labels_column.sql` | 0.2+ | Adds `labels JSON` column for [sqlcommenter](https://google.github.io/sqlcommenter/) query label support | |
There was a problem hiding this comment.
The migration table says 001_add_labels_column.sql adds a labels JSON column, but the actual schema/migration defines labels as String DEFAULT '{}'. Please align the docs with the actual ClickHouse type (or update the schema if JSON is the real intent).
| | `001_add_labels_column.sql` | 0.2+ | Adds `labels JSON` column for [sqlcommenter](https://google.github.io/sqlcommenter/) query label support | | |
| | `001_add_labels_column.sql` | 0.2+ | Adds `labels String DEFAULT '{}'` column for [sqlcommenter](https://google.github.io/sqlcommenter/) query label support | |
| // Meta-unescape (\' -> ') and URL-decode (%XX) in a single pass. | ||
| // Per sqlcommenter spec, meta unescaping happens before URL decoding, | ||
| // but the two transforms don't overlap so a combined pass is equivalent. | ||
| size_t MetaUnescapeAndUrlDecode(std::string_view src, char* dst, size_t max_len) { | ||
| size_t written = 0; | ||
| size_t i = 0; | ||
| while (i < src.size() && written < max_len) { | ||
| if (src[i] == '\\' && i + 1 < src.size() && src[i + 1] == '\'') { | ||
| dst[written++] = '\''; | ||
| i += 2; | ||
| } else if (src[i] == '%' && i + 2 < src.size()) { | ||
| int byte = DecodeHexByte(src[i + 1], src[i + 2]); | ||
| if (byte >= 0) { | ||
| dst[written++] = static_cast<char>(byte); | ||
| i += 3; | ||
| continue; | ||
| } | ||
| dst[written++] = src[i++]; | ||
| } else { | ||
| dst[written++] = src[i++]; | ||
| } | ||
| } | ||
| return written; |
There was a problem hiding this comment.
MetaUnescapeAndUrlDecode will percent-decode arbitrary bytes (including control characters and NUL via %00) into label keys/values. Those bytes can produce non-printable / non-UTF-8 strings in downstream outputs (OTel attribute names/values, ClickHouse JSON text), which can cause ingestion failures or hard-to-debug display issues. Consider validating/sanitizing decoded output (e.g., reject or replace control bytes and ensure UTF-8) before returning it from the parser.
The otel-collector-contrib image is distroless — no shell, wget, or curl available for Docker healthchecks. Remove the in-container healthcheck and poll from the host via curl in CI instead.
The OTel periodic metric reader defaults to 5s export interval. Tests that scrape the Prometheus endpoint need to wait for at least one export cycle. Set otel_metric_interval_ms=1000 in the test helper and add sleep(3) before Prometheus scrapes.
The test was using a SELECT with a comment (normalized away) and a racy wait predicate. Use DDL (CREATE/DROP TABLE) whose text survives normalization, wait for the row to arrive in ClickHouse, then assert.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 23 out of 23 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| | Column | Type | Description | | ||
| |---|---|---| | ||
| | `labels` | `String DEFAULT '{}'` | Key-value labels extracted from [sqlcommenter](https://google.github.io/sqlcommenter/) comments appended to the query. For example, `/* controller='users',action='show' */` produces `{"controller":"users","action":"show"}`. Access subpaths directly in ClickHouse: `labels.controller`, `labels.action`. Empty `{}` when no labels are present. | | ||
|
|
||
| <Note> | ||
| Labels are parsed from the **last** `/* */` comment block in the query text. The parser supports URL-encoded values and escaped single quotes per the sqlcommenter specification. Controlled by the [`pg_stat_ch.track_labels`](/reference/configuration) GUC (default: `true`). | ||
| </Note> |
There was a problem hiding this comment.
The labels column is documented as String, but the description claims you can access subpaths via labels.controller / labels.action. That dot-notation requires a ClickHouse JSON-typed column; with String you need functions like JSONExtractString(labels, 'controller'). Please either change the column type to JSON(...) (and keep docs as-is) or update the docs/examples to match a String column.
| | `labels` | `String DEFAULT '{}'` | Key-value labels extracted from [sqlcommenter](https://google.github.io/sqlcommenter/) comments appended to the query. For example, `/* controller='users',action='show' */` produces `{"controller":"users","action":"show"}`. Access subpaths directly in ClickHouse: `labels.controller`, `labels.action`. Empty `{}` when no labels are present. | | ||
|
|
||
| <Note> | ||
| Labels are parsed from the **last** `/* */` comment block in the query text. The parser supports URL-encoded values and escaped single quotes per the sqlcommenter specification. Controlled by the [`pg_stat_ch.track_labels`](/reference/configuration) GUC (default: `true`). |
There was a problem hiding this comment.
This section introduces pg_stat_ch.track_labels and links to /reference/configuration, but that page currently doesn’t mention track_labels (so readers can’t discover details like context/reload semantics). Please add pg_stat_ch.track_labels to the configuration reference and consider adding a short warning that label values are user-supplied and may contain sensitive data if applications put PII/tokens in sqlcommenter comments.
| Labels are parsed from the **last** `/* */` comment block in the query text. The parser supports URL-encoded values and escaped single quotes per the sqlcommenter specification. Controlled by the [`pg_stat_ch.track_labels`](/reference/configuration) GUC (default: `true`). | |
| Labels are parsed from the **last** `/* */` comment block in the query text. The parser supports URL-encoded values and escaped single quotes per the sqlcommenter specification. Collection is controlled by the [`pg_stat_ch.track_labels`](/reference/configuration) GUC, which defaults to `true` and can be changed with a configuration reload (no server restart required). | |
| Label values are user-supplied application metadata. If applications include PII, tokens, session identifiers, or other secrets in sqlcommenter comments, those values may be exported in `labels`. Only enable and use labels with trusted, non-sensitive metadata. |
| docker compose -f docker/docker-compose.test.yml up -d --wait | ||
| docker compose -f docker/docker-compose.otel.yml up -d | ||
| # Poll OTel health endpoint (distroless image has no shell for healthcheck) | ||
| for i in $(seq 1 30); do | ||
| if curl -sf http://localhost:13133/ >/dev/null 2>&1; then | ||
| echo "OTel collector ready" | ||
| break | ||
| fi | ||
| echo "Waiting for ClickHouse... ($i/30)" | ||
| echo "Waiting for OTel collector... ($i/30)" | ||
| sleep 1 | ||
| done |
There was a problem hiding this comment.
The readiness polling loop for the OTel collector never fails the step if the collector doesn’t become ready within 30s (it just falls through). Please add a post-loop check (or track a flag) and exit 1 if the health endpoint never responds, so CI fails fast with a clear error when the collector can’t start.
| } | ||
|
|
||
| bool IsWhitespace(char c) { | ||
| return c == ' ' || c == '\t' || c == '\n' || c == '\r'; |
| // Scan a single-quoted value, handling \' escapes per sqlcommenter spec. | ||
| // Assumes the opening quote was already consumed. | ||
| // Returns the raw content (before decoding). Sets *ok = false on unterminated quote. | ||
| std::string_view ScanQuotedValue(bool* ok) { |
There was a problem hiding this comment.
| std::string_view ScanQuotedValue(bool* ok) { | |
| std::string_view ScanQuotedValue() { |
caller can go by result.data() == nullptr
| } else { | ||
| exporter->AppendLabels(ParseResult{}); | ||
| } |
There was a problem hiding this comment.
| } else { | |
| exporter->AppendLabels(ParseResult{}); | |
| } | |
| } |
if GUC disabled why do we still push empty object?
|
There's a lot of logic that tries to be antifragile by ignoring bad inputs (skips misformed labels, tries to keep parsing after, ignores missing commas) which feels like it'll result in unnecessary parsing over invalid input (read: last comment of any normal query), potentially grabbing random labels (probably from commented SQL) the |
Summary
key='value'labels from SQL comments in the sqlcommenter format (/* controller='users',action='show' */) and export them to ClickHouse as alabels JSONcolumn and to OTel asdb.query.label.*attributespg_stat_ch.track_labels(bool, default true,PGC_SIGHUP)Details
The bgworker scans the query text backward for the last
/* */comment block, parseskey='value'pairs with URL decoding, and passes the structuredParseResultto each exporter directly.New files:
src/export/sqlcommenter_parse.cc/.h— parser:ExtractLastComment+ParseSqlcommenter+SerializeLabelsJsontest/unit/sqlcommenter_parse_test.cc— 45 unit tests covering parsing, URL decoding, truncation, and edge casest/030_query_labels.pl— TAP integration test for end-to-end ClickHouse label exportSchema change:
labels JSON(max_dynamic_paths=64)onevents_rawTest plan
sqlcommenter_parse_test— 45 cases)030_query_labels.pl) confirms labels round-trip to ClickHouse🤖 Generated with Claude Code