Skip to content

A118: TLS Telemetry#550

Open
gtcooke94 wants to merge 18 commits into
grpc:masterfrom
gtcooke94:authn_telemetry
Open

A118: TLS Telemetry#550
gtcooke94 wants to merge 18 commits into
grpc:masterfrom
gtcooke94:authn_telemetry

Conversation

@gtcooke94

@gtcooke94 gtcooke94 commented May 18, 2026

Copy link
Copy Markdown
Contributor

The gRFC to support authentication telemetry in the TLS stack for handshake and TLS offload operation monitoring.

@gtcooke94 gtcooke94 changed the title [A118] Authentication Telemetry [A118] TLS Telemetry May 20, 2026
Comment thread A118-tls-telemetry.md Outdated
Comment thread A118-tls-telemetry.md Outdated
Comment thread A118-tls-telemetry.md Outdated
Comment thread A118-tls-telemetry.md Outdated
Comment thread A118-tls-telemetry.md Outdated
Comment thread A118-tls-telemetry.md Outdated
Comment thread A118-tls-telemetry.md Outdated
Comment thread A118-tls-telemetry.md Outdated
Comment thread A118-tls-telemetry.md Outdated
Comment thread A118-tls-telemetry.md Outdated
Comment thread A118-tls-telemetry.md
Comment thread A118-tls-telemetry.md Outdated
Comment thread A118-tls-telemetry.md Outdated
Comment thread A118-tls-telemetry.md Outdated
Comment thread A118-tls-telemetry.md
Comment thread A118-tls-telemetry.md Outdated
Comment thread A118-tls-telemetry.md Outdated
Comment thread A118-tls-telemetry.md Outdated
Comment thread A118-tls-telemetry.md Outdated
Comment thread A118-tls-telemetry.md
@gtcooke94 gtcooke94 changed the title [A118] TLS Telemetry A118: TLS Telemetry May 22, 2026

@ejona86 ejona86 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I'll delay giving my Approval until others have, and then I'll Approve+merge when it seems there's concensus.

Comment thread A118-tls-telemetry.md Outdated
Comment thread A118-tls-telemetry.md
from the labels.

* `grpc.client.tls.handshakes`

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing the instrument type (counter), the unit ({handshake}), and a description. The description should say it is experimental.

The histograms have the instrument type, but are still missing the unit (s) and description.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is still missing a description.

@markdroth markdroth left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good overall! My comments are all relatively minor.

Please let me know if you have any questions.

Comment thread A118-tls-telemetry.md Outdated
Comment thread A118-tls-telemetry.md Outdated
Comment thread A118-tls-telemetry.md
Comment thread A118-tls-telemetry.md Outdated
Comment thread A118-tls-telemetry.md Outdated
Comment thread A118-tls-telemetry.md Outdated

@dfawley dfawley left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we want to find someone from otel to review the propsed metrics before we implement this? I don't think any of us are experts in this area.

Comment thread A118-tls-telemetry.md Outdated
Comment on lines +238 to +239
+ duration := time.Since(startTime).Seconds()
+ <increment metric>

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which metric is this, specifically? The ones defined above all seem to be finer-grained than this. And grpc.client.tls.handshakes is just a counter, so doesn't need a duration?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, this was from when we still had the handshake at a latency metric. Updated the code snippet.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks. One other nit here is that you're mixing tabs and spaces on the different lines, so the alignment is off for me.

Comment thread A118-tls-telemetry.md Outdated

| Label Name | Required/Optional | Description |
| :--- | :--- | :--- |
| `grpc.status` | Required | Result of the certificate selection offloading, in the format of a gRPC status code (as defined in [A66]). |

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy/paste-o from above. "Result of the private key signing...."

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread A118-tls-telemetry.md
Comment on lines +87 to +88
understanding authentication behavior. An enum describing the type of resumption
used (or none) will be created - it can be extended in the future to make

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This says "or none" but there is no corresponding NONE enum value below?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FULL_HANDSHAKE means no resumption was done - should we rename to NO_RESUMPTION?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel too strongly about the names, but it might want a comment. But what exactly does "RESUMED_HANDSHAKE" actually mean, relative to the other future things you suggested could be added in the future like "ticket-based resumption vs. session-based resumption"? Is "RESUMED_HANDSHAKE" actually "SESSION_RESUMPTION"?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RESUMED_HANDSHAKE means that it was resumed, with no specificity on session or ticket

Comment thread A118-tls-telemetry.md Outdated
single place: `internal/transport/http2_client.go` inside `NewHTTP2Client` via
`transportCreds.ClientHandshake`. Server handshakes are also performed in a
single place: `internal/transport/http2_server.go` inside `NewServerTransport`
via `config.Credentials.ServerHandshake`. We can incremement the metrics here

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*increment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, missed this in the review, fixed

Comment thread A118-tls-telemetry.md Outdated
| Label Name | Required/Optional | Description |
| :--- | :--- | :--- |
| `grpc.tls.handshake.result` | Required | The `TlsTelemetryHandshakeResult` enum string indicating success or the reason for handshake failure. |
| `grpc.tls.handshake.resumed` | Optional | The `TlsResumptionType` enum string indicating if and how the handshake was resumed. |

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
| `grpc.tls.handshake.resumed` | Optional | The `TlsResumptionType` enum string indicating if and how the handshake was resumed. |
| `grpc.tls.handshake.resumed` | Optional | The `TlsResumptionType` enum string indicating if and how the handshake was resumed. |

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread A118-tls-telemetry.md Outdated
The following metrics are count metrics, with the primary information coming
from the labels.

* `grpc.client.tls.handshakes` (unit: handshakes type: counter. This is experimental.)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unit here should be "{handshake}" (the braces and everything are important for otel). And "int64 counter".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread A118-tls-telemetry.md Outdated
| `grpc.lb.locality` | Optional | The locality to which the traffic is being sent (as defined in [A78], [A94]). |
| `grpc.lb.backend_service` | Optional | The backend service to which the traffic is being sent (as defined in [A89], [A94]). |

* `grpc.server.tls.handshakes` (unit: handshakes type: counter. This is experimental.)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(As above)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread A118-tls-telemetry.md Outdated

The following metrics are non-per-call bucketed latency metrics that report the duration of offloaded cryptographic operations.

* `grpc.server.tls.offload_certificate_selection_duration` (unit: float64 seconds, type: histogram - latency buckets defined in [A66]. This is experimental.)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"float64" isn't a unit. I think you want "unit: s" and "type: float64 histogram".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread A118-tls-telemetry.md
if transportCreds != nil {
+ isTLS := transportCreds.Info().SecurityProtocol == "tls"
conn, authInfo, err = transportCreds.ClientHandshake(connectCtx, addr.ServerName, conn)
+ if isTLS {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other thought: do we want to read the authInfo returned by the handshake to determine whether TLS was involved (I guess cast it to TLSInfo?) or trust the creds string directly? I don't know that it matters at all, but it's something to ponder, perhaps.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: Both the tls creds and xds creds today set this field to "tls". Similarly, both of them set AuthInfo to TLSInfo. Maybe the idea was always to include xds creds as well, in which case we are good.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way - I don't see a reason using XDS creds should preclude folks from monitoring - if it's doing TLS and we have TLS monitoring, the user can get monitoring?

As for how we want to see if it's TLS - can we discuss on the implementation PR since it's more of an impl detail? This is meant more as a demonstrative snippet, not a code source of truth.

@dfawley dfawley requested a review from mbissa June 5, 2026 17:44
@ejona86 ejona86 requested a review from mbissa June 5, 2026 17:44
@ejona86

ejona86 commented Jun 5, 2026

Copy link
Copy Markdown
Member

@mbissa, can you look at this from the metrics definition perspective (did we define everything we needed to define)?

Comment thread A118-tls-telemetry.md
Comment thread A118-tls-telemetry.md

@markdroth markdroth left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, modulo updating the C++ implementation details.

Comment thread A118-tls-telemetry.md Outdated
Comment thread A118-tls-telemetry.md
@@ -185,7 +185,7 @@ In the few use-cases where TSI is not called via gRPC, we will ensure that
metric incrementation is not performed.

// TODO: Update this comment to be more accurate based on the impl PR

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants