A118: TLS Telemetry#550
Conversation
ejona86
left a comment
There was a problem hiding this comment.
This looks good. I'll delay giving my Approval until others have, and then I'll Approve+merge when it seems there's concensus.
| from the labels. | ||
|
|
||
| * `grpc.client.tls.handshakes` | ||
|
|
There was a problem hiding this comment.
Missing the instrument type (counter), the unit ({handshake}), and a description. The description should say it is experimental.
The histograms have the instrument type, but are still missing the unit (s) and description.
There was a problem hiding this comment.
It is still missing a description.
markdroth
left a comment
There was a problem hiding this comment.
This looks good overall! My comments are all relatively minor.
Please let me know if you have any questions.
dfawley
left a comment
There was a problem hiding this comment.
Did we want to find someone from otel to review the propsed metrics before we implement this? I don't think any of us are experts in this area.
| + duration := time.Since(startTime).Seconds() | ||
| + <increment metric> |
There was a problem hiding this comment.
Which metric is this, specifically? The ones defined above all seem to be finer-grained than this. And grpc.client.tls.handshakes is just a counter, so doesn't need a duration?
There was a problem hiding this comment.
My bad, this was from when we still had the handshake at a latency metric. Updated the code snippet.
There was a problem hiding this comment.
OK thanks. One other nit here is that you're mixing tabs and spaces on the different lines, so the alignment is off for me.
|
|
||
| | Label Name | Required/Optional | Description | | ||
| | :--- | :--- | :--- | | ||
| | `grpc.status` | Required | Result of the certificate selection offloading, in the format of a gRPC status code (as defined in [A66]). | |
There was a problem hiding this comment.
Copy/paste-o from above. "Result of the private key signing...."
| understanding authentication behavior. An enum describing the type of resumption | ||
| used (or none) will be created - it can be extended in the future to make |
There was a problem hiding this comment.
This says "or none" but there is no corresponding NONE enum value below?
There was a problem hiding this comment.
FULL_HANDSHAKE means no resumption was done - should we rename to NO_RESUMPTION?
There was a problem hiding this comment.
I don't feel too strongly about the names, but it might want a comment. But what exactly does "RESUMED_HANDSHAKE" actually mean, relative to the other future things you suggested could be added in the future like "ticket-based resumption vs. session-based resumption"? Is "RESUMED_HANDSHAKE" actually "SESSION_RESUMPTION"?
There was a problem hiding this comment.
RESUMED_HANDSHAKE means that it was resumed, with no specificity on session or ticket
| single place: `internal/transport/http2_client.go` inside `NewHTTP2Client` via | ||
| `transportCreds.ClientHandshake`. Server handshakes are also performed in a | ||
| single place: `internal/transport/http2_server.go` inside `NewServerTransport` | ||
| via `config.Credentials.ServerHandshake`. We can incremement the metrics here |
There was a problem hiding this comment.
Whoops, missed this in the review, fixed
| | Label Name | Required/Optional | Description | | ||
| | :--- | :--- | :--- | | ||
| | `grpc.tls.handshake.result` | Required | The `TlsTelemetryHandshakeResult` enum string indicating success or the reason for handshake failure. | | ||
| | `grpc.tls.handshake.resumed` | Optional | The `TlsResumptionType` enum string indicating if and how the handshake was resumed. | |
There was a problem hiding this comment.
Nit:
| | `grpc.tls.handshake.resumed` | Optional | The `TlsResumptionType` enum string indicating if and how the handshake was resumed. | | |
| | `grpc.tls.handshake.resumed` | Optional | The `TlsResumptionType` enum string indicating if and how the handshake was resumed. | |
| The following metrics are count metrics, with the primary information coming | ||
| from the labels. | ||
|
|
||
| * `grpc.client.tls.handshakes` (unit: handshakes type: counter. This is experimental.) |
There was a problem hiding this comment.
The unit here should be "{handshake}" (the braces and everything are important for otel). And "int64 counter".
| | `grpc.lb.locality` | Optional | The locality to which the traffic is being sent (as defined in [A78], [A94]). | | ||
| | `grpc.lb.backend_service` | Optional | The backend service to which the traffic is being sent (as defined in [A89], [A94]). | | ||
|
|
||
| * `grpc.server.tls.handshakes` (unit: handshakes type: counter. This is experimental.) |
|
|
||
| The following metrics are non-per-call bucketed latency metrics that report the duration of offloaded cryptographic operations. | ||
|
|
||
| * `grpc.server.tls.offload_certificate_selection_duration` (unit: float64 seconds, type: histogram - latency buckets defined in [A66]. This is experimental.) |
There was a problem hiding this comment.
"float64" isn't a unit. I think you want "unit: s" and "type: float64 histogram".
| if transportCreds != nil { | ||
| + isTLS := transportCreds.Info().SecurityProtocol == "tls" | ||
| conn, authInfo, err = transportCreds.ClientHandshake(connectCtx, addr.ServerName, conn) | ||
| + if isTLS { |
There was a problem hiding this comment.
Other thought: do we want to read the authInfo returned by the handshake to determine whether TLS was involved (I guess cast it to TLSInfo?) or trust the creds string directly? I don't know that it matters at all, but it's something to ponder, perhaps.
There was a problem hiding this comment.
FYI: Both the tls creds and xds creds today set this field to "tls". Similarly, both of them set AuthInfo to TLSInfo. Maybe the idea was always to include xds creds as well, in which case we are good.
There was a problem hiding this comment.
Either way - I don't see a reason using XDS creds should preclude folks from monitoring - if it's doing TLS and we have TLS monitoring, the user can get monitoring?
As for how we want to see if it's TLS - can we discuss on the implementation PR since it's more of an impl detail? This is meant more as a demonstrative snippet, not a code source of truth.
|
@mbissa, can you look at this from the metrics definition perspective (did we define everything we needed to define)? |
markdroth
left a comment
There was a problem hiding this comment.
This looks good, modulo updating the C++ implementation details.
| @@ -185,7 +185,7 @@ In the few use-cases where TSI is not called via gRPC, we will ensure that | |||
| metric incrementation is not performed. | |||
|
|
|||
| // TODO: Update this comment to be more accurate based on the impl PR | |||
The gRFC to support authentication telemetry in the TLS stack for handshake and TLS offload operation monitoring.