enhancement(datadog encoder): support for metrics v3 protocol#1175
enhancement(datadog encoder): support for metrics v3 protocol#1175tobz wants to merge 20 commits into
Conversation
Binary Size Analysis (Agent Data Plane)Target: 5cc63ba (baseline) vs 2c79bb6 (comparison) diff
|
| Module | File Size | Symbols |
|---|---|---|
figment |
+122.55 KiB | 605 |
saluki_components::encoders::datadog |
+63.81 KiB | 356 |
hyper_util |
-46.99 KiB | 157 |
core |
+46.90 KiB | 14261 |
hyper |
+43.38 KiB | 505 |
serde_with |
+19.03 KiB | 50 |
piecemeal |
+17.11 KiB | 34 |
tonic |
+14.00 KiB | 461 |
anyhow |
+13.28 KiB | 1714 |
saluki_components::common::datadog |
+11.77 KiB | 418 |
datadog_protos::trace_piecemeal_include::datadog |
-10.62 KiB | 14 |
tracing |
-10.31 KiB | 152 |
saluki_components::transforms::dogstatsd_mapper |
-9.14 KiB | 16 |
http |
+8.57 KiB | 421 |
otlp_protos::otlp_include::opentelemetry |
-8.23 KiB | 236 |
saluki_common::task::instrument |
+8.19 KiB | 42 |
[sections] |
+8.08 KiB | 9 |
saluki_core::topology::interconnect |
-7.72 KiB | 71 |
rmp |
-7.67 KiB | 39 |
saluki_components::sources::dogstatsd |
+7.40 KiB | 266 |
Detailed Symbol Changes
FILE SIZE VM SIZE
-------------- --------------
+1.4% +292Ki +1.4% +218Ki [45737 Others]
[NEW] +156Ki [NEW] +156Ki agent_data_plane::cli::run::handle_run_command::_{{closure}}::h9596dc1a52bcbc31
[NEW] +68.0Ki [NEW] +67.8Ki agent_data_plane::cli::run::create_topology::_{{closure}}::h80fab2488a7b6640
[NEW] +63.5Ki [NEW] +63.3Ki saluki_core::topology::built::BuiltTopology::spawn::_{{closure}}::h2749e231ef5b15fc
[NEW] +60.4Ki [NEW] +60.2Ki agent_data_plane::internal::env::workload::RemoteAgentWorkloadProvider::from_configuration::_{{closure}}::h776cee4a1f5245a9
[NEW] +57.5Ki [NEW] +57.3Ki agent_data_plane::cli::debug::handle_debug_command::_{{closure}}::h35e7970d78ae28a3
[NEW] +56.6Ki [NEW] +56.4Ki agent_data_plane::cli::dogstatsd::handle_dogstatsd_command::_{{closure}}::hd4caf9bf78fff791
[NEW] +54.3Ki [NEW] +54.1Ki saluki_core::topology::blueprint::TopologyBlueprint::build::_{{closure}}::haeb02f7f19b41d13
[NEW] +53.2Ki [NEW] +52.8Ki agent_data_plane::main::_{{closure}}::h59003e4fe55bbe14
[NEW] +42.3Ki [NEW] +42.1Ki _<saluki_components::forwarders::otlp::OtlpForwarder as saluki_core::components::forwarders::Forwarder>::run::_{{closure}}::h09054407a2adc7fa
[NEW] +42.1Ki [NEW] +42.1Ki core::ops::function::FnOnce::call_once::hdba756311593e635
[DEL] -41.9Ki [DEL] -41.8Ki saluki_components::common::datadog::io::run_endpoint_io_loop::_{{closure}}::ha1be06804654f797
[DEL] -42.1Ki [DEL] -42.1Ki core::ops::function::FnOnce::call_once::h0d616879cf1ff5e1
[DEL] -53.7Ki [DEL] -53.3Ki agent_data_plane::main::_{{closure}}::hbbff7540ad384147
[DEL] -54.3Ki [DEL] -54.1Ki saluki_core::topology::blueprint::TopologyBlueprint::build::_{{closure}}::h77e6ee9e8c600d51
[DEL] -56.7Ki [DEL] -56.6Ki agent_data_plane::cli::dogstatsd::handle_dogstatsd_command::_{{closure}}::ha96d45c191ed2dd1
[DEL] -57.6Ki [DEL] -57.4Ki agent_data_plane::cli::debug::handle_debug_command::_{{closure}}::hb91555f7fbd7a256
[DEL] -60.1Ki [DEL] -59.9Ki agent_data_plane::internal::env::workload::RemoteAgentWorkloadProvider::from_configuration::_{{closure}}::hea9789cc98b277de
[DEL] -63.5Ki [DEL] -63.3Ki saluki_core::topology::built::BuiltTopology::spawn::_{{closure}}::h937c94d72f452158
[DEL] -67.7Ki [DEL] -67.5Ki agent_data_plane::cli::run::create_topology::_{{closure}}::h46b7d405a57b2a7f
[DEL] -156Ki [DEL] -156Ki agent_data_plane::cli::run::handle_run_command::_{{closure}}::ha486aef1f744edbb
+0.8% +292Ki +0.7% +219Ki TOTAL
Regression Detector (Agent Data Plane)Run ID: Optimization Goals: ✅ No significant changes detectedFine details of change detection per experiment (35)Experiments configured
Bounds Checks: ✅ Passed (5)
ExplanationA change is flagged as a regression when |Δ mean %| > 5.00% in the regressing direction for its optimization goal AND SMP marks the experiment as a regression ( |
|
This is temporarily blocked on there being a version of the Datadog Agent for us to test against in correctness tests that has up-to-date v3 metrics support. Currently, we're hitting an issue related to rate intervals being delta encoded when they shouldn't be. That bug is fixed in DataDog/datadog-agent#45825 but won't be released until 7.77: roughly 2 weeks from now before an RC is available to use. We can potentially do a hacky image build or something for keep going in the meantime and then switch back to a proper Agent version once available, we'll see. |
79cdda1 to
59636cd
Compare
|
We've temporarily handled the issue of correctness tests by using a "dev" container image ( We can't merge this as-is: we need to wait for at least an RC build of Datadog Agent 7.77 so we can pin to a non-development image. In the meantime, I'm going to work on making sure we've integrated all of the same small fixes/changes that have been steadily being made upstream in the Datadog Agent repository for V3 support. |
30ee642 to
898021d
Compare
be9a81c to
a9f5109
Compare
a9f5109 to
31b5f82
Compare
| }); | ||
| let v3_flushed = if let Some(v3_metrics) = maybe_v3_metrics { | ||
| if v2_flushed || v3_metrics.len() >= v3_endpoint_config.max_metrics_per_payload() { | ||
| encode_and_flush_v3_metrics(endpoint, &v3_endpoint_config, v3_metrics, &telemetry, &mut payloads_tx, batch_id.as_ref(), v3_payload_info).await?; |
There was a problem hiding this comment.
This doesn't seem to observe any intake payload size limits, or am I missing anything?
There was a problem hiding this comment.
That's correct.
Right now, we're either flushing with the V2 encoder determines it needs to flush (so that we generate an equivalent payload in terms of the contained metrics between the two) or if we exceed the configured maximum metrics per payload limit.
In V2/V3 mode, I suppose it's entirely possible to have the V3 payload exceed the payload limits, although it would be incredibly unlikely. In V3 only mode, it's obviously a much more likely risk.
My thought process was that we would improve this -- make V3 encoding aware of the payload limits -- at the same time we added incremental compression to match the behavior of the Agent... since back when this was originally written many weeks ago, it seemed like we'd have enough time between then and "V3 only in production / for customers" to do the follow-up work.
I guess the question I have is: do you feel like we still have that sort of time before we want to be running V3 only?
| Ok(encoded) => { | ||
| match create_v3_request("/api/intake/metrics/v3/series", encoded, ep_config.compression_scheme()).await { | ||
| Ok(request) => { | ||
| flush_payload(request, events, payloads_tx, batch_id, 0, 1, payload_info).await?; |
There was a problem hiding this comment.
Is it intentional that batch_seq and batch_len are hard-coded as 0 and 1?
There was a problem hiding this comment.
It is intentional, but only in the context of us not currently splitting V3 payloads: there can literally only ever be a single V3 payload in each batch.
(Mostly related to the question you left about not obeying intake payload size limits.)
There was a problem hiding this comment.
That doesn't sound right: there can be multiple v2 payloads constructed from even a single event buffer, so even if they mach one to one, there should be multiple v3 ones. Or are we using word batch differently here?
There was a problem hiding this comment.
There's probably some missing context/mismatched terminology here, yeah.
Every time we receive an "event buffer", we process all of the metrics in the event buffer, incrementally encoding them via the request builder. As we're doing that, we might reach the point where a payload is "full" (aka adding another metric would cause the final payload to exceed the (un)compressed size limits) and we have to flush it before encoding the next metric in the event buffer. That scenario is what I'm referring to here.
Unlike the Core Agent, we don't immediately emit partial payloads for the remainder of an event buffer: if we finish processing an event buffer, and there's still remaining metrics in the request builder, we wait for a period of time for additional event buffers to come in and eventually time out and flush that partial payload.
ca5ebc6 to
16afa2e
Compare
|

Summary
Change Type
How did you test this PR?
References