feat(openfeature): emit server-side EVP flagevaluation by leoromanovsky · Pull Request #3984 · DataDog/dd-trace-php

leoromanovsky · 2026-06-14T14:02:30Z

Motivation

Customers need PHP services to report server-side feature-flag evaluation counts through the same backend contract as the other SDKs, including both the PHP 7 extension path and PHP 8 OpenFeature path. This contribution adds native PHP EVP flagevaluation delivery through the shared libdatadog sidecar path while preserving existing OTel metric and exposure behavior, giving APM a backend-verifiable rollout signal for approval.

Justification

The 5 MiB limit is unlikely for small applications, but it is reachable at the scale this rollout is designed to support. Using compact JSON estimates against a 5,242,880-byte body limit: a minimal degraded row is about 137 bytes, so about 37,991 rows fit; a small full row is about 277 bytes, so about 18,859 rows fit; a normal full row with 10 context attributes is about 588 bytes, so about 8,901 rows fit; a max bounded-context row with 256 256-character fields is about 68,341 bytes, so only about 76 rows fit.

The target scale is 2,500 flags x 50 full-fidelity buckets, or 125,000 rows. Even small full rows can exceed 35 MiB at that scale, so byte splitting is a real compliance requirement rather than only defensive hardening. Existing backpressure bounds the queue and aggregate cardinality, but it does not bound the final encoded POST body; async posting avoids blocking customer evaluations but a single oversized request can still be rejected with 413.

Changes

Adds native PHP EVP flagevaluation aggregation using backend-visible dimensions: flag key, variant key, allocation key, runtime-default state, error message, real targeting-rule key when present, targeting key, and bounded context.
Prunes evaluation context before it participates in aggregation keys or queued snapshots.
Keeps OpenFeature reason out of EVP payloads and aggregation keys.
Preserves visible variant/allocation/error dimensions when overflow folds into degraded events.
Flushes the aggregated flagevaluation batch before exposure and OTel metric sidecar actions at request shutdown.
Splits native-to-sidecar flagevaluation batches into bounded 512-event IPC chunks.
Bumps libdatadog to the companion flagevaluation sidecar delivery head.
Consumes the libdatadog sidecar follow-up that splits final EVP POST bodies by encoded uncompressed JSON bytes under the 5 MiB limit, degrades oversized full rows, and drops/logs only if a degraded row still cannot fit.

Decisions

EVP cardinality is defined only by fields the worker receives; reason is not a hidden aggregate key.
Degraded events omit targeting key and context but keep the visible dimensions needed for backend counts.
PHP keeps the shared sidecar delivery architecture instead of adding a direct EVP HTTP writer to the extension.
System-test validation covers the production per-flag degradation threshold and exercises overflow past the full-fidelity cap.
Payload splitting happens in the libdatadog sidecar flagevaluation flusher where the final uncompressed EVP POST body exists; PHP keeps count-based IPC chunking as a secondary guard, not the EVP byte-limit mechanism.
This PR remains draft until the cross-SDK rollout and review pass are complete.

flowchart TD
  A[PHP extension flushes aggregated rows] --> B[send bounded IPC chunks to sidecar]
  B --> C[sidecar coalesces and serializes candidate JSON]
  C --> D{POST body <= 5 MiB?}
  D -- yes --> E[post asynchronously through Agent EVP proxy]
  D -- no --> F{single full row can degrade?}
  F -- yes --> G[omit targeting_key and context]
  G --> C
  F -- no --> H[drop, log, count]

Validation Evidence

Payload Limit Follow-Up

PHP commit: 37176c9b4
libdatadog submodule pointer now references 46734bc1e, which contains the sidecar EVP byte-splitting/degrade/drop fix.
No PHP source changed in this follow-up; the behavior change is consumed through the libdatadog submodule pointer.
Companion libdatadog validation in feat(datadog-ffe): server-side EVP flagevaluation payload + bincode-safe sidecar delivery libdatadog#2117: cargo nextest run -p datadog-sidecar ffe_flagevaluation_flusher, cargo check -p datadog-sidecar, cargo +nightly-2026-02-08 fmt --all -- --check, and cargo +stable clippy -p datadog-sidecar --all-targets --all-features -- -D warnings passed.

Dogfooding App

ffe-dogfooding app-php7 and app-php8-openfeature were run with local PHP artifacts and the companion libdatadog sidecar changes.
Evaluated ffe-dogfooding-string-flag through PHP 7 with public-safe targeting keys:
- php7-evp-agent-20260622T2252-alpha
- php7-evp-agent-20260622T2252-bravo
- php7-evp-agent-20260622T2252-charlie
Evaluated ffe-dogfooding-string-flag through PHP 8 OpenFeature with public-safe targeting keys:
- php8of-evp-agent-20260622T2255-alpha
- php8of-evp-agent-20260622T2255-bravo
- php8of-evp-agent-20260622T2255-charlie
App-side result: all six evaluations returned variant_2.

System Tests

Companion draft PR: Enable EVP flagevaluation system tests for PHP system-tests#7187

Staging End-To-End

Dogfooding ran without the local mock-intake EVP tee/proxy, so the Agent sent EVP traffic through the normal backend path.
Retriever staging query returned 6 flagevaluation rows for the exact PHP 7 and PHP 8 OpenFeature targeting keys above.
Each row had flag.key=ffe-dogfooding-string-flag, variant.key=variant_2, allocation.key=allocation-override-392dd7c149f8, and evaluation_count=1.

…ith PREP-01 libdatadog - Enable 'flagevaluation-evp' feature on datadog-ffe dep (FfeFlagEvaluationBatch type now compiled) - Fix components-rs/bytes.rs: update 4x VecMap::remove() -> remove_slow() for libdatadog compat post-commit 74284cac7 (VecMap API renamed); this unblocks compilation against the PREP-01 libdatadog ref

…patch - Two-tier aggregation in components-rs/ffe.rs: full→degraded→drop-counted with caps GLOBAL_CAP=131072/PER_FLAG_CAP=10000/DEGRADED_CAP=32768 - Killswitch DD_FLAGGING_EVALUATION_COUNTS_ENABLED (default: on) via evp_enabled() in Rust and isEvpEnabled() in EvaluationMetricRecorder.php - ddog_ffe_flush_flag_evaluation_batch() Rust C-export dispatches SidecarAction::FfeFlagEvaluationBatch via sidecar_blocking::enqueue_actions - ddtrace_ffe_flush_flag_evaluation_batch() C wrapper in tracer/ffe.c mirrors existing exposure/metric flush pattern with sidecar globals - RSHUTDOWN call added in tracer/ddtrace.c after existing flush calls - 11 Rust unit tests covering both tiers, overflow, drain, killswitch

…EVP aggregator race ddog_ffe_evaluate() records into the global EVP_AGGREGATOR; without EVP_TEST_LOCK the test ran concurrently with degraded_tier_overflow tests, causing dropped_degraded_overflow to be 2 instead of 1.

… + regen Cargo.lock Points dd-trace-php's libdatadog submodule at the local PREP-01 commit containing the flagevaluation EVP emitter (FfeFlagEvaluationBatch), so components-rs builds against it via the datadog-ffe path dep with the flagevaluation-evp feature. NOTE: 89a2ba7fc is local/unpushed — re-point to the merged upstream libdatadog SHA before any PR.

The Rust C-export ddog_ffe_flush_flag_evaluation_batch (components-rs/ffe.rs) was added without a matching prototype in the committed cbindgen header components-rs/datadog.h. tracer/ffe.c calls it, so PHP8's stricter toolchain fails with -Werror=implicit-function-declaration (ddtrace.so link Error 2). PHP7 only warned and linked, masking the bug. Prototype matches the Rust signature (SidecarTransport**/InstanceId*/QueueId*/CharSlice x3).

…ow drops The full-tier EVP flagevaluation drain previously emitted context: None and drained the degraded-overflow drop count silently. - Full tier now carries the pruned evaluation context (shared prune_context bounds: <=256 fields, string values >256 bytes skipped) plus context.dd.service, matching the degraded tier's cap enforcement. The pruned context is captured once per bucket at insertion and carried verbatim into the drained event. - The degraded-tier overflow drop counter is read-and-reset at drain and logged via tracing::warn when non-zero, so an undersized degradedCap is observable instead of a silent loss of legitimate counts.

…low surfacing - ddog_ffe_evaluate_populates_evp_aggregator_for_flush / _respects_killswitch: drive the real FFI entry point ddog_ffe_evaluate (the function the PHP/C layer calls) and assert it feeds the aggregator that the sidecar flush drains, closing the 'unit-green but emits nothing' gap that earlier tests left uncovered. - full_tier_event_carries_pruned_context / _prunes_oversized_string_values / _empty_context_emits_no_context_object: assert the full tier carries the pruned context and enforces the field/value bounds. - drain_resets_degraded_overflow_drop_counter: assert drain reads-and-resets the observable overflow drop counter.

…ncode-safe wire + reliable enqueue) Bump the libdatadog submodule to the bincode-safe flagevaluation fix (DataDog/libdatadog#2117): the worker->sidecar IPC is bincode, which the old serde_json::Value + skip_serializing_if wire types could not deserialize, so the sidecar silently dropped every batch. - Stringify the pruned full-tier context (JSON object string) at drain so the bincode wire stays plain; the sidecar flusher re-expands it into a JSON object for the POST. - Use sidecar_blocking::enqueue_actions_reliable for the one-shot RSHUTDOWN flush.

datadog-official · 2026-06-14T14:10:43Z

Tests

✨ Fix all issues with BitsAI

⚠️ Warnings

🚦 16 Pipeline jobs failed

DataDog/apm-reliability/dd-trace-php | ASAN test_c with multiple observers: [8.3]

DataDog/apm-reliability/dd-trace-php | check-big-regressions

DataDog/apm-reliability/dd-trace-php | test_extension_ci: [7.0]

View all 16 failed jobs.

⌛ 1 Test performance regression detected

tmp/build_extension/tests/ext/priority_sampling/manual_global_override.phpt (Global explicitly set priority sampling must be respected) from PHP.tmp.build_extension.tests.ext.priority_sampling — 4.42s (+3.78s, +582%)

ℹ️ Info

No other issues found (see more)

🧪 All tests passed
❄️ No new flaky tests detected

🎯 Code Coverage (details)
• Patch Coverage: 100.00%
• Overall Coverage: 54.08% (-0.04%)

Useful? React with 👍 / 👎

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: 37176c9 | Docs | Datadog PR Page | Give us feedback!}

… FFE fix

…r schema

…2446-evp-flagevaluation-php # Conflicts: # libdatadog

pr-commenter · 2026-06-16T23:39:36Z

Benchmarks [ tracer ]

Benchmark execution time: 2026-06-23 17:42:30

Comparing candidate commit 37176c9 in PR branch leo.romanovsky/ffl-2446-evp-flagevaluation-php with baseline commit 8f132ce in branch master.

Some scenarios are present only in baseline or only in candidate runs. If you didn't create or remove some scenarios in your branch, this maybe a sign of crashed benchmarks 💥💥💥
Check Gitlab CI job log to find if any benchmark has crashed.

Scenarios present only in candidate:

FlagEvaluationBench/benchEvaluateDistinctContexts-opcache
FlagEvaluationBench/benchEvaluateTargetingMatch
FlagEvaluationBench/benchEvaluateDistinctContexts
FlagEvaluationBench/benchEvaluateWithoutCounting
FlagEvaluationBench/benchEvaluateTargetingMatch-opcache
FlagEvaluationBench/benchEvaluateWithoutCounting-opcache
FlagEvaluationBench/benchEvaluateSplit
FlagEvaluationBench/benchEvaluateSplit-opcache

Found 3 performance improvements and 1 performance regressions! Performance is the same for 190 metrics, 0 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

🟩 = significantly better candidate vs. baseline
🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

scenario:MessagePackSerializationBench/benchMessagePackSerialization

🟩 execution_time [-5.975µs; -5.465µs] or [-5.290%; -4.838%]

scenario:MessagePackSerializationBench/benchMessagePackSerialization-opcache

🟩 execution_time [-4.236µs; -2.744µs] or [-3.794%; -2.458%]

scenario:SpanBench/benchOpenTelemetryAPI-opcache

🟩 mem_peak [-4.303MB; -1.088MB] or [-9.720%; -2.457%]

scenario:SpanBench/benchOpenTelemetryInteroperability

🟥 execution_time [+219.319µs; +223.766µs] or [+113.054%; +115.347%]

…2446-evp-flagevaluation-php # Conflicts: # libdatadog

leoromanovsky added 10 commits June 12, 2026 15:47

chore(openfeature): remove internal planning annotations

862b74d

chore(openfeature): wire flagevaluation benchmark into benchmark suite

63fdebe

leoromanovsky added 6 commits June 15, 2026 10:45

build(openfeature): bump libdatadog submodule to clippy/rustfmt-clean…

ae6ee4c

… FFE fix

fix(openfeature): align PHP EVP flagevaluation aggregation with worke…

8b949b7

…r schema

build(openfeature): consume latest libdatadog flagevaluation fix

fe9535a

test(ffe): skip fixture sweep when submodule is absent

bb04578

build(openfeature): consume merged libdatadog flagevaluation fix

422663c

Merge remote-tracking branch 'origin/master' into leo.romanovsky/ffl-…

2cf4267

…2446-evp-flagevaluation-php # Conflicts: # libdatadog

leoromanovsky added 4 commits June 16, 2026 20:39

update libdatadog flagevaluation coalescing

17d12c5

fix(openfeature): stabilize PHP flagevaluation delivery

696c397

Merge remote-tracking branch 'origin/master' into leo.romanovsky/ffl-…

aad84d7

…2446-evp-flagevaluation-php # Conflicts: # libdatadog

Fix PHP flagevaluation EVP aggregation caps

9093fb9

This was referenced Jun 20, 2026

Enable EVP flagevaluation system tests for PHP DataDog/system-tests#7187

Draft

feat(datadog-ffe): server-side EVP flagevaluation payload + bincode-safe sidecar delivery DataDog/libdatadog#2117

Open

leoromanovsky added 2 commits June 22, 2026 19:03

Align flagevaluation EVP timestamp

60a0ab3

chore(ffi): bump libdatadog flagevaluation evp limit

37176c9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(openfeature): emit server-side EVP flagevaluation#3984

feat(openfeature): emit server-side EVP flagevaluation#3984
leoromanovsky wants to merge 22 commits into
masterfrom
leo.romanovsky/ffl-2446-evp-flagevaluation-php

leoromanovsky commented Jun 14, 2026 •

edited

Loading

Uh oh!

datadog-official Bot commented Jun 14, 2026 •

edited by datadog-datadog-prod-us1 Bot

Loading

Uh oh!

pr-commenter Bot commented Jun 16, 2026 •

edited

Loading

Explanation

More details about the CI and significant changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leoromanovsky commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Justification

Changes

Decisions

Validation Evidence

Payload Limit Follow-Up

Dogfooding App

System Tests

Staging End-To-End

Uh oh!

datadog-official Bot commented Jun 14, 2026 • edited by datadog-datadog-prod-us1 Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Warnings

ℹ️ Info

Uh oh!

pr-commenter Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks [ tracer ]

Explanation

More details about the CI and significant changes

scenario:MessagePackSerializationBench/benchMessagePackSerialization

scenario:MessagePackSerializationBench/benchMessagePackSerialization-opcache

scenario:SpanBench/benchOpenTelemetryAPI-opcache

scenario:SpanBench/benchOpenTelemetryInteroperability

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

leoromanovsky commented Jun 14, 2026 •

edited

Loading

datadog-official Bot commented Jun 14, 2026 •

edited by datadog-datadog-prod-us1 Bot

Loading

pr-commenter Bot commented Jun 16, 2026 •

edited

Loading