Skip to content

feat(correctness): fan out from one millstone process#1724

Open
webern wants to merge 6 commits into
mainfrom
matt.briggs/millstone-fan-out
Open

feat(correctness): fan out from one millstone process#1724
webern wants to merge 6 commits into
mainfrom
matt.briggs/millstone-fan-out

Conversation

@webern
Copy link
Copy Markdown
Contributor

@webern webern commented May 22, 2026

Summary

Reduce payload timing divergence by running a single millstone process that fans out to both baseline and comparison in correctness tests.

Before this change, the millstone container ran sh -c '... & P1=$! ... & P2=$! ...' to fork two
independent millstones - one per target which is one place where drift may be introduced.

In follow up PRs I intend to extend the solution to include

  • waiting for a ready signal from both consumers before starting
  • respecting a wall-clock-time-based boundary pause to avoid the bucket boundary

Key changes:

  • bin/correctness/millstone: Config accepts targets: as a named map (baseline: /
    comparison:) instead of a single target:; TargetSender fans out each generated payload to
    all configured sinks; driver and corpus collapsed to a single send loop; errors include the
    target name.
  • bin/correctness/panoramic: Docker (runner.rs) and k8s (k8s.rs) paths each spawn one
    millstone container/pod per test instead of two. New shared helpers in correctness/config.rs
    (resolve_group_placeholders, millstone_targets_all_sockets, millstone_first_network_port)
    substitute the $GROUP placeholder per-target on the host. The resolved YAML is written under
    the per-test log_dir (deliberately not mounts_dir, which is overlaid into the agent
    containers).
  • 19 test/correctness/*/millstone.yaml migrated mechanically to the targets: shape.

Change Type

  • Enhancement

How did you test this PR

Locally on macOS against rebuilt correctness-tools and datadog-agent images at this commit:

  • Full suite (-d test/integration/cases -d test/correctness) parallel (-p 4 default): 51/52
    pass
    . Single failure: dsd-origin-detection-matrix/unified-high-cardinality — window-edge
    divergence, the residual window boundary failure class that subsequent PRs intend to target.
  • Full suite sequential (-p 1): 52/52 pass. This is slow! Parallelization is a good idea if we can fix it.

Compare to the pre-fanout parallel baseline of 50/52 with all four dsd-origin-detection-matrix/*
variants plus dsd-service-checks failing under value-divergence.

No leaked airlock resources after either run. I did see leaked airlock resources on an aborted run, and that is something I also intend to work on downstream.

References

Not directly, but documenting my ongoing sensitivity to local flakes:

Millstone now holds multiple sinks per process and writes the same
payload bytes to all configured targets. Eliminates per-payload
divergence between Agent (baseline) and Agent+ADP (comparison) by
construction.

- millstone: `targets:` named map replaces single `target:`; in-process
  fan-out via TargetSender; fail-fast with named-target errors.
- panoramic: single millstone process per test for both Docker and k8s
  paths; shared helpers for $GROUP placeholder resolution and socket
  enumeration.
- 19 `test/correctness/*/millstone.yaml` migrated to `targets:` shape.
- 4 new millstone unit tests + 6 new panoramic unit tests.
@webern webern requested a review from a team as a code owner May 22, 2026 15:47
@dd-octo-sts dd-octo-sts Bot added the area/test All things testing: unit/integration, correctness, SMP regression, etc. label May 22, 2026
@datadog-official

This comment has been minimized.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4c353efc6f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread bin/correctness/millstone/src/corpus.rs
Comment thread bin/correctness/panoramic/src/correctness/runner.rs Outdated
@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented May 22, 2026

Binary Size Analysis (Agent Data Plane)

Baseline: f907c91 · Comparison: 5e9ddc9 · diff
Analysis Configuration: stripped binaries · Pass/Fail Threshold: +5%
Sizes: 37.68 MiB (baseline) vs 37.68 MiB (comparison)
Size Change: +0 B (+0.00%)

✅ Binary size difference within threshold

Changes by Module
Module File Size Symbols
anon.e23c78aa09c99bb915937a91f6b5f237.1.llvm.5513365544103422328 +129 B 1
anon.e23c78aa09c99bb915937a91f6b5f237.1.llvm.6491998991054823396 -129 B 1
anon.e23c78aa09c99bb915937a91f6b5f237.4.llvm.5513365544103422328 +114 B 1
anon.e23c78aa09c99bb915937a91f6b5f237.4.llvm.6491998991054823396 -114 B 1
anon.e23c78aa09c99bb915937a91f6b5f237.3.llvm.5513365544103422328 +108 B 1
anon.e23c78aa09c99bb915937a91f6b5f237.3.llvm.6491998991054823396 -108 B 1
anon.e23c78aa09c99bb915937a91f6b5f237.0.llvm.5513365544103422328 +96 B 1
anon.e23c78aa09c99bb915937a91f6b5f237.0.llvm.6491998991054823396 -96 B 1
anon.e23c78aa09c99bb915937a91f6b5f237.2.llvm.5513365544103422328 +94 B 1
anon.e23c78aa09c99bb915937a91f6b5f237.2.llvm.6491998991054823396 -94 B 1
Detailed Symbol Changes
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  [NEW]    +129  [NEW]     +40    anon.e23c78aa09c99bb915937a91f6b5f237.1.llvm.5513365544103422328
  [NEW]    +114  [NEW]     +25    anon.e23c78aa09c99bb915937a91f6b5f237.4.llvm.5513365544103422328
  [NEW]    +108  [NEW]     +19    anon.e23c78aa09c99bb915937a91f6b5f237.3.llvm.5513365544103422328
  [NEW]     +96  [NEW]      +7    anon.e23c78aa09c99bb915937a91f6b5f237.0.llvm.5513365544103422328
  [NEW]     +94  [NEW]      +5    anon.e23c78aa09c99bb915937a91f6b5f237.2.llvm.5513365544103422328
  [DEL]     -94  [DEL]      -5    anon.e23c78aa09c99bb915937a91f6b5f237.2.llvm.6491998991054823396
  [DEL]     -96  [DEL]      -7    anon.e23c78aa09c99bb915937a91f6b5f237.0.llvm.6491998991054823396
  [DEL]    -108  [DEL]     -19    anon.e23c78aa09c99bb915937a91f6b5f237.3.llvm.6491998991054823396
  [DEL]    -114  [DEL]     -25    anon.e23c78aa09c99bb915937a91f6b5f237.4.llvm.6491998991054823396
  [DEL]    -129  [DEL]     -40    anon.e23c78aa09c99bb915937a91f6b5f237.1.llvm.6491998991054823396
  [ = ]       0  [ = ]       0    TOTAL

@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented May 22, 2026

Regression Detector (Agent Data Plane)

Run ID: a565584f-e51b-42ea-835d-e64c78706513
Baseline: f907c912 · Comparison: 5e9ddc93 · diff

Optimization Goals: ✅ No significant changes detected

Fine details of change detection per experiment (35)

Experiments configured erratic: true are tagged (ignored) and skipped when determining which experiments regressed or improved. Experiments which are detected as erratic at runtime are tagged (erratic) to flag that the run's sample dispersion was high, but their regression / improvement signal still counts.

experiment goal Δ mean % links
otlp_ingest_metrics_5mb_memory memory ⚪ +1.38 metrics profiles logs
otlp_ingest_traces_ottl_filtering_5mb_cpu (erratic) cpu ⚪ +0.96 metrics profiles logs
dsd_uds_1mb_3k_contexts_cpu (erratic) cpu ⚪ +0.86 metrics profiles logs
otlp_ingest_traces_5mb_memory memory ⚪ +0.29 metrics profiles logs
otlp_ingest_traces_5mb_cpu (erratic) cpu ⚪ +0.26 metrics profiles logs
otlp_ingest_logs_5mb_memory (ignored) memory ⚪ +0.25 metrics profiles logs
dsd_uds_512kb_3k_contexts_memory memory ⚪ +0.25 metrics profiles logs
otlp_ingest_logs_5mb_cpu (ignored) cpu ⚪ +0.18 metrics profiles logs
quality_gates_rss_dsd_heavy memory ⚪ +0.13 metrics profiles logs
quality_gates_rss_dsd_medium memory ⚪ +0.11 metrics profiles logs
dsd_uds_500mb_3k_contexts_throughput throughput ⚪ -0.08 metrics profiles logs
dsd_uds_10mb_3k_contexts_memory memory ⚪ +0.06 metrics profiles logs
otlp_ingest_traces_5mb_throughput throughput ⚪ -0.04 metrics profiles logs
dsd_uds_500mb_3k_contexts_memory memory ⚪ +0.03 metrics profiles logs
dsd_uds_100mb_3k_contexts_memory memory ⚪ +0.03 metrics profiles logs
otlp_ingest_traces_ottl_filtering_5mb_throughput throughput ⚪ -0.01 metrics profiles logs
otlp_ingest_traces_ottl_transform_5mb_throughput throughput ⚪ -0.01 metrics profiles logs
otlp_ingest_logs_5mb_throughput (ignored) throughput ⚪ -0.01 metrics profiles logs
otlp_ingest_metrics_5mb_throughput throughput ⚪ -0.00 metrics profiles logs
dsd_uds_1mb_3k_contexts_throughput throughput ⚪ -0.00 metrics profiles logs
dsd_uds_10mb_3k_contexts_throughput throughput ⚪ +0.00 metrics profiles logs
dsd_uds_512kb_3k_contexts_throughput throughput ⚪ +0.00 metrics profiles logs
quality_gates_rss_dsd_ultraheavy memory ⚪ -0.01 metrics profiles logs
dsd_uds_100mb_3k_contexts_throughput throughput ⚪ +0.01 metrics profiles logs
dsd_uds_1mb_3k_contexts_memory memory ⚪ -0.02 metrics profiles logs
otlp_ingest_traces_ottl_filtering_5mb_memory memory ⚪ -0.06 metrics profiles logs
otlp_ingest_traces_ottl_transform_5mb_memory memory ⚪ -0.11 metrics profiles logs
quality_gates_rss_dsd_low memory ⚪ -0.14 metrics profiles logs
quality_gates_rss_idle memory ⚪ -0.15 metrics profiles logs
dsd_uds_500mb_3k_contexts_cpu (erratic) cpu ⚪ -0.40 metrics profiles logs
otlp_ingest_traces_ottl_transform_5mb_cpu (erratic) cpu ⚪ -0.61 metrics profiles logs
dsd_uds_100mb_3k_contexts_cpu (erratic) cpu ⚪ -1.43 metrics profiles logs
dsd_uds_512kb_3k_contexts_cpu (erratic) cpu ⚪ -2.34 metrics profiles logs
otlp_ingest_metrics_5mb_cpu (erratic) cpu ⚪ -2.59 metrics profiles logs
dsd_uds_10mb_3k_contexts_cpu (erratic) cpu ⚪ -4.10 metrics profiles logs
Bounds Checks: ✅ Passed (5)
experiment check replicates observed links
quality_gates_rss_dsd_heavy memory_usage 10/10 ✅ 119 MiB ≤ 140 MiB metrics profiles logs
quality_gates_rss_dsd_low memory_usage 10/10 ✅ 40.1 MiB ≤ 50 MiB metrics profiles logs
quality_gates_rss_dsd_medium memory_usage 10/10 ✅ 60.2 MiB ≤ 75 MiB metrics profiles logs
quality_gates_rss_dsd_ultraheavy memory_usage 10/10 ✅ 179 MiB ≤ 200 MiB metrics profiles logs
quality_gates_rss_idle memory_usage 10/10 ✅ 26.8 MiB ≤ 40 MiB metrics profiles logs
Explanation

A change is flagged as a regression when |Δ mean %| > 5.00% in the regressing direction for its optimization goal AND SMP marks the experiment as a regression (is_regression: true). Improvements use the matching criteria for the improving direction. Experiments configured erratic: true (tagged (ignored)) are skipped outright; experiments detected as erratic at runtime (tagged (erratic)) still count, since that flag describes sample dispersion rather than directional certainty. The Δ mean % cell is colored accordingly: 🟢 = improvement, 🔴 = regression, ⚪ = neutral. Reduction in CPU or memory is an improvement; reduction in ingress throughput is a regression.

Comment thread bin/correctness/millstone/src/config.rs Outdated
Comment thread bin/correctness/panoramic/src/correctness/k8s.rs Outdated
Comment thread bin/correctness/panoramic/src/correctness/k8s.rs Outdated
Comment thread bin/correctness/panoramic/src/correctness/k8s.rs Outdated
Comment thread bin/correctness/panoramic/src/correctness/k8s.rs
Comment thread bin/correctness/panoramic/src/correctness/k8s.rs
Comment thread bin/correctness/panoramic/src/correctness/config.rs
Comment thread bin/correctness/millstone/src/driver.rs
Comment thread bin/correctness/millstone/src/corpus.rs
@webern webern marked this pull request as draft May 22, 2026 17:40
@webern webern changed the title feat(correctness): fan out millstone to multiple sinks feat(correctness): fan out from one millstone process May 22, 2026
@webern webern marked this pull request as ready for review May 22, 2026 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/test All things testing: unit/integration, correctness, SMP regression, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant