spec-sheet: add new envd-object-scalability and cluster-object-limits scenarios by aljoscha · Pull Request #36540 · MaterializeInc/materialize

aljoscha · 2026-05-13T11:47:06Z

Fixes SQL-222

The existing scenarios scale cluster size or envd CPU cores -- nothing measures how adapter/envd latency moves as the catalog itself grows. Add two scenarios under a new `envd_scalability` group that fix the measurement cluster and vary the number of catalog objects. `envd_scalability_tables` puts N empty tables in the catalog -- pure catalog/adapter pressure, no controller load. `envd_scalability_mvs` does N materialized views over a single 1-row base table -- same catalog footprint, plus controller load proportional to N. The MV scenario shards across single-replica pad clusters at 10000 MVs per cluster (so 100k MVs spans 10 clusters), since one cluster can't reasonably host that many dataflows. For each N in {1, 10, 100, 1k, 3k, 5k, 10k, 20k, 30k, 50k, 100k} we run 10 reps each of `CREATE TABLE` (DDL through the coordinator) and `SELECT * FROM <1-row table>` (a simple peek on a fixed 100cc cluster). The catalog is built incrementally across size points, so going from N=k to the next size point only adds (next - k) objects -- otherwise we'd pay an O(sizes * N) build cost. The size list is overridable via `--envd-scalability-sizes` for scaffolding runs. Results land in a third CSV (`*.envd_scalability.csv`) reusing the cluster CSV schema; `mode='envd_scalability'` distinguishes the rows. Test analytics rides on the existing `cluster_spec_sheet_result` table -- no schema change needed. The analyzer plots `time_ms` vs N per (scenario, category, test_name). This is going to be long-running, especially the MV scenario where each create exercises the controller -- expect hours for the full size range. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add two new scenarios -- cluster_object_limits_indexes and cluster_object_limits_mvs -- that find, per cluster size, the maximum number of idle materializations one cluster can keep fresh. The materializations are derived from a one-row, never-updated base table so the only work the cluster has to do is keep advancing each materialization's write_frontier in step with the upstream table. Once the cluster can't keep up, freshness collapses; the driver records the largest N at which `max(local_lag) < 2s` was still achievable, with the unhealthy data point recorded too so the cliff is visible. Staging-only (rejects --target=cloud-production), to avoid burning production resources on long object-limit searches.

…lability default at 50k When a materialization stalls completely (write_frontier never advances past the minimum timestamp), `mz_internal.mz_materialization_lag` reports `now() - 0` = current unix time in ms (~1.78e12). Recorded as-is this crushes every healthy data point to ~0 on the plot. Cap the recorded value at 10x the healthy threshold (= 20 s), preserve the underlying truth via the `healthy` column, and label the plot to make the cap and healthy threshold explicit. Also drop 100_000 from the envd_scalability default size list: 50_000 is a more sensible default ceiling for staging. The full size list is still override-able via --envd-scalability-sizes for ad-hoc runs.

…tion The release-qualification pipeline already runs three cluster-spec-sheet groups (cluster_compute on production, source_ingestion on production, environmentd on staging). Add two more groups -- envd_scalability and cluster_object_limits -- both running against staging, since both push the catalog / cluster to limits we don't want to exercise on production.

The three "envd / cluster" groups in the cluster-spec-sheet were named inconsistently. Settle on the three concept names the cluster-spec-sheet effort uses verbally: environmentd -> envd_qps_scalability (QPS vs envd CPU) envd_scalability -> envd_objects_scalability (latency vs catalog N) cluster_object_limits -> cluster_object_limits (unchanged) Renames apply to: scenario constants, scenario-name string values, group keys in SCENARIO_GROUPS, class names, the run/analyze function names, the --envd-scalability-sizes CLI flag, the result CSV suffix, and the `mode` field written into CSV rows. The pre-existing QPS scenarios keep their individual `*_envd_strong_scaling` names since only the group is renamed. Also updates the release-qualification pipeline step ids/args and the README to match.

…w start When debugging cluster-spec-sheet runs on staging it's hard to tell which environment we're actually talking to and whether the system parameter defaults we expect (lifted via LaunchDarkly or similar) are actually applied. Add a one-shot diagnostic right after target.initialize() that prints mz_environment_id() and SHOWs the limits the test depends on (max_tables, max_materialized_views, max_objects_per_schema, max_clusters, max_credit_consumption_rate, memory_limiter_interval). Best-effort: any probe error is logged and swallowed so a transient failure does not abort the workflow.

aljoscha and others added 7 commits May 13, 2026 11:30

fix crdb cluster spec sheet

11564b8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spec-sheet: add new envd-object-scalability and cluster-object-limits scenarios#36540

spec-sheet: add new envd-object-scalability and cluster-object-limits scenarios#36540
aljoscha wants to merge 7 commits into
MaterializeInc:mainfrom
aljoscha:envd-specsheet

aljoscha commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aljoscha commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant