Skip to content

spec-sheet: add new envd-object-scalability and cluster-object-limits scenarios#36540

Draft
aljoscha wants to merge 7 commits into
MaterializeInc:mainfrom
aljoscha:envd-specsheet
Draft

spec-sheet: add new envd-object-scalability and cluster-object-limits scenarios#36540
aljoscha wants to merge 7 commits into
MaterializeInc:mainfrom
aljoscha:envd-specsheet

Conversation

@aljoscha
Copy link
Copy Markdown
Contributor

Fixes SQL-222

aljoscha and others added 7 commits May 13, 2026 11:30
The existing scenarios scale cluster size or envd CPU cores -- nothing
measures how adapter/envd latency moves as the catalog itself grows. Add
two scenarios under a new `envd_scalability` group that fix the
measurement cluster and vary the number of catalog objects.

`envd_scalability_tables` puts N empty tables in the catalog -- pure
catalog/adapter pressure, no controller load. `envd_scalability_mvs`
does N materialized views over a single 1-row base table -- same
catalog footprint, plus controller load proportional to N. The MV
scenario shards across single-replica pad clusters at 10000 MVs per
cluster (so 100k MVs spans 10 clusters), since one cluster can't
reasonably host that many dataflows.

For each N in {1, 10, 100, 1k, 3k, 5k, 10k, 20k, 30k, 50k, 100k} we run
10 reps each of `CREATE TABLE` (DDL through the coordinator) and
`SELECT * FROM <1-row table>` (a simple peek on a fixed 100cc cluster).
The catalog is built incrementally across size points, so going from
N=k to the next size point only adds (next - k) objects -- otherwise
we'd pay an O(sizes * N) build cost. The size list is overridable via
`--envd-scalability-sizes` for scaffolding runs.

Results land in a third CSV (`*.envd_scalability.csv`) reusing the
cluster CSV schema; `mode='envd_scalability'` distinguishes the rows.
Test analytics rides on the existing `cluster_spec_sheet_result` table
-- no schema change needed. The analyzer plots `time_ms` vs N per
(scenario, category, test_name).

This is going to be long-running, especially the MV scenario where each
create exercises the controller -- expect hours for the full size
range.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add two new scenarios -- cluster_object_limits_indexes and
cluster_object_limits_mvs -- that find, per cluster size, the maximum
number of idle materializations one cluster can keep fresh.

The materializations are derived from a one-row, never-updated base
table so the only work the cluster has to do is keep advancing each
materialization's write_frontier in step with the upstream table. Once
the cluster can't keep up, freshness collapses; the driver records the
largest N at which `max(local_lag) < 2s` was still achievable, with the
unhealthy data point recorded too so the cliff is visible.

Staging-only (rejects --target=cloud-production), to avoid burning
production resources on long object-limit searches.
…lability default at 50k

When a materialization stalls completely (write_frontier never advances
past the minimum timestamp), `mz_internal.mz_materialization_lag` reports
`now() - 0` = current unix time in ms (~1.78e12). Recorded as-is this
crushes every healthy data point to ~0 on the plot. Cap the recorded
value at 10x the healthy threshold (= 20 s), preserve the underlying
truth via the `healthy` column, and label the plot to make the cap and
healthy threshold explicit.

Also drop 100_000 from the envd_scalability default size list: 50_000 is
a more sensible default ceiling for staging. The full size list is still
override-able via --envd-scalability-sizes for ad-hoc runs.
…tion

The release-qualification pipeline already runs three cluster-spec-sheet
groups (cluster_compute on production, source_ingestion on production,
environmentd on staging). Add two more groups -- envd_scalability and
cluster_object_limits -- both running against staging, since both push
the catalog / cluster to limits we don't want to exercise on production.
The three "envd / cluster" groups in the cluster-spec-sheet were named
inconsistently. Settle on the three concept names the cluster-spec-sheet
effort uses verbally:

  environmentd          -> envd_qps_scalability     (QPS vs envd CPU)
  envd_scalability      -> envd_objects_scalability (latency vs catalog N)
  cluster_object_limits -> cluster_object_limits    (unchanged)

Renames apply to: scenario constants, scenario-name string values, group
keys in SCENARIO_GROUPS, class names, the run/analyze function names,
the --envd-scalability-sizes CLI flag, the result CSV suffix, and the
`mode` field written into CSV rows. The pre-existing QPS scenarios keep
their individual `*_envd_strong_scaling` names since only the group is
renamed.

Also updates the release-qualification pipeline step ids/args and the
README to match.
…w start

When debugging cluster-spec-sheet runs on staging it's hard to tell which
environment we're actually talking to and whether the system parameter
defaults we expect (lifted via LaunchDarkly or similar) are actually
applied. Add a one-shot diagnostic right after target.initialize() that
prints mz_environment_id() and SHOWs the limits the test depends on
(max_tables, max_materialized_views, max_objects_per_schema, max_clusters,
max_credit_consumption_rate, memory_limiter_interval).

Best-effort: any probe error is logged and swallowed so a transient
failure does not abort the workflow.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant