catalog v82->v83: normalize live Role rows still in pre-v81 byte form by DAlperin · Pull Request #36534 · MaterializeInc/materialize

DAlperin · 2026-05-13T02:28:05Z

From @def- LLM review:

The new migration only emits repairs for keys that already have a dangling negative diff. compute_repairs first records +1 Role rows, then only examines snapshot entries whose diff is not +1; a Role is repaired only if there is an existing -1 plus a parsed-equal +1 sibling (src/catalog/src/durable/upgrade/v82_to_v83.rs, especially lines 162-165 and 180-240). A lone v80-form +1 Role row is therefore treated as healthy and left untouched; the new tests also only cover already-corrupted shapes (stale +1 plus dangling -1) and never a snapshot that contains just the stale pre-v81 byte form (src/catalog/src/durable/upgrade/v82_to_v83.rs).

This fix recanonicalizes Roles in v83 format if they are different. I added point tests but dennis will also rebase on this.

def-

lgtm from a testing perspective: https://buildkite.com/materialize/nightly/builds/16425 with #36525 rebased on top of this

def- · 2026-05-13T04:05:32Z

@claude /review

claude · 2026-05-13T04:06:20Z

Claude finished @def-'s task in 2m 19s —— View job

Code Review

Gathering context
Reading changed files
Analyzing changes
Review complete

Overview

This PR adds Pass 2 to the v82→v83 migration to address a gap identified in the LLM review: Role rows still in pre-v81 byte form that have never had DDL applied since the non-cloud no-op. Without this fix, such rows survive the migration and the next ALTER ROLE/DROP ROLE on them would manufacture a fresh dangling -1 with no migration left to repair it.

Correctness

The two-pass structure is sound:

Pass 1 correctness (unchanged logic): Correctly tracks which stale +1 rows it retracts in retracted_in_pass_1 and which keys it declined to repair in unrepaired_keys. Both sets are correctly populated in all three skip-paths:

src/catalog/src/durable/upgrade/v82_to_v83.rs:238 — unexpected diff magnitude
src/catalog/src/durable/upgrade/v82_to_v83.rs:274 — no parsed-equal sibling
src/catalog/src/durable/upgrade/v82_to_v83.rs:284 — ambiguous live rows

Pass 2 correctness: The guard chain is correct and in the right order:

Skip non-+1 rows
Skip non-Role rows
Skip rows already retracted by Pass 1 (prevents (stale, -2))
Skip rows for unrepaired_keys (prevents normalizing a subset of a key that still has a dangling -1)
Skip rows already in canonical form (no-op check)

The inserted_canonical deduplication at src/catalog/src/durable/upgrade/v82_to_v83.rs:343 correctly handles the pathological case where two distinct stale byte forms of the same role both exist as +1 rows — only one canonical +1 is inserted. The stats.normalized counter incrementing per input row (not per canonical insert) is fine; it counts work done, not rows written.

Tricky edge: stale row whose bytes equal the dangling row's bytes. In a fully consolidated snapshot, (X, +1) and (X, -1) would cancel to 0 and never appear, so s.bytes == kind_json in the Pass 1 loop is dead code in practice. The skip guard is still good defensive programming.

Tricky edge: canonical form of a Pass-1-emitted row. The dangling row (kind_json, diff -1) was originally produced by re-serializing through the current proto, so canonical(parse(dangling_bytes)) == dangling_bytes holds. Pass 2 only iterates the original snapshot, not the repairs vector, so there's no risk of Pass 2 attempting to normalize a +1 emitted by Pass 1.

Test Coverage

The new tests are comprehensive and cover exactly the cases identified in the LLM review:

Test	Scenario
`untouched_pre_v81_role_is_normalized`	Core new case: lone stale `+1` → retracted + canonical reinserted
`canonical_form_role_is_not_normalized`	Already-canonical `+1` is a no-op
`multiple_stale_byte_forms_no_dangling_collapse_to_one_canonical`	Two stale forms → two `-1`s but only one canonical `+1`
`unrepaired_dangling_skips_normalize_for_same_key`	Pass 1 skip → Pass 2 also skips same key
`normalize_does_not_double_retract_pass_1_stale_rows`	Pass 2 doesn't double-retract rows already handled by Pass 1

Minor Notes

Naming: stale_role_kind_with_extra_whitespace (src/catalog/src/durable/upgrade/v82_to_v83.rs:854) doesn't actually use extra whitespace — it adds a spurious "password": null field. The name is slightly misleading for a test helper that may be read later to understand the failure mode being exercised. Low priority, test-only.
Logging scope: The tracing::info! at src/catalog/src/durable/upgrade/v82_to_v83.rs:338 logs role_name for each normalized row. For environments with many stale roles this could be chatty, but for a one-time migration this is appropriate — it produces a useful audit trail.
inserted_canonical is a BTreeSet<StateUpdateKindJson> (owned) while retracted_in_pass_1 is a BTreeSet<&StateUpdateKindJson> (borrowed). The asymmetry is intentional — retracted_in_pass_1 borrows from the snapshot while inserted_canonical needs to store newly computed canonical forms that don't exist in the snapshot. This is correct.

Verdict

The implementation is correct and the test suite covers the important edge cases. The LLM review's concern is properly addressed. LGTM.

bosconi

Looks good!

bosconi · 2026-05-13T04:52:37Z

+
+    #[mz_ore::test]
+    #[cfg_attr(miri, ignore)] // can't call foreign function `decContextDefault` on OS `linux`
+    fn multiple_stale_byte_forms_no_dangling_collapse_to_one_canonical() {


A function name James Joyce would be proud of!

def- · 2026-05-13T06:24:17Z

One thing popped up based on the QA LLM review:

auto_provision_source: null is now set for everything instead of Some(Frontegg) for the original v80->v81 migration. I'm not sure how bad that is, and how much we care. CC @SangJunBak

aljoscha · 2026-05-13T07:56:56Z

I think @def- ia right. To preserve the spirit of the original migration, we need to apply the same logic that decided for frontegg or None. Otherwise this looks very good.

Basically, in the normalization function apply the logic because the normalized form is dynamic, if that makes sense.

ggevay · 2026-05-15T13:40:49Z

Now we need to put this in a new migration before merging, right? Discussed here

Because all cloud environments have already ran this migration, we need to copy changes from the old migration to a new one. - 1:1 Copies over old migration into a new file - Reverts changes in the old migration

This is the case the original heuristic missed because it only accepted Managed while some cloud customers have mz_system as an unmanaged cluster

catalog v82->v83: normalize live Role rows still in pre-v81 byte form

0da6e5c

DAlperin requested a review from a team as a code owner May 13, 2026 02:28

def- mentioned this pull request May 13, 2026

platform-checks: Add check and scenario for incident #1007 & extend existing incident-1007 test #36525

Open

def- approved these changes May 13, 2026

View reviewed changes

bosconi approved these changes May 13, 2026

View reviewed changes

SangJunBak added 3 commits May 15, 2026 11:10

rerun v82_to_v83 catalog migration as new migration

bc46cd9

Because all cloud environments have already ran this migration, we need to copy changes from the old migration to a new one. - 1:1 Copies over old migration into a new file - Reverts changes in the old migration

Add v80_to_v81 migration to v83_to_v84

1a523a1

Extend is_cloud heuristic to account for an unmanaged mz_system

284997e

This is the case the original heuristic missed because it only accepted Managed while some cloud customers have mz_system as an unmanaged cluster

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

catalog v82->v83: normalize live Role rows still in pre-v81 byte form#36534

catalog v82->v83: normalize live Role rows still in pre-v81 byte form#36534
DAlperin wants to merge 4 commits into
MaterializeInc:mainfrom
DAlperin:dov/catalog-v83-role-normalize

DAlperin commented May 13, 2026

Uh oh!

def- left a comment

Uh oh!

def- commented May 13, 2026

Uh oh!

claude Bot commented May 13, 2026 •

edited

Loading

Uh oh!

bosconi left a comment

Uh oh!

bosconi May 13, 2026

Uh oh!

def- commented May 13, 2026

Uh oh!

aljoscha commented May 13, 2026

Uh oh!

ggevay commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

DAlperin commented May 13, 2026

Uh oh!

def- left a comment

Choose a reason for hiding this comment

Uh oh!

def- commented May 13, 2026

Uh oh!

claude Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Overview

Correctness

Test Coverage

Minor Notes

Verdict

Uh oh!

bosconi left a comment

Choose a reason for hiding this comment

Uh oh!

bosconi May 13, 2026

Choose a reason for hiding this comment

Uh oh!

def- commented May 13, 2026

Uh oh!

aljoscha commented May 13, 2026

Uh oh!

ggevay commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

claude Bot commented May 13, 2026 •

edited

Loading