Skip to content

catalog v82->v83: normalize live Role rows still in pre-v81 byte form#36534

Open
DAlperin wants to merge 4 commits into
MaterializeInc:mainfrom
DAlperin:dov/catalog-v83-role-normalize
Open

catalog v82->v83: normalize live Role rows still in pre-v81 byte form#36534
DAlperin wants to merge 4 commits into
MaterializeInc:mainfrom
DAlperin:dov/catalog-v83-role-normalize

Conversation

@DAlperin
Copy link
Copy Markdown
Member

From @def- LLM review:

The new migration only emits repairs for keys that already have a dangling negative diff. compute_repairs first records +1 Role rows, then only examines snapshot entries whose diff is not +1; a Role is repaired only if there is an existing -1 plus a parsed-equal +1 sibling (src/catalog/src/durable/upgrade/v82_to_v83.rs, especially lines 162-165 and 180-240). A lone v80-form +1 Role row is therefore treated as healthy and left untouched; the new tests also only cover already-corrupted shapes (stale +1 plus dangling -1) and never a snapshot that contains just the stale pre-v81 byte form (src/catalog/src/durable/upgrade/v82_to_v83.rs).

This fix recanonicalizes Roles in v83 format if they are different. I added point tests but dennis will also rebase on this.

Copy link
Copy Markdown
Contributor

@def- def- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm from a testing perspective: https://buildkite.com/materialize/nightly/builds/16425 with #36525 rebased on top of this

@def-
Copy link
Copy Markdown
Contributor

def- commented May 13, 2026

@claude /review

@claude
Copy link
Copy Markdown

claude Bot commented May 13, 2026

Claude finished @def-'s task in 2m 19s —— View job


Code Review

  • Gathering context
  • Reading changed files
  • Analyzing changes
  • Review complete

Overview

This PR adds Pass 2 to the v82→v83 migration to address a gap identified in the LLM review: Role rows still in pre-v81 byte form that have never had DDL applied since the non-cloud no-op. Without this fix, such rows survive the migration and the next ALTER ROLE/DROP ROLE on them would manufacture a fresh dangling -1 with no migration left to repair it.

Correctness

The two-pass structure is sound:

Pass 1 correctness (unchanged logic): Correctly tracks which stale +1 rows it retracts in retracted_in_pass_1 and which keys it declined to repair in unrepaired_keys. Both sets are correctly populated in all three skip-paths:

  • src/catalog/src/durable/upgrade/v82_to_v83.rs:238 — unexpected diff magnitude
  • src/catalog/src/durable/upgrade/v82_to_v83.rs:274 — no parsed-equal sibling
  • src/catalog/src/durable/upgrade/v82_to_v83.rs:284 — ambiguous live rows

Pass 2 correctness: The guard chain is correct and in the right order:

  1. Skip non-+1 rows
  2. Skip non-Role rows
  3. Skip rows already retracted by Pass 1 (prevents (stale, -2))
  4. Skip rows for unrepaired_keys (prevents normalizing a subset of a key that still has a dangling -1)
  5. Skip rows already in canonical form (no-op check)

The inserted_canonical deduplication at src/catalog/src/durable/upgrade/v82_to_v83.rs:343 correctly handles the pathological case where two distinct stale byte forms of the same role both exist as +1 rows — only one canonical +1 is inserted. The stats.normalized counter incrementing per input row (not per canonical insert) is fine; it counts work done, not rows written.

Tricky edge: stale row whose bytes equal the dangling row's bytes. In a fully consolidated snapshot, (X, +1) and (X, -1) would cancel to 0 and never appear, so s.bytes == kind_json in the Pass 1 loop is dead code in practice. The skip guard is still good defensive programming.

Tricky edge: canonical form of a Pass-1-emitted row. The dangling row (kind_json, diff -1) was originally produced by re-serializing through the current proto, so canonical(parse(dangling_bytes)) == dangling_bytes holds. Pass 2 only iterates the original snapshot, not the repairs vector, so there's no risk of Pass 2 attempting to normalize a +1 emitted by Pass 1.

Test Coverage

The new tests are comprehensive and cover exactly the cases identified in the LLM review:

Test Scenario
untouched_pre_v81_role_is_normalized Core new case: lone stale +1 → retracted + canonical reinserted
canonical_form_role_is_not_normalized Already-canonical +1 is a no-op
multiple_stale_byte_forms_no_dangling_collapse_to_one_canonical Two stale forms → two -1s but only one canonical +1
unrepaired_dangling_skips_normalize_for_same_key Pass 1 skip → Pass 2 also skips same key
normalize_does_not_double_retract_pass_1_stale_rows Pass 2 doesn't double-retract rows already handled by Pass 1

Minor Notes

  1. Naming: stale_role_kind_with_extra_whitespace (src/catalog/src/durable/upgrade/v82_to_v83.rs:854) doesn't actually use extra whitespace — it adds a spurious "password": null field. The name is slightly misleading for a test helper that may be read later to understand the failure mode being exercised. Low priority, test-only.

  2. Logging scope: The tracing::info! at src/catalog/src/durable/upgrade/v82_to_v83.rs:338 logs role_name for each normalized row. For environments with many stale roles this could be chatty, but for a one-time migration this is appropriate — it produces a useful audit trail.

  3. inserted_canonical is a BTreeSet<StateUpdateKindJson> (owned) while retracted_in_pass_1 is a BTreeSet<&StateUpdateKindJson> (borrowed). The asymmetry is intentional — retracted_in_pass_1 borrows from the snapshot while inserted_canonical needs to store newly computed canonical forms that don't exist in the snapshot. This is correct.

Verdict

The implementation is correct and the test suite covers the important edge cases. The LLM review's concern is properly addressed. LGTM.

Copy link
Copy Markdown
Member

@bosconi bosconi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!


#[mz_ore::test]
#[cfg_attr(miri, ignore)] // can't call foreign function `decContextDefault` on OS `linux`
fn multiple_stale_byte_forms_no_dangling_collapse_to_one_canonical() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A function name James Joyce would be proud of!

@def-
Copy link
Copy Markdown
Contributor

def- commented May 13, 2026

One thing popped up based on the QA LLM review:

auto_provision_source: null is now set for everything instead of Some(Frontegg) for the original v80->v81 migration. I'm not sure how bad that is, and how much we care. CC @SangJunBak

@aljoscha
Copy link
Copy Markdown
Contributor

I think @def- ia right. To preserve the spirit of the original migration, we need to apply the same logic that decided for frontegg or None. Otherwise this looks very good.

Basically, in the normalization function apply the logic because the normalized form is dynamic, if that makes sense.

@ggevay
Copy link
Copy Markdown
Contributor

ggevay commented May 15, 2026

Now we need to put this in a new migration before merging, right? Discussed here

Because all cloud environments have already ran this migration, we need to copy changes from the old migration to a new one. 

- 1:1 Copies over old migration into a new file
- Reverts changes in the old migration
 This is the case the original heuristic missed because it only accepted Managed while some cloud customers have mz_system as an unmanaged cluster
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants