Skip to content

[SPARK-55347][SDP][FOLLOW-UP] Cleanup AutoCDC Flow code#56255

Open
szehon-ho wants to merge 3 commits into
apache:masterfrom
szehon-ho:spark-57113-follow-up
Open

[SPARK-55347][SDP][FOLLOW-UP] Cleanup AutoCDC Flow code#56255
szehon-ho wants to merge 3 commits into
apache:masterfrom
szehon-ho:spark-57113-follow-up

Conversation

@szehon-ho
Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

Follow-up to #56160 (SPARK-57113) addressing post-merge review comments and reducing duplication in the AutoCDC flow code and its test suites. No behavior change.

sql/pipelines/.../graph/FlowExecution.scala:

  • Hoist the org.json4s imports out of serializeKeyColumnNames / parseKeyColumnNames to the top of the file, per Spark's import conventions.
  • Factor the duplicated AutoCDC key-field resolution shared by auxiliaryKeyColumnNames and validateNoAutoCdcKeyDrift into a single expectedAuxiliaryKeyFields helper.
  • Import scala.collection.mutable instead of using a fully-qualified inline reference.

Tests:

  • Add a shared singleAutoCdcFlowPipeline helper to AutoCdcGraphExecutionTestMixin and use it across the AutoCDC SCD1 E2E suites (AutoCdcScd1KeyDriftSuite, AutoCdcScd1MultiPipelineSuite, AutoCdcScd1AuxiliaryTableDurabilitySuite, AutoCdcScd1SchemaEvolutionSuite), removing the repeated single-table/single-flow TestGraphRegistrationContext registration boilerplate.

Why are the changes needed?

Addresses non-blocking review feedback left on #56160 and reduces duplication in the AutoCDC flow code and its tests, improving readability and maintainability. The net diff removes ~300 lines.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pure refactor with no behavior change, covered by the existing AutoCDC suites:

  • AutoCdcAuxiliaryTableSuite
  • AutoCdcScd1KeyDriftSuite
  • AutoCdcScd1MultiPipelineSuite
  • AutoCdcScd1AuxiliaryTableDurabilitySuite
  • AutoCdcScd1SchemaEvolutionSuite

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor (Claude Opus 4.8)

szehon-ho added 3 commits June 1, 2026 15:11
Follow-up to apache#56160 (SPARK-57113) addressing post-merge review comments
and reducing duplication in the AutoCDC flow code and its test suites.
No behavior change.

- FlowExecution.scala: hoist org.json4s imports to the file top; factor
  the duplicated AutoCDC key-field resolution into a shared
  expectedAuxiliaryKeyFields helper; import scala.collection.mutable
  instead of a fully-qualified inline reference.
- Tests: add a shared singleAutoCdcFlowPipeline helper to
  AutoCdcGraphExecutionTestMixin and use it across the AutoCDC E2E suites,
  removing repeated single-table/single-flow registration boilerplate.
…cd1KeyDriftSuite

Replace the repeated per-test `val session = spark; import session.implicits._`
with a single class-level `import testImplicits._`, and import
`classic.DataFrame` and `types.MetadataBuilder` instead of using
fully-qualified names.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant