Spark 4.1: Add validation for required fields in add_files procedure#16977
Spark 4.1: Add validation for required fields in add_files procedure#16977ebyhr wants to merge 1 commit into
Conversation
b7ddc3a to
17c3f97
Compare
| } | ||
|
|
||
| private static Set<Integer> requiredFieldIds(Schema schema) { | ||
| return TypeUtil.indexById(schema.asStruct()).entrySet().stream() |
There was a problem hiding this comment.
TypeUtil.indexById is recursive, so this set includes nested required ids (struct fields, list elements which are required by default, map keys). nullValueCounts is keyed by leaf primitive ids, and for a required leaf under an optional parent struct Parquet counts the leaf as null whenever the parent is absent, which could reject a valid file. Is the intent to validate only top-level required columns? If nested handling is intended, could we add a nested-schema test to check the behavior?
| } | ||
|
|
||
| @TestTemplate | ||
| public void violateNotNullConstraintFromFileTable() { |
There was a problem hiding this comment.
Both new tests cover the unpartitioned paths (catalog -> driver, file-table -> task). Could we add a partitioned case (a null in a required non-partition column) so the buildManifest validation is pinned for partitioned imports, and a positive test confirming a required column with valid data still imports successfully?
add_filesbypasses NOT NULL column constraints - importing a Parquet/ORC/Avrofile with null values in a required column silently succeeds, violating the
table's schema integrity.
Validation is skipped when null-value metrics are not collected (e.g., metrics
mode set to
none), since there is no data to validate against.add_filesprocedure allows importing NULL on NOT NULL columns #10742