Skip to content

Spark 4.1: Add validation for required fields in add_files procedure#16977

Open
ebyhr wants to merge 1 commit into
apache:mainfrom
ebyhr:ebi/spark-add-files-not-null
Open

Spark 4.1: Add validation for required fields in add_files procedure#16977
ebyhr wants to merge 1 commit into
apache:mainfrom
ebyhr:ebi/spark-add-files-not-null

Conversation

@ebyhr

@ebyhr ebyhr commented Jun 26, 2026

Copy link
Copy Markdown
Member

add_files bypasses NOT NULL column constraints - importing a Parquet/ORC/Avro
file with null values in a required column silently succeeds, violating the
table's schema integrity.

Validation is skipped when null-value metrics are not collected (e.g., metrics
mode set to none), since there is no data to validate against.

@github-actions github-actions Bot added the spark label Jun 26, 2026
Comment thread spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java Outdated
Comment thread spark/v4.1/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java Outdated
@ebyhr ebyhr force-pushed the ebi/spark-add-files-not-null branch from b7ddc3a to 17c3f97 Compare June 29, 2026 05:37
}

private static Set<Integer> requiredFieldIds(Schema schema) {
return TypeUtil.indexById(schema.asStruct()).entrySet().stream()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TypeUtil.indexById is recursive, so this set includes nested required ids (struct fields, list elements which are required by default, map keys). nullValueCounts is keyed by leaf primitive ids, and for a required leaf under an optional parent struct Parquet counts the leaf as null whenever the parent is absent, which could reject a valid file. Is the intent to validate only top-level required columns? If nested handling is intended, could we add a nested-schema test to check the behavior?

}

@TestTemplate
public void violateNotNullConstraintFromFileTable() {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both new tests cover the unpartitioned paths (catalog -> driver, file-table -> task). Could we add a partitioned case (a null in a required non-partition column) so the buildManifest validation is pinned for partitioned imports, and a positive test confirming a required column with valid data still imports successfully?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants