Skip to content

[core] Support arbitrary time granularity for chain table delta computation#8185

Open
juntaozhang wants to merge 6 commits into
apache:masterfrom
juntaozhang:pr-add-chain-min-part
Open

[core] Support arbitrary time granularity for chain table delta computation#8185
juntaozhang wants to merge 6 commits into
apache:masterfrom
juntaozhang:pr-add-chain-min-part

Conversation

@juntaozhang

Copy link
Copy Markdown
Contributor

Purpose

#8184

Tests

  • ChainPartitionStepExtractorTest
  • ChainTableUtilsTest
    • testGetDeltaPartitionsWithHourMinuteGranularity
    • testGetDeltaPartitionsWithSeparateHourAndMinute
  • SparkChainTableITCase
    • testChainTableWithMinuteLevelPartitions

String fingerprint = DateTimeFormatter.ofPattern(formatter).format(FINGERPRINT);

checkArgument(
fingerprint.length() == formatter.length(),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rejects valid partition.timestamp-formatter values that use DateTimeFormatter quoting, for example yyyyMMdd'T'HHmm. PartitionTimeExtractor and calPartValues can handle a pattern such as $dtT$hm with that formatter, but this check compares the formatted output length (20260609T1150) with the raw formatter pattern length including quote characters, so chain reads fail before delta partitions are enumerated. Please parse formatter fields without relying on raw pattern positions, or validate/reject these formatter patterns consistently before they can be used.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. Fixed and refactored by following steps:

  1. Parse pattern and formatter into tokens
  2. Match pattern tokens to format tokens recursively
  3. Compute each variable's start/end position in the formatted output
  4. Extract partition values by slicing the formatted datetime string

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ChainTableUtils.calPartValues relied on regex extraction which couldn't handle patterns like
$dtT$hm with quoted literals in the formatter (e.g., yyyyMMdd'T'HHmm).

The new ChainPartitionPatternResolver.calPartValues establishes mapping between pattern variables and formatter tokens, so $dt maps to yyyyMMdd and $hm maps to HHmm directly.


private static final LocalDateTime FINGERPRINT = LocalDateTime.of(2026, 6, 9, 11, 50, 58);
private static final String TIME_UNIT_CHARS = "yMdHhmsS";
private static final Pattern VARIABLE = Pattern.compile("\\$[a-zA-Z_]+");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable regex stops before digits, so a valid partition column like dt1 is parsed as variable $dt plus a constant 1. For partition.timestamp-pattern = "$dt1" and formatter yyyyMMdd, the extractor then treats the day digit as a fixed constant and returns a monthly step, causing getDeltaPartitions to generate only month boundaries and miss daily delta partitions. Please include the same identifier characters that partition fields can use, e.g. digits after the first character, when splitting timestamp-pattern variables.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. This is already fixed — parsePattern() matches the full partition column name from partitionColumns, so $dt1 is correctly parsed as a single variable.

public class ChainPartitionStepExtractor {

private static final LocalDateTime FINGERPRINT = LocalDateTime.of(2026, 6, 9, 11, 50, 58);
private static final String TIME_UNIT_CHARS = "yMdHhmsS";

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S is advertised as a time unit here, but resolveField cannot map fractional seconds: the fingerprint has zero nanos, so a formatter containing SSS produces 000, and Long.parseLong("000") does not match any year/month/day/hour/minute/second value. A valid Java formatter such as yyyyMMddHHmmssSSS will therefore throw Unknown time unit value: 000 during chain delta planning. Please either implement a fractional-second step or remove/reject S consistently instead of accepting it in TIME_UNIT_CHARS.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. S has been removed from supported time unit chars.

.collectAsList().stream()
.map(Row::toString)
.collect(Collectors.toList()))
.containsExactlyInAnyOrder(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new minute-level scenario currently fails locally with mvn -pl paimon-spark/paimon-spark-ut -am -Pfast-build -DfailIfNoTests=false -Dtest=SparkChainTableITCase#testChainTableWithMinuteLevelPartitions test: the query for dt=20250810 and hr_min=03:40 also returns [7,1,7,20250810,03:40] from the 03:45 delta partition. That means the chain planner is still reading delta partitions after the requested chain partition, so this feature can expose future delta rows when a partition filter is present. Please tighten the delta partition selection to (anchor, requested] for the actual requested partition.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation. After refactored, I rebased to master and the UT passed on my local:

[INFO] --- maven-compiler-plugin:3.8.0:testCompile (default-testCompile) @ paimon-spark-ut_2.12 ---
[INFO] Nothing to compile - all classes are up to date
[INFO]
[INFO] --- maven-jar-plugin:3.4.2:test-jar (prepare-test-jar) @ paimon-spark-ut_2.12 ---
[INFO] Building jar: /Users/juntao/src/github.com/apache/paimon/paimon-spark/paimon-spark-ut/target/paimon-spark-ut_2.12-1.5-SNAPSHOT-tests.jar
[INFO]
[INFO] --- maven-surefire-plugin:3.0.0-M5:test (default-test) @ paimon-spark-ut_2.12 ---
[INFO]
[INFO] -------------------------------------------------------
[INFO]  T E S T S
[INFO] -------------------------------------------------------
[INFO] Running org.apache.paimon.spark.SparkChainTableITCase

[INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 22.931 s - in org.apache.paimon.spark.SparkChainTableITCase

@juntaozhang juntaozhang force-pushed the pr-add-chain-min-part branch from b11e9c2 to 156d471 Compare June 28, 2026 13:50
@juntaozhang juntaozhang requested a review from JingsongLi June 28, 2026 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants