Skip to content

Core: Retry Hadoop version hint before metadata scan#17013

Open
majian1998 wants to merge 1 commit into
apache:mainfrom
majian1998:fix/hadoop-version-hint-retry-config
Open

Core: Retry Hadoop version hint before metadata scan#17013
majian1998 wants to merge 1 commit into
apache:mainfrom
majian1998:fix/hadoop-version-hint-retry-config

Conversation

@majian1998

Copy link
Copy Markdown
Contributor

Summary

Retry reading Hadoop version-hint.text briefly before falling back to scanning the metadata directory.

Problem

HadoopTableOperations updates version-hint.text as a best-effort pointer after committing a new metadata file. On object stores such as OSS/S3/GCS, the delete/rename sequence can make the hint file briefly unavailable or not yet visible to readers.

Today, a transient read failure immediately falls back to listing the metadata directory. For tables with many metadata files, especially on object-store-backed metadata directories, that listing can be significantly more expensive than retrying the small hint file read.

Fix

Add configurable retries for reading version-hint.text before metadata directory listing.

Defaults:

iceberg.version-hint.retry.num-retries = 2
iceberg.version-hint.retry.initial-wait-ms = 100
iceberg.version-hint.retry.max-wait-ms = 800

The retry uses exponential backoff capped by the configured max wait. If the metadata directory does not exist, findVersion() keeps the existing fast path and returns 0 without retrying. If retry sleep is interrupted, the interrupt is surfaced instead of falling through to metadata listing.

The configuration keys intentionally avoid the iceberg.hadoop.* prefix to prevent ambiguity with integrations that use iceberg.hadoop.* as a pass-through prefix for Hadoop configuration.

Test plan

git diff --check

Attempted:

./gradlew :iceberg-core:spotlessCheck :iceberg-core:test --tests org.apache.iceberg.hadoop.TestHadoopTableOperations --tests org.apache.iceberg.hadoop.TestHadoopCatalog.testVersionHintFileErrorWithFile --tests org.apache.iceberg.hadoop.TestHadoopCatalog.testVersionHintFileMissingMetadata --tests org.apache.iceberg.hadoop.TestHadoopCatalog.testMetadataFileMissing

Gradle did not reach test execution because plugin dependency downloads failed with TLS handshake errors from the Gradle plugin repository.

@majian1998 majian1998 force-pushed the fix/hadoop-version-hint-retry-config branch from 6df81ce to 527ddb8 Compare June 30, 2026 04:09
Retry reading version-hint.text when the metadata directory exists before falling back to listing metadata files. Keep missing metadata directories on the fast path, use non-conflicting Hadoop configuration keys, cap exponential retry backoff, and surface retry interruptions.
@majian1998 majian1998 force-pushed the fix/hadoop-version-hint-retry-config branch from 527ddb8 to ab95836 Compare June 30, 2026 05:47
}
}

private Integer retryReadVersionHint(FileSystem fs, Path versionHintFile) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retryReadVersionHint and sleepBeforeVersionHintRetry reimplement retry, capped exponential backoff, and re-interrupt-then-throw that org.apache.iceberg.util.Tasks already provides (its exponentialBackoff(min, max, totalTimeout, scale) plus the sleep/interrupt handling in runSingleThreaded, which does the same Thread.currentThread().interrupt(); throw new RuntimeException(...)). AGENTS.md lists Tasks.foreach as the standard retry utility and this package already uses it in HadoopFileIO. Consider Tasks.foreach(versionHintFile).retry(numRetries).exponentialBackoff(initialWaitMs, maxWaitMs, totalTimeoutMs, 2.0).onlyRetryOn(IOException.class).run(...), falling back to the metadata listing in the catch when retries are exhausted. onlyRetryOn(IOException.class) also avoids retrying a deterministic parse failure (NumberFormatException from a corrupt hint), which the current catch (Exception) re-reads with full backoff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants