Skip to content

[AURON #2118] make native environment initialization retriable after failure#2151

Open
XorSum wants to merge 1 commit into
apache:masterfrom
XorSum:fix_initialization_error
Open

[AURON #2118] make native environment initialization retriable after failure#2151
XorSum wants to merge 1 commit into
apache:masterfrom
XorSum:fix_initialization_error

Conversation

@XorSum

@XorSum XorSum commented Apr 2, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #2118

Rationale for this change

avoid initialize in static block, make native environment initialization retriable after failure

What changes are included in this PR?

Are there any user-facing changes?

How was this patch tested?

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses executor-wide NoClassDefFoundError caused by failures in JNI/native initialization during class static initialization, by moving native environment initialization into a retriable, lazily-invoked code path.

Changes:

  • Remove static initializer-based native environment setup from AuronCallNativeWrapper.
  • Add a thread-safe, lazy init() method guarded by double-checked locking and invoke it from the wrapper constructor.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +81 to +90
private static void init() {
if (!initialized) {
synchronized (AuronCallNativeWrapper.class) {
if (!initialized) {
// initialize native environment
LOG.info("Initializing native environment (batchSize="
+ AuronAdaptor.getInstance().getAuronConfiguration().get(AuronConfiguration.BATCH_SIZE)
+ ", "
+ "memoryFraction="
+ AuronAdaptor.getInstance().getAuronConfiguration().get(AuronConfiguration.MEMORY_FRACTION)

Copilot AI Apr 2, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

init() can run while a Spark task thread is already interrupted/cancelled (the production issue shows ClosedByInterruptException). Even though initialization is now retriable, attempting Files.copy/System.load from an interrupted task is still likely to fail and generate noisy errors. Consider short-circuiting initialization when the current task is not running (e.g., !AuronAdaptor.getInstance().isTaskRunning() or Thread.currentThread().isInterrupted()), and fail fast so a later non-cancelled task can perform the one-time init successfully.

Copilot uses AI. Check for mistakes.
Comment on lines +85 to +106
// initialize native environment
LOG.info("Initializing native environment (batchSize="
+ AuronAdaptor.getInstance().getAuronConfiguration().get(AuronConfiguration.BATCH_SIZE)
+ ", "
+ "memoryFraction="
+ AuronAdaptor.getInstance().getAuronConfiguration().get(AuronConfiguration.MEMORY_FRACTION)
+ ")");

// arrow configuration
System.setProperty("arrow.struct.conflict.policy", "CONFLICT_APPEND");

// preload JNI bridge classes
try {
Class.forName("org.apache.auron.jni.JniBridge");
} catch (ClassNotFoundException e) {
throw new RuntimeException("Cannot load JniBridge class", e);
}

AuronAdaptor.getInstance().loadAuronLib();
Runtime.getRuntime().addShutdownHook(new Thread(JniBridge::onExit));

initialized = true;

Copilot AI Apr 2, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because initialization is now retried after failures, AuronAdaptor.getInstance().loadAuronLib() may be invoked multiple times in the same JVM. In the Spark adaptor, each call extracts to a fresh createTempFile(...).deleteOnExit(), so repeated failures can accumulate temp files until executor exit. Consider making loadAuronLib() idempotent / cached (or cleaning up temp files on failure) to avoid disk bloat on flaky/interrupt-driven init attempts.

Suggested change
// initialize native environment
LOG.info("Initializing native environment (batchSize="
+ AuronAdaptor.getInstance().getAuronConfiguration().get(AuronConfiguration.BATCH_SIZE)
+ ", "
+ "memoryFraction="
+ AuronAdaptor.getInstance().getAuronConfiguration().get(AuronConfiguration.MEMORY_FRACTION)
+ ")");
// arrow configuration
System.setProperty("arrow.struct.conflict.policy", "CONFLICT_APPEND");
// preload JNI bridge classes
try {
Class.forName("org.apache.auron.jni.JniBridge");
} catch (ClassNotFoundException e) {
throw new RuntimeException("Cannot load JniBridge class", e);
}
AuronAdaptor.getInstance().loadAuronLib();
Runtime.getRuntime().addShutdownHook(new Thread(JniBridge::onExit));
initialized = true;
try {
// initialize native environment
LOG.info("Initializing native environment (batchSize="
+ AuronAdaptor.getInstance().getAuronConfiguration().get(AuronConfiguration.BATCH_SIZE)
+ ", "
+ "memoryFraction="
+ AuronAdaptor.getInstance().getAuronConfiguration().get(AuronConfiguration.MEMORY_FRACTION)
+ ")");
// arrow configuration
System.setProperty("arrow.struct.conflict.policy", "CONFLICT_APPEND");
// preload JNI bridge classes
try {
Class.forName("org.apache.auron.jni.JniBridge");
} catch (ClassNotFoundException e) {
throw new RuntimeException("Cannot load JniBridge class", e);
}
AuronAdaptor.getInstance().loadAuronLib();
Runtime.getRuntime().addShutdownHook(new Thread(JniBridge::onExit));
} finally {
// Mark initialization as attempted to avoid repeated native library loading
initialized = true;
}

Copilot uses AI. Check for mistakes.

@weiqingy weiqingy left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tracking this down — swapping the static {} initializer for a retriable, double-checked init() is the right cure for the <clinit> poisoning: a failed class initializer leaves the class permanently unusable on that executor, whereas leaving initialized false on a thrown loadAuronLib() lets the next task retry. The DCL reads correct to me — volatile flag, re-checked under the lock, and written last after every side effect. A couple of questions inline.

throw new RuntimeException("Cannot load JniBridge class", e);
}

AuronAdaptor.getInstance().loadAuronLib();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The retry depends on a quiet invariant here: every side effect that runs before initialized = trueSystem.setProperty, Class.forName, loadAuronLib, addShutdownHook — has to be either idempotent or guaranteed not to have run on the path that threw. Today that holds, since loadAuronLib() throws out of Files.copy before System.load, so on the failure path the native lib never loads and the shutdown hook never registers — a retry re-runs cleanly. That coupling lives in a different file, so an edit to loadAuronLib() that added work after System.load wouldn't have any signal at the retry site that it just broke the retry. Is it worth making that invariant visible here — a one-line note, or an idempotency check — so the next edit to the load path can't quietly break it?

this);
}

private static void init() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "How was this patch tested?" box is empty — and I think a faithful regression test is genuinely hard here, since it would need the real native lib plus a thread interrupted mid-Files.copy to reproduce the ClosedByInterruptException, which is racy and a process-global load. So this is a real question rather than a request: is there any lightweight check worth adding (even just asserting a second construction succeeds after init has run once), or is this effectively integration-only?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

JNI lazy initialization fails under stage cancellation, causing executor-wide NoClassDefFoundError (AuronCallNativeWrapper)

5 participants