[AURON #2118] make native environment initialization retriable after failure#2151
[AURON #2118] make native environment initialization retriable after failure#2151XorSum wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses executor-wide NoClassDefFoundError caused by failures in JNI/native initialization during class static initialization, by moving native environment initialization into a retriable, lazily-invoked code path.
Changes:
- Remove static initializer-based native environment setup from
AuronCallNativeWrapper. - Add a thread-safe, lazy
init()method guarded by double-checked locking and invoke it from the wrapper constructor.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| private static void init() { | ||
| if (!initialized) { | ||
| synchronized (AuronCallNativeWrapper.class) { | ||
| if (!initialized) { | ||
| // initialize native environment | ||
| LOG.info("Initializing native environment (batchSize=" | ||
| + AuronAdaptor.getInstance().getAuronConfiguration().get(AuronConfiguration.BATCH_SIZE) | ||
| + ", " | ||
| + "memoryFraction=" | ||
| + AuronAdaptor.getInstance().getAuronConfiguration().get(AuronConfiguration.MEMORY_FRACTION) |
There was a problem hiding this comment.
init() can run while a Spark task thread is already interrupted/cancelled (the production issue shows ClosedByInterruptException). Even though initialization is now retriable, attempting Files.copy/System.load from an interrupted task is still likely to fail and generate noisy errors. Consider short-circuiting initialization when the current task is not running (e.g., !AuronAdaptor.getInstance().isTaskRunning() or Thread.currentThread().isInterrupted()), and fail fast so a later non-cancelled task can perform the one-time init successfully.
| // initialize native environment | ||
| LOG.info("Initializing native environment (batchSize=" | ||
| + AuronAdaptor.getInstance().getAuronConfiguration().get(AuronConfiguration.BATCH_SIZE) | ||
| + ", " | ||
| + "memoryFraction=" | ||
| + AuronAdaptor.getInstance().getAuronConfiguration().get(AuronConfiguration.MEMORY_FRACTION) | ||
| + ")"); | ||
|
|
||
| // arrow configuration | ||
| System.setProperty("arrow.struct.conflict.policy", "CONFLICT_APPEND"); | ||
|
|
||
| // preload JNI bridge classes | ||
| try { | ||
| Class.forName("org.apache.auron.jni.JniBridge"); | ||
| } catch (ClassNotFoundException e) { | ||
| throw new RuntimeException("Cannot load JniBridge class", e); | ||
| } | ||
|
|
||
| AuronAdaptor.getInstance().loadAuronLib(); | ||
| Runtime.getRuntime().addShutdownHook(new Thread(JniBridge::onExit)); | ||
|
|
||
| initialized = true; |
There was a problem hiding this comment.
Because initialization is now retried after failures, AuronAdaptor.getInstance().loadAuronLib() may be invoked multiple times in the same JVM. In the Spark adaptor, each call extracts to a fresh createTempFile(...).deleteOnExit(), so repeated failures can accumulate temp files until executor exit. Consider making loadAuronLib() idempotent / cached (or cleaning up temp files on failure) to avoid disk bloat on flaky/interrupt-driven init attempts.
| // initialize native environment | |
| LOG.info("Initializing native environment (batchSize=" | |
| + AuronAdaptor.getInstance().getAuronConfiguration().get(AuronConfiguration.BATCH_SIZE) | |
| + ", " | |
| + "memoryFraction=" | |
| + AuronAdaptor.getInstance().getAuronConfiguration().get(AuronConfiguration.MEMORY_FRACTION) | |
| + ")"); | |
| // arrow configuration | |
| System.setProperty("arrow.struct.conflict.policy", "CONFLICT_APPEND"); | |
| // preload JNI bridge classes | |
| try { | |
| Class.forName("org.apache.auron.jni.JniBridge"); | |
| } catch (ClassNotFoundException e) { | |
| throw new RuntimeException("Cannot load JniBridge class", e); | |
| } | |
| AuronAdaptor.getInstance().loadAuronLib(); | |
| Runtime.getRuntime().addShutdownHook(new Thread(JniBridge::onExit)); | |
| initialized = true; | |
| try { | |
| // initialize native environment | |
| LOG.info("Initializing native environment (batchSize=" | |
| + AuronAdaptor.getInstance().getAuronConfiguration().get(AuronConfiguration.BATCH_SIZE) | |
| + ", " | |
| + "memoryFraction=" | |
| + AuronAdaptor.getInstance().getAuronConfiguration().get(AuronConfiguration.MEMORY_FRACTION) | |
| + ")"); | |
| // arrow configuration | |
| System.setProperty("arrow.struct.conflict.policy", "CONFLICT_APPEND"); | |
| // preload JNI bridge classes | |
| try { | |
| Class.forName("org.apache.auron.jni.JniBridge"); | |
| } catch (ClassNotFoundException e) { | |
| throw new RuntimeException("Cannot load JniBridge class", e); | |
| } | |
| AuronAdaptor.getInstance().loadAuronLib(); | |
| Runtime.getRuntime().addShutdownHook(new Thread(JniBridge::onExit)); | |
| } finally { | |
| // Mark initialization as attempted to avoid repeated native library loading | |
| initialized = true; | |
| } |
weiqingy
left a comment
There was a problem hiding this comment.
Thanks for tracking this down — swapping the static {} initializer for a retriable, double-checked init() is the right cure for the <clinit> poisoning: a failed class initializer leaves the class permanently unusable on that executor, whereas leaving initialized false on a thrown loadAuronLib() lets the next task retry. The DCL reads correct to me — volatile flag, re-checked under the lock, and written last after every side effect. A couple of questions inline.
| throw new RuntimeException("Cannot load JniBridge class", e); | ||
| } | ||
|
|
||
| AuronAdaptor.getInstance().loadAuronLib(); |
There was a problem hiding this comment.
The retry depends on a quiet invariant here: every side effect that runs before initialized = true — System.setProperty, Class.forName, loadAuronLib, addShutdownHook — has to be either idempotent or guaranteed not to have run on the path that threw. Today that holds, since loadAuronLib() throws out of Files.copy before System.load, so on the failure path the native lib never loads and the shutdown hook never registers — a retry re-runs cleanly. That coupling lives in a different file, so an edit to loadAuronLib() that added work after System.load wouldn't have any signal at the retry site that it just broke the retry. Is it worth making that invariant visible here — a one-line note, or an idempotency check — so the next edit to the load path can't quietly break it?
| this); | ||
| } | ||
|
|
||
| private static void init() { |
There was a problem hiding this comment.
The "How was this patch tested?" box is empty — and I think a faithful regression test is genuinely hard here, since it would need the real native lib plus a thread interrupted mid-Files.copy to reproduce the ClosedByInterruptException, which is racy and a process-global load. So this is a real question rather than a request: is there any lightweight check worth adding (even just asserting a second construction succeeds after init has run once), or is this effectively integration-only?
Which issue does this PR close?
Closes #2118
Rationale for this change
avoid initialize in static block, make native environment initialization retriable after failure
What changes are included in this PR?
Are there any user-facing changes?
How was this patch tested?