Fix flaky Android NDK download in QNN CI setup#20678
Conversation
Summary: QNN CI jobs intermittently fail during environment setup while downloading the Android NDK, with the signature `curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)`. Because this happens in the shared `setup_android_ndk` step (used by `build-qnn-sdk.sh`, `build-qnn-direct-sdk.sh`, and `setup-qnn-deps.sh`), the failure surfaces on a different test each run but always with the same signature. The failure is an intermittent HTTP/2 stream reset from `dl.google.com` mid-transfer. Two gaps made it fatal rather than self-healing: the existing `curl --retry 3` never retried it, because curl's default retry set does not include transport error 92 (and `--retry-connrefused` does not cover it either); and `set -ex` then aborted the whole script on the first occurrence. This mirrors the download-robustness pattern already used by `install_qnn` and `install_hexagon_sdk` in the same file, and applies it to `setup_android_ndk`: - `--http1.1` sidesteps the HTTP/2 stream-reset behavior entirely (the standard workaround for this Google CDN error). - `--retry-all-errors` makes the retry count apply to transport failures such as error 92. - `--fail` treats HTTP errors as failures instead of writing an error body into the zip. - The download is wrapped in a 5-attempt loop that removes any partial file and validates the archive with `unzip -tq` before extracting, so a truncated or corrupt download cannot slip through to a confusing `unzip` error. The risky `--continue-at -` resume is dropped. Differential Revision: D110373581
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20678
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit fb63cef with merge base ee990d7 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@psiddh has exported this pull request. If you are a Meta employee, you can view the originating Diff in D110373581. |
This PR needs a
|
There was a problem hiding this comment.
Pull request overview
Improves robustness of the Android NDK download step used by Qualcomm (QNN) CI setup by avoiding flaky HTTP/2 behavior and adding retry + archive validation before extraction.
Changes:
- Forces NDK download via HTTP/1.1 and enables retries for transport-level failures (
--retry-all-errors). - Wraps the NDK download in a multi-attempt loop that re-downloads from scratch and validates the zip with
unzip -tqbefore extracting.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| echo "NDK download/verify failed (attempt ${attempt}), retrying..." | ||
| sleep 5 | ||
| done | ||
| unzip -q "/tmp/${NDK_ZIP}" -d "${NDK_INSTALL_DIR}" |
| for attempt in 1 2 3 4 5; do | ||
| rm -f "/tmp/${NDK_ZIP}" | ||
| curl --fail --http1.1 --retry 3 --retry-delay 5 --retry-connrefused --retry-all-errors \ | ||
| -Lo "/tmp/${NDK_ZIP}" "https://dl.google.com/android/repository/${NDK_ZIP}" || true | ||
| if unzip -tq "/tmp/${NDK_ZIP}" >/dev/null 2>&1; then | ||
| break | ||
| fi |
shewu-quic
left a comment
There was a problem hiding this comment.
Thanks for your fix.
Summary:
QNN CI jobs intermittently fail during environment setup while downloading the Android NDK, with the signature
curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2). Because this happens in the sharedsetup_android_ndkstep (used bybuild-qnn-sdk.sh,build-qnn-direct-sdk.sh, andsetup-qnn-deps.sh), the failure surfaces on a different test each run but always with the same signature.The failure is an intermittent HTTP/2 stream reset from
dl.google.commid-transfer. Two gaps made it fatal rather than self-healing: the existingcurl --retry 3never retried it, because curl's default retry set does not include transport error 92 (and--retry-connrefuseddoes not cover it either); andset -exthen aborted the whole script on the first occurrence.This mirrors the download-robustness pattern already used by
install_qnnandinstall_hexagon_sdkin the same file, and applies it tosetup_android_ndk:--http1.1sidesteps the HTTP/2 stream-reset behavior entirely (the standard workaround for this Google CDN error).--retry-all-errorsmakes the retry count apply to transport failures such as error 92.--failtreats HTTP errors as failures instead of writing an error body into the zip.unzip -tqbefore extracting, so a truncated or corrupt download cannot slip through to a confusingunziperror. The risky--continue-at -resume is dropped.Differential Revision: D110373581