Skip to content

Fix flaky Android NDK download in QNN CI setup#20678

Merged
psiddh merged 1 commit into
pytorch:mainfrom
psiddh:export-D110373581
Jul 2, 2026
Merged

Fix flaky Android NDK download in QNN CI setup#20678
psiddh merged 1 commit into
pytorch:mainfrom
psiddh:export-D110373581

Conversation

@psiddh

@psiddh psiddh commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Summary:
QNN CI jobs intermittently fail during environment setup while downloading the Android NDK, with the signature curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2). Because this happens in the shared setup_android_ndk step (used by build-qnn-sdk.sh, build-qnn-direct-sdk.sh, and setup-qnn-deps.sh), the failure surfaces on a different test each run but always with the same signature.

The failure is an intermittent HTTP/2 stream reset from dl.google.com mid-transfer. Two gaps made it fatal rather than self-healing: the existing curl --retry 3 never retried it, because curl's default retry set does not include transport error 92 (and --retry-connrefused does not cover it either); and set -ex then aborted the whole script on the first occurrence.

This mirrors the download-robustness pattern already used by install_qnn and install_hexagon_sdk in the same file, and applies it to setup_android_ndk:

  • --http1.1 sidesteps the HTTP/2 stream-reset behavior entirely (the standard workaround for this Google CDN error).
  • --retry-all-errors makes the retry count apply to transport failures such as error 92.
  • --fail treats HTTP errors as failures instead of writing an error body into the zip.
  • The download is wrapped in a 5-attempt loop that removes any partial file and validates the archive with unzip -tq before extracting, so a truncated or corrupt download cannot slip through to a confusing unzip error. The risky --continue-at - resume is dropped.

Differential Revision: D110373581

Summary:
QNN CI jobs intermittently fail during environment setup while downloading the Android NDK, with the signature `curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)`. Because this happens in the shared `setup_android_ndk` step (used by `build-qnn-sdk.sh`, `build-qnn-direct-sdk.sh`, and `setup-qnn-deps.sh`), the failure surfaces on a different test each run but always with the same signature.

The failure is an intermittent HTTP/2 stream reset from `dl.google.com` mid-transfer. Two gaps made it fatal rather than self-healing: the existing `curl --retry 3` never retried it, because curl's default retry set does not include transport error 92 (and `--retry-connrefused` does not cover it either); and `set -ex` then aborted the whole script on the first occurrence.

This mirrors the download-robustness pattern already used by `install_qnn` and `install_hexagon_sdk` in the same file, and applies it to `setup_android_ndk`:
- `--http1.1` sidesteps the HTTP/2 stream-reset behavior entirely (the standard workaround for this Google CDN error).
- `--retry-all-errors` makes the retry count apply to transport failures such as error 92.
- `--fail` treats HTTP errors as failures instead of writing an error body into the zip.
- The download is wrapped in a 5-attempt loop that removes any partial file and validates the archive with `unzip -tq` before extracting, so a truncated or corrupt download cannot slip through to a confusing `unzip` error. The risky `--continue-at -` resume is dropped.

Differential Revision: D110373581
@psiddh psiddh requested a review from abhinaykukkadapu as a code owner July 1, 2026 21:06
Copilot AI review requested due to automatic review settings July 1, 2026 21:06
@pytorch-bot

pytorch-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20678

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit fb63cef with merge base ee990d7 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 1, 2026
@meta-codesync

meta-codesync Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

@psiddh has exported this pull request. If you are a Meta employee, you can view the originating Diff in D110373581.

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@psiddh psiddh requested a review from rascani July 1, 2026 21:07

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves robustness of the Android NDK download step used by Qualcomm (QNN) CI setup by avoiding flaky HTTP/2 behavior and adding retry + archive validation before extraction.

Changes:

  • Forces NDK download via HTTP/1.1 and enables retries for transport-level failures (--retry-all-errors).
  • Wraps the NDK download in a multi-attempt loop that re-downloads from scratch and validates the zip with unzip -tq before extracting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

echo "NDK download/verify failed (attempt ${attempt}), retrying..."
sleep 5
done
unzip -q "/tmp/${NDK_ZIP}" -d "${NDK_INSTALL_DIR}"
Comment on lines +35 to +41
for attempt in 1 2 3 4 5; do
rm -f "/tmp/${NDK_ZIP}"
curl --fail --http1.1 --retry 3 --retry-delay 5 --retry-connrefused --retry-all-errors \
-Lo "/tmp/${NDK_ZIP}" "https://dl.google.com/android/repository/${NDK_ZIP}" || true
if unzip -tq "/tmp/${NDK_ZIP}" >/dev/null 2>&1; then
break
fi

@shewu-quic shewu-quic left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your fix.

@psiddh psiddh merged commit 01c25f8 into pytorch:main Jul 2, 2026
192 of 195 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants