Skip to content

Nvidia-runners S3 cache implementation#2053

Merged
Steboss merged 34 commits into
mainfrom
sbosisio/custom-s3-cache-with-local-aws
May 20, 2026
Merged

Nvidia-runners S3 cache implementation#2053
Steboss merged 34 commits into
mainfrom
sbosisio/custom-s3-cache-with-local-aws

Conversation

@Steboss
Copy link
Copy Markdown
Contributor

@Steboss Steboss commented Apr 23, 2026

No description provided.

@Steboss Steboss requested a review from olupton April 24, 2026 15:26
@Steboss
Copy link
Copy Markdown
Contributor Author

Steboss commented Apr 24, 2026

@olupton
Here is an initial setup for using nvidia-enterprise-runners + S3 bucket.
At the moment I've used a test bucket within our JAX AWS space.

Here is a comparison between this implementation and OCI one ( I took the very last main CI for the comparison):

Process AMD64 OCI (s) S3 cache (s)
Jaxlib wheel 1017 721
CUDA plugin wheel 47 50
CUDA PJRT wheel 488 461
Extra XLA tools 26 25
Process ARM64 OCI (s) S3 cache (s)
Jaxlib wheel 931 824
CUDA plugin wheel 104 70
CUDA PJRT wheel 771 543
Extra XLA tools 90 36

I moved also our tests to EKS, but we need to discuss what tests we need to run, if the tests I've implemented make sense/preserve what we have before, and what to do with nsys-jax tests. If you could have a look at this please.
thank you :)

Copy link
Copy Markdown
Collaborator

@olupton olupton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/NVIDIA/JAX-Toolbox/actions/runs/24884720555/job/72862175533?pr=2053#step:7:5245 looks like we still run through the whole build again for final having done it for mealkit. Maybe the runners' local cache directories are too small?

The logic to spin up the remote cache server proxying to S3 looks good.

Comment thread .github/actions/build-container/action.yml Outdated
Comment thread .github/actions/build-container/action.yml Outdated
Comment thread .github/eks-workflow-files/maxtext/test.yml Outdated
Comment thread .github/workflows/_ci.yaml
Comment thread .github/workflows/_ci.yaml Outdated
Comment thread .github/workflows/_ci.yaml Outdated
Comment thread .github/workflows/_ci.yaml
Comment thread .github/actions/build-container/action.yml Outdated
@Steboss
Copy link
Copy Markdown
Contributor Author

Steboss commented May 8, 2026

@olupton
A little update on the caching of mealkit for building final JAX image.
After testing some unsuccessful solutions proposed by the nv-gha-runners team, I came up with this strategy. Here we are using a local-registry that pushes mealkit (here to ghcr and localhost:5000 but we could push it to localhost:5000 only).
This enables to saved time, avoiding the whole rebuild of final. Here is the example with amd64.
The approach is minimal, doesn't introduce more tech debt or breaking changes.
If you're ok with it, I can clean up this repo, introduce this localhost registry, and we can have a first solution for substituting OCI in build-jax steps.

@olupton
Copy link
Copy Markdown
Collaborator

olupton commented May 11, 2026

A little update on the caching of mealkit for building final JAX image. After testing some unsuccessful solutions proposed by the nv-gha-runners team, I came up with this strategy.

Great!

Here we are using a local-registry that pushes mealkit (here to ghcr and localhost:5000 but we could push it to localhost:5000 only).

We have to keep pushing it to somewhere non-local because the JAX mealkit is the input to, for example, the MaxText mealkit.

This enables to saved time, avoiding the whole rebuild of final. Here is the example with amd64. The approach is minimal, doesn't introduce more tech debt or breaking changes. If you're ok with it, I can clean up this repo, introduce this localhost registry, and we can have a first solution for substituting OCI in build-jax steps.

Let's try it!

We don't need to do this now, but we can speed this up more by enabling build caching for TE (already implemented internally).

@Steboss Steboss requested a review from olupton May 13, 2026 09:58
@Steboss
Copy link
Copy Markdown
Contributor Author

Steboss commented May 13, 2026

@olupton
I cleaned the repo and fixed the process + created secrets and the terraform side on AWS.
There’s one part of this that I’m not especially excited about, but I think it may be the only practical option: adding a CI dispatch step.
We need this to trigger the CI for any pull requests, given the doc from nvidia runners.
The cleanest alternative would be to adopt a strict branch naming convention, for example pull-request/sbosisio-something, so that CI can continue to trigger through push. But if we want to keep normal feature branch names, a dispatch-based approach seems like the most reliable solution.
Let me know what you think of it. Thank you :)

Comment thread .github/actions/build-container/action.yml Outdated
Comment thread .github/workflows/_ci.yaml Outdated
Comment thread .github/workflows/pr-dispatch-ci.yaml Outdated
@Steboss Steboss requested a review from olupton May 14, 2026 08:29
@Steboss Steboss mentioned this pull request May 14, 2026
Steboss added a commit that referenced this pull request May 18, 2026
Per #2053 we'll need `copy-pr-bot` app to run CI from pull requests. 
- @olupton could you check if we have already installed `copy-pr-bot`
app in JTB, please?
- added a list of trustees for this repo
- Any commits from now on must be signed

---------

Co-authored-by: Olli Lupton <olupton@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 18, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Steboss
Copy link
Copy Markdown
Contributor Author

Steboss commented May 18, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@olupton it works

Comment thread .github/workflows/pr-dispatch-ci.yaml Outdated
@olupton
Copy link
Copy Markdown
Collaborator

olupton commented May 18, 2026

/ok to test 14f532b

@Steboss
Copy link
Copy Markdown
Contributor Author

Steboss commented May 19, 2026

/ok to test fde4e51

@Steboss Steboss requested a review from olupton May 19, 2026 10:15
olupton
olupton previously approved these changes May 20, 2026
Comment thread .github/workflows/_ci.yaml Outdated
@Steboss Steboss requested a review from olupton May 20, 2026 14:15
@Steboss Steboss merged commit 82f7d10 into main May 20, 2026
12 checks passed
@Steboss Steboss deleted the sbosisio/custom-s3-cache-with-local-aws branch May 20, 2026 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants