Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/transformers/generation/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -2006,7 +2006,7 @@ def _valid_auto_compile_criteria(
cache = model_kwargs.get("past_key_values", model_kwargs.get("cache_params"))

# Base logic
valid_hardware = self.device.type in ["cuda", "xpu"] or bool(
valid_hardware = self.device.type in ["cuda", "xpu", "neuron"] or bool(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it dependent on adding a full static-shape generation loop first?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not sure what you mean? I think this is a general list of devices that support compile OOB - you can also hack CPU etc with some private flag iirc

The static shapes etc come later in the input preparation

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will it not auto-compile, and then error out down the line due to dynamic inputs? From what I understood this device cannot support full compile without complete static shapes

Copy link
Copy Markdown
Contributor

@vasqu vasqu Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the line below the condition is set to use valid hardware + cache --> if you don't set static cache (and hence all the static prep), you are out of luck either way

There is no real dynamic thing going on

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok discussed internally, now understanding it: With this we enable compile for neuron when we set static caches but there are still dynamic traces within the whole generate loop so it potentially doesn't make sense to add yet - we should rather wait for feature completeness before adding this. That's at least what I understood now

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For testing purposes, we can enable via the private flags within the compile config

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think there might be a misunderstanding. The compilation is on the forward call, not the generation loop, right ? So having a compile-friendly generation loop is not a pre-requisite (quite the opposite actually, as it is by far not the case, even for CUDA and XPU).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the forward IS compile-friendly, even for neuron.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the forward IS compile-friendly, even for neuron.

Ahhh right right, we compile the forward of decoding stage with 1 new token, sorry

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, thanks for clarifying. We messed up 😬 it's true we only compile the forward on decode nowadays (which was different before where we also compiled more parts)

generation_config.compile_config is not None and generation_config.compile_config._compile_all_devices
)
# Note: for some models that only use linear attention (e.g. Mamba), even a DynamicCache is compileable since all
Expand Down
Loading