Skip to content

Low-overhead stateless functional quantize/GEMM API (plain tensors in → out) #3087

@cael-ling

Description

@cael-ling

Problem

A local large-scale LLM pre-training team finds TE's host-side overhead significant for small-shape / latency-sensitive paths, and has already bypassed TE to call MXFP8/NVFP4 kernels directly from their own C++. The cost is structural: (1) autocast + FP8GlobalStateManager state bookkeeping, (2) tensor-subclass torch_dispatch, (3) Quantizer objects passed per op + attribute reads, (4) Python↔C++ round-trips constructing subclass tensors. tex.quantize doesn't avoid this — it still needs a quantizer and returns a subclass tensor.

Request

A documented, supported stateless functional API: plain torch.Tensor in → plain torch.Tensor(s) out (data + scale_inv + amax), bypassing subclass/Quantizer/autocast. Essentially a thin blessed wrapper over the existing nvte_* C API.

Is a lightweight functional surface like this something the team would consider in principle, or is it intentionally out of scope for TE? Mainly trying to gauge whether it's worth exploring further before we discuss possible ways to help move it along.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions