Low-overhead stateless functional quantize/GEMM API (plain tensors in → out)

## Problem

A local large-scale LLM pre-training team finds TE's host-side overhead significant for small-shape / latency-sensitive paths, and has already bypassed TE to call MXFP8/NVFP4 kernels directly from their own C++. The cost is structural: (1) autocast + FP8GlobalStateManager state bookkeeping, (2) tensor-subclass __torch_dispatch__, (3) Quantizer objects passed per op + attribute reads, (4) Python↔C++ round-trips constructing subclass tensors. tex.quantize doesn't avoid this — it still needs a quantizer and returns a subclass tensor.

## Request

A documented, supported stateless functional API: plain torch.Tensor in → plain torch.Tensor(s) out (data + scale_inv + amax), bypassing subclass/Quantizer/autocast. Essentially a thin blessed wrapper over the existing nvte_* C API.

Is a lightweight functional surface like this something the team would consider in principle, or is it intentionally out of scope for TE? Mainly trying to gauge whether it's worth exploring further before we discuss possible ways to help move it along.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low-overhead stateless functional quantize/GEMM API (plain tensors in → out) #3087

Problem

Request

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Low-overhead stateless functional quantize/GEMM API (plain tensors in → out) #3087

Description

Problem

Request

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions