[MIOpen] Enable grouped CK xdlops solvers for large-stride tensors (ROCM-23997) by JonathanLichtnerAMD · Pull Request #7663 · ROCm/rocm-libraries

JonathanLichtnerAMD · 2026-05-20T23:47:30Z

Motivation

PyTorch started dispatching large convolutions to MIOpen as of pytorch 2.11/2.12, but the MIOpen support is limited and requires additional work on the MIOpen side to fully support the large convolutions. In particular, the six grouped CK xdlops solvers (ConvHipImplicitGemmGroup{Fwd,Bwd,Wrw}Xdlops and their 3D counterparts) are marked inapplicable for convolutions whose tensor element-strides exceed the int32 range, such as (1,96,4736,4736) or (1,96,1024,1024,1024). Without these solvers, the remaining applicable options are limited to naive kernels (slow) and explicit GEMM (which can require very large workspaces).

As per ROCM-23997, this PR removes the int32 element-stride applicability restriction on the grouped CK xdlops solvers so that MIOpen can serve these shapes. Requires the matching CK changes in #7734 to be merged first.

Technical Details

Drop the AllTensorsDimsFitIntoInt() applicability guard from all six solvers.
Widen CKArgs dimension members from int to int64_t and length/stride arrays from std::array<ck::index_t,N> to std::array<ck::long_index_t,N> across the shared CRTP base (CKArgsSplitK), 2D Fwd CKArgs, and all three 3D CKArgs templates. The NCHW stride builder is the prime overflow site — operands are now int64 so the contiguous stride no longer wraps before assignment. CKConvDims in the common helpers is widened in lock step, and argument construction now binds to CK's long_index_t MakeArgumentPointer overload.
Widen the GetCKSplitkMaxWorkspaceSize accumulator from auto (deduced as int) to std::size_t so the max-reduction over per-instance workspace sizes cannot itself overflow when large-tensor instances report multi-GB workspaces.
Add a CK large-tensor instance filter (RequiresLargeTensorCKInstance / IsLargeTensorCKInstance) applied at every CK-instance enumeration site (FillValidKernelsIDs, IsCKApplicable, IsCKArgsSupported, IsCKSplitKSupported, GetCKSplitkMaxWorkspaceSize). Without the filter, most non-large-tensor CK impls silently narrow the int64 overload back to int32 and would corrupt kernel arguments for out-of-range strides. When no large-tensor instance is registered for a given shape, the solver is correctly marked inapplicable.
For CK Bwd/Wrw, the int32 narrowed length/stride arrays handed to MakeArgumentPointer are stored as a mutable member of CKArgs (2D CKArgsSplitK base and 3D CKArgs templates) rather than as locals in MakeArgPtr, because CK captures references into those arrays and the locals went out of scope before IsSupportedArgument ran. The narrowed bundle is populated lazily from MakeArgPtr rather than in the CKArgs constructor, so the debug-only overflow assert in ToCKIndexArray only runs for kernels that survived the large-tensor filter (CKArgs is constructed unconditionally in FillValidKernelsIDs before the filter).
projects/miopen/CHANGELOG.md: add Resolved-issues bullet.

Risk Assessment

Medium-low. Changes are scoped to the six grouped CK xdlops solvers and their shared CK helpers; widening from int → int64_t / std::size_t is additive at every call site verified. The new large-tensor instance filter guarantees the solver is correctly marked inapplicable when no large-tensor CK instance is registered, so existing shapes continue to dispatch to the same kernels as before.

Test Plan

Automated Tests:

API-level solution-count gtests (conv_api_solution_count_{2d,3d}_large_stride) invoke each of the six solvers directly via miopenConvolution{Forward,BackwardData,BackwardWeights}CompileSolution to confirm CK is selected past the 2^31 element-stride boundary. Gated to gfx90A | gfx94X | gfx950 via IsTestSupportedByDevice.
Kernel-launching numerical tests for all six solvers on shapes whose element-strides exceed INT_MAX (2D: 1x96x4736x4736; 3D: 1x96x512x512x88), so the CPU reference compare catches int32 wrap-around inside the CK kernel itself. One unit_conv_solver_ConvHipImplicitGemm{,3D}Group{Fwd,Bwd,Wrw}Xdlops_LargeStride.cpp per solver. Excluded from standard/pre-checkin via test_categories.yaml (~4–5+ GB each) and run in comprehensive/full; gated to gfx90A | gfx94X | gfx950.
Four single-position TestCheckDimsFitIntoInt boundary cases (INT_MAX and INT_MAX+1 in one length and one stride slot) so the strict-> semantics is greppable and survives future refactors.

Manual tests:

Run ASAN against the new tests
Run the UnitTestConvSolverGroupXDlops large-stride tests manually on gfx90a
Run the UnitTestConvSolverGroupXDlops large-stride tests manually on gfx942
Run the UnitTestConvSolverGroupXDlops large-stride tests manually on gfx950

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
[ci] Per-category timeouts for test matrix; migrate MIOpen TheRock#5507 has been merged and rocm-libraries bumped (or the slow tests disabled)
also need CK PR [CK] Grouped Convolution Global Load/Store instances #7734

…OCM-23997) Allow the six grouped CK xdlops solvers to handle convolutions whose tensor element-strides exceed the int32 range: - ConvHipImplicitGemmGroup{Fwd,Bwd,Wrw}Xdlops (2D) - ConvHipImplicitGemm3DGroup{Fwd,Bwd,Wrw}Xdlops (3D) Changes: * Remove the AllTensorsDimsFitIntoInt() applicability guard from all six solvers. * Widen CKArgs dimension members from int to int64_t and length/stride arrays from std::array<ck::index_t,N> to std::array<ck::long_index_t,N> across the shared CRTP base (CKArgsSplitK, used by 2D Bwd/Wrw), the 2D Fwd CKArgs, and all three 3D CKArgs templates (Fwd/Bwd/Wrw). The NCHW stride builder is the prime overflow site for shapes like (1,96,4736,4736) or (1,96,1024,1024,1024) — operands are now int64 so the contiguous stride no longer wraps before assignment. CKConvDims in the common helpers is widened in lock step. Argument construction now binds to CK's long_index_t MakeArgumentPointer overload. * Widen the GetCKSplitkMaxWorkspaceSize accumulator from \`auto\` (deduced as int) to std::size_t so the max-reduction over per-instance workspace sizes cannot itself overflow when large-tensor instances report multi-GB workspaces. * Add a CK large-tensor instance filter (RequiresLargeTensorCKInstance / IsLargeTensorCKInstance) applied at every CK-instance enumeration site: - FillValidKernelsIDs (GetSolution path) - IsCKApplicable (IsApplicable path) - IsCKArgsSupported (Check*ArgSupported* / HeuristicInit path) - IsCKSplitKSupported (per-dtype; Generic dispatch inherits) - GetCKSplitkMaxWorkspaceSize (Bwd/Wrw workspace sizing) Without the filter, most non-large-tensor CK impls silently narrow the int64 overload back to int32 and would corrupt kernel arguments for out-of-range strides; in debug builds the ToCKIndexArray assert in ck_grouped_conv_impl_helpers.hpp catches the narrow. When the predicate fires and no large-tensor instance is registered, the solver is correctly marked inapplicable. Tests: * Add gtest unit tests (conv_api_solution_count_{2d,3d}_large_stride) that invoke each of the six solvers directly via miopenConvolution{Forward,BackwardData,BackwardWeights}CompileSolution to confirm CK is selected past the 2^31 element-stride boundary. Shapes with no matching CK kernel instance are listed in IsKnownFailing helpers and skipped. The IsKnownFailing* lists were hand-tuned for gfx942 CK tile coverage, so these sweeps are gated to the CI-covered arches (gfx90A | gfx94X | gfx950 | gfx115X) via IsTestSupportedByDevice in the shared RunCompile{Fwd,BwdData,Wrw} helpers; non-allowlisted arches SKIP rather than emit stale FAILED lines. * Add kernel-launching numerical Fwd tests on shapes whose element-strides exceed INT_MAX (2D: 1x96x4736x4736; 3D: 1x96x512x512x88) via UnitTestConvSolverGroupXDlops, so the CPU reference compare catches int32 wrap-around inside the CK kernel itself. These are excluded from standard/pre-checkin via test_categories.yaml (~4-5+ GB each) and run in comprehensive/full; skipped on memory-constrained systems. Gated to gfx90A | gfx94X (CI covered and manually qualified for the large-stride kernel-launch path) via UnitTestConvSolverParams::supported_devs. * Add four single-position TestCheckDimsFitIntoInt boundary cases (INT_MAX and INT_MAX+1 in one length and one stride slot) so the strict-\`>\` semantics is greppable and survives future refactors. Add a Resolved-issues bullet to projects/miopen/CHANGELOG.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…arge-stride-tensors

…eserve filter contract The int32 narrowed length/stride arrays handed to CK BWD/WRW MakeArgumentPointer must outlive the returned Argument (CK captures references). Store them as a member of CKArgs instead of as locals in MakeArgPtr, which previously caused stack-use-after-scope under ASAN. Populate the bundle lazily via a const method on the (now `mutable`) member, not in the constructor. CKArgs is constructed unconditionally in FillValidKernelsIDs before the RequiresLargeTensorCKInstance filter runs; narrowing in the constructor would trip ToCKIndexArray's debug-only overflow assert on >INT_MAX shapes even when no kernel is ultimately selected. Lazy population means narrowing only runs from MakeArgPtr, which is only reached for kernels that survived the filter -- restoring the original safety contract. Applies to 2D BWD/WRW (CKArgsSplitK base) and 3D BWD/WRW CKArgs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The IsKnownFailing* lists in conv_api_solution_count_{2d,3d}_large_stride are hand-tuned for gfx942 CK tile coverage. A gfx1151 CI run produced 26 failures across both sub- and >INT_MAX shapes (including the smallest 3D FP32 shape), because the CK *Xdlops solvers target CDNA MFMA and have not been characterized on RDNA 3.5. Remove gfx115X from IsArchInCiAllowlist so these tests SKIP on gfx115X instead of FAILing, matching the header's documented design intent. Re-add once gfx115X CK coverage is characterized and the IsKnownFailing* lists grow per-arch entries. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…arge-stride-tensors

…s gfx90a/gfx942/gfx950 CK upstream has registered large-tensor xdlops kernels for the >INT_MAX element-stride paths on gfx90a/gfx942/gfx950, so the per-shape IsKnownFailing* predicates in the 2D/3D API sweeps are no longer needed and the unit tests can be widened to include gfx950 alongside gfx90A/gfx94X. - Drop IsKnownFailing{Fwd,BwdData,Wrw}{2D,3D} predicates and MatchesDhw helper; simplify Run* callsites accordingly. - Drop the KnownFailingFn template parameter and per-shape GTEST_SKIP block from RunCompile{Fwd,BwdData,Wrw} in the shared header. - Add Gpu::gfx950 to supportedDevices in the two ConvHipImplicitGemm{,3D}GroupFwdXdlops_LargeStride unit tests. - Refresh stale comments that referenced the removed predicates. Verified on gfx942: 606/606 tests pass across all four binaries with zero SKIPPED and zero FAILED. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Adds end-to-end kernel-launch tests for the grouped CK xdlops BackwardData and BackwardWeights solvers on shapes whose element-stride exceeds INT_MAX, mirroring the existing Forward LargeStride coverage. These complement the API-level applicability sweeps by catching int32 wraparound that can occur inside the CK kernel even after MIOpen's host-side widening. Excluded from the per-PR test category via test_categories.yaml. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

JonathanLichtnerAMD · 2026-05-29T21:29:59Z

This PR can be be merged once the associated CK changes in PR #7734 are merged. (Note that the added tests will prevent this from being accidentally merged.)

JonathanLichtnerAMD · 2026-05-29T22:01:07Z

This might also need ROCm/TheRock#5507 or there might be a risk of test timeouts on nightlies and rock bumps.

JonathanLichtnerAMD requested review from a team as code owners May 20, 2026 23:47

github-actions Bot added documentation project: miopen labels May 20, 2026

Reformat files

cbc4f96

assistant-librarian Bot added the organization: ROCm label May 21, 2026

JonathanLichtnerAMD and others added 6 commits May 21, 2026 08:33

Merge branch 'develop' into users/jlichtne/ROCM-23997-enable-ck-for-l…

58b36fb

…arge-stride-tensors

Merge branch 'develop' into users/jlichtne/ROCM-23997-enable-ck-for-l…

8f503fb

…arge-stride-tensors

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MIOpen] Enable grouped CK xdlops solvers for large-stride tensors (ROCM-23997)#7663

[MIOpen] Enable grouped CK xdlops solvers for large-stride tensors (ROCM-23997)#7663
JonathanLichtnerAMD wants to merge 8 commits into
developfrom
users/jlichtne/ROCM-23997-enable-ck-for-large-stride-tensors

JonathanLichtnerAMD commented May 20, 2026 •

edited

Loading

Uh oh!

JonathanLichtnerAMD commented May 29, 2026

Uh oh!

JonathanLichtnerAMD commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JonathanLichtnerAMD commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Risk Assessment

Test Plan

Submission Checklist

Uh oh!

JonathanLichtnerAMD commented May 29, 2026

Uh oh!

JonathanLichtnerAMD commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JonathanLichtnerAMD commented May 20, 2026 •

edited

Loading