[MIOpen] Enable grouped CK xdlops solvers for large-stride tensors (ROCM-23997)#7663
Open
JonathanLichtnerAMD wants to merge 8 commits into
Open
Conversation
…OCM-23997)
Allow the six grouped CK xdlops solvers to handle convolutions whose
tensor element-strides exceed the int32 range:
- ConvHipImplicitGemmGroup{Fwd,Bwd,Wrw}Xdlops (2D)
- ConvHipImplicitGemm3DGroup{Fwd,Bwd,Wrw}Xdlops (3D)
Changes:
* Remove the AllTensorsDimsFitIntoInt() applicability guard from all six
solvers.
* Widen CKArgs dimension members from int to int64_t and length/stride
arrays from std::array<ck::index_t,N> to std::array<ck::long_index_t,N>
across the shared CRTP base (CKArgsSplitK, used by 2D Bwd/Wrw), the 2D
Fwd CKArgs, and all three 3D CKArgs templates (Fwd/Bwd/Wrw). The NCHW
stride builder is the prime overflow site for shapes like
(1,96,4736,4736) or (1,96,1024,1024,1024) — operands are now int64 so
the contiguous stride no longer wraps before assignment. CKConvDims in
the common helpers is widened in lock step. Argument construction now
binds to CK's long_index_t MakeArgumentPointer overload.
* Widen the GetCKSplitkMaxWorkspaceSize accumulator from \`auto\` (deduced
as int) to std::size_t so the max-reduction over per-instance workspace
sizes cannot itself overflow when large-tensor instances report
multi-GB workspaces.
* Add a CK large-tensor instance filter
(RequiresLargeTensorCKInstance / IsLargeTensorCKInstance) applied at
every CK-instance enumeration site:
- FillValidKernelsIDs (GetSolution path)
- IsCKApplicable (IsApplicable path)
- IsCKArgsSupported (Check*ArgSupported* / HeuristicInit path)
- IsCKSplitKSupported (per-dtype; Generic dispatch inherits)
- GetCKSplitkMaxWorkspaceSize (Bwd/Wrw workspace sizing)
Without the filter, most non-large-tensor CK impls silently narrow the
int64 overload back to int32 and would corrupt kernel arguments for
out-of-range strides; in debug builds the ToCKIndexArray assert in
ck_grouped_conv_impl_helpers.hpp catches the narrow. When the
predicate fires and no large-tensor instance is registered, the solver
is correctly marked inapplicable.
Tests:
* Add gtest unit tests (conv_api_solution_count_{2d,3d}_large_stride)
that invoke each of the six solvers directly via
miopenConvolution{Forward,BackwardData,BackwardWeights}CompileSolution
to confirm CK is selected past the 2^31 element-stride boundary.
Shapes with no matching CK kernel instance are listed in IsKnownFailing
helpers and skipped. The IsKnownFailing* lists were hand-tuned for
gfx942 CK tile coverage, so these sweeps are gated to the CI-covered
arches (gfx90A | gfx94X | gfx950 | gfx115X) via IsTestSupportedByDevice
in the shared RunCompile{Fwd,BwdData,Wrw} helpers; non-allowlisted
arches SKIP rather than emit stale FAILED lines.
* Add kernel-launching numerical Fwd tests on shapes whose
element-strides exceed INT_MAX (2D: 1x96x4736x4736; 3D:
1x96x512x512x88) via UnitTestConvSolverGroupXDlops, so the CPU
reference compare catches int32 wrap-around inside the CK kernel
itself. These are excluded from standard/pre-checkin via
test_categories.yaml (~4-5+ GB each) and run in comprehensive/full;
skipped on memory-constrained systems. Gated to gfx90A | gfx94X (CI
covered and manually qualified for the large-stride kernel-launch
path) via UnitTestConvSolverParams::supported_devs.
* Add four single-position TestCheckDimsFitIntoInt boundary cases
(INT_MAX and INT_MAX+1 in one length and one stride slot) so the
strict-\`>\` semantics is greppable and survives future refactors.
Add a Resolved-issues bullet to projects/miopen/CHANGELOG.md.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…arge-stride-tensors
…eserve filter contract The int32 narrowed length/stride arrays handed to CK BWD/WRW MakeArgumentPointer must outlive the returned Argument (CK captures references). Store them as a member of CKArgs instead of as locals in MakeArgPtr, which previously caused stack-use-after-scope under ASAN. Populate the bundle lazily via a const method on the (now `mutable`) member, not in the constructor. CKArgs is constructed unconditionally in FillValidKernelsIDs before the RequiresLargeTensorCKInstance filter runs; narrowing in the constructor would trip ToCKIndexArray's debug-only overflow assert on >INT_MAX shapes even when no kernel is ultimately selected. Lazy population means narrowing only runs from MakeArgPtr, which is only reached for kernels that survived the filter -- restoring the original safety contract. Applies to 2D BWD/WRW (CKArgsSplitK base) and 3D BWD/WRW CKArgs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The IsKnownFailing* lists in conv_api_solution_count_{2d,3d}_large_stride
are hand-tuned for gfx942 CK tile coverage. A gfx1151 CI run produced 26
failures across both sub- and >INT_MAX shapes (including the smallest 3D
FP32 shape), because the CK *Xdlops solvers target CDNA MFMA and have not
been characterized on RDNA 3.5. Remove gfx115X from IsArchInCiAllowlist
so these tests SKIP on gfx115X instead of FAILing, matching the header's
documented design intent. Re-add once gfx115X CK coverage is characterized
and the IsKnownFailing* lists grow per-arch entries.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…arge-stride-tensors
…s gfx90a/gfx942/gfx950
CK upstream has registered large-tensor xdlops kernels for the >INT_MAX
element-stride paths on gfx90a/gfx942/gfx950, so the per-shape
IsKnownFailing* predicates in the 2D/3D API sweeps are no longer needed
and the unit tests can be widened to include gfx950 alongside gfx90A/gfx94X.
- Drop IsKnownFailing{Fwd,BwdData,Wrw}{2D,3D} predicates and MatchesDhw
helper; simplify Run* callsites accordingly.
- Drop the KnownFailingFn template parameter and per-shape GTEST_SKIP
block from RunCompile{Fwd,BwdData,Wrw} in the shared header.
- Add Gpu::gfx950 to supportedDevices in the two
ConvHipImplicitGemm{,3D}GroupFwdXdlops_LargeStride unit tests.
- Refresh stale comments that referenced the removed predicates.
Verified on gfx942: 606/606 tests pass across all four binaries with
zero SKIPPED and zero FAILED.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds end-to-end kernel-launch tests for the grouped CK xdlops BackwardData and BackwardWeights solvers on shapes whose element-stride exceeds INT_MAX, mirroring the existing Forward LargeStride coverage. These complement the API-level applicability sweeps by catching int32 wraparound that can occur inside the CK kernel even after MIOpen's host-side widening. Excluded from the per-PR test category via test_categories.yaml. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
Author
|
This PR can be be merged once the associated CK changes in PR #7734 are merged. (Note that the added tests will prevent this from being accidentally merged.) |
Contributor
Author
|
This might also need ROCm/TheRock#5507 or there might be a risk of test timeouts on nightlies and rock bumps. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
PyTorch started dispatching large convolutions to MIOpen as of pytorch 2.11/2.12, but the MIOpen support is limited and requires additional work on the MIOpen side to fully support the large convolutions. In particular, the six grouped CK xdlops solvers (
ConvHipImplicitGemmGroup{Fwd,Bwd,Wrw}Xdlopsand their 3D counterparts) are marked inapplicable for convolutions whose tensor element-strides exceed the int32 range, such as(1,96,4736,4736)or(1,96,1024,1024,1024). Without these solvers, the remaining applicable options are limited to naive kernels (slow) and explicit GEMM (which can require very large workspaces).As per ROCM-23997, this PR removes the int32 element-stride applicability restriction on the grouped CK xdlops solvers so that MIOpen can serve these shapes. Requires the matching CK changes in #7734 to be merged first.
Technical Details
AllTensorsDimsFitIntoInt()applicability guard from all six solvers.CKArgsdimension members frominttoint64_tand length/stride arrays fromstd::array<ck::index_t,N>tostd::array<ck::long_index_t,N>across the shared CRTP base (CKArgsSplitK), 2D FwdCKArgs, and all three 3DCKArgstemplates. The NCHW stride builder is the prime overflow site — operands are now int64 so the contiguous stride no longer wraps before assignment.CKConvDimsin the common helpers is widened in lock step, and argument construction now binds to CK'slong_index_tMakeArgumentPointeroverload.GetCKSplitkMaxWorkspaceSizeaccumulator fromauto(deduced asint) tostd::size_tso the max-reduction over per-instance workspace sizes cannot itself overflow when large-tensor instances report multi-GB workspaces.RequiresLargeTensorCKInstance/IsLargeTensorCKInstance) applied at every CK-instance enumeration site (FillValidKernelsIDs,IsCKApplicable,IsCKArgsSupported,IsCKSplitKSupported,GetCKSplitkMaxWorkspaceSize). Without the filter, most non-large-tensor CK impls silently narrow the int64 overload back to int32 and would corrupt kernel arguments for out-of-range strides. When no large-tensor instance is registered for a given shape, the solver is correctly marked inapplicable.MakeArgumentPointerare stored as amutablemember ofCKArgs(2DCKArgsSplitKbase and 3DCKArgstemplates) rather than as locals inMakeArgPtr, because CK captures references into those arrays and the locals went out of scope beforeIsSupportedArgumentran. The narrowed bundle is populated lazily fromMakeArgPtrrather than in theCKArgsconstructor, so the debug-only overflow assert inToCKIndexArrayonly runs for kernels that survived the large-tensor filter (CKArgsis constructed unconditionally inFillValidKernelsIDsbefore the filter).projects/miopen/CHANGELOG.md: add Resolved-issues bullet.Risk Assessment
Medium-low. Changes are scoped to the six grouped CK xdlops solvers and their shared CK helpers; widening from
int→int64_t/std::size_tis additive at every call site verified. The new large-tensor instance filter guarantees the solver is correctly marked inapplicable when no large-tensor CK instance is registered, so existing shapes continue to dispatch to the same kernels as before.Test Plan
Automated Tests:
conv_api_solution_count_{2d,3d}_large_stride) invoke each of the six solvers directly viamiopenConvolution{Forward,BackwardData,BackwardWeights}CompileSolutionto confirm CK is selected past the 2^31 element-stride boundary. Gated to gfx90A | gfx94X | gfx950 viaIsTestSupportedByDevice.INT_MAX(2D:1x96x4736x4736; 3D:1x96x512x512x88), so the CPU reference compare catches int32 wrap-around inside the CK kernel itself. Oneunit_conv_solver_ConvHipImplicitGemm{,3D}Group{Fwd,Bwd,Wrw}Xdlops_LargeStride.cppper solver. Excluded from standard/pre-checkin viatest_categories.yaml(~4–5+ GB each) and run in comprehensive/full; gated to gfx90A | gfx94X | gfx950.TestCheckDimsFitIntoIntboundary cases (INT_MAXandINT_MAX+1in one length and one stride slot) so the strict->semantics is greppable and survives future refactors.Manual tests:
UnitTestConvSolverGroupXDlopslarge-stride tests manually on gfx90aUnitTestConvSolverGroupXDlopslarge-stride tests manually on gfx942UnitTestConvSolverGroupXDlopslarge-stride tests manually on gfx950Submission Checklist