Skip to content

[MIOpen] Enable grouped CK xdlops solvers for large-stride tensors (ROCM-23997)#7663

Open
JonathanLichtnerAMD wants to merge 8 commits into
developfrom
users/jlichtne/ROCM-23997-enable-ck-for-large-stride-tensors
Open

[MIOpen] Enable grouped CK xdlops solvers for large-stride tensors (ROCM-23997)#7663
JonathanLichtnerAMD wants to merge 8 commits into
developfrom
users/jlichtne/ROCM-23997-enable-ck-for-large-stride-tensors

Conversation

@JonathanLichtnerAMD
Copy link
Copy Markdown
Contributor

@JonathanLichtnerAMD JonathanLichtnerAMD commented May 20, 2026

Motivation

PyTorch started dispatching large convolutions to MIOpen as of pytorch 2.11/2.12, but the MIOpen support is limited and requires additional work on the MIOpen side to fully support the large convolutions. In particular, the six grouped CK xdlops solvers (ConvHipImplicitGemmGroup{Fwd,Bwd,Wrw}Xdlops and their 3D counterparts) are marked inapplicable for convolutions whose tensor element-strides exceed the int32 range, such as (1,96,4736,4736) or (1,96,1024,1024,1024). Without these solvers, the remaining applicable options are limited to naive kernels (slow) and explicit GEMM (which can require very large workspaces).

As per ROCM-23997, this PR removes the int32 element-stride applicability restriction on the grouped CK xdlops solvers so that MIOpen can serve these shapes. Requires the matching CK changes in #7734 to be merged first.

Technical Details

  • Drop the AllTensorsDimsFitIntoInt() applicability guard from all six solvers.
  • Widen CKArgs dimension members from int to int64_t and length/stride arrays from std::array<ck::index_t,N> to std::array<ck::long_index_t,N> across the shared CRTP base (CKArgsSplitK), 2D Fwd CKArgs, and all three 3D CKArgs templates. The NCHW stride builder is the prime overflow site — operands are now int64 so the contiguous stride no longer wraps before assignment. CKConvDims in the common helpers is widened in lock step, and argument construction now binds to CK's long_index_t MakeArgumentPointer overload.
  • Widen the GetCKSplitkMaxWorkspaceSize accumulator from auto (deduced as int) to std::size_t so the max-reduction over per-instance workspace sizes cannot itself overflow when large-tensor instances report multi-GB workspaces.
  • Add a CK large-tensor instance filter (RequiresLargeTensorCKInstance / IsLargeTensorCKInstance) applied at every CK-instance enumeration site (FillValidKernelsIDs, IsCKApplicable, IsCKArgsSupported, IsCKSplitKSupported, GetCKSplitkMaxWorkspaceSize). Without the filter, most non-large-tensor CK impls silently narrow the int64 overload back to int32 and would corrupt kernel arguments for out-of-range strides. When no large-tensor instance is registered for a given shape, the solver is correctly marked inapplicable.
  • For CK Bwd/Wrw, the int32 narrowed length/stride arrays handed to MakeArgumentPointer are stored as a mutable member of CKArgs (2D CKArgsSplitK base and 3D CKArgs templates) rather than as locals in MakeArgPtr, because CK captures references into those arrays and the locals went out of scope before IsSupportedArgument ran. The narrowed bundle is populated lazily from MakeArgPtr rather than in the CKArgs constructor, so the debug-only overflow assert in ToCKIndexArray only runs for kernels that survived the large-tensor filter (CKArgs is constructed unconditionally in FillValidKernelsIDs before the filter).
  • projects/miopen/CHANGELOG.md: add Resolved-issues bullet.

Risk Assessment

Medium-low. Changes are scoped to the six grouped CK xdlops solvers and their shared CK helpers; widening from intint64_t / std::size_t is additive at every call site verified. The new large-tensor instance filter guarantees the solver is correctly marked inapplicable when no large-tensor CK instance is registered, so existing shapes continue to dispatch to the same kernels as before.

Test Plan

Automated Tests:

  • API-level solution-count gtests (conv_api_solution_count_{2d,3d}_large_stride) invoke each of the six solvers directly via miopenConvolution{Forward,BackwardData,BackwardWeights}CompileSolution to confirm CK is selected past the 2^31 element-stride boundary. Gated to gfx90A | gfx94X | gfx950 via IsTestSupportedByDevice.
  • Kernel-launching numerical tests for all six solvers on shapes whose element-strides exceed INT_MAX (2D: 1x96x4736x4736; 3D: 1x96x512x512x88), so the CPU reference compare catches int32 wrap-around inside the CK kernel itself. One unit_conv_solver_ConvHipImplicitGemm{,3D}Group{Fwd,Bwd,Wrw}Xdlops_LargeStride.cpp per solver. Excluded from standard/pre-checkin via test_categories.yaml (~4–5+ GB each) and run in comprehensive/full; gated to gfx90A | gfx94X | gfx950.
  • Four single-position TestCheckDimsFitIntoInt boundary cases (INT_MAX and INT_MAX+1 in one length and one stride slot) so the strict-> semantics is greppable and survives future refactors.

Manual tests:

  • Run ASAN against the new tests
  • Run the UnitTestConvSolverGroupXDlops large-stride tests manually on gfx90a
  • Run the UnitTestConvSolverGroupXDlops large-stride tests manually on gfx942
  • Run the UnitTestConvSolverGroupXDlops large-stride tests manually on gfx950

Submission Checklist

…OCM-23997)

Allow the six grouped CK xdlops solvers to handle convolutions whose
tensor element-strides exceed the int32 range:

  - ConvHipImplicitGemmGroup{Fwd,Bwd,Wrw}Xdlops       (2D)
  - ConvHipImplicitGemm3DGroup{Fwd,Bwd,Wrw}Xdlops     (3D)

Changes:

* Remove the AllTensorsDimsFitIntoInt() applicability guard from all six
  solvers.

* Widen CKArgs dimension members from int to int64_t and length/stride
  arrays from std::array<ck::index_t,N> to std::array<ck::long_index_t,N>
  across the shared CRTP base (CKArgsSplitK, used by 2D Bwd/Wrw), the 2D
  Fwd CKArgs, and all three 3D CKArgs templates (Fwd/Bwd/Wrw). The NCHW
  stride builder is the prime overflow site for shapes like
  (1,96,4736,4736) or (1,96,1024,1024,1024) — operands are now int64 so
  the contiguous stride no longer wraps before assignment. CKConvDims in
  the common helpers is widened in lock step. Argument construction now
  binds to CK's long_index_t MakeArgumentPointer overload.

* Widen the GetCKSplitkMaxWorkspaceSize accumulator from \`auto\` (deduced
  as int) to std::size_t so the max-reduction over per-instance workspace
  sizes cannot itself overflow when large-tensor instances report
  multi-GB workspaces.

* Add a CK large-tensor instance filter
  (RequiresLargeTensorCKInstance / IsLargeTensorCKInstance) applied at
  every CK-instance enumeration site:
    - FillValidKernelsIDs         (GetSolution path)
    - IsCKApplicable              (IsApplicable path)
    - IsCKArgsSupported           (Check*ArgSupported* / HeuristicInit path)
    - IsCKSplitKSupported         (per-dtype; Generic dispatch inherits)
    - GetCKSplitkMaxWorkspaceSize (Bwd/Wrw workspace sizing)
  Without the filter, most non-large-tensor CK impls silently narrow the
  int64 overload back to int32 and would corrupt kernel arguments for
  out-of-range strides; in debug builds the ToCKIndexArray assert in
  ck_grouped_conv_impl_helpers.hpp catches the narrow. When the
  predicate fires and no large-tensor instance is registered, the solver
  is correctly marked inapplicable.

Tests:

* Add gtest unit tests (conv_api_solution_count_{2d,3d}_large_stride)
  that invoke each of the six solvers directly via
  miopenConvolution{Forward,BackwardData,BackwardWeights}CompileSolution
  to confirm CK is selected past the 2^31 element-stride boundary.
  Shapes with no matching CK kernel instance are listed in IsKnownFailing
  helpers and skipped. The IsKnownFailing* lists were hand-tuned for
  gfx942 CK tile coverage, so these sweeps are gated to the CI-covered
  arches (gfx90A | gfx94X | gfx950 | gfx115X) via IsTestSupportedByDevice
  in the shared RunCompile{Fwd,BwdData,Wrw} helpers; non-allowlisted
  arches SKIP rather than emit stale FAILED lines.

* Add kernel-launching numerical Fwd tests on shapes whose
  element-strides exceed INT_MAX (2D: 1x96x4736x4736; 3D:
  1x96x512x512x88) via UnitTestConvSolverGroupXDlops, so the CPU
  reference compare catches int32 wrap-around inside the CK kernel
  itself. These are excluded from standard/pre-checkin via
  test_categories.yaml (~4-5+ GB each) and run in comprehensive/full;
  skipped on memory-constrained systems. Gated to gfx90A | gfx94X (CI
  covered and manually qualified for the large-stride kernel-launch
  path) via UnitTestConvSolverParams::supported_devs.

* Add four single-position TestCheckDimsFitIntoInt boundary cases
  (INT_MAX and INT_MAX+1 in one length and one stride slot) so the
  strict-\`>\` semantics is greppable and survives future refactors.

Add a Resolved-issues bullet to projects/miopen/CHANGELOG.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
JonathanLichtnerAMD and others added 6 commits May 21, 2026 08:33
…eserve filter contract

The int32 narrowed length/stride arrays handed to CK BWD/WRW
MakeArgumentPointer must outlive the returned Argument (CK captures
references). Store them as a member of CKArgs instead of as locals in
MakeArgPtr, which previously caused stack-use-after-scope under ASAN.

Populate the bundle lazily via a const method on the (now `mutable`)
member, not in the constructor. CKArgs is constructed unconditionally
in FillValidKernelsIDs before the RequiresLargeTensorCKInstance filter
runs; narrowing in the constructor would trip ToCKIndexArray's
debug-only overflow assert on >INT_MAX shapes even when no kernel is
ultimately selected. Lazy population means narrowing only runs from
MakeArgPtr, which is only reached for kernels that survived the filter
-- restoring the original safety contract.

Applies to 2D BWD/WRW (CKArgsSplitK base) and 3D BWD/WRW CKArgs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The IsKnownFailing* lists in conv_api_solution_count_{2d,3d}_large_stride
are hand-tuned for gfx942 CK tile coverage. A gfx1151 CI run produced 26
failures across both sub- and >INT_MAX shapes (including the smallest 3D
FP32 shape), because the CK *Xdlops solvers target CDNA MFMA and have not
been characterized on RDNA 3.5. Remove gfx115X from IsArchInCiAllowlist
so these tests SKIP on gfx115X instead of FAILing, matching the header's
documented design intent. Re-add once gfx115X CK coverage is characterized
and the IsKnownFailing* lists grow per-arch entries.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…s gfx90a/gfx942/gfx950

CK upstream has registered large-tensor xdlops kernels for the >INT_MAX
element-stride paths on gfx90a/gfx942/gfx950, so the per-shape
IsKnownFailing* predicates in the 2D/3D API sweeps are no longer needed
and the unit tests can be widened to include gfx950 alongside gfx90A/gfx94X.

- Drop IsKnownFailing{Fwd,BwdData,Wrw}{2D,3D} predicates and MatchesDhw
  helper; simplify Run* callsites accordingly.
- Drop the KnownFailingFn template parameter and per-shape GTEST_SKIP
  block from RunCompile{Fwd,BwdData,Wrw} in the shared header.
- Add Gpu::gfx950 to supportedDevices in the two
  ConvHipImplicitGemm{,3D}GroupFwdXdlops_LargeStride unit tests.
- Refresh stale comments that referenced the removed predicates.

Verified on gfx942: 606/606 tests pass across all four binaries with
zero SKIPPED and zero FAILED.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds end-to-end kernel-launch tests for the grouped CK xdlops BackwardData
and BackwardWeights solvers on shapes whose element-stride exceeds INT_MAX,
mirroring the existing Forward LargeStride coverage. These complement the
API-level applicability sweeps by catching int32 wraparound that can occur
inside the CK kernel even after MIOpen's host-side widening. Excluded from
the per-PR test category via test_categories.yaml.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@JonathanLichtnerAMD
Copy link
Copy Markdown
Contributor Author

This PR can be be merged once the associated CK changes in PR #7734 are merged. (Note that the added tests will prevent this from being accidentally merged.)

@JonathanLichtnerAMD
Copy link
Copy Markdown
Contributor Author

This might also need ROCm/TheRock#5507 or there might be a risk of test timeouts on nightlies and rock bumps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant