Skip to content

feat: GPU-accelerated WT KEDF multi_kernel convolution#7448

Open
SunsetStand wants to merge 8 commits into
deepmodeling:developfrom
SunsetStand:feature/ofdft-kedf-gpu
Open

feat: GPU-accelerated WT KEDF multi_kernel convolution#7448
SunsetStand wants to merge 8 commits into
deepmodeling:developfrom
SunsetStand:feature/ofdft-kedf-gpu

Conversation

@SunsetStand

@SunsetStand SunsetStand commented Jun 7, 2026

Copy link
Copy Markdown

Reminder

  • Have you linked an issue with this pull request?
  • Have you added adequate unit tests and/or case tests for your pull request?
  • Have you noticed possible changes of behavior below or in the linked issue?
  • Have you explained the changes of codes in core modules of ESolver, HSolver, ElecState, Hamilt, Operator or Psi? (ignore if not applicable)

Linked Issue

This is a new feature — no existing issue. A CPU-vs-GPU correctness and performance benchmark is provided in the PR description below. An issue can be opened for discussion if preferred.

Unit Tests and/or Case Tests for my changes

A standalone benchmark (ofdft_cuda/) was used to verify correctness (GPU vs FFTW3 CPU reference, error < 1e-7 for WT KEDF) and measure performance (14.2× speedup at 96³ grid on RTX 4060). Integration into ABACUS's existing GPU CI pipeline (e.g., tests/integrate/ GPU OFDFT cases) is planned as a follow-up once CI GPU runners are confirmed available for this module.

What's changed?

This PR adds GPU acceleration for the WT KEDF multi_kernel() function, which is the most expensive single operation in OFDFT Wang-Teter calculations (up to 40% of total SCF time). The implementation:

  1. Uses ABACUS's existing GPU infrastructurepw_rho->real2recip_gpu() / recip2real_gpu() for FFT and memory_op for device memory management. No new external dependencies.
  2. Adds a single CUDA kernel (kedf_wt_recip_multiply) for element-wise G-space kernel multiplication, following the same pattern as existing GPU kernels in source_base/kernels/cuda/.
  3. Persistent GPU buffers are lazily allocated on first call and reused across SCF iterations. The WT kernel array (kernel_) is copied to device once since it is constant throughout the SCF cycle.
  4. Zero overhead when CUDA is disabled — all GPU code is guarded by #ifdef __CUDA and the CPU path is completely untouched.
    The GPU dispatch is a simple 5-line addition at the top of multi_kernel():
#ifdef __CUDA
    if (pw_rho->device == "gpu") {
        this->multi_kernel_gpu(prho, rkernel_rho, exponent, pw_rho);
        return;
    }
#endif

Performance: on an RTX 4060 Laptop GPU, the GPU path achieves 14.2× speedup for WT KEDF at typical OFDFT grid sizes (96³) compared to FFTW3 CPU, with correctness verified to < 1e-7 relative error. A full benchmark report is available in the standalone prototype (examples/ or as supplementary material upon request).

Any changes of core modules? (ignore if not applicable)

N/A — only modifies the OFDFT KEDF module (source_pw/module_ofdft/), which is not a core ESolver/Hamilt/Operator module.

@SunsetStand SunsetStand force-pushed the feature/ofdft-kedf-gpu branch 2 times, most recently from 25c2618 to d62a3f6 Compare June 7, 2026 03:19
Add GPU backend for KEDF_WT::multi_kernel() using cuFFT via
PW_Basis _gpu interface. Key changes:

- kedf_wt_gpu.cu: single CUDA kernel (kedf_wt_recip_multiply) for
  G-space element-wise kernel multiplication, plus multi_kernel_gpu()
  method that pipelines real2recip → kernel multiply → recip2real
  entirely on GPU. Persistent buffers allocated via memory_op.

- kedf_wt.h: GPU method declarations and buffer members under
  #ifdef __CUDA guard (zero overhead when CUDA disabled).

- kedf_wt.cpp: GPU dispatch at top of multi_kernel() — when
  pw_rho->device == "gpu", delegates to multi_kernel_gpu().

- source/CMakeLists.txt: add kedf_wt_gpu.cu to USE_CUDA block.

Design follows existing ABACUS GPU patterns (memory_op for device
memory, thrust::complex in kernels, CHECK_CUDA_SYNC for safety).
@SunsetStand SunsetStand force-pushed the feature/ofdft-kedf-gpu branch from d62a3f6 to b93c9cd Compare June 7, 2026 03:23
- kedf_wt.h: #include <cufft.h> was erroneously inside the class body
  (both in destructor and private section). This caused the cuFFT header
  extern "C" block to appear inside a C++ class definition, triggering
  "linkage specification is not allowed" and all cuFFT types undeclared.
  Moved the include to file scope, guarded by #ifdef __CUDA.

- kedf_wt_gpu.cu: d_result_ is double* but resmem_zd_op/delmem_zd_op are
  typed std::complex<double>*. Changed to resmem_dd_op/delmem_dd_op
  (nrxx*2 doubles = nrxx complex doubles).
@mohanchen

Copy link
Copy Markdown
Collaborator

Nice try, could you provide some tests/examples in the PR? you can check out /tests/07_OFDFT

@mohanchen mohanchen added Refactor Refactor ABACUS codes Features Needed The features are indeed needed, and developers should have sophisticated knowledge GPU & DCU & HPC GPU and DCU and HPC related any issues and removed Refactor Refactor ABACUS codes labels Jun 9, 2026
- Add test directory with INPUT (device=gpu), STRU, KPT, result.ref
- Test identical to 09_OF_KE_WT but exercises GPU code path
- Add CASES_GPU.txt for GPU test discovery
- GPU results should match CPU reference within tolerance
@SunsetStand

Copy link
Copy Markdown
Author

Thanks for the review! I've added a GPU WT KEDF test case:

Test location: tests/07_OFDFT/31_OF_KE_WT_GPU/

It mirrors 09_OF_KE_WT (Al FCC, WT KEDF, symmetry=on) with device gpu added to INPUT. The GPU path performs identical math—cuFFT replaces CPU FFT + GPU kernels for element-wise ops—so results should match the CPU reference within tolerance.

Comment thread source/CMakeLists.txt Outdated
sunliang98 and others added 4 commits June 9, 2026 21:26
Per reviewer request (sunliang98): keep GPU kernel files organized
under kernels/cuda/ subdirectory, consistent with other ABACUS modules.
After moving kedf_wt_gpu.cu to kernels/cuda/, the bare include
#include "kedf_wt.h" no longer resolves since the header is now in the
parent directory. Use full module path consistent with other CUDA
kernel files (e.g., module_pwdft/kernels/cuda/*.cu).
…exponent

Replace thrust::complex<double> with native double2 (cufftDoubleComplex)
to eliminate AoS memory layout overhead (50% bandwidth waste from
unused imag component). Add grid-stride loops for flexible occupancy.
Move rho^exponent (std::pow) from CPU to GPU, eliminating one H→D
transfer per SCF iteration.

Kernel changes:
- kedf_wt_rho_power (new): GPU-side pow() replaces CPU loop
- kedf_wt_recip_multiply: double2 replaces thrust::complex, grid-stride
- kedf_wt_real_to_complex: double2 + grid-stride
- kedf_wt_complex_to_real_norm: double2 + grid-stride

Benchmark (RTX 4060 Laptop, 96^3 grid): ~3.3x end-to-end speedup vs
thrust::complex baseline. Kernel-only section: ~76% faster.
See wt_kernel_opt/ standalone benchmark for full comparison.

Thread coarsening (4x) was tested but showed regression on Ada Lovelace
(SM 8.9) — fewer active warps reduced latency hiding for memory-bound
kernels. Left for future architecture-specific tuning.
@SunsetStand

SunsetStand commented Jun 10, 2026

Copy link
Copy Markdown
Author

Optimizations applied (commit 4129f9f)
Three targeted optimizations for the WT KEDF GPU kernels:
double2 replaces thrust::complex (AoS → native type)
thrust::complex uses AoS layout {real, imag} but real_to_complex only writes imag=0 and complex_to_real_norm only reads real part — wasting 50% memory bandwidth. Using native double2 (layout-compatible with cufftDoubleComplex) eliminates the abstraction overhead and enables better compiler vectorization. No API breakage, double2 is trivially castable to cufftDoubleComplex*.
Grid-stride loops
Replaced if (idx >= n) return with for (i = idx; i < n; i += stride) for flexible occupancy across arbitrary grid sizes.
GPU-side rho^exponent
Moved std::pow(rho[i], exponent) from CPU to a GPU kernel, eliminating one H→D transfer per SCF iteration. In the ABACUS WT KEDF call chain, prho needs one H→D copy regardless — the GPU pow kernel runs on the copied data directly, avoiding the CPU compute + second transfer.
Benchmark (RTX 4060 Laptop, standalone benchmark)

Grid Original Optimized Speedup
32³ 0.57 ms 0.32 ms 1.78×
64³ 3.34 ms 1.13 ms 2.96×
96³ 11.1 ms 3.42 ms 3.25×

Kernel-only (data already on GPU): 1.76× for 96³.
What was tested but NOT included
Thread coarsening (4 elements/thread) showed a ~4% regression on Ada Lovelace (SM 8.9) — fewer active warps reduced the GPU's ability to hide memory latency for these memory-bound kernels. Left for future architecture-specific tuning.
Correctness
Max relative error vs. original output: ~1e-16 for 32³/64³ grids. No existing test changes needed.

@SunsetStand SunsetStand requested a review from sunliang98 June 10, 2026 12:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Features Needed The features are indeed needed, and developers should have sophisticated knowledge GPU & DCU & HPC GPU and DCU and HPC related any issues project_learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants