feat: GPU-accelerated WT KEDF multi_kernel convolution by SunsetStand · Pull Request #7448 · deepmodeling/abacus-develop

SunsetStand · 2026-06-07T03:06:43Z

Reminder

Have you linked an issue with this pull request?
Have you added adequate unit tests and/or case tests for your pull request?
Have you noticed possible changes of behavior below or in the linked issue?
Have you explained the changes of codes in core modules of ESolver, HSolver, ElecState, Hamilt, Operator or Psi? (ignore if not applicable)

Linked Issue

This is a new feature — no existing issue. A CPU-vs-GPU correctness and performance benchmark is provided in the PR description below. An issue can be opened for discussion if preferred.

Unit Tests and/or Case Tests for my changes

A standalone benchmark (ofdft_cuda/) was used to verify correctness (GPU vs FFTW3 CPU reference, error < 1e-7 for WT KEDF) and measure performance (14.2× speedup at 96³ grid on RTX 4060). Integration into ABACUS's existing GPU CI pipeline (e.g., tests/integrate/ GPU OFDFT cases) is planned as a follow-up once CI GPU runners are confirmed available for this module.

What's changed?

This PR adds GPU acceleration for the WT KEDF multi_kernel() function, which is the most expensive single operation in OFDFT Wang-Teter calculations (up to 40% of total SCF time). The implementation:

Uses ABACUS's existing GPU infrastructure — pw_rho->real2recip_gpu() / recip2real_gpu() for FFT and memory_op for device memory management. No new external dependencies.
Adds a single CUDA kernel (kedf_wt_recip_multiply) for element-wise G-space kernel multiplication, following the same pattern as existing GPU kernels in source_base/kernels/cuda/.
Persistent GPU buffers are lazily allocated on first call and reused across SCF iterations. The WT kernel array (kernel_) is copied to device once since it is constant throughout the SCF cycle.
Zero overhead when CUDA is disabled — all GPU code is guarded by #ifdef __CUDA and the CPU path is completely untouched.
The GPU dispatch is a simple 5-line addition at the top of multi_kernel():

#ifdef __CUDA
    if (pw_rho->device == "gpu") {
        this->multi_kernel_gpu(prho, rkernel_rho, exponent, pw_rho);
        return;
    }
#endif

Performance: on an RTX 4060 Laptop GPU, the GPU path achieves 14.2× speedup for WT KEDF at typical OFDFT grid sizes (96³) compared to FFTW3 CPU, with correctness verified to < 1e-7 relative error. A full benchmark report is available in the standalone prototype (examples/ or as supplementary material upon request).

Any changes of core modules? (ignore if not applicable)

N/A — only modifies the OFDFT KEDF module (source_pw/module_ofdft/), which is not a core ESolver/Hamilt/Operator module.

Add GPU backend for KEDF_WT::multi_kernel() using cuFFT via PW_Basis _gpu interface. Key changes: - kedf_wt_gpu.cu: single CUDA kernel (kedf_wt_recip_multiply) for G-space element-wise kernel multiplication, plus multi_kernel_gpu() method that pipelines real2recip → kernel multiply → recip2real entirely on GPU. Persistent buffers allocated via memory_op. - kedf_wt.h: GPU method declarations and buffer members under #ifdef __CUDA guard (zero overhead when CUDA disabled). - kedf_wt.cpp: GPU dispatch at top of multi_kernel() — when pw_rho->device == "gpu", delegates to multi_kernel_gpu(). - source/CMakeLists.txt: add kedf_wt_gpu.cu to USE_CUDA block. Design follows existing ABACUS GPU patterns (memory_op for device memory, thrust::complex in kernels, CHECK_CUDA_SYNC for safety).

- kedf_wt.h: #include <cufft.h> was erroneously inside the class body (both in destructor and private section). This caused the cuFFT header extern "C" block to appear inside a C++ class definition, triggering "linkage specification is not allowed" and all cuFFT types undeclared. Moved the include to file scope, guarded by #ifdef __CUDA. - kedf_wt_gpu.cu: d_result_ is double* but resmem_zd_op/delmem_zd_op are typed std::complex<double>*. Changed to resmem_dd_op/delmem_dd_op (nrxx*2 doubles = nrxx complex doubles).

mohanchen · 2026-06-09T03:17:23Z

Nice try, could you provide some tests/examples in the PR? you can check out /tests/07_OFDFT

- Add test directory with INPUT (device=gpu), STRU, KPT, result.ref - Test identical to 09_OF_KE_WT but exercises GPU code path - Add CASES_GPU.txt for GPU test discovery - GPU results should match CPU reference within tolerance

SunsetStand · 2026-06-09T07:04:02Z

Thanks for the review! I've added a GPU WT KEDF test case:

Test location: tests/07_OFDFT/31_OF_KE_WT_GPU/

It mirrors 09_OF_KE_WT (Al FCC, WT KEDF, symmetry=on) with device gpu added to INPUT. The GPU path performs identical math—cuFFT replaces CPU FFT + GPU kernels for element-wise ops—so results should match the CPU reference within tolerance.

Per reviewer request (sunliang98): keep GPU kernel files organized under kernels/cuda/ subdirectory, consistent with other ABACUS modules.

After moving kedf_wt_gpu.cu to kernels/cuda/, the bare include #include "kedf_wt.h" no longer resolves since the header is now in the parent directory. Use full module path consistent with other CUDA kernel files (e.g., module_pwdft/kernels/cuda/*.cu).

…exponent Replace thrust::complex<double> with native double2 (cufftDoubleComplex) to eliminate AoS memory layout overhead (50% bandwidth waste from unused imag component). Add grid-stride loops for flexible occupancy. Move rho^exponent (std::pow) from CPU to GPU, eliminating one H→D transfer per SCF iteration. Kernel changes: - kedf_wt_rho_power (new): GPU-side pow() replaces CPU loop - kedf_wt_recip_multiply: double2 replaces thrust::complex, grid-stride - kedf_wt_real_to_complex: double2 + grid-stride - kedf_wt_complex_to_real_norm: double2 + grid-stride Benchmark (RTX 4060 Laptop, 96^3 grid): ~3.3x end-to-end speedup vs thrust::complex baseline. Kernel-only section: ~76% faster. See wt_kernel_opt/ standalone benchmark for full comparison. Thread coarsening (4x) was tested but showed regression on Ada Lovelace (SM 8.9) — fewer active warps reduced latency hiding for memory-bound kernels. Left for future architecture-specific tuning.

SunsetStand · 2026-06-10T11:53:03Z

Optimizations applied (commit 4129f9f)
Three targeted optimizations for the WT KEDF GPU kernels:
double2 replaces thrust::complex (AoS → native type)
thrust::complex uses AoS layout {real, imag} but real_to_complex only writes imag=0 and complex_to_real_norm only reads real part — wasting 50% memory bandwidth. Using native double2 (layout-compatible with cufftDoubleComplex) eliminates the abstraction overhead and enables better compiler vectorization. No API breakage, double2 is trivially castable to cufftDoubleComplex*.
Grid-stride loops
Replaced if (idx >= n) return with for (i = idx; i < n; i += stride) for flexible occupancy across arbitrary grid sizes.
GPU-side rho^exponent
Moved std::pow(rho[i], exponent) from CPU to a GPU kernel, eliminating one H→D transfer per SCF iteration. In the ABACUS WT KEDF call chain, prho needs one H→D copy regardless — the GPU pow kernel runs on the copied data directly, avoiding the CPU compute + second transfer.
Benchmark (RTX 4060 Laptop, standalone benchmark)

Grid	Original	Optimized	Speedup
32³	0.57 ms	0.32 ms	1.78×
64³	3.34 ms	1.13 ms	2.96×
96³	11.1 ms	3.42 ms	3.25×

Kernel-only (data already on GPU): 1.76× for 96³.
What was tested but NOT included
Thread coarsening (4 elements/thread) showed a ~4% regression on Ada Lovelace (SM 8.9) — fewer active warps reduced the GPU's ability to hide memory latency for these memory-bound kernels. Left for future architecture-specific tuning.
Correctness
Max relative error vs. original output: ~1e-16 for 32³/64³ grids. No existing test changes needed.

checkpoint: skeleton gpu file before full implementation

7395e6d

SunsetStand force-pushed the feature/ofdft-kedf-gpu branch 2 times, most recently from 25c2618 to d62a3f6 Compare June 7, 2026 03:19

SunsetStand force-pushed the feature/ofdft-kedf-gpu branch from d62a3f6 to b93c9cd Compare June 7, 2026 03:23

mohanchen added the project_learning label Jun 8, 2026

mohanchen requested a review from sunliang98 June 9, 2026 03:15

mohanchen assigned sunliang98 Jun 9, 2026

mohanchen added Refactor Refactor ABACUS codes Features Needed The features are indeed needed, and developers should have sophisticated knowledge GPU & DCU & HPC GPU and DCU and HPC related any issues and removed Refactor Refactor ABACUS codes labels Jun 9, 2026

test: add GPU WT KEDF test case (31_OF_KE_WT_GPU)

d60158f

- Add test directory with INPUT (device=gpu), STRU, KPT, result.ref - Test identical to 09_OF_KE_WT but exercises GPU code path - Add CASES_GPU.txt for GPU test discovery - GPU results should match CPU reference within tolerance

sunliang98 approved these changes Jun 9, 2026

View reviewed changes

Comment thread source/CMakeLists.txt Outdated

sunliang98 and others added 4 commits June 9, 2026 21:26

Merge branch 'develop' into feature/ofdft-kedf-gpu

73a1537

refactor: move kedf_wt_gpu.cu to kernels/cuda/ for module consistency

7149d2e

Per reviewer request (sunliang98): keep GPU kernel files organized under kernels/cuda/ subdirectory, consistent with other ABACUS modules.

SunsetStand requested a review from sunliang98 June 10, 2026 12:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: GPU-accelerated WT KEDF multi_kernel convolution#7448

feat: GPU-accelerated WT KEDF multi_kernel convolution#7448
SunsetStand wants to merge 8 commits into
deepmodeling:developfrom
SunsetStand:feature/ofdft-kedf-gpu

SunsetStand commented Jun 7, 2026 •

edited

Loading

Uh oh!

mohanchen commented Jun 9, 2026

Uh oh!

SunsetStand commented Jun 9, 2026

Uh oh!

Uh oh!

SunsetStand commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

SunsetStand commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reminder

Linked Issue

Unit Tests and/or Case Tests for my changes

What's changed?

Any changes of core modules? (ignore if not applicable)

Uh oh!

mohanchen commented Jun 9, 2026

Uh oh!

SunsetStand commented Jun 9, 2026

Uh oh!

Uh oh!

SunsetStand commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SunsetStand commented Jun 7, 2026 •

edited

Loading

SunsetStand commented Jun 10, 2026 •

edited

Loading