feat: GPU-accelerated WT KEDF multi_kernel convolution#7448
feat: GPU-accelerated WT KEDF multi_kernel convolution#7448SunsetStand wants to merge 8 commits into
Conversation
25c2618 to
d62a3f6
Compare
Add GPU backend for KEDF_WT::multi_kernel() using cuFFT via PW_Basis _gpu interface. Key changes: - kedf_wt_gpu.cu: single CUDA kernel (kedf_wt_recip_multiply) for G-space element-wise kernel multiplication, plus multi_kernel_gpu() method that pipelines real2recip → kernel multiply → recip2real entirely on GPU. Persistent buffers allocated via memory_op. - kedf_wt.h: GPU method declarations and buffer members under #ifdef __CUDA guard (zero overhead when CUDA disabled). - kedf_wt.cpp: GPU dispatch at top of multi_kernel() — when pw_rho->device == "gpu", delegates to multi_kernel_gpu(). - source/CMakeLists.txt: add kedf_wt_gpu.cu to USE_CUDA block. Design follows existing ABACUS GPU patterns (memory_op for device memory, thrust::complex in kernels, CHECK_CUDA_SYNC for safety).
d62a3f6 to
b93c9cd
Compare
- kedf_wt.h: #include <cufft.h> was erroneously inside the class body (both in destructor and private section). This caused the cuFFT header extern "C" block to appear inside a C++ class definition, triggering "linkage specification is not allowed" and all cuFFT types undeclared. Moved the include to file scope, guarded by #ifdef __CUDA. - kedf_wt_gpu.cu: d_result_ is double* but resmem_zd_op/delmem_zd_op are typed std::complex<double>*. Changed to resmem_dd_op/delmem_dd_op (nrxx*2 doubles = nrxx complex doubles).
|
Nice try, could you provide some tests/examples in the PR? you can check out /tests/07_OFDFT |
- Add test directory with INPUT (device=gpu), STRU, KPT, result.ref - Test identical to 09_OF_KE_WT but exercises GPU code path - Add CASES_GPU.txt for GPU test discovery - GPU results should match CPU reference within tolerance
|
Thanks for the review! I've added a GPU WT KEDF test case: Test location: tests/07_OFDFT/31_OF_KE_WT_GPU/ It mirrors 09_OF_KE_WT (Al FCC, WT KEDF, symmetry=on) with device gpu added to INPUT. The GPU path performs identical math—cuFFT replaces CPU FFT + GPU kernels for element-wise ops—so results should match the CPU reference within tolerance. |
Per reviewer request (sunliang98): keep GPU kernel files organized under kernels/cuda/ subdirectory, consistent with other ABACUS modules.
After moving kedf_wt_gpu.cu to kernels/cuda/, the bare include #include "kedf_wt.h" no longer resolves since the header is now in the parent directory. Use full module path consistent with other CUDA kernel files (e.g., module_pwdft/kernels/cuda/*.cu).
…exponent Replace thrust::complex<double> with native double2 (cufftDoubleComplex) to eliminate AoS memory layout overhead (50% bandwidth waste from unused imag component). Add grid-stride loops for flexible occupancy. Move rho^exponent (std::pow) from CPU to GPU, eliminating one H→D transfer per SCF iteration. Kernel changes: - kedf_wt_rho_power (new): GPU-side pow() replaces CPU loop - kedf_wt_recip_multiply: double2 replaces thrust::complex, grid-stride - kedf_wt_real_to_complex: double2 + grid-stride - kedf_wt_complex_to_real_norm: double2 + grid-stride Benchmark (RTX 4060 Laptop, 96^3 grid): ~3.3x end-to-end speedup vs thrust::complex baseline. Kernel-only section: ~76% faster. See wt_kernel_opt/ standalone benchmark for full comparison. Thread coarsening (4x) was tested but showed regression on Ada Lovelace (SM 8.9) — fewer active warps reduced latency hiding for memory-bound kernels. Left for future architecture-specific tuning.
|
Optimizations applied (commit 4129f9f)
Kernel-only (data already on GPU): 1.76× for 96³. |
Reminder
Linked Issue
This is a new feature — no existing issue. A CPU-vs-GPU correctness and performance benchmark is provided in the PR description below. An issue can be opened for discussion if preferred.
Unit Tests and/or Case Tests for my changes
A standalone benchmark (
ofdft_cuda/) was used to verify correctness (GPU vs FFTW3 CPU reference, error < 1e-7 for WT KEDF) and measure performance (14.2× speedup at 96³ grid on RTX 4060). Integration into ABACUS's existing GPU CI pipeline (e.g.,tests/integrate/GPU OFDFT cases) is planned as a follow-up once CI GPU runners are confirmed available for this module.What's changed?
This PR adds GPU acceleration for the WT KEDF
multi_kernel()function, which is the most expensive single operation in OFDFT Wang-Teter calculations (up to 40% of total SCF time). The implementation:pw_rho->real2recip_gpu()/recip2real_gpu()for FFT andmemory_opfor device memory management. No new external dependencies.kedf_wt_recip_multiply) for element-wise G-space kernel multiplication, following the same pattern as existing GPU kernels insource_base/kernels/cuda/.kernel_) is copied to device once since it is constant throughout the SCF cycle.#ifdef __CUDAand the CPU path is completely untouched.The GPU dispatch is a simple 5-line addition at the top of
multi_kernel():Performance: on an RTX 4060 Laptop GPU, the GPU path achieves 14.2× speedup for WT KEDF at typical OFDFT grid sizes (96³) compared to FFTW3 CPU, with correctness verified to < 1e-7 relative error. A full benchmark report is available in the standalone prototype (examples/ or as supplementary material upon request).
Any changes of core modules? (ignore if not applicable)
N/A — only modifies the OFDFT KEDF module (source_pw/module_ofdft/), which is not a core ESolver/Hamilt/Operator module.