Remove pinned host memory from barrier solver#1321
Conversation
Replace all pinned_dense_vector_t members in iteration_data_t with plain dense_vector_t, eliminating CPU<->GPU synchronization overhead from page-locked memory allocation. Removes 169 net lines. Vectors removed (pinned -> plain or deleted entirely): - 10 direction vectors (dw_aff, dx_aff, dy_aff, dv_aff, dz_aff and their corrector counterparts) - 5 RHS vectors (primal_rhs, bound_rhs, dual_rhs, complementarity_xz_rhs, complementarity_wv_rhs) - 5 residual vectors (primal_residual, bound_residual, dual_residual, complementarity_xz_residual, complementarity_wv_residual) - diag, inv_diag, inv_sqrt_diag (CPU-only, converted to dense_vector_t) - c, b (constants, converted; permanent d_b_ added to avoid per-iteration device_copy in compute_primal_dual_objective) - restrict_u_ (converted; permanent d_restrict_u_ added, copied once) - w, x, y, v, z, upper_bounds (state vectors, converted) Also removes the CPU compute_residuals function entirely (replaced by gpu_compute_residuals path) and simplifies gpu_compute_search_direction signature by removing unused pinned vector parameters. Validated on 179 benchmark problems (portfolio/maros/qplib): identical results vs baseline under --cudss-deterministic true. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Would this be with release/26.06 or postponed to the next release? |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughIteration state, RHS/residual assembly, objective dot-products, and search-direction work are moved to device-resident buffers; gpu_compute_search_direction now allocates device work internally, and host copies occur only at explicit synchronized snapshot/solution-export points. ChangesGPU-Resident Barrier Solver
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
cpp/src/barrier/barrier.cu (2)
1386-1388: 💤 Low valueVerify dense-columns path synchronization is correct.
The D2H copy of
inv_diagfollowed bysynchronize()before host-sidesolve_adatis necessary for correctness whenn_dense_columns > 0. The host solve uses the current device-computedinv_diagvalues.However, consider using
RAFT_CUDA_TRYwrapper for consistency with other sync points in this file.Suggested change for consistency
raft::copy(inv_diag.data(), d_inv_diag.data(), d_inv_diag.size(), stream_view_); - stream_view_.synchronize(); + RAFT_CUDA_TRY(cudaStreamSynchronize(stream_view_));🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/src/barrier/barrier.cu` around lines 1386 - 1388, The D2H copy of inv_diag currently uses raft::copy(...) followed by stream_view_.synchronize(); for correctness when n_dense_columns > 0—wrap the CUDA synchronization in the RAFT_CUDA_TRY macro for consistency with other sync points (i.e., ensure the raft::copy and the subsequent stream_view_.synchronize() call are protected by RAFT_CUDA_TRY) so device errors are checked before proceeding to host-side work such as host_copy(...) and the host solve (solve_adat).
2471-2472: 💤 Low valueMinor performance: prefer D2D copy from
d_b_instead of H2D fromlp.rhs.Since
d_b_is already a permanent device copy oflp.rhs(copied once at construction, line 351), using it as the source avoids an H2D transfer each iteration.Suggested optimization
data.d_primal_residual_.resize(lp.num_rows, stream_view_); - raft::copy(data.d_primal_residual_.data(), lp.rhs.data(), lp.rhs.size(), stream_view_); + raft::copy(data.d_primal_residual_.data(), data.d_b_.data(), data.d_b_.size(), stream_view_);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/src/barrier/barrier.cu` around lines 2471 - 2472, Replace the host-to-device copy from lp.rhs with a device-to-device copy from the existing device buffer d_b_; specifically, change the raft::copy call that writes into data.d_primal_residual_ (currently using lp.rhs.data()) to use d_b_.data() (or the appropriate device pointer named d_b_) and keep the size and stream_view_ unchanged so the transfer is D2D instead of H2D.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@cpp/src/barrier/barrier.cu`:
- Around line 1386-1388: The D2H copy of inv_diag currently uses raft::copy(...)
followed by stream_view_.synchronize(); for correctness when n_dense_columns >
0—wrap the CUDA synchronization in the RAFT_CUDA_TRY macro for consistency with
other sync points (i.e., ensure the raft::copy and the subsequent
stream_view_.synchronize() call are protected by RAFT_CUDA_TRY) so device errors
are checked before proceeding to host-side work such as host_copy(...) and the
host solve (solve_adat).
- Around line 2471-2472: Replace the host-to-device copy from lp.rhs with a
device-to-device copy from the existing device buffer d_b_; specifically, change
the raft::copy call that writes into data.d_primal_residual_ (currently using
lp.rhs.data()) to use d_b_.data() (or the appropriate device pointer named d_b_)
and keep the size and stream_view_ unchanged so the transfer is D2D instead of
H2D.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: de9a0363-49f9-4235-9a1f-39c929d3f62a
📒 Files selected for processing (2)
cpp/src/barrier/barrier.cucpp/src/barrier/barrier.hpp
💤 Files with no reviewable changes (1)
- cpp/src/barrier/barrier.hpp
| // Verify A*x = b | ||
| data.primal_residual = lp.rhs; | ||
| data.cusparse_view_.spmv(1.0, data.x, -1.0, data.primal_residual); | ||
| dense_vector_t<i_t, f_t> primal_residual(lp.num_rows); |
There was a problem hiding this comment.
Nit: rename it to init_primal_residual
| dense_vector_t<i_t, f_t> primal_residual(lp.num_rows); | |
| dense_vector_t<i_t, f_t> init_primal_residual(lp.num_rows); |
yuwenchen95
left a comment
There was a problem hiding this comment.
Some resize calls still appears in the main while loop of barrier method. Since dimensions of these vectors are uncganed once set up, it's better to regroup all resize operations at the beginning of a barrier methods.
| #endif | ||
|
|
||
| if (data.n_upper_bounds > 0) { | ||
| dense_vector_t<i_t, f_t> bound_residual(data.n_upper_bounds); |
There was a problem hiding this comment.
Add init_ prefix like above.
| data.z.pairwise_subtract(data.c, data.dual_residual); | ||
| if (data.Q.n > 0) { matrix_vector_multiply(data.Q, -1.0, data.x, 1.0, data.dual_residual); } | ||
| data.cusparse_view_.transpose_spmv(1.0, data.y, 1.0, data.dual_residual); | ||
| dense_vector_t<i_t, f_t> dual_residual(lp.num_cols); |
There was a problem hiding this comment.
| dense_vector_t<i_t, f_t> dual_residual(lp.num_cols); | |
| dense_vector_t<i_t, f_t> init_dual_residual(lp.num_cols); |
|
|
||
| data.d_primal_residual_.resize(data.primal_residual.size(), stream_view_); | ||
| raft::copy(data.d_primal_residual_.data(), lp.rhs.data(), lp.rhs.size(), stream_view_); | ||
| data.d_primal_residual_.resize(lp.num_rows, stream_view_); |
There was a problem hiding this comment.
Nit: it would look clearer if we only resize d_primal_residual_ and d_dual_residual_ at the first time we call it.
| stream_view_.value()); | ||
| RAFT_CHECK_CUDA(stream_view_); | ||
| if (data.Q.n > 0) { | ||
| auto descr_dual_residual = data.cusparse_view_.create_vector(data.d_dual_residual_); |
There was a problem hiding this comment.
descr_dual_residual should be added into the initialization of data, instead of creating it every time.
| data.d_dy_.resize(dy.size(), stream_view_); | ||
| data.d_dz_.resize(dz.size(), stream_view_); | ||
| data.d_dv_.resize(dv.size(), stream_view_); | ||
| { |
There was a problem hiding this comment.
Logically better to move the code block Barrier: GPU allocation and copies before the while loop of a IPM
| // D2D: RHS = residuals (all on device) | ||
| data.cone_combined_step_ = false; | ||
| data.cone_sigma_mu_ = f_t(0); | ||
| raft::copy( |
There was a problem hiding this comment.
Better to resize d_bound_rhs_ and d_dw_ only once at the beginning of IPM.
| vector_norm2<i_t, f_t>(data.primal_residual), | ||
| vector_norm2<i_t, f_t>(data.dual_residual), | ||
| primal_residual_norm, | ||
| dual_residual_norm, |
There was a problem hiding this comment.
dual_residual_norm is not used since to_solution recomputes the dual z and then the residual. We'd better not pass it into to_solution.
Replace all pinned_dense_vector_t members in iteration_data_t with plain dense_vector_t, eliminating CPU<->GPU synchronization overhead from page-locked memory allocation. Removes 169 net lines.
Vectors removed (pinned -> plain or deleted entirely):
Also removes the CPU compute_residuals function entirely (replaced by gpu_compute_residuals path) and simplifies gpu_compute_search_direction signature by removing unused pinned vector parameters.
Validated on 179 benchmark problems (portfolio/maros/qplib): identical results vs baseline under --cudss-deterministic true.
Description
Issue
Checklist