Skip to content

Remove pinned host memory from barrier solver#1321

Open
rg20 wants to merge 5 commits into
NVIDIA:release/26.06from
rg20:remove_pinned_memory
Open

Remove pinned host memory from barrier solver#1321
rg20 wants to merge 5 commits into
NVIDIA:release/26.06from
rg20:remove_pinned_memory

Conversation

@rg20
Copy link
Copy Markdown
Contributor

@rg20 rg20 commented May 28, 2026

Replace all pinned_dense_vector_t members in iteration_data_t with plain dense_vector_t, eliminating CPU<->GPU synchronization overhead from page-locked memory allocation. Removes 169 net lines.

Vectors removed (pinned -> plain or deleted entirely):

  • 10 direction vectors (dw_aff, dx_aff, dy_aff, dv_aff, dz_aff and their corrector counterparts)
  • 5 RHS vectors (primal_rhs, bound_rhs, dual_rhs, complementarity_xz_rhs, complementarity_wv_rhs)
  • 5 residual vectors (primal_residual, bound_residual, dual_residual, complementarity_xz_residual, complementarity_wv_residual)
  • diag, inv_diag, inv_sqrt_diag (CPU-only, converted to dense_vector_t)
  • c, b (constants, converted; permanent d_b_ added to avoid per-iteration device_copy in compute_primal_dual_objective)
  • restrict_u_ (converted; permanent d_restrict_u_ added, copied once)
  • w, x, y, v, z, upper_bounds (state vectors, converted)

Also removes the CPU compute_residuals function entirely (replaced by gpu_compute_residuals path) and simplifies gpu_compute_search_direction signature by removing unused pinned vector parameters.

Validated on 179 benchmark problems (portfolio/maros/qplib): identical results vs baseline under --cudss-deterministic true.

Description

Issue

Checklist

  • I am familiar with the Contributing Guidelines.
  • Testing
    • New or existing tests cover these changes
    • Added tests
    • Created an issue to follow-up
    • NA
  • Documentation
    • The documentation is up to date with these changes
    • Added new documentation
    • NA

Replace all pinned_dense_vector_t members in iteration_data_t with plain
dense_vector_t, eliminating CPU<->GPU synchronization overhead from
page-locked memory allocation. Removes 169 net lines.

Vectors removed (pinned -> plain or deleted entirely):
- 10 direction vectors (dw_aff, dx_aff, dy_aff, dv_aff, dz_aff and
  their corrector counterparts)
- 5 RHS vectors (primal_rhs, bound_rhs, dual_rhs,
  complementarity_xz_rhs, complementarity_wv_rhs)
- 5 residual vectors (primal_residual, bound_residual, dual_residual,
  complementarity_xz_residual, complementarity_wv_residual)
- diag, inv_diag, inv_sqrt_diag (CPU-only, converted to dense_vector_t)
- c, b (constants, converted; permanent d_b_ added to avoid
  per-iteration device_copy in compute_primal_dual_objective)
- restrict_u_ (converted; permanent d_restrict_u_ added, copied once)
- w, x, y, v, z, upper_bounds (state vectors, converted)

Also removes the CPU compute_residuals function entirely (replaced by
gpu_compute_residuals path) and simplifies gpu_compute_search_direction
signature by removing unused pinned vector parameters.

Validated on 179 benchmark problems (portfolio/maros/qplib): identical
results vs baseline under --cudss-deterministic true.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@rg20 rg20 requested a review from a team as a code owner May 28, 2026 19:52
@rg20 rg20 requested review from akifcorduk and hlinsen May 28, 2026 19:52
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 28, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@rg20 rg20 marked this pull request as draft May 28, 2026 19:52
@yuwenchen95
Copy link
Copy Markdown
Contributor

Would this be with release/26.06 or postponed to the next release?

@rg20 rg20 changed the base branch from main to release/26.06 May 29, 2026 15:17
@rg20 rg20 added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels May 29, 2026
@rg20 rg20 added this to the 26.06 milestone May 29, 2026
@rg20 rg20 marked this pull request as ready for review June 3, 2026 15:04
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 3, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 10c417be-7f20-4bf0-b3f2-7aa715d7c22e

📥 Commits

Reviewing files that changed from the base of the PR and between ba245ae and aa3ead9.

📒 Files selected for processing (1)
  • cpp/src/barrier/barrier.cu
🚧 Files skipped from review as they are similar to previous changes (1)
  • cpp/src/barrier/barrier.cu

📝 Walkthrough

Walkthrough

Iteration state, RHS/residual assembly, objective dot-products, and search-direction work are moved to device-resident buffers; gpu_compute_search_direction now allocates device work internally, and host copies occur only at explicit synchronized snapshot/solution-export points.

Changes

GPU-Resident Barrier Solver

Layer / File(s) Summary
Data Structure and Initialization
cpp/src/barrier/barrier.cu, cpp/src/barrier/barrier.hpp
iteration_data_t member types converted from pinned-host to standard/dense/device-oriented storage; new device buffers such as d_b_ and d_restrict_u_ added and initialized; constructor and cuSPARSE view setup reordered and device copies for c/b performed.
Header and private API updates
cpp/src/barrier/barrier.hpp
Removed barrier/pinned_host_allocator.hpp include, deleted private templated compute_residuals declaration, and updated gpu_compute_search_direction to remove pinned_dense_vector_t reference parameters for directions.
Initial-point checks & residuals on-device
cpp/src/barrier/barrier.cu (initial_point, gpu_compute_residuals)
Initial primal/dual feasibility checks are computed from device spmv results; gpu_compute_residuals seeds device residual buffers (e.g., copies d_b_ into d_primal_residual_) and removes prior host staging/synchronization.
Search Direction Computation on Device
cpp/src/barrier/barrier.cu (gpu_compute_search_direction, gpu_solve_adat)
gpu_compute_search_direction allocates/resizes internal device vectors for upper bounds, dy/dx/dz/dv/dw and residual/work buffers instead of receiving pinned-host outputs; ADAT path explicitly copies/synchronizes inv_diag when host-side ADAT solve is used.
Affine / Corrector RHS Assembly on Device
cpp/src/barrier/barrier.cu (compute_affine_rhs, compute_cc_rhs)
Affine and corrector RHS vectors are assembled and negated on-device via device-to-device copies/resizes (d_h_, d_dual_rhs_, d_bound_rhs_, d_dw_) and on-device zeroing (thrust::fill); host-side RHS aliases and intermediate buffers removed.
Objective, iterate updates & solve loop integration
cpp/src/barrier/barrier.cu (compute_primal_dual_objective, compute_next_iterate, solve, snapshot/save, check_for_suboptimal_solution)
Objective dot-products use device buffers (d_b_, d_restrict_u_); initial iterate uploaded to GPU and initial residual/mu computed on-device; iterates remain device-resident during stepping; affine/corrector steps use the new device-only search-direction flow; solution/snapshot paths explicitly synchronize and copy w/x/y/v/z from device back to host before exporting.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Suggested reviewers

  • akifcorduk
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Remove pinned host memory from barrier solver' directly and clearly describes the primary change—eliminating pinned memory usage from the barrier solver implementation.
Description check ✅ Passed The description provides detailed context about the changeset, explaining what vectors were removed/converted, the performance rationale, validation testing results, and how the signature was simplified.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
cpp/src/barrier/barrier.cu (2)

1386-1388: 💤 Low value

Verify dense-columns path synchronization is correct.

The D2H copy of inv_diag followed by synchronize() before host-side solve_adat is necessary for correctness when n_dense_columns > 0. The host solve uses the current device-computed inv_diag values.

However, consider using RAFT_CUDA_TRY wrapper for consistency with other sync points in this file.

Suggested change for consistency
       raft::copy(inv_diag.data(), d_inv_diag.data(), d_inv_diag.size(), stream_view_);
-      stream_view_.synchronize();
+      RAFT_CUDA_TRY(cudaStreamSynchronize(stream_view_));
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/barrier/barrier.cu` around lines 1386 - 1388, The D2H copy of
inv_diag currently uses raft::copy(...) followed by stream_view_.synchronize();
for correctness when n_dense_columns > 0—wrap the CUDA synchronization in the
RAFT_CUDA_TRY macro for consistency with other sync points (i.e., ensure the
raft::copy and the subsequent stream_view_.synchronize() call are protected by
RAFT_CUDA_TRY) so device errors are checked before proceeding to host-side work
such as host_copy(...) and the host solve (solve_adat).

2471-2472: 💤 Low value

Minor performance: prefer D2D copy from d_b_ instead of H2D from lp.rhs.

Since d_b_ is already a permanent device copy of lp.rhs (copied once at construction, line 351), using it as the source avoids an H2D transfer each iteration.

Suggested optimization
   data.d_primal_residual_.resize(lp.num_rows, stream_view_);
-  raft::copy(data.d_primal_residual_.data(), lp.rhs.data(), lp.rhs.size(), stream_view_);
+  raft::copy(data.d_primal_residual_.data(), data.d_b_.data(), data.d_b_.size(), stream_view_);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/barrier/barrier.cu` around lines 2471 - 2472, Replace the
host-to-device copy from lp.rhs with a device-to-device copy from the existing
device buffer d_b_; specifically, change the raft::copy call that writes into
data.d_primal_residual_ (currently using lp.rhs.data()) to use d_b_.data() (or
the appropriate device pointer named d_b_) and keep the size and stream_view_
unchanged so the transfer is D2D instead of H2D.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@cpp/src/barrier/barrier.cu`:
- Around line 1386-1388: The D2H copy of inv_diag currently uses raft::copy(...)
followed by stream_view_.synchronize(); for correctness when n_dense_columns >
0—wrap the CUDA synchronization in the RAFT_CUDA_TRY macro for consistency with
other sync points (i.e., ensure the raft::copy and the subsequent
stream_view_.synchronize() call are protected by RAFT_CUDA_TRY) so device errors
are checked before proceeding to host-side work such as host_copy(...) and the
host solve (solve_adat).
- Around line 2471-2472: Replace the host-to-device copy from lp.rhs with a
device-to-device copy from the existing device buffer d_b_; specifically, change
the raft::copy call that writes into data.d_primal_residual_ (currently using
lp.rhs.data()) to use d_b_.data() (or the appropriate device pointer named d_b_)
and keep the size and stream_view_ unchanged so the transfer is D2D instead of
H2D.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: de9a0363-49f9-4235-9a1f-39c929d3f62a

📥 Commits

Reviewing files that changed from the base of the PR and between 3fba293 and d0957f6.

📒 Files selected for processing (2)
  • cpp/src/barrier/barrier.cu
  • cpp/src/barrier/barrier.hpp
💤 Files with no reviewable changes (1)
  • cpp/src/barrier/barrier.hpp

@chris-maes chris-maes modified the milestones: 26.06, 26.08 Jun 3, 2026
// Verify A*x = b
data.primal_residual = lp.rhs;
data.cusparse_view_.spmv(1.0, data.x, -1.0, data.primal_residual);
dense_vector_t<i_t, f_t> primal_residual(lp.num_rows);
Copy link
Copy Markdown
Contributor

@yuwenchen95 yuwenchen95 Jun 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: rename it to init_primal_residual

Suggested change
dense_vector_t<i_t, f_t> primal_residual(lp.num_rows);
dense_vector_t<i_t, f_t> init_primal_residual(lp.num_rows);

Copy link
Copy Markdown
Contributor

@yuwenchen95 yuwenchen95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some resize calls still appears in the main while loop of barrier method. Since dimensions of these vectors are uncganed once set up, it's better to regroup all resize operations at the beginning of a barrier methods.

#endif

if (data.n_upper_bounds > 0) {
dense_vector_t<i_t, f_t> bound_residual(data.n_upper_bounds);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add init_ prefix like above.

data.z.pairwise_subtract(data.c, data.dual_residual);
if (data.Q.n > 0) { matrix_vector_multiply(data.Q, -1.0, data.x, 1.0, data.dual_residual); }
data.cusparse_view_.transpose_spmv(1.0, data.y, 1.0, data.dual_residual);
dense_vector_t<i_t, f_t> dual_residual(lp.num_cols);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dense_vector_t<i_t, f_t> dual_residual(lp.num_cols);
dense_vector_t<i_t, f_t> init_dual_residual(lp.num_cols);


data.d_primal_residual_.resize(data.primal_residual.size(), stream_view_);
raft::copy(data.d_primal_residual_.data(), lp.rhs.data(), lp.rhs.size(), stream_view_);
data.d_primal_residual_.resize(lp.num_rows, stream_view_);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: it would look clearer if we only resize d_primal_residual_ and d_dual_residual_ at the first time we call it.

stream_view_.value());
RAFT_CHECK_CUDA(stream_view_);
if (data.Q.n > 0) {
auto descr_dual_residual = data.cusparse_view_.create_vector(data.d_dual_residual_);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

descr_dual_residual should be added into the initialization of data, instead of creating it every time.

data.d_dy_.resize(dy.size(), stream_view_);
data.d_dz_.resize(dz.size(), stream_view_);
data.d_dv_.resize(dv.size(), stream_view_);
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logically better to move the code block Barrier: GPU allocation and copies before the while loop of a IPM

// D2D: RHS = residuals (all on device)
data.cone_combined_step_ = false;
data.cone_sigma_mu_ = f_t(0);
raft::copy(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to resize d_bound_rhs_ and d_dw_ only once at the beginning of IPM.

vector_norm2<i_t, f_t>(data.primal_residual),
vector_norm2<i_t, f_t>(data.dual_residual),
primal_residual_norm,
dual_residual_norm,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dual_residual_norm is not used since to_solution recomputes the dual z and then the residual. We'd better not pass it into to_solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants