Remove pinned host memory from barrier solver by rg20 · Pull Request #1321 · NVIDIA/cuopt

rg20 · 2026-05-28T19:52:08Z

Replace all pinned_dense_vector_t members in iteration_data_t with plain dense_vector_t, eliminating CPU<->GPU synchronization overhead from page-locked memory allocation. Removes 169 net lines.

Vectors removed (pinned -> plain or deleted entirely):

10 direction vectors (dw_aff, dx_aff, dy_aff, dv_aff, dz_aff and their corrector counterparts)
5 RHS vectors (primal_rhs, bound_rhs, dual_rhs, complementarity_xz_rhs, complementarity_wv_rhs)
5 residual vectors (primal_residual, bound_residual, dual_residual, complementarity_xz_residual, complementarity_wv_residual)
diag, inv_diag, inv_sqrt_diag (CPU-only, converted to dense_vector_t)
c, b (constants, converted; permanent d_b_ added to avoid per-iteration device_copy in compute_primal_dual_objective)
restrict_u_ (converted; permanent d_restrict_u_ added, copied once)
w, x, y, v, z, upper_bounds (state vectors, converted)

Also removes the CPU compute_residuals function entirely (replaced by gpu_compute_residuals path) and simplifies gpu_compute_search_direction signature by removing unused pinned vector parameters.

Validated on 179 benchmark problems (portfolio/maros/qplib): identical results vs baseline under --cudss-deterministic true.

Description

Issue

Checklist

I am familiar with the Contributing Guidelines.
Testing
- New or existing tests cover these changes
- Added tests
- Created an issue to follow-up
- NA
Documentation
- The documentation is up to date with these changes
- Added new documentation
- NA

Replace all pinned_dense_vector_t members in iteration_data_t with plain dense_vector_t, eliminating CPU<->GPU synchronization overhead from page-locked memory allocation. Removes 169 net lines. Vectors removed (pinned -> plain or deleted entirely): - 10 direction vectors (dw_aff, dx_aff, dy_aff, dv_aff, dz_aff and their corrector counterparts) - 5 RHS vectors (primal_rhs, bound_rhs, dual_rhs, complementarity_xz_rhs, complementarity_wv_rhs) - 5 residual vectors (primal_residual, bound_residual, dual_residual, complementarity_xz_residual, complementarity_wv_residual) - diag, inv_diag, inv_sqrt_diag (CPU-only, converted to dense_vector_t) - c, b (constants, converted; permanent d_b_ added to avoid per-iteration device_copy in compute_primal_dual_objective) - restrict_u_ (converted; permanent d_restrict_u_ added, copied once) - w, x, y, v, z, upper_bounds (state vectors, converted) Also removes the CPU compute_residuals function entirely (replaced by gpu_compute_residuals path) and simplifies gpu_compute_search_direction signature by removing unused pinned vector parameters. Validated on 179 benchmark problems (portfolio/maros/qplib): identical results vs baseline under --cudss-deterministic true. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

copy-pr-bot · 2026-05-28T19:52:11Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yuwenchen95 · 2026-05-29T07:43:07Z

Would this be with release/26.06 or postponed to the next release?

coderabbitai · 2026-06-03T15:10:39Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 10c417be-7f20-4bf0-b3f2-7aa715d7c22e

📥 Commits

Reviewing files that changed from the base of the PR and between ba245ae and aa3ead9.

📒 Files selected for processing (1)

cpp/src/barrier/barrier.cu

🚧 Files skipped from review as they are similar to previous changes (1)

cpp/src/barrier/barrier.cu

📝 Walkthrough

Walkthrough

Iteration state, RHS/residual assembly, objective dot-products, and search-direction work are moved to device-resident buffers; gpu_compute_search_direction now allocates device work internally, and host copies occur only at explicit synchronized snapshot/solution-export points.

Changes

GPU-Resident Barrier Solver

Layer / File(s)	Summary
Data Structure and Initialization `cpp/src/barrier/barrier.cu`, `cpp/src/barrier/barrier.hpp`	`iteration_data_t` member types converted from pinned-host to standard/dense/device-oriented storage; new device buffers such as `d_b_` and `d_restrict_u_` added and initialized; constructor and cuSPARSE view setup reordered and device copies for `c`/`b` performed.
Header and private API updates `cpp/src/barrier/barrier.hpp`	Removed `barrier/pinned_host_allocator.hpp` include, deleted private templated `compute_residuals` declaration, and updated `gpu_compute_search_direction` to remove `pinned_dense_vector_t` reference parameters for directions.
Initial-point checks & residuals on-device `cpp/src/barrier/barrier.cu` (initial_point, gpu_compute_residuals)	Initial primal/dual feasibility checks are computed from device spmv results; `gpu_compute_residuals` seeds device residual buffers (e.g., copies `d_b_` into `d_primal_residual_`) and removes prior host staging/synchronization.
Search Direction Computation on Device `cpp/src/barrier/barrier.cu` (`gpu_compute_search_direction`, `gpu_solve_adat`)	`gpu_compute_search_direction` allocates/resizes internal device vectors for upper bounds, dy/dx/dz/dv/dw and residual/work buffers instead of receiving pinned-host outputs; ADAT path explicitly copies/synchronizes `inv_diag` when host-side ADAT solve is used.
Affine / Corrector RHS Assembly on Device `cpp/src/barrier/barrier.cu` (`compute_affine_rhs`, `compute_cc_rhs`)	Affine and corrector RHS vectors are assembled and negated on-device via device-to-device copies/resizes (`d_h_`, `d_dual_rhs_`, `d_bound_rhs_`, `d_dw_`) and on-device zeroing (thrust::fill); host-side RHS aliases and intermediate buffers removed.
Objective, iterate updates & solve loop integration `cpp/src/barrier/barrier.cu` (`compute_primal_dual_objective`, `compute_next_iterate`, `solve`, snapshot/save, `check_for_suboptimal_solution`)	Objective dot-products use device buffers (`d_b_`, `d_restrict_u_`); initial iterate uploaded to GPU and initial residual/mu computed on-device; iterates remain device-resident during stepping; affine/corrector steps use the new device-only search-direction flow; solution/snapshot paths explicitly synchronize and copy `w/x/y/v/z` from device back to host before exporting.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Suggested reviewers

akifcorduk

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Remove pinned host memory from barrier solver' directly and clearly describes the primary change—eliminating pinned memory usage from the barrier solver implementation.
Description check	✅ Passed	The description provides detailed context about the changeset, explaining what vectors were removed/converted, the performance rationale, validation testing results, and how the signature was simplified.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

cpp/src/barrier/barrier.cu (2)

1386-1388: 💤 Low value

Verify dense-columns path synchronization is correct.

The D2H copy of inv_diag followed by synchronize() before host-side solve_adat is necessary for correctness when n_dense_columns > 0. The host solve uses the current device-computed inv_diag values.

However, consider using RAFT_CUDA_TRY wrapper for consistency with other sync points in this file.

Suggested change for consistency

       raft::copy(inv_diag.data(), d_inv_diag.data(), d_inv_diag.size(), stream_view_);
-      stream_view_.synchronize();
+      RAFT_CUDA_TRY(cudaStreamSynchronize(stream_view_));

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/barrier/barrier.cu` around lines 1386 - 1388, The D2H copy of
inv_diag currently uses raft::copy(...) followed by stream_view_.synchronize();
for correctness when n_dense_columns > 0—wrap the CUDA synchronization in the
RAFT_CUDA_TRY macro for consistency with other sync points (i.e., ensure the
raft::copy and the subsequent stream_view_.synchronize() call are protected by
RAFT_CUDA_TRY) so device errors are checked before proceeding to host-side work
such as host_copy(...) and the host solve (solve_adat).

2471-2472: 💤 Low value

Minor performance: prefer D2D copy from d_b_ instead of H2D from lp.rhs.

Since d_b_ is already a permanent device copy of lp.rhs (copied once at construction, line 351), using it as the source avoids an H2D transfer each iteration.

Suggested optimization

   data.d_primal_residual_.resize(lp.num_rows, stream_view_);
-  raft::copy(data.d_primal_residual_.data(), lp.rhs.data(), lp.rhs.size(), stream_view_);
+  raft::copy(data.d_primal_residual_.data(), data.d_b_.data(), data.d_b_.size(), stream_view_);

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/barrier/barrier.cu` around lines 2471 - 2472, Replace the
host-to-device copy from lp.rhs with a device-to-device copy from the existing
device buffer d_b_; specifically, change the raft::copy call that writes into
data.d_primal_residual_ (currently using lp.rhs.data()) to use d_b_.data() (or
the appropriate device pointer named d_b_) and keep the size and stream_view_
unchanged so the transfer is D2D instead of H2D.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@cpp/src/barrier/barrier.cu`:
- Around line 1386-1388: The D2H copy of inv_diag currently uses raft::copy(...)
followed by stream_view_.synchronize(); for correctness when n_dense_columns >
0—wrap the CUDA synchronization in the RAFT_CUDA_TRY macro for consistency with
other sync points (i.e., ensure the raft::copy and the subsequent
stream_view_.synchronize() call are protected by RAFT_CUDA_TRY) so device errors
are checked before proceeding to host-side work such as host_copy(...) and the
host solve (solve_adat).
- Around line 2471-2472: Replace the host-to-device copy from lp.rhs with a
device-to-device copy from the existing device buffer d_b_; specifically, change
the raft::copy call that writes into data.d_primal_residual_ (currently using
lp.rhs.data()) to use d_b_.data() (or the appropriate device pointer named d_b_)
and keep the size and stream_view_ unchanged so the transfer is D2D instead of
H2D.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: de9a0363-49f9-4235-9a1f-39c929d3f62a

📥 Commits

Reviewing files that changed from the base of the PR and between 3fba293 and d0957f6.

📒 Files selected for processing (2)

cpp/src/barrier/barrier.cu
cpp/src/barrier/barrier.hpp

💤 Files with no reviewable changes (1)

cpp/src/barrier/barrier.hpp

…inned_memory

yuwenchen95 · 2026-06-04T10:12:55Z

  // Verify A*x = b
-  data.primal_residual = lp.rhs;
-  data.cusparse_view_.spmv(1.0, data.x, -1.0, data.primal_residual);
+  dense_vector_t<i_t, f_t> primal_residual(lp.num_rows);


Nit: rename it to init_primal_residual

Suggested change

dense_vector_t<i_t, f_t> primal_residual(lp.num_rows);

dense_vector_t<i_t, f_t> init_primal_residual(lp.num_rows);

yuwenchen95

Some resize calls still appears in the main while loop of barrier method. Since dimensions of these vectors are uncganed once set up, it's better to regroup all resize operations at the beginning of a barrier methods.

yuwenchen95 · 2026-06-04T10:14:18Z

 #endif

  if (data.n_upper_bounds > 0) {
+    dense_vector_t<i_t, f_t> bound_residual(data.n_upper_bounds);


Add init_ prefix like above.

yuwenchen95 · 2026-06-04T10:16:45Z

-  data.z.pairwise_subtract(data.c, data.dual_residual);
-  if (data.Q.n > 0) { matrix_vector_multiply(data.Q, -1.0, data.x, 1.0, data.dual_residual); }
-  data.cusparse_view_.transpose_spmv(1.0, data.y, 1.0, data.dual_residual);
+  dense_vector_t<i_t, f_t> dual_residual(lp.num_cols);


Suggested change

dense_vector_t<i_t, f_t> dual_residual(lp.num_cols);

dense_vector_t<i_t, f_t> init_dual_residual(lp.num_cols);

yuwenchen95 · 2026-06-04T13:27:18Z


-  data.d_primal_residual_.resize(data.primal_residual.size(), stream_view_);
-  raft::copy(data.d_primal_residual_.data(), lp.rhs.data(), lp.rhs.size(), stream_view_);
+  data.d_primal_residual_.resize(lp.num_rows, stream_view_);


Nit: it would look clearer if we only resize d_primal_residual_ and d_dual_residual_ at the first time we call it.

yuwenchen95 · 2026-06-04T13:28:38Z

                                  stream_view_.value());
  RAFT_CHECK_CUDA(stream_view_);
+  if (data.Q.n > 0) {
+    auto descr_dual_residual = data.cusparse_view_.create_vector(data.d_dual_residual_);


descr_dual_residual should be added into the initialization of data, instead of creating it every time.

yuwenchen95 · 2026-06-04T13:35:15Z

-  data.d_dy_.resize(dy.size(), stream_view_);
-  data.d_dz_.resize(dz.size(), stream_view_);
-  data.d_dv_.resize(dv.size(), stream_view_);
+  {


Logically better to move the code block Barrier: GPU allocation and copies before the while loop of a IPM

yuwenchen95 · 2026-06-04T13:47:18Z

+  // D2D: RHS = residuals (all on device)
  data.cone_combined_step_ = false;
  data.cone_sigma_mu_      = f_t(0);
+  raft::copy(


Better to resize d_bound_rhs_ and d_dw_ only once at the beginning of IPM.

yuwenchen95 · 2026-06-04T14:22:08Z

-                     vector_norm2<i_t, f_t>(data.primal_residual),
-                     vector_norm2<i_t, f_t>(data.dual_residual),
+                     primal_residual_norm,
+                     dual_residual_norm,


dual_residual_norm is not used since to_solution recomputes the dual z and then the residual. We'd better not pass it into to_solution.

rg20 requested a review from a team as a code owner May 28, 2026 19:52

rg20 requested review from akifcorduk and hlinsen May 28, 2026 19:52

rg20 marked this pull request as draft May 28, 2026 19:52

rg20 changed the base branch from main to release/26.06 May 29, 2026 15:17

rg20 added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels May 29, 2026

rg20 added this to the 26.06 milestone May 29, 2026

hlinsen added 2 commits June 2, 2026 07:26

Fix merge conflicts

d58270e

Use pinned mem only for inv diag

d0957f6

rg20 marked this pull request as ready for review June 3, 2026 15:04

coderabbitai Bot reviewed Jun 3, 2026

View reviewed changes

chris-maes modified the milestones: 26.06, 26.08 Jun 3, 2026

hlinsen added 2 commits June 3, 2026 15:30

Merge branch 'release/26.06' of github.com:NVIDIA/cuopt into remove_p…

ba245ae

…inned_memory

Avoid more copies

aa3ead9

yuwenchen95 reviewed Jun 4, 2026

View reviewed changes

	dense_vector_t<i_t, f_t> primal_residual(lp.num_rows);
	dense_vector_t<i_t, f_t> init_primal_residual(lp.num_rows);

	dense_vector_t<i_t, f_t> dual_residual(lp.num_cols);
	dense_vector_t<i_t, f_t> init_dual_residual(lp.num_cols);

Conversation

rg20 commented May 28, 2026

Description

Issue

Checklist

Uh oh!

copy-pr-bot Bot commented May 28, 2026

Uh oh!

yuwenchen95 commented May 29, 2026

Uh oh!

coderabbitai Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

yuwenchen95 Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuwenchen95 left a comment

Choose a reason for hiding this comment

Uh oh!

yuwenchen95 Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

yuwenchen95 Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

yuwenchen95 Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

yuwenchen95 Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

yuwenchen95 Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

yuwenchen95 Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

yuwenchen95 Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coderabbitai Bot commented Jun 3, 2026 •

edited

Loading

yuwenchen95 Jun 4, 2026 •

edited

Loading