Skip to content

Refactor(barrier): Optimize affine direction handling with GPU snapshot#1397

Open
solos wants to merge 3 commits into
NVIDIA:mainfrom
solos:refactor-barrier
Open

Refactor(barrier): Optimize affine direction handling with GPU snapshot#1397
solos wants to merge 3 commits into
NVIDIA:mainfrom
solos:refactor-barrier

Conversation

@solos
Copy link
Copy Markdown

@solos solos commented Jun 5, 2026

Description

Refactor(barrier): Optimize affine direction handling with GPU snapshot

  • Eliminated redundant CPU-GPU data transfers in Mehrotra step.
  • Introduced GPU-side snapshot to retain affine directions (dx, dy, etc.).
  • Removed costly synchronous copies and stream synchronizations.
  • Simplified code by deprecating manual host-side vector management.

Issue

Checklist

  • I am familiar with the Contributing Guidelines.
  • Testing
    • New or existing tests cover these changes
    • Added tests
    • Created an issue to follow-up
    • NA
  • Documentation
    • The documentation is up to date with these changes
    • Added new documentation
    • NA

@solos solos requested a review from a team as a code owner June 5, 2026 10:39
@solos solos requested review from chris-maes and kaatish June 5, 2026 10:39
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jun 5, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 5, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 025da6ff-b0b7-4fd9-9fd7-6088e16bf072

📥 Commits

Reviewing files that changed from the base of the PR and between 6b28fde and d99a356.

📒 Files selected for processing (2)
  • cpp/src/barrier/barrier.cu
  • cpp/src/barrier/barrier.hpp
🚧 Files skipped from review as they are similar to previous changes (2)
  • cpp/src/barrier/barrier.hpp
  • cpp/src/barrier/barrier.cu

📝 Walkthrough

Walkthrough

Optionally snapshots affine search directions into dedicated device buffers before corrector overwrites; keeps intermediate direction data device-resident (removing redundant host↔device transfers); updates FINITE_CHECK to validate device-stored affine vectors and updates iteration call sites.

Changes

GPU Search Direction Affine Snapshot Optimization

Layer / File(s) Summary
Function signature extension and snapshot helper
cpp/src/barrier/barrier.hpp, cpp/src/barrier/barrier.cu
Extends gpu_compute_search_direction with snapshot_affine_direction boolean parameter (default false) and adds copy_affine_direction_to_device_buffers() to copy finalized affine directions (dw/dx/dy/dv/dz) into dedicated device buffers.
Device-resident direction buffer flow
cpp/src/barrier/barrier.cu
Removes host←device and device→host transfers of dx/dy/dz/dv across augmented and non-augmented solve paths; keeps intermediate directions GPU-resident for downstream computations.
Affine snapshot invocation
cpp/src/barrier/barrier.cu
Conditionally snapshots current affine directions into d_*_aff_ when snapshot_affine_direction is true by calling the new helper before the corrector overwrites working buffers.
compute_target_mu device use
cpp/src/barrier/barrier.cu
Removes the host→device copy that populated d_*_aff_; compute_target_mu now uses device-resident affine buffers.
FINITE_CHECK validation update
cpp/src/barrier/barrier.cu
Changes finiteness validation to copy d_dy_aff_ from device to host and check finiteness there instead of relying on host-side affine buffers.
Solve loop and sync removals
cpp/src/barrier/barrier.cu
Solve loop passes snapshot_affine_direction=true for the affine predictor call and removes explicit stream synchronizations after affine and corrector gpu_compute_search_direction calls.

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels: non-breaking, improvement

Suggested reviewers:

  • mlubin
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the main refactoring: optimizing affine direction handling by introducing GPU snapshots, which aligns with the primary objective of eliminating CPU-GPU transfers.
Description check ✅ Passed The description is directly related to the changeset, detailing the elimination of CPU-GPU data transfers, GPU-side snapshot introduction, and removal of synchronous operations, which matches the code changes documented in the raw summary.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/src/barrier/barrier.cu`:
- Around line 3488-3490: Remove the unconditional host-side sync (the
RAFT_CUDA_TRY(cudaStreamSynchronize(stream_view_)) call) in the hot path;
instead rely on stream ordering for the producers of data.d_*_aff_ and the
consumers (compute_target_mu, compute_cc_rhs, gpu_compute_search_direction,
compute_final_direction) which already enqueue on the same stream_view_. If any
producer/consumer can run on a different stream, replace the sync with a CUDA
event-based stream synchronization (record on producer stream and wait on
consumer stream) rather than a host-side cudaStreamSynchronize; keep
RAFT_CUDA_TRY and stream_view_ usage intact for any event waits you add.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ba0b05f6-f2fb-4b54-88bc-69a1f0acf522

📥 Commits

Reviewing files that changed from the base of the PR and between cf90894 and d05dfe5.

📒 Files selected for processing (2)
  • cpp/src/barrier/barrier.cu
  • cpp/src/barrier/barrier.hpp

Comment thread cpp/src/barrier/barrier.cu Outdated
@solos solos force-pushed the refactor-barrier branch 3 times, most recently from cacdd39 to 3043b45 Compare June 5, 2026 11:10
solos added 2 commits June 5, 2026 19:11
- Eliminated redundant CPU-GPU data transfers in Mehrotra step.
- Introduced GPU-side snapshot to retain affine directions (dx, dy, etc.).
- Removed costly synchronous copies and stream synchronizations.
- Simplified code by deprecating manual host-side vector management.
Add docs
Fix bug
@solos solos force-pushed the refactor-barrier branch from 29461d5 to 6857d1a Compare June 5, 2026 11:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant