Skip to content

New complex acos function.#9096

Merged
s-oboyle merged 4 commits into
NVIDIA:mainfrom
s-oboyle:complex_acos_accuracy_refinement
May 21, 2026
Merged

New complex acos function.#9096
s-oboyle merged 4 commits into
NVIDIA:mainfrom
s-oboyle:complex_acos_accuracy_refinement

Conversation

@s-oboyle
Copy link
Copy Markdown
Contributor

Unlike asin and atan, acos needs more than just a call to the equivalent inverse-hyperbolic function.
Doing it this way fixes all the under/overflow issues.

Perf

We have a slightly suspicious result here. Despite this being basically a wrapper around (the fairly large) acosh with an extra fma and some sign flips, it has much less perf than acosh, as seen here, where I would expect them to be quite similar.
We could have either hit a register usage boundary, or maybe acosh hasn't been inlined.
Also possible is the values that math_bench test (which try to guess real-life usage) now hit a slowpath in acosh more often.
To be investigated, as this is nearly ~1.5x slower than anticipated.

Operations/SM/cycle:
cacos():

H100 old new new/old
fp64 0.1518 0.1430 0.94
fp32 0.4461 0.4072 0.91

Correctness

GPU fp64:
Max ulp real error (4.772,0.3392) @ (4174.277773,1.009244847)	(0x40b04e471c22e769,0x3ff025ddec967ac6)
	Ours = (0.0002417771195,-9.029843829)    Ref = (0.0002417771195,-9.029843829)
	Ours = (0x3f2fb0b1a45a9274,0xc0220f47b0bd75df)               Ref = (0x3f2fb0b1a45a926f,0xc0220f47b0bd75df)

Max ulp imag error (0.7242,4.233) @ (-4.243991582e-314,0.0009757882705)	(0x8000000200000000,0x3f4ff9815ad08602)
	Ours = (1.570796327,-0.0009757881156)    Ref = (1.570796327,-0.0009757881156)
	Ours = (0x3ff921fb54442d19,0xbf4ff98105af1dab)               Ref = (0x3ff921fb54442d18,0xbf4ff98105af1daf)
GPU fp32:
Max ulp real error (4.768,0.2087) @ (1.67785461e+35,4.062130506e+31)	(0x7a0141d9,0x74002da1)
	Ours = (0.0002421027166,-81.80113983)    Ref = (0.0002421026438,-81.80113983)
	Ours = (0x397ddcf4,0xc2a39a2f)               Ref = (0x397ddcef,0xc2a39a2f)

Max ulp imag error (0.6333,4.976) @ (-2.802596929e-45,0.007756583858)	(0x80000002,0x3bfe2af1)
	Ours = (1.570796251,-0.007756503765)    Ref = (1.570796371,-0.007756506093)
	Ours = (0x3fc90fda,0xbbfe2a45)               Ref = (0x3fc90fdb,0xbbfe2a4a)
CPU fp64:
Max ulp real error (4.855,0.004554) @ (8476.368757,1.467965354e-247)    (0x40c08e2f336b8800,0xcb06c1784f32580)
        Ours = (1.731832824e-251,-9.738184602)    Ref = (1.731832824e-251,-9.738184602)
        Ours = (0xbdfbe1a4050c7e0,0xc02379f35503c149)               Ref = (0xbdfbe1a4050c7db,0xc02379f35503c149)

Max ulp imag error (0.7242,4.184) @ (-0,0.003891582592) (0x8000000000000000,0x3f6fe13d7ebcb800)
        Ours = (1.570796327,-0.003891572769)    Ref = (1.570796327,-0.003891572769)
        Ours = (0x3ff921fb54442d19,0xbf6fe13838bc3a57)               Ref = (0x3ff921fb54442d18,0xbf6fe13838bc3a53)
CPU fp32:
Max ulp real error (5.348,0.09948) @ (66712.41406,-8356.245117) (0x47824c35,0xc60290fb)
        Ours = (0.1246087849,11.80907726)    Ref = (0.1246087477,11.80907726)
        Ours = (0x3dff32e4,0x413cf1fb)               Ref = (0x3dff32df,0x413cf1fb)

Max ulp imag error (0.1838,3.693) @ (-0.4162324369,-0.1135988012)       (0xbed51c6b,0xbde8a67d)
        Ours = (1.996576786,0.1244143695)    Ref = (1.996576786,0.1244143993)
        Ours = (0x3fff8fd4,0x3dfeccf6)               Ref = (0x3fff8fd4,0x3dfeccfa)

@s-oboyle s-oboyle requested a review from a team as a code owner May 21, 2026 13:57
@s-oboyle s-oboyle requested a review from griwes May 21, 2026 13:57
@copy-pr-bot
Copy link
Copy Markdown
Contributor

copy-pr-bot Bot commented May 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-project-automation github-project-automation Bot moved this to Todo in CCCL May 21, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL May 21, 2026
@s-oboyle s-oboyle requested review from davebayer, fbusato and miscco and removed request for griwes May 21, 2026 14:01
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 868d7bcb-fb59-43e1-ba9b-e9325a35558b

📥 Commits

Reviewing files that changed from the base of the PR and between 3f2087c and 01ee399.

📒 Files selected for processing (1)
  • libcudacxx/include/cuda/std/__complex/inverse_trigonometric_functions.h

📝 Walkthrough

Summary by CodeRabbit

  • Refactor
    • Streamlined the internal implementation of the inverse cosine function for complex numbers in the CUDA standard library, improving computational efficiency and reducing code dependencies while maintaining equivalent behavior.

important:

Walkthrough

The cuda::std::acos implementation for complex numbers is refactored to replace explicit special-case branches with sign-normalized arithmetic: extract signbits, compute acosh on absolute components, apply conditional pi correction via fma for negative real inputs, and restore original signs.

Changes

Complex acos arithmetic optimization

Layer / File(s) Summary
Header includes updated
libcudacxx/include/cuda/std/__complex/inverse_trigonometric_functions.h
Added __cmath/fma.h, retained predicate headers (isinf, isnan, signbit), removed headers from the previous log/sqrt code path.
acos(complex<_Tp>) sign-normalized acosh computation
libcudacxx/include/cuda/std/__complex/inverse_trigonometric_functions.h
Rewrote acos(const complex<_Tp>&) to use signbit/fabs normalization, compute acosh on magnitudes, reconstruct real/imag per quadrant, conditionally apply a pi high/low fma correction when original real<0, and flip imag sign for original imag<0. Removed prior explicit isinf/isnan/zero checks and log + sqrt path.

Suggested reviewers

  • davebayer
  • fbusato

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fa3d28f3-03af-4c48-9c2d-954c73bc2ca3

📥 Commits

Reviewing files that changed from the base of the PR and between 52f2794 and 3f2087c.

📒 Files selected for processing (1)
  • libcudacxx/include/cuda/std/__complex/inverse_trigonometric_functions.h

Comment thread libcudacxx/include/cuda/std/__complex/inverse_trigonometric_functions.h Outdated
@s-oboyle
Copy link
Copy Markdown
Contributor Author

/ok to test 01ee399

@github-actions
Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 1h 53m: Pass: 100%/116 | Total: 2d 18h | Max: 1h 15m | Hits: 70%/516144

See results here.

@s-oboyle s-oboyle merged commit c12def0 into NVIDIA:main May 21, 2026
134 of 137 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

2 participants