PSMS Model Precision & Performance #43

brandynlucca · 2026-04-02T19:55:04Z

brandynlucca
Apr 2, 2026

I recently pushed a PSMS implementation I've been working on for acousticTS that I think may be helpful for its state in echoSMs in two ways: 1) precision drift at higher $ka/kb$, and 2) computation time for the fluid-filled boundary condition. This is more food-for-thought more than anything since micro-optimizing code is often completely unnecessary. The purpose for this thread is:

1. Is there a plan to enable quad-precision in echoSMs for PSMSModel?
2. There may be heuristics and proxy methods that can help speed up model runs without losing significant accuracy relative to benchmarks. And if quad-precision is enabled, then perhaps it would be worthwhile migrating code from Python to a C/C++ layer (or Cython) down the road.

The motivation behind this thread is my recent attempts to implement a T-matrix method into the acousticTS R-package and an interest to migrate this into Python for some future multiple scattering modeling.

Precision

The prolate_swf.f90 file used in the spheroidalwavefunctions import defaults to the lower end of double precision (integer, parameter :: knd = selected_real_kind(8)). acousticTS uses 15 decimal digits for its double precision, which doesn't seem to make all that much of a difference when it comes to the outputs (at least when comparing the results from echoSMs and acousticTS) for the fixed rigid and pressure release boundaries (largest deviations being associated with nulls).

Case	Frequency set (kHz)	Max abs. $\Delta$ TS vs `echoSMs` (dB)	Mean abs. $\Delta$ TS vs `echoSMs` (dB)
`fixed_rigid`	`12, 18, 38, 70, 100`	`0.49692`	`0.10091`
`pressure_release`	`12, 18, 38, 70, 100`	`0.08619`	`0.01757`

The double-precision results track pretty well with the benchmarked values for these boundaries. Conversely, this is not the case for the liquid-filled benchmark (the tables below use the acousticTS outputs). Note that here, the simplify_Amn refers to the same simplification defined by both Furusawa (1988) and matches the outputs from PSMSModel in echoSMs.

Precision	`simplify_Amn`	Max abs. $\Delta$ TS (dB)	Mean abs. $\Delta$ TS (dB)
`double`	`FALSE`	8.67114	1.53575
`double`	`FALSE`	17.34265	4.59850
`double`	`TRUE`	7.18195	2.22574
`double`	`TRUE`	7.18195	2.22574

These differences are really driven by a spectral decorrelation and loss of coherence in the higher $ka$ regimes (roughly starting at around 90 kHz for the benchmark values). Comparatively, running things at quad precision (33 decimal digits) does a remarkably better job with the benchmark.

Precision	`simplify_Amn`	Max abs. $\Delta$ TS (dB)	Mean abs. $\Delta$ TS (dB)
`quad`	`FALSE`	0.08263	0.02805
`quad`	`FALSE`	0.08348	0.02806
`quad`	`TRUE`	3.65223	1.46801
`quad`	`TRUE`	3.65223	1.46801

When visualized:

When all said and done, this quad-precision implementation for liquid-filled prolate spheroids demonstrates some differences with echoSMs but aligns well with the Prol_Spheroid repository.

Software	Frequency set (kHz)	Max abs. $\Delta$ TS vs acousticTS (dB)	Mean abs. $\Delta$ TS vs acousticTS (dB)
echoSMs	`12, 18, 38, 70, 100`	`1.01676`	`0.20537`
Prol_Spheroid	`12, 18, 38, 70, 100`	`0.00128`	`0.00055`

I am not very familiar with F2PY with respect to how the generator works, but I think it would be helpful to also include either separate quad-precision binaries or some precision-specific interface. Since I am not super familiar with F2PY, below is the implementation chain I used for linking R to Fortran.

acousticTS precision handling

The user makes a high-level, human-readable choice in the R function.

[ R Environment ]
|
|-- precision = "double"  ---> Maps to conceptually requested precision Tier 1
|
`-- precision = "quad"    ---> Maps to conceptually requested precision Tier 2

R passes this configuration to the C++ interface. Rcpp converts the string into a conditional logic flow, determining which Type Alias of psms_fbs to use:

[ C++ Interface ]
|
| (IF "double" is requested)
|---------------------------> Calls psms_fbs<double>
|
| (IF "quad" is requested)
`---------------------------> Calls psms_fbs<acousticts_quad_t>

This then subsequently maps to a specific compiled binary files (via instructions from Makevars files) that interface with the same shared source Fortran code:

[ Fortran binaries ]
|
| (IF "double" is requested)
|---------------------------> psms_fbs<double> -> prolate_swf.o
|
| (IF "quad" is requested)
`---------------------------> Calls psms_fbs<acousticts_quad_t> -> prolate_swf_quad.o

These binaries then send the appropriate instructions to the shared Fortran source code:

[ Fortran interface ]
|
| (IF "double" is requested)
|---------------------------> prolate_swf.o -> profcn_cpp_interface OR profcn_cpp_interface_batch
| (IF "quad" is requested)
`---------------------------> prolate_swf_quad.o -> profcn_cpp_interface_quad OR profcn_cpp_interface_batch_quad

Runtime

Obviously the backend for the PSMS model is pretty costly. For runtimes on my local machine:

Case	Frequency set (kHz)	acousticTS elapsed (s)	echoSMs elapsed (s)	Prol_Spheroid original elapsed (s)	Prol_Spheroid vectorized elapsed (s)
`fixed_rigid`	`12, 18, 38, 70, 100`	`0.86`	`0.34`	`N/A`	`N/A`
`pressure_release`	`12, 18, 38, 70, 100`	`0.93`	`0.33`	`N/A`	`N/A`
`liquid_filled`	`12, 18, 38, 70, 100`	`2.72`	`48.02`	`48.65`	`11.06`

The acousticTS runtimes are when the argument adaptive = TRUE. The adaptive algorithm I implemented makes the following changes:

Quadrature: the number of integration points are adjusted on a frequency-by-frequency basis based on the relative modal difficulty (a combination of max $m$ and $n$) and the frequency-specific $\chi$ quantity. For reference: acousticTS uses Gauss-Legendre quadrature for integration.
Tail stopping: defined gradient threshold that dictates whether the tail-end of modal terms can be considered negligible
- precision="double": relative tolerance = 1e-8, absolute tolerance = 1e-12
- precision="quad": relative tolerance = 1e-12, absolute tolerance = 1e-18
Kernel proxies/surrogates: these serve to help reduce the size of the assembled kernel matrices (i.e., $K_{n \ell}^{m(1)}$ and $K_{n \ell}^{m(3)}$).
Lightweight caches for repeated Legendre- and quadrature-related- functions were added to the Fortran source code.

These changes as well as moving practically all of the logic-handling and processing from R into C++ conferred pretty sizable improvements to model runtimes for the fluid-filled boundary. The fixed rigid and pressure-release boundaries do benefit from this, but really only when you compute over wide frequency bands with fine-scale intervals. Not included in the below benchmarks/comparisons, but it also helped reduce memory-related errors and maxing out computer hardware (no one likes a BSOD...).

This adaptive implementation does a fairly good job retaining model agreement with the benchmark values, e.g. the liquid-filled boundary:

Frequency (kHz)	Benchmark TS (dB)	Literal TS (dB)	Adaptive TS (dB)	Adaptive `n_integration`	Literal $\Delta$ TS (dB)	Adaptive $\Delta$ TS (dB)
12	-87.05	-87.05331	-87.05331	32	-0.00331	-0.00331
18	-81.19	-81.19965	-81.19965	32	-0.00965	-0.00965
38	-77.17	-77.20046	-77.20046	32	-0.03046	-0.03046
70	-76.92	-76.95042	-76.95042	32	-0.03042	-0.03042
120	-80.58	-80.55970	-80.55951	32	0.02030	0.02049
200	-89.31	-89.39263	-89.39348	48	-0.08263	-0.08348
250	-79.39	-79.42879	-79.42819	56	-0.03879	-0.03819
300	-77.52	-77.51659	-77.51653	64	0.00341	0.00347
333	`NA`	-76.90684	-76.90677	72	`NA`	`NA`
400	-78.41	-78.44349	-78.44309	88	-0.03349	-0.03309

This consistency accompanies an improvement in reduction in computation times (12, 18, 38, 70, 120, 200, 250, 300, 333, 400 kHz):

Rigid and pressure release

Boundary	Precision	`adaptive`	`n_integration`	Elapsed time (s)
`fixed_rigid`	`double`	`FALSE`	`96`	0.20
`fixed_rigid`	`double`	`TRUE`		0.14
`fixed_rigid`	`quad`	`FALSE`	`96`	13.61
`fixed_rigid`	`quad`	`TRUE`		13.53
`pressure_release`	`double`	`FALSE`	`96`	0.14
`pressure_release`	`double`	`TRUE`		0.14
`pressure_release`	`quad`	`FALSE`	`96`	13.38
`pressure_release`	`quad`	`TRUE`		13.36

Liquid filled

Precision	`simplify_Amn`	`adaptive`	`n_integration`	Elapsed time (s)
`double`	`FALSE`	`FALSE`	`96`	1.81
`double`	`FALSE`	`TRUE`		1.31
`double`	`TRUE`	`FALSE`	`96`	0.18
`double`	`TRUE`	`TRUE`		0.15
`quad`	`FALSE`	`FALSE`	`96`	59.66
`quad`	`FALSE`	`TRUE`		48.33
`quad`	`TRUE`	`FALSE`	`96`	14.75
`quad`	`TRUE`	`TRUE`		14.56

gavinmacaulay · 2026-04-06T09:17:15Z

gavinmacaulay
Apr 6, 2026
Maintainer

Thanks for this! Improving the echoSMs PSMS model performance has been on the list for a while, so it's nice to have some concrete input on how to improve things :)

Using quad floats from Fortran in Python is fiddly at best so I defaulted to the easier double precision to get things working. Does your model use quad precision in the PSMS model or just in the calculation of the spheroidal values? If it's enough just to do the spheroidal calculations in quad but return the results in double precision, that would make it much easier to support.

Your notes have also helped me discover that the argument to selected_real_kind is digits of precision, not bytes used to represent the float. Interestingly, the documentation for the Fortran spheroidal wave functions clearly considers the argument to be bytes (and my memory from using Fortran some 35 years ago was that one specified the precision using bytes). At worst this may mean that the spheroidalwavefunctions package is using single precision, and at best double precision (selected_real_kind uses the smallest available float to meet the requested number of digits).

Another complication is that most (all?) Intel and AMD CPU's don't natively support quad precision so they get emulated with software, leading to calculations with quad precision being 10-60 times slower than for double precision (according to some searches). But accuracy of calculations wins over speed, so it's still likely to be worthwhile to do...

0 replies

brandynlucca · 2026-04-06T17:25:08Z

brandynlucca
Apr 6, 2026
Author

Thanks for this! Improving the echoSMs PSMS model performance has been on the list for a while, so it's nice to have some concrete input on how to improve things :)

I'm always happy to help help spare other folks avoid the same frustrations and time-sinks I ran into trying to improve precision while not taking eons to run!

Using quad floats from Fortran in Python is fiddly at best so I defaulted to the easier double precision to get things working.

The distinction I would make is that Python does not necessarily need to traffic in quad values directly for the numerically sensitive parts of the PSMS solver to remain in quad internally. I know that float128/longdouble is highly platform-dependent (e.g., standard Windows compiler limits np.longdouble to being identical to np.float64), so that totally makes sense to just avoid dealing with that altogether on the Python site of things. That is basically the route I took on the R side: keep the difficult compiled stages in the requested arithmetic, then cast back to double only at the interface boundary. So I let C++ and Fortran handle all of the heavy-precision-lifting and let R hand off the results to end-users.

Does your model use quad precision in the PSMS model or just in the calculation of the spheroidal values? If it's enough just to do the spheroidal calculations in quad but return the results in double precision, that would make it much easier to support.

Quad precision is not only used inside the spheroidal-wave-function calls. For the full penetrable PSMS branch, I keep the overlap integrals, kernel assembly, and dense per-m solves in the requested arithmetic as well, and only cast back to double at the R interface at the very end. So if the question is whether the API can still return doubles, then yes. But if the question is whether it is enough to do only the spheroidal evaluations in quad and then do the PSMS kernel algebra in double, I would be cautious about assuming that for the harder penetrable cases.

Returning doubles at the API boundary is fine; demoting before the dense solve is different. My implementation uses a somewhat different linear-solver path from echoSMs, but the general issue is the same: once the matrix is demoted before the SVD/pseudoinverse or related solve step, the conditioning, singular-value thresholding, and final coefficients are all governed by double precision. At that point it is no longer an end-to-end quad-precision PSMS solve. When developing this implementation, I found that the main source of precision drift was not just in evaluating $S_\text{mn}$ and $R_\text{mn}$ themselves, but in how those values propagate through the overlap integrals and the ill-conditioned kernel systems for the full liquid-filled solve. For rigid and pressure-release cases, double precision seems to be fine. For the full penetrable prolate case, though, double and quad can separate by multiple dB once the retained modal limits get large enough.

My sense is that a lot of the numerical trouble starts in the spheroidal-function layer, but for the full penetrable PSMS branch the overlap integrals and dense solve can amplify that error rather than wash it out. There is an observable difference in TS at high $ka$ when keeping only the spheroidal-function layer in quad precision and bookkeeping everything else as doubles. The mixed precision case gets you closer to the benchmarked values, but that just somewhat kicks the proverbial precision-drift-can down to a slightly greater $ka$. So for me this is partly a numerical question and partly a user-facing one: if precision = "quad" is exposed as a public argument, I want that to describe the numerically sensitive parts of the PSMS solve, not just the special-function subroutine.

Your notes have also helped me discover that the argument to selected_real_kind is digits of precision, not bytes used to represent the float. Interestingly, the documentation for the Fortran spheroidal wave functions clearly considers the argument to be bytes (and my memory from using Fortran some 35 years ago was that one specified the precision using bytes). At worst this may mean that the spheroidalwavefunctions package is using single precision, and at best double precision (selected_real_kind uses the smallest available float to meet the requested number of digits).

Yeah, the source Fortran code erroneously implies that the selected_real_kind(8) and selected_real_kind(16) represent double and quad precision, respectively. If I had to hazard a guess, this probably was more a coincidence based on whatever compiler was used for the original code. And now that I think of it, I should probably add this sort of information to the documentation since not everyone runs the same compiler...

For example, my compiler (gfortran 6.3.0) spits out:

p=6  -> kind 4  -> precision 6
p=7  -> kind 8  -> precision 15
p=8  -> kind 8  -> precision 15
p=15 -> kind 8  -> precision 15
p=16 -> kind 10 -> precision 18
p=18 -> kind 10 -> precision 18
p=33 -> kind 16 -> precision 33

So the selected_real_kind(...) values used in acousticTS (15, 33) are based on the arithmetic precision supported by Fortran (precision()), not the decimal widths (17, 36) defined when using the IEEE 754 convention. In that sense, selected_real_kind(8) and selected_real_kind(16) are bad ways to refer to double and quad, because they are only minimum precision requests, not fixed format names. On some compilers, selected_real_kind(8) just so happens to land on double, while selected_real_kind(16) lands on an 18-digit kind rather than quad. So if I were to have used selected_real_kind(16) with such a copmiler, I would be orders of magnitude coarser than true quad precision.

Another complication is that most (all?) Intel and AMD CPU's don't natively support quad precision so they get emulated with software, leading to calculations with quad precision being 10-60 times slower than for double precision (according to some searches). But accuracy of calculations wins over speed, so it's still likely to be worthwhile to do...

That is also definitely a good point and matches my experience too. In my benchmarks the quad runs are slower by roughly tens of times rather than a small consistent factor, so I agree that the cost is very real. One alternative would be to use double-double precision to get close to quad precision (something like 31 decimal digits?). And while double-double precision keeps things like an exponent's range, the Fortran code already does a lot of mantissa and exponent scaling that I don't think would be substantially impacted by using double-double. But that is still not true binary128, and you still run into the issue of double-precision function error being injected into a higher-precision downstream solve.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PSMS Model Precision & Performance #43

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

PSMS Model Precision & Performance #43

Uh oh!

brandynlucca Apr 2, 2026

Precision

acousticTS precision handling

Runtime

Rigid and pressure release

Liquid filled

Replies: 2 comments

Uh oh!

gavinmacaulay Apr 6, 2026 Maintainer

Uh oh!

brandynlucca Apr 6, 2026 Author

brandynlucca
Apr 2, 2026

gavinmacaulay
Apr 6, 2026
Maintainer

brandynlucca
Apr 6, 2026
Author