feat: Windows SHA-stamped surrogate filename and configurable surrogate pool size#1339
feat: Windows SHA-stamped surrogate filename and configurable surrogate pool size#1339simongdavies wants to merge 1 commit intohyperlight-dev:mainfrom
Conversation
- Use SHA-256 content hash in surrogate binary filename
(hyperlight_surrogate_{sha8}.exe) to eliminate cross-version
ACCESS_DENIED race when multiple hyperlight versions coexist.
- Add HYPERLIGHT_INITIAL_SURROGATES env var (1-512, default 512)
to control how many surrogate processes are pre-created at startup.
- Add HYPERLIGHT_MAX_SURROGATES env var (>=initial, <=512, default 512)
as hard cap with on-demand CAS growth when pool is exhausted.
- Rollback created_count on process creation failure to prevent
permanent capacity loss from transient errors.
- Increment created_count per-process (not store-after-loop) to
prevent count drift on partial init failure.
- Warn when env var values are clamped to valid range.
- Add tests for env var parsing (with #[serial] for thread safety)
and locked-file extraction resilience.
- Update surrogate development notes documentation.
Signed-off-by: Simon Davies <simongdavies@users.noreply.github.com>
ludfjig
left a comment
There was a problem hiding this comment.
Did you consider making these knobs on SandboxConfigurtion instead?
| fn surrogate_binary_name() -> Result<String> { | ||
| let exe = Asset::get(EMBEDDED_SURROGATE_NAME) | ||
| .ok_or_else(|| new_error!("could not find embedded surrogate binary"))?; | ||
| let sha = sha256::digest(exe.data.as_ref()); |
There was a problem hiding this comment.
Have you considered making this random instead? Then it would additionally cover the case where you run 2 hyperlight of the same versino
There was a problem hiding this comment.
This is already covered since we no longer fail if the copy cannot proceed, the challenge with random names is two fold , first it will potentially fill a disk up over time, at least it will leave a lot of redundant files around, second its poentiallymore likely to trigger AV, there are a couple more issues here that need to be solved that I will open issues for so I think for now I will leave it as is.
Yes, but since I wanted the values to apply equally to any surrogate process managers in the process there isn't a place to put it, the values are also not SandboxConfiguration, I think maybe when we revise the API we can look at this again |
Fixes hyperlight surrogate process manager on Windows:
If two different implementations of hyperlight are used by a single host (e.g. hyperlight and hyperlight-js) then they may (even if this is highly unlikely) have different versions of the hyperlight surrogate binary (if the hyperlight versions are different). More likely is that they use the same version, when this happens, then the second implementation will try to overwrite the in already in use surrogate process exe.
If there are multiple implementations of hyperlight each with their own surrogate process manager then under current behavior each one will spin up 512 surrogate process, not only does this waste resources and take time it also means that there will be 1024 which is more that can be supported in a single process.
The issue with file copying prevents hyperagent from running on Windows (as it uses both hyperlight and hyperlight-js). It also does not need the overhead of 512 surrogate processes.
There are other scenarios where hyperlight may be used where this upfront creation of surrogate processes is both unnecessary and wasteful.
This PR introduces the following changes to deal with this:
Use SHA-256 content hash in surrogate binary filename (hyperlight_surrogate_{sha8}.exe) to eliminate cross-version ACCESS_DENIED race when multiple hyperlight versions coexist.
Add HYPERLIGHT_INITIAL_SURROGATES env var (1-512, default 512) to control how many surrogate processes are pre-created at startup.
Add HYPERLIGHT_MAX_SURROGATES env var (>=initial, <=512, default 512) as hard cap with on-demand CAS growth when pool is exhausted.
Add tests for env var parsing (with #[serial] for thread safety) and locked-file extraction resilience.
Update surrogate development notes documentation.