Block multiple sled reservations with the same gen by jmpesp · Pull Request #10479 · oxidecomputer/omicron

jmpesp · 2026-05-21T16:56:39Z

If multiple instance-start sagas are concurrently attempting to allocate for the same instance, this temporarily results in multiple rows in sled_resource_vmm with different propolis ids for the same instance id. One of the instance-start sagas will succeed, where the other(s) will unwind (due to an "instance changed state before it could be started" error from sis_move_to_starting), and remove the sled_resource_vmm record that they added by matching on that saga's propolis id.

There's never been a uniqueness constraint for instance id in the sled_resource_vmm table, because there can't be, otherwise we'd never be able to migrate an instance (which makes a new record on a different sled for the same instance).

For an instance start that performs any new local storage allocation, this is a problem: the latent assumption in inserting / updating local storage related records is that this type of duplication could not occur, that if the insert succeeded then it means the allocation will only be performed once. Because this is not true the CTE will happily stomp all over the local storage allocation related records and that leads to the orphaning seen in the linked issue.

The fix is to add a uniqueness constraint to sled_resource_vmm that ensures only one record for a given instance id plus the instance state generation number exists. This will not affect migration because the instance state generation is bumped in that case.

This commit also changes the local storage related unit tests to clearly specify the ncpus and memory for the fake instances, as inspecting the sled_resource_vmm records produced by the test showed the resources didn't match the instance specification.

Fixes oxidecomputer/customer-support#1184.

If multiple instance-start sagas are concurrently attempting to allocate for the same instance, this temporarily results in multiple rows in `sled_resource_vmm` with different propolis ids for the same instance id. One of the instance-start sagas will succeed, where the other(s) will unwind (due to an "instance changed state before it could be started" error from `sis_move_to_starting`), and remove the `sled_resource_vmm` record that they added by matching on that saga's propolis id. There's never been a uniqueness constraint for instance id in the `sled_resource_vmm` table, because there can't be, otherwise we'd never be able to migrate an instance (which makes a new record on a different sled for the same instance). For an instance start that performs any new local storage allocation, this is a problem: the latent assumption in inserting / updating local storage related records is that this type of duplication could not occur, that if the insert succeeded then it means the allocation will only be performed once. Because this is not true the CTE will happily stomp all over the local storage allocation related records and that leads to the orphaning seen in the linked issue. The fix is to add a uniqueness constraint to `sled_resource_vmm` that ensures only one record for a given instance id plus the instance state generation number exists. This will not affect migration because the instance state generation is bumped in that case. This commit also changes the local storage related unit tests to clearly specify the ncpus and memory for the fake instances, as inspecting the `sled_resource_vmm` records produced by the test showed the resources didn't match the instance specification. Fixes oxidecomputer/customer-support#1184.

jmpesp requested a review from hawkw May 21, 2026 16:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block multiple sled reservations with the same gen#10479

Block multiple sled reservations with the same gen#10479
jmpesp wants to merge 1 commit into
oxidecomputer:mainfrom
jmpesp:instance_state_generation_in_sled_reservation

jmpesp commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jmpesp commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant