Skip to content

rego: Introduce metadata rollback and enhance handling of mount failures in C-LCOW#2768

Merged
anmaxvl merged 12 commits into
microsoft:mainfrom
micromaomao:fix-metadata-desync
Jun 12, 2026
Merged

rego: Introduce metadata rollback and enhance handling of mount failures in C-LCOW#2768
anmaxvl merged 12 commits into
microsoft:mainfrom
micromaomao:fix-metadata-desync

Conversation

@micromaomao

@micromaomao micromaomao commented Jun 9, 2026

Copy link
Copy Markdown
Member

Introduce a metadata rollback mechanism WithTransaction(). This fixes the Rego "metadata desync" bug.

This is extracted from #2559 as a request from @anmaxvl


For mounts, a minor cleanup code is added to ensure we close down the dm-crypt
device if we fails to mount it. Aside from this, it is relatively
straightforward - if we get a failure, the clean up functions will remove the
directory, remove any dm-devices, and we will revert the Rego metadata.

For unmounts, careful consideration needs to be taken, since if the directory
has been unmounted successfully (or even partially successful?), and we get an
error, we cannot just revert the policy state, as this may allow the host to use
a broken / empty mount as one of the image layer. See 615c9a0bdf's commit
message for more detailed thoughts.

The solution I opted for is, for non-trivial unmount failure cases (i.e. not
policy denial, not invalid mountpoint), if it fails, then we will block all
further mount, unmount, container creation and deletion attempts. I think this
is OK since we really do not expect unmounts to fail - this is especially the
case for us since the only writable disk mount we have is the shared scratch
disk, which we do not unmount at all unless we're about to kill the UVM.

Testing

The "Rollback policy state on mount errors" commit message has some instruction
for making a deliberately corrupted (but still passes the verifyinfo extraction)
VHD that will cause a mount error. The other way we could make mount / unmount
fail, and thus test this change, is by mounting a tmpfs or ro bind in relevant
places:

To make unmount fail:

mkdir /run/gcs/c/.../rootfs/a && mount -t tmpfs none /run/gcs/c/.../rootfs/a

or

mkdir /run/gcs/mounts/scsi/m1/a && mount -t tmpfs none  /run/gcs/mounts/scsi/m1/a

To make mount fail:

mount -o ro --bind /run/mounts/scsi /run/mounts/scsi

or

mount --bind -o ro /run/gcs/c /run/gcs/c

Once failure is triggered, one can make them work again by umounting the tmpfs
or ro bind.

What about other operations?

This PR covers mount and unmount of disks, overlays and 9p. Aside from setting
metadata.matches as part of the narrowing scheme, and except for
metadata.started to prevent re-using a container ID, Rego does not use
persistent state for any other operations. Since it's not clear whether
reverting the state would be semantically correct (we would need to carefully
consider exactly what are the side effects of say, failing to execute a process,
start a container, or send a signal, etc), and adding the revert to those
operations does not currently affect much behaviour, I've opted not to apply the
metadata revert to those for now.

Allow unrecoverable_error.go to build on Windows and fix IsSNP() invocation

IsSNP() now can return an error, although this is not expected on LCOW.

Signed-off-by: Tingmao Wang tingmaowang@microsoft.com

[cherry-picked from 421b12249544a334e36df33dc4846673b2a88279]

This set of changes fixes the [Metadata Desync with UVM
State](https://msazure.visualstudio.com/One/_workitems/edit/33232631/) bug, by
reverting the Rego policy state on mount and some types of unmount failures.

For mounts, a minor cleanup code is added to ensure we close down the dm-crypt
device if we fails to mount it.  Aside from this, it is relatively
straightforward - if we get a failure, the clean up functions will remove the
directory, remove any dm-devices, and we will revert the Rego metadata.

For unmounts, careful consideration needs to be taken, since if the directory
has been unmounted successfully (or even partially successful?), and we get an
error, we cannot just revert the policy state, as this may allow the host to use
a broken / empty mount as one of the image layer. See 615c9a0bdf's commit
message for more detailed thoughts.

The solution I opted for is, for non-trivial unmount failure cases (i.e. not
policy denial, not invalid mountpoint), if it fails, then we will block all
further mount, unmount, container creation and deletion attempts.  I think this
is OK since we really do not expect unmounts to fail - this is especially the
case for us since the only writable disk mount we have is the shared scratch
disk, which we do not unmount at all unless we're about to kill the UVM.

Testing
-------

The "Rollback policy state on mount errors" commit message has some instruction
for making a deliberately corrupted (but still passes the verifyinfo extraction)
VHD that will cause a mount error.  The other way we could make mount / unmount
fail, and thus test this change, is by mounting a tmpfs or ro bind in relevant
places:

To make unmount fail:

    mkdir /run/gcs/c/.../rootfs/a && mount -t tmpfs none /run/gcs/c/.../rootfs/a

or

    mkdir /run/gcs/mounts/scsi/m1/a && mount -t tmpfs none  /run/gcs/mounts/scsi/m1/a

To make mount fail:

    mount -o ro --bind /run/mounts/scsi /run/mounts/scsi

or

    mount --bind -o ro /run/gcs/c /run/gcs/c

Once failure is triggered, one can make them work again by `umount`ing the tmpfs
or ro bind.

What about other operations?
----------------------------

This PR covers mount and unmount of disks, overlays and 9p.  Aside from setting
`metadata.matches` as part of the narrowing scheme, and except for
`metadata.started` to prevent re-using a container ID, Rego does not use
persistent state for any other operations.  Since it's not clear whether
reverting the state would be semantically correct (we would need to carefully
consider exactly what are the side effects of say, failing to execute a process,
start a container, or send a signal, etc), and adding the revert to those
operations does not currently affect much behaviour, I've opted not to apply the
metadata revert to those for now.

Allow unrecoverable_error.go to build on Windows and fix IsSNP() invocation

IsSNP() now can return an error, although this is not expected on LCOW.

Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>
Addresses the following comments:
microsoft#2762 (comment)
microsoft#2762 (comment)
microsoft#2762 (comment)
microsoft#2762 (comment)
microsoft#2762 (comment)
microsoft#2762 (comment)

Assisted-by: GitHub Copilot:claude-opus-4.7 copilot-review
Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>
Assisted-by: GitHub Copilot:auto copilot-review
Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>
Assisted-by: GitHub Copilot:auto copilot-review
Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>
Assisted-by: GitHub Copilot:auto copilot-review
Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>
…ggestion from Maksim

Rebase done with Copilot without much manual review.

Assisted-by: GitHub Copilot:claude-opus-4.8 copilot-review
Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>
@micromaomao micromaomao requested a review from a team as a code owner June 9, 2026 16:40
@micromaomao micromaomao requested a review from anmaxvl June 9, 2026 16:40
@anmaxvl anmaxvl self-assigned this Jun 9, 2026

@helsaawy helsaawy left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor feed back, lgtm overall

Comment thread internal/guest/runtime/hcsv2/uvm.go Outdated
micromaomao and others added 2 commits June 10, 2026 01:39
Suggested-by: Hamza El-Saawy <hamzaelsaawy@microsoft.com>
Assisted-by: GitHub Copilot:claude-opus-4.8
Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>
- Rename WithTransaction to WithMetadataRollback for clarity
- Move unrecoverable error handling from policy layer to GCS layer
  to fix inverted dependency (pkg/securitypolicy -> internal/gcs)
- Replace UnrecoverableError with two-level UVM error state:
  - Inconsistent (unmount failure): blocks mount/unmount/container
    create/delete and policy injection, allows shutdown/signal/exec
    and diagnostics for cleanup and troubleshooting
  - Unrecoverable (policy metadata rollback failure): blocks all
    operations except exec and diagnostics, GCS self-terminates
    via panic after 1 hour grace period
- Remove unused internal/gcs/unrecoverable_error.go
- Add withPolicyRollback helper on Host to centralize sentinel
  detection and UVM state management

Assisted-by: GitHub Copilot claude-opus-4.8
Signed-off-by: Maksim An <maksiman@microsoft.com>

@KenGordon KenGordon left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Unrecoverable (policy metadata rollback failure): blocks all
operations except exec and diagnostics, GCS self-terminates
via panic after 1 hour grace period"

In the real world you could never exec into the container due to policy. Thus thus 1 hour of indeterminate state is just an exploit waiting to happen. In the event of a Fatal condition all we can do is while true; and wait for the host side to timeout and kill the UVM.

We will have been in some indeterminate "should never happen" state, so there is no reasoning we can do about what is safe or not.

@anmaxvl

anmaxvl commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

"Unrecoverable (policy metadata rollback failure): blocks all operations except exec and diagnostics, GCS self-terminates via panic after 1 hour grace period"

In the real world you could never exec into the container due to policy. Thus thus 1 hour of indeterminate state is just an exploit waiting to happen. In the event of a Fatal condition all we can do is while true; and wait for the host side to timeout and kill the UVM.

We will have been in some indeterminate "should never happen" state, so there is no reasoning we can do about what is safe or not.

@KenGordon In that case we should just panic or exit gcs/gcs-sidecar process upon hitting an unrecoverable error. For Windows, gcs-sidecar is marked as a critical process and it will bring down the UVM as well.

Comment thread internal/guest/runtime/hcsv2/uvm.go Outdated
Comment on lines +126 to +129
// Unrecoverable: set when the policy metadata rollback itself fails
// (ErrFatalPolicyDesync). The policy state is corrupted beyond repair.
// Blocks all operations except exec and diagnostics. The GCS process
// will self-terminate after a grace period.

@micromaomao micromaomao Jun 10, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Ken that a 1 hour grace period is not useful. When there is an "unrecoverable" error, we can't expect the GCS or the security policy enforcer to continue to function correctly. Originally I used panic when e.g. policy rollback fails, and I would like any alternative solution to be as "impactful". This turns a panic into a soft state that users of the enforcer has to remember to check.

The cleaneast solution would be to ensure that GCS can safely panic, but a for {} loop as @KenGordon suggested is also acceptable and practical. This also has the advantage that, if there is already an existing exec session, when GCS hangs like this that existing session (which does not depend on GCS responding anymore) can be used to debug things, if that is be useful.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go routine hitting UnrecoverableError would go into sleep for 1 hour, which is way longer than a bridge timeout, which the shim will hit and tear down the bridge connection, I believe. and any subsequent request would result in "bridge timeout" errors. I half "fixed" that by making sure that gcs actually dies after some time.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go routine hitting UnrecoverableError would go into sleep for 1 hour, which is way longer than a bridge timeout, which the shim will hit and tear down the bridge connection

It was a 1h sleep in an infinite loop, but I guess it's true that bridge will timeout, in which case disregard the comment about ability to debug a hang GCS as it won't last very long.

I half "fixed" that by making sure that gcs actually dies after some time.

Actually I just realized that this means that we eventually still panics, so any concerns we have regarding GCS exiting still applies. If we aren't willing to panic right away, we probably also don't want to panic in an hour. So I would either go for a direct panic or stay with for { sleep }

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I start getting all insistent? The problem here is we KNOW the UVM is in a BAD state. What else can we SAFETLY do other than sit in a loop? We can't fall out the bottom or we might leave the UVM in a strange state (maybe a restarted GCS) (in linux I am sure that would be ok as the init process will die but in Windows we are a service). If we wait fore an hour 1) can we call the time code? 2) are we responding on the bridge? How can we trust our bridge code? All bets are off.

@micromaomao

Copy link
Copy Markdown
Member Author

For Windows, gcs-sidecar is marked as a critical process and it will bring down the UVM as well.

To be precise, does it result in a CRITICAL_PROCESS_DIED bugcheck?

@KenGordon

Copy link
Copy Markdown
Collaborator

For Windows, gcs-sidecar is marked as a critical process and it will bring down the UVM as well.

To be precise, does it result in a CRITICAL_PROCESS_DIED bugcheck?

If we are certain of that we might panic() instead of while true;

micromaomao and others added 4 commits June 10, 2026 22:51
After discussion with Maksim, we decided to keep the `for { sleep } ` loop
approach for now until the gcs-sidecar can be made a critical process and we're
confident that taking GCS down will take down the UVM, rather than e.g. restart
the GCS.

This reverts commit eefab95.

Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>
Co-authored-by: Maksim An <maksiman@microsoft.com>
Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>
Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>
ContainerPlatform pushed back against using an infinite loop due to concerns
with IcMs.  The plan is to simply panic now, then in a future PR make the
gcs-sidecar a critical process

Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>
@micromaomao micromaomao requested a review from Copilot June 11, 2026 14:01
@micromaomao

Copy link
Copy Markdown
Member Author

For Windows, gcs-sidecar is marked as a critical process and it will bring down the UVM as well.

To be precise, does it result in a CRITICAL_PROCESS_DIED bugcheck?

If we are certain of that we might panic() instead of while true;

The result of some experiment done yesterday was that we're not certain of this. As discussed in standup I've made this RP panic. Maksim already has a commit that makes the gcs-sidecar a critical process, but he prefers to merge it separately.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR introduces a metadata-transaction/rollback mechanism for the Rego security policy interpreter and integrates it into UVM mount/unmount flows to keep policy state consistent during failures.

Changes:

  • Add WithMetadataRollback to the security policy enforcer interface and implement it for the Rego enforcer.
  • Add SaveMetadata/RestoreMetadata support to the Rego policy interpreter plus tests for deep-copying and save/restore behavior.
  • Update guest mount/unmount paths to roll back policy metadata on failures, add dm-crypt cleanup on SCSI mount failure, and introduce a “UVM inconsistent” state gate after certain unmount failures.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
pkg/securitypolicy/securitypolicyenforcer_rego.go Adds mutex-guarded metadata rollback transaction wrapper for the Rego enforcer
pkg/securitypolicy/securitypolicyenforcer.go Extends enforcer interface with WithMetadataRollback and implements it for open/closed door enforcers
pkg/securitypolicy/regopolicy_linux_test.go Adds property tests covering rollback behavior across mount/create flows
pkg/securitypolicy/rego_utils_test.go Refactors test helpers to support mounting with deterministic container IDs
internal/regopolicyinterpreter/regopolicyinterpreter.go Adds metadata deep copy + save/restore APIs; replaces hard-coded "metadata" with constants
internal/regopolicyinterpreter/regopolicyinterpreter_test.go Adds tests for metadata deep copy and save/restore correctness
internal/guest/storage/scsi/scsi.go Ensures dm-crypt device is cleaned up on encrypted mount failures
internal/guest/storage/scsi/scsi_test.go Extends encryption mount failure test to validate dm-crypt cleanup
internal/guest/storage/overlay/overlay.go Updates comment to match current responsibility of MountLayer
internal/guest/storage/mount.go Logs when UnmountPath is called on a non-existent path
internal/guest/runtime/hcsv2/uvm.go Adds “inconsistent UVM” state gating + wraps device/mount operations in metadata rollback transactions
Comments suppressed due to low confidence (1)

internal/guest/storage/mount.go:134

  • Logging a warning on every UnmountPath call for a non-existent path can lead to noisy logs (and potential log amplification if the caller retries/removes idempotently). Consider lowering to Debug level, rate-limiting, or only warning when removeTarget is true and the missing path is unexpected for the call site.
	if _, err := osStat(target); err != nil {
		if os.IsNotExist(err) {
			log.G(ctx).WithField("target", target).Warnf("UnmountPath called for non-existent path")
			return nil
		}
		return errors.Wrapf(err, "failed to determine if path '%s' exists", target)
	}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1201 to +1222
func (policy *regoEnforcer) WithMetadataRollback(fn func() error) error {
if !policy.transactionLock.TryLock() {
return errors.New("nested or concurrent policy transactions are not supported")
}
defer policy.transactionLock.Unlock()

saved, err := policy.rego.SaveMetadata()
if err != nil {
return errors.Wrap(err, "failed to snapshot policy metadata")
}

err = fn()
if err != nil {
if restoreErr := policy.rego.RestoreMetadata(saved); restoreErr != nil {
panic(fmt.Sprintf("failed to rollback policy metadata: %v (caused by error: %v)", restoreErr, err))
}
log.G(context.Background()).WithError(err).Warn("rolled back policy metadata due to error")
return err
}

return nil
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't restore metadata on panic because we don't know if it's fully correct to do so - side effect might have already happened.

Comment on lines +1214 to +1216
if restoreErr := policy.rego.RestoreMetadata(saved); restoreErr != nil {
panic(fmt.Sprintf("failed to rollback policy metadata: %v (caused by error: %v)", restoreErr, err))
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is what we want

Comment on lines 128 to 132
GetUserInfo(spec *oci.Process, rootPath string) (IDName, []IDName, string, error)
EnforceVerifiedCIMsPolicy(ctx context.Context, containerID string, layerHashes []string, mountedCim []string) (err error)
EnforceRegistryChangesPolicy(ctx context.Context, containerID string, registryValues interface{}) error
WithMetadataRollback(fn func() error) error
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not applicable

Comment thread internal/regopolicyinterpreter/regopolicyinterpreter_test.go
Comment thread internal/regopolicyinterpreter/regopolicyinterpreter_test.go
Comment thread internal/guest/runtime/hcsv2/uvm.go
@anmaxvl anmaxvl merged commit 3f1758f into microsoft:main Jun 12, 2026
31 of 33 checks passed
@micromaomao micromaomao deleted the fix-metadata-desync branch June 12, 2026 13:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants