rego: Introduce metadata rollback and enhance handling of mount failures in C-LCOW by micromaomao · Pull Request #2768 · microsoft/hcsshim

micromaomao · 2026-06-09T16:40:05Z

Introduce a metadata rollback mechanism WithTransaction(). This fixes the Rego "metadata desync" bug.

This is extracted from #2559 as a request from @anmaxvl

For mounts, a minor cleanup code is added to ensure we close down the dm-crypt
device if we fails to mount it. Aside from this, it is relatively
straightforward - if we get a failure, the clean up functions will remove the
directory, remove any dm-devices, and we will revert the Rego metadata.

For unmounts, careful consideration needs to be taken, since if the directory
has been unmounted successfully (or even partially successful?), and we get an
error, we cannot just revert the policy state, as this may allow the host to use
a broken / empty mount as one of the image layer. See 615c9a0bdf's commit
message for more detailed thoughts.

The solution I opted for is, for non-trivial unmount failure cases (i.e. not
policy denial, not invalid mountpoint), if it fails, then we will block all
further mount, unmount, container creation and deletion attempts. I think this
is OK since we really do not expect unmounts to fail - this is especially the
case for us since the only writable disk mount we have is the shared scratch
disk, which we do not unmount at all unless we're about to kill the UVM.

Testing

The "Rollback policy state on mount errors" commit message has some instruction
for making a deliberately corrupted (but still passes the verifyinfo extraction)
VHD that will cause a mount error. The other way we could make mount / unmount
fail, and thus test this change, is by mounting a tmpfs or ro bind in relevant
places:

To make unmount fail:

mkdir /run/gcs/c/.../rootfs/a && mount -t tmpfs none /run/gcs/c/.../rootfs/a

or

mkdir /run/gcs/mounts/scsi/m1/a && mount -t tmpfs none  /run/gcs/mounts/scsi/m1/a

To make mount fail:

mount -o ro --bind /run/mounts/scsi /run/mounts/scsi

or

mount --bind -o ro /run/gcs/c /run/gcs/c

Once failure is triggered, one can make them work again by umounting the tmpfs
or ro bind.

What about other operations?

This PR covers mount and unmount of disks, overlays and 9p. Aside from setting
metadata.matches as part of the narrowing scheme, and except for
metadata.started to prevent re-using a container ID, Rego does not use
persistent state for any other operations. Since it's not clear whether
reverting the state would be semantically correct (we would need to carefully
consider exactly what are the side effects of say, failing to execute a process,
start a container, or send a signal, etc), and adding the revert to those
operations does not currently affect much behaviour, I've opted not to apply the
metadata revert to those for now.

Allow unrecoverable_error.go to build on Windows and fix IsSNP() invocation

IsSNP() now can return an error, although this is not expected on LCOW.

Signed-off-by: Tingmao Wang tingmaowang@microsoft.com

[cherry-picked from 421b12249544a334e36df33dc4846673b2a88279] This set of changes fixes the [Metadata Desync with UVM State](https://msazure.visualstudio.com/One/_workitems/edit/33232631/) bug, by reverting the Rego policy state on mount and some types of unmount failures. For mounts, a minor cleanup code is added to ensure we close down the dm-crypt device if we fails to mount it. Aside from this, it is relatively straightforward - if we get a failure, the clean up functions will remove the directory, remove any dm-devices, and we will revert the Rego metadata. For unmounts, careful consideration needs to be taken, since if the directory has been unmounted successfully (or even partially successful?), and we get an error, we cannot just revert the policy state, as this may allow the host to use a broken / empty mount as one of the image layer. See 615c9a0bdf's commit message for more detailed thoughts. The solution I opted for is, for non-trivial unmount failure cases (i.e. not policy denial, not invalid mountpoint), if it fails, then we will block all further mount, unmount, container creation and deletion attempts. I think this is OK since we really do not expect unmounts to fail - this is especially the case for us since the only writable disk mount we have is the shared scratch disk, which we do not unmount at all unless we're about to kill the UVM. Testing ------- The "Rollback policy state on mount errors" commit message has some instruction for making a deliberately corrupted (but still passes the verifyinfo extraction) VHD that will cause a mount error. The other way we could make mount / unmount fail, and thus test this change, is by mounting a tmpfs or ro bind in relevant places: To make unmount fail: mkdir /run/gcs/c/.../rootfs/a && mount -t tmpfs none /run/gcs/c/.../rootfs/a or mkdir /run/gcs/mounts/scsi/m1/a && mount -t tmpfs none /run/gcs/mounts/scsi/m1/a To make mount fail: mount -o ro --bind /run/mounts/scsi /run/mounts/scsi or mount --bind -o ro /run/gcs/c /run/gcs/c Once failure is triggered, one can make them work again by `umount`ing the tmpfs or ro bind. What about other operations? ---------------------------- This PR covers mount and unmount of disks, overlays and 9p. Aside from setting `metadata.matches` as part of the narrowing scheme, and except for `metadata.started` to prevent re-using a container ID, Rego does not use persistent state for any other operations. Since it's not clear whether reverting the state would be semantically correct (we would need to carefully consider exactly what are the side effects of say, failing to execute a process, start a container, or send a signal, etc), and adding the revert to those operations does not currently affect much behaviour, I've opted not to apply the metadata revert to those for now. Allow unrecoverable_error.go to build on Windows and fix IsSNP() invocation IsSNP() now can return an error, although this is not expected on LCOW. Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>

Addresses the following comments: microsoft#2762 (comment) microsoft#2762 (comment) microsoft#2762 (comment) microsoft#2762 (comment) microsoft#2762 (comment) microsoft#2762 (comment) Assisted-by: GitHub Copilot:claude-opus-4.7 copilot-review Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>

Assisted-by: GitHub Copilot:auto copilot-review Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>

…ggestion from Maksim Rebase done with Copilot without much manual review. Assisted-by: GitHub Copilot:claude-opus-4.8 copilot-review Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>

helsaawy

minor feed back, lgtm overall

Suggested-by: Hamza El-Saawy <hamzaelsaawy@microsoft.com> Assisted-by: GitHub Copilot:claude-opus-4.8 Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>

- Rename WithTransaction to WithMetadataRollback for clarity - Move unrecoverable error handling from policy layer to GCS layer to fix inverted dependency (pkg/securitypolicy -> internal/gcs) - Replace UnrecoverableError with two-level UVM error state: - Inconsistent (unmount failure): blocks mount/unmount/container create/delete and policy injection, allows shutdown/signal/exec and diagnostics for cleanup and troubleshooting - Unrecoverable (policy metadata rollback failure): blocks all operations except exec and diagnostics, GCS self-terminates via panic after 1 hour grace period - Remove unused internal/gcs/unrecoverable_error.go - Add withPolicyRollback helper on Host to centralize sentinel detection and UVM state management Assisted-by: GitHub Copilot claude-opus-4.8 Signed-off-by: Maksim An <maksiman@microsoft.com>

KenGordon

"Unrecoverable (policy metadata rollback failure): blocks all
operations except exec and diagnostics, GCS self-terminates
via panic after 1 hour grace period"

In the real world you could never exec into the container due to policy. Thus thus 1 hour of indeterminate state is just an exploit waiting to happen. In the event of a Fatal condition all we can do is while true; and wait for the host side to timeout and kill the UVM.

We will have been in some indeterminate "should never happen" state, so there is no reasoning we can do about what is safe or not.

anmaxvl · 2026-06-10T16:40:16Z

"Unrecoverable (policy metadata rollback failure): blocks all operations except exec and diagnostics, GCS self-terminates via panic after 1 hour grace period"

In the real world you could never exec into the container due to policy. Thus thus 1 hour of indeterminate state is just an exploit waiting to happen. In the event of a Fatal condition all we can do is while true; and wait for the host side to timeout and kill the UVM.

We will have been in some indeterminate "should never happen" state, so there is no reasoning we can do about what is safe or not.

@KenGordon In that case we should just panic or exit gcs/gcs-sidecar process upon hitting an unrecoverable error. For Windows, gcs-sidecar is marked as a critical process and it will bring down the UVM as well.

micromaomao · 2026-06-10T16:40:15Z

+	// Unrecoverable: set when the policy metadata rollback itself fails
+	// (ErrFatalPolicyDesync). The policy state is corrupted beyond repair.
+	// Blocks all operations except exec and diagnostics. The GCS process
+	// will self-terminate after a grace period.


I agree with Ken that a 1 hour grace period is not useful. When there is an "unrecoverable" error, we can't expect the GCS or the security policy enforcer to continue to function correctly. Originally I used panic when e.g. policy rollback fails, and I would like any alternative solution to be as "impactful". This turns a panic into a soft state that users of the enforcer has to remember to check.

The cleaneast solution would be to ensure that GCS can safely panic, but a for {} loop as @KenGordon suggested is also acceptable and practical. This also has the advantage that, if there is already an existing exec session, when GCS hangs like this that existing session (which does not depend on GCS responding anymore) can be used to debug things, if that is be useful.

go routine hitting UnrecoverableError would go into sleep for 1 hour, which is way longer than a bridge timeout, which the shim will hit and tear down the bridge connection, I believe. and any subsequent request would result in "bridge timeout" errors. I half "fixed" that by making sure that gcs actually dies after some time.

go routine hitting UnrecoverableError would go into sleep for 1 hour, which is way longer than a bridge timeout, which the shim will hit and tear down the bridge connection

It was a 1h sleep in an infinite loop, but I guess it's true that bridge will timeout, in which case disregard the comment about ability to debug a hang GCS as it won't last very long.

I half "fixed" that by making sure that gcs actually dies after some time.

Actually I just realized that this means that we eventually still panics, so any concerns we have regarding GCS exiting still applies. If we aren't willing to panic right away, we probably also don't want to panic in an hour. So I would either go for a direct panic or stay with for { sleep }

Should I start getting all insistent? The problem here is we KNOW the UVM is in a BAD state. What else can we SAFETLY do other than sit in a loop? We can't fall out the bottom or we might leave the UVM in a strange state (maybe a restarted GCS) (in linux I am sure that would be ok as the init process will die but in Windows we are a service). If we wait fore an hour 1) can we call the time code? 2) are we responding on the bridge? How can we trust our bridge code? All bets are off.

micromaomao · 2026-06-10T16:44:37Z

For Windows, gcs-sidecar is marked as a critical process and it will bring down the UVM as well.

To be precise, does it result in a CRITICAL_PROCESS_DIED bugcheck?

KenGordon · 2026-06-10T16:58:14Z

For Windows, gcs-sidecar is marked as a critical process and it will bring down the UVM as well.

To be precise, does it result in a CRITICAL_PROCESS_DIED bugcheck?

If we are certain of that we might panic() instead of while true;

After discussion with Maksim, we decided to keep the `for { sleep } ` loop approach for now until the gcs-sidecar can be made a critical process and we're confident that taking GCS down will take down the UVM, rather than e.g. restart the GCS. This reverts commit eefab95. Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>

Co-authored-by: Maksim An <maksiman@microsoft.com> Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>

Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>

ContainerPlatform pushed back against using an infinite loop due to concerns with IcMs. The plan is to simply panic now, then in a future PR make the gcs-sidecar a critical process Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>

micromaomao · 2026-06-11T14:02:48Z

For Windows, gcs-sidecar is marked as a critical process and it will bring down the UVM as well.

To be precise, does it result in a CRITICAL_PROCESS_DIED bugcheck?

If we are certain of that we might panic() instead of while true;

The result of some experiment done yesterday was that we're not certain of this. As discussed in standup I've made this RP panic. Maksim already has a commit that makes the gcs-sidecar a critical process, but he prefers to merge it separately.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR introduces a metadata-transaction/rollback mechanism for the Rego security policy interpreter and integrates it into UVM mount/unmount flows to keep policy state consistent during failures.

Changes:

Add WithMetadataRollback to the security policy enforcer interface and implement it for the Rego enforcer.
Add SaveMetadata/RestoreMetadata support to the Rego policy interpreter plus tests for deep-copying and save/restore behavior.
Update guest mount/unmount paths to roll back policy metadata on failures, add dm-crypt cleanup on SCSI mount failure, and introduce a “UVM inconsistent” state gate after certain unmount failures.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
pkg/securitypolicy/securitypolicyenforcer_rego.go	Adds mutex-guarded metadata rollback transaction wrapper for the Rego enforcer
pkg/securitypolicy/securitypolicyenforcer.go	Extends enforcer interface with `WithMetadataRollback` and implements it for open/closed door enforcers
pkg/securitypolicy/regopolicy_linux_test.go	Adds property tests covering rollback behavior across mount/create flows
pkg/securitypolicy/rego_utils_test.go	Refactors test helpers to support mounting with deterministic container IDs
internal/regopolicyinterpreter/regopolicyinterpreter.go	Adds metadata deep copy + save/restore APIs; replaces hard-coded `"metadata"` with constants
internal/regopolicyinterpreter/regopolicyinterpreter_test.go	Adds tests for metadata deep copy and save/restore correctness
internal/guest/storage/scsi/scsi.go	Ensures dm-crypt device is cleaned up on encrypted mount failures
internal/guest/storage/scsi/scsi_test.go	Extends encryption mount failure test to validate dm-crypt cleanup
internal/guest/storage/overlay/overlay.go	Updates comment to match current responsibility of `MountLayer`
internal/guest/storage/mount.go	Logs when `UnmountPath` is called on a non-existent path
internal/guest/runtime/hcsv2/uvm.go	Adds “inconsistent UVM” state gating + wraps device/mount operations in metadata rollback transactions

Comments suppressed due to low confidence (1)

internal/guest/storage/mount.go:134

Logging a warning on every UnmountPath call for a non-existent path can lead to noisy logs (and potential log amplification if the caller retries/removes idempotently). Consider lowering to Debug level, rate-limiting, or only warning when removeTarget is true and the missing path is unexpected for the call site.

	if _, err := osStat(target); err != nil {
		if os.IsNotExist(err) {
			log.G(ctx).WithField("target", target).Warnf("UnmountPath called for non-existent path")
			return nil
		}
		return errors.Wrapf(err, "failed to determine if path '%s' exists", target)
	}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

micromaomao · 2026-06-11T14:08:39Z

+func (policy *regoEnforcer) WithMetadataRollback(fn func() error) error {
+	if !policy.transactionLock.TryLock() {
+		return errors.New("nested or concurrent policy transactions are not supported")
+	}
+	defer policy.transactionLock.Unlock()
+
+	saved, err := policy.rego.SaveMetadata()
+	if err != nil {
+		return errors.Wrap(err, "failed to snapshot policy metadata")
+	}
+
+	err = fn()
+	if err != nil {
+		if restoreErr := policy.rego.RestoreMetadata(saved); restoreErr != nil {
+			panic(fmt.Sprintf("failed to rollback policy metadata: %v (caused by error: %v)", restoreErr, err))
+		}
+		log.G(context.Background()).WithError(err).Warn("rolled back policy metadata due to error")
+		return err
+	}
+
+	return nil
+}


we can't restore metadata on panic because we don't know if it's fully correct to do so - side effect might have already happened.

micromaomao · 2026-06-11T14:05:39Z

+		if restoreErr := policy.rego.RestoreMetadata(saved); restoreErr != nil {
+			panic(fmt.Sprintf("failed to rollback policy metadata: %v (caused by error: %v)", restoreErr, err))
+		}


this is what we want

micromaomao · 2026-06-11T14:06:02Z

 	GetUserInfo(spec *oci.Process, rootPath string) (IDName, []IDName, string, error)
 	EnforceVerifiedCIMsPolicy(ctx context.Context, containerID string, layerHashes []string, mountedCim []string) (err error)
 	EnforceRegistryChangesPolicy(ctx context.Context, containerID string, registryValues interface{}) error
+	WithMetadataRollback(fn func() error) error
 }


not applicable

micromaomao added 6 commits June 9, 2026 15:21

Address some comments from Maksim

36c9cd7

Assisted-by: GitHub Copilot:auto copilot-review Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>

Address mountsBroken refactor suggestion from Maksim

0bcb77b

Assisted-by: GitHub Copilot:auto copilot-review Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>

Remove unused function inRevertableSection

09dca88

Assisted-by: GitHub Copilot:auto copilot-review Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>

Implement revertable section to closure-based transaction refactor su…

67fabcd

…ggestion from Maksim Rebase done with Copilot without much manual review. Assisted-by: GitHub Copilot:claude-opus-4.8 copilot-review Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>

micromaomao requested a review from a team as a code owner June 9, 2026 16:40

micromaomao requested a review from anmaxvl June 9, 2026 16:40

anmaxvl self-assigned this Jun 9, 2026

msscotb assigned helsaawy Jun 9, 2026

helsaawy reviewed Jun 9, 2026

View reviewed changes

Comment thread internal/guest/runtime/hcsv2/uvm.go Outdated

micromaomao and others added 2 commits June 10, 2026 01:39

uvmConsistencyError: change cause from string to error

1981697

Suggested-by: Hamza El-Saawy <hamzaelsaawy@microsoft.com> Assisted-by: GitHub Copilot:claude-opus-4.8 Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>

anmaxvl approved these changes Jun 10, 2026

View reviewed changes

KenGordon reviewed Jun 10, 2026

View reviewed changes

micromaomao commented Jun 10, 2026

View reviewed changes

micromaomao and others added 4 commits June 10, 2026 22:51

securitypolicyenforcer: Rename WithTransaction to WithMetadataRollback

b962ce3

Co-authored-by: Maksim An <maksiman@microsoft.com> Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>

Fix lint

5d8920e

Signed-off-by: Tingmao Wang <tingmaowang@microsoft.com>

micromaomao requested a review from Copilot June 11, 2026 14:01

Copilot AI reviewed Jun 11, 2026

View reviewed changes

micromaomao mentioned this pull request Jun 11, 2026

Merge various fixes for C-LCOW since conf-aci/0.2.5 (part 3) #2559

Open

anmaxvl approved these changes Jun 12, 2026

View reviewed changes

anmaxvl merged commit 3f1758f into microsoft:main Jun 12, 2026
31 of 33 checks passed

micromaomao deleted the fix-metadata-desync branch June 12, 2026 13:19

Conversation

micromaomao commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

What about other operations?

Uh oh!

helsaawy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

KenGordon left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anmaxvl commented Jun 10, 2026

Uh oh!

micromaomao Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anmaxvl Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

micromaomao Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

KenGordon Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

micromaomao commented Jun 10, 2026

Uh oh!

KenGordon commented Jun 10, 2026

Uh oh!

micromaomao commented Jun 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

micromaomao Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

micromaomao Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

micromaomao Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

micromaomao commented Jun 9, 2026 •

edited

Loading

KenGordon left a comment •

edited

Loading

micromaomao Jun 10, 2026 •

edited

Loading