Skip to content

feat(dstack-util): mix gcp vTPM AK cert into instance_id#726

Merged
kvinwang merged 1 commit into
masterfrom
feat/gcp-instance-id-binding
Jun 11, 2026
Merged

feat(dstack-util): mix gcp vTPM AK cert into instance_id#726
kvinwang merged 1 commit into
masterfrom
feat/gcp-instance-id-binding

Conversation

@kvinwang

@kvinwang kvinwang commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Problem

instance_id is derived from instance_id_seed, which is persisted on the data disk:

instance_id = sha256(instance_id_seed || app_id)[..20]

On GCP a VM can be cloned from a disk image / snapshot. Every clone inherits the same instance_id_seed and therefore computes the same instance_id, letting multiple running VMs share one identity (managed instance groups, image-based cloning, etc.).

Fix

On GCP, mix the public key of the pre-provisioned vTPM Attestation Key into the instance_id:

instance_id = sha256(instance_id_seed || app_id || sha256(ak_pub_area))[..20]   # GCP only
instance_id = sha256(instance_id_seed || app_id)[..20]                          # other platforms (unchanged)

The AK is derived deterministically from the per-instance Endorsement seed held in the vTPM — not on the data disk — so it is stable across reboot/stop-start but fresh on a disk clone, which is exactly the property needed to keep instance_id unique per running VM. Reuses the existing tpm-attest GCP AK load path (prefers ECC, falls back to RSA).

Why hash the AK public area, not the AK certificate: a certificate carries serial / validity / signature bytes that can change on re-issuance for the same key, which would shift instance_id without a clone. The public area depends only on the key. (Observed AK cert validity is ~30 years from instance creation, so re-issuance is unlikely in practice — hashing the pubkey removes the dependency entirely.)

  • Fails closed: if GCP is detected but the AK can't be loaded, it errors rather than silently falling back to the duplication-prone seed-only id.
  • Other platforms unaffected (platform_instance_binding() returns None).
  • tpm-attest: exposes the AK public area on LoadedAk (previously discarded as _public).

Validation

Tested on real c3-standard-4 --confidential-compute-type=TDX VMs (confirmed /dev/tdx_guest):

Scenario AK instance_id
reboot unchanged stable
stop → start unchanged stable
clone from disk image differs diverges (dedup works)

Notes / follow-ups

  • This is dedup-grade, not anti-malicious-host: the AK is not signature-verified or bound into the TDX quote here. A malicious host could still equivocate. Verifying the Google signature chain + binding to the quote would be a separate follow-up (intentionally out of scope).
  • Existing GCP instances will see their instance_id (and RTMR3 measurement) change once after upgrade.

Copilot AI review requested due to automatic review settings June 11, 2026 08:39

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates how dstack-util derives instance_id on GCP to avoid identity collisions when VMs are cloned from disk images/snapshots (which duplicate the persisted instance_id_seed). It does so by mixing a per-instance value read from the GCP vTPM (the AK certificate) into the instance_id derivation, while leaving other platforms unchanged.

Changes:

  • Add platform_instance_binding() that, on GCP, reads the vTPM AK certificate from NV (ECC first, then RSA) and contributes sha256(cert) as a per-instance binding value.
  • Extend instance_id derivation to include the platform binding when available (GCP), preserving the previous seed-only derivation on other platforms.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread dstack-util/src/system_setup.rs Outdated
Comment on lines +717 to +721
/// On GCP we use the pre-provisioned vTPM Attestation Key certificate: it lives in
/// the vTPM NV store (not on the data disk), so a VM cloned from a disk image gets
/// a fresh vTPM with a different AK cert, while a reboot of the same VM keeps it
/// stable — exactly the property we need. The cert is also signed by Google, so the
/// host cannot trivially forge a duplicate.
instance_id is derived from instance_id_seed, which is persisted on the
data disk. On GCP a VM can be cloned from a disk image / snapshot, so
every clone inherits the same seed and thus the same instance_id,
letting multiple running VMs share one identity.

On GCP, mix the public key of the pre-provisioned vTPM Attestation Key
into the instance_id. The AK is derived deterministically from the
per-instance Endorsement seed held in the vTPM (not on the data disk),
so it is stable across reboot/stop-start but fresh on a disk clone.

We hash the AK public area rather than its certificate so the binding is
immune to certificate re-issuance: a re-signed cert carries new serial/
validity/signature bytes for the same key, which would otherwise change
instance_id without a clone. (Observed cert validity is ~30 years from
instance creation, so re-issuance is unlikely, but the pubkey removes the
dependency entirely.)

tpm-attest: expose the AK public area on LoadedAk (previously discarded).

Verified on real c3-standard-4 TDX confidential VMs:
- reboot: AK unchanged
- stop/start: AK unchanged
- clone from disk image: AK differs

Fails closed: if GCP is detected but the AK cannot be loaded, error
instead of silently falling back to the seed-only id. Other platforms
are unaffected.
@kvinwang kvinwang force-pushed the feat/gcp-instance-id-binding branch from ff28c37 to f1ba0a2 Compare June 11, 2026 08:51
@kvinwang kvinwang merged commit 56151e3 into master Jun 11, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants