Skip to content

Latest commit

 

History

History
128 lines (103 loc) · 5.78 KB

File metadata and controls

128 lines (103 loc) · 5.78 KB

Post-deploy: CycleCloud is already configured

What the cloud-init bootstrap does, how to verify it finished, and how to log into the web UI.

The cloud-init bootstrap in scripts/cloud-config.yaml.tftpl runs the full CycleCloud install end-to-end — there is no web wizard to click through. This is Phase 1 of the project (see known-gaps.md for Phase 2 — cluster automation).

terraform apply does not return until the bootstrap is fully complete. The null_resource.cyclecloud_ready in terraform/cyclecloud.tf polls the VM every 20 s via az vm run-command invoke and only succeeds once the in-VM sentinel /var/lib/cc-bootstrap.done appears (or fails fast if /var/lib/cc-bootstrap.failed shows up). Plan on roughly 10–15 minutes from apply starting to it returning.

The bootstrap stages (in scripts/cc-bootstrap.sh.tftpl):

  1. cyclecloud8 package install (cloud-config runcmd)
  2. await_startup (CycleCloud web app comes online)
  3. CycleCloud CLI install from /opt/cycle_server/tools/cyclecloud-cli.zip
  4. cc-bootstrap.sh starts here — managed-identity login + Key Vault fetch of admin password and SSH public key
  5. Drop account_data.json into /opt/cycle_server/config/data/ (skips the site name / EULA / admin-account wizard; CycleCloud renames the file to *.imported once processed)
  6. Poll /ui/metadata for HTTP 200 — await_startup returns before the REST layer is ready
  7. cyclecloud initialize --batch --url=http://localhost:8080/ ... (HTTP loopback — CC8's package install doesn't open 8443 until a TLS keystore is configured; see known-gaps.md)
  8. cyclecloud account create -f azure_data.json (registers the subscription using AzureRMUseManagedIdentity: true, with the locker storage account and cyclecloud container already provisioned by Terraform)
  9. Sentinel /var/lib/cc-bootstrap.done is written; on-disk secrets are shredded

Verifying the bootstrap finished

In most cases the terraform apply exit code is all you need. If you want to inspect the VM directly, SSH in (see ssh-key.md) and:

ls /var/lib/cc-bootstrap.done                                 # sentinel exists on success
sudo tail -n 80 /var/log/cc-bootstrap.log                     # timestamped per-stage log
sudo cloud-init status --wait                                 # blocks until cloud-init is done
ls /opt/cycle_server/config/data/account_data.json.imported   # should exist
sudo -u cyclecloudadmin /usr/local/bin/cyclecloud locker list # should list the configured account

If the bootstrap failed and you want to re-run it after fixing the cause, the script is idempotent — delete the sentinel and re-invoke:

sudo rm -f /var/lib/cc-bootstrap.done /var/lib/cc-bootstrap.failed
sudo /usr/local/sbin/cc-bootstrap.sh

Boot diagnostics are also enabled on the VM (Azure-managed storage); the serial console and screenshot are available in the portal under Help → Boot diagnostics on the VM blade.

Logging into the web UI

The admin username is var.vm_admin_username (default cyclecloudadmin). The password was generated by Terraform and stored write-only in Key Vault:

cd terraform
az keyvault secret show \
  --vault-name "$(terraform output -raw key_vault_name)" \
  --name      "$(terraform output -raw cyclecloud_admin_password_secret_name)" \
  --query value -o tsv

Open the web UI per the mode you deployed in (see access-modes.md) and sign in with those credentials. The subscription should already be listed under Settings → Subscriptions — if it is, the bootstrap finished cleanly and you can go straight to creating a cluster.

If Settings → Subscriptions is empty, the cyclecloud account create step failed (usually a transient RBAC propagation race). Inspect /var/log/cc-bootstrap.log on the VM, then re-run the whole bootstrap: sudo rm -f /var/lib/cc-bootstrap.{done,failed} && sudo /usr/local/sbin/cc-bootstrap.sh.

Mounting the NFS shares (sched + shared)

terraform/files.tf provisions a Premium FileStorage account with two NFSv4.1 shares — sched (downstream Slurm scheduler state) and shared (cluster-wide shared data) — reachable over port 2049 from any NIC in the server or cluster subnets via the file private endpoint. The shares are sized at the Premium 100 GiB minimum quota (the dev intent was ~10 GiB each but Azure rejects anything smaller).

NFSv4.1 on Azure Files does not use account keys or SAS — access is gated by network reachability (PE + VNet) and POSIX permissions inside the share once mounted. There is nothing to fetch from Key Vault for this.

From the CycleCloud VM (or any future cluster node in the cluster subnet):

sudo apt-get install -y nfs-common

SA=$(cd terraform && terraform output -raw nfs_storage_account)
sudo mkdir -p /sched /shared
sudo mount -t nfs -o vers=4,minorversion=1,sec=sys "${SA}.file.core.windows.net:/${SA}/sched"  /sched
sudo mount -t nfs -o vers=4,minorversion=1,sec=sys "${SA}.file.core.windows.net:/${SA}/shared" /shared

terraform output nfs_shares prints the exact mount_fqdn / mount_point for each share if you'd rather copy-paste than interpolate.

For persistent mounts on cluster nodes, render the same two lines into /etc/fstab with _netdev,nofail so the boot doesn't hang if the PE is briefly unreachable:

<sa>.file.core.windows.net:/<sa>/sched   /sched   nfs   vers=4,minorversion=1,sec=sys,_netdev,nofail   0 0
<sa>.file.core.windows.net:/<sa>/shared  /shared  nfs   vers=4,minorversion=1,sec=sys,_netdev,nofail   0 0