Skip to content

a3 mega/deepseek/recipe : deployment failure for both vllm&sglang #3

@salander0411

Description

@salander0411

I am following this recipe: https://github.com/AI-Hypercomputer/gpu-recipes/blob/main/inference/a3mega/deepseek-r1-671b/sglang-serving-gke/README.md

I have already enabled Cloud Storage Fuse CSI driver.

gcloud container node-pools create a3-mega
--location=${ZONE}
--num-nodes=2
--machine-type=a3-megagpu-8g
--accelerator=type=nvidia-h100-mega-80gb,count=8,gpu-driver-version=LATEST
--placement-type=COMPACT
--cluster=${CLUSTER_NAME}
--spot

Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  6m45s                  default-scheduler  Successfully assigned default/tiangel-serving-deepseek-r1-model-0 to gke-tiangel-cluster-1-a3-mega-91214fa0-kz3w
  Normal   Pulling    6m45s                  kubelet            Pulling image "us-central1-artifactregistry.gcr.io/gke-release/gke-release/gcs-fuse-csi-driver-sidecar-mounter:v1.8.3-gke.2@sha256:07a5a7b18b083c47031c540e1664eb0c777a50e523dde030d8b0effdc9bb8761"
  Normal   Pulled     6m44s                  kubelet            Successfully pulled image "us-central1-artifactregistry.gcr.io/gke-release/gke-release/gcs-fuse-csi-driver-sidecar-mounter:v1.8.3-gke.2@sha256:07a5a7b18b083c47031c540e1664eb0c777a50e523dde030d8b0effdc9bb8761" in 604ms (604ms including waiting). Image size: 31687282 bytes.
  Normal   Created    6m44s                  kubelet            Created container: gke-gcsfuse-sidecar
  Normal   Started    6m44s                  kubelet            Started container gke-gcsfuse-sidecar
  Normal   Pulling    6m44s                  kubelet            Pulling image "us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/nccl-plugin-gpudirecttcpx-dev:v1.0.8-1"
  Normal   Pulled     5m29s                  kubelet            Successfully pulled image "us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/nccl-plugin-gpudirecttcpx-dev:v1.0.8-1" in 1m14.607s (1m14.607s including waiting). Image size: 7599167601 bytes.
  Normal   Created    5m29s                  kubelet            Created container: nccl-plugin-installer
  Normal   Started    5m29s                  kubelet            Started container nccl-plugin-installer
  Normal   Pulled     4m39s                  kubelet            Successfully pulled image "us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpgpudmarxd-dev:v1.0.14" in 46.581s (46.581s including waiting). Image size: 5243380426 bytes.
  Normal   Pulling    4m39s                  kubelet            Pulling image "us-central1-docker.pkg.dev/gpu-launchpad-playground/wwoo-docker/sglang:v0.4.3.post2-cu125-srt"
  Normal   Pulled     2m57s                  kubelet            Successfully pulled image "us-central1-docker.pkg.dev/gpu-launchpad-playground/wwoo-docker/sglang:v0.4.3.post2-cu125-srt" in 1m41.712s (1m41.712s including waiting). Image size: 11753203350 bytes.
  Normal   Created    2m57s                  kubelet            Created container: sglang-leader
  Normal   Started    2m57s                  kubelet            Started container sglang-leader
  Normal   Pulling    2m56s (x2 over 5m26s)  kubelet            Pulling image "us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpgpudmarxd-dev:v1.0.14"
  Normal   Created    2m56s (x2 over 4m39s)  kubelet            Created container: tcpxo-daemon
  Normal   Started    2m56s (x2 over 4m39s)  kubelet            Started container tcpxo-daemon
  Normal   Pulled     2m56s                  kubelet            Successfully pulled image "us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpgpudmarxd-dev:v1.0.14" in 682ms (682ms including waiting). Image size: 5243380426 bytes.
  Normal   Killing    2m53s                  kubelet            Stopping container gke-gcsfuse-sidecar
  Normal   Killing    2m53s                  kubelet            Stopping container tcpxo-daemon
  Normal   Killing    2m53s                  kubelet            Stopping container sglang-leader
  Warning  Unhealthy  2m35s                  kubelet            Readiness probe failed: dial tcp 10.84.3.5:30000: connect: connection refused

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions