|
| 1 | +--- |
| 2 | +title: DeepSeek V4 |
| 3 | +description: Deploying DeepSeek-V4-Pro using SGLang on NVIDIA B200:8 |
| 4 | +--- |
| 5 | + |
| 6 | +# DeepSeek V4 |
| 7 | + |
| 8 | +This example shows how to deploy `deepseek-ai/DeepSeek-V4-Pro` as a |
| 9 | +[service](https://dstack.ai/docs/services) using |
| 10 | +[SGLang](https://github.com/sgl-project/sglang) and `dstack`. |
| 11 | + |
| 12 | +## Apply a configuration |
| 13 | + |
| 14 | +Save the following configuration as `deepseek-v4.dstack.yml`. |
| 15 | + |
| 16 | +<div editor-title="deepseek-v4.dstack.yml"> |
| 17 | + |
| 18 | +```yaml |
| 19 | +type: service |
| 20 | +name: deepseek-v4 |
| 21 | + |
| 22 | +image: lmsysorg/sglang:deepseek-v4-blackwell |
| 23 | + |
| 24 | +env: |
| 25 | + - HF_TOKEN |
| 26 | + - SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256 |
| 27 | + - SGLANG_JIT_DEEPGEMM_PRECOMPILE=0 |
| 28 | + |
| 29 | +commands: |
| 30 | + - | |
| 31 | + sglang serve \ |
| 32 | + --trust-remote-code \ |
| 33 | + --model-path deepseek-ai/DeepSeek-V4-Pro \ |
| 34 | + --tp 8 \ |
| 35 | + --dp 8 \ |
| 36 | + --enable-dp-attention \ |
| 37 | + --moe-a2a-backend deepep \ |
| 38 | + --mem-fraction-static 0.82 \ |
| 39 | + --cuda-graph-max-bs 64 \ |
| 40 | + --max-running-requests 256 \ |
| 41 | + --deepep-config '{"normal_dispatch":{"num_sms":96},"normal_combine":{"num_sms":96}}' \ |
| 42 | + --tool-call-parser deepseekv4 \ |
| 43 | + --reasoning-parser deepseek-v4 \ |
| 44 | + --host 0.0.0.0 \ |
| 45 | + --port 30000 |
| 46 | +
|
| 47 | +port: 30000 |
| 48 | +model: deepseek-ai/DeepSeek-V4-Pro |
| 49 | + |
| 50 | +resources: |
| 51 | + gpu: B200:8 |
| 52 | + shm_size: 32GB |
| 53 | + disk: 2TB.. |
| 54 | +``` |
| 55 | +
|
| 56 | +</div> |
| 57 | +
|
| 58 | +This configuration uses the single-node Blackwell `DeepSeek-V4-Pro` recipe |
| 59 | +shape for `8 x NVIDIA B200`. |
| 60 | + |
| 61 | +Export your Hugging Face token and apply the configuration with |
| 62 | +[`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md). |
| 63 | + |
| 64 | +<div class="termy"> |
| 65 | + |
| 66 | +```shell |
| 67 | +$ export HF_TOKEN=<your-hf-token> |
| 68 | +$ dstack apply -f deepseek-v4.dstack.yml |
| 69 | +``` |
| 70 | + |
| 71 | +</div> |
| 72 | + |
| 73 | +If no gateway is created, the service endpoint will be available at |
| 74 | +`<dstack server URL>/proxy/services/<project name>/<run name>/`. |
| 75 | + |
| 76 | +<div class="termy"> |
| 77 | + |
| 78 | +```shell |
| 79 | +curl http://127.0.0.1:3000/proxy/services/main/deepseek-v4/v1/chat/completions \ |
| 80 | + -X POST \ |
| 81 | + -H 'Authorization: Bearer <dstack token>' \ |
| 82 | + -H 'Content-Type: application/json' \ |
| 83 | + -d '{ |
| 84 | + "model": "deepseek-ai/DeepSeek-V4-Pro", |
| 85 | + "messages": [ |
| 86 | + { |
| 87 | + "role": "user", |
| 88 | + "content": "What is 15% of 240? Reply with just the number." |
| 89 | + } |
| 90 | + ], |
| 91 | + "temperature": 0, |
| 92 | + "max_tokens": 32 |
| 93 | + }' |
| 94 | +``` |
| 95 | + |
| 96 | +</div> |
| 97 | + |
| 98 | +## Reasoning mode |
| 99 | + |
| 100 | +To separate the model's reasoning into `reasoning_content`, keep |
| 101 | +`--reasoning-parser deepseek-v4` in the server command and send |
| 102 | +`chat_template_kwargs` in the request body. |
| 103 | + |
| 104 | +For raw HTTP requests, `chat_template_kwargs` and `separate_reasoning` must be |
| 105 | +top-level JSON fields. |
| 106 | + |
| 107 | +<div class="termy"> |
| 108 | + |
| 109 | +```shell |
| 110 | +curl http://127.0.0.1:3000/proxy/services/main/deepseek-v4/v1/chat/completions \ |
| 111 | + -X POST \ |
| 112 | + -H 'Authorization: Bearer <dstack token>' \ |
| 113 | + -H 'Content-Type: application/json' \ |
| 114 | + -d '{ |
| 115 | + "model": "deepseek-ai/DeepSeek-V4-Pro", |
| 116 | + "messages": [ |
| 117 | + { |
| 118 | + "role": "user", |
| 119 | + "content": "Solve step by step: If 3x + 5 = 20, what is x?" |
| 120 | + } |
| 121 | + ], |
| 122 | + "temperature": 0, |
| 123 | + "max_tokens": 256, |
| 124 | + "chat_template_kwargs": { |
| 125 | + "thinking": true |
| 126 | + }, |
| 127 | + "separate_reasoning": true |
| 128 | + }' |
| 129 | +``` |
| 130 | + |
| 131 | +</div> |
| 132 | + |
| 133 | +This returns both: |
| 134 | + |
| 135 | +- `reasoning_content`: a separate reasoning trace |
| 136 | +- `content`: the final user-visible answer |
| 137 | + |
| 138 | +## Deployment notes |
| 139 | + |
| 140 | +- Use `lmsysorg/sglang:deepseek-v4-blackwell` for `B200:8`. |
| 141 | +- The first startup can take several minutes while the model loads and SGLang |
| 142 | + finishes CUDA graph capture. |
| 143 | +- On container backends such as Vast.ai, avoid `instance_path` cache volumes in |
| 144 | + this service config. |
| 145 | +- The endpoint is OpenAI-compatible and served on port `30000`. |
| 146 | + |
| 147 | +## What's next? |
| 148 | + |
| 149 | +1. Read the [DeepSeek-V4-Pro model card](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro) |
| 150 | +2. Read the [DeepSeek-V4 SGLang cookbook](https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4) |
| 151 | +3. Browse the dedicated [SGLang](https://dstack.ai/examples/inference/sglang/) and [vLLM](https://dstack.ai/examples/inference/vllm/) examples |
0 commit comments