Skip to content

Commit 970c864

Browse files
author
Andrey Cheptsov
committed
Add DeepSeek V4 model docs
1 parent f4d9513 commit 970c864

5 files changed

Lines changed: 168 additions & 2 deletions

File tree

docs/examples.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -188,6 +188,17 @@ hide:
188188
## Models
189189

190190
<div class="tx-landing__highlights_grid">
191+
<a href="/examples/models/deepseek-v4"
192+
class="feature-cell">
193+
<h3>
194+
DeepSeek V4
195+
</h3>
196+
197+
<p>
198+
Deploy DeepSeek V4 with SGLang on B200:8
199+
</p>
200+
</a>
201+
191202
<a href="/examples/models/qwen36"
192203
class="feature-cell">
193204
<h3>

docs/examples/models/deepseek-v4/index.md

Whitespace-only changes.

examples/inference/sglang/README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,9 @@ description: Deploying Qwen3.6-27B using SGLang on NVIDIA and AMD GPUs
88
This example shows how to deploy `Qwen/Qwen3.6-27B` using
99
[SGLang](https://github.com/sgl-project/sglang) and `dstack`.
1010

11+
> For a `DeepSeek-V4-Pro` deployment on `B200:8`, see the
12+
[DeepSeek V4](../../models/deepseek-v4/index.md) model page.
13+
1114
## Apply a configuration
1215

1316
Here's an example of a service that deploys
Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
---
2+
title: DeepSeek V4
3+
description: Deploying DeepSeek-V4-Pro using SGLang on NVIDIA B200:8
4+
---
5+
6+
# DeepSeek V4
7+
8+
This example shows how to deploy `deepseek-ai/DeepSeek-V4-Pro` as a
9+
[service](https://dstack.ai/docs/services) using
10+
[SGLang](https://github.com/sgl-project/sglang) and `dstack`.
11+
12+
## Apply a configuration
13+
14+
Save the following configuration as `deepseek-v4.dstack.yml`.
15+
16+
<div editor-title="deepseek-v4.dstack.yml">
17+
18+
```yaml
19+
type: service
20+
name: deepseek-v4
21+
22+
image: lmsysorg/sglang:deepseek-v4-blackwell
23+
24+
env:
25+
- HF_TOKEN
26+
- SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=256
27+
- SGLANG_JIT_DEEPGEMM_PRECOMPILE=0
28+
29+
commands:
30+
- |
31+
sglang serve \
32+
--trust-remote-code \
33+
--model-path deepseek-ai/DeepSeek-V4-Pro \
34+
--tp 8 \
35+
--dp 8 \
36+
--enable-dp-attention \
37+
--moe-a2a-backend deepep \
38+
--mem-fraction-static 0.82 \
39+
--cuda-graph-max-bs 64 \
40+
--max-running-requests 256 \
41+
--deepep-config '{"normal_dispatch":{"num_sms":96},"normal_combine":{"num_sms":96}}' \
42+
--tool-call-parser deepseekv4 \
43+
--reasoning-parser deepseek-v4 \
44+
--host 0.0.0.0 \
45+
--port 30000
46+
47+
port: 30000
48+
model: deepseek-ai/DeepSeek-V4-Pro
49+
50+
resources:
51+
gpu: B200:8
52+
shm_size: 32GB
53+
disk: 2TB..
54+
```
55+
56+
</div>
57+
58+
This configuration uses the single-node Blackwell `DeepSeek-V4-Pro` recipe
59+
shape for `8 x NVIDIA B200`.
60+
61+
Export your Hugging Face token and apply the configuration with
62+
[`dstack apply`](https://dstack.ai/docs/reference/cli/dstack/apply.md).
63+
64+
<div class="termy">
65+
66+
```shell
67+
$ export HF_TOKEN=<your-hf-token>
68+
$ dstack apply -f deepseek-v4.dstack.yml
69+
```
70+
71+
</div>
72+
73+
If no gateway is created, the service endpoint will be available at
74+
`<dstack server URL>/proxy/services/<project name>/<run name>/`.
75+
76+
<div class="termy">
77+
78+
```shell
79+
curl http://127.0.0.1:3000/proxy/services/main/deepseek-v4/v1/chat/completions \
80+
-X POST \
81+
-H 'Authorization: Bearer &lt;dstack token&gt;' \
82+
-H 'Content-Type: application/json' \
83+
-d '{
84+
"model": "deepseek-ai/DeepSeek-V4-Pro",
85+
"messages": [
86+
{
87+
"role": "user",
88+
"content": "What is 15% of 240? Reply with just the number."
89+
}
90+
],
91+
"temperature": 0,
92+
"max_tokens": 32
93+
}'
94+
```
95+
96+
</div>
97+
98+
## Reasoning mode
99+
100+
To separate the model's reasoning into `reasoning_content`, keep
101+
`--reasoning-parser deepseek-v4` in the server command and send
102+
`chat_template_kwargs` in the request body.
103+
104+
For raw HTTP requests, `chat_template_kwargs` and `separate_reasoning` must be
105+
top-level JSON fields.
106+
107+
<div class="termy">
108+
109+
```shell
110+
curl http://127.0.0.1:3000/proxy/services/main/deepseek-v4/v1/chat/completions \
111+
-X POST \
112+
-H 'Authorization: Bearer &lt;dstack token&gt;' \
113+
-H 'Content-Type: application/json' \
114+
-d '{
115+
"model": "deepseek-ai/DeepSeek-V4-Pro",
116+
"messages": [
117+
{
118+
"role": "user",
119+
"content": "Solve step by step: If 3x + 5 = 20, what is x?"
120+
}
121+
],
122+
"temperature": 0,
123+
"max_tokens": 256,
124+
"chat_template_kwargs": {
125+
"thinking": true
126+
},
127+
"separate_reasoning": true
128+
}'
129+
```
130+
131+
</div>
132+
133+
This returns both:
134+
135+
- `reasoning_content`: a separate reasoning trace
136+
- `content`: the final user-visible answer
137+
138+
## Deployment notes
139+
140+
- Use `lmsysorg/sglang:deepseek-v4-blackwell` for `B200:8`.
141+
- The first startup can take several minutes while the model loads and SGLang
142+
finishes CUDA graph capture.
143+
- On container backends such as Vast.ai, avoid `instance_path` cache volumes in
144+
this service config.
145+
- The endpoint is OpenAI-compatible and served on port `30000`.
146+
147+
## What's next?
148+
149+
1. Read the [DeepSeek-V4-Pro model card](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro)
150+
2. Read the [DeepSeek-V4 SGLang cookbook](https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4)
151+
3. Browse the dedicated [SGLang](https://dstack.ai/examples/inference/sglang/) and [vLLM](https://dstack.ai/examples/inference/vllm/) examples

mkdocs.yml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -306,12 +306,13 @@ nav:
306306
- vLLM: examples/inference/vllm/index.md
307307
- NIM: examples/inference/nim/index.md
308308
- TensorRT-LLM: examples/inference/trtllm/index.md
309+
- Models:
310+
- DeepSeek V4: examples/models/deepseek-v4/index.md
311+
- Qwen 3.6: examples/models/qwen36/index.md
309312
- Accelerators:
310313
- AMD: examples/accelerators/amd/index.md
311314
- TPU: examples/accelerators/tpu/index.md
312315
- Tenstorrent: examples/accelerators/tenstorrent/index.md
313-
- Models:
314-
- Qwen 3.6: examples/models/qwen36/index.md
315316
- Blog:
316317
- blog/index.md
317318
- Case studies: blog/case-studies.md

0 commit comments

Comments
 (0)