[DeepSeek-V4] Implement model integration, decoders, and configuration stack#4153
[DeepSeek-V4] Implement model integration, decoders, and configuration stack#4153parambole wants to merge 4 commits into
Conversation
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
2a19018 to
23adce0
Compare
|
🤖 Hi @parambole, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
|
🤖 I'm sorry @parambole, but I was unable to process your request. Please see the logs for more details. |
23adce0 to
6deaacc
Compare
entrpn
left a comment
There was a problem hiding this comment.
just one comment, everything else looks good.
|
Are you able to have a real run and check profile to see if the scan blocks order as expected? Compile test won't be able to verify a RunTime error. |
|
🤖 Hi @parambole, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
|
🤖 I'm sorry @parambole, but I was unable to process your request. Please see the logs for more details. |
This commit introduces full support for DeepSeek V4 by integrating its compressed attention mechanisms, MoE routing, and architectural layers. Key changes: - Add `deepseek4.yml` configuration and `DeepSeek4DecoderLayer` implementation. - Implement hybrid Hash Routing and Token Routing for MoE layers. - Add prefix/suffix layer unrolling for non-uniform compression blocks. - Fix Pydantic validation for base MLP dimensions. - Bypass MLA instantiation in favor of native CompressedAttention (CSA/HCA).
6deaacc to
5953a73
Compare
| base_mlp_dim: 2048 | ||
| base_moe_mlp_dim: 2048 | ||
| vocab_size: 129280 | ||
|
|
There was a problem hiding this comment.
add head_dim: 512 here
There was a problem hiding this comment.
partial_rotary_factor=self.config.qk_rope_head_dim / self.config.head_dim
There was a problem hiding this comment.
Thank you for the integration! Main suggestions:
- Update scan logic.
- Note
first_num_hash_layers=3for the prefix layers. Followed by [HCA-128, CSA-4] cycles. There is no suffix. This is true for both flash and pro.
- Check two RoPE theta for sliding_window and hca_or_csa.
- hca_or_csa should use theta2, rather than a mix of theta1 & theta2
- Add unit test for train compile. example
See in line for details and other minor comments.
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| # model config for DeepSeek V4 |
There was a problem hiding this comment.
Which version? Seems like flash: https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash/blob/main/config.json?
For the file name deepseek4.yml, perhaps deepseek4-284b.yml?
- Other files have size in the name. https://github.com/AI-Hypercomputer/maxtext/tree/main/src/maxtext/configs/models
- From the table, it says flash is 284B and pro is 1.6T. https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash
| routed_score_func: "sqrtsoftplus" | ||
|
|
||
| # --- Attention configuration --- | ||
| attention: 'dot_product' |
There was a problem hiding this comment.
We typically do not put attention in config. Suggest to remove here, and add check in types.py. E.g.,
maxtext/src/maxtext/configs/types.py
Lines 2925 to 2926 in 61c225f
| shard_mode=self.config.shard_mode, | ||
| rngs=self.rngs, | ||
| ) | ||
| elif rope_type == "deepseek4": |
There was a problem hiding this comment.
change rope_type == "deepseek4" to
if self.config.model_name.startswith("deepseek4"):
-
Currently, we only have three
rope_type
maxtext/src/maxtext/configs/base.yml
Line 924 in 61c225f
Recall you setrope_type: defaultin deepseek4.yml. I notice you attempt to override function arg torope_type=deepseek4. This is confusing and unnecessary. -
most models use the name to differentiate model-specific rope. e.g.,
maxtext/src/maxtext/layers/attentions.py
Line 874 in 5953a73
| use_bias_in_projections=use_bias_in_projections, | ||
| name=name, | ||
| rngs=rngs, | ||
| rope_type="deepseek4", |
There was a problem hiding this comment.
nit: can remove, see the other comment
|
|
||
| # Note: Layers (0,1) are not compressed. | ||
| # The 44th layer (MTP module with compress_ratio=0) has been explicitly dropped for now. | ||
| # This leaves exactly 43 layers: 2 prefix [0,0] + 40 scanned + 1 suffix [4]. |
There was a problem hiding this comment.
I think it should be: 3 prefix [0,0,4] + 40 scanned. Note first_num_hash_layers=3 correspond to layer 0, 1, 2.
| q_lora_rank: The rank for the LoRA projection in the compressed query. | ||
| compress_ratio: The compression ratio for the compressor. | ||
| """ | ||
| """Initializes the Compressed Attention module.""" |
There was a problem hiding this comment.
an underlying HCA or CSA compressor based on the provided
layer_type.
currently seems using compress_ratio (0, =4, >4). although I would prefer if we can use layer_type instead
"""Initializes the Compressed Attention module."""
duplicate line, can remove
Also, might worth adding more docstring.
- highlight: Shared-KV, MQA, 3 different attention (sliding, hca, csa), different rope theta
- like HF
| elif rope_type == "deepseek4": | ||
| rotary_embedding = DeepSeekV4RotaryEmbedding( | ||
| head_dim=rope_embedding_dims, | ||
| partial_rotary_factor=self.partial_rotary_factor if self.partial_rotary_factor is not None else 1.0, |
There was a problem hiding this comment.
could you clarify when we use partial_rotary_factor vs. 1.0?
There was a problem hiding this comment.
The usage of DeepSeekV4RotaryEmbedding with two theta seems different from HF.
- HF:
- sliding window: main_rope
- CSA / HCA: compressed_rope
- https://github.com/huggingface/transformers/blob/7ff490aca20f597f6c6e42a449c7c8dd28807c6b/src/transformers/models/deepseek_v4/modeling_deepseek_v4.py#L775-L777
- In this PR, there are two rotary embedding:
self.compress_rotary_embedding, initialized here withconfig.compressed_rope_max_timescaleself.rotary_embedding=init_rotary_embedding()inattention.py, initialized via base class that usesconfig.rope_max_timescalemaxtext/src/maxtext/layers/attentions.py
Lines 854 to 860 in 5953a73
- This PR:
- sliding window: main_rope
- CSA / HCA: compressed_rope + main_rope [inconsistent]
- The unit test for attention is passing as you override
self.rotary_embeddingfor CSA / HCA.
- The test logic is different from how we actually attention in model.
maxtext/tests/unit/deepseek_v4_vs_reference_test.py
Lines 551 to 554 in 5953a73
There was a problem hiding this comment.
For RoPE, it is safer to test with longer length. e.g., seq_len=4096
maxtext/tests/unit/gpt_vs_reference_test.py
Line 705 in 61c225f
maxtext/tests/unit/yarn_vs_reference_test.py
Line 142 in 61c225f
| head_dim=rope_embedding_dims, | ||
| partial_rotary_factor=self.partial_rotary_factor if self.partial_rotary_factor is not None else 1.0, | ||
| rope_theta=self.rope_max_timescale, | ||
| dtype=self.dtype, |
There was a problem hiding this comment.
Could you make DeepSeekV4RotaryEmbedding inherits RotaryEmbedding, to be consistent as other RoPE classes?
dtype=self.dtypeperhaps should befprop_dtype # The dtype of the output?maxtext/src/maxtext/layers/embeddings.py
Line 1806 in 61c225f
maxtext/src/maxtext/layers/embeddings.py
Lines 272 to 287 in 61c225f
…en deepseek-v4 architecture
5eb3336 to
0b1a9a5
Compare
Description
This PR introduces native architectural and routing support for the DeepSeek V4 model in MaxText.
Why & What: DeepSeek V4 introduces non-uniform architectural features that require explicit configuration unrolling. This PR solves the integration by implementing:
[0, 0]prefix compression ratios, the perfectly alternating[4, 128]scanned middle layers, and the[4, 0]suffix layers.Tests
tests/unit/deepseek_v4_vs_reference_test.py.v5p-512mesh to guarantee memory constraints and HLO generation.Compile Command to Reproduce:
Proof of Compilation:
Checklist
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.