DeepSeek-V3-lite naming conventions?

#76
by AlphaGaO - opened

Hello, i am currently working on a pruned version of DeepSeek V3,

The methodology involves layer wise routed expert pruning and distillation, then post training on the full model.
I already tested the pipeline on DeepSeek V2 lite, bringing 64@6 experts to 16@4 experts and it seems to give correct results.

I just started running the same method on Deepseek V3 with the following pruned target:
Base Model: 256@8 => DeepSeek-V3-671B@37B-full
22@6 => DeepSeek-V3-Lite-72B@31B-large
16@4 => DeepSeek-V3-Lite-57B@26B-medium
8@2 => DeepSeek-V3-Lite-36B@21B-small
4@1 => DeepSeek-V3-Lite-26B@19B-nano

I'll upload them on huggingface when the pipeline finish to run (it should take about 3 days on my 2x3090 rig).

Do you authorize me to adopt the naming convention as above for the uploads?

If the methodology gives good result, i'll transfer it to the R1 and R1-Zero as well.

Sign up or log in to comment