calcuis commited on
Commit
d2fc767
·
verified ·
1 Parent(s): 52abff1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -5
README.md CHANGED
@@ -23,10 +23,10 @@ use any gguf connector to interact with gguf file(s), i.e., [connector](https://
23
  - base model: deepseek-ai/[DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1)
24
  - tool used for quantization: [cutter](https://pypi.org/project/gguf-cutter)
25
 
26
- ### appendices (by deekseek-ai)
27
- ### DeepSeek-R1-Evaluation
28
- For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1.
29
 
 
 
30
 
31
  | Category | Benchmark (Metric) | Claude-3.5-Sonnet-1022 | GPT-4o 0513 | DeepSeek V3 | OpenAI o1-mini | OpenAI o1-1217 | DeepSeek R1 |
32
  |----------|-------------------|----------------------|------------|--------------|----------------|------------|--------------|
@@ -55,7 +55,6 @@ use any gguf connector to interact with gguf file(s), i.e., [connector](https://
55
  | | C-Eval (EM) | 76.7 | 76.0 | 86.5 | 68.9 | - | **91.8** |
56
  | | C-SimpleQA (Correct) | 55.4 | 58.7 | **68.0** | 40.3 | - | 63.7 |
57
 
58
-
59
  ### Distilled Model Evaluation
60
 
61
  | Model | AIME 2024 pass@1 | AIME 2024 cons@64 | MATH-500 pass@1 | GPQA Diamond pass@1 | LiveCodeBench pass@1 | CodeForces rating |
@@ -70,4 +69,3 @@ use any gguf connector to interact with gguf file(s), i.e., [connector](https://
70
  | DeepSeek-R1-Distill-Qwen-32B | **72.6** | 83.3 | 94.3 | 62.1 | 57.2 | 1691 |
71
  | DeepSeek-R1-Distill-Llama-8B | 50.4 | 80.0 | 89.1 | 49.0 | 39.6 | 1205 |
72
  | DeepSeek-R1-Distill-Llama-70B | 70.0 | **86.7** | **94.5** | **65.2** | **57.5** | 1633 |
73
-
 
23
  - base model: deepseek-ai/[DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1)
24
  - tool used for quantization: [cutter](https://pypi.org/project/gguf-cutter)
25
 
26
+ ### appendices: model evaluation (written by deekseek-ai)
 
 
27
 
28
+ #### DeepSeek-R1-Evaluation
29
+ For all our (here refer to deekseek-ai) models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1.
30
 
31
  | Category | Benchmark (Metric) | Claude-3.5-Sonnet-1022 | GPT-4o 0513 | DeepSeek V3 | OpenAI o1-mini | OpenAI o1-1217 | DeepSeek R1 |
32
  |----------|-------------------|----------------------|------------|--------------|----------------|------------|--------------|
 
55
  | | C-Eval (EM) | 76.7 | 76.0 | 86.5 | 68.9 | - | **91.8** |
56
  | | C-SimpleQA (Correct) | 55.4 | 58.7 | **68.0** | 40.3 | - | 63.7 |
57
 
 
58
  ### Distilled Model Evaluation
59
 
60
  | Model | AIME 2024 pass@1 | AIME 2024 cons@64 | MATH-500 pass@1 | GPQA Diamond pass@1 | LiveCodeBench pass@1 | CodeForces rating |
 
69
  | DeepSeek-R1-Distill-Qwen-32B | **72.6** | 83.3 | 94.3 | 62.1 | 57.2 | 1691 |
70
  | DeepSeek-R1-Distill-Llama-8B | 50.4 | 80.0 | 89.1 | 49.0 | 39.6 | 1205 |
71
  | DeepSeek-R1-Distill-Llama-70B | 70.0 | **86.7** | **94.5** | **65.2** | **57.5** | 1633 |