Update README.md
Browse files
README.md
CHANGED
@@ -50,11 +50,11 @@ As of September 2024, Skywork-Critic-Llama3.1-70B **ranks first** on RewardBench
|
|
50 |
| Anthropic/claude-3-5-sonnet-20240620 | 96.4 | 74.0 | 81.6 | 84.7 | 84.2 |
|
51 |
| meta-llama/Meta-Llama-3.1-70B-Instruct * | 97.2 | 70.2 | 82.8 | 86.0 | 84.0 |
|
52 |
| NCSOFT/Llama-3-OffsetBias-8B * | 92.5 | 80.3 | 86.8 | 76.4 | 84.0 |
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
# Demo Code
|
57 |
-
Below
|
|
|
|
|
|
|
58 |
|
59 |
```python
|
60 |
import torch
|
@@ -119,6 +119,108 @@ print(completion)
|
|
119 |
|
120 |
```
|
121 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
122 |
# Declaration and License Agreement
|
123 |
|
124 |
## Declaration
|
|
|
50 |
| Anthropic/claude-3-5-sonnet-20240620 | 96.4 | 74.0 | 81.6 | 84.7 | 84.2 |
|
51 |
| meta-llama/Meta-Llama-3.1-70B-Instruct * | 97.2 | 70.2 | 82.8 | 86.0 | 84.0 |
|
52 |
| NCSOFT/Llama-3-OffsetBias-8B * | 92.5 | 80.3 | 86.8 | 76.4 | 84.0 |
|
|
|
|
|
|
|
53 |
# Demo Code
|
54 |
+
Below are two examples of how to use the Skywork Critic model: as a preference data selector, and as a judge to generate scores and rationales for instruction-response pairs.
|
55 |
+
|
56 |
+
## Skywork Critic Model as a Preference Data Selector
|
57 |
+
Here is an example showing how to use the Skywork Critic Model as a preference data selector. It distinguishes between chosen and rejected training data for Direct Preference Optimization (DPO) training.
|
58 |
|
59 |
```python
|
60 |
import torch
|
|
|
119 |
|
120 |
```
|
121 |
|
122 |
+
## Skywork Critic Model as a Judge
|
123 |
+
|
124 |
+
Here is an example showing how to use the Skywork Critic model as a judge. For an instruction-response pair, the Skywork-Critic Model generates a score and rationale based on specific evaluation criteria. Our preliminary research indicates that 8B-parameter models struggle to produce reliable judgments for responses. Consequently, we exclusively utilize the Skywork-Critic-Llama3.1-70B model as our judge.
|
125 |
+
|
126 |
+
```
|
127 |
+
import torch
|
128 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
129 |
+
|
130 |
+
# An Example Case
|
131 |
+
prompt = "Jane has 12 apples. She gives 4 apples to her friend Mark, then buys 1 more apple, and finally splits all her apples equally among herself and her 2 siblings. How many apples does each person get?"
|
132 |
+
|
133 |
+
# Chosen Response
|
134 |
+
responseA = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among herself and her 2 siblings (3 people in total). 9 ÷ 3 = 3 apples each. Each person gets 3 apples."
|
135 |
+
|
136 |
+
# Rejected Response
|
137 |
+
responseB = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among her 2 siblings (2 people in total). 9 ÷ 2 = 4.5 apples each. Each person gets 4 apples."
|
138 |
+
|
139 |
+
# feed a natural language prompt to generative model
|
140 |
+
single_rating_prompt_template = """请扮演一个专业的对话质量评价专家,能够从多个维度诊断和分析 AI 助手对用户问题的回答,并进行总体打分(分值范围是 1-5)。你的评估应考虑回答的有用性、相关性、准确性、深度、创造性、安全性等众多维度,请注意,不同任务类型的指令对评估分析维度的侧重不一样,需要根据具体的问题进行分析。
|
141 |
+
|
142 |
+
[用户问题]
|
143 |
+
{input}
|
144 |
+
|
145 |
+
[助手回答的开始]
|
146 |
+
{response_a}
|
147 |
+
[助手回答的结束]
|
148 |
+
|
149 |
+
你的详细评估和总体打分为: """
|
150 |
+
|
151 |
+
# Critic responseA and responseB separately.
|
152 |
+
user_message = single_rating_prompt_template.format(input=prompt, response_a=responseA)
|
153 |
+
# user_message = single_rating_prompt_template.format(input=prompt, response_a=responseB)
|
154 |
+
|
155 |
+
|
156 |
+
conversation = [{"role": "user", "content": user_message}]
|
157 |
+
|
158 |
+
print(f"conversation => {conversation}")
|
159 |
+
|
160 |
+
|
161 |
+
model_name = "70B"
|
162 |
+
model = AutoModelForCausalLM.from_pretrained(
|
163 |
+
model_name,
|
164 |
+
torch_dtype="auto",
|
165 |
+
device_map="auto"
|
166 |
+
)
|
167 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
168 |
+
|
169 |
+
input_ids = tokenizer.apply_chat_template(
|
170 |
+
conversation,
|
171 |
+
tokenize=True,
|
172 |
+
add_generation_prompt=True,
|
173 |
+
return_tensors="pt").to(model.device)
|
174 |
+
|
175 |
+
generation = model.generate(
|
176 |
+
input_ids=input_ids,
|
177 |
+
max_new_tokens=2048,
|
178 |
+
do_sample=False,
|
179 |
+
pad_token_id=128009,
|
180 |
+
temperature=0)
|
181 |
+
|
182 |
+
completion = tokenizer.decode(
|
183 |
+
generation[0][len(input_ids[0]):],
|
184 |
+
skip_special_tokens=True,
|
185 |
+
clean_up_tokenization_spaces=True)
|
186 |
+
|
187 |
+
|
188 |
+
print(completion)
|
189 |
+
```
|
190 |
+
|
191 |
+
Model-Generated Judgment for Chosen Response A
|
192 |
+
```
|
193 |
+
**评估分析:**
|
194 |
+
|
195 |
+
1. **有用性**:回答直接解决了用户的问题,提供了清晰的步骤和计算过程,非常有用。
|
196 |
+
2. **相关性**:回答完全相关于用户的问题,没有偏离主题。
|
197 |
+
3. **准确性**:计算过程准确无误,结果正确。
|
198 |
+
4. **深度**:回答提供了详细的步骤和解释,帮助用户理解问题的解决过程。
|
199 |
+
5. **创造性**:虽然回答是直接的计算过程,但在解释每一步时使用了简单的语言和明确的计算,使得回答易于理解。
|
200 |
+
6. **安全性**:回答中没有包含任何可能引起误解或危险的信息。
|
201 |
+
|
202 |
+
**总体打分:5**
|
203 |
+
|
204 |
+
**总结**:这个回答非常清晰、准确且有用,完全满足了用户的需求。它通过逐步解释和计算,帮助用户理解问题的解决过程。没有发现任何需要改进的地方,因此给予满分。
|
205 |
+
```
|
206 |
+
|
207 |
+
Model-Generated Judgment for Rejected Response B
|
208 |
+
```
|
209 |
+
**评估分析:**
|
210 |
+
|
211 |
+
1. **有用性**:回答直接解��了用户的问题,提供了每个人最终得到的苹果数量。
|
212 |
+
2. **相关性**:回答紧密相关于用户的问题,没有偏离主题。
|
213 |
+
3. **准确性**:回答中的计算过程是正确的,但在最后一步中出现了错误。Jane 分苹果时,应该是将苹果分给自己和她的2个兄弟姐妹,总共3个人,而不是2个人。
|
214 |
+
4. **深度**:回答提供了详细的计算步骤,帮助用户理解问题的解决过程。
|
215 |
+
5. **创造性**:回答是直接的,没有特别的创造性,但对于这个简单的数学问题来说,直接的回答是合适的。
|
216 |
+
6. **安全性**:回答没有包含任何可能引起误解或危险的信息。
|
217 |
+
|
218 |
+
**总体打分:** 4
|
219 |
+
|
220 |
+
**改进建议:**
|
221 |
+
- 在最后一步中,正确地计算每个人得到的苹果数量。Jane 应该将 9 个苹果分给自己和她的 2 个兄弟姐妹,总共 3 个人。因此,每个人得到的苹果数量应该是 9 ÷ 3 = 3 个苹果。
|
222 |
+
```
|
223 |
+
|
224 |
# Declaration and License Agreement
|
225 |
|
226 |
## Declaration
|