zhao1iang commited on
Commit
a9823e5
·
verified ·
1 Parent(s): 0302a27

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -4
README.md CHANGED
@@ -50,11 +50,11 @@ As of September 2024, Skywork-Critic-Llama3.1-70B **ranks first** on RewardBench
50
  | Anthropic/claude-3-5-sonnet-20240620 | 96.4 | 74.0 | 81.6 | 84.7 | 84.2 |
51
  | meta-llama/Meta-Llama-3.1-70B-Instruct * | 97.2 | 70.2 | 82.8 | 86.0 | 84.0 |
52
  | NCSOFT/Llama-3-OffsetBias-8B * | 92.5 | 80.3 | 86.8 | 76.4 | 84.0 |
53
-
54
-
55
-
56
  # Demo Code
57
- Below is an example of obtaining the critic of two conversations.
 
 
 
58
 
59
  ```python
60
  import torch
@@ -119,6 +119,108 @@ print(completion)
119
 
120
  ```
121
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
122
  # Declaration and License Agreement
123
 
124
  ## Declaration
 
50
  | Anthropic/claude-3-5-sonnet-20240620 | 96.4 | 74.0 | 81.6 | 84.7 | 84.2 |
51
  | meta-llama/Meta-Llama-3.1-70B-Instruct * | 97.2 | 70.2 | 82.8 | 86.0 | 84.0 |
52
  | NCSOFT/Llama-3-OffsetBias-8B * | 92.5 | 80.3 | 86.8 | 76.4 | 84.0 |
 
 
 
53
  # Demo Code
54
+ Below are two examples of how to use the Skywork Critic model: as a preference data selector, and as a judge to generate scores and rationales for instruction-response pairs.
55
+
56
+ ## Skywork Critic Model as a Preference Data Selector
57
+ Here is an example showing how to use the Skywork Critic Model as a preference data selector. It distinguishes between chosen and rejected training data for Direct Preference Optimization (DPO) training.
58
 
59
  ```python
60
  import torch
 
119
 
120
  ```
121
 
122
+ ## Skywork Critic Model as a Judge
123
+
124
+ Here is an example showing how to use the Skywork Critic model as a judge. For an instruction-response pair, the Skywork-Critic Model generates a score and rationale based on specific evaluation criteria. Our preliminary research indicates that 8B-parameter models struggle to produce reliable judgments for responses. Consequently, we exclusively utilize the Skywork-Critic-Llama3.1-70B model as our judge.
125
+
126
+ ```
127
+ import torch
128
+ from transformers import AutoModelForCausalLM, AutoTokenizer
129
+
130
+ # An Example Case
131
+ prompt = "Jane has 12 apples. She gives 4 apples to her friend Mark, then buys 1 more apple, and finally splits all her apples equally among herself and her 2 siblings. How many apples does each person get?"
132
+
133
+ # Chosen Response
134
+ responseA = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among herself and her 2 siblings (3 people in total). 9 ÷ 3 = 3 apples each. Each person gets 3 apples."
135
+
136
+ # Rejected Response
137
+ responseB = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among her 2 siblings (2 people in total). 9 ÷ 2 = 4.5 apples each. Each person gets 4 apples."
138
+
139
+ # feed a natural language prompt to generative model
140
+ single_rating_prompt_template = """请扮演一个专业的对话质量评价专家,能够从多个维度诊断和分析 AI 助手对用户问题的回答,并进行总体打分(分值范围是 1-5)。你的评估应考虑回答的有用性、相关性、准确性、深度、创造性、安全性等众多维度,请注意,不同任务类型的指令对评估分析维度的侧重不一样,需要根据具体的问题进行分析。
141
+
142
+ [用户问题]
143
+ {input}
144
+
145
+ [助手回答的开始]
146
+ {response_a}
147
+ [助手回答的结束]
148
+
149
+ 你的详细评估和总体打分为: """
150
+
151
+ # Critic responseA and responseB separately.
152
+ user_message = single_rating_prompt_template.format(input=prompt, response_a=responseA)
153
+ # user_message = single_rating_prompt_template.format(input=prompt, response_a=responseB)
154
+
155
+
156
+ conversation = [{"role": "user", "content": user_message}]
157
+
158
+ print(f"conversation => {conversation}")
159
+
160
+
161
+ model_name = "70B"
162
+ model = AutoModelForCausalLM.from_pretrained(
163
+ model_name,
164
+ torch_dtype="auto",
165
+ device_map="auto"
166
+ )
167
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
168
+
169
+ input_ids = tokenizer.apply_chat_template(
170
+ conversation,
171
+ tokenize=True,
172
+ add_generation_prompt=True,
173
+ return_tensors="pt").to(model.device)
174
+
175
+ generation = model.generate(
176
+ input_ids=input_ids,
177
+ max_new_tokens=2048,
178
+ do_sample=False,
179
+ pad_token_id=128009,
180
+ temperature=0)
181
+
182
+ completion = tokenizer.decode(
183
+ generation[0][len(input_ids[0]):],
184
+ skip_special_tokens=True,
185
+ clean_up_tokenization_spaces=True)
186
+
187
+
188
+ print(completion)
189
+ ```
190
+
191
+ Model-Generated Judgment for Chosen Response A
192
+ ```
193
+ **评估分析:**
194
+
195
+ 1. **有用性**:回答直接解决了用户的问题,提供了清晰的步骤和计算过程,非常有用。
196
+ 2. **相关性**:回答完全相关于用户的问题,没有偏离主题。
197
+ 3. **准确性**:计算过程准确无误,结果正确。
198
+ 4. **深度**:回答提供了详细的步骤和解释,帮助用户理解问题的解决过程。
199
+ 5. **创造性**:虽然回答是直接的计算过程,但在解释每一步时使用了简单的语言和明确的计算,使得回答易于理解。
200
+ 6. **安全性**:回答中没有包含任何可能引起误解或危险的信息。
201
+
202
+ **总体打分:5**
203
+
204
+ **总结**:这个回答非常清晰、准确且有用,完全满足了用户的需求。它通过逐步解释和计算,帮助用户理解问题的解决过程。没有发现任何需要改进的地方,因此给予满分。
205
+ ```
206
+
207
+ Model-Generated Judgment for Rejected Response B
208
+ ```
209
+ **评估分析:**
210
+
211
+ 1. **有用性**:回答直接解��了用户的问题,提供了每个人最终得到的苹果数量。
212
+ 2. **相关性**:回答紧密相关于用户的问题,没有偏离主题。
213
+ 3. **准确性**:回答中的计算过程是正确的,但在最后一步中出现了错误。Jane 分苹果时,应该是将苹果分给自己和她的2个兄弟姐妹,总共3个人,而不是2个人。
214
+ 4. **深度**:回答提供了详细的计算步骤,帮助用户理解问题的解决过程。
215
+ 5. **创造性**:回答是直接的,没有特别的创造性,但对于这个简单的数学问题来说,直接的回答是合适的。
216
+ 6. **安全性**:回答没有包含任何可能引起误解或危险的信息。
217
+
218
+ **总体打分:** 4
219
+
220
+ **改进建议:**
221
+ - 在最后一步中,正确地计算每个人得到的苹果数量。Jane 应该将 9 个苹果分给自己和她的 2 个兄弟姐妹,总共 3 个人。因此,每个人得到的苹果数量应该是 9 ÷ 3 = 3 个苹果。
222
+ ```
223
+
224
  # Declaration and License Agreement
225
 
226
  ## Declaration