Skywork
/

Skywork-Critic-Llama-3.1-70B

Text Generation

PyTorch

llama

conversational

Model card Files Files and versions Community

zhao1iang commited on Sep 24, 2024

Commit

a9823e5

verified ·

1 Parent(s): 0302a27

Update README.md

Browse files

Files changed (1) hide show

README.md +106 -4

README.md CHANGED Viewed

@@ -50,11 +50,11 @@ As of September 2024, Skywork-Critic-Llama3.1-70B **ranks first** on RewardBench
 | Anthropic/claude-3-5-sonnet-20240620       | 96.4  |   74.0    |  81.6  |   84.7    | 84.2  |
 | meta-llama/Meta-Llama-3.1-70B-Instruct *       | 97.2 |   70.2    |  82.8  |   86.0    | 84.0  |
 | NCSOFT/Llama-3-OffsetBias-8B *       | 92.5  |   80.3    |  86.8  |   76.4    | 84.0  |
 # Demo Code
-Below is an example of obtaining the critic of two conversations.
 ```python
 import torch
@@ -119,6 +119,108 @@ print(completion)
 ```
 # Declaration and License Agreement
 ## Declaration

 | Anthropic/claude-3-5-sonnet-20240620       | 96.4  |   74.0    |  81.6  |   84.7    | 84.2  |
 | meta-llama/Meta-Llama-3.1-70B-Instruct *       | 97.2 |   70.2    |  82.8  |   86.0    | 84.0  |
 | NCSOFT/Llama-3-OffsetBias-8B *       | 92.5  |   80.3    |  86.8  |   76.4    | 84.0  |
 # Demo Code
+Below are two examples of how to use the Skywork Critic model: as a preference data selector, and as a judge to generate scores and rationales for instruction-response pairs.
+## Skywork Critic Model as a Preference Data Selector
+Here is an example showing how to use the Skywork Critic Model as a preference data selector. It distinguishes between chosen and rejected training data for Direct Preference Optimization (DPO) training.
 ```python
 import torch
 ```
+## Skywork Critic Model as a Judge
+Here is an example showing how to use the Skywork Critic model as a judge. For an instruction-response pair, the Skywork-Critic Model generates a score and rationale based on specific evaluation criteria. Our preliminary research indicates that 8B-parameter models struggle to produce reliable judgments for responses. Consequently, we exclusively utilize the Skywork-Critic-Llama3.1-70B model as our judge.
+```
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+# An Example Case
+prompt = "Jane has 12 apples. She gives 4 apples to her friend Mark, then buys 1 more apple, and finally splits all her apples equally among herself and her 2 siblings. How many apples does each person get?"
+# Chosen Response
+responseA = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among herself and her 2 siblings (3 people in total). 9 ÷ 3 = 3 apples each. Each person gets 3 apples."
+# Rejected Response
+responseB = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among her 2 siblings (2 people in total). 9 ÷ 2 = 4.5 apples each. Each person gets 4 apples."
+# feed a natural language prompt to generative model
+single_rating_prompt_template = """请扮演一个专业的对话质量评价专家，能够从多个维度诊断和分析 AI 助手对用户问题的回答，并进行总体打分（分值范围是 1-5）。你的评估应考虑回答的有用性、相关性、准确性、深度、创造性、安全性等众多维度，请注意，不同任务类型的指令对评估分析维度的侧重不一样，需要根据具体的问题进行分析。
+[用户问题]
+{input}
+[助手回答的开始]
+{response_a}
+[助手回答的结束]
+你的详细评估和总体打分为: """
+# Critic responseA and responseB separately.
+user_message = single_rating_prompt_template.format(input=prompt, response_a=responseA)
+# user_message = single_rating_prompt_template.format(input=prompt, response_a=responseB)
+conversation = [{"role": "user", "content": user_message}]
+print(f"conversation => {conversation}")
+model_name = "70B"
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype="auto",
+    device_map="auto"
+)
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+input_ids = tokenizer.apply_chat_template(
+    conversation,
+    tokenize=True,
+    add_generation_prompt=True,
+    return_tensors="pt").to(model.device)
+generation = model.generate(
+    input_ids=input_ids,
+    max_new_tokens=2048,
+    do_sample=False,
+    pad_token_id=128009,
+    temperature=0)
+completion = tokenizer.decode(
+    generation[0][len(input_ids[0]):],
+    skip_special_tokens=True,
+    clean_up_tokenization_spaces=True)
+print(completion)
+```
+Model-Generated Judgment for Chosen Response A
+```
+**评估分析：**
+1. **有用性**：回答直接解决了用户的问题，提供了清晰的步骤和计算过程，非常有用。
+2. **相关性**：回答完全相关于用户的问题，没有偏离主题。
+3. **准确性**：计算过程准确无误，结果正确。
+4. **深度**：回答提供了详细的步骤和解释，帮助用户理解问题的解决过程。
+5. **创造性**：虽然回答是直接的计算过程，但在解释每一步时使用了简单的语言和明确的计算，使得回答易于理解。
+6. **安全性**：回答中没有包含任何可能引起误解或危险的信息。
+**总体打分：5**
+**总结**：这个回答非常清晰、准确且有用，完全满足了用户的需求。它通过逐步解释和计算，帮助用户理解问题的解决过程。没有发现任何需要改进的地方，因此给予满分。
+```
+Model-Generated Judgment for Rejected Response B
+```
+**评估分析：**
+1. **有用性**：回答直接解��了用户的问题，提供了每个人最终得到的苹果数量。
+2. **相关性**：回答紧密相关于用户的问题，没有偏离主题。
+3. **准确性**：回答中的计算过程是正确的，但在最后一步中出现了错误。Jane 分苹果时，应该是将苹果分给自己和她的2个兄弟姐妹，总共3个人，而不是2个人。
+4. **深度**：回答提供了详细的计算步骤，帮助用户理解问题的解决过程。
+5. **创造性**：回答是直接的，没有特别的创造性，但对于这个简单的数学问题来说，直接的回答是合适的。
+6. **安全性**：回答没有包含任何可能引起误解或危险的信息。
+**总体打分：** 4
+**改进建议：**
+- 在最后一步中，正确地计算每个人得到的苹果数量。Jane 应该将 9 个苹果分给自己和她的 2 个兄弟姐妹，总共 3 个人。因此，每个人得到的苹果数量应该是 9 ÷ 3 = 3 个苹果。
+```
 # Declaration and License Agreement
 ## Declaration