CTRL: Critic Training via Reinforcement Learning
CTRL-32B is a critic LLM finetuned from Qwen2.5-Coder-32B-Instruct.
- Project Page: https://critic-rl.github.io/
- Paper: https://arxiv.org/abs/2502.03492
- Code: https://github.com/HKUNLP/critic-rl
Quickstart
We recommend using vLLM for inference:
from vllm import LLM, SamplingParams
def format_prompt_for_ctrl(problem, answer):
"""Given a question-answer pair, we ask the model to generate a critique."""
return f"""You are tasked with analyzing an answer to a problem and providing constructive feedback. Do NOT provide direct solutions.
Problem description:
<problem>
{problem}
</problem>
Answer:
<answer>
{answer}
</answer>
Structure your response using the following format (without <format> tags):
<format>
Analysis:
{{Analysis}}
Improvement suggestions:
{{Suggestions}}
Overall judgment: {{Correct/Incorrect}}
</format>"""
# Sample prompts.
problem = """Write a python function to check whether every odd index contains odd numbers of a given list."""
answer = """```python
def odd_length_sum(arr):
n = len(arr)
res = 0
# Iterate through each element in the array
for i in range(n):
# Calculate the number of subarrays in which arr[i] is present
count = ((i + 1) * (n - i) + 1) // 2
# If the count is odd, add the element to the result
if count % 2 == 1:
res += arr[i]
return res
```"""
prompts = [
format_prompt_for_ctrl(problem, answer),
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=1024)
# Create an LLM.
llm = LLM(model="Zhihui/CTRL-32B", tensor_parallel_size=2)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Citation
@article{xie2025teaching,
title={Teaching Language Models to Critique via Reinforcement Learning},
author={Xie, Zhihui and Chen, Liyu and Mao, Weichao and Xu, Jingjing and Kong, Lingpeng and others},
journal={arXiv preprint arXiv:2502.03492},
year={2025}
}
- Downloads last month
- 18
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no library tag.