CTRL: Critic Training via Reinforcement Learning

CTRL-32B is a critic LLM finetuned from Qwen2.5-Coder-32B-Instruct.

Quickstart

We recommend using vLLM for inference:

from vllm import LLM, SamplingParams

def format_prompt_for_ctrl(problem, answer):
    """Given a question-answer pair, we ask the model to generate a critique."""
    return f"""You are tasked with analyzing an answer to a problem and providing constructive feedback. Do NOT provide direct solutions.

Problem description:
<problem>
{problem}
</problem>

Answer:
<answer>
{answer}
</answer>

Structure your response using the following format (without <format> tags):
<format>
Analysis:
{{Analysis}}

Improvement suggestions:
{{Suggestions}}

Overall judgment: {{Correct/Incorrect}}
</format>"""

# Sample prompts.
problem = """Write a python function to check whether every odd index contains odd numbers of a given list."""
answer = """```python
def odd_length_sum(arr):
    n = len(arr)
    res = 0

    # Iterate through each element in the array
    for i in range(n):
        # Calculate the number of subarrays in which arr[i] is present
        count = ((i + 1) * (n - i) + 1) // 2

        # If the count is odd, add the element to the result
        if count % 2 == 1:
            res += arr[i]

    return res
```"""
prompts = [
    format_prompt_for_ctrl(problem, answer),
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=1024)

# Create an LLM.
llm = LLM(model="Zhihui/CTRL-32B", tensor_parallel_size=2)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Citation

@article{xie2025teaching,
  title={Teaching Language Models to Critique via Reinforcement Learning},
  author={Xie, Zhihui and Chen, Liyu and Mao, Weichao and Xu, Jingjing and Kong, Lingpeng and others},
  journal={arXiv preprint arXiv:2502.03492},
  year={2025}
}
Downloads last month
18
Safetensors
Model size
32.8B params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for Zhihui/CTRL-32B

Quantizations
2 models