Teaching Language Models to Critique via Reinforcement Learning
Abstract
Teaching large language models (LLMs) to critique and refine their outputs is crucial for building systems that can iteratively improve, yet it is fundamentally limited by the ability to provide accurate judgments and actionable suggestions. In this work, we study LLM critics for code generation and propose CTRL, a framework for Critic Training via Reinforcement Learning, which trains a critic model to generate feedback that maximizes correction performance for a fixed generator model without human supervision. Our results demonstrate that critics trained with CTRL significantly enhance pass rates and mitigate compounding errors across both base and stronger generator models. Furthermore, we show that these critic models act as accurate generative reward models and enable test-time scaling through iterative critique-revision, achieving up to 106.1% relative improvements across challenging code generation benchmarks.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Reusing Embeddings: Reproducible Reward Model Research in Large Language Model Alignment without GPUs (2025)
- Enabling Scalable Oversight via Self-Evolving Critic (2025)
- Improving Multi-Step Reasoning Abilities of Large Language Models with Direct Advantage Policy Optimization (2024)
- RAG-Reward: Optimizing RAG with Reward Modeling and RLHF (2025)
- Towards Cost-Effective Reward Guided Text Generation (2025)
- Guided Code Generation with LLMs: A Multi-Agent Framework for Complex Code Tasks (2025)
- Understanding Impact of Human Feedback via Influence Functions (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper