Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification
Abstract
Sampling-based search, a simple paradigm for utilizing test-time compute, involves generating multiple candidate responses and selecting the best one -- typically by verifying each response for correctness. In this paper, we study the scaling trends governing sampling-based search. Among our findings is that simply scaling up a minimalist implementation that uses only random sampling and direct self-verification results in sustained performance improvements that, for example, elevate the Gemini v1.5 Pro model's reasoning capabilities past that of o1-Preview on popular benchmarks. We partially attribute the scalability of sampling-based search to a phenomenon of implicit scaling, where sampling a larger pool of responses in turn improves verification accuracy. We further identify two useful principles for improving self-verification capabilities with test-time compute: (1) comparing across responses provides helpful signals about the locations of errors and hallucinations, and (2) different model output styles are useful for different contexts -- chains of thought are useful for reasoning but harder to verify. We also find that, though accurate verification can be elicited, frontier models demonstrate remarkably weak out-of-box verification capabilities and introduce a benchmark to measure progress on these deficiencies.
Community
Much attention has been recently (and rightfully) been given to scaling test-time compute by extending the length of reasoning chains (e.g., o1, r1). This paper focuses on two other old-school axes of test-time compute scaling, sampling and verification, that provide fundamentally important knobs for massive and embarassingly parallel scaling and are complementary with other test-time compute scaling strategies (e.g., use of long-COT reasoning models).
TLDR: This paper studies scaling trends along the sampling and verification axes of test-time compute, showing that:
- Even just by naviely sampling a lot of responses and asking models to self-verify, blackbox (public API-only) queries to our 2024-era (non-reasoning) Gemini models are enough to exceed o1-Preview reasoning capabilities.
- The self-verification accuracy of frontier models counterintuitively improves on its own the more that you sample due to an implicit scaling phenomenon.
- Frontier models are poor at self-verification out-of-the-box, but you can fix this by expending more compute and following two basic principles. 1) Optimal writing styles for generation (i.e., COT) are not optimal for verification, so always have models expend compute to rewrite their outputs. 2) Models are bad at finding errors and hallucinations unless you give their locations---but you can fix this by sampling a lot and having models compare between different output samples, since their diffs narrow down where errors/hallucinations may lie.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling (2025)
- ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning (2025)
- Test-time Computing: from System-1 Thinking to System-2 Thinking (2025)
- AirRAG: Activating Intrinsic Reasoning for Retrieval Augmented Generation via Tree-based Search (2025)
- Scaling Flaws of Verifier-Guided Search in Mathematical Reasoning (2025)
- CoT-based Synthesizer: Enhancing LLM Performance through Answer Synthesis (2025)
- s1: Simple test-time scaling (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper