Papers
arxiv:2502.01839

Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification

Published on Feb 3
ยท Submitted by ericzhao28 on Feb 5
Authors:
,

Abstract

Sampling-based search, a simple paradigm for utilizing test-time compute, involves generating multiple candidate responses and selecting the best one -- typically by verifying each response for correctness. In this paper, we study the scaling trends governing sampling-based search. Among our findings is that simply scaling up a minimalist implementation that uses only random sampling and direct self-verification results in sustained performance improvements that, for example, elevate the Gemini v1.5 Pro model's reasoning capabilities past that of o1-Preview on popular benchmarks. We partially attribute the scalability of sampling-based search to a phenomenon of implicit scaling, where sampling a larger pool of responses in turn improves verification accuracy. We further identify two useful principles for improving self-verification capabilities with test-time compute: (1) comparing across responses provides helpful signals about the locations of errors and hallucinations, and (2) different model output styles are useful for different contexts -- chains of thought are useful for reasoning but harder to verify. We also find that, though accurate verification can be elicited, frontier models demonstrate remarkably weak out-of-box verification capabilities and introduce a benchmark to measure progress on these deficiencies.

Community

Paper author Paper submitter
โ€ข
edited 1 day ago

Much attention has been recently (and rightfully) been given to scaling test-time compute by extending the length of reasoning chains (e.g., o1, r1). This paper focuses on two other old-school axes of test-time compute scaling, sampling and verification, that provide fundamentally important knobs for massive and embarassingly parallel scaling and are complementary with other test-time compute scaling strategies (e.g., use of long-COT reasoning models).

TLDR: This paper studies scaling trends along the sampling and verification axes of test-time compute, showing that:

  1. Even just by naviely sampling a lot of responses and asking models to self-verify, blackbox (public API-only) queries to our 2024-era (non-reasoning) Gemini models are enough to exceed o1-Preview reasoning capabilities.
  2. The self-verification accuracy of frontier models counterintuitively improves on its own the more that you sample due to an implicit scaling phenomenon.
  3. Frontier models are poor at self-verification out-of-the-box, but you can fix this by expending more compute and following two basic principles. 1) Optimal writing styles for generation (i.e., COT) are not optimal for verification, so always have models expend compute to rewrite their outputs. 2) Models are bad at finding errors and hallucinations unless you give their locations---but you can fix this by sampling a lot and having models compare between different output samples, since their diffs narrow down where errors/hallucinations may lie.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2502.01839 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2502.01839 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2502.01839 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.