Satori-7B-Round2 is a 7B LLM trained on open-source model (Qwen-2.5-Math-7B) and open-source data (OpenMathInstruct-2 and NuminaMath). Satori-7B-Round2 is capable of autoregressive search, i.e., self-reflection and self-exploration without external guidance. This is achieved through our proposed Chain-of-Action-Thought (COAT) reasoning and a two-stage post-training paradigm.
Our Approach
We formulate LLM reasoning as a sequential decision-making problem, where reasoning is a process of constructing and refining an answer step by step. Specifically, the LLM (agent's policy) starts with an input context (initial state), generates a reasoning step (action), and updates the context (next state). The LLM repeats this process until it reaches a final answer, and receives a reward that evaluates whether the final answer matches the ground truth. With this formulation, we could train the LLM to reason using RL, aiming to generate a sequence of reasoning steps that maximize the expected reward.
Chain-of-Action-Thought reasoning (COAT)
The key challenge of achieving autoregressive search is enabling the LLM to determine when to reflect, continue, or explore alternative solutions without external intervention. To enable this, we introduce several special meta-action tokens that guide the LLM's reasoning process,
- Continue Reasoning (<|continue|>): encourages the LLM to build upon its current reasoning trajectory by generating the next intermediate step.
- Reflect (<|reflect|>): prompts the model to pause and verify the correctness of prior reasoning steps.
- Explore Alternative Solution (<|explore|>): signals the model to identify critical flaws in its reasoning and explore a new solution.
We refer to this formulation as Chain-of-Action-Thought (COAT) reasoning. Each COAT reasoning step is a sequence of tokens, starting with one of the meta-action tokens.
Overview of Training Framework
- A small-scale format tuning (FT) stage that helps the base LLM to internalize the COAT reasoning format.
- A large-scale self-improvement stage that utilizes reinforcement learning with "Restart and Explore" (RAE) techniques.
Format Tuning Through Imitation Learning
This stage aims to fine-tune a pre-trained base LLM to imitate a few demonstrated reasoning trajectories with COAT reasoning format. To synthesize such COAT trajectories that incorporate trials and errors, we propose a multi-agent data synthesis framework that leverages three LLMs:
- Generator: Given an input problem, a generator generates multiple reasoning paths for a given input problem using classical CoT techniques.
- Critic: A critic evaluates the correctness of the reasoning paths generated by the generator, providing feedback to refine the reasoning and address suboptimal steps.
- Reward Model: A reward model assigns scores to the refined reasoning paths and selects the most effective path as the final demonstration trajectory.
These three models collaborate to construct high-quality demonstration trajectories. We observe that a small number (10K) of demonstration trajectories is sufficient for base LLM to follow the COAT reasoning format.
Self-improvement via Reinforcement Learning
Through format tuning, the LLM has adopted the COAT reasoning style but struggles to generalize to unseen problems. The RL stage aims to incentivize the actual capabilities of leveraging self-reflection to improve reasoning. We start with the format-tuned LLM and further optimize it using the classical PPO algorithm with two additional key strategies,
- Restart and Explore (RAE): Inspired by Go-Explore, we train the LLM policy to reason not only from the problem statement but also from intermediate steps sampled from past trajectories, both correct and incorrect. We also add exploration bonuses to encourage deeper reflection, further increasing opportunities for the policy to arrive at correct answers.
- Iterative Self-improvement: The policy might converge to a local sub-optimum and cannot further improve. Inspired by Kickstarting, after each round of RL training, we distill the knowledge of the current teacher policy into the student model (base LLM) through supervised fine-tuning. Starting from the newly fine-tuned LLM, we then perform another round of RL training.
Satori-7B-Round2 is obtained through a second round of iterative self-improvement.
Usage
import os
from tqdm import tqdm
import torch
from vllm import LLM, SamplingParams
def generate(question_list,model_path):
llm = LLM(
model=model_path,
trust_remote_code=True,
tensor_parallel_size=1,
)
sampling_params = SamplingParams(
max_tokens=4096,
temperature=0.0,
n=1,
skip_special_tokens=True # hide special tokens such as "<|continue|>", "<|reflect|>", and "<|explore|>"
)
outputs = llm.generate(question_list, sampling_params, use_tqdm=True)
completions = [[output.text for output in output_item.outputs] for output_item in outputs]
return completions
def prepare_prompt(question):
prompt = f"<|im_start|>user\nSolve the following math problem efficiently and clearly.\nPlease reason step by step, and put your final answer within \\boxed{{}}.\nProblem: {question}<|im_end|>\n<|im_start|>assistant\n"
return prompt
def run():
model_path = "Satori-reasoning/Satori-7B-Round2"
all_problems = [
"which number is larger? 9.11 or 9.9?",
]
completions = generate(
[prepare_prompt(problem_data) for problem_data in all_problems],
model_path
)
for completion in completions:
print(completion[0])
if __name__ == "__main__":
run()
Benchmarking Performance
Satori-7B-Round2 is evaluated on both in-domain reasoning benchmarks (math reasoning) and out-of-domain benchmarks (general reasoning tasks). All results are reported as the zero-shot pass@1 accuracy with greedy sampling.
Evaluation Tasks
- Mathmatics Reasoning Benchmarks: GSM8K, MATH500, AMC2023, AIME2024, and OlympiadBench. Except for GSM8K, all other datasets feature competition-level problems.
- General Domain Reasoning Benchmarks:
- Logical Reasoning: FOLIO, BoardgameQA (BGQA).
- Code Reasoning: CRUXEval.
- Commonsense Reasoning: StrategyQA (STGQA).
- Tabular Reasoning: TableBench.
- Domain-specific Reasoning: MMLUPro STEM subsets (STEM), including physics, chemistry, computer science, engineering, biology, and economics.
Math Reasoning Benchmarks
Satori-7B-Round2 achieves SOTA performance and outperforms Qwen-2.5-Math-7B-Instruct which uses the same base model (Qwen-2.5-Math-7B).
Scale | Model | GSM8K | MATH500 | OlymBench | AMC2023 | AIME2024 | AVG. |
---|---|---|---|---|---|---|---|
Large | Llama-3.1-70B-Instruct | 94.1 | 68.0 | 29.4 | 42.5 | 13.3 | 49.5 |
OpenMath2-Llama3.1-70B | 94.1 | 71.8 | 30.1 | 45.0 | 13.3 | 50.9 | |
QwQ-32B-Preview | 95.5 | 90.6 | 61.2 | 77.5 | 50.0 | 75.0 | |
Small | Llama-3.1-8b-Instruct | 84.4 | 51.9 | 15.1 | 22.5 | 3.3 | 35.4 |
OpenMath2-Llama3.1-8B | 90.5 | 67.8 | 28.9 | 37.5 | 6.7 | 46.3 | |
NuminaMath-7B-CoT | 78.9 | 54.6 | 15.9 | 20.0 | 10.0 | 35.9 | |
Qwen-2.5-7B-Instruct | 91.6 | 75.5 | 35.5 | 52.5 | 6.7 | 52.4 | |
Qwen-2.5-Math-7B-Instruct | 95.2 | 83.6 | 41.6 | 62.5 | 16.7 | 59.9 | |
Satori-7B-Round2 | 93.9 | 83.6 | 48.5 | 72.5 | 23.3 | 64.4 |
General Domain Reasoning Benchmarks
Trained only on math datasets, Satori-7B-Round2 exhibits strong transferability across diverse out-of-domain reasoning benchmarks and outperforms Qwen-2.5-Math-7B-Instruct by a large margin. Moreover, despite not being trained in other domains, Satori-7B-Round2 achieves performance comparable to or exceeding other small-scale general instruct models.
Scale | Model | FOLIO | BGQA | CRUXEval | StrategyQA | TableBench | STEM | Avg. |
---|---|---|---|---|---|---|---|---|
Large | Llama-3.1-70B-Instruct | 65.0 | 58.3 | 59.6 | 88.8 | 34.2 | 61.7 | 61.3 |
OpenMath2-Llama3.1-70B | 68.5 | 68.7 | 35.1 | 95.6 | 46.8 | 15.1 | 55.0 | |
QwQ-32B-Preview | 84.2 | 71.1 | 65.2 | 88.2 | 51.5 | 71.3 | 71.9 | |
Small | Llama-3.1-8b-Instruct | 63.5 | 50.3 | 38.5 | 92.2 | 32.4 | 43.4 | 53.4 |
OpenMath2-Llama3.1-8B | 57.1 | 49.0 | 11.1 | 84.4 | 34.2 | 10.9 | 41.1 | |
NuminaMath-7B-CoT | 53.2 | 44.6 | 28.0 | 77.8 | 29.1 | 11.3 | 40.7 | |
Qwen-2.5-7B-Instruct | 72.4 | 53.0 | 58.1 | 91.3 | 43.2 | 57.1 | 62.5 | |
Qwen-2.5-Math-7B-Instruct | 68.9 | 51.3 | 28.0 | 85.3 | 36.2 | 45.2 | 52.5 | |
Satori-7B-Round2 | 72.9 | 58.5 | 41.1 | 90.4 | 44.6 | 57.4 | 60.8 |
Resources
We provide our training datasets:
- Full format tuning dataset with 300K unique questions.
- RL dataset with 550K unique questions.
Please refer to our blog and research paper for more technical details of Satori.
Citation
If you find our model and data helpful, please cite our paper:
@misc{shen2025satorireinforcementlearningchainofactionthought,
title={Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search},
author={Maohao Shen and Guangtao Zeng and Zhenting Qi and Zhang-Wei Hong and Zhenfang Chen and Wei Lu and Gregory Wornell and Subhro Das and David Cox and Chuang Gan},
year={2025},
eprint={2502.02508},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.02508},
}
- Downloads last month
- 54