This model uses reinforcement learning to train on the GSM8K dataset, generating reasoning chains and formatted outputs despite the dataset lacking intermediate steps. A reward function guides the model, prioritizing answer correctness and XML format adherence.
Training Details:
- Dataset: GSM8K
- Algorithm: GRPO
- Hardware: Single NVIDIA GeForce RTX 3090 Ti
- Training Duration: 250 epochs, ~48 minutes
The output length limit(200) restricts the model's ability to generate complex reasoning chains, hindering observation of output length growth during training.
Example:
Which one is bigger? 9.11 or 9.8?
This qwen2.5 model was trained 2x faster with Unsloth and Huggingface's TRL library.
- Downloads last month
- 17
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
Model tree for Nagi-ovo/Qwen2.5-7B-Reasoning-Adapter
Base model
Qwen/Qwen2.5-7B
Finetuned
Qwen/Qwen2.5-7B-Instruct
Quantized
unsloth/Qwen2.5-7B-Instruct-bnb-4bit