This model uses reinforcement learning to train on the GSM8K dataset, generating reasoning chains and formatted outputs despite the dataset lacking intermediate steps. A reward function guides the model, prioritizing answer correctness and XML format adherence.

Training Details:

  • Dataset: GSM8K
  • Algorithm: GRPO
  • Hardware: Single NVIDIA GeForce RTX 3090 Ti
  • Training Duration: 250 epochs, ~48 minutes

image/png Limitations:

The output length limit(200) restricts the model's ability to generate complex reasoning chains, hindering observation of output length growth during training.

Example:

Which one is bigger? 9.11 or 9.8? image/png

This qwen2.5 model was trained 2x faster with Unsloth and Huggingface's TRL library.

Downloads last month
17
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for Nagi-ovo/Qwen2.5-7B-Reasoning-Adapter

Base model

Qwen/Qwen2.5-7B
Adapter
(7)
this model

Dataset used to train Nagi-ovo/Qwen2.5-7B-Reasoning-Adapter