Model Card for Qwen2.5-0.5B-Instruct (Fine-Tuned on OpenR1-Math-220k, 2% Done, 50% underway Feb 13th)

Model Details

Model Name: Qwen2.5-0.5B-Instruct (GRPO Fine-Tuned)
Model ID: _Qwen2.5-0.5B-R1subset_
License: [Apache 2.0 / or whichever applies]
Finetuned From: Qwen/Qwen2.5-0.5B-Instruct
Language(s): English (mathematical text)

Developed By: Christian H. Cooper Funding: Self-sponsored
Shared By: Christian H. Cooper

Model Description

This model is a Qwen2.5-0.5B base LLM fine-tuned on a 2% subset of the OpenR1-Math-220k dataset. I used Group Relative Policy Optimization (GRPO) from the trl library, guiding the model toward producing well-formatted chain-of-thought answers in:

<reasoning>
  ...
</reasoning>
<answer>
  ...
</answer>

It focuses on math reasoning tasks, learning to generate a step-by-step solution (<reasoning>) and a numeric or final textual answer (<answer>). We incorporate reward functions that encourage correct chain-of-thought structure, numeric answers, and correctness.

Model Sources

  • GitHub or Repo: [Pending]
  • Paper/Demo: [Pending]

Uses

Direct Use

  • Math Problem Solving: The model tries to reason through math word problems, providing step-by-step reasoning and a final answer.

Downstream Use

  • Educational Tools: Potentially used in tutoring or step-by-step solution generation.
  • Math Chatbots: A math helper that can respond in a structured <reasoning>/<answer> format.

Out-of-Scope Use

  • High-Stakes Decisions: Model is not guaranteed to be correct for advanced or critical math scenarios (finance, medical, engineering safety).
  • Non-English: Primary training data is English math text, so reliability in other languages is minimal.

Bias, Risks, and Limitations

  • Bias: Although this is a math-focused dataset, any language model can exhibit unintended biases.
  • Risks: The model may produce mathematically incorrect or incomplete solutions. The partial coverage (2% of the dataset) further limits accuracy.
  • Limitations:
    • Only partially fine-tuned on 2% of the data, so correctness is not guaranteed.
    • The chain-of-thought is for interpretability but may still contain flawed reasoning or leaps.

How to Get Started

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "HarleyCooper/Qwen.5B-OpenR1Math"  # Will keep the same name through all % iterations. 
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to("cuda")

prompt = """<reasoning>
Question: It is known that in a convex $n$-gon ($n>3$) no three diagonals pass through the same point.

Find the number of points (distinct from the vertices) of intersection of pairs of diagonals.
</reasoning>
<answer>
"""

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=2000)
answer = tokenizer.decode(outputs[0])
print(answer)

Training Details

Training Data

  • Dataset: A 2% subsample (~4.4k problems) of OpenR1-Math-220k.
  • Data Format: Each sample has problem, solution, answer. We transform them into:
    • "prompt": A single string containing system instructions + the problem text.
    • "answer": A string with <reasoning> + <answer> blocks.

Training Procedure

  • Framework: TRL (v0.4+) with Group Relative Policy Optimization (GRPO).
  • Objective: Reinforcement learning on chain-of-thought format, numeric correctness, and final-answer consistency.
  • Reward Functions:
    1. xmlcount_reward_func: Encourages <reasoning>/<answer> structure.
    2. soft_format_reward_func: Checks for <reasoning>.*</reasoning><answer>.*</answer> in any multiline arrangement.
    3. strict_format_reward_func: Strict multiline regex for exact formatting.
    4. int_reward_func: Partial reward if the final <answer> is purely numeric.
    5. correctness_reward_func: Binary reward if the final extracted answer exactly matches the known correct answer.

Training Hyperparameters

  • Base Model: Qwen2.5-0.5B
  • Learning Rate: ~5e-6
  • Batch Size: 1–2 (due to GPU constraints)
  • Optimizer: AdamW (β1=0.9, β2=0.99)
  • Scheduler: Cosine with warmup_ratio=0.1
  • Num Generations: 16 (GRPO config)
  • Number of Training Epochs: 1 epoch on 2% data
  • Hardware: Single A100 40GB on Colab
  • Max Prompt Length: 256 tokens
  • Max Completion Length: 200 tokens

Speeds, Sizes, Times

  • Approx. Steps: ~200–300 steps for 2% subset
  • Run Time: Varies from ~1 to 2 hours on Colab A100

Evaluation

Testing Data

  • Currently trained + tested on the same subset (2%). Next step would be to evaluate on a withheld portion or the full set to measure true correctness.

Metrics

  • Format Rewards: xmlcount, soft_format, strict_format
  • Correctness: Exact match final numeric/string answer
  • Partial Numeric: int_reward_func

Results

  • The model shows a strong improvement in output format (70–80% format compliance) but relatively low exact numeric correctness. Additional epochs or a larger training fraction are needed for better correctness.

Environmental Impact

  • Hardware: Single A100 40GB GPU in a Colab environment
  • Train Time: ~1–2 hours on 2% data
  • Carbon Footprint: Not measured exactly, but minimal compared to large-scale runs

Model Architecture & Objective

  • Architecture: Transformer-based causal language model (Qwen2.5-0.5B)
  • Objective: RL-based chain-of-thought generation for math reasoning

Citation

@misc{cooperQwen2.5-0.5B,
  title={Qwen2.5-0.5B Fine-Tuned on OpenR1 (2% subset)},
  author={Christian H. Cooper.},
  howpublished={\url{https://huggingface.co/Christian-cooper-us/Qwen2.5-0.5B-R1subset}},
  year={2025},
}

Contact


Disclaimer: This model is experimental, trained on only 2% of the dataset. It may produce inaccurate math solutions and is not suitable for high-stakes or time-sensitive deployments.

Downloads last month
6
Safetensors
Model size
494M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.