Model Card for Qwen2.5-0.5B-Instruct (Fine-Tuned on OpenR1-Math-220k, 2% Done, 50% underway Feb 13th)

Model Details

Model Name: Qwen2.5-0.5B-Instruct (GRPO Fine-Tuned)
Model ID: _Qwen2.5-0.5B-R1subset_
License: [Apache 2.0 / or whichever applies]
Finetuned From: Qwen/Qwen2.5-0.5B-Instruct
Language(s): English (mathematical text)

Developed By: Christian H. Cooper Funding: Self-sponsored
Shared By: Christian H. Cooper

Model Description

This model is a Qwen2.5-0.5B base LLM fine-tuned on a 2% subset of the OpenR1-Math-220k dataset. I used Group Relative Policy Optimization (GRPO) from the trl library, guiding the model toward producing well-formatted chain-of-thought answers in:

<reasoning>
  ...
</reasoning>
<answer>
  ...
</answer>

It focuses on math reasoning tasks, learning to generate a step-by-step solution (<reasoning>) and a numeric or final textual answer (<answer>). We incorporate reward functions that encourage correct chain-of-thought structure, numeric answers, and correctness.

Model Sources

GitHub or Repo: [Pending]
Paper/Demo: [Pending]

Uses

Direct Use

Math Problem Solving: The model tries to reason through math word problems, providing step-by-step reasoning and a final answer.

Downstream Use

Educational Tools: Potentially used in tutoring or step-by-step solution generation.
Math Chatbots: A math helper that can respond in a structured <reasoning>/<answer> format.

Out-of-Scope Use

High-Stakes Decisions: Model is not guaranteed to be correct for advanced or critical math scenarios (finance, medical, engineering safety).
Non-English: Primary training data is English math text, so reliability in other languages is minimal.

Bias, Risks, and Limitations

Bias: Although this is a math-focused dataset, any language model can exhibit unintended biases.
Risks: The model may produce mathematically incorrect or incomplete solutions. The partial coverage (2% of the dataset) further limits accuracy.
Limitations:
- Only partially fine-tuned on 2% of the data, so correctness is not guaranteed.
- The chain-of-thought is for interpretability but may still contain flawed reasoning or leaps.

How to Get Started

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "HarleyCooper/Qwen.5B-OpenR1Math"  # Will keep the same name through all % iterations. 
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to("cuda")

prompt = """<reasoning>
Question: It is known that in a convex $n$-gon ($n>3$) no three diagonals pass through the same point.

Find the number of points (distinct from the vertices) of intersection of pairs of diagonals.
</reasoning>
<answer>
"""

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=2000)
answer = tokenizer.decode(outputs[0])
print(answer)

Training Details

Training Data

Dataset: A 2% subsample (~4.4k problems) of OpenR1-Math-220k.
Data Format: Each sample has problem, solution, answer. We transform them into:
- "prompt": A single string containing system instructions + the problem text.
- "answer": A string with <reasoning> + <answer> blocks.

Training Procedure

Framework: TRL (v0.4+) with Group Relative Policy Optimization (GRPO).
Objective: Reinforcement learning on chain-of-thought format, numeric correctness, and final-answer consistency.
Reward Functions:
1. xmlcount_reward_func: Encourages <reasoning>/<answer> structure.
2. soft_format_reward_func: Checks for <reasoning>.*</reasoning><answer>.*</answer> in any multiline arrangement.
3. strict_format_reward_func: Strict multiline regex for exact formatting.
4. int_reward_func: Partial reward if the final <answer> is purely numeric.
5. correctness_reward_func: Binary reward if the final extracted answer exactly matches the known correct answer.

Training Hyperparameters

Base Model: Qwen2.5-0.5B
Learning Rate: ~5e-6
Batch Size: 1–2 (due to GPU constraints)
Optimizer: AdamW (β1=0.9, β2=0.99)
Scheduler: Cosine with warmup_ratio=0.1
Num Generations: 16 (GRPO config)
Number of Training Epochs: 1 epoch on 2% data
Hardware: Single A100 40GB on Colab
Max Prompt Length: 256 tokens
Max Completion Length: 200 tokens

Speeds, Sizes, Times

Approx. Steps: ~200–300 steps for 2% subset
Run Time: Varies from ~1 to 2 hours on Colab A100

Evaluation

Testing Data

Currently trained + tested on the same subset (2%). Next step would be to evaluate on a withheld portion or the full set to measure true correctness.

Metrics

Format Rewards: xmlcount, soft_format, strict_format
Correctness: Exact match final numeric/string answer
Partial Numeric: int_reward_func

Results

The model shows a strong improvement in output format (70–80% format compliance) but relatively low exact numeric correctness. Additional epochs or a larger training fraction are needed for better correctness.

Environmental Impact

Hardware: Single A100 40GB GPU in a Colab environment
Train Time: ~1–2 hours on 2% data
Carbon Footprint: Not measured exactly, but minimal compared to large-scale runs

Model Architecture & Objective

Architecture: Transformer-based causal language model (Qwen2.5-0.5B)
Objective: RL-based chain-of-thought generation for math reasoning

Citation

@misc{cooperQwen2.5-0.5B,
  title={Qwen2.5-0.5B Fine-Tuned on OpenR1 (2% subset)},
  author={Christian H. Cooper.},
  howpublished={\url{https://huggingface.co/Christian-cooper-us/Qwen2.5-0.5B-R1subset}},
  year={2025},
}

Contact

Maintainers: Christian Cooper (GitHub: @christian-cooper-us), others.

Disclaimer: This model is experimental, trained on only 2% of the dataset. It may produce inaccurate math solutions and is not suitable for high-stakes or time-sensitive deployments.