---
library_name: transformers
tags:
  - trl
  - grpo
  - qwen
  - gsm8k
---

# Qwen-0.5B-GRPO: A Fine-Tuned Math Reasoner

This model is a fine-tuned version of the Qwen 0.5B model (based on [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)) using GRPO (Generative Reward Policy Optimization). It has been trained on the GSM8K math dataset to improve its ability to generate step-by-step reasoning for math problems, following a structured output format with explicit `<reasoning>` and `<answer>` sections.

## Model Details

### Model Description

Qwen-0.5B-GRPO is designed to serve as a lightweight math reasoning assistant. By fine-tuning with reinforcement learning using GRPO, the model learns to produce responses that include both intermediate reasoning and final answers. Key adaptations include:
- **Base Model:** Qwen/Qwen2.5-0.5B-Instruct
- **Fine-Tuning Method:** GRPO (reinforcement learning with custom reward functions)
- **Dataset:** GSM8K – a collection of challenging grade-school math problems
- **Generation Engine:** Utilizes vLLM for faster inference on a single GPU setup
- **Precision:** BF16 training for efficiency on Colab GPUs

- **Developed by:** Davut Emre Taşar
- **License:** Please refer to the license of the base model on its Hugging Face Hub page

### Model Sources

- **Repository (this model):** [https://huggingface.co/emre/Qwen-0.5B-GRPO](https://huggingface.co/emre/Qwen-0.5B-GRPO)
- **Base Model Repository:** [https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)
- **Dataset:** [https://huggingface.co/datasets/openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k)

## Uses

### Intended Use

This model is intended for educational and research purposes, particularly to demonstrate and support math problem solving with clear, step-by-step reasoning. It is well-suited for:
- Generating structured explanations for math problems.
- Serving as a lightweight assistant in educational applications focused on math reasoning.

### Out-of-Scope Use

- **High-Stakes Decision Making:** This model is not designed for critical decision making.
- **Non-Math Domains:** Its performance is tailored to math problems; performance on other domains may be limited.
- **Over-Reliance on Automated Reasoning:** The reward functions used during fine-tuning (e.g., exact string matching) may not capture all nuances, so human oversight is recommended.

## Bias, Risks, and Limitations

- **Model Size:** With only 0.5B parameters, it may not perform as robustly as larger models.
- **Training Duration:** Fine-tuning was performed for a single epoch; further training might be needed for more challenging tasks.
- **Reward Function Limitations:** The custom reward functions (checking for correct formatting and numerical correctness) are heuristic and may occasionally miss subtleties in reasoning.
- **Generalization:** The structured format (with `<reasoning>` and `<answer>` tags) is enforced during training and may require adaptation for other use cases.

### Recommendations

Users should:
- Validate model outputs on a case-by-case basis.
- Consider further fine-tuning for domain-specific applications.
- Use the model as a supplementary tool rather than the sole resource for critical math reasoning tasks.

## How to Get Started with the Model

Below is an example code snippet to load and use the model:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "emre/Qwen-0.5B-GRPO"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16).to("cuda")

# Example prompt: structured with <reasoning> and <answer> tags.
prompt = """<reasoning>
Step-by-step reasoning:
</reasoning>
<answer>
"""
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_length=300)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))