--- library_name: transformers tags: - trl - grpo - qwen - gsm8k --- # Qwen-0.5B-GRPO: A Fine-Tuned Math Reasoner This model is a fine-tuned version of the Qwen 0.5B model (based on [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)) using GRPO (Generative Reward Policy Optimization). It has been trained on the GSM8K math dataset to improve its ability to generate step-by-step reasoning for math problems, following a structured output format with explicit `` and `` sections. ## Model Details ### Model Description Qwen-0.5B-GRPO is designed to serve as a lightweight math reasoning assistant. By fine-tuning with reinforcement learning using GRPO, the model learns to produce responses that include both intermediate reasoning and final answers. Key adaptations include: - **Base Model:** Qwen/Qwen2.5-0.5B-Instruct - **Fine-Tuning Method:** GRPO (reinforcement learning with custom reward functions) - **Dataset:** GSM8K – a collection of challenging grade-school math problems - **Generation Engine:** Utilizes vLLM for faster inference on a single GPU setup - **Precision:** BF16 training for efficiency on Colab GPUs - **Developed by:** Davut Emre Taşar - **License:** Please refer to the license of the base model on its Hugging Face Hub page ### Model Sources - **Repository (this model):** [https://huggingface.co/emre/Qwen-0.5B-GRPO](https://huggingface.co/emre/Qwen-0.5B-GRPO) - **Base Model Repository:** [https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) - **Dataset:** [https://huggingface.co/datasets/openai/gsm8k](https://huggingface.co/datasets/openai/gsm8k) ## Uses ### Intended Use This model is intended for educational and research purposes, particularly to demonstrate and support math problem solving with clear, step-by-step reasoning. It is well-suited for: - Generating structured explanations for math problems. - Serving as a lightweight assistant in educational applications focused on math reasoning. ### Out-of-Scope Use - **High-Stakes Decision Making:** This model is not designed for critical decision making. - **Non-Math Domains:** Its performance is tailored to math problems; performance on other domains may be limited. - **Over-Reliance on Automated Reasoning:** The reward functions used during fine-tuning (e.g., exact string matching) may not capture all nuances, so human oversight is recommended. ## Bias, Risks, and Limitations - **Model Size:** With only 0.5B parameters, it may not perform as robustly as larger models. - **Training Duration:** Fine-tuning was performed for a single epoch; further training might be needed for more challenging tasks. - **Reward Function Limitations:** The custom reward functions (checking for correct formatting and numerical correctness) are heuristic and may occasionally miss subtleties in reasoning. - **Generalization:** The structured format (with `` and `` tags) is enforced during training and may require adaptation for other use cases. ### Recommendations Users should: - Validate model outputs on a case-by-case basis. - Consider further fine-tuning for domain-specific applications. - Use the model as a supplementary tool rather than the sole resource for critical math reasoning tasks. ## How to Get Started with the Model Below is an example code snippet to load and use the model: ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_name = "emre/Qwen-0.5B-GRPO" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16).to("cuda") # Example prompt: structured with and tags. prompt = """ Step-by-step reasoning: """ inputs = tokenizer(prompt, return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_length=300) print(tokenizer.decode(outputs[0], skip_special_tokens=True))