|
--- |
|
language: |
|
- en |
|
library_name: transformers |
|
tags: |
|
- llm |
|
- llama |
|
- llama-3 |
|
- grpo |
|
- reinforcement-learning |
|
- mathematical-reasoning |
|
- logic |
|
- deepseek |
|
license: apache-2.0 |
|
datasets: |
|
- openai/gsm8k |
|
model_name: LLaMA-3.2-3B-GRPO-GSM325 |
|
base_model: meta-llama/Meta-Llama-3-3B |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
# **LLaMA-3.2-3B-GRPO-GSM325** |
|
|
|
π **LLaMA-3.2-3B-GRPO-GSM325** is a fine-tuned version of **LLaMA 3.2B**, trained using **GRPO (Guided Reinforcement Policy Optimization)** and **DeepSeek R1βs open-source recipe**. This model significantly enhances the base **LLaMA-3.2-3B** in **mathematical problem-solving, logical reasoning, and structured response generation**, pushing it towards **GPT-4o1-style advanced reasoning**. |
|
|
|
π₯ Trained **entirely on a Free Google Colab Tesla T4 GPU**: [Training Notebook](https://colab.research.google.com/drive/1o95CT5DV2zZXjScDHxKfRJBaNGv3ULpj?usp=sharing) |
|
|
|
π **With more resources and extended training, this model could be pushed even further!** |
|
|
|
## **Model Details** |
|
- **Base Model**: LLaMA 3.2B |
|
- **Fine-tuning Method**: GRPO with structured reinforcement |
|
- **Dataset**: 325 curated questions from GSM8K (math reasoning) |
|
- **Format Adherence**: XML-based structured reasoning |
|
- **Notable Improvements**: |
|
- **Mathematical accuracy** β |
|
- **Logical consistency** β |
|
- **Structured XML-format responses** β |
|
- **GPT-4o1-like step-by-step reasoning** β |
|
|
|
--- |
|
|
|
## **Usage** |
|
|
|
### **Example Input & Output** |
|
|
|
#### **Input (User Query)** |
|
```xml |
|
If 2x+5=10. Solve for x. |
|
``` |
|
|
|
#### **Output (Model Response)** |
|
```xml |
|
<reasoning> |
|
To solve for x, we need to isolate x on one side of the equation. This can be done by subtracting 5 from both sides of the equation. |
|
</reasoning> |
|
<answer> |
|
2x + 5 - 5 = 10 - 5, |
|
2x = 5, |
|
2x / 2 = 5 / 2, |
|
x = 2.5 |
|
</answer> |
|
``` |
|
|
|
--- |
|
|
|
## **Installation & Inference** |
|
|
|
### **Hugging Face Transformers** |
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
model_id = "Rauhan/llama-3.2-3B-GRPO-GSM325" |
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
model = AutoModelForCausalLM.from_pretrained(model_id) |
|
``` |
|
|
|
### **Using vLLM for Fast Inference** |
|
```python |
|
from vllm import LLM, SamplingParams |
|
|
|
llm = LLM(model="Rauhan/llama-3.2-3B-GRPO-GSM325") |
|
sampling_params = SamplingParams(temperature=0.7, max_tokens=256) |
|
|
|
output = llm.generate(["<reasoning>\nA store sells apples...\n</reasoning>"], sampling_params) |
|
print(output) |
|
``` |
|
|
|
## **Limitations & Future Work** |
|
|
|
π§ **Limitations**: |
|
- Limited by **small dataset size (325 questions)** |
|
- Training done on **a single Free Google Colab Tesla T4 GPU** |
|
- Some **long-form reasoning may need further fine-tuning** |
|
|
|
π **Future Improvements**: |
|
- Training on **a larger dataset** (more GSM8K questions + other logical reasoning datasets) |
|
- Extending fine-tuning using **DeepSeek R1βs full training pipeline** |
|
- Further **quantization for faster and memory-efficient inference** |
|
|
|
--- |
|
|
|
## **License & Citation** |
|
|
|
This model is released under **Apache 2.0 License**. If you use this model in your research, please cite: |
|
|
|
``` |
|
@misc{llama-3.2-3B-GRPO-GSM325, |
|
author = {Rauhan}, |
|
title = {LLaMA-3.2-3B-GRPO-GSM325}, |
|
year = {2025}, |
|
publisher = {Hugging Face}, |
|
url = {https://huggingface.co/Rauhan/llama-3.2-3B-GRPO-GSM325} |
|
} |
|
``` |
|
|
|
--- |
|
|
|
π **This model demonstrates how even small models can achieve great results with the right fine-tuning techniques!** π |
|
|
|
--- |
|
|
|
## **About the Author** |
|
|
|
π **Portfolio & Contact Information**: |
|
- π Website: [rauhanahmed.org](https://rauhanahmed.org) |
|
- π’ GitHub: [github.com/rauhanAhmed](https://github.com/rauhanAhmed) |
|
- πΌ LinkedIn: [linkedin.com/in/rauhan-ahmed](https://www.linkedin.com/in/rauhan-ahmed) |
|
- π¦ Twitter (X): [x.com/ahmed_rauh46040](https://x.com/ahmed_rauh46040) |
|
- π§ Email: [[email protected]](mailto:[email protected]) |
|
|
|
Feel free to reach out for collaborations, AI research, or any inquiries! π |
|
|
|
|