LLaMA-3.2-3B-GRPO-GSM325

πŸš€ LLaMA-3.2-3B-GRPO-GSM325 is a fine-tuned version of LLaMA 3.2B, trained using GRPO (Guided Reinforcement Policy Optimization) and DeepSeek R1’s open-source recipe. This model significantly enhances the base LLaMA-3.2-3B in mathematical problem-solving, logical reasoning, and structured response generation, pushing it towards GPT-4o1-style advanced reasoning.

πŸ”₯ Trained entirely on a Free Google Colab Tesla T4 GPU: Training Notebook

πŸš€ With more resources and extended training, this model could be pushed even further!

Model Details

  • Base Model: LLaMA 3.2B
  • Fine-tuning Method: GRPO with structured reinforcement
  • Dataset: 325 curated questions from GSM8K (math reasoning)
  • Format Adherence: XML-based structured reasoning
  • Notable Improvements:
    • Mathematical accuracy βœ”
    • Logical consistency βœ”
    • Structured XML-format responses βœ”
    • GPT-4o1-like step-by-step reasoning βœ”

Usage

Example Input & Output

Input (User Query)

If 2x+5=10. Solve for x.

Output (Model Response)

<reasoning>
To solve for x, we need to isolate x on one side of the equation. This can be done by subtracting 5 from both sides of the equation.
</reasoning>
<answer>
2x + 5 - 5 = 10 - 5,
2x = 5,
2x / 2 = 5 / 2,
x = 2.5
</answer>

Installation & Inference

Hugging Face Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Rauhan/llama-3.2-3B-GRPO-GSM325"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

Using vLLM for Fast Inference

from vllm import LLM, SamplingParams

llm = LLM(model="Rauhan/llama-3.2-3B-GRPO-GSM325")
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)

output = llm.generate(["<reasoning>\nA store sells apples...\n</reasoning>"], sampling_params)
print(output)

Limitations & Future Work

🚧 Limitations:

  • Limited by small dataset size (325 questions)
  • Training done on a single Free Google Colab Tesla T4 GPU
  • Some long-form reasoning may need further fine-tuning

πŸš€ Future Improvements:

  • Training on a larger dataset (more GSM8K questions + other logical reasoning datasets)
  • Extending fine-tuning using DeepSeek R1’s full training pipeline
  • Further quantization for faster and memory-efficient inference

License & Citation

This model is released under Apache 2.0 License. If you use this model in your research, please cite:

@misc{llama-3.2-3B-GRPO-GSM325,
  author = {Rauhan},
  title = {LLaMA-3.2-3B-GRPO-GSM325},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Rauhan/llama-3.2-3B-GRPO-GSM325}
}

πŸš€ This model demonstrates how even small models can achieve great results with the right fine-tuning techniques! πŸš€


About the Author

πŸ”— Portfolio & Contact Information:

Feel free to reach out for collaborations, AI research, or any inquiries! πŸš€

Downloads last month
47
Safetensors
Model size
3.21B params
Tensor type
FP16
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Dataset used to train Rauhan/llama-3.2-3B-GRPO-GSM325