mkurman/Llama-3.2-MedIT-SUN-2.5B-BT-GPRO

Important Notice:
This model is provided strictly for research purposes and is not intended for production use. It should not be considered a validated source of medical or professional advice. Use only in controlled experimental settings.

Model Overview

mkurman/Llama-3.2-MedIT-SUN-2.5B-BT-GPRO is a fine-tuned variant of meditsolutions/Llama-3.2-SUN-2.5B-chat, adapted specifically for exploring natural language understanding and reasoning. This model leverages a multi-stage training approach combining Blurred Thoughts Supervised Fine-Tuning (BT-SFT) and Group Relative Policy Optimization (GRPO) to enhance its performance on specialized tasks.

Training Procedure

The model was developed through the following sequential steps:

Initial Blurred Thoughts Supervised Fine-Tuning (BT-SFT):
- Base Model: meditsolutions/Llama-3.2-SUN-2.5B-chat
- Parameters: 2600 steps, batch size 2, accumulation iterations 16, learning rate 1e-6
- Dataset: open-thoughts/OpenThoughts-114k
- Details: For further information on BT-SFT, see the detailed post.
Group Relative Policy Optimization (GRPO) Stage 1:
- Dataset: Jiayi-Pan/Countdown-Tasks-3to4
- Training: 500 steps
Group Relative Policy Optimization (GRPO) Stage 2:
- Dataset: FreedomIntelligence/medical-o1-verifiable-problem
- Training: 50 steps
Final BT-SFT Stage:
- Parameters: Same settings as the initial BT-SFT, applied for an additional 400 steps

Datasets Utilized

open-thoughts/OpenThoughts-114k:
A dataset consisting of open-ended thoughts that supports diverse conversational contexts during the initial supervised fine-tuning.
Jiayi-Pan/Countdown-Tasks-3to4:
A dataset designed for task-specific learning, aiding in the model’s adaptation to structured problem-solving.
FreedomIntelligence/medical-o1-verifiable-problem:
A dataset curated for enhancing the model's capabilities in addressing verifiable medical problems.

Intended Use

Research and Experimental Applications:
This model is optimized for academic research and exploratory projects. It is ideal for investigating advanced fine-tuning methods and evaluating performance on task-oriented conversational scenarios.
Controlled Environments:
Users should deploy this model only within controlled experimental frameworks where rigorous evaluation and proper safety guardrails are in place.

Limitations and Ethical Considerations

Not for Clinical or Production Use:
The model’s outputs have not been validated for clinical accuracy or professional decision-making. It must not be used as a primary source for medical, legal, or safety-critical information.
Safety and Guardrails:
All users must implement appropriate safety measures and validation protocols. The model may produce biased or inaccurate results and should be used with caution.
Experimental Nature:
Given its research-oriented design, the model’s performance can vary widely based on input and context. It is essential to perform thorough testing and validation before drawing any conclusions from its outputs.

License

This model is released under the Llama 3.2 license. Users must adhere to the terms specified in the license when utilizing this model.

Final Notice

All outputs from mkurman/Llama-3.2-MedIT-SUN-2.5B-BT-GPRO are intended solely for research purposes. This model is not a comprehensive knowledge source and should not be used as a substitute for professional advice or decision-making. Ensure that all necessary guardrails and safety protocols are in place when conducting any experiments with this model.

mkurman
/

Llama-3.2-MedIT-SUN-2.5B-BT-GRPO