@mkurman on Hugging Face: "Blurred-Thoughts Supervised Fine-Tuning (BT-SFT) 🤖 Can we teach a model to…"

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

mkurman

posted an update 7 days ago

Post

1973

Blurred-Thoughts Supervised Fine-Tuning (BT-SFT) 🤖

Can we teach a model to think completely on its own without reinforcement learning? Actually, yes.

We can do straightforward supervised fine-tuning using a relatively simple trick: blurring a part of CoT thoughts. But why is this effective?

We observed that various models differ in their thinking processes, and fine-tuning one model on another model’s thoughts (CoT) can sometimes be inefficient—often resulting in the model simply memorizing reasoning rather than learning how to actually think.

I discovered that this process can still be efficient if we clearly indicate when the model should start and stop thinking and uncover only a part of CoT and the expected answer, blurring the other part of CoT. This approach allows the model to learn only a portion of the thought process while still arriving at an expected answer.

To demonstrate this, you can watch my experimental BT-SFT on meditsolutions/Llama-3.2-SUN-2.5B-chat model, which was fine-tuned on 151 million tokens from the Magpie-Align/Magpie-Reasoning-V2-250K-CoT-Deepseek-R1-Llama-70B dataset.

Enjoy! 🚀

PS. If you were curious enough to read this, leave me a comment. It's always nice to chat with open-minded and intelligent ppl.

Chandresh7777777

6 days ago

Great idea brother! Now, even I want to implement this, however, I am not sure how to calculate the loss for the blurred (or MASKED tokens). Did you use some reward model or KL divergence between the predicted token (for which ground truth has been MASKED) and the neighboring tokens?

mkurman

5 days ago

For blurred thoughts, you must set labels equal to ignore_index (-100); that should be enough! For BT SFT, I used normal CrossEntropy loss, but a critique/reward model can also be a good idea! Check this paper: https://arxiv.org/abs/2501.17703

In this post