non-reasoning data

#132

by cmgzy - opened 3 days ago

Discussion

cmgzy

3 days ago

•

edited 3 days ago

In 2nd stage of RL of R1(sec 2.3.4), "we collected a total of approximately 200k training samples that are unrelated to reasoning". Part of it generated by "potential chain-of-thought before answering the question by prompting". Does it mean there are none "< think >... < /think >" in the 200k samples?

We observe that, without system message and properly prompting, model sometimes responses without reasoning (output < think >\n\n< /think >). Dose it relate to those 200k non-reasoning data?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment