non-reasoning data
#132
by
cmgzy
- opened
In 2nd stage of RL of R1(sec 2.3.4), "we collected a total of approximately 200k training samples that are unrelated to reasoning". Part of it generated by "potential chain-of-thought before answering the question by prompting". Does it mean there are none "< think >... < /think >" in the 200k samples?
We observe that, without system message and properly prompting, model sometimes responses without reasoning (output < think >\n\n< /think >). Dose it relate to those 200k non-reasoning data?