non-reasoning data

#132
by cmgzy - opened

In 2nd stage of RL of R1(sec 2.3.4), "we collected a total of approximately 200k training samples that are unrelated to reasoning". Part of it generated by "potential chain-of-thought before answering the question by prompting". Does it mean there are none "< think >... < /think >" in the 200k samples?

We observe that, without system message and properly prompting, model sometimes responses without reasoning (output < think >\n\n< /think >). Dose it relate to those 200k non-reasoning data?

Sign up or log in to comment