The input starts with the token "<|begin▁of▁sentence|>" repeated twice. / 输入开头重复2次“<|begin▁of▁sentence|>”
I have two questions:
In the configuration file "tokenizer_config.json" for the Qwen series model, the "tokenizer_class" is set to "LlamaTokenizerFast." I'm not sure why this is the case, but after testing, the results are consistent with those of QwenTokenizer.
In the "tokenizer_config.json," "add_bos_token" is set to true, meaning that the tokenizer will automatically add a bos_token, which is "<|begin▁of▁sentence|>". However, when using tokenizer.apply_chat_template, it also adds "<|begin▁of▁sentence|>", resulting in the final output starting with two repeated "<|begin▁of▁sentence|>" tokens.
Here is the reproduction code:
prompt = "计算1+1"
messages = [
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
input_ids = model_inputs['input_ids']
tokenizer.decode(input_ids[0])
# output:
# <|begin▁of▁sentence|><|begin▁of▁sentence|>You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|User|>计算1+1<|Assistant|>
我有2个问题:
1、在qwen系列模型的配置文件 "tokenizer_config.json" 中,"tokenizer_class"设置为"LlamaTokenizerFast",这个不知道是为什么,不过测下来和QwenTokenizer的返回结果是一致的;
2、tokenizer_config.json中 "add_bos_token": true, 也就是tokenizer时会自动添加bos_token,也就是“<|begin▁of▁sentence|>”,但是tokenizer.apply_chat_template时 也会添加“<|begin▁of▁sentence|>”,也就导致 最终开头是2个重复的“<|begin▁of▁sentence|>”
请问以上是否符合预期(训练时也一样输入2次<|begin▁of▁sentence|>),如果一致则应该不需要改动。