question for the pr of [update additional_special_tokens (#8)]
#10
by
Qingyun
- opened
This pr added additional_special_tokens, which seems result in mismatch of tokenizer length and vocablary size in my transformers==4.31.0
version.
"additional_special_tokens": [
"<|im_start|>",
"<|im_end|>",
"<|action_start|>",
"<|action_end|>",
"<|interpreter|>",
"<|plugin|>"
],
tokenizer
ipdb> InternLM2Tokenizer(name_or_path='internlm/internlm2-chat-7b', vocab_size=92544, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '</s>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|action_start|>', '<|action_end|>', '<|interpreter|>', '<|plugin|>']}, clean_up_tokenization_spaces=False)
len(tokenizer)
ipdb> 92550
It seems that the additional special tokens are made new ids, which is mismatched with the input_embeddings. But this pr seems to resolve the bug in 4.33.2 as described in this issue.
Qingyun
changed discussion status to
closed