README.md · danielhanchen/open_llama_3b_600bt_preview at db778530c4ef5552a1d137579ae754264c77d0fd

metadata

license: apache-2.0
language:
  - en

Original model from https://huggingface.co/openlm-research/open_llama_3b_600bt_preview.

This repo includes:

Ported LlamaTokenizer to LlamaTokenizerFast via a few lines of code. Loading via AutoTokenizer takes 3 to 4 minutes. Now, a few seconds!

from transformers import LlamaTokenizerFast
from tokenizers import AddedToken
tokenizer = LlamaTokenizerFast.from_pretrained(
    "openlm-research/open_llama_3b_600bt_preview",
    add_bos_token = True, add_eos_token = True,
    bos_token = AddedToken("<s>",   single_word = True),
    eos_token = AddedToken("</s>",  single_word = True),
    unk_token = AddedToken("<unk>", single_word = True),
    pad_token = AddedToken("<unk>", single_word = True)
)
tokenizer.push_to_hub("open_llama_3b_600bt_preview")

AutoTokenizer does not recognize the BOS, EOS and UNK tokens. All tokenizations weirdly prepend 0 and append 0 to the end, when actually, you're supposed to prepend 1 and append 2.
Manually added BOS <s>, EOS </s>, UNK <unk> tokens, with PAD (padding) being also the <unk> token.