|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
--- |
|
|
|
Original model from https://huggingface.co/openlm-research/open_llama_3b_600bt_preview. |
|
|
|
This repo includes: |
|
1) Ported `LlamaTokenizer` to `LlamaTokenizerFast` via a few lines of code. |
|
Loading via `AutoTokenizer` takes 3 to 4 minutes. Now, a few seconds! |
|
``` |
|
from transformers import LlamaTokenizerFast |
|
from tokenizers import AddedToken |
|
tokenizer = LlamaTokenizerFast.from_pretrained( |
|
"openlm-research/open_llama_3b_600bt_preview", |
|
add_bos_token = True, add_eos_token = True, |
|
bos_token = AddedToken("<s>", single_word = True), |
|
eos_token = AddedToken("</s>", single_word = True), |
|
unk_token = AddedToken("<unk>", single_word = True), |
|
pad_token = AddedToken("<unk>", single_word = True) |
|
) |
|
tokenizer.push_to_hub("open_llama_3b_600bt_preview") |
|
``` |
|
2) `AutoTokenizer` does not recognize the BOS, EOS and UNK tokens. All tokenizations weirdly prepend 0 and append 0 to the end, when actually, you're supposed to prepend 1 and append 2. |
|
3) Manually added BOS `<s>`, EOS `</s>`, UNK `<unk>` tokens, with PAD (padding) being also the `<unk>` token. |