Commit
·
db77853
1
Parent(s):
7877031
Update README.md
Browse files
README.md
CHANGED
@@ -9,7 +9,7 @@ Original model from https://huggingface.co/openlm-research/open_llama_3b_600bt_p
|
|
9 |
This repo includes:
|
10 |
1) Ported `LlamaTokenizer` to `LlamaTokenizerFast` via a few lines of code.
|
11 |
Loading via `AutoTokenizer` takes 3 to 4 minutes. Now, a few seconds!
|
12 |
-
|
13 |
from transformers import LlamaTokenizerFast
|
14 |
from tokenizers import AddedToken
|
15 |
tokenizer = LlamaTokenizerFast.from_pretrained(
|
@@ -20,6 +20,7 @@ tokenizer = LlamaTokenizerFast.from_pretrained(
|
|
20 |
unk_token = AddedToken("<unk>", single_word = True),
|
21 |
pad_token = AddedToken("<unk>", single_word = True)
|
22 |
)
|
23 |
-
|
|
|
24 |
2) `AutoTokenizer` does not recognize the BOS, EOS and UNK tokens. All tokenizations weirdly prepend 0 and append 0 to the end, when actually, you're supposed to prepend 1 and append 2.
|
25 |
3) Manually added BOS `<s>`, EOS `</s>`, UNK `<unk>` tokens, with PAD (padding) being also the `<unk>` token.
|
|
|
9 |
This repo includes:
|
10 |
1) Ported `LlamaTokenizer` to `LlamaTokenizerFast` via a few lines of code.
|
11 |
Loading via `AutoTokenizer` takes 3 to 4 minutes. Now, a few seconds!
|
12 |
+
```
|
13 |
from transformers import LlamaTokenizerFast
|
14 |
from tokenizers import AddedToken
|
15 |
tokenizer = LlamaTokenizerFast.from_pretrained(
|
|
|
20 |
unk_token = AddedToken("<unk>", single_word = True),
|
21 |
pad_token = AddedToken("<unk>", single_word = True)
|
22 |
)
|
23 |
+
tokenizer.push_to_hub("open_llama_3b_600bt_preview")
|
24 |
+
```
|
25 |
2) `AutoTokenizer` does not recognize the BOS, EOS and UNK tokens. All tokenizations weirdly prepend 0 and append 0 to the end, when actually, you're supposed to prepend 1 and append 2.
|
26 |
3) Manually added BOS `<s>`, EOS `</s>`, UNK `<unk>` tokens, with PAD (padding) being also the `<unk>` token.
|