|
|
|
|
|
# 背景知识 |
|
|
|
GPT2采用的byte-level BPE,BERT采用的char-level BPE。 |
|
|
|
|
|
- BPE on unicode sequence |
|
- BPE on UTF-8 byte sequence |
|
- |
|
|
|
来自 https://huggingface.co/gpt2/tree/main |
|
|
|
### BPE的问题 |
|
|
|
|
|
- 直接BPE,会出现 dog. dog! 等合并成一个词。 |
|
|
|
byte-level BPE |
|
|
|
- bpe会把空格拼接到后一个词上,比如 bpe.decode(bpes[1:2]) = ' world',在NER任务上是不是算把空格也标注进去了? |
|
- bpe会把 'world'和' world'视为两个完全不同的token,不好吧? |
|
- 大小写: |
|
|
|
|
|
### 怎样解决 |
|
|
|
|
|
|
|
### GPT2的 |
|
|
|
|
|
|
|
# 下载 |
|
|
|
### 官方 |
|
|
|
### huggingface = 官方 |
|
|
|
- [vocab.json](https://huggingface.co/gpt2-large/resolve/main/vocab.json): 50257个kv-pair. https://huggingface.co/gpt2/resolve/main/vocab.json |
|
- [merges.txt](https://huggingface.co/gpt2-large/resolve/main/merges.txt): 50001行,https://huggingface.co/gpt2/resolve/main/merges.txt |
|
- merges.txts是否包含所有的组合?https://github.com/huggingface/transformers/issues/4777 |
|
- [tokenizer.json](https://huggingface.co/openai-community/gpt2-large/blob/main/tokenizer.json) |
|
- 这个是给 |
|
|
|
词典加载 https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/tokenization_gpt2.py |
|
|
|
### fairseq = 官方 |
|
|
|
- [vocab.bpe](https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe):50001行 |
|
- 等于 hf的 `merges.txt` |
|
- [encoder.json](https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json): 50257个kv-pair |
|
- 等于 hf的 `vocab.json` |
|
- [dict.txt](https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt): 50260行 这是词频,是由fairseq-preprocess生成的 https://github.com/pytorch/fairseq/issues/1186 |
|
|
|
|
|
词典加载 https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt2/tokenization_gpt2.py |
|
|
|
|
|
|
|
|
|
|
|
|