PeptideCLM
Collection
Peptide-specific Chemical Language Model
•
3 items
•
Updated
Peptide-trained Chemical Language Model using 10.8M peptides and 12.6M small molecules for MLM pretraining.
Loading the tokenizer is not possible with transformers. A custom tokenizer must be loaded from the 'tokenizer' directory found at at https://github.com/AaronFeller/PeptideCLM
An example script for this can be found in the repository. A short example is below (note, the tokenizer directory must be downloaded):
from tokenizer.my_tokenizers import SMILES_SPE_Tokenizer
def get_tokenizer():
vocab_file = 'tokenizer/new_vocab.txt'
splits_file = 'tokenizer/new_splits.txt'
tokenizer = SMILES_SPE_Tokenizer(vocab_file, splits_file)
return tokenizer