aaronfeller/PeptideCLM-23M-all

Peptide-trained Chemical Language Model using 10.8M peptides and 12.6M small molecules for MLM pretraining.

Loading the tokenizer is not possible with transformers. A custom tokenizer must be loaded from the 'tokenizer' directory found at at https://github.com/AaronFeller/PeptideCLM

An example script for this can be found in the repository. A short example is below (note, the tokenizer directory must be downloaded):

from tokenizer.my_tokenizers import SMILES_SPE_Tokenizer

def get_tokenizer():
    vocab_file = 'tokenizer/new_vocab.txt'
    splits_file = 'tokenizer/new_splits.txt'
    tokenizer = SMILES_SPE_Tokenizer(vocab_file, splits_file)
    return tokenizer

aaronfeller
/

PeptideCLM-23M-all

Collection including aaronfeller/PeptideCLM-23M-all

PeptideCLM