ctheodoris/Geneformer · get_embs() method expects token names to be in token_gene

8 days ago

Hi everyone,

I am trying to apply the get_embs() method from the emb_extractor module. If I load the token_dict from geneformer/token_dictionary_gc95M (or gc30M), it has the token ids as keys (makes sense). But the get_embs() method expects them to be in values:
' # Check if CLS and EOS token is present in the token dictionary
cls_present = any("" in value for value in token_gene_dict.values())
eos_present = any("" in value for value in token_gene_dict.values())
'
(geneformer/emb_extractor.py, lines 70-72)

Am I loading the wrong file or is there another step I am missing?

Thanks!

Jonatan

ctheodoris

Owner 7 days ago

Thanks for your question. The file you mentioned is loaded as gene_token_dict and then is inverted to form the token_gene_dict so the keys and values should be appropriate.

ctheodoris changed discussion status to closed 7 days ago

ctheodoris
/

Geneformer

get_embs() method expects token names to be in token_gene_dict.values(), but are stored as keys