Clarification on `gene_class_dict` Format
The documentation of Geneformer/geneformer/classifier.py
specifies the expected format for the gene_class_dict
parameter as:
gene_class_dict : None, dict
| Gene classes to fine-tune model to distinguish.
| Dictionary in format: {Gene_label_A: list(geneA1, geneA2, ...),
| Gene_label_B: list(geneB1, geneB2, ...)}
| Gene values should be Ensembl IDs.
However, based on the function of Geneformer/geneformer/classifier_utils.py
below:
def label_gene_classes(example, class_id_dict, gene_class_dict):
return [
class_id_dict.get(gene_class_dict.get(token_id, -100), -100)
for token_id in example["input_ids"]
]
It seems that gene_class_dict
is expected to have keys as token_id
(likely Ensembl IDs or similar identifiers) and values as Gene_label
. This contradicts the earlier documentation, which suggests a structure where keys are Gene_label
and values are lists of genes (Ensembl IDs).
Questions
Should
gene_class_dict
be structured as:{Gene_label_A: [geneA1, geneA2, ...], Gene_label_B: [geneB1, geneB2, ...]}
as per the documentation?
Or, should it be structured as:
{geneA1: Gene_label_A, geneA2: Gene_label_A, ...}
as implied by the function
label_gene_classes
?If the intended structure is the first one (as per the documentation), could you clarify how
label_gene_classes
processes such a structure, or provide a corrected example?
Additional Context
- The
example["input_ids"]
in the function seems to interact directly with thegene_class_dict
keys. Ifgene_class_dict
were structured as{Gene_label: [gene1, gene2, ...]}
, the function logic would not align, asget()
operates on individual keys, not lists.
Looking forward to your clarification. Thank you!
Thank you for your questions! We have examples on the datasets repository. See here for example: https://huggingface.co/datasets/ctheodoris/Genecorpus-30M/tree/main/example_input_files/gene_classification/bivalent_promoters
You are correct that label_gene_classes is expecting a different format - we pushed a change to make it consistent with the input format. Please pull the new version.