Clarification on `gene_class_dict` Format

#469
by lishensuo - opened

The documentation of Geneformer/geneformer/classifier.pyspecifies the expected format for the gene_class_dict parameter as:

gene_class_dict : None, dict
| Gene classes to fine-tune model to distinguish.
| Dictionary in format: {Gene_label_A: list(geneA1, geneA2, ...),
|                        Gene_label_B: list(geneB1, geneB2, ...)}
| Gene values should be Ensembl IDs.

However, based on the function of Geneformer/geneformer/classifier_utils.py below:

def label_gene_classes(example, class_id_dict, gene_class_dict):
    return [
        class_id_dict.get(gene_class_dict.get(token_id, -100), -100)
        for token_id in example["input_ids"]
    ]

It seems that gene_class_dict is expected to have keys as token_id (likely Ensembl IDs or similar identifiers) and values as Gene_label. This contradicts the earlier documentation, which suggests a structure where keys are Gene_label and values are lists of genes (Ensembl IDs).

Questions

  1. Should gene_class_dict be structured as:

    {Gene_label_A: [geneA1, geneA2, ...], Gene_label_B: [geneB1, geneB2, ...]}
    

    as per the documentation?

  2. Or, should it be structured as:

    {geneA1: Gene_label_A, geneA2: Gene_label_A, ...}
    

    as implied by the function label_gene_classes?

  3. If the intended structure is the first one (as per the documentation), could you clarify how label_gene_classes processes such a structure, or provide a corrected example?

Additional Context

  • The example["input_ids"] in the function seems to interact directly with the gene_class_dict keys. If gene_class_dict were structured as {Gene_label: [gene1, gene2, ...]}, the function logic would not align, as get() operates on individual keys, not lists.

Looking forward to your clarification. Thank you!


Thank you for your questions! We have examples on the datasets repository. See here for example: https://huggingface.co/datasets/ctheodoris/Genecorpus-30M/tree/main/example_input_files/gene_classification/bivalent_promoters

You are correct that label_gene_classes is expecting a different format - we pushed a change to make it consistent with the input format. Please pull the new version.

ctheodoris changed discussion status to closed

Sign up or log in to comment