Issue getting perturbation stats via get_stats(): too many values to unpack
Hello, thank you for having well-documented code and examples! I've been able to train a handful of fine-tuning cell classification models, but am having some trouble modeling gene perturbations. My fine-tuned model is trained to classify cell type, from which I would like to individually model the knockouts of two genes in three different cell types. Below is the setup for my InSilicoPerturber. My goal here is to see how removing each gene from cells in specific cell type s affects the placement of that cell type in the cell space- in other words, which cell types does it most closely resemble if we remove this gene? Looking at the cell_states_to_model argument, I made the cell types I want to perturb the goal start states, the related cell types I expect the cell state to potentially shift to as the goal end states, and all other cell types as the alternate possible end states. Each of target_cell_types, related_cell_types, other_cell_types are non-overlapping lists of cell types. Perhaps this is not a proper use case or setup of the perturbation tool?
isp = InSilicoPerturber(perturb_type="delete",
perturb_rank_shift=None,
genes_to_perturb=[gene_1, gene_2],
combos=0,
anchor_gene=None,
model_type="CellClassifier", # using model fine-tuned for cell classification
num_classes=len(cell_types),
emb_mode="cell_and_gene", # want to look at impact on both gene and cell embeddings
cell_emb_style="mean_pool",
filter_data=None,
cell_states_to_model={"cell_type":(target_cell_types, related_cell_types, other_cell_types)},
emb_layer=0,
forward_batch_size=100,
nproc=16,
save_raw_data=True)
The model is able to run, but I'm getting an error when I try to retrieve the in silico perturbation statistics. Below is call that produced the error followed by the error. I locally reran the steps that precede the error, and found that the error is coming from the dict_list that is used in the isp_stats_to_goal_state() method. I'm not sure exactly what information is stored in the dict_list, but mine has a lot of nan, inf and -inf. Perhaps this could be the source of the ValueError?
ispstats.get_stats("../../code/models/perturbation_model",
None,
"../../data/project/perturbation_model",
"v1_perturbation_model")
ValueError Traceback (most recent call last)
/tmp/ipykernel_119167/121183301.py in
----> 1 cos_sims_df = isp_stats_to_goal_state(cos_sims_df_initial, dict_list, ispstats.cell_states_to_model)
/projects/b1038/Pulmonary/sfenske/projects/geneformer_experiments/code/venv_geneformer/lib/python3.7/site-packages/geneformer/in_silico_perturber_stats.py in isp_stats_to_goal_state(cos_sims_df, dict_list, cell_states_to_model)
124 goal_end_random_megalist = [goal_end for start_state,goal_end in random_tuples]
125 elif alt_end_state_exists == True:
--> 126 goal_end_random_megalist = [goal_end for start_state,goal_end,alt_end in random_tuples]
127 alt_end_random_megalist = [alt_end for start_state,goal_end,alt_end in random_tuples]
128
/projects/b1038/Pulmonary/sfenske/projects/geneformer_experiments/code/venv_geneformer/lib/python3.7/site-packages/geneformer/in_silico_perturber_stats.py in (.0)
124 goal_end_random_megalist = [goal_end for start_state,goal_end in random_tuples]
125 elif alt_end_state_exists == True:
--> 126 goal_end_random_megalist = [goal_end for start_state,goal_end,alt_end in random_tuples]
127 alt_end_random_megalist = [alt_end for start_state,goal_end,alt_end in random_tuples]
128
ValueError: too many values to unpack (expected 3)
I would appreciate any help you may be able to provide!
Thank you for your interest in Geneformer! Please pull the updated version and try running your analysis with just "cell" rather than "cell_and_gene" for the emb_mode. Then, you can separately run the analysis again with "cell_and_gene" with cell_states_to_model being None. The gene embedding shifts are independent of the cell states to model because those are relevant to an in silico perturbation affecting particular downstream genes, as opposed the effect of the perturbation at shifting the cell state. It would be great if you could add an update on whether that resolves your issue.
Please note that the updated version will run the analysis with both of the genes you supplied being perturbed in combination. If you'd like to perturb them separately, you can run it separately for each gene.