[Rank value encoding] What kinds of normalization was done before calculating the non-zero median for genes?

#480
by myl200 - opened

Hello!
I have a question about your method, the rank value encoding. In your Nature 2023 paper, the description starts with "... we first calculated the non-zero median value of expression of each detected gene across all cells passing quality filtering from the entire Genecorpus-30M." I'm curious about what kind of normalization was done before this step. Based on "gene_median_dictionary_gc30M.pkl", all values are in the np.float64 format.

Thanks for your question. The input is raw counts without normalization (see here). The raw counts are then transformed as described in the manuscript. You are also welcome to check out the code in the tokenizer script in this repository to see exactly how it is done.

ctheodoris changed discussion status to closed

Sign up or log in to comment