---
language:
- code
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:3150
- loss:MultipleNegativesRankingLoss
- loss:ContrastiveLoss
widget:
- source_sentence: '{"file_record": {"dataset_path": "https://nxc-fredato.imbi.uni-freiburg.de/s/A2Kgip3knb4xmFj/download",
"embeddings": {"X_hvg": "https://nxc-fredato.imbi.uni-freiburg.de/s/HHeBR7Q9QnLM85E/download",
"X_pca": "https://nxc-fredato.imbi.uni-freiburg.de/s/rkHBdRGpy7qAspj/download",
"X_scvi": "https://nxc-fredato.imbi.uni-freiburg.de/s/KXJjqrsrjnPKD3b/download",
"X_geneformer": "https://nxc-fredato.imbi.uni-freiburg.de/s/sLBtSQxQ3HxiMyE/download"}},
"sample_id": "census_acae7679-d077-461c-b857-ee6ccfeb267f_357"}'
sentences:
- Memory B cell derived from a 5-year-old human individual, with IGH + IGL, IgA1
isotype, IGHJ3*01, IGHV3-30*18, IGLC3, IGLV1-44, IGLJ3, obstructive sleep apnea,
and recurrent tonsillitis.
- Neuron cell type from the hippocampal formation, specifically from the Head of
hippocampus (HiH) - Uncal CA1 dissection, in a 50-year-old male individual, with
a supercluster term of Miscellaneous.
- Prostate gland microvascular endothelial cell derived from a 74-year-old male
of European ethnicity, specifically located in the transition zone of the prostate.
- source_sentence: '{"file_record": {"dataset_path": "https://nxc-fredato.imbi.uni-freiburg.de/s/A2Kgip3knb4xmFj/download",
"embeddings": {"X_hvg": "https://nxc-fredato.imbi.uni-freiburg.de/s/HHeBR7Q9QnLM85E/download",
"X_pca": "https://nxc-fredato.imbi.uni-freiburg.de/s/rkHBdRGpy7qAspj/download",
"X_scvi": "https://nxc-fredato.imbi.uni-freiburg.de/s/KXJjqrsrjnPKD3b/download",
"X_geneformer": "https://nxc-fredato.imbi.uni-freiburg.de/s/sLBtSQxQ3HxiMyE/download"}},
"sample_id": "census_27d91086-cfe3-4e33-9282-bd1246e5ce8e_128"}'
sentences:
- Sample is an oligodendrocyte cell from a 29-year-old male human, with European
self-reported ethnicity, specifically located in the thalamic complex.
- Neuron cell type from a 50-year-old male cerebral cortex, specifically from the
Long insular gyri (LIG) and Dysgranular insular cortex - Idg region, classified
as Deep-layer corticothalamic and 6b.
- Fibroblast cells from the thalamic complex, specifically from the medial nuclear
complex of thalamus (MNC), mediodorsal nucleus of thalamus + reuniens nucleus
(medioventral nucleus) of thalamus (MD + Re) in a 50-year-old male.
- source_sentence: '{"file_record": {"dataset_path": "https://nxc-fredato.imbi.uni-freiburg.de/s/A2Kgip3knb4xmFj/download",
"embeddings": {"X_hvg": "https://nxc-fredato.imbi.uni-freiburg.de/s/HHeBR7Q9QnLM85E/download",
"X_pca": "https://nxc-fredato.imbi.uni-freiburg.de/s/rkHBdRGpy7qAspj/download",
"X_scvi": "https://nxc-fredato.imbi.uni-freiburg.de/s/KXJjqrsrjnPKD3b/download",
"X_geneformer": "https://nxc-fredato.imbi.uni-freiburg.de/s/sLBtSQxQ3HxiMyE/download"}},
"sample_id": "census_5af90777-6760-4003-9dba-8f945fec6fdf_3563"}'
sentences:
- Memory B cell from a 3-year-old male human with recurrent tonsillitis, expressing
IgG3 isotype, IGLC2, and IGLV2-23-IGLJ2 antibody.
- Macrophage cells from the kidney tissue of a female individual in her sixties,
specifically SPP1+ tumor-associated macrophages (TAMs), originating from a tumor
sample.
- Fibroblast cells from the hypothalamus tissue, specifically from the mammillary
region of HTH (HTHma) and mammillary nucleus (MN), of a 29-year-old male.
- source_sentence: '{"file_record": {"dataset_path": "https://nxc-fredato.imbi.uni-freiburg.de/s/A2Kgip3knb4xmFj/download",
"embeddings": {"X_hvg": "https://nxc-fredato.imbi.uni-freiburg.de/s/HHeBR7Q9QnLM85E/download",
"X_pca": "https://nxc-fredato.imbi.uni-freiburg.de/s/rkHBdRGpy7qAspj/download",
"X_scvi": "https://nxc-fredato.imbi.uni-freiburg.de/s/KXJjqrsrjnPKD3b/download",
"X_geneformer": "https://nxc-fredato.imbi.uni-freiburg.de/s/sLBtSQxQ3HxiMyE/download"}},
"sample_id": "census_218acb0f-9f2f-4f76-b90b-15a4b7c7f629_30680"}'
sentences:
- Ependymal cell derived from the spinal cord tissue of a 50-year-old male human
donor.
- Progenitor cells derived from blood tissue of a 58-year old female with managed
systemic lupus erythematosus (SLE). The cells are of European ethnicity and were
obtained from peripheral blood mononuclear cell suspension.
- Cell sample from the cortex of kidney, taken from a 46-year-old male with kidney
cancer, identified as an alternatively activated macrophage.
- source_sentence: '{"file_record": {"dataset_path": "https://nxc-fredato.imbi.uni-freiburg.de/s/A2Kgip3knb4xmFj/download",
"embeddings": {"X_hvg": "https://nxc-fredato.imbi.uni-freiburg.de/s/HHeBR7Q9QnLM85E/download",
"X_pca": "https://nxc-fredato.imbi.uni-freiburg.de/s/rkHBdRGpy7qAspj/download",
"X_scvi": "https://nxc-fredato.imbi.uni-freiburg.de/s/KXJjqrsrjnPKD3b/download",
"X_geneformer": "https://nxc-fredato.imbi.uni-freiburg.de/s/sLBtSQxQ3HxiMyE/download"}},
"sample_id": "census_e84f2780-51e8-4cfa-8aa0-13bbfef677c7_184"}'
sentences:
- A 46-year old female's liver sample, specifically conventional dendritic cell
type 1 (cDC1s) enriched in CD45+ cell suspension, with no reported liver-related
diseases.
- Sample is an ON-bipolar cell derived from the peripheral region of the retina
of a 60-year-old male with European self-reported ethnicity, mapped to GENCODE
24 reference annotation.
- Alpha-beta T cell from the thoracic lymph node of a female individual in her seventies,
identified as T_CD4/CD8 subtype, with predicted labels as Double-negative thymocytes,
and majority voting as Regulatory T cells.
datasets:
- jo-mengr/geo_7k_cellxgene_3_5k_multiplets
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
- cosine_accuracy
- cosine_accuracy_threshold
- cosine_f1
- cosine_f1_threshold
- cosine_precision
- cosine_recall
- cosine_ap
- cosine_mcc
model-index:
- name: SentenceTransformer
results:
- task:
type: triplet
name: Triplet
dataset:
name: Unknown
type: unknown
metrics:
- type: cosine_accuracy
value: 0.4942857027053833
name: Cosine Accuracy
- task:
type: binary-classification
name: Binary Classification
dataset:
name: Unknown
type: unknown
metrics:
- type: cosine_accuracy
value: 0.900952380952381
name: Cosine Accuracy
- type: cosine_accuracy_threshold
value: 0.8415515422821045
name: Cosine Accuracy Threshold
- type: cosine_f1
value: 0.8594377510040161
name: Cosine F1
- type: cosine_f1_threshold
value: 0.7717130184173584
name: Cosine F1 Threshold
- type: cosine_precision
value: 0.8085642317380353
name: Cosine Precision
- type: cosine_recall
value: 0.9171428571428571
name: Cosine Recall
- type: cosine_ap
value: 0.8751864028703273
name: Cosine Ap
- type: cosine_mcc
value: 0.7860489453464287
name: Cosine Mcc
---
# SentenceTransformer
This is a [sentence-transformers](https://www.SBERT.net) model trained on the [geo_7k_cellxgene_3_5k_multiplets](https://huggingface.co/datasets/jo-mengr/geo_7k_cellxgene_3_5k_multiplets) dataset. It maps sentences & paragraphs to a None-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
## Model Details
### Model Description
- **Model Type:** Sentence Transformer
- **Maximum Sequence Length:** None tokens
- **Output Dimensionality:** None dimensions
- **Similarity Function:** Cosine Similarity
- **Training Dataset:**
- [geo_7k_cellxgene_3_5k_multiplets](https://huggingface.co/datasets/jo-mengr/geo_7k_cellxgene_3_5k_multiplets)
- **Language:** code
### Model Sources
- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
### Full Model Architecture
```
SentenceTransformer(
(0): MMContextEncoder(
(text_encoder): BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(28996, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): BertEncoder(
(layer): ModuleList(
(0-11): 12 x BertLayer(
(attention): BertAttention(
(self): BertSdpaSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(pooler): BertPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)
(text_adapter): AdapterModule(
(net): Sequential(
(0): Linear(in_features=768, out_features=512, bias=True)
(1): ReLU(inplace=True)
(2): Linear(in_features=512, out_features=2048, bias=True)
(3): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(omics_adapter): AdapterModule(
(net): Sequential(
(0): Linear(in_features=64, out_features=512, bias=True)
(1): ReLU(inplace=True)
(2): Linear(in_features=512, out_features=2048, bias=True)
(3): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)
)
```
## Usage
### Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
```bash
pip install -U sentence-transformers
```
Then you can load this model and run inference.
```python
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("jo-mengr/mmcontext-geo7k-cellxgene3.5k-pairs-cell_type")
# Run inference
sentences = [
'{"file_record": {"dataset_path": "https://nxc-fredato.imbi.uni-freiburg.de/s/A2Kgip3knb4xmFj/download", "embeddings": {"X_hvg": "https://nxc-fredato.imbi.uni-freiburg.de/s/HHeBR7Q9QnLM85E/download", "X_pca": "https://nxc-fredato.imbi.uni-freiburg.de/s/rkHBdRGpy7qAspj/download", "X_scvi": "https://nxc-fredato.imbi.uni-freiburg.de/s/KXJjqrsrjnPKD3b/download", "X_geneformer": "https://nxc-fredato.imbi.uni-freiburg.de/s/sLBtSQxQ3HxiMyE/download"}}, "sample_id": "census_e84f2780-51e8-4cfa-8aa0-13bbfef677c7_184"}',
"A 46-year old female's liver sample, specifically conventional dendritic cell type 1 (cDC1s) enriched in CD45+ cell suspension, with no reported liver-related diseases.",
'Sample is an ON-bipolar cell derived from the peripheral region of the retina of a 60-year-old male with European self-reported ethnicity, mapped to GENCODE 24 reference annotation.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
```
## Evaluation
### Metrics
#### Triplet
* Evaluated with [TripletEvaluator
](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.TripletEvaluator)
| Metric | Value |
|:--------------------|:-----------|
| **cosine_accuracy** | **0.4943** |
#### Binary Classification
* Evaluated with [BinaryClassificationEvaluator
](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.BinaryClassificationEvaluator)
| Metric | Value |
|:--------------------------|:-----------|
| cosine_accuracy | 0.901 |
| cosine_accuracy_threshold | 0.8416 |
| cosine_f1 | 0.8594 |
| cosine_f1_threshold | 0.7717 |
| cosine_precision | 0.8086 |
| cosine_recall | 0.9171 |
| **cosine_ap** | **0.8752** |
| cosine_mcc | 0.786 |
## Training Details
### Training Dataset
#### geo_7k_cellxgene_3_5k_multiplets
* Dataset: [geo_7k_cellxgene_3_5k_multiplets](https://huggingface.co/datasets/jo-mengr/geo_7k_cellxgene_3_5k_multiplets) at [d5af8a2](https://huggingface.co/datasets/jo-mengr/geo_7k_cellxgene_3_5k_multiplets/tree/d5af8a2ef144a95afa06f7294be2686f7a610e50)
* Size: 3,150 training samples
* Columns: anndata_ref
, caption
, and label
* Approximate statistics based on the first 1000 samples:
| | anndata_ref | caption | label |
|:--------|:--------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------|:---------------------------------------------------------------|
| type | string | string | float |
| details |
{"file_record": {"dataset_path": "https://nxc-fredato.imbi.uni-freiburg.de/s/DCW3zXGDx6DWY7i/download", "embeddings": {"X_hvg": "https://nxc-fredato.imbi.uni-freiburg.de/s/EbjeimYBdjefbpg/download", "X_pca": "https://nxc-fredato.imbi.uni-freiburg.de/s/mggGyqZE6892DWz/download", "X_scvi": "https://nxc-fredato.imbi.uni-freiburg.de/s/Rt4wXwEPifBT2nX/download", "X_geneformer": "https://nxc-fredato.imbi.uni-freiburg.de/s/dmkHbFpkJLLqHPx/download"}}, "sample_id": "census_a37f857c-779f-464e-9310-3db43a1811e7_2741"}
| Sample is a macrophage cell type derived from the ileal epithelium tissue of a female human in her fourth decade.
| 1.0
|
| {"file_record": {"dataset_path": "https://nxc-fredato.imbi.uni-freiburg.de/s/DCW3zXGDx6DWY7i/download", "embeddings": {"X_hvg": "https://nxc-fredato.imbi.uni-freiburg.de/s/EbjeimYBdjefbpg/download", "X_pca": "https://nxc-fredato.imbi.uni-freiburg.de/s/mggGyqZE6892DWz/download", "X_scvi": "https://nxc-fredato.imbi.uni-freiburg.de/s/Rt4wXwEPifBT2nX/download", "X_geneformer": "https://nxc-fredato.imbi.uni-freiburg.de/s/dmkHbFpkJLLqHPx/download"}}, "sample_id": "census_a37f857c-779f-464e-9310-3db43a1811e7_2741"}
| Erythrocyte cells at the mid erythroid stage, derived from bone marrow of a male human fetus at 15 weeks post-fertilization.
| 0.0
|
| {"file_record": {"dataset_path": "https://nxc-fredato.imbi.uni-freiburg.de/s/DCW3zXGDx6DWY7i/download", "embeddings": {"X_hvg": "https://nxc-fredato.imbi.uni-freiburg.de/s/EbjeimYBdjefbpg/download", "X_pca": "https://nxc-fredato.imbi.uni-freiburg.de/s/mggGyqZE6892DWz/download", "X_scvi": "https://nxc-fredato.imbi.uni-freiburg.de/s/Rt4wXwEPifBT2nX/download", "X_geneformer": "https://nxc-fredato.imbi.uni-freiburg.de/s/dmkHbFpkJLLqHPx/download"}}, "sample_id": "census_a37f857c-779f-464e-9310-3db43a1811e7_2741"}
| Native cell from the spleen of a 15th week post-fertilization human female, identified as DOUBLET_IMMUNE_FIBROBLAST.
| 0.0
|
* Loss: [ContrastiveLoss
](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#contrastiveloss) with these parameters:
```json
{
"distance_metric": "SiameseDistanceMetric.COSINE_DISTANCE",
"margin": 0.5,
"size_average": true
}
```
### Evaluation Dataset
#### geo_7k_cellxgene_3_5k_multiplets
* Dataset: [geo_7k_cellxgene_3_5k_multiplets](https://huggingface.co/datasets/jo-mengr/geo_7k_cellxgene_3_5k_multiplets) at [d5af8a2](https://huggingface.co/datasets/jo-mengr/geo_7k_cellxgene_3_5k_multiplets/tree/d5af8a2ef144a95afa06f7294be2686f7a610e50)
* Size: 350 evaluation samples
* Columns: anndata_ref
, caption
, and label
* Approximate statistics based on the first 350 samples:
| | anndata_ref | caption | label |
|:--------|:--------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------|:---------------------------------------------------------------|
| type | string | string | float |
| details | {"file_record": {"dataset_path": "https://nxc-fredato.imbi.uni-freiburg.de/s/EbaoL4ydTqmYwP9/download", "embeddings": {"X_hvg": "https://nxc-fredato.imbi.uni-freiburg.de/s/X8EFSis4S5ecdse/download", "X_pca": "https://nxc-fredato.imbi.uni-freiburg.de/s/DGxs2PkPeDF2RGm/download", "X_scvi": "https://nxc-fredato.imbi.uni-freiburg.de/s/bm3N8RCWePiyJKz/download", "X_geneformer": "https://nxc-fredato.imbi.uni-freiburg.de/s/8FGZG6EzMeBYxjX/download"}}, "sample_id": "census_b46237d1-19c6-4af2-9335-9854634bad16_7973"}
| Sample contains stem cells (LGR5 stem) derived from the duodeno-jejunal junction of a human fetus at Carnegie stage 23.
| 1.0
|
| {"file_record": {"dataset_path": "https://nxc-fredato.imbi.uni-freiburg.de/s/EbaoL4ydTqmYwP9/download", "embeddings": {"X_hvg": "https://nxc-fredato.imbi.uni-freiburg.de/s/X8EFSis4S5ecdse/download", "X_pca": "https://nxc-fredato.imbi.uni-freiburg.de/s/DGxs2PkPeDF2RGm/download", "X_scvi": "https://nxc-fredato.imbi.uni-freiburg.de/s/bm3N8RCWePiyJKz/download", "X_geneformer": "https://nxc-fredato.imbi.uni-freiburg.de/s/8FGZG6EzMeBYxjX/download"}}, "sample_id": "census_b46237d1-19c6-4af2-9335-9854634bad16_7973"}
| A 46-year old female's liver sample, specifically conventional dendritic cell type 1 (cDC1s) enriched in CD45+ cell suspension, with no reported liver-related diseases.
| 0.0
|
| {"file_record": {"dataset_path": "https://nxc-fredato.imbi.uni-freiburg.de/s/EbaoL4ydTqmYwP9/download", "embeddings": {"X_hvg": "https://nxc-fredato.imbi.uni-freiburg.de/s/X8EFSis4S5ecdse/download", "X_pca": "https://nxc-fredato.imbi.uni-freiburg.de/s/DGxs2PkPeDF2RGm/download", "X_scvi": "https://nxc-fredato.imbi.uni-freiburg.de/s/bm3N8RCWePiyJKz/download", "X_geneformer": "https://nxc-fredato.imbi.uni-freiburg.de/s/8FGZG6EzMeBYxjX/download"}}, "sample_id": "census_b46237d1-19c6-4af2-9335-9854634bad16_7973"}
| A CD16-negative, CD56-bright natural killer cell sample taken from the spleen of a male in his sixth decade.
| 0.0
|
* Loss: [ContrastiveLoss
](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#contrastiveloss) with these parameters:
```json
{
"distance_metric": "SiameseDistanceMetric.COSINE_DISTANCE",
"margin": 0.5,
"size_average": true
}
```
### Training Hyperparameters
#### Non-Default Hyperparameters
- `eval_strategy`: steps
- `per_device_train_batch_size`: 16
- `per_device_eval_batch_size`: 16
- `learning_rate`: 2e-05
- `num_train_epochs`: 4
- `warmup_ratio`: 0.1
#### All Hyperparameters