metadata
language:
- code
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:3150
- loss:MultipleNegativesRankingLoss
- loss:ContrastiveLoss
widget:
- source_sentence: >-
{"file_record": {"dataset_path":
"https://nxc-fredato.imbi.uni-freiburg.de/s/A2Kgip3knb4xmFj/download",
"embeddings": {"X_hvg":
"https://nxc-fredato.imbi.uni-freiburg.de/s/HHeBR7Q9QnLM85E/download",
"X_pca":
"https://nxc-fredato.imbi.uni-freiburg.de/s/rkHBdRGpy7qAspj/download",
"X_scvi":
"https://nxc-fredato.imbi.uni-freiburg.de/s/KXJjqrsrjnPKD3b/download",
"X_geneformer":
"https://nxc-fredato.imbi.uni-freiburg.de/s/sLBtSQxQ3HxiMyE/download"}},
"sample_id": "census_acae7679-d077-461c-b857-ee6ccfeb267f_357"}
sentences:
- >-
Memory B cell derived from a 5-year-old human individual, with IGH +
IGL, IgA1 isotype, IGHJ3*01, IGHV3-30*18, IGLC3, IGLV1-44, IGLJ3,
obstructive sleep apnea, and recurrent tonsillitis.
- >-
Neuron cell type from the hippocampal formation, specifically from the
Head of hippocampus (HiH) - Uncal CA1 dissection, in a 50-year-old male
individual, with a supercluster term of Miscellaneous.
- >-
Prostate gland microvascular endothelial cell derived from a 74-year-old
male of European ethnicity, specifically located in the transition zone
of the prostate.
- source_sentence: >-
{"file_record": {"dataset_path":
"https://nxc-fredato.imbi.uni-freiburg.de/s/A2Kgip3knb4xmFj/download",
"embeddings": {"X_hvg":
"https://nxc-fredato.imbi.uni-freiburg.de/s/HHeBR7Q9QnLM85E/download",
"X_pca":
"https://nxc-fredato.imbi.uni-freiburg.de/s/rkHBdRGpy7qAspj/download",
"X_scvi":
"https://nxc-fredato.imbi.uni-freiburg.de/s/KXJjqrsrjnPKD3b/download",
"X_geneformer":
"https://nxc-fredato.imbi.uni-freiburg.de/s/sLBtSQxQ3HxiMyE/download"}},
"sample_id": "census_27d91086-cfe3-4e33-9282-bd1246e5ce8e_128"}
sentences:
- >-
Sample is an oligodendrocyte cell from a 29-year-old male human, with
European self-reported ethnicity, specifically located in the thalamic
complex.
- >-
Neuron cell type from a 50-year-old male cerebral cortex, specifically
from the Long insular gyri (LIG) and Dysgranular insular cortex - Idg
region, classified as Deep-layer corticothalamic and 6b.
- >-
Fibroblast cells from the thalamic complex, specifically from the medial
nuclear complex of thalamus (MNC), mediodorsal nucleus of thalamus +
reuniens nucleus (medioventral nucleus) of thalamus (MD + Re) in a
50-year-old male.
- source_sentence: >-
{"file_record": {"dataset_path":
"https://nxc-fredato.imbi.uni-freiburg.de/s/A2Kgip3knb4xmFj/download",
"embeddings": {"X_hvg":
"https://nxc-fredato.imbi.uni-freiburg.de/s/HHeBR7Q9QnLM85E/download",
"X_pca":
"https://nxc-fredato.imbi.uni-freiburg.de/s/rkHBdRGpy7qAspj/download",
"X_scvi":
"https://nxc-fredato.imbi.uni-freiburg.de/s/KXJjqrsrjnPKD3b/download",
"X_geneformer":
"https://nxc-fredato.imbi.uni-freiburg.de/s/sLBtSQxQ3HxiMyE/download"}},
"sample_id": "census_5af90777-6760-4003-9dba-8f945fec6fdf_3563"}
sentences:
- >-
Memory B cell from a 3-year-old male human with recurrent tonsillitis,
expressing IgG3 isotype, IGLC2, and IGLV2-23-IGLJ2 antibody.
- >-
Macrophage cells from the kidney tissue of a female individual in her
sixties, specifically SPP1+ tumor-associated macrophages (TAMs),
originating from a tumor sample.
- >-
Fibroblast cells from the hypothalamus tissue, specifically from the
mammillary region of HTH (HTHma) and mammillary nucleus (MN), of a
29-year-old male.
- source_sentence: >-
{"file_record": {"dataset_path":
"https://nxc-fredato.imbi.uni-freiburg.de/s/A2Kgip3knb4xmFj/download",
"embeddings": {"X_hvg":
"https://nxc-fredato.imbi.uni-freiburg.de/s/HHeBR7Q9QnLM85E/download",
"X_pca":
"https://nxc-fredato.imbi.uni-freiburg.de/s/rkHBdRGpy7qAspj/download",
"X_scvi":
"https://nxc-fredato.imbi.uni-freiburg.de/s/KXJjqrsrjnPKD3b/download",
"X_geneformer":
"https://nxc-fredato.imbi.uni-freiburg.de/s/sLBtSQxQ3HxiMyE/download"}},
"sample_id": "census_218acb0f-9f2f-4f76-b90b-15a4b7c7f629_30680"}
sentences:
- >-
Ependymal cell derived from the spinal cord tissue of a 50-year-old male
human donor.
- >-
Progenitor cells derived from blood tissue of a 58-year old female with
managed systemic lupus erythematosus (SLE). The cells are of European
ethnicity and were obtained from peripheral blood mononuclear cell
suspension.
- >-
Cell sample from the cortex of kidney, taken from a 46-year-old male
with kidney cancer, identified as an alternatively activated macrophage.
- source_sentence: >-
{"file_record": {"dataset_path":
"https://nxc-fredato.imbi.uni-freiburg.de/s/A2Kgip3knb4xmFj/download",
"embeddings": {"X_hvg":
"https://nxc-fredato.imbi.uni-freiburg.de/s/HHeBR7Q9QnLM85E/download",
"X_pca":
"https://nxc-fredato.imbi.uni-freiburg.de/s/rkHBdRGpy7qAspj/download",
"X_scvi":
"https://nxc-fredato.imbi.uni-freiburg.de/s/KXJjqrsrjnPKD3b/download",
"X_geneformer":
"https://nxc-fredato.imbi.uni-freiburg.de/s/sLBtSQxQ3HxiMyE/download"}},
"sample_id": "census_e84f2780-51e8-4cfa-8aa0-13bbfef677c7_184"}
sentences:
- >-
A 46-year old female's liver sample, specifically conventional dendritic
cell type 1 (cDC1s) enriched in CD45+ cell suspension, with no reported
liver-related diseases.
- >-
Sample is an ON-bipolar cell derived from the peripheral region of the
retina of a 60-year-old male with European self-reported ethnicity,
mapped to GENCODE 24 reference annotation.
- >-
Alpha-beta T cell from the thoracic lymph node of a female individual in
her seventies, identified as T_CD4/CD8 subtype, with predicted labels as
Double-negative thymocytes, and majority voting as Regulatory T cells.
datasets:
- jo-mengr/geo_7k_cellxgene_3_5k_multiplets
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
- cosine_accuracy
- cosine_accuracy_threshold
- cosine_f1
- cosine_f1_threshold
- cosine_precision
- cosine_recall
- cosine_ap
- cosine_mcc
model-index:
- name: SentenceTransformer
results:
- task:
type: triplet
name: Triplet
dataset:
name: Unknown
type: unknown
metrics:
- type: cosine_accuracy
value: 0.4942857027053833
name: Cosine Accuracy
- task:
type: binary-classification
name: Binary Classification
dataset:
name: Unknown
type: unknown
metrics:
- type: cosine_accuracy
value: 0.900952380952381
name: Cosine Accuracy
- type: cosine_accuracy_threshold
value: 0.8415515422821045
name: Cosine Accuracy Threshold
- type: cosine_f1
value: 0.8594377510040161
name: Cosine F1
- type: cosine_f1_threshold
value: 0.7717130184173584
name: Cosine F1 Threshold
- type: cosine_precision
value: 0.8085642317380353
name: Cosine Precision
- type: cosine_recall
value: 0.9171428571428571
name: Cosine Recall
- type: cosine_ap
value: 0.8751864028703273
name: Cosine Ap
- type: cosine_mcc
value: 0.7860489453464287
name: Cosine Mcc
SentenceTransformer
This is a sentence-transformers model trained on the geo_7k_cellxgene_3_5k_multiplets dataset. It maps sentences & paragraphs to a None-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Maximum Sequence Length: None tokens
- Output Dimensionality: None dimensions
- Similarity Function: Cosine Similarity
- Training Dataset:
- Language: code
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): MMContextEncoder(
(text_encoder): BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(28996, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(encoder): BertEncoder(
(layer): ModuleList(
(0-11): 12 x BertLayer(
(attention): BertAttention(
(self): BertSdpaSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
)
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
)
)
(pooler): BertPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
)
)
(text_adapter): AdapterModule(
(net): Sequential(
(0): Linear(in_features=768, out_features=512, bias=True)
(1): ReLU(inplace=True)
(2): Linear(in_features=512, out_features=2048, bias=True)
(3): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
(omics_adapter): AdapterModule(
(net): Sequential(
(0): Linear(in_features=64, out_features=512, bias=True)
(1): ReLU(inplace=True)
(2): Linear(in_features=512, out_features=2048, bias=True)
(3): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
)
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("jo-mengr/mmcontext-geo7k-cellxgene3.5k-pairs-cell_type")
# Run inference
sentences = [
'{"file_record": {"dataset_path": "https://nxc-fredato.imbi.uni-freiburg.de/s/A2Kgip3knb4xmFj/download", "embeddings": {"X_hvg": "https://nxc-fredato.imbi.uni-freiburg.de/s/HHeBR7Q9QnLM85E/download", "X_pca": "https://nxc-fredato.imbi.uni-freiburg.de/s/rkHBdRGpy7qAspj/download", "X_scvi": "https://nxc-fredato.imbi.uni-freiburg.de/s/KXJjqrsrjnPKD3b/download", "X_geneformer": "https://nxc-fredato.imbi.uni-freiburg.de/s/sLBtSQxQ3HxiMyE/download"}}, "sample_id": "census_e84f2780-51e8-4cfa-8aa0-13bbfef677c7_184"}',
"A 46-year old female's liver sample, specifically conventional dendritic cell type 1 (cDC1s) enriched in CD45+ cell suspension, with no reported liver-related diseases.",
'Sample is an ON-bipolar cell derived from the peripheral region of the retina of a 60-year-old male with European self-reported ethnicity, mapped to GENCODE 24 reference annotation.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Evaluation
Metrics
Triplet
- Evaluated with
TripletEvaluator
Metric | Value |
---|---|
cosine_accuracy | 0.4943 |
Binary Classification
- Evaluated with
BinaryClassificationEvaluator
Metric | Value |
---|---|
cosine_accuracy | 0.901 |
cosine_accuracy_threshold | 0.8416 |
cosine_f1 | 0.8594 |
cosine_f1_threshold | 0.7717 |
cosine_precision | 0.8086 |
cosine_recall | 0.9171 |
cosine_ap | 0.8752 |
cosine_mcc | 0.786 |
Training Details
Training Dataset
geo_7k_cellxgene_3_5k_multiplets
- Dataset: geo_7k_cellxgene_3_5k_multiplets at d5af8a2
- Size: 3,150 training samples
- Columns:
anndata_ref
,caption
, andlabel
- Approximate statistics based on the first 1000 samples:
anndata_ref caption label type string string float details - min: 510 characters
- mean: 512.71 characters
- max: 514 characters
- min: 43 characters
- mean: 162.51 characters
- max: 1070 characters
- min: 0.0
- mean: 0.33
- max: 1.0
- Samples:
- Loss:
ContrastiveLoss
with these parameters:{ "distance_metric": "SiameseDistanceMetric.COSINE_DISTANCE", "margin": 0.5, "size_average": true }
Evaluation Dataset
geo_7k_cellxgene_3_5k_multiplets
- Dataset: geo_7k_cellxgene_3_5k_multiplets at d5af8a2
- Size: 350 evaluation samples
- Columns:
anndata_ref
,caption
, andlabel
- Approximate statistics based on the first 350 samples:
anndata_ref caption label type string string float details - min: 510 characters
- mean: 512.77 characters
- max: 514 characters
- min: 50 characters
- mean: 159.74 characters
- max: 924 characters
- min: 0.0
- mean: 0.33
- max: 1.0
- Samples:
- Loss:
ContrastiveLoss
with these parameters:{ "distance_metric": "SiameseDistanceMetric.COSINE_DISTANCE", "margin": 0.5, "size_average": true }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: stepsper_device_train_batch_size
: 16per_device_eval_batch_size
: 16learning_rate
: 2e-05num_train_epochs
: 4warmup_ratio
: 0.1
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: stepsprediction_loss_only
: Trueper_device_train_batch_size
: 16per_device_eval_batch_size
: 16per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 2e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1.0num_train_epochs
: 4max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.1warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Falsefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Falsehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseeval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Nonedispatch_batches
: Nonesplit_batches
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseeval_use_gather_object
: Falseprompts
: Nonebatch_sampler
: batch_samplermulti_dataset_batch_sampler
: proportional
Training Logs
Epoch | Step | Training Loss | Validation Loss | cosine_accuracy | cosine_ap |
---|---|---|---|---|---|
-1 | -1 | - | - | 0.5029 | - |
0.5076 | 100 | 6.008 | 9.9084 | 0.5057 | - |
1.0152 | 200 | 4.7386 | 10.5698 | 0.4943 | - |
0.5076 | 100 | 4.3879 | 11.5229 | 0.4943 | - |
1.0152 | 200 | 4.1962 | 11.7110 | 0.5 | - |
1.5228 | 300 | 4.2736 | 12.5341 | 0.4971 | - |
2.0305 | 400 | 4.1793 | 13.1011 | 0.4943 | - |
-1 | -1 | - | - | - | 0.3408 |
0.1692 | 100 | 0.1614 | 0.3496 | - | 0.3389 |
0.3384 | 200 | 0.1641 | 0.3579 | - | 0.3390 |
0.5076 | 300 | 0.1652 | 0.3592 | - | 0.3396 |
0.6768 | 400 | 0.1672 | 0.3696 | - | 0.3413 |
0.8460 | 500 | 0.1579 | 0.3591 | - | 0.3417 |
1.0152 | 600 | 0.1722 | 0.2388 | - | 0.3457 |
1.1844 | 700 | 0.1553 | 0.3597 | - | 0.3866 |
1.3536 | 800 | 0.1029 | 0.0675 | - | 0.6485 |
1.5228 | 900 | 0.059 | 0.0464 | - | 0.7094 |
1.6920 | 1000 | 0.0446 | 0.0357 | - | 0.7133 |
1.8613 | 1100 | 0.035 | 0.0286 | - | 0.7571 |
2.0305 | 1200 | 0.0304 | 0.0226 | - | 0.8048 |
2.1997 | 1300 | 0.0258 | 0.0293 | - | 0.7571 |
2.3689 | 1400 | 0.0226 | 0.0179 | - | 0.8204 |
2.5381 | 1500 | 0.0207 | 0.0160 | - | 0.8292 |
2.7073 | 1600 | 0.0198 | 0.0166 | - | 0.8152 |
2.8765 | 1700 | 0.0215 | 0.0157 | - | 0.8430 |
3.0457 | 1800 | 0.0183 | 0.0161 | - | 0.8544 |
3.2149 | 1900 | 0.0163 | 0.0138 | - | 0.8651 |
3.3841 | 2000 | 0.0163 | 0.0142 | - | 0.8696 |
3.5533 | 2100 | 0.0159 | 0.0129 | - | 0.8719 |
3.7225 | 2200 | 0.015 | 0.0129 | - | 0.8773 |
3.8917 | 2300 | 0.0157 | 0.0127 | - | 0.8752 |
Framework Versions
- Python: 3.11.6
- Sentence Transformers: 3.5.0.dev0
- Transformers: 4.43.4
- PyTorch: 2.6.0
- Accelerate: 0.33.0
- Datasets: 2.14.4
- Tokenizers: 0.19.1
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
ContrastiveLoss
@inproceedings{hadsell2006dimensionality,
author={Hadsell, R. and Chopra, S. and LeCun, Y.},
booktitle={2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06)},
title={Dimensionality Reduction by Learning an Invariant Mapping},
year={2006},
volume={2},
number={},
pages={1735-1742},
doi={10.1109/CVPR.2006.100}
}