metadata

tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:69500
  - loss:Infonce
base_model: Snowflake/snowflake-arctic-embed-l-v2.0
widget:
  - source_sentence: What aspect of human relationship to nature is omitted from the text
    sentences:
      - >-
        There are a few good ones, though. Here are the best WWE apps and WWE
        games for Android! The first five are the best games...

        Go Android Apps (blog)

        The Best Themes for Android Free Download: Hi friend we are again back
        with our new top ten best free themes for android list. This article is
        especially dedicated for those persons who want to make their
        smartphone...

        Paragon Software has created an app for Android that allows your device
        to natively read partitions in file systems that Android normally can't
        handle, such as Microsoft's NTFS, allowing immediate and easy use of...
        While the Sentio Desktop app can be used on its own, it was primarily
        meant to complement Sentio's Superbook, a crowdfunded laptop shell for
        Android smartphones and tablets that's just entering production after...

        ... phone then GBWhatsapp is the app for you. GBWhatsapp is basically
        similar to Whatsapp+ in terms of features. The newest available version
        right now is GBWhatsapp 6.40 APK for Android devices.
      - >-
        A true entertainer. date city state venue 11/23/2012 West Palm Beach FL
        Kravis Center 11/24/2012 Sarasota FL Van Wezel Performing Arts Hall
        11/25/2012 Clearwater FL Capitol Theatre 11/29/2012 Durham NC Durham
        Performing Arts Center 12/1/2012 Atlantic City NJ Trump Taj Mahal
        12/2/2012 Staten Island NY St. George Theatre 12/4/2012 Bethlehem PA
        Musikfest Cafe 12/5/2012 Verona NY Turning Stone Casino 12/6/2012
        Stamford CT Palace Theatre Stamford 12/8/2012 Shippensburg PA Luhrs
        Center 12/9/2012 Boston MA Wilbur Theatre 12/11/2012 Greensburg PA The
        Palace Theatre 12/12/2012 Easton MD Avalon Theatre 12/15/2012 Saint
        Charles IL Arcada Theater 12/16/2012 Milwaukee WI Potawatomi Bingo
        Casino 12/18/2012 Beaver Creek CO Vilar Performing Arts Center
        12/20/2012 Chandler AZ Ovations Live!
      - >-
        The reader will gain a better understanding of the direction nature and
        culture is heading today by learning how connections were made in the
        past. It omits that which Raymond Williams called "a working landscape"
        -- the most intimate human relationship to nature which is people who
        live and work on it.
  - source_sentence: >-
      Why is it recommended to contact a wedding agency or consultant before
      making a decision
    sentences:
      - >-
        Perhaps owing to this humiliation I resigned as Chief Winery Warlord,
        and took a position elsewhere. Following my resignation, we rebooked our
        date with axe throwing destiny, and converted the night from a team
        building exercise to a majestic send off in honour of my 10ish glorious
        years at Coffin Ridge. We arrived in our most impeccable vestments.
      - >-
        Therefore, those private companies increased their own rate of cash burn
        since the financial markets were willing to fund money-losing
        enterprises without hesitation. Out of the 100 largest North
        American-based technology companies, 16 have lost money over the past
        year.
      - >-
        Yet , it is best to contact a wedding agency or consultant before you
        make your concluding decision. This will make certain you are dealing
        with a respectable company.
  - source_sentence: >-
      What is the Electronic Music Education and Preservation Project (EMEAPP)
      and what are its functions
    sentences:
      - >-
        The Electronic Music Education and Preservation Project (EMEAPP) is the
        steward of a privately held world-class curated collection of rare
        vintage electronic instruments and stage-used gear. This includes
        effects units, amps, organs, synthesizers, electro-mechanical
        instruments, guitars, prototypes, vintage audio/video media and analog
        studio gear. In addition, EMEAPP itself is cultivating its own humble
        collection. It is our charge to cultivate and reap excellent knowledge
        from these unique resources and return it to our members and the world.
        We do this as a learning center, through research projects, creative
        endeavors, media programming and tours, enlightening many people along
        the way. There is so much to be harvested from history; EMEAPP has a key
        to the vault. EMEAPP is a private museum, a critical learning center and
        a multi-media production studio nicely packed into a brick-and-mortar
        facility outside of Philadelphia, Pennsylvania. EMEAPP is a 501(c)(3)
        non-profit organization.
      - You got a problem? Yo, she'll splode it.
      - >-
        I love sex; I think sex is completely absurdly demonized in our culture.
        But in the end, however much sex you want to have, with however many
        people in how many ways, to be loved and to love is what human beings
        really want.
  - source_sentence: What year did the Duchess die and where did it happen
    sentences:
      - >-
        League One


        League table


        Results summary


        Results by matchday


        Matches

        On 21 June 2018, the League One fixtures for the forthcoming season were
        announced. FA Cup


        The first round draw was made live on BBC by Dennis Wise and Dion Dublin
        on 22 October.
      - >-
        The Duchess was widowed in 2007 and died in London in 2011. Issue 


        The Duke and Duchess of Buccleuch and Queensberry had four children:

        Richard Scott, 10th Duke of Buccleuch (b. 1954), married Lady Elizabeth
        Kerr, daughter of the Marquess of Lothian, and has issue two sons and
        two daughters. Lord John (born 9 August 1957), married Berrin Torolsan,
        and lives in Istanbul, Turkey. Lady Charlotte-Anne (born 9 January
        1966), married Count Bernard de Castellane in 1991, and has issue two
        sons and a daughter. Lord Damian (born 8 October 1969), married
        Elizabeth Powis, and has issue. External links

        Jane in her wedding dress  

        Movie clip of Jane's wedding


        References 


        1929 births

        2011 deaths

        British duchesses by marriage

        Jane

        Scottish female models

        British cookbook writers

        Women cookbook writers
      - >-
        Is this common, do other people with epilepsy have dangerously low
        appetites? So we left there and stopped and got her a bite to eat.
  - source_sentence: Why is it important to keep moving over the summer
    sentences:
      - It's important to keep moving over the summer!
      - >-
        2008. CHENG HF, LEE YM, Chu CH, Leung WK & Mok TMY. - Journal Editor
        (Hong Kong Medical Journal) 2008

        - Editor-in-Chief (Hong Kong Dental Journal) 2007

        - Editor-in-Chief (Hong Kong Dental Journal) 2006

        - Deputy Editor (Hong Kong Dental Journal) 2004
      - >-
        Both demand collective action and shared resources. While one is
        distinctly egalitarian and the other hierarchical in nature, both speak
        of sublimating private goals for the achievement of larger, shared ones.
pipeline_tag: sentence-similarity
library_name: sentence-transformers

SentenceTransformer based on Snowflake/snowflake-arctic-embed-l-v2.0

This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-l-v2.0. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: Snowflake/snowflake-arctic-embed-l-v2.0
Maximum Sequence Length: 1024 tokens
Output Dimensionality: 1024 tokens
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 1024, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("Jrinky/snowflake")
# Run inference
sentences = [
    'Why is it important to keep moving over the summer',
    "It's important to keep moving over the summer!",
    '2008. CHENG HF, LEE YM, Chu CH, Leung WK & Mok TMY. - Journal Editor (Hong Kong Medical Journal) 2008\n- Editor-in-Chief (Hong Kong Dental Journal) 2007\n- Editor-in-Chief (Hong Kong Dental Journal) 2006\n- Deputy Editor (Hong Kong Dental Journal) 2004',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

Size: 69,500 training samples
Columns: anchor and positive
Approximate statistics based on the first 1000 samples:
anchor positive
type string string
details
min: 6 tokens
mean: 17.47 tokens
max: 44 tokens

min: 3 tokens
mean: 113.33 tokens
max: 1024 tokens

	anchor	positive
type	string	string
details	min: 6 tokens mean: 17.47 tokens max: 44 tokens	min: 3 tokens mean: 113.33 tokens max: 1024 tokens

Samples:

anchor	positive
`What might have been unnecessary if better emergency plans had been implemented`	`If better emergency plans had been in place, maybe chemical dipersants wouldn't be needed. And on and on.`
`What was the year of publication for the 3rd Edition of 'Regular Polytopes' by H.S.M. Coxeter`	`Coxeter, Regular Polytopes, 3rd Edition, Dover New York, 1973 Kaleidoscopes: Selected Writings of H.S.M. Coxeter, edited by F. Arthur Sherk, Peter McMullen, Anthony C. Thompson, Asia Ivic Weiss, Wiley-Interscience Publication, 1995, (Paper 22) H.S.M.`
`Who is the author of the GURPS Shapeshifters supplement`	`GURPS Shapeshifters () is a supplement by Robert M. Schroeck for the GURPS role-playing game system, third edition.`

Loss: selfloss.Infonce with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Evaluation Dataset

Unnamed Dataset

Size: 17,376 evaluation samples
Columns: anchor and positive
Approximate statistics based on the first 1000 samples:
anchor positive
type string string
details
min: 6 tokens
mean: 16.87 tokens
max: 45 tokens

min: 6 tokens
mean: 115.36 tokens
max: 1024 tokens

	anchor	positive
type	string	string
details	min: 6 tokens mean: 16.87 tokens max: 45 tokens	min: 6 tokens mean: 115.36 tokens max: 1024 tokens

Samples:

anchor	positive
`What impressive achievements did the Warriors accomplish during their last season in Division III`	`The Warriors were among the most lethal offensive teams in Division III this past year, posting a team batting average of .344 and averaging nearly seven runs per game, smacking 29 home runs, and collecting nearly 600 total bases. They shared the Little East Conference regular-season championship and later knocked off the top seed in the NCAA regional tournament (Montclair State) en route to their winningest season in 14 years.`
`How many bars had nectar and capped honey on them`	`Eight of the bars had nectar and capped honey on them. There are eighteen bars with brood in some form on them and a mix of workers and drones.`
`What idea is being requested regarding the 'triangle'`	`Next up...the "triangle". Please, seriously, if anyone could float me an idea, I would really appreciate it.`

Loss: selfloss.Infonce with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim"
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: steps
per_device_train_batch_size: 3
per_device_eval_batch_size: 3
learning_rate: 5e-06
num_train_epochs: 5
warmup_ratio: 0.1
fp16: True
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: steps
prediction_loss_only: True
per_device_train_batch_size: 3
per_device_eval_batch_size: 3
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 5e-06
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 5
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: True
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: True
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: False
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
eval_use_gather_object: False
batch_sampler: no_duplicates
multi_dataset_batch_sampler: proportional

Training Logs

Epoch	Step	Training Loss	Validation Loss
0.0777	150	0.0257	0.0134
0.1554	300	0.0136	0.0082
0.2332	450	0.0079	0.0062
0.3109	600	0.0065	0.0051
0.3886	750	0.0059	0.0045
0.4663	900	0.0057	0.0040
0.5440	1050	0.0064	0.0037
0.6218	1200	0.005	0.0034
0.6995	1350	0.0052	0.0034
0.7772	1500	0.0041	0.0032

Framework Versions

Python: 3.12.3
Sentence Transformers: 3.2.0
Transformers: 4.44.2
PyTorch: 2.6.0+cu124
Accelerate: 1.3.0
Datasets: 2.19.0
Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

Infonce

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}