SentenceTransformer based on Snowflake/snowflake-arctic-embed-l

This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-l. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: Snowflake/snowflake-arctic-embed-l
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 1024 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("KatGaw/eu-legal-ft-2")
# Run inference
sentences = [
    'What is the role of the Commission in assessing a harmonised standard proposed by a European standardisation organisation?',
    'to in paragraph 1, or parts of those specifications, shall be presumed to be in conformity with the requirements set out in \nSection 2 of this Chapter or, as applicable, to comply with the obligations referred to in Sections 2 and 3 of Chapter V, to \nthe extent those common specifications cover those requirements or those obligations.\n4.\nWhere a harmonised standard is adopted by a European standardisation organisation and proposed to the \nCommission for the publication of its reference in the Official Journal of the European Union, the Commission shall assess the \nharmonised standard in accordance with Regulation (EU) No 1025/2012. When reference to a harmonised standard is',
    'Member States relating to the making available on the market of measuring instruments (OJ L 96, 29.3.2014, p. 149).',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.81
cosine_accuracy@3 0.93
cosine_accuracy@5 0.95
cosine_accuracy@10 1.0
cosine_precision@1 0.81
cosine_precision@3 0.31
cosine_precision@5 0.19
cosine_precision@10 0.1
cosine_recall@1 0.81
cosine_recall@3 0.93
cosine_recall@5 0.95
cosine_recall@10 1.0
cosine_ndcg@10 0.9069
cosine_mrr@10 0.877
cosine_map@100 0.877

Training Details

Training Dataset

Unnamed Dataset

  • Size: 1,658 training samples
  • Columns: sentence_0 and sentence_1
  • Approximate statistics based on the first 1000 samples:
    sentence_0 sentence_1
    type string string
    details
    • min: 2 tokens
    • mean: 21.21 tokens
    • max: 44 tokens
    • min: 5 tokens
    • mean: 126.72 tokens
    • max: 217 tokens
  • Samples:
    sentence_0 sentence_1
    What documentation must the provider prepare according to Article 11 and Annex IV? (b) the provider has drawn up the technical documentation in accordance with Article 11 and Annex IV;
    (c) the system bears the required CE marking and is accompanied by the EU declaration of conformity referred to in
    Article 47 and instructions for use;
    (d) the provider has appointed an authorised representative in accordance with Article 22(1).
    OJ L, 12.7.2024
    EN
    ELI: http://data.europa.eu/eli/reg/2024/1689/oj
    65/144
    What must accompany the system alongside the CE marking as per the context provided? (b) the provider has drawn up the technical documentation in accordance with Article 11 and Annex IV;
    (c) the system bears the required CE marking and is accompanied by the EU declaration of conformity referred to in
    Article 47 and instructions for use;
    (d) the provider has appointed an authorised representative in accordance with Article 22(1).
    OJ L, 12.7.2024
    EN
    ELI: http://data.europa.eu/eli/reg/2024/1689/oj
    65/144
    What actions will the Commission take if there are doubts about a notified body's competence? 1.
    The Commission shall, where necessary, investigate all cases where there are reasons to doubt the competence of
    a notified body or the continued fulfilment by a notified body of the requirements laid down in Article 31 and of its
    applicable responsibilities.
    2.
    The notifying authority shall provide the Commission, on request, with all relevant information relating to the
    notification or the maintenance of the competence of the notified body concerned.
    3.
    The Commission shall ensure that all sensitive information obtained in the course of its investigations pursuant to this
    Article is treated confidentially in accordance with Article 78.
    4.
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 10
  • per_device_eval_batch_size: 10
  • num_train_epochs: 30
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 10
  • per_device_eval_batch_size: 10
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 30
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin

Training Logs

Click to expand
Epoch Step Training Loss cosine_ndcg@10
0.3012 50 - 0.8523
0.6024 100 - 0.8744
0.9036 150 - 0.8993
1.0 166 - 0.9049
1.2048 200 - 0.8871
1.5060 250 - 0.8737
1.8072 300 - 0.8864
2.0 332 - 0.8850
2.1084 350 - 0.8884
2.4096 400 - 0.8776
2.7108 450 - 0.8779
3.0 498 - 0.8864
3.0120 500 1.1103 0.8866
3.3133 550 - 0.8956
3.6145 600 - 0.9069
3.9157 650 - 0.9079
4.0 664 - 0.9055
4.2169 700 - 0.9000
4.5181 750 - 0.8907
4.8193 800 - 0.9033
5.0 830 - 0.9016
5.1205 850 - 0.8950
5.4217 900 - 0.9047
5.7229 950 - 0.9134
6.0 996 - 0.9048
6.0241 1000 0.1809 0.9092
6.3253 1050 - 0.8953
6.6265 1100 - 0.8866
6.9277 1150 - 0.9021
7.0 1162 - 0.9021
7.2289 1200 - 0.9003
7.5301 1250 - 0.8908
7.8313 1300 - 0.8979
8.0 1328 - 0.9024
8.1325 1350 - 0.9008
8.4337 1400 - 0.9061
8.7349 1450 - 0.9125
9.0 1494 - 0.9152
9.0361 1500 0.0889 0.9152
9.3373 1550 - 0.9097
9.6386 1600 - 0.8966
9.9398 1650 - 0.8991
10.0 1660 - 0.9014
10.2410 1700 - 0.9027
10.5422 1750 - 0.9052
10.8434 1800 - 0.8917
11.0 1826 - 0.8936
11.1446 1850 - 0.8941
11.4458 1900 - 0.9058
11.7470 1950 - 0.8983
12.0 1992 - 0.9083
12.0482 2000 0.0658 0.9044
12.3494 2050 - 0.9063
12.6506 2100 - 0.9047
12.9518 2150 - 0.9115
13.0 2158 - 0.9152
13.2530 2200 - 0.9111
13.5542 2250 - 0.9000
13.8554 2300 - 0.9049
14.0 2324 - 0.8991
14.1566 2350 - 0.8891
14.4578 2400 - 0.9017
14.7590 2450 - 0.9050
15.0 2490 - 0.9012
15.0602 2500 0.0517 0.9014
15.3614 2550 - 0.8998
15.6627 2600 - 0.8947
15.9639 2650 - 0.9002
16.0 2656 - 0.8965
16.2651 2700 - 0.9085
16.5663 2750 - 0.8940
16.8675 2800 - 0.8932
17.0 2822 - 0.9066
17.1687 2850 - 0.8960
17.4699 2900 - 0.8908
17.7711 2950 - 0.8991
18.0 2988 - 0.8983
18.0723 3000 0.0569 0.9005
18.3735 3050 - 0.8945
18.6747 3100 - 0.9003
18.9759 3150 - 0.8994
19.0 3154 - 0.9024
19.2771 3200 - 0.9032
19.5783 3250 - 0.8980
19.8795 3300 - 0.8989
20.0 3320 - 0.9020
20.1807 3350 - 0.9023
20.4819 3400 - 0.9033
20.7831 3450 - 0.8907
21.0 3486 - 0.9063
21.0843 3500 0.0318 0.9026
21.3855 3550 - 0.8989
21.6867 3600 - 0.8965
21.9880 3650 - 0.8976
22.0 3652 - 0.8976
22.2892 3700 - 0.8972
22.5904 3750 - 0.9030
22.8916 3800 - 0.8955
23.0 3818 - 0.9011
23.1928 3850 - 0.8968
23.4940 3900 - 0.8970
23.7952 3950 - 0.8978
24.0 3984 - 0.8964
24.0964 4000 0.047 0.8976
24.3976 4050 - 0.9005
24.6988 4100 - 0.9021
25.0 4150 - 0.8991
25.3012 4200 - 0.9021
25.6024 4250 - 0.8944
25.9036 4300 - 0.8984
26.0 4316 - 0.8995
26.2048 4350 - 0.8963
26.5060 4400 - 0.8973
26.8072 4450 - 0.9037
27.0 4482 - 0.9040
27.1084 4500 0.0325 0.8974
27.4096 4550 - 0.8966
27.7108 4600 - 0.8995
28.0 4648 - 0.9012
28.0120 4650 - 0.9012
28.3133 4700 - 0.9068
28.6145 4750 - 0.9069
28.9157 4800 - 0.9072
29.0 4814 - 0.9072
29.2169 4850 - 0.9069
29.5181 4900 - 0.9069
29.8193 4950 - 0.9069
30.0 4980 - 0.9069

Framework Versions

  • Python: 3.11.11
  • Sentence Transformers: 3.4.1
  • Transformers: 4.48.3
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.3.0
  • Datasets: 3.3.1
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
7
Safetensors
Model size
335M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for URC/eu-legal-ft-2

Finetuned
(55)
this model

Evaluation results