legal-ft / README.md
llm-wizard's picture
Add new SentenceTransformer model
445671b verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:400
  - loss:MatryoshkaLoss
  - loss:MultipleNegativesRankingLoss
base_model: Snowflake/snowflake-arctic-embed-l
widget:
  - source_sentence: >-
      What actions did Mr and Mrs Harris take that led to the revelation of the
      facts in the case?
    sentences:
      - >-
        Perplexity’s marketing activities include promoting on its Instagram
        account a massive billboard 

        in Times Square from September 2024 which read “Congratulations
        Perplexity on 250 million 

        questions answered last month.”5 
         
        4 Discover New York with Perplexity, Perplexity AI (last visited Oct.
        17, 2024), 

        https://www.perplexity.ai/encyclopedia/discovernewyork. 

        5 @perplexity.ai, Instagram (Sept. 4, 2024),  

        https://www.instagram.com/perplexity.ai/p/C_g2TonSHC5.  

        Case 1:24-cv-07984     Document 1     Filed 10/21/24     Page 8 of 42
      - >-
        31 
         
        status.    It was not until Mr. and Mrs. Harris retained counsel, served
        a demand letter on May 22, 

        2024, met with the then Assistant Superintendent and a lengthy “bulling
        investigation” that these 

        facts came to light.    

        The Defendant’s actions and conduct, by definition, was arbitrary and
        capricious as was 

        the  imposition of discipline that was a gross abuse of discretion when
        it served as a catalyst for 

        this action.  Similarly, the Defendants exceeded their authority by
        repeatedly doubling down on 

        their acts and conduct when given the opportunity to reverse course. 
        The adverse action taken was 

        not based on sound, objective, adopted and approved policies and
        procedures regarding the use of
      - >-
        website users, and licensing is transacted with individuals and entities
        residing in this State and 

        District. As such, the injuries alleged herein from Perplexity’s
        infringement and other unlawful 

        conduct foreseeably occurred in this State and District. In addition,
        Perplexity or its agents reside 

        in this District and may be found in this State and District. 

        23. 

        Defendant Perplexity is subject to the jurisdiction of this Court
        pursuant to N.Y. 

        C.P.L.R. § 302(a)(1) and (3) as it has purposefully directed its
        activities at New York and has 

        Case 1:24-cv-07984     Document 1     Filed 10/21/24     Page 7 of 42
  - source_sentence: >-
      How did the Plaintiffs demonstrate that the discipline and sanctions
      imposed by Hingham were arbitrary and capricious?
    sentences:
      - >-
        27 
         
        in the adoption and execution of policies and practices that in their
        judgment are needed to preserve 

        internal order and discipline and to maintain institutional security"),
        such deference is not without 

        limitation.  The propriety of and deference afforded to the decision
        making is a rebuttable 

        presumption that may only be undone by a showing that the action taken
        was arbitrary and 

        capricious.  See Doe v. Supt. Of Schools of Stoughton, 437 Mass. 1, 5
        (2002).  The Plaintiffs are 

        likely to succeed on the merits because they have shown through
        Hingham’s own investigation 

        materials that the discipline and sanctions imposed were arbitrary,
        capricious and an abuse of 

        discretion under the circumstances.
      - >-
        7 
         
        highly competitive curriculum with by and large top grades, a 36 ACT
        (highest score possible) 

        and a varied solid resume.  In order for RNH to apply to Stanford by
        November 1, which means 

        submitting no later than October 25th, his transcript issue must be
        resolved by early October so 

        that when RNH requests his transcripts, they reflect grades commensurate
        with his achievement 

        and not marred by the incident that gave rise to this case. 

        Letter grades of “C” in this type of admissions environment typically
        lead to the applicant 

        being excluded from consideration.  Additionally, transcripts and
        information regarding any 

        disciplinary infraction, especially one regarding an academic integrity
        infraction, are a substantial
      - >-
        as one’s own.  Id. at ¶106-107.  During the project, RNH and his
        classmate did not take someone 

        else’s work or ideas and pass them off as their own.  Id. at ¶108.  RNH
        and his classmate used AI, 

        which generates and synthesizes new information, and did not pass off
        another’s work as their 

        own.  Id. at ¶109.  Despite having this information, the Defendants
        exceeded the authority granted 

        to them in an abuse of authority, discretion, and unfettered state
        action by unfairly and unjustly 

        acting as investigator, judge, jury, and executioner in determining the
        extreme and outrageous 

        sanctions imposed upon these Students.  Id. at ¶110.   

        After being unfairly and unjustly accused of cheating, plagiarism, and
        academic
  - source_sentence: >-
      How many students with academic infractions were inducted into the NHS,
      and what was one of the reasons for their infractions?
    sentences:
      - >-
        companies that want to utilize popular, high-quality, human-created
        journalism for use by the 

        companies’ AI applications. Revenue received from legitimately-run AI
        companies supports the 

        costs of news gathering. This revenue also establishes that there is a
        market for the licensing of 

        human-generated content for lawful use in AI technologies. 

        42. 

        Plaintiffs’ content is highly valued in this market.  
         
         
        Case 1:24-cv-07984     Document 1     Filed 10/21/24     Page 12 of 42
      - >-
        18 
         
        upon in affirming the decision through an appeal to exclude RNH and his
        classmate from the NHS.  

        Id. at ¶145.  At that time, Defendant Swanson and other Defendants knew
        or should have known 

        that the District inducted at least seven students into NHS, who had
        academic infractions on their 

        record, one of which was because of the prior use of AI.  Id. at
        ¶146.   

        The “committee” that adjudicated selection for NHS this year did not
        include teachers who 

        know and are familiar with RNH and his classmate.  Id. at ¶147.  This is
        due to the then escalating 

        contract conflict with the Hingham Educators Association (“HEA”) where
        HEA engaged in an
      - >-
        is plainly and solely for a commercial purpose. Moreover, upon
        information and belief, it copies 

        into its index every single word of Plaintiffs’ copyrighted works that
        it can get its hands on. 

        Additionally, the use to which it puts these copies is to create a
        commercial substitute for Plaintiffs’ 

        protected works  in Perplexity’s own words, to allow and encourage
        users to “Skip the Links” to 

        Plaintiffs’ original works. Such substitution causes substantial harm to
        Plaintiffs’ traditional 

        advertising and subscription revenues. Perplexity’s conduct also harms
        Plaintiffs’ additional, 

        established revenue stream from licensing to more scrupulous AI
        companies. Nor is Perplexity’s
  - source_sentence: How many pages does the document filed in case 1:24-cv-12437-WGY contain?
    sentences:
      - Case 1:24-cv-12437-WGY   Document 8   Filed 10/08/24   Page 26 of 42
      - >-
        more specialized in conducting a specific task, responding to prompts
        specific to a subject area, or 

        recognizing nuances in particular questions. 

        48. 

        Fine-tuning might also ensure that the LLM responds to certain prompts
        by 

        mimicking a certain linguistic style. For example, outputting a cooking
        recipe requires a distinct 

        output style from recounting the statistics of the Allies’ landing at
        Normandy’s beaches on D-Day, 

        or from writing a poem about the summer wind. A medical treatise uses a
        distinct linguistic style 

        from a sports recap. 

        Case 1:24-cv-07984     Document 1     Filed 10/21/24     Page 13 of 42
      - >-
        D. The Balancing Of The Irreparable Harm Heavily Favors The Plaintiffs 

        The balance of harms in this case clearly favors granting an injunction.
        If the injunction is 

        not granted, RNH will suffer irreparable harm that cannot be adequately
        remedied by any future 

        court decision or monetary compensation. RNH’s academic and professional
        future is at stake, as 

        a delayed resolution of the investigation into academic sanctions could
        result in missed deadlines 

        for college applications, exclusion from consideration at elite
        universities, and a permanent stain 

        on his academic record. The reputational damage and uncertainty caused
        will undermine RNH’s 

        ability to compete fairly with other applicants, affecting not only his
        immediate educational
  - source_sentence: >-
      What challenges do professional journalists and publishers face that may
      impact their ability to enforce their intellectual property rights?
    sentences:
      - >-
        ban or prohibition on the use of AI by students. The Defendants were not
        trained on any policies 

        or procedures for use of AI alone, never mind what they were “able to
        do” to students who used 

        it.    The entire purpose behind having such policies and procedures in
        place is to ensure notice, 

        equity, fairness and to be sure:  a level playing field for all.  
        Making matters worse, there exists 

        no adequate procedures and policies for the induction of an applicant
        into NHS when compared to 

        other members who are inducted despite the same or similar infractions. 
        This is a denial of student 

        rights of the highest order. 
         
        In the case here, RNH was disciplined on an ad hoc and on-going basis
        over more than six
      - >-
        19 

        respect. They feel very good about it. And in our user interface, even
        though we give the answer, 

        we do show the user exactly where the answer is coming from.”16  

        68. 

        As Srinivas surely knows or should know, academic standards for
        avoiding 

        plagiarism are wholly independent from copyright law.17 Dow Jones and
        NYP Holdings editors 

        and journalists are not graduate students working out of a library or
        lab, eager to have someone 

        acknowledge and utilize their research. They are professional
        journalists and publishers  working 

        under high-pressure deadlines, sometimes in dangerous places  whose
        livelihoods depend on the 

        enforcement and monetization of their intellectual property rights.  

        69.
      - >-
        example the school committee under Mass. G.L. c. 71, § 37, may punish a
        student offender without 

        a prior rule specifically forbidding the offending conduct; however,
        surely such authority cannot 

        be limitless. Moreover, this court believes that the imposition of a
        severe penalty without a 

        specific promulgated rule might be constitutionally deficient under
        certain circumstances.   

        Id. (emphasis supplied). “What those circumstances are can only be left
        to the development of the 

        case law in the area.”    Id.   There has been no case law developed in
        the area of school discipline 

        Case 1:24-cv-12437-WGY   Document 8   Filed 10/08/24   Page 28 of 42
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - cosine_accuracy@1
  - cosine_accuracy@3
  - cosine_accuracy@5
  - cosine_accuracy@10
  - cosine_precision@1
  - cosine_precision@3
  - cosine_precision@5
  - cosine_precision@10
  - cosine_recall@1
  - cosine_recall@3
  - cosine_recall@5
  - cosine_recall@10
  - cosine_ndcg@10
  - cosine_mrr@10
  - cosine_map@100
model-index:
  - name: SentenceTransformer based on Snowflake/snowflake-arctic-embed-l
    results:
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: Unknown
          type: unknown
        metrics:
          - type: cosine_accuracy@1
            value: 0.7291666666666666
            name: Cosine Accuracy@1
          - type: cosine_accuracy@3
            value: 0.8541666666666666
            name: Cosine Accuracy@3
          - type: cosine_accuracy@5
            value: 0.9375
            name: Cosine Accuracy@5
          - type: cosine_accuracy@10
            value: 1
            name: Cosine Accuracy@10
          - type: cosine_precision@1
            value: 0.7291666666666666
            name: Cosine Precision@1
          - type: cosine_precision@3
            value: 0.28472222222222215
            name: Cosine Precision@3
          - type: cosine_precision@5
            value: 0.1875
            name: Cosine Precision@5
          - type: cosine_precision@10
            value: 0.09999999999999999
            name: Cosine Precision@10
          - type: cosine_recall@1
            value: 0.7291666666666666
            name: Cosine Recall@1
          - type: cosine_recall@3
            value: 0.8541666666666666
            name: Cosine Recall@3
          - type: cosine_recall@5
            value: 0.9375
            name: Cosine Recall@5
          - type: cosine_recall@10
            value: 1
            name: Cosine Recall@10
          - type: cosine_ndcg@10
            value: 0.8575788154610162
            name: Cosine Ndcg@10
          - type: cosine_mrr@10
            value: 0.8125248015873017
            name: Cosine Mrr@10
          - type: cosine_map@100
            value: 0.8125248015873016
            name: Cosine Map@100

SentenceTransformer based on Snowflake/snowflake-arctic-embed-l

This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-l. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: Snowflake/snowflake-arctic-embed-l
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 1024 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("llm-wizard/legal-ft")
# Run inference
sentences = [
    'What challenges do professional journalists and publishers face that may impact their ability to enforce their intellectual property rights?',
    '19 \nrespect. They feel very good about it. And in our user interface, even though we give the answer, \nwe do show the user exactly where the answer is coming from.”16  \n68. \nAs Srinivas surely knows or should know, academic standards for avoiding \nplagiarism are wholly independent from copyright law.17 Dow Jones and NYP Holdings editors \nand journalists are not graduate students working out of a library or lab, eager to have someone \nacknowledge and utilize their research. They are professional journalists and publishers – working \nunder high-pressure deadlines, sometimes in dangerous places – whose livelihoods depend on the \nenforcement and monetization of their intellectual property rights.  \n69.',
    'ban or prohibition on the use of AI by students. The Defendants were not trained on any policies \nor procedures for use of AI alone, never mind what they were “able to do” to students who used \nit.    The entire purpose behind having such policies and procedures in place is to ensure notice, \nequity, fairness and to be sure:  a level playing field for all.   Making matters worse, there exists \nno adequate procedures and policies for the induction of an applicant into NHS when compared to \nother members who are inducted despite the same or similar infractions.  This is a denial of student \nrights of the highest order. \n \nIn the case here, RNH was disciplined on an ad hoc and on-going basis over more than six',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.7292
cosine_accuracy@3 0.8542
cosine_accuracy@5 0.9375
cosine_accuracy@10 1.0
cosine_precision@1 0.7292
cosine_precision@3 0.2847
cosine_precision@5 0.1875
cosine_precision@10 0.1
cosine_recall@1 0.7292
cosine_recall@3 0.8542
cosine_recall@5 0.9375
cosine_recall@10 1.0
cosine_ndcg@10 0.8576
cosine_mrr@10 0.8125
cosine_map@100 0.8125

Training Details

Training Dataset

Unnamed Dataset

  • Size: 400 training samples
  • Columns: sentence_0 and sentence_1
  • Approximate statistics based on the first 400 samples:
    sentence_0 sentence_1
    type string string
    details
    • min: 10 tokens
    • mean: 20.93 tokens
    • max: 35 tokens
    • min: 25 tokens
    • mean: 140.37 tokens
    • max: 260 tokens
  • Samples:
    sentence_0 sentence_1
    What provisions of the 2023-2024 Handbook were referenced regarding the use of AI and academic integrity? 13

    procedure, expectation, conduct, discipline, sanction or consequence for the use of AI. Id. at ¶102.
    Under these circumstances, the use of AI was not a violation of the then existing “Academic
    Integrity: Cheating and Plagiarism” provisions of the 2023-2024 Handbook. Id. at ¶104. As such,
    accusations of cheating, plagiarism, and academic misconduct or dishonesty were not supported
    by the record evidence which, at all times relevant, the Defendants have had in their care, custody
    and control. Id. at ¶105.
    While there is much dispute as to whether the use of generative AI constitutes plagiarism,
    plagiarism is defined as the practice of taking someone else’s work or ideas and passing them off
    How is plagiarism defined in the context provided? 13

    procedure, expectation, conduct, discipline, sanction or consequence for the use of AI. Id. at ¶102.
    Under these circumstances, the use of AI was not a violation of the then existing “Academic
    Integrity: Cheating and Plagiarism” provisions of the 2023-2024 Handbook. Id. at ¶104. As such,
    accusations of cheating, plagiarism, and academic misconduct or dishonesty were not supported
    by the record evidence which, at all times relevant, the Defendants have had in their care, custody
    and control. Id. at ¶105.
    While there is much dispute as to whether the use of generative AI constitutes plagiarism,
    plagiarism is defined as the practice of taking someone else’s work or ideas and passing them off
    What is the case number associated with the document filed on 10/21/24? program-ad-revenue-sharing-ai-time-fortune-der-spiegel.
    Case 1:24-cv-07984 Document 1 Filed 10/21/24 Page 21 of 42
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 10
  • per_device_eval_batch_size: 10
  • num_train_epochs: 10
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 10
  • per_device_eval_batch_size: 10
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 10
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin

Training Logs

Epoch Step cosine_ndcg@10
1.0 40 0.8182
1.25 50 0.8172
2.0 80 0.8112
2.5 100 0.8414
3.0 120 0.8236
3.75 150 0.7962
4.0 160 0.7930
5.0 200 0.8536
6.0 240 0.8263
6.25 250 0.8257
7.0 280 0.8475
7.5 300 0.8505
8.0 320 0.8499
8.75 350 0.8582
9.0 360 0.8576
10.0 400 0.8576

Framework Versions

  • Python: 3.11.11
  • Sentence Transformers: 3.4.1
  • Transformers: 4.48.2
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.3.0
  • Datasets: 3.2.0
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}