metadata
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- generated_from_trainer
- dataset_size:400
- loss:MatryoshkaLoss
- loss:MultipleNegativesRankingLoss
base_model: Snowflake/snowflake-arctic-embed-l
widget:
- source_sentence: >-
What actions did Mr and Mrs Harris take that led to the revelation of the
facts in the case?
sentences:
- >-
Perplexity’s marketing activities include promoting on its Instagram
account a massive billboard
in Times Square from September 2024 which read “Congratulations
Perplexity on 250 million
questions answered last month.”5
4 Discover New York with Perplexity, Perplexity AI (last visited Oct.
17, 2024),
https://www.perplexity.ai/encyclopedia/discovernewyork.
5 @perplexity.ai, Instagram (Sept. 4, 2024),
https://www.instagram.com/perplexity.ai/p/C_g2TonSHC5.
Case 1:24-cv-07984 Document 1 Filed 10/21/24 Page 8 of 42
- >-
31
status. It was not until Mr. and Mrs. Harris retained counsel, served
a demand letter on May 22,
2024, met with the then Assistant Superintendent and a lengthy “bulling
investigation” that these
facts came to light.
The Defendant’s actions and conduct, by definition, was arbitrary and
capricious as was
the imposition of discipline that was a gross abuse of discretion when
it served as a catalyst for
this action. Similarly, the Defendants exceeded their authority by
repeatedly doubling down on
their acts and conduct when given the opportunity to reverse course.
The adverse action taken was
not based on sound, objective, adopted and approved policies and
procedures regarding the use of
- >-
website users, and licensing is transacted with individuals and entities
residing in this State and
District. As such, the injuries alleged herein from Perplexity’s
infringement and other unlawful
conduct foreseeably occurred in this State and District. In addition,
Perplexity or its agents reside
in this District and may be found in this State and District.
23.
Defendant Perplexity is subject to the jurisdiction of this Court
pursuant to N.Y.
C.P.L.R. § 302(a)(1) and (3) as it has purposefully directed its
activities at New York and has
Case 1:24-cv-07984 Document 1 Filed 10/21/24 Page 7 of 42
- source_sentence: >-
How did the Plaintiffs demonstrate that the discipline and sanctions
imposed by Hingham were arbitrary and capricious?
sentences:
- >-
27
in the adoption and execution of policies and practices that in their
judgment are needed to preserve
internal order and discipline and to maintain institutional security"),
such deference is not without
limitation. The propriety of and deference afforded to the decision
making is a rebuttable
presumption that may only be undone by a showing that the action taken
was arbitrary and
capricious. See Doe v. Supt. Of Schools of Stoughton, 437 Mass. 1, 5
(2002). The Plaintiffs are
likely to succeed on the merits because they have shown through
Hingham’s own investigation
materials that the discipline and sanctions imposed were arbitrary,
capricious and an abuse of
discretion under the circumstances.
- >-
7
highly competitive curriculum with by and large top grades, a 36 ACT
(highest score possible)
and a varied solid resume. In order for RNH to apply to Stanford by
November 1, which means
submitting no later than October 25th, his transcript issue must be
resolved by early October so
that when RNH requests his transcripts, they reflect grades commensurate
with his achievement
and not marred by the incident that gave rise to this case.
Letter grades of “C” in this type of admissions environment typically
lead to the applicant
being excluded from consideration. Additionally, transcripts and
information regarding any
disciplinary infraction, especially one regarding an academic integrity
infraction, are a substantial
- >-
as one’s own. Id. at ¶106-107. During the project, RNH and his
classmate did not take someone
else’s work or ideas and pass them off as their own. Id. at ¶108. RNH
and his classmate used AI,
which generates and synthesizes new information, and did not pass off
another’s work as their
own. Id. at ¶109. Despite having this information, the Defendants
exceeded the authority granted
to them in an abuse of authority, discretion, and unfettered state
action by unfairly and unjustly
acting as investigator, judge, jury, and executioner in determining the
extreme and outrageous
sanctions imposed upon these Students. Id. at ¶110.
After being unfairly and unjustly accused of cheating, plagiarism, and
academic
- source_sentence: >-
How many students with academic infractions were inducted into the NHS,
and what was one of the reasons for their infractions?
sentences:
- >-
companies that want to utilize popular, high-quality, human-created
journalism for use by the
companies’ AI applications. Revenue received from legitimately-run AI
companies supports the
costs of news gathering. This revenue also establishes that there is a
market for the licensing of
human-generated content for lawful use in AI technologies.
42.
Plaintiffs’ content is highly valued in this market.
Case 1:24-cv-07984 Document 1 Filed 10/21/24 Page 12 of 42
- >-
18
upon in affirming the decision through an appeal to exclude RNH and his
classmate from the NHS.
Id. at ¶145. At that time, Defendant Swanson and other Defendants knew
or should have known
that the District inducted at least seven students into NHS, who had
academic infractions on their
record, one of which was because of the prior use of AI. Id. at
¶146.
The “committee” that adjudicated selection for NHS this year did not
include teachers who
know and are familiar with RNH and his classmate. Id. at ¶147. This is
due to the then escalating
contract conflict with the Hingham Educators Association (“HEA”) where
HEA engaged in an
- >-
is plainly and solely for a commercial purpose. Moreover, upon
information and belief, it copies
into its index every single word of Plaintiffs’ copyrighted works that
it can get its hands on.
Additionally, the use to which it puts these copies is to create a
commercial substitute for Plaintiffs’
protected works – in Perplexity’s own words, to allow and encourage
users to “Skip the Links” to
Plaintiffs’ original works. Such substitution causes substantial harm to
Plaintiffs’ traditional
advertising and subscription revenues. Perplexity’s conduct also harms
Plaintiffs’ additional,
established revenue stream from licensing to more scrupulous AI
companies. Nor is Perplexity’s
- source_sentence: How many pages does the document filed in case 1:24-cv-12437-WGY contain?
sentences:
- Case 1:24-cv-12437-WGY Document 8 Filed 10/08/24 Page 26 of 42
- >-
more specialized in conducting a specific task, responding to prompts
specific to a subject area, or
recognizing nuances in particular questions.
48.
Fine-tuning might also ensure that the LLM responds to certain prompts
by
mimicking a certain linguistic style. For example, outputting a cooking
recipe requires a distinct
output style from recounting the statistics of the Allies’ landing at
Normandy’s beaches on D-Day,
or from writing a poem about the summer wind. A medical treatise uses a
distinct linguistic style
from a sports recap.
Case 1:24-cv-07984 Document 1 Filed 10/21/24 Page 13 of 42
- >-
D. The Balancing Of The Irreparable Harm Heavily Favors The Plaintiffs
The balance of harms in this case clearly favors granting an injunction.
If the injunction is
not granted, RNH will suffer irreparable harm that cannot be adequately
remedied by any future
court decision or monetary compensation. RNH’s academic and professional
future is at stake, as
a delayed resolution of the investigation into academic sanctions could
result in missed deadlines
for college applications, exclusion from consideration at elite
universities, and a permanent stain
on his academic record. The reputational damage and uncertainty caused
will undermine RNH’s
ability to compete fairly with other applicants, affecting not only his
immediate educational
- source_sentence: >-
What challenges do professional journalists and publishers face that may
impact their ability to enforce their intellectual property rights?
sentences:
- >-
ban or prohibition on the use of AI by students. The Defendants were not
trained on any policies
or procedures for use of AI alone, never mind what they were “able to
do” to students who used
it. The entire purpose behind having such policies and procedures in
place is to ensure notice,
equity, fairness and to be sure: a level playing field for all.
Making matters worse, there exists
no adequate procedures and policies for the induction of an applicant
into NHS when compared to
other members who are inducted despite the same or similar infractions.
This is a denial of student
rights of the highest order.
In the case here, RNH was disciplined on an ad hoc and on-going basis
over more than six
- >-
19
respect. They feel very good about it. And in our user interface, even
though we give the answer,
we do show the user exactly where the answer is coming from.”16
68.
As Srinivas surely knows or should know, academic standards for
avoiding
plagiarism are wholly independent from copyright law.17 Dow Jones and
NYP Holdings editors
and journalists are not graduate students working out of a library or
lab, eager to have someone
acknowledge and utilize their research. They are professional
journalists and publishers – working
under high-pressure deadlines, sometimes in dangerous places – whose
livelihoods depend on the
enforcement and monetization of their intellectual property rights.
69.
- >-
example the school committee under Mass. G.L. c. 71, § 37, may punish a
student offender without
a prior rule specifically forbidding the offending conduct; however,
surely such authority cannot
be limitless. Moreover, this court believes that the imposition of a
severe penalty without a
specific promulgated rule might be constitutionally deficient under
certain circumstances.
Id. (emphasis supplied). “What those circumstances are can only be left
to the development of the
case law in the area.” Id. There has been no case law developed in
the area of school discipline
Case 1:24-cv-12437-WGY Document 8 Filed 10/08/24 Page 28 of 42
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
- cosine_accuracy@1
- cosine_accuracy@3
- cosine_accuracy@5
- cosine_accuracy@10
- cosine_precision@1
- cosine_precision@3
- cosine_precision@5
- cosine_precision@10
- cosine_recall@1
- cosine_recall@3
- cosine_recall@5
- cosine_recall@10
- cosine_ndcg@10
- cosine_mrr@10
- cosine_map@100
model-index:
- name: SentenceTransformer based on Snowflake/snowflake-arctic-embed-l
results:
- task:
type: information-retrieval
name: Information Retrieval
dataset:
name: Unknown
type: unknown
metrics:
- type: cosine_accuracy@1
value: 0.7291666666666666
name: Cosine Accuracy@1
- type: cosine_accuracy@3
value: 0.8541666666666666
name: Cosine Accuracy@3
- type: cosine_accuracy@5
value: 0.9375
name: Cosine Accuracy@5
- type: cosine_accuracy@10
value: 1
name: Cosine Accuracy@10
- type: cosine_precision@1
value: 0.7291666666666666
name: Cosine Precision@1
- type: cosine_precision@3
value: 0.28472222222222215
name: Cosine Precision@3
- type: cosine_precision@5
value: 0.1875
name: Cosine Precision@5
- type: cosine_precision@10
value: 0.09999999999999999
name: Cosine Precision@10
- type: cosine_recall@1
value: 0.7291666666666666
name: Cosine Recall@1
- type: cosine_recall@3
value: 0.8541666666666666
name: Cosine Recall@3
- type: cosine_recall@5
value: 0.9375
name: Cosine Recall@5
- type: cosine_recall@10
value: 1
name: Cosine Recall@10
- type: cosine_ndcg@10
value: 0.8575788154610162
name: Cosine Ndcg@10
- type: cosine_mrr@10
value: 0.8125248015873017
name: Cosine Mrr@10
- type: cosine_map@100
value: 0.8125248015873016
name: Cosine Map@100
SentenceTransformer based on Snowflake/snowflake-arctic-embed-l
This is a sentence-transformers model finetuned from Snowflake/snowflake-arctic-embed-l. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
Model Details
Model Description
- Model Type: Sentence Transformer
- Base model: Snowflake/snowflake-arctic-embed-l
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 1024 dimensions
- Similarity Function: Cosine Similarity
Model Sources
- Documentation: Sentence Transformers Documentation
- Repository: Sentence Transformers on GitHub
- Hugging Face: Sentence Transformers on Hugging Face
Full Model Architecture
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
Usage
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("llm-wizard/legal-ft")
# Run inference
sentences = [
'What challenges do professional journalists and publishers face that may impact their ability to enforce their intellectual property rights?',
'19 \nrespect. They feel very good about it. And in our user interface, even though we give the answer, \nwe do show the user exactly where the answer is coming from.”16 \n68. \nAs Srinivas surely knows or should know, academic standards for avoiding \nplagiarism are wholly independent from copyright law.17 Dow Jones and NYP Holdings editors \nand journalists are not graduate students working out of a library or lab, eager to have someone \nacknowledge and utilize their research. They are professional journalists and publishers – working \nunder high-pressure deadlines, sometimes in dangerous places – whose livelihoods depend on the \nenforcement and monetization of their intellectual property rights. \n69.',
'ban or prohibition on the use of AI by students. The Defendants were not trained on any policies \nor procedures for use of AI alone, never mind what they were “able to do” to students who used \nit. The entire purpose behind having such policies and procedures in place is to ensure notice, \nequity, fairness and to be sure: a level playing field for all. Making matters worse, there exists \nno adequate procedures and policies for the induction of an applicant into NHS when compared to \nother members who are inducted despite the same or similar infractions. This is a denial of student \nrights of the highest order. \n \nIn the case here, RNH was disciplined on an ad hoc and on-going basis over more than six',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Evaluation
Metrics
Information Retrieval
- Evaluated with
InformationRetrievalEvaluator
Metric | Value |
---|---|
cosine_accuracy@1 | 0.7292 |
cosine_accuracy@3 | 0.8542 |
cosine_accuracy@5 | 0.9375 |
cosine_accuracy@10 | 1.0 |
cosine_precision@1 | 0.7292 |
cosine_precision@3 | 0.2847 |
cosine_precision@5 | 0.1875 |
cosine_precision@10 | 0.1 |
cosine_recall@1 | 0.7292 |
cosine_recall@3 | 0.8542 |
cosine_recall@5 | 0.9375 |
cosine_recall@10 | 1.0 |
cosine_ndcg@10 | 0.8576 |
cosine_mrr@10 | 0.8125 |
cosine_map@100 | 0.8125 |
Training Details
Training Dataset
Unnamed Dataset
- Size: 400 training samples
- Columns:
sentence_0
andsentence_1
- Approximate statistics based on the first 400 samples:
sentence_0 sentence_1 type string string details - min: 10 tokens
- mean: 20.93 tokens
- max: 35 tokens
- min: 25 tokens
- mean: 140.37 tokens
- max: 260 tokens
- Samples:
sentence_0 sentence_1 What provisions of the 2023-2024 Handbook were referenced regarding the use of AI and academic integrity?
13
procedure, expectation, conduct, discipline, sanction or consequence for the use of AI. Id. at ¶102.
Under these circumstances, the use of AI was not a violation of the then existing “Academic
Integrity: Cheating and Plagiarism” provisions of the 2023-2024 Handbook. Id. at ¶104. As such,
accusations of cheating, plagiarism, and academic misconduct or dishonesty were not supported
by the record evidence which, at all times relevant, the Defendants have had in their care, custody
and control. Id. at ¶105.
While there is much dispute as to whether the use of generative AI constitutes plagiarism,
plagiarism is defined as the practice of taking someone else’s work or ideas and passing them offHow is plagiarism defined in the context provided?
13
procedure, expectation, conduct, discipline, sanction or consequence for the use of AI. Id. at ¶102.
Under these circumstances, the use of AI was not a violation of the then existing “Academic
Integrity: Cheating and Plagiarism” provisions of the 2023-2024 Handbook. Id. at ¶104. As such,
accusations of cheating, plagiarism, and academic misconduct or dishonesty were not supported
by the record evidence which, at all times relevant, the Defendants have had in their care, custody
and control. Id. at ¶105.
While there is much dispute as to whether the use of generative AI constitutes plagiarism,
plagiarism is defined as the practice of taking someone else’s work or ideas and passing them offWhat is the case number associated with the document filed on 10/21/24?
program-ad-revenue-sharing-ai-time-fortune-der-spiegel.
Case 1:24-cv-07984 Document 1 Filed 10/21/24 Page 21 of 42 - Loss:
MatryoshkaLoss
with these parameters:{ "loss": "MultipleNegativesRankingLoss", "matryoshka_dims": [ 768, 512, 256, 128, 64 ], "matryoshka_weights": [ 1, 1, 1, 1, 1 ], "n_dims_per_step": -1 }
Training Hyperparameters
Non-Default Hyperparameters
eval_strategy
: stepsper_device_train_batch_size
: 10per_device_eval_batch_size
: 10num_train_epochs
: 10multi_dataset_batch_sampler
: round_robin
All Hyperparameters
Click to expand
overwrite_output_dir
: Falsedo_predict
: Falseeval_strategy
: stepsprediction_loss_only
: Trueper_device_train_batch_size
: 10per_device_eval_batch_size
: 10per_gpu_train_batch_size
: Noneper_gpu_eval_batch_size
: Nonegradient_accumulation_steps
: 1eval_accumulation_steps
: Nonetorch_empty_cache_steps
: Nonelearning_rate
: 5e-05weight_decay
: 0.0adam_beta1
: 0.9adam_beta2
: 0.999adam_epsilon
: 1e-08max_grad_norm
: 1num_train_epochs
: 10max_steps
: -1lr_scheduler_type
: linearlr_scheduler_kwargs
: {}warmup_ratio
: 0.0warmup_steps
: 0log_level
: passivelog_level_replica
: warninglog_on_each_node
: Truelogging_nan_inf_filter
: Truesave_safetensors
: Truesave_on_each_node
: Falsesave_only_model
: Falserestore_callback_states_from_checkpoint
: Falseno_cuda
: Falseuse_cpu
: Falseuse_mps_device
: Falseseed
: 42data_seed
: Nonejit_mode_eval
: Falseuse_ipex
: Falsebf16
: Falsefp16
: Falsefp16_opt_level
: O1half_precision_backend
: autobf16_full_eval
: Falsefp16_full_eval
: Falsetf32
: Nonelocal_rank
: 0ddp_backend
: Nonetpu_num_cores
: Nonetpu_metrics_debug
: Falsedebug
: []dataloader_drop_last
: Falsedataloader_num_workers
: 0dataloader_prefetch_factor
: Nonepast_index
: -1disable_tqdm
: Falseremove_unused_columns
: Truelabel_names
: Noneload_best_model_at_end
: Falseignore_data_skip
: Falsefsdp
: []fsdp_min_num_params
: 0fsdp_config
: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap
: Noneaccelerator_config
: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed
: Nonelabel_smoothing_factor
: 0.0optim
: adamw_torchoptim_args
: Noneadafactor
: Falsegroup_by_length
: Falselength_column_name
: lengthddp_find_unused_parameters
: Noneddp_bucket_cap_mb
: Noneddp_broadcast_buffers
: Falsedataloader_pin_memory
: Truedataloader_persistent_workers
: Falseskip_memory_metrics
: Trueuse_legacy_prediction_loop
: Falsepush_to_hub
: Falseresume_from_checkpoint
: Nonehub_model_id
: Nonehub_strategy
: every_savehub_private_repo
: Nonehub_always_push
: Falsegradient_checkpointing
: Falsegradient_checkpointing_kwargs
: Noneinclude_inputs_for_metrics
: Falseinclude_for_metrics
: []eval_do_concat_batches
: Truefp16_backend
: autopush_to_hub_model_id
: Nonepush_to_hub_organization
: Nonemp_parameters
:auto_find_batch_size
: Falsefull_determinism
: Falsetorchdynamo
: Noneray_scope
: lastddp_timeout
: 1800torch_compile
: Falsetorch_compile_backend
: Nonetorch_compile_mode
: Nonedispatch_batches
: Nonesplit_batches
: Noneinclude_tokens_per_second
: Falseinclude_num_input_tokens_seen
: Falseneftune_noise_alpha
: Noneoptim_target_modules
: Nonebatch_eval_metrics
: Falseeval_on_start
: Falseuse_liger_kernel
: Falseeval_use_gather_object
: Falseaverage_tokens_across_devices
: Falseprompts
: Nonebatch_sampler
: batch_samplermulti_dataset_batch_sampler
: round_robin
Training Logs
Epoch | Step | cosine_ndcg@10 |
---|---|---|
1.0 | 40 | 0.8182 |
1.25 | 50 | 0.8172 |
2.0 | 80 | 0.8112 |
2.5 | 100 | 0.8414 |
3.0 | 120 | 0.8236 |
3.75 | 150 | 0.7962 |
4.0 | 160 | 0.7930 |
5.0 | 200 | 0.8536 |
6.0 | 240 | 0.8263 |
6.25 | 250 | 0.8257 |
7.0 | 280 | 0.8475 |
7.5 | 300 | 0.8505 |
8.0 | 320 | 0.8499 |
8.75 | 350 | 0.8582 |
9.0 | 360 | 0.8576 |
10.0 | 400 | 0.8576 |
Framework Versions
- Python: 3.11.11
- Sentence Transformers: 3.4.1
- Transformers: 4.48.2
- PyTorch: 2.5.1+cu124
- Accelerate: 1.3.0
- Datasets: 3.2.0
- Tokenizers: 0.21.0
Citation
BibTeX
Sentence Transformers
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
MatryoshkaLoss
@misc{kusupati2024matryoshka,
title={Matryoshka Representation Learning},
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
year={2024},
eprint={2205.13147},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
MultipleNegativesRankingLoss
@misc{henderson2017efficient,
title={Efficient Natural Language Response Suggestion for Smart Reply},
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
year={2017},
eprint={1705.00652},
archivePrefix={arXiv},
primaryClass={cs.CL}
}