10 8 71

Yuichi Tateno

hotchpotch

https://secon.dev/

AI & ML interests

IR, Kaggle(competitions master)

Recent Activity

new activity 3 days ago

sentence-transformers/static-retrieval-mrl-en-v1:How to track download statistics for static embeddings in sentence-transformers?

liked a dataset 4 days ago

hpprc/r1-distill-qwen-pseudo-qa

liked a model 5 days ago

tokyotech-llm/edu-classifier

View all activity

Organizations

hotchpotch's activity

New activity in sentence-transformers/static-retrieval-mrl-en-v1 3 days ago

How to track download statistics for static embeddings in sentence-transformers?

#3 opened 3 days ago by

hotchpotch

liked a dataset 4 days ago

hpprc/r1-distill-qwen-pseudo-qa

Viewer • Updated about 18 hours ago • 1.04M • 29 • 4

liked a model 5 days ago

tokyotech-llm/edu-classifier

Text Classification • Updated 8 days ago • 76 • 8

liked a dataset 7 days ago

cl-nagoya/ruri-dataset-v2-pt

Viewer • Updated 23 days ago • 310M • 147 • 3

liked a model 9 days ago

HuggingFaceFW/fineweb-edu-classifier

Text Classification • Updated Nov 17, 2024 • 35.4k • 160

liked a dataset 11 days ago

HuggingFaceFW/fineweb-2

Viewer • Updated 29 days ago • 12.5B • 69.7k • 406

updated a model 11 days ago

hotchpotch/tmp-exp034-128

Updated 11 days ago

published a model 11 days ago

hotchpotch/tmp-exp034-128

Updated 11 days ago

updated a model 11 days ago

hotchpotch/xlm-roberta-japanese-tokenizer-16k

Updated 11 days ago

liked a model 12 days ago

sentence-transformers/static-retrieval-mrl-en-v1

updated a model 15 days ago

hotchpotch/xlm-roberta-japanese-tokenizer-12k

Updated 15 days ago

published a model 15 days ago

hotchpotch/xlm-roberta-japanese-tokenizer-12k

Updated 15 days ago

updated a model 15 days ago

hotchpotch/xlm-roberta-japanese-tokenizer-24k

Updated 15 days ago

published a model 15 days ago

hotchpotch/xlm-roberta-japanese-tokenizer-24k

Updated 15 days ago

updated 2 models 15 days ago

hotchpotch/static-embedding-japanese

hotchpotch/xlm-roberta-japanese-tokenizer

Updated 15 days ago • 1

published a model 16 days ago

hotchpotch/xlm-roberta-japanese-tokenizer-16k

Updated 11 days ago

commented on Train 400x faster Static Embedding Models with Sentence Transformers 16 days ago

Sorry, there was a mistake in the measurement script.
When I measured it again, I got a result of 99.1% for 512dim.

I'll rewrite the article later. Thank you for pointing this out.

commented on Train 400x faster Static Embedding Models with Sentence Transformers 16 days ago

I also find this very strange.

In the case of 512 dims, clustering tasks etc. are good, so there is a possibility that there is a bias in the data between [256:512] for specific information.

Maybe, because the batch size is large at 6144, there is a possibility that a bias has occurred by chance towards the end of the learning.

commented on Train 400x faster Static Embedding Models with Sentence Transformers 17 days ago

This is a fantastic approach!

I trained a Static Embedding Japanese model (static-embedding-japanese) by incorporating a large amount of Japanese datasets, and when we compared it using the Japanese Multilingual Text Embedding Benchmark (JMTEB), I were able to achieve scores that were only slightly lower than mE5-small.

https://huggingface.co/hotchpotch/static-embedding-japanese

JMTEB results

Model	Avg(micro)	Retrieval	STS	Classification	Reranking	Clustering	PairClassification
text-embedding-3-small	69.18	66.39	79.46	73.06	92.92	51.06	62.27
multilingual-e5-small	67.71	67.27	80.07	67.62	93.03	46.91	62.19
static-embedding-japanese	67.17	67.92	80.16	67.96	91.87	40.39	62.37

Thank you for publishing such an excellent article.