Yuichi Tateno
AI & ML interests
Recent Activity
Organizations
hotchpotch's activity
How to track download statistics for static embeddings in sentence-transformers?
Sorry, there was a mistake in the measurement script.
When I measured it again, I got a result of 99.1% for 512dim.
I'll rewrite the article later. Thank you for pointing this out.
I also find this very strange.
In the case of 512 dims, clustering tasks etc. are good, so there is a possibility that there is a bias in the data between [256:512] for specific information.
Maybe, because the batch size is large at 6144, there is a possibility that a bias has occurred by chance towards the end of the learning.
This is a fantastic approach!
I trained a Static Embedding Japanese model (static-embedding-japanese) by incorporating a large amount of Japanese datasets, and when we compared it using the Japanese Multilingual Text Embedding Benchmark (JMTEB), I were able to achieve scores that were only slightly lower than mE5-small.
JMTEB results
Model | Avg(micro) | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
---|---|---|---|---|---|---|---|
text-embedding-3-small | 69.18 | 66.39 | 79.46 | 73.06 | 92.92 | 51.06 | 62.27 |
multilingual-e5-small | 67.71 | 67.27 | 80.07 | 67.62 | 93.03 | 46.91 | 62.19 |
static-embedding-japanese | 67.17 | 67.92 | 80.16 | 67.96 | 91.87 | 40.39 | 62.37 |
Thank you for publishing such an excellent article.