--- license: apache-2.0 datasets: - Tevatron/msmarco-passage language: - en base_model: - ielabgroup/bert-base-uncased-fineweb100bt-smae --- Model used in [Starbucks: Improved Training for 2D Matryoshka Embeddings](https://arxiv.org/pdf/2410.13230) This model is a bert-baase-uncased size model initialized with [ielabgroup/bert-base-uncased-fineweb100bt-smae](https://huggingface.co/ielabgroup/bert-base-uncased-fineweb100bt-smae) fine-tuned on MS MARCO dataset with Starbucks Representation Learning (SRL) method. SRL enables elastic layer-dimension embedding generation for search. The following layer-dimension pairs are involved during fine-tuning: [(2, 32), (4, 64), (6, 128), (8, 256), (10, 512), (12, 768)] To inference with all the layer-dimension pairs: ```python import torch from transformers import AutoTokenizer, AutoModel # trained layer-dimension sizes sizes = [(2, 32), (4, 64), (6, 128), (8, 256), (10, 512), (12, 768)] tokenizer = AutoTokenizer.from_pretrained("ielabgroup/Starbucks-msmarco") model = AutoModel.from_pretrained("ielabgroup/Starbucks-msmarco").eval() query = ["What is the capital of France?"] passages = ["The capital of France is Paris.", "China's capital is Beijing."] inputs = tokenizer(query, return_tensors="pt") outputs = model(**inputs, return_dict=True, output_hidden_states=True) # cls token embeddings of each layer-dimension size query_embeddings = [outputs.hidden_states[layer][:, 0, :dim] for layer, dim in sizes] passage_inputs = tokenizer(passages, return_tensors="pt", padding=True, truncation=True) passage_outputs = model(**passage_inputs, return_dict=True, output_hidden_states=True) passage_embeddings = [passage_outputs.hidden_states[layer][:, 0, :dim] for layer, dim in sizes] for (layer, dim), query_embedding, passage_embedding in zip(sizes, query_embeddings, passage_embeddings): scores = torch.matmul(query_embedding, passage_embedding.T) print(f"Layer {layer}, Dimension {dim}, Scores: {scores.tolist()}") ``` output ``` Layer 2, Dimension 32, Scores: [[28.416101455688477, 18.783443450927734]] Layer 4, Dimension 64, Scores: [[29.415122985839844, 20.81881332397461]] Layer 6, Dimension 128, Scores: [[29.515695571899414, 20.07825469970703]] Layer 8, Dimension 256, Scores: [[33.34524154663086, 23.34392738342285]] Layer 10, Dimension 512, Scores: [[80.5205307006836, 63.31733322143555]] Layer 12, Dimension 768, Scores: [[181.57217407226562, 171.1049346923828]] ``` To inference with target layer and dimension size (i.e., extract a small model out of it): ```python import torch from transformers import AutoTokenizer, AutoModel, AutoConfig # extracted layer-dimension sizes num_layer = 2 dim = 32 tokenizer = AutoTokenizer.from_pretrained("ielabgroup/Starbucks-msmarco") config = AutoConfig.from_pretrained("ielabgroup/Starbucks-msmarco", num_hidden_layers=num_layer) model = AutoModel.from_pretrained("ielabgroup/Starbucks-msmarco", config=config).eval() print(len(model.encoder.layer)) # only has 2 layers query = ["What is the capital of France?"] passages = ["The capital of France is Paris.", "China's capital is Beijing."] inputs = tokenizer(query, return_tensors="pt") query_embeddings = model(**inputs, return_dict=True)[0][:, 0, :dim] print(query_embeddings.shape) # torch.Size([1, 32]) passage_inputs = tokenizer(passages, return_tensors="pt", padding=True, truncation=True) passage_embeddings = model(**passage_inputs, return_dict=True)[0][:, 0, :dim] scores = torch.matmul(query_embeddings, passage_embeddings.T) print(scores.tolist()) ``` output ``` [[28.416101455688477, 18.783443450927734]] ```