intfloat/multilingual-e5-large

hakanbicer9608

Mar 13, 2024

This comment has been hidden

hakanbicer9608

Mar 13, 2024

This comment has been hidden

Lue-C

27 days ago

Hi,
I want to use multilingual-e5-large with FAISS and got some code working. But the results are really bad compared to the results obtained with chromadb. I think this could be due to the indexing method/norm of the vector space. This is my code:

import numpy as np
import faiss
from langchain_huggingface.embeddings.huggingface import HuggingFaceEmbeddings
import pickle

text_chunks = ["Dies ist Chunk 1", "Dies ist Chunk 2", "Dies ist Chunk 3"]
metadatas = [
    {"source": "Dokument A", "page": 1},
    {"source": "Dokument B", "page": 5},
    {"source": "Dokument C", "page": 3}
]

# initializing embedding model
model = HuggingFaceEmbeddings(cache_folder="../models/embeddings/multilingual-e5-large_new", model_name='intfloat/multilingual-e5-large')

# converting text chunks to vectors
embeddings = model.embed_documents(text_chunks)

# convert to float32
embeddings = np.array(embeddings).astype('float32')

# get dimension of vector space
dimension = embeddings.shape[1]

# create faiss index
index = faiss.IndexFlatIP(dimension)

# add to index
index.add(embeddings)

# save index
faiss.write_index(index, "text_chunks_index.faiss")

# saving metadata
with open("metadatas.pkl", "wb") as f:
    pickle.dump(metadatas, f)

# defining and embedding query
query = "Just an example query"
query_vector = np.array(model.embed_documents(query))#.astype('float32')

# search for similar vectors
k = 2
distances, indices = loaded_index.search(query_vector, k)

print("\n search results:")
for i, idx in enumerate(indices[0]):
    print(f"text: {text_chunks[idx]}")
    print(f"metadata: {metadatas[idx]}")
    print(f"Inner product: {distances[0][i]}")

When I am using a chroma database like

import chromadb
from langchain.vectorstores import Chroma

embeddings = HuggingFaceEmbeddings(cache_folder='../models/embeddings/multilingual-e5-large_new', model_name='intfloat/multilingual-e5-large')
db = Chroma.from_documents(docs, embeddings, persist_directory=db_directory)

query = "Just an example query"
matches = db.similarity_search_with_relevance_scores(query, k=2)

I get far better results. How can I adjust the faiss code to get results similar or equal to those obtained with chroma?

Regards

intfloat
/

multilingual-e5-large

Usage with Faiss