Usage with Faiss
#32
by
hakanbicer9608
- opened
This comment has been hidden
This comment has been hidden
Hi,
I want to use multilingual-e5-large with FAISS and got some code working. But the results are really bad compared to the results obtained with chromadb. I think this could be due to the indexing method/norm of the vector space. This is my code:
import numpy as np
import faiss
from langchain_huggingface.embeddings.huggingface import HuggingFaceEmbeddings
import pickle
text_chunks = ["Dies ist Chunk 1", "Dies ist Chunk 2", "Dies ist Chunk 3"]
metadatas = [
{"source": "Dokument A", "page": 1},
{"source": "Dokument B", "page": 5},
{"source": "Dokument C", "page": 3}
]
# initializing embedding model
model = HuggingFaceEmbeddings(cache_folder="../models/embeddings/multilingual-e5-large_new", model_name='intfloat/multilingual-e5-large')
# converting text chunks to vectors
embeddings = model.embed_documents(text_chunks)
# convert to float32
embeddings = np.array(embeddings).astype('float32')
# get dimension of vector space
dimension = embeddings.shape[1]
# create faiss index
index = faiss.IndexFlatIP(dimension)
# add to index
index.add(embeddings)
# save index
faiss.write_index(index, "text_chunks_index.faiss")
# saving metadata
with open("metadatas.pkl", "wb") as f:
pickle.dump(metadatas, f)
# defining and embedding query
query = "Just an example query"
query_vector = np.array(model.embed_documents(query))#.astype('float32')
# search for similar vectors
k = 2
distances, indices = loaded_index.search(query_vector, k)
print("\n search results:")
for i, idx in enumerate(indices[0]):
print(f"text: {text_chunks[idx]}")
print(f"metadata: {metadatas[idx]}")
print(f"Inner product: {distances[0][i]}")
When I am using a chroma database like
import chromadb
from langchain.vectorstores import Chroma
embeddings = HuggingFaceEmbeddings(cache_folder='../models/embeddings/multilingual-e5-large_new', model_name='intfloat/multilingual-e5-large')
db = Chroma.from_documents(docs, embeddings, persist_directory=db_directory)
query = "Just an example query"
matches = db.similarity_search_with_relevance_scores(query, k=2)
I get far better results. How can I adjust the faiss code to get results similar or equal to those obtained with chroma?
Regards