admin管理员组文章数量:1389627
Everytime i add my jsonl to a new chromadb it says that my vector shape 384 and it should be 1024.
Something seems to be going wrong with the chromadb insertion but I can't figure it out.
I start with jsonl with embeddings (size 1024), id, metadata, and document.
I have checked that all lines are 1024.
Then I check what chromadb collections I have (ensuring there are none), create a collection, and insert my jsonl into the collection. But then when I attempt to query: InvalidDimensionException: Embedding dimension 384 does not match collection dimensionality 1024.
I've deleted everything and tried to restart everything. What else should I try?
This is the insertion code:
import chromadb
import json
# Initialize the client (modify the path if using PersistentClient)
client = chromadb.PersistentClient(path="./chromadb_store")
# Load the JSONL file
documents = []
embeddings = []
metadatas = []
ids = []
# Read JSONL and extract data
with open('810_and_embeddings.jsonl', 'r') as f:
for line in f:
data = json.loads(line)
documents.append(data['document'])
embeddings.append(data['embedding'])
metadatas.append(data['metadata'])
# Convert id to string
ids.append(str(data['id'])) # Ensure the ID is a string
# Create or load a collection
collection_name = "df_810"
try:
collection = client.get_collection(name=collection_name) # Try to load existing collection
print(f"Collection '{collection_name}' loaded.")
except Exception as e:
collection = client.create_collection(name=collection_name) # If it doesn't exist, create it
print(f"Collection '{collection_name}' created.")
# Set batch size for processing
batch_size = 100 # Adjust this based on memory constraints
# Function to insert data in batches
def insert_in_batches(documents, embeddings, metadatas, ids, batch_size):
for i in range(0, len(documents), batch_size):
# Get the batch slice
batch_docs = documents[i:i + batch_size]
batch_embeddings = embeddings[i:i + batch_size]
batch_metadatas = metadatas[i:i + batch_size]
batch_ids = ids[i:i + batch_size]
# Add the batch to the collection
collection.add(
documents=batch_docs,
embeddings=batch_embeddings,
metadatas=batch_metadatas,
ids=batch_ids
)
print(f"Inserted batch {i // batch_size + 1} of {len(documents) // batch_size + 1}")
# Insert data in batches
insert_in_batches(documents, embeddings, metadatas, ids, batch_size)
print("All documents and embeddings upserted successfully.")
本文标签:
版权声明:本文标题:chromadb - InvalidDimensionException: Embedding dimension 384 does not match collection dimensionality 1024 - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1744636618a2616859.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论