admin管理员组文章数量:1318981
This is my output for file upload:
2025-01-19T21:50:38.661133+00:00 app[web.1]: Global session ID set to: 09a86o2ou8p5
2025-01-19T21:50:38.723501+00:00 app[web.1]: fucking initializing session vector store via file embedder initialization with session id: 09a86o2ou8p5
2025-01-19T21:50:39.146951+00:00 app[web.1]: WARNING:langchain_community.vectorstores.pgvector:Collection not found
2025-01-19T21:50:39.234423+00:00 app[web.1]: fucking initializing persistent vector store TAR
2025-01-19T21:50:39.559845+00:00 app[web.1]: uploaded file_extension: pdf
2025-01-19T21:50:39.559902+00:00 app[web.1]: loader_cls: <class 'langchain_community.document_loaders.pdf.UnstructuredPDFLoader'>
2025-01-19T21:50:39.559927+00:00 app[web.1]: directory path str: ./uploaded_files/09a86o2ou8p5
2025-01-19T21:50:39.559928+00:00 app[web.1]: DirectoryLoader absolute path: /uploaded_files/09a86o2ou8p5
2025-01-19T21:50:39.559937+00:00 app[web.1]: Contents of the directory:
2025-01-19T21:50:39.559994+00:00 app[web.1]: Root: ./uploaded_files/09a86o2ou8p5
2025-01-19T21:50:39.559995+00:00 app[web.1]: Directories: []
2025-01-19T21:50:39.560007+00:00 app[web.1]: Files: ['CAI2025_paper_6124.pdf']
2025-01-19T21:50:40.210297+00:00 app[web.1]:
2025-01-19T21:50:40.210334+00:00 app[web.1]: docs: []
2025-01-19T21:50:40.210353+00:00 app[web.1]: docs after chunking: []
for this code:
async def process_documents(self, directory_path: Path, file_path, glob_pattern, loader_cls, text_splitter, embeddings, session_vector_store=None, unique_id=None, session_id=None):
"""To ensure that the process_documents method accurately processes only PDF files, one can modify the glob_pattern parameter used in the DirectoryLoader to specifically target PDF files. This adjustment will make the method more focused and prevent it from attempting to process files of other types, which might not be suitable for the intended processing pipeline."""
print(f"loader_cls: {loader_cls}")
directory_path_str = str(directory_path)
absolute_path = os.path.abspath(directory_path_str)
directory_path_str = f"./{directory_path_str}"
print(f"directory path str: {directory_path_str}")
print(f"DirectoryLoader absolute path: {absolute_path}")
# List contents of the directory
try:
print("Contents of the directory:")
for root, dirs, files in os.walk(directory_path_str):
print(f"Root: {root}")
print(f"Directories: {dirs}")
print(f"Files: {files}")
except Exception as e:
print(f"Error listing directory contents: {e}")
raise
# pdf_glob_pattern = "**/*.pdf" # Updated to specifically target PDF files
extracted_extension = glob_pattern.split('.')[-1]
# Read PDF using PDF Loader (Only Text)
loader = DirectoryLoader(
# f".{os.path.abspath(directory_path_str)}",
directory_path_str,
# directory_path_str,
glob=glob_pattern, # Use the specified glob pattern
use_multithreading=True,
show_progress=True,
max_concurrency=50,
loader_cls=loader_cls,
)
docs = loader.load()
print(f"docs: {docs}")
# if unique_id is not None:
# print("Storing json for ")
# store_file_path(unique_id=unique_id, file_path=os.path.basename(file_path))
# else:
# print("Storing json for ")
# store_file_path(unique_id=session_id, file_path=os.path.basename(file_path))
# print(f"unique id: {unique_id}")
chunks = docs
# Split documents into meaningful chunks
if extracted_extension != 'csv' and extracted_extension != 'json':
chunks = text_splitter.split_documents(docs)
print(f"docs after chunking: {chunks}")
# if not os.path.exists(absolute_path):
# raise FileNotFoundError(f"Directory {absolute_path} does not exist.")
store, current_embeddings = await self.persistent_vector_store.from_documents(
documents=chunks,
embedding=embeddings,
chunk_size=10,
collection_name=PG_COLLECTION_NAME,
connection_string=os.getenv("POSTGRES_URL"),
pre_delete_collection=False, # Controls whether to clear the collection before adding new docs
)
if session_vector_store:
# print(f"docs after chunking: {chunks}")
store, current_embeddings = await session_vector_store.from_documents(
documents=chunks,
embedding=embeddings,
chunk_size=10,
collection_name=session_id,
connection_string=os.getenv("POSTGRES_URL"),
pre_delete_collection=False, # Controls whether to clear the collection before adding new docs
)
# print(f"current_embeddings test: {current_embeddings}")
return current_embeddings
As you can see, CAI2025_paper_6124.pdf exists in ./uploaded_files/09a86o2ou8p5
Nevertheless, DirectoryLoader fails to load any docs given the directory, only in the deployed environment in Heroku.
I have tried listing the files in the directory with:
try:
print("Contents of the directory:")
for root, dirs, files in os.walk(directory_path_str):
print(f"Root: {root}")
print(f"Directories: {dirs}")
print(f"Files: {files}")
except Exception as e:
print(f"Error listing directory contents: {e}")
raise
I have tried using making DirectoryLoader read from absolute_path instead of directory_path_str.
I tested locally, and it always works locally.
I am desparate to find a fix for DirectoryLoader in the Heroku deployed environment.
This is my output for file upload:
2025-01-19T21:50:38.661133+00:00 app[web.1]: Global session ID set to: 09a86o2ou8p5
2025-01-19T21:50:38.723501+00:00 app[web.1]: fucking initializing session vector store via file embedder initialization with session id: 09a86o2ou8p5
2025-01-19T21:50:39.146951+00:00 app[web.1]: WARNING:langchain_community.vectorstores.pgvector:Collection not found
2025-01-19T21:50:39.234423+00:00 app[web.1]: fucking initializing persistent vector store TAR
2025-01-19T21:50:39.559845+00:00 app[web.1]: uploaded file_extension: pdf
2025-01-19T21:50:39.559902+00:00 app[web.1]: loader_cls: <class 'langchain_community.document_loaders.pdf.UnstructuredPDFLoader'>
2025-01-19T21:50:39.559927+00:00 app[web.1]: directory path str: ./uploaded_files/09a86o2ou8p5
2025-01-19T21:50:39.559928+00:00 app[web.1]: DirectoryLoader absolute path: /uploaded_files/09a86o2ou8p5
2025-01-19T21:50:39.559937+00:00 app[web.1]: Contents of the directory:
2025-01-19T21:50:39.559994+00:00 app[web.1]: Root: ./uploaded_files/09a86o2ou8p5
2025-01-19T21:50:39.559995+00:00 app[web.1]: Directories: []
2025-01-19T21:50:39.560007+00:00 app[web.1]: Files: ['CAI2025_paper_6124.pdf']
2025-01-19T21:50:40.210297+00:00 app[web.1]:
2025-01-19T21:50:40.210334+00:00 app[web.1]: docs: []
2025-01-19T21:50:40.210353+00:00 app[web.1]: docs after chunking: []
for this code:
async def process_documents(self, directory_path: Path, file_path, glob_pattern, loader_cls, text_splitter, embeddings, session_vector_store=None, unique_id=None, session_id=None):
"""To ensure that the process_documents method accurately processes only PDF files, one can modify the glob_pattern parameter used in the DirectoryLoader to specifically target PDF files. This adjustment will make the method more focused and prevent it from attempting to process files of other types, which might not be suitable for the intended processing pipeline."""
print(f"loader_cls: {loader_cls}")
directory_path_str = str(directory_path)
absolute_path = os.path.abspath(directory_path_str)
directory_path_str = f"./{directory_path_str}"
print(f"directory path str: {directory_path_str}")
print(f"DirectoryLoader absolute path: {absolute_path}")
# List contents of the directory
try:
print("Contents of the directory:")
for root, dirs, files in os.walk(directory_path_str):
print(f"Root: {root}")
print(f"Directories: {dirs}")
print(f"Files: {files}")
except Exception as e:
print(f"Error listing directory contents: {e}")
raise
# pdf_glob_pattern = "**/*.pdf" # Updated to specifically target PDF files
extracted_extension = glob_pattern.split('.')[-1]
# Read PDF using PDF Loader (Only Text)
loader = DirectoryLoader(
# f".{os.path.abspath(directory_path_str)}",
directory_path_str,
# directory_path_str,
glob=glob_pattern, # Use the specified glob pattern
use_multithreading=True,
show_progress=True,
max_concurrency=50,
loader_cls=loader_cls,
)
docs = loader.load()
print(f"docs: {docs}")
# if unique_id is not None:
# print("Storing json for ")
# store_file_path(unique_id=unique_id, file_path=os.path.basename(file_path))
# else:
# print("Storing json for ")
# store_file_path(unique_id=session_id, file_path=os.path.basename(file_path))
# print(f"unique id: {unique_id}")
chunks = docs
# Split documents into meaningful chunks
if extracted_extension != 'csv' and extracted_extension != 'json':
chunks = text_splitter.split_documents(docs)
print(f"docs after chunking: {chunks}")
# if not os.path.exists(absolute_path):
# raise FileNotFoundError(f"Directory {absolute_path} does not exist.")
store, current_embeddings = await self.persistent_vector_store.from_documents(
documents=chunks,
embedding=embeddings,
chunk_size=10,
collection_name=PG_COLLECTION_NAME,
connection_string=os.getenv("POSTGRES_URL"),
pre_delete_collection=False, # Controls whether to clear the collection before adding new docs
)
if session_vector_store:
# print(f"docs after chunking: {chunks}")
store, current_embeddings = await session_vector_store.from_documents(
documents=chunks,
embedding=embeddings,
chunk_size=10,
collection_name=session_id,
connection_string=os.getenv("POSTGRES_URL"),
pre_delete_collection=False, # Controls whether to clear the collection before adding new docs
)
# print(f"current_embeddings test: {current_embeddings}")
return current_embeddings
As you can see, CAI2025_paper_6124.pdf exists in ./uploaded_files/09a86o2ou8p5
Nevertheless, DirectoryLoader fails to load any docs given the directory, only in the deployed environment in Heroku.
I have tried listing the files in the directory with:
try:
print("Contents of the directory:")
for root, dirs, files in os.walk(directory_path_str):
print(f"Root: {root}")
print(f"Directories: {dirs}")
print(f"Files: {files}")
except Exception as e:
print(f"Error listing directory contents: {e}")
raise
I have tried using making DirectoryLoader read from absolute_path instead of directory_path_str.
I tested locally, and it always works locally.
I am desparate to find a fix for DirectoryLoader in the Heroku deployed environment.
Share Improve this question asked Jan 19 at 22:14 Wonjae OhWonjae Oh 11 Answer
Reset to default 0The problem had to do with missing OpenGL modules in the Debian environment.
Adding this:
libgl1 \
libglib2.0-0 \
in the Dockerfile solved the problem
RUN apt-get update && apt-get install -y \
libgl1 \
libglib2.0-0 \
libcairo2 \
libpango1.0-0 \
libgdk-pixbuf2.0-0 \
shared-mime-info \
libgirepository1.0-dev \
gir1.2-pango-1.0 \
gir1.2-gdkpixbuf-2.0 \
gir1.2-cairo-1.0 \
python3-gi \
python3-cairo \
git \
build-essential \
curl \
&& rm -rf /var/lib/apt/lists/*
本文标签: deploymentHow to make langchain DirectoryLoader work in deployed environment in HerokuStack Overflow
版权声明:本文标题:deployment - How to make langchain DirectoryLoader work in deployed environment in Heroku - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1742056140a2418306.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论