admin管理员组

文章数量:1318981

This is my output for file upload:

2025-01-19T21:50:38.661133+00:00 app[web.1]: Global session ID set to: 09a86o2ou8p5
2025-01-19T21:50:38.723501+00:00 app[web.1]: fucking initializing session vector store via file embedder initialization with session id: 09a86o2ou8p5
2025-01-19T21:50:39.146951+00:00 app[web.1]: WARNING:langchain_community.vectorstores.pgvector:Collection not found
2025-01-19T21:50:39.234423+00:00 app[web.1]: fucking initializing persistent vector store TAR
2025-01-19T21:50:39.559845+00:00 app[web.1]: uploaded file_extension: pdf
2025-01-19T21:50:39.559902+00:00 app[web.1]: loader_cls: <class 'langchain_community.document_loaders.pdf.UnstructuredPDFLoader'>
2025-01-19T21:50:39.559927+00:00 app[web.1]: directory path str: ./uploaded_files/09a86o2ou8p5
2025-01-19T21:50:39.559928+00:00 app[web.1]: DirectoryLoader absolute path: /uploaded_files/09a86o2ou8p5
2025-01-19T21:50:39.559937+00:00 app[web.1]: Contents of the directory:
2025-01-19T21:50:39.559994+00:00 app[web.1]: Root: ./uploaded_files/09a86o2ou8p5
2025-01-19T21:50:39.559995+00:00 app[web.1]: Directories: []
2025-01-19T21:50:39.560007+00:00 app[web.1]: Files: ['CAI2025_paper_6124.pdf']
2025-01-19T21:50:40.210297+00:00 app[web.1]: 
2025-01-19T21:50:40.210334+00:00 app[web.1]: docs: []
2025-01-19T21:50:40.210353+00:00 app[web.1]: docs after chunking: []

for this code:

async def process_documents(self, directory_path: Path, file_path, glob_pattern, loader_cls, text_splitter, embeddings, session_vector_store=None, unique_id=None, session_id=None):
        """To ensure that the process_documents method accurately processes only PDF files, one can modify the glob_pattern parameter used in the DirectoryLoader to specifically target PDF files. This adjustment will make the method more focused and prevent it from attempting to process files of other types, which might not be suitable for the intended processing pipeline."""
        print(f"loader_cls: {loader_cls}")
        directory_path_str = str(directory_path)
        absolute_path = os.path.abspath(directory_path_str)
        directory_path_str = f"./{directory_path_str}"
        print(f"directory path str: {directory_path_str}")
        print(f"DirectoryLoader absolute path: {absolute_path}")
        
        # List contents of the directory
        try:
            print("Contents of the directory:")
            for root, dirs, files in os.walk(directory_path_str):
                print(f"Root: {root}")
                print(f"Directories: {dirs}")
                print(f"Files: {files}")
        except Exception as e:
            print(f"Error listing directory contents: {e}")
            raise
        
        # pdf_glob_pattern = "**/*.pdf"  # Updated to specifically target PDF files
        extracted_extension = glob_pattern.split('.')[-1]
        
        # Read PDF using PDF Loader (Only Text)
        loader = DirectoryLoader(
            # f".{os.path.abspath(directory_path_str)}",
            directory_path_str,
            # directory_path_str,
            glob=glob_pattern,  # Use the specified glob pattern
            use_multithreading=True,
            show_progress=True,
            max_concurrency=50,
            loader_cls=loader_cls,
        )
        docs = loader.load()
        print(f"docs: {docs}")
        
        # if unique_id is not None:
        #     print("Storing json for ")
        #     store_file_path(unique_id=unique_id, file_path=os.path.basename(file_path))
        # else:
        #     print("Storing json for ")
        #     store_file_path(unique_id=session_id, file_path=os.path.basename(file_path))
        # print(f"unique id: {unique_id}")
        
        chunks = docs
        # Split documents into meaningful chunks
        if extracted_extension != 'csv' and extracted_extension != 'json':
            chunks = text_splitter.split_documents(docs)
        
        print(f"docs after chunking: {chunks}")
        
        # if not os.path.exists(absolute_path):
        #     raise FileNotFoundError(f"Directory {absolute_path} does not exist.")
        
        store, current_embeddings = await self.persistent_vector_store.from_documents(
            documents=chunks,
            embedding=embeddings,
            chunk_size=10,
            collection_name=PG_COLLECTION_NAME,
            connection_string=os.getenv("POSTGRES_URL"),
            pre_delete_collection=False,  # Controls whether to clear the collection before adding new docs
        )
        
        if session_vector_store:
            # print(f"docs after chunking: {chunks}")
            store, current_embeddings = await session_vector_store.from_documents(
                documents=chunks,
                embedding=embeddings,
                chunk_size=10,
                collection_name=session_id,
                connection_string=os.getenv("POSTGRES_URL"),
                pre_delete_collection=False,  # Controls whether to clear the collection before adding new docs
            )
            # print(f"current_embeddings test: {current_embeddings}")
            
        return current_embeddings

As you can see, CAI2025_paper_6124.pdf exists in ./uploaded_files/09a86o2ou8p5

Nevertheless, DirectoryLoader fails to load any docs given the directory, only in the deployed environment in Heroku.

I have tried listing the files in the directory with:

try:
            print("Contents of the directory:")
            for root, dirs, files in os.walk(directory_path_str):
                print(f"Root: {root}")
                print(f"Directories: {dirs}")
                print(f"Files: {files}")
        except Exception as e:
            print(f"Error listing directory contents: {e}")
            raise

I have tried using making DirectoryLoader read from absolute_path instead of directory_path_str.

I tested locally, and it always works locally.

I am desparate to find a fix for DirectoryLoader in the Heroku deployed environment.

This is my output for file upload:

2025-01-19T21:50:38.661133+00:00 app[web.1]: Global session ID set to: 09a86o2ou8p5
2025-01-19T21:50:38.723501+00:00 app[web.1]: fucking initializing session vector store via file embedder initialization with session id: 09a86o2ou8p5
2025-01-19T21:50:39.146951+00:00 app[web.1]: WARNING:langchain_community.vectorstores.pgvector:Collection not found
2025-01-19T21:50:39.234423+00:00 app[web.1]: fucking initializing persistent vector store TAR
2025-01-19T21:50:39.559845+00:00 app[web.1]: uploaded file_extension: pdf
2025-01-19T21:50:39.559902+00:00 app[web.1]: loader_cls: <class 'langchain_community.document_loaders.pdf.UnstructuredPDFLoader'>
2025-01-19T21:50:39.559927+00:00 app[web.1]: directory path str: ./uploaded_files/09a86o2ou8p5
2025-01-19T21:50:39.559928+00:00 app[web.1]: DirectoryLoader absolute path: /uploaded_files/09a86o2ou8p5
2025-01-19T21:50:39.559937+00:00 app[web.1]: Contents of the directory:
2025-01-19T21:50:39.559994+00:00 app[web.1]: Root: ./uploaded_files/09a86o2ou8p5
2025-01-19T21:50:39.559995+00:00 app[web.1]: Directories: []
2025-01-19T21:50:39.560007+00:00 app[web.1]: Files: ['CAI2025_paper_6124.pdf']
2025-01-19T21:50:40.210297+00:00 app[web.1]: 
2025-01-19T21:50:40.210334+00:00 app[web.1]: docs: []
2025-01-19T21:50:40.210353+00:00 app[web.1]: docs after chunking: []

for this code:

async def process_documents(self, directory_path: Path, file_path, glob_pattern, loader_cls, text_splitter, embeddings, session_vector_store=None, unique_id=None, session_id=None):
        """To ensure that the process_documents method accurately processes only PDF files, one can modify the glob_pattern parameter used in the DirectoryLoader to specifically target PDF files. This adjustment will make the method more focused and prevent it from attempting to process files of other types, which might not be suitable for the intended processing pipeline."""
        print(f"loader_cls: {loader_cls}")
        directory_path_str = str(directory_path)
        absolute_path = os.path.abspath(directory_path_str)
        directory_path_str = f"./{directory_path_str}"
        print(f"directory path str: {directory_path_str}")
        print(f"DirectoryLoader absolute path: {absolute_path}")
        
        # List contents of the directory
        try:
            print("Contents of the directory:")
            for root, dirs, files in os.walk(directory_path_str):
                print(f"Root: {root}")
                print(f"Directories: {dirs}")
                print(f"Files: {files}")
        except Exception as e:
            print(f"Error listing directory contents: {e}")
            raise
        
        # pdf_glob_pattern = "**/*.pdf"  # Updated to specifically target PDF files
        extracted_extension = glob_pattern.split('.')[-1]
        
        # Read PDF using PDF Loader (Only Text)
        loader = DirectoryLoader(
            # f".{os.path.abspath(directory_path_str)}",
            directory_path_str,
            # directory_path_str,
            glob=glob_pattern,  # Use the specified glob pattern
            use_multithreading=True,
            show_progress=True,
            max_concurrency=50,
            loader_cls=loader_cls,
        )
        docs = loader.load()
        print(f"docs: {docs}")
        
        # if unique_id is not None:
        #     print("Storing json for ")
        #     store_file_path(unique_id=unique_id, file_path=os.path.basename(file_path))
        # else:
        #     print("Storing json for ")
        #     store_file_path(unique_id=session_id, file_path=os.path.basename(file_path))
        # print(f"unique id: {unique_id}")
        
        chunks = docs
        # Split documents into meaningful chunks
        if extracted_extension != 'csv' and extracted_extension != 'json':
            chunks = text_splitter.split_documents(docs)
        
        print(f"docs after chunking: {chunks}")
        
        # if not os.path.exists(absolute_path):
        #     raise FileNotFoundError(f"Directory {absolute_path} does not exist.")
        
        store, current_embeddings = await self.persistent_vector_store.from_documents(
            documents=chunks,
            embedding=embeddings,
            chunk_size=10,
            collection_name=PG_COLLECTION_NAME,
            connection_string=os.getenv("POSTGRES_URL"),
            pre_delete_collection=False,  # Controls whether to clear the collection before adding new docs
        )
        
        if session_vector_store:
            # print(f"docs after chunking: {chunks}")
            store, current_embeddings = await session_vector_store.from_documents(
                documents=chunks,
                embedding=embeddings,
                chunk_size=10,
                collection_name=session_id,
                connection_string=os.getenv("POSTGRES_URL"),
                pre_delete_collection=False,  # Controls whether to clear the collection before adding new docs
            )
            # print(f"current_embeddings test: {current_embeddings}")
            
        return current_embeddings

As you can see, CAI2025_paper_6124.pdf exists in ./uploaded_files/09a86o2ou8p5

Nevertheless, DirectoryLoader fails to load any docs given the directory, only in the deployed environment in Heroku.

I have tried listing the files in the directory with:

try:
            print("Contents of the directory:")
            for root, dirs, files in os.walk(directory_path_str):
                print(f"Root: {root}")
                print(f"Directories: {dirs}")
                print(f"Files: {files}")
        except Exception as e:
            print(f"Error listing directory contents: {e}")
            raise

I have tried using making DirectoryLoader read from absolute_path instead of directory_path_str.

I tested locally, and it always works locally.

I am desparate to find a fix for DirectoryLoader in the Heroku deployed environment.

Share Improve this question asked Jan 19 at 22:14 Wonjae OhWonjae Oh 1
Add a comment  | 

1 Answer 1

Reset to default 0

The problem had to do with missing OpenGL modules in the Debian environment.

Adding this:

libgl1 \
libglib2.0-0 \

in the Dockerfile solved the problem

RUN apt-get update && apt-get install -y \
    libgl1 \
    libglib2.0-0 \
    libcairo2 \
    libpango1.0-0 \
    libgdk-pixbuf2.0-0 \
    shared-mime-info \
    libgirepository1.0-dev \
    gir1.2-pango-1.0 \
    gir1.2-gdkpixbuf-2.0 \
    gir1.2-cairo-1.0 \
    python3-gi \
    python3-cairo \
    git \
    build-essential \
    curl \
    && rm -rf /var/lib/apt/lists/*

本文标签: deploymentHow to make langchain DirectoryLoader work in deployed environment in HerokuStack Overflow