deployment - How to make langchain DirectoryLoader work in deployed environment in Heroku - Stack Overflow

IT技术

更新时间：2025-03-160

admin管理员组
文章数量:1318981

This is my output for file upload:

2025-01-19T21:50:38.661133+00:00 app[web.1]: Global session ID set to: 09a86o2ou8p5
2025-01-19T21:50:38.723501+00:00 app[web.1]: fucking initializing session vector store via file embedder initialization with session id: 09a86o2ou8p5
2025-01-19T21:50:39.146951+00:00 app[web.1]: WARNING:langchain_community.vectorstores.pgvector:Collection not found
2025-01-19T21:50:39.234423+00:00 app[web.1]: fucking initializing persistent vector store TAR
2025-01-19T21:50:39.559845+00:00 app[web.1]: uploaded file_extension: pdf
2025-01-19T21:50:39.559902+00:00 app[web.1]: loader_cls: <class 'langchain_community.document_loaders.pdf.UnstructuredPDFLoader'>
2025-01-19T21:50:39.559927+00:00 app[web.1]: directory path str: ./uploaded_files/09a86o2ou8p5
2025-01-19T21:50:39.559928+00:00 app[web.1]: DirectoryLoader absolute path: /uploaded_files/09a86o2ou8p5
2025-01-19T21:50:39.559937+00:00 app[web.1]: Contents of the directory:
2025-01-19T21:50:39.559994+00:00 app[web.1]: Root: ./uploaded_files/09a86o2ou8p5
2025-01-19T21:50:39.559995+00:00 app[web.1]: Directories: []
2025-01-19T21:50:39.560007+00:00 app[web.1]: Files: ['CAI2025_paper_6124.pdf']
2025-01-19T21:50:40.210297+00:00 app[web.1]: 
2025-01-19T21:50:40.210334+00:00 app[web.1]: docs: []
2025-01-19T21:50:40.210353+00:00 app[web.1]: docs after chunking: []

for this code:

async def process_documents(self, directory_path: Path, file_path, glob_pattern, loader_cls, text_splitter, embeddings, session_vector_store=None, unique_id=None, session_id=None):
        """To ensure that the process_documents method accurately processes only PDF files, one can modify the glob_pattern parameter used in the DirectoryLoader to specifically target PDF files. This adjustment will make the method more focused and prevent it from attempting to process files of other types, which might not be suitable for the intended processing pipeline."""
        print(f"loader_cls: {loader_cls}")
        directory_path_str = str(directory_path)
        absolute_path = os.path.abspath(directory_path_str)
        directory_path_str = f"./{directory_path_str}"
        print(f"directory path str: {directory_path_str}")
        print(f"DirectoryLoader absolute path: {absolute_path}")
        
        # List contents of the directory
        try:
            print("Contents of the directory:")
            for root, dirs, files in os.walk(directory_path_str):
                print(f"Root: {root}")
                print(f"Directories: {dirs}")
                print(f"Files: {files}")
        except Exception as e:
            print(f"Error listing directory contents: {e}")
            raise
        
        # pdf_glob_pattern = "**/*.pdf"  # Updated to specifically target PDF files
        extracted_extension = glob_pattern.split('.')[-1]
        
        # Read PDF using PDF Loader (Only Text)
        loader = DirectoryLoader(
            # f".{os.path.abspath(directory_path_str)}",
            directory_path_str,
            # directory_path_str,
            glob=glob_pattern,  # Use the specified glob pattern
            use_multithreading=True,
            show_progress=True,
            max_concurrency=50,
            loader_cls=loader_cls,
        )
        docs = loader.load()
        print(f"docs: {docs}")
        
        # if unique_id is not None:
        #     print("Storing json for ")
        #     store_file_path(unique_id=unique_id, file_path=os.path.basename(file_path))
        # else:
        #     print("Storing json for ")
        #     store_file_path(unique_id=session_id, file_path=os.path.basename(file_path))
        # print(f"unique id: {unique_id}")
        
        chunks = docs
        # Split documents into meaningful chunks
        if extracted_extension != 'csv' and extracted_extension != 'json':
            chunks = text_splitter.split_documents(docs)
        
        print(f"docs after chunking: {chunks}")
        
        # if not os.path.exists(absolute_path):
        #     raise FileNotFoundError(f"Directory {absolute_path} does not exist.")
        
        store, current_embeddings = await self.persistent_vector_store.from_documents(
            documents=chunks,
            embedding=embeddings,
            chunk_size=10,
            collection_name=PG_COLLECTION_NAME,
            connection_string=os.getenv("POSTGRES_URL"),
            pre_delete_collection=False,  # Controls whether to clear the collection before adding new docs
        )
        
        if session_vector_store:
            # print(f"docs after chunking: {chunks}")
            store, current_embeddings = await session_vector_store.from_documents(
                documents=chunks,
                embedding=embeddings,
                chunk_size=10,
                collection_name=session_id,
                connection_string=os.getenv("POSTGRES_URL"),
                pre_delete_collection=False,  # Controls whether to clear the collection before adding new docs
            )
            # print(f"current_embeddings test: {current_embeddings}")
            
        return current_embeddings

As you can see, CAI2025_paper_6124.pdf exists in ./uploaded_files/09a86o2ou8p5

Nevertheless, DirectoryLoader fails to load any docs given the directory, only in the deployed environment in Heroku.

I have tried listing the files in the directory with:

try:
            print("Contents of the directory:")
            for root, dirs, files in os.walk(directory_path_str):
                print(f"Root: {root}")
                print(f"Directories: {dirs}")
                print(f"Files: {files}")
        except Exception as e:
            print(f"Error listing directory contents: {e}")
            raise

I have tried using making DirectoryLoader read from absolute_path instead of directory_path_str.

I tested locally, and it always works locally.

I am desparate to find a fix for DirectoryLoader in the Heroku deployed environment.

This is my output for file upload:

2025-01-19T21:50:38.661133+00:00 app[web.1]: Global session ID set to: 09a86o2ou8p5
2025-01-19T21:50:38.723501+00:00 app[web.1]: fucking initializing session vector store via file embedder initialization with session id: 09a86o2ou8p5
2025-01-19T21:50:39.146951+00:00 app[web.1]: WARNING:langchain_community.vectorstores.pgvector:Collection not found
2025-01-19T21:50:39.234423+00:00 app[web.1]: fucking initializing persistent vector store TAR
2025-01-19T21:50:39.559845+00:00 app[web.1]: uploaded file_extension: pdf
2025-01-19T21:50:39.559902+00:00 app[web.1]: loader_cls: <class 'langchain_community.document_loaders.pdf.UnstructuredPDFLoader'>
2025-01-19T21:50:39.559927+00:00 app[web.1]: directory path str: ./uploaded_files/09a86o2ou8p5
2025-01-19T21:50:39.559928+00:00 app[web.1]: DirectoryLoader absolute path: /uploaded_files/09a86o2ou8p5
2025-01-19T21:50:39.559937+00:00 app[web.1]: Contents of the directory:
2025-01-19T21:50:39.559994+00:00 app[web.1]: Root: ./uploaded_files/09a86o2ou8p5
2025-01-19T21:50:39.559995+00:00 app[web.1]: Directories: []
2025-01-19T21:50:39.560007+00:00 app[web.1]: Files: ['CAI2025_paper_6124.pdf']
2025-01-19T21:50:40.210297+00:00 app[web.1]: 
2025-01-19T21:50:40.210334+00:00 app[web.1]: docs: []
2025-01-19T21:50:40.210353+00:00 app[web.1]: docs after chunking: []

for this code:

async def process_documents(self, directory_path: Path, file_path, glob_pattern, loader_cls, text_splitter, embeddings, session_vector_store=None, unique_id=None, session_id=None):
        """To ensure that the process_documents method accurately processes only PDF files, one can modify the glob_pattern parameter used in the DirectoryLoader to specifically target PDF files. This adjustment will make the method more focused and prevent it from attempting to process files of other types, which might not be suitable for the intended processing pipeline."""
        print(f"loader_cls: {loader_cls}")
        directory_path_str = str(directory_path)
        absolute_path = os.path.abspath(directory_path_str)
        directory_path_str = f"./{directory_path_str}"
        print(f"directory path str: {directory_path_str}")
        print(f"DirectoryLoader absolute path: {absolute_path}")
        
        # List contents of the directory
        try:
            print("Contents of the directory:")
            for root, dirs, files in os.walk(directory_path_str):
                print(f"Root: {root}")
                print(f"Directories: {dirs}")
                print(f"Files: {files}")
        except Exception as e:
            print(f"Error listing directory contents: {e}")
            raise
        
        # pdf_glob_pattern = "**/*.pdf"  # Updated to specifically target PDF files
        extracted_extension = glob_pattern.split('.')[-1]
        
        # Read PDF using PDF Loader (Only Text)
        loader = DirectoryLoader(
            # f".{os.path.abspath(directory_path_str)}",
            directory_path_str,
            # directory_path_str,
            glob=glob_pattern,  # Use the specified glob pattern
            use_multithreading=True,
            show_progress=True,
            max_concurrency=50,
            loader_cls=loader_cls,
        )
        docs = loader.load()
        print(f"docs: {docs}")
        
        # if unique_id is not None:
        #     print("Storing json for ")
        #     store_file_path(unique_id=unique_id, file_path=os.path.basename(file_path))
        # else:
        #     print("Storing json for ")
        #     store_file_path(unique_id=session_id, file_path=os.path.basename(file_path))
        # print(f"unique id: {unique_id}")
        
        chunks = docs
        # Split documents into meaningful chunks
        if extracted_extension != 'csv' and extracted_extension != 'json':
            chunks = text_splitter.split_documents(docs)
        
        print(f"docs after chunking: {chunks}")
        
        # if not os.path.exists(absolute_path):
        #     raise FileNotFoundError(f"Directory {absolute_path} does not exist.")
        
        store, current_embeddings = await self.persistent_vector_store.from_documents(
            documents=chunks,
            embedding=embeddings,
            chunk_size=10,
            collection_name=PG_COLLECTION_NAME,
            connection_string=os.getenv("POSTGRES_URL"),
            pre_delete_collection=False,  # Controls whether to clear the collection before adding new docs
        )
        
        if session_vector_store:
            # print(f"docs after chunking: {chunks}")
            store, current_embeddings = await session_vector_store.from_documents(
                documents=chunks,
                embedding=embeddings,
                chunk_size=10,
                collection_name=session_id,
                connection_string=os.getenv("POSTGRES_URL"),
                pre_delete_collection=False,  # Controls whether to clear the collection before adding new docs
            )
            # print(f"current_embeddings test: {current_embeddings}")
            
        return current_embeddings

As you can see, CAI2025_paper_6124.pdf exists in ./uploaded_files/09a86o2ou8p5

Nevertheless, DirectoryLoader fails to load any docs given the directory, only in the deployed environment in Heroku.

I have tried listing the files in the directory with:

try:
            print("Contents of the directory:")
            for root, dirs, files in os.walk(directory_path_str):
                print(f"Root: {root}")
                print(f"Directories: {dirs}")
                print(f"Files: {files}")
        except Exception as e:
            print(f"Error listing directory contents: {e}")
            raise

I have tried using making DirectoryLoader read from absolute_path instead of directory_path_str.

I tested locally, and it always works locally.

I am desparate to find a fix for DirectoryLoader in the Heroku deployed environment.

Share Improve this question asked Jan 19 at 22:14 Wonjae Oh 1

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

The problem had to do with missing OpenGL modules in the Debian environment.

Adding this:

libgl1 \
libglib2.0-0 \

in the Dockerfile solved the problem

RUN apt-get update && apt-get install -y \
    libgl1 \
    libglib2.0-0 \
    libcairo2 \
    libpango1.0-0 \
    libgdk-pixbuf2.0-0 \
    shared-mime-info \
    libgirepository1.0-dev \
    gir1.2-pango-1.0 \
    gir1.2-gdkpixbuf-2.0 \
    gir1.2-cairo-1.0 \
    python3-gi \
    python3-cairo \
    git \
    build-essential \
    curl \
    && rm -rf /var/lib/apt/lists/*

本文标签： deploymentHow to make langchain DirectoryLoader work in deployed environment in HerokuStack Overflow

版权声明：本文标题：deployment - How to make langchain DirectoryLoader work in deployed environment in Heroku - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1742056140a2418306.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

deployment - How to make langchain DirectoryLoader work in deployed environment in Heroku - Stack Overflow

1 Answer 1

更多相关文章