admin管理员组

文章数量:1334373

I am creating a small application using spring-ai with mongodb-atlas (local docker container) to store the RAG data.

I want to "seed" the mongoDB with some content on the service start. The content is a list of documents with metadata. The problem is that this content will be inserted each time, the application starts and I have not found a way to prevent the insertion of duplicate data. I can't simply remove all data from the database, as I want to add data later that should be persisted and kept in there even when the service is restarted and maybe filled with different/newer presets.

Right now I'm trying something like that:

    @Autowired
    public void init(VectorStore vectorStore) {
        List<Document> documents = List.of(
                new Document("Once there was a little Girl",
                        Map.of("type", "init", "pos", "1", "plot", "1")),
                new Document("The girls name was Mary",
                        Map.of("type", "init", "pos", "2", "plot", "1")),
                new Document("Once there was a little Boy",
                        Map.of("type", "init", "pos", "1", "plot", "2")),
                new Document("The boys name was Peter",
                        Map.of("type", "init", "pos", "2", "plot", "2")),
                new Document("Peter was a wild kid",
                        Map.of("type", "init", "pos", "3", "plot", "2"))
        );

        List<String> collect = vectorStore.similaritySearch("type == 'init'")
                .stream().map(Document::getId).collect(Collectors.toList());
        vectorStore.delete(
                collect
        );
        
        vectorStore.add(documents);
    }

This doesn't work because there is one metadata map that is stored a bit differently (in mongoDB I can see that the order of fields in the metadata map is different somehow) and that row is not removed in the delete step. So with each start, this row is duplicated. The behaviour is pretty stable, When I change the value of type from init to story, a different row will escape deletion. This drives me mad...

I would like to have a way to provide initial data to the DB that may change when the service evolves, without filling up the DB with additional trash that presumably will lead to problems later. (I assume that will be tha case, but I'm not in a stage yet to verify that this will be a problem, nevertheless, it is anoying)

Has anyone solved something similar?

本文标签: javaInitializing RAG using vectorstore without duplicatesStack Overflow