admin管理员组

文章数量:1353145

I'm trying to use SelfQueryRetriever to automatically generate a query by document metadata on a vector database.

When called directly, the configured LLM seems to be able to answer with the expected output JSON:

llm = ChatOllama(model="gemma3:4b", temperature=0)

metadata_field_info = [
    AttributeInfo(name="genre", description="The genre of the movie", type="string or list[string]"),
    AttributeInfo(name="year", description="The year the movie was released", type="integer"),
    AttributeInfo(name="director", description="The name of the movie director", type="string"),
    AttributeInfo(name="rating", description="A 1-10 rating for the movie", type="float"),
]

prompt_template_str = """
You are a text-to-structured-query converter. Convert natural language questions into structured queries.

Use the provided metadata schema and examples to guide your conversion. Ensure that if a genre is mentioned or implied, it is included in the structured query.

Respond ONLY with a valid JSON object. Do not include any explanations.

Schema:
{schema}

Examples:
{examples}

User Query:
{query}

Structured Query:
"""

queries = [
    "Find science fiction movies with a rating greater than 8.5",
]

for i, query in enumerate(queries, 1):
        # Format the prompt with schema and examples (as done in the retriever)
    formatted_prompt = prompt_template_str.format(
        schema=str(metadata_field_info),  # Ensure the schema is included
        examples=str(examples),  # Ensure examples are passed correctly
        query=query  # Pass the user query
    )

    # Log the formatted prompt
    print("\nFormatted Prompt:")
    print(formatted_prompt)

    # Ask the model to generate the structured query
    model_response = llm.invoke(formatted_prompt)

    # Log the model's output
    print("\nModel Response:")
    print(model_response)

As you can see, the model response has been able to determine that we are querying by genre and rating:

Model Response:
content='```json\n{"filter": "and(gt(\\"rating\\", 8.5), eq(\\"genre\\", \\"science fiction\\"))"}\n```' additional_kwargs={} response_metadata={'model': 'gemma3:4b', 'created_at': '2025-04-01T20:00:34.613301Z', 'done': True, 'done_reason': 'stop', 'total_duration': 1093037375, 'load_duration': 23306375, 'prompt_eval_count': 456, 'prompt_eval_duration': 553318417, 'eval_count': 33, 'eval_duration': 515989333, 'message': Message(role='assistant', content='', images=None, tool_calls=None)} id='run-ecb1a649-891e-41fd-8dfd-8387e1ed054b-0' usage_metadata={'input_tokens': 456, 'output_tokens': 33, 'total_tokens': 489}

Now trying directly with the SelfQueryRetriever:

retriever = SelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=vectorstore,
    document_contents=document_content_description,
    metadata_field_info=metadata_field_info,
    prompt=custom_prompt,
    enable_limit=False,
    examples = examples,
    search_kwargs={"k":4}
)

for i, query in enumerate(queries, 1):
    structured_query = retriever.query_constructor.invoke(query)
    print(f"\nQuery {i}: {query}")
    print(f"\nStructured Query {i} Output: {structured_query}")

The output shows that it's only querying by rating, and not genre:

Structured Query 1 Output: query='science fiction' filter=Comparison(comparator=<Comparator.GT: 'gt'>, attribute='rating', value=8.5) limit=None

Can you spot the reason why the llm works perfectly when it is directly invoked but it's not capable to generate the proper structured query through the SelfQueryRetriever?

Here's the full source code:

from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_postgres.vectorstores import PGVector
from langchain_core.documents import Document
from langchain.prompts import PromptTemplate

# PGVector DB connection
connection = "postgresql+psycopg://langchain:langchain@localhost:6024/langchain"

# Sample documents
docs = [
    Document(
        page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
        metadata={"year": 1993, "rating": 8.7, "genre": "science fiction"},
    ),
    Document(
        page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
        metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2, "genre": "science fiction"},
    ),
    Document(
        page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
        metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6, "genre": "science fiction"},
    ),
    Document(
        page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
        metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3, "genre": "comedy"},
    ),
    Document(
        page_content="Some weird aliens fight for the freedom of the resistance against an evil galactic empire",
        metadata={"year": 1979, "director": "Gee Lucas", "rating": 9.2, "genre": "science fiction"},
    ),
    Document(
        page_content="Toys come alive and have a blast doing so",
        metadata={"year": 1995, "genre": "animated", "rating": 7.5},
    ),
    Document(
        page_content="Three men walk into the Zone, three men walk out of the Zone",
        metadata={"year": 1979, "director": "Andrei Tarkovsky", "genre": "thriller","rating": 9.9},
    ),
]

# Embeddings
embeddings_model = OllamaEmbeddings(model="nomic-embed-text")

# Vectorstore
vectorstore = PGVector.from_documents(docs, embeddings_model, connection=connection)

# Metadata schema
metadata_field_info = [
    AttributeInfo(name="genre", description="The genre of the movie", type="string or list[string]"),
    AttributeInfo(name="year", description="The year the movie was released", type="integer"),
    AttributeInfo(name="director", description="The name of the movie director", type="string"),
    AttributeInfo(name="rating", description="A 1-10 rating for the movie", type="float"),
]

# Description of contents
document_content_description = "Brief summary of a movie"

# Define Ollama LLM for querying metadata
llm = ChatOllama(model="gemma3:4b", temperature=0)

prompt_template_str = """
You are a text-to-structured-query converter. Convert natural language questions into structured queries.

Use the provided metadata schema and examples to guide your conversion. Ensure that if a genre is mentioned or implied, it is included in the structured query.

Respond ONLY with a valid JSON object. Do not include any explanations.

Schema:
{schema}

Examples:
{examples}

User Query:
{query}

Structured Query:
"""

custom_prompt = PromptTemplate(
    template=prompt_template_str,
    input_variables=["query", "schema", "examples"]
)

examples = [
    {
        "query": "I want to watch science fiction movies with a rating at least 5.5",
        "filter": 'and(gt("rating", 5.5), eq("genre", "science fiction"))'
    },
    {
        "query": "Find movies of science fiction genre with a rating greater than 8.5",
        "filter": 'and(gt("rating", 8.5), eq("genre", "science fiction"))'
    },
    {
        "query": "Find comedy movies",
        "filter": 'eq("genre", "comedy")'
    },
    {
        "query": "Show animated movies with a rating above 7",
        "filter": 'and(gt("rating", 7), eq("genre", "animated"))'
    },
    {
        "query": "Find science fiction movies with a rating greater than 8.5",
        "filter": 'and(gt("rating", 8.5), eq("genre", "science fiction"))'
    },
    # Ensure queries with 'science fiction' and 'rating' include the genre clause
    {
        "query": "Find science fiction movies above 8.0 rating",
        "filter": 'and(gt("rating", 8.0), eq("genre", "science fiction"))'
    },
    # Adjust for cases where the genre isn't clear or needs to be inferred
    {
        "query": "Find movies with a rating greater than 8.0",
        "filter": 'gt("rating", 8.0)'
    }
]

# Create the retriever
retriever = SelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=vectorstore,
    document_contents=document_content_description,
    metadata_field_info=metadata_field_info,
    prompt=custom_prompt,
    enable_limit=False,
    examples = examples,
    search_kwargs={"k":4}
)

queries = [
    "Find science fiction movies with a rating greater than 8.5",
]

# Log the structured queries before processing
for i, query in enumerate(queries, 1):
    structured_query = retriever.query_constructor.invoke(query)
    print(f"\nQuery {i}: {query}")
    print(f"\nStructured Query {i} Output: {structured_query}")

    # Format the prompt with schema and examples (as done in the retriever)
    formatted_prompt = prompt_template_str.format(
        schema=str(metadata_field_info),  # Ensure the schema is included
        examples=str(examples),  # Ensure examples are passed correctly
        query=query  # Pass the user query
    )

    # Log the formatted prompt
    print("\nFormatted Prompt:")
    print(formatted_prompt)

    # Ask the model to generate the structured query
    model_response = llm.invoke(formatted_prompt)

    # Log the model's output
    print("\nModel Response:")
    print(model_response)

# Query and results
for i, query in enumerate(queries, 1):
    print(f"\nResults {i}:")
    results = retriever.invoke(query)
    for r in results:
        print(r.page_content, "=>", r.metadata)

I'm trying to use SelfQueryRetriever to automatically generate a query by document metadata on a vector database.

When called directly, the configured LLM seems to be able to answer with the expected output JSON:

llm = ChatOllama(model="gemma3:4b", temperature=0)

metadata_field_info = [
    AttributeInfo(name="genre", description="The genre of the movie", type="string or list[string]"),
    AttributeInfo(name="year", description="The year the movie was released", type="integer"),
    AttributeInfo(name="director", description="The name of the movie director", type="string"),
    AttributeInfo(name="rating", description="A 1-10 rating for the movie", type="float"),
]

prompt_template_str = """
You are a text-to-structured-query converter. Convert natural language questions into structured queries.

Use the provided metadata schema and examples to guide your conversion. Ensure that if a genre is mentioned or implied, it is included in the structured query.

Respond ONLY with a valid JSON object. Do not include any explanations.

Schema:
{schema}

Examples:
{examples}

User Query:
{query}

Structured Query:
"""

queries = [
    "Find science fiction movies with a rating greater than 8.5",
]

for i, query in enumerate(queries, 1):
        # Format the prompt with schema and examples (as done in the retriever)
    formatted_prompt = prompt_template_str.format(
        schema=str(metadata_field_info),  # Ensure the schema is included
        examples=str(examples),  # Ensure examples are passed correctly
        query=query  # Pass the user query
    )

    # Log the formatted prompt
    print("\nFormatted Prompt:")
    print(formatted_prompt)

    # Ask the model to generate the structured query
    model_response = llm.invoke(formatted_prompt)

    # Log the model's output
    print("\nModel Response:")
    print(model_response)

As you can see, the model response has been able to determine that we are querying by genre and rating:

Model Response:
content='```json\n{"filter": "and(gt(\\"rating\\", 8.5), eq(\\"genre\\", \\"science fiction\\"))"}\n```' additional_kwargs={} response_metadata={'model': 'gemma3:4b', 'created_at': '2025-04-01T20:00:34.613301Z', 'done': True, 'done_reason': 'stop', 'total_duration': 1093037375, 'load_duration': 23306375, 'prompt_eval_count': 456, 'prompt_eval_duration': 553318417, 'eval_count': 33, 'eval_duration': 515989333, 'message': Message(role='assistant', content='', images=None, tool_calls=None)} id='run-ecb1a649-891e-41fd-8dfd-8387e1ed054b-0' usage_metadata={'input_tokens': 456, 'output_tokens': 33, 'total_tokens': 489}

Now trying directly with the SelfQueryRetriever:

retriever = SelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=vectorstore,
    document_contents=document_content_description,
    metadata_field_info=metadata_field_info,
    prompt=custom_prompt,
    enable_limit=False,
    examples = examples,
    search_kwargs={"k":4}
)

for i, query in enumerate(queries, 1):
    structured_query = retriever.query_constructor.invoke(query)
    print(f"\nQuery {i}: {query}")
    print(f"\nStructured Query {i} Output: {structured_query}")

The output shows that it's only querying by rating, and not genre:

Structured Query 1 Output: query='science fiction' filter=Comparison(comparator=<Comparator.GT: 'gt'>, attribute='rating', value=8.5) limit=None

Can you spot the reason why the llm works perfectly when it is directly invoked but it's not capable to generate the proper structured query through the SelfQueryRetriever?

Here's the full source code:

from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_postgres.vectorstores import PGVector
from langchain_core.documents import Document
from langchain.prompts import PromptTemplate

# PGVector DB connection
connection = "postgresql+psycopg://langchain:langchain@localhost:6024/langchain"

# Sample documents
docs = [
    Document(
        page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
        metadata={"year": 1993, "rating": 8.7, "genre": "science fiction"},
    ),
    Document(
        page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
        metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2, "genre": "science fiction"},
    ),
    Document(
        page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
        metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6, "genre": "science fiction"},
    ),
    Document(
        page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
        metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3, "genre": "comedy"},
    ),
    Document(
        page_content="Some weird aliens fight for the freedom of the resistance against an evil galactic empire",
        metadata={"year": 1979, "director": "Gee Lucas", "rating": 9.2, "genre": "science fiction"},
    ),
    Document(
        page_content="Toys come alive and have a blast doing so",
        metadata={"year": 1995, "genre": "animated", "rating": 7.5},
    ),
    Document(
        page_content="Three men walk into the Zone, three men walk out of the Zone",
        metadata={"year": 1979, "director": "Andrei Tarkovsky", "genre": "thriller","rating": 9.9},
    ),
]

# Embeddings
embeddings_model = OllamaEmbeddings(model="nomic-embed-text")

# Vectorstore
vectorstore = PGVector.from_documents(docs, embeddings_model, connection=connection)

# Metadata schema
metadata_field_info = [
    AttributeInfo(name="genre", description="The genre of the movie", type="string or list[string]"),
    AttributeInfo(name="year", description="The year the movie was released", type="integer"),
    AttributeInfo(name="director", description="The name of the movie director", type="string"),
    AttributeInfo(name="rating", description="A 1-10 rating for the movie", type="float"),
]

# Description of contents
document_content_description = "Brief summary of a movie"

# Define Ollama LLM for querying metadata
llm = ChatOllama(model="gemma3:4b", temperature=0)

prompt_template_str = """
You are a text-to-structured-query converter. Convert natural language questions into structured queries.

Use the provided metadata schema and examples to guide your conversion. Ensure that if a genre is mentioned or implied, it is included in the structured query.

Respond ONLY with a valid JSON object. Do not include any explanations.

Schema:
{schema}

Examples:
{examples}

User Query:
{query}

Structured Query:
"""

custom_prompt = PromptTemplate(
    template=prompt_template_str,
    input_variables=["query", "schema", "examples"]
)

examples = [
    {
        "query": "I want to watch science fiction movies with a rating at least 5.5",
        "filter": 'and(gt("rating", 5.5), eq("genre", "science fiction"))'
    },
    {
        "query": "Find movies of science fiction genre with a rating greater than 8.5",
        "filter": 'and(gt("rating", 8.5), eq("genre", "science fiction"))'
    },
    {
        "query": "Find comedy movies",
        "filter": 'eq("genre", "comedy")'
    },
    {
        "query": "Show animated movies with a rating above 7",
        "filter": 'and(gt("rating", 7), eq("genre", "animated"))'
    },
    {
        "query": "Find science fiction movies with a rating greater than 8.5",
        "filter": 'and(gt("rating", 8.5), eq("genre", "science fiction"))'
    },
    # Ensure queries with 'science fiction' and 'rating' include the genre clause
    {
        "query": "Find science fiction movies above 8.0 rating",
        "filter": 'and(gt("rating", 8.0), eq("genre", "science fiction"))'
    },
    # Adjust for cases where the genre isn't clear or needs to be inferred
    {
        "query": "Find movies with a rating greater than 8.0",
        "filter": 'gt("rating", 8.0)'
    }
]

# Create the retriever
retriever = SelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=vectorstore,
    document_contents=document_content_description,
    metadata_field_info=metadata_field_info,
    prompt=custom_prompt,
    enable_limit=False,
    examples = examples,
    search_kwargs={"k":4}
)

queries = [
    "Find science fiction movies with a rating greater than 8.5",
]

# Log the structured queries before processing
for i, query in enumerate(queries, 1):
    structured_query = retriever.query_constructor.invoke(query)
    print(f"\nQuery {i}: {query}")
    print(f"\nStructured Query {i} Output: {structured_query}")

    # Format the prompt with schema and examples (as done in the retriever)
    formatted_prompt = prompt_template_str.format(
        schema=str(metadata_field_info),  # Ensure the schema is included
        examples=str(examples),  # Ensure examples are passed correctly
        query=query  # Pass the user query
    )

    # Log the formatted prompt
    print("\nFormatted Prompt:")
    print(formatted_prompt)

    # Ask the model to generate the structured query
    model_response = llm.invoke(formatted_prompt)

    # Log the model's output
    print("\nModel Response:")
    print(model_response)

# Query and results
for i, query in enumerate(queries, 1):
    print(f"\nResults {i}:")
    results = retriever.invoke(query)
    for r in results:
        print(r.page_content, "=>", r.metadata)
Share Improve this question asked Apr 1 at 20:11 codependentcodependent 24.5k37 gold badges190 silver badges349 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 0

The key to fix it was to use more describe metadata field info:

metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']",
        type="string",
    ),
    AttributeInfo(
        name="year",
        description="The year the movie was released",
        type="integer",
    ),
    AttributeInfo(
        name="director",
        description="The name of the movie director",
        type="string",
    ),
    AttributeInfo(
        name="rating", description="A 1-10 rating for the movie", type="float"
    ),
]

本文标签: langchainLanchain39s SelfQueryRetriever is not generating some query paramsStack Overflow