admin管理员组文章数量:1353145
I'm trying to use SelfQueryRetriever
to automatically generate a query by document metadata on a vector database.
When called directly, the configured LLM seems to be able to answer with the expected output JSON:
llm = ChatOllama(model="gemma3:4b", temperature=0)
metadata_field_info = [
AttributeInfo(name="genre", description="The genre of the movie", type="string or list[string]"),
AttributeInfo(name="year", description="The year the movie was released", type="integer"),
AttributeInfo(name="director", description="The name of the movie director", type="string"),
AttributeInfo(name="rating", description="A 1-10 rating for the movie", type="float"),
]
prompt_template_str = """
You are a text-to-structured-query converter. Convert natural language questions into structured queries.
Use the provided metadata schema and examples to guide your conversion. Ensure that if a genre is mentioned or implied, it is included in the structured query.
Respond ONLY with a valid JSON object. Do not include any explanations.
Schema:
{schema}
Examples:
{examples}
User Query:
{query}
Structured Query:
"""
queries = [
"Find science fiction movies with a rating greater than 8.5",
]
for i, query in enumerate(queries, 1):
# Format the prompt with schema and examples (as done in the retriever)
formatted_prompt = prompt_template_str.format(
schema=str(metadata_field_info), # Ensure the schema is included
examples=str(examples), # Ensure examples are passed correctly
query=query # Pass the user query
)
# Log the formatted prompt
print("\nFormatted Prompt:")
print(formatted_prompt)
# Ask the model to generate the structured query
model_response = llm.invoke(formatted_prompt)
# Log the model's output
print("\nModel Response:")
print(model_response)
As you can see, the model response has been able to determine that we are querying by genre and rating:
Model Response:
content='```json\n{"filter": "and(gt(\\"rating\\", 8.5), eq(\\"genre\\", \\"science fiction\\"))"}\n```' additional_kwargs={} response_metadata={'model': 'gemma3:4b', 'created_at': '2025-04-01T20:00:34.613301Z', 'done': True, 'done_reason': 'stop', 'total_duration': 1093037375, 'load_duration': 23306375, 'prompt_eval_count': 456, 'prompt_eval_duration': 553318417, 'eval_count': 33, 'eval_duration': 515989333, 'message': Message(role='assistant', content='', images=None, tool_calls=None)} id='run-ecb1a649-891e-41fd-8dfd-8387e1ed054b-0' usage_metadata={'input_tokens': 456, 'output_tokens': 33, 'total_tokens': 489}
Now trying directly with the SelfQueryRetriever
:
retriever = SelfQueryRetriever.from_llm(
llm=llm,
vectorstore=vectorstore,
document_contents=document_content_description,
metadata_field_info=metadata_field_info,
prompt=custom_prompt,
enable_limit=False,
examples = examples,
search_kwargs={"k":4}
)
for i, query in enumerate(queries, 1):
structured_query = retriever.query_constructor.invoke(query)
print(f"\nQuery {i}: {query}")
print(f"\nStructured Query {i} Output: {structured_query}")
The output shows that it's only querying by rating, and not genre:
Structured Query 1 Output: query='science fiction' filter=Comparison(comparator=<Comparator.GT: 'gt'>, attribute='rating', value=8.5) limit=None
Can you spot the reason why the llm works perfectly when it is directly invoked but it's not capable to generate the proper structured query through the SelfQueryRetriever
?
Here's the full source code:
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_postgres.vectorstores import PGVector
from langchain_core.documents import Document
from langchain.prompts import PromptTemplate
# PGVector DB connection
connection = "postgresql+psycopg://langchain:langchain@localhost:6024/langchain"
# Sample documents
docs = [
Document(
page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
metadata={"year": 1993, "rating": 8.7, "genre": "science fiction"},
),
Document(
page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2, "genre": "science fiction"},
),
Document(
page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6, "genre": "science fiction"},
),
Document(
page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3, "genre": "comedy"},
),
Document(
page_content="Some weird aliens fight for the freedom of the resistance against an evil galactic empire",
metadata={"year": 1979, "director": "Gee Lucas", "rating": 9.2, "genre": "science fiction"},
),
Document(
page_content="Toys come alive and have a blast doing so",
metadata={"year": 1995, "genre": "animated", "rating": 7.5},
),
Document(
page_content="Three men walk into the Zone, three men walk out of the Zone",
metadata={"year": 1979, "director": "Andrei Tarkovsky", "genre": "thriller","rating": 9.9},
),
]
# Embeddings
embeddings_model = OllamaEmbeddings(model="nomic-embed-text")
# Vectorstore
vectorstore = PGVector.from_documents(docs, embeddings_model, connection=connection)
# Metadata schema
metadata_field_info = [
AttributeInfo(name="genre", description="The genre of the movie", type="string or list[string]"),
AttributeInfo(name="year", description="The year the movie was released", type="integer"),
AttributeInfo(name="director", description="The name of the movie director", type="string"),
AttributeInfo(name="rating", description="A 1-10 rating for the movie", type="float"),
]
# Description of contents
document_content_description = "Brief summary of a movie"
# Define Ollama LLM for querying metadata
llm = ChatOllama(model="gemma3:4b", temperature=0)
prompt_template_str = """
You are a text-to-structured-query converter. Convert natural language questions into structured queries.
Use the provided metadata schema and examples to guide your conversion. Ensure that if a genre is mentioned or implied, it is included in the structured query.
Respond ONLY with a valid JSON object. Do not include any explanations.
Schema:
{schema}
Examples:
{examples}
User Query:
{query}
Structured Query:
"""
custom_prompt = PromptTemplate(
template=prompt_template_str,
input_variables=["query", "schema", "examples"]
)
examples = [
{
"query": "I want to watch science fiction movies with a rating at least 5.5",
"filter": 'and(gt("rating", 5.5), eq("genre", "science fiction"))'
},
{
"query": "Find movies of science fiction genre with a rating greater than 8.5",
"filter": 'and(gt("rating", 8.5), eq("genre", "science fiction"))'
},
{
"query": "Find comedy movies",
"filter": 'eq("genre", "comedy")'
},
{
"query": "Show animated movies with a rating above 7",
"filter": 'and(gt("rating", 7), eq("genre", "animated"))'
},
{
"query": "Find science fiction movies with a rating greater than 8.5",
"filter": 'and(gt("rating", 8.5), eq("genre", "science fiction"))'
},
# Ensure queries with 'science fiction' and 'rating' include the genre clause
{
"query": "Find science fiction movies above 8.0 rating",
"filter": 'and(gt("rating", 8.0), eq("genre", "science fiction"))'
},
# Adjust for cases where the genre isn't clear or needs to be inferred
{
"query": "Find movies with a rating greater than 8.0",
"filter": 'gt("rating", 8.0)'
}
]
# Create the retriever
retriever = SelfQueryRetriever.from_llm(
llm=llm,
vectorstore=vectorstore,
document_contents=document_content_description,
metadata_field_info=metadata_field_info,
prompt=custom_prompt,
enable_limit=False,
examples = examples,
search_kwargs={"k":4}
)
queries = [
"Find science fiction movies with a rating greater than 8.5",
]
# Log the structured queries before processing
for i, query in enumerate(queries, 1):
structured_query = retriever.query_constructor.invoke(query)
print(f"\nQuery {i}: {query}")
print(f"\nStructured Query {i} Output: {structured_query}")
# Format the prompt with schema and examples (as done in the retriever)
formatted_prompt = prompt_template_str.format(
schema=str(metadata_field_info), # Ensure the schema is included
examples=str(examples), # Ensure examples are passed correctly
query=query # Pass the user query
)
# Log the formatted prompt
print("\nFormatted Prompt:")
print(formatted_prompt)
# Ask the model to generate the structured query
model_response = llm.invoke(formatted_prompt)
# Log the model's output
print("\nModel Response:")
print(model_response)
# Query and results
for i, query in enumerate(queries, 1):
print(f"\nResults {i}:")
results = retriever.invoke(query)
for r in results:
print(r.page_content, "=>", r.metadata)
I'm trying to use SelfQueryRetriever
to automatically generate a query by document metadata on a vector database.
When called directly, the configured LLM seems to be able to answer with the expected output JSON:
llm = ChatOllama(model="gemma3:4b", temperature=0)
metadata_field_info = [
AttributeInfo(name="genre", description="The genre of the movie", type="string or list[string]"),
AttributeInfo(name="year", description="The year the movie was released", type="integer"),
AttributeInfo(name="director", description="The name of the movie director", type="string"),
AttributeInfo(name="rating", description="A 1-10 rating for the movie", type="float"),
]
prompt_template_str = """
You are a text-to-structured-query converter. Convert natural language questions into structured queries.
Use the provided metadata schema and examples to guide your conversion. Ensure that if a genre is mentioned or implied, it is included in the structured query.
Respond ONLY with a valid JSON object. Do not include any explanations.
Schema:
{schema}
Examples:
{examples}
User Query:
{query}
Structured Query:
"""
queries = [
"Find science fiction movies with a rating greater than 8.5",
]
for i, query in enumerate(queries, 1):
# Format the prompt with schema and examples (as done in the retriever)
formatted_prompt = prompt_template_str.format(
schema=str(metadata_field_info), # Ensure the schema is included
examples=str(examples), # Ensure examples are passed correctly
query=query # Pass the user query
)
# Log the formatted prompt
print("\nFormatted Prompt:")
print(formatted_prompt)
# Ask the model to generate the structured query
model_response = llm.invoke(formatted_prompt)
# Log the model's output
print("\nModel Response:")
print(model_response)
As you can see, the model response has been able to determine that we are querying by genre and rating:
Model Response:
content='```json\n{"filter": "and(gt(\\"rating\\", 8.5), eq(\\"genre\\", \\"science fiction\\"))"}\n```' additional_kwargs={} response_metadata={'model': 'gemma3:4b', 'created_at': '2025-04-01T20:00:34.613301Z', 'done': True, 'done_reason': 'stop', 'total_duration': 1093037375, 'load_duration': 23306375, 'prompt_eval_count': 456, 'prompt_eval_duration': 553318417, 'eval_count': 33, 'eval_duration': 515989333, 'message': Message(role='assistant', content='', images=None, tool_calls=None)} id='run-ecb1a649-891e-41fd-8dfd-8387e1ed054b-0' usage_metadata={'input_tokens': 456, 'output_tokens': 33, 'total_tokens': 489}
Now trying directly with the SelfQueryRetriever
:
retriever = SelfQueryRetriever.from_llm(
llm=llm,
vectorstore=vectorstore,
document_contents=document_content_description,
metadata_field_info=metadata_field_info,
prompt=custom_prompt,
enable_limit=False,
examples = examples,
search_kwargs={"k":4}
)
for i, query in enumerate(queries, 1):
structured_query = retriever.query_constructor.invoke(query)
print(f"\nQuery {i}: {query}")
print(f"\nStructured Query {i} Output: {structured_query}")
The output shows that it's only querying by rating, and not genre:
Structured Query 1 Output: query='science fiction' filter=Comparison(comparator=<Comparator.GT: 'gt'>, attribute='rating', value=8.5) limit=None
Can you spot the reason why the llm works perfectly when it is directly invoked but it's not capable to generate the proper structured query through the SelfQueryRetriever
?
Here's the full source code:
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_postgres.vectorstores import PGVector
from langchain_core.documents import Document
from langchain.prompts import PromptTemplate
# PGVector DB connection
connection = "postgresql+psycopg://langchain:langchain@localhost:6024/langchain"
# Sample documents
docs = [
Document(
page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
metadata={"year": 1993, "rating": 8.7, "genre": "science fiction"},
),
Document(
page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
metadata={"year": 2010, "director": "Christopher Nolan", "rating": 8.2, "genre": "science fiction"},
),
Document(
page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
metadata={"year": 2006, "director": "Satoshi Kon", "rating": 8.6, "genre": "science fiction"},
),
Document(
page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
metadata={"year": 2019, "director": "Greta Gerwig", "rating": 8.3, "genre": "comedy"},
),
Document(
page_content="Some weird aliens fight for the freedom of the resistance against an evil galactic empire",
metadata={"year": 1979, "director": "Gee Lucas", "rating": 9.2, "genre": "science fiction"},
),
Document(
page_content="Toys come alive and have a blast doing so",
metadata={"year": 1995, "genre": "animated", "rating": 7.5},
),
Document(
page_content="Three men walk into the Zone, three men walk out of the Zone",
metadata={"year": 1979, "director": "Andrei Tarkovsky", "genre": "thriller","rating": 9.9},
),
]
# Embeddings
embeddings_model = OllamaEmbeddings(model="nomic-embed-text")
# Vectorstore
vectorstore = PGVector.from_documents(docs, embeddings_model, connection=connection)
# Metadata schema
metadata_field_info = [
AttributeInfo(name="genre", description="The genre of the movie", type="string or list[string]"),
AttributeInfo(name="year", description="The year the movie was released", type="integer"),
AttributeInfo(name="director", description="The name of the movie director", type="string"),
AttributeInfo(name="rating", description="A 1-10 rating for the movie", type="float"),
]
# Description of contents
document_content_description = "Brief summary of a movie"
# Define Ollama LLM for querying metadata
llm = ChatOllama(model="gemma3:4b", temperature=0)
prompt_template_str = """
You are a text-to-structured-query converter. Convert natural language questions into structured queries.
Use the provided metadata schema and examples to guide your conversion. Ensure that if a genre is mentioned or implied, it is included in the structured query.
Respond ONLY with a valid JSON object. Do not include any explanations.
Schema:
{schema}
Examples:
{examples}
User Query:
{query}
Structured Query:
"""
custom_prompt = PromptTemplate(
template=prompt_template_str,
input_variables=["query", "schema", "examples"]
)
examples = [
{
"query": "I want to watch science fiction movies with a rating at least 5.5",
"filter": 'and(gt("rating", 5.5), eq("genre", "science fiction"))'
},
{
"query": "Find movies of science fiction genre with a rating greater than 8.5",
"filter": 'and(gt("rating", 8.5), eq("genre", "science fiction"))'
},
{
"query": "Find comedy movies",
"filter": 'eq("genre", "comedy")'
},
{
"query": "Show animated movies with a rating above 7",
"filter": 'and(gt("rating", 7), eq("genre", "animated"))'
},
{
"query": "Find science fiction movies with a rating greater than 8.5",
"filter": 'and(gt("rating", 8.5), eq("genre", "science fiction"))'
},
# Ensure queries with 'science fiction' and 'rating' include the genre clause
{
"query": "Find science fiction movies above 8.0 rating",
"filter": 'and(gt("rating", 8.0), eq("genre", "science fiction"))'
},
# Adjust for cases where the genre isn't clear or needs to be inferred
{
"query": "Find movies with a rating greater than 8.0",
"filter": 'gt("rating", 8.0)'
}
]
# Create the retriever
retriever = SelfQueryRetriever.from_llm(
llm=llm,
vectorstore=vectorstore,
document_contents=document_content_description,
metadata_field_info=metadata_field_info,
prompt=custom_prompt,
enable_limit=False,
examples = examples,
search_kwargs={"k":4}
)
queries = [
"Find science fiction movies with a rating greater than 8.5",
]
# Log the structured queries before processing
for i, query in enumerate(queries, 1):
structured_query = retriever.query_constructor.invoke(query)
print(f"\nQuery {i}: {query}")
print(f"\nStructured Query {i} Output: {structured_query}")
# Format the prompt with schema and examples (as done in the retriever)
formatted_prompt = prompt_template_str.format(
schema=str(metadata_field_info), # Ensure the schema is included
examples=str(examples), # Ensure examples are passed correctly
query=query # Pass the user query
)
# Log the formatted prompt
print("\nFormatted Prompt:")
print(formatted_prompt)
# Ask the model to generate the structured query
model_response = llm.invoke(formatted_prompt)
# Log the model's output
print("\nModel Response:")
print(model_response)
# Query and results
for i, query in enumerate(queries, 1):
print(f"\nResults {i}:")
results = retriever.invoke(query)
for r in results:
print(r.page_content, "=>", r.metadata)
Share
Improve this question
asked Apr 1 at 20:11
codependentcodependent
24.5k37 gold badges190 silver badges349 bronze badges
1 Answer
Reset to default 0The key to fix it was to use more describe metadata field info:
metadata_field_info = [
AttributeInfo(
name="genre",
description="The genre of the movie. One of ['science fiction', 'comedy', 'drama', 'thriller', 'romance', 'action', 'animated']",
type="string",
),
AttributeInfo(
name="year",
description="The year the movie was released",
type="integer",
),
AttributeInfo(
name="director",
description="The name of the movie director",
type="string",
),
AttributeInfo(
name="rating", description="A 1-10 rating for the movie", type="float"
),
]
本文标签: langchainLanchain39s SelfQueryRetriever is not generating some query paramsStack Overflow
版权声明:本文标题:langchain - Lanchain's SelfQueryRetriever is not generating some query params - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1743874114a2554011.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论