admin管理员组文章数量:1312727
I want to load large responses of the Solr streaming API into polars (python), efficiently. The Solr streaming API returns JSON of the following form:
{
"result-set":{
"docs":[{
"col1":"value",
"col2":"value"}
,{
"col1":"value",
"col2":"value"}
...
,{
"EOF":true,
"RESPONSE_TIME":12345}]}}
That is: I need every element of result-set.docs
-- except for the last one, which marks the end of the response.
For now, my fastest solution is to convert this first to ndjson using the jstream and GNU head and then use pl.read_ndjson
:
cat result.json | jstream -d 3 | head -n -1 > result.ndjson
This clocks in at around 8s for a 770MiB file, which is perfectly fine for me. If I manually change the JSON to just have a top-level list, I can load this even faster using pl.read_json(result_manipulated).head(-1)
, clocking in at around 3s -- at least, if specify the schema manually so the last line does not produce any schema errors.
So, I wonder whether there is any fast way to import this file without leaving python?
I want to load large responses of the Solr streaming API into polars (python), efficiently. The Solr streaming API returns JSON of the following form:
{
"result-set":{
"docs":[{
"col1":"value",
"col2":"value"}
,{
"col1":"value",
"col2":"value"}
...
,{
"EOF":true,
"RESPONSE_TIME":12345}]}}
That is: I need every element of result-set.docs
-- except for the last one, which marks the end of the response.
For now, my fastest solution is to convert this first to ndjson using the jstream and GNU head and then use pl.read_ndjson
:
cat result.json | jstream -d 3 | head -n -1 > result.ndjson
This clocks in at around 8s for a 770MiB file, which is perfectly fine for me. If I manually change the JSON to just have a top-level list, I can load this even faster using pl.read_json(result_manipulated).head(-1)
, clocking in at around 3s -- at least, if specify the schema manually so the last line does not produce any schema errors.
So, I wonder whether there is any fast way to import this file without leaving python?
Share Improve this question edited Feb 1 at 10:27 jqurious 21.7k5 gold badges20 silver badges39 bronze badges asked Feb 1 at 6:13 Lars NoschinskiLars Noschinski 3,66718 silver badges29 bronze badges 1 |1 Answer
Reset to default 0This is a classic stream / buffer reading issue. Instead of bulk processing the entire streaming response from Solr, the intention is that the client will read it chunk by chunk and make sense of it as you go.
I have not tried this myself but there are streaming json parsers for the Python ecosystem (https://pypi./project/json-stream/) which seems to fit the bill at a glance. I believe you will be able to configure it so that your code consumes each doc at a time while still reading the stream from your streaming request.
Good luck
-- Jan Høydahl - Apache Solr committer
本文标签: pythonFast loading of a Solr streaming response (JSON) into PolarsStack Overflow
版权声明:本文标题:python - Fast loading of a Solr streaming response (JSON) into Polars - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1741884683a2402953.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
subprocess.run
and similar stackoverflow/questions/89228/…? – Dean MacGregor Commented Feb 3 at 17:47