admin管理员组

文章数量:1312727

I want to load large responses of the Solr streaming API into polars (python), efficiently. The Solr streaming API returns JSON of the following form:

{
  "result-set":{
    "docs":[{
       "col1":"value",
       "col2":"value"}
    ,{
       "col1":"value",
       "col2":"value"}
    ...
    ,{
       "EOF":true,
       "RESPONSE_TIME":12345}]}}

That is: I need every element of result-set.docs-- except for the last one, which marks the end of the response.

For now, my fastest solution is to convert this first to ndjson using the jstream and GNU head and then use pl.read_ndjson:

cat result.json | jstream -d 3 | head -n -1 > result.ndjson

This clocks in at around 8s for a 770MiB file, which is perfectly fine for me. If I manually change the JSON to just have a top-level list, I can load this even faster using pl.read_json(result_manipulated).head(-1), clocking in at around 3s -- at least, if specify the schema manually so the last line does not produce any schema errors.

So, I wonder whether there is any fast way to import this file without leaving python?

I want to load large responses of the Solr streaming API into polars (python), efficiently. The Solr streaming API returns JSON of the following form:

{
  "result-set":{
    "docs":[{
       "col1":"value",
       "col2":"value"}
    ,{
       "col1":"value",
       "col2":"value"}
    ...
    ,{
       "EOF":true,
       "RESPONSE_TIME":12345}]}}

That is: I need every element of result-set.docs-- except for the last one, which marks the end of the response.

For now, my fastest solution is to convert this first to ndjson using the jstream and GNU head and then use pl.read_ndjson:

cat result.json | jstream -d 3 | head -n -1 > result.ndjson

This clocks in at around 8s for a 770MiB file, which is perfectly fine for me. If I manually change the JSON to just have a top-level list, I can load this even faster using pl.read_json(result_manipulated).head(-1), clocking in at around 3s -- at least, if specify the schema manually so the last line does not produce any schema errors.

So, I wonder whether there is any fast way to import this file without leaving python?

Share Improve this question edited Feb 1 at 10:27 jqurious 21.7k5 gold badges20 silver badges39 bronze badges asked Feb 1 at 6:13 Lars NoschinskiLars Noschinski 3,66718 silver badges29 bronze badges 1
  • When you say without leaving python are you ruling out the usage of subprocess.run and similar stackoverflow/questions/89228/…? – Dean MacGregor Commented Feb 3 at 17:47
Add a comment  | 

1 Answer 1

Reset to default 0

This is a classic stream / buffer reading issue. Instead of bulk processing the entire streaming response from Solr, the intention is that the client will read it chunk by chunk and make sense of it as you go.

I have not tried this myself but there are streaming json parsers for the Python ecosystem (https://pypi./project/json-stream/) which seems to fit the bill at a glance. I believe you will be able to configure it so that your code consumes each doc at a time while still reading the stream from your streaming request.

Good luck

-- Jan Høydahl - Apache Solr committer

本文标签: pythonFast loading of a Solr streaming response (JSON) into PolarsStack Overflow