admin管理员组

文章数量:1125064

In Polars / pandas / PyArrow, I can instantiate an object from a dict, e.g.

In [12]: pl.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
Out[12]:
shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 4   │
│ 2   ┆ 5   │
│ 3   ┆ 6   │
└─────┴─────┘

Is there a way to do that in DuckDB, without going via pandas / pyarrow / etc.?

In Polars / pandas / PyArrow, I can instantiate an object from a dict, e.g.

In [12]: pl.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
Out[12]:
shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 4   │
│ 2   ┆ 5   │
│ 3   ┆ 6   │
└─────┴─────┘

Is there a way to do that in DuckDB, without going via pandas / pyarrow / etc.?

Share Improve this question asked 2 days ago ignoring_gravityignoring_gravity 10.4k7 gold badges41 silver badges86 bronze badges 1
  • From what I see, replacement scans are only available with certain types - pandas.DataFrame, pyarrow Table, Dataset, RecordBatchReader, Scanner, or NumPy ndarrays etc. You can create relation via some methods, but not from dict directly - relational api. I guess converting dict to json is not something you'd want though.. – roman Commented 2 days ago
Add a comment  | 

1 Answer 1

Reset to default 2

duckdb features a function duckdb.read_json which should do this by simply streaming the dict as a json string, but no combination of its various parameters will make it read that dict the same way polars does unfortunately. You can rearrange the dict to match the structure expected for a "unstructured" json format in their documentation pretty easily and then use the duckdb.read_json function to load it the same way as polars, which I think the closest you can get with the library as it currently stands.

Here is a demonstration which shows the polars interpretation we expected, the naive duckdb interpretation we probably didn't and the transformation necessary to load it like polars did:

import polars
import duckdb
import io
import json
# Silent requirement for fsspec

data = {'a': [1,2,3], 'b': [4,5,6]}
polars_data = polars.DataFrame(data)
print('polars\' interpretation:')
print(polars_data)

duckdb_data = duckdb.read_json(io.StringIO(json.dumps(data)))
print('duckdb \'s naive interpretation:')
print(duckdb_data)

def transform_to_duckjson(dictdata: dict):
    return [{fieldname: fieldrecords[recordnum] for fieldname, fieldrecords in dictdata.items()} for recordnum in range(len(dictdata[next(iter(dictdata.keys()))]))]

duckdb_data2 = duckdb.read_json(io.StringIO(json.dumps(transform_to_duckjson(data))))
print('duckdb \'s interpretation after adjustment:')
print(duckdb_data2)

Which gives me the following output:

polars' interpretation:
shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 4   │
│ 2   ┆ 5   │
│ 3   ┆ 6   │
└─────┴─────┘
duckdb 's naive interpretation:
┌───────────┬───────────┐
│     a     │     b     │
│  int64[]  │  int64[]  │
├───────────┼───────────┤
│ [1, 2, 3] │ [4, 5, 6] │
└───────────┴───────────┘

duckdb 's interpretation after adjustment:
┌───────┬───────┐
│   a   │   b   │
│ int64 │ int64 │
├───────┼───────┤
│     1 │     4 │
│     2 │     5 │
│     3 │     6 │
└───────┴───────┘

As a note: Here is the python api reference (scroll down to "read_csv(..." for the relevant information) which doesn't provide any help at all for using the function unfortunately.

Let me know if you have any questions.

本文标签: duckdbDuckDBPyRelation from Python dictStack Overflow