admin管理员组

文章数量:1129019

Why?

I am querying data from a MongoDB collection and loading the result into a Polars DataFrame. Depending on the limit filter of the mongo query the operation works or raises the error of the title. I wasn't able to fix it because I can't tell if the issue is with Mongo or with Polars. By the way I'm quite new to Polars.

Context

So this essentially the query I'm running on Python using pymongo==4.5.0:

import datetime as dt
res= mongo_clt.col.db.find(
    filter={
        'createdAt': {
            '$gte': dt.datetime.fromisoformat("2024-09-01")
        },
    },
    projection=[
        "type",
        "checked",
        "status",
        "createdAt",
        ],
    limit=0
)

Note that setting limit=0 is the same as not adding a limit, and thus should query all entries.

Now for reference, between the date 2024-09-01 and today(2025-01-08) I should collect about 4700 rows, which I validated running the query in MongoDB Compass and from loading the response directly to a Pandas dataframe instead a Polars one.

The schema I'm using for variables projected is:

import polars as pl
cols_type = {
    'type':pl.Categorical,
    'checked':pl.Boolean,
    'status':pl.Categorical,
    'createdAt':pl.Datetime('ms')
}

Then the response unpacking is:

df = pl.DataFrame(
    data=res,
    schema_overrides=cols_types,
)

Issue

If I set limit = 100 or even limit = 1000 the operation works and I get a Polars dataframe with 100 (or 1000) rows with the correct types. Now if I raise the limit to say 4000 or simply remove the limit I get the following error:

{
    "name": "TypeError",
    "message": "argument 'schema': 'Object' is not a Polars data type",
    "stack": "---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[5], line 1
----> 1 df_raq = pl.DataFrame(
      2     data=res,
      3     schema_overrides=cols_types,
      4 )

File ~/Desktop/dev/.venv/lib/python3.10/site-packages/polars/dataframe/frame.py:419, in DataFrame.__init__(self, data, schema, schema_overrides, strict, orient, infer_schema_length, nan_to_null)
    414     self._df = pandas_to_pydf(
    415         data, schema=schema, schema_overrides=schema_overrides, strict=strict
    416     )
    418 elif not isinstance(data, Sized) and isinstance(data, (Generator, Iterable)):
--> 419     self._df = iterable_to_pydf(
    420         data,
    421         schema=schema,
    422         schema_overrides=schema_overrides,
    423         strict=strict,
    424         orient=orient,
    425         infer_schema_length=infer_schema_length,
    426     )
    428 elif isinstance(data, pl.DataFrame):
    429     self._df = dataframe_to_pydf(
    430         data, schema=schema, schema_overrides=schema_overrides, strict=strict
    431     )

File ~/Desktop/dev/.venv/lib/python3.10/site-packages/polars/_utils/construction/dataframe.py:990, in iterable_to_pydf(data, schema, schema_overrides, strict, orient, chunk_size, infer_schema_length)
    988 if not values:
    989     break
--> 990 frame_chunk = to_frame_chunk(values, original_schema)
    991 if df is None:
    992     df = frame_chunk

File ~/Desktop/dev/.venv/lib/python3.10/site-packages/polars/_utils/construction/dataframe.py:963, in iterable_to_pydf.<locals>.to_frame_chunk(values, schema)
    962 def to_frame_chunk(values: list[Any], schema: SchemaDefinition | None) -> DataFrame:
--> 963     return pl.DataFrame(
    964         data=values,
    965         schema=schema,
    966         strict=strict,
    967         orient=\"row\",
    968         infer_schema_length=infer_schema_length,
    969     )

File ~/Desktop/dev/.venv/lib/python3.10/site-packages/polars/dataframe/frame.py:384, in DataFrame.__init__(self, data, schema, schema_overrides, strict, orient, infer_schema_length, nan_to_null)
    375     self._df = dict_to_pydf(
    376         data,
    377         schema=schema,
   (...)
    380         nan_to_null=nan_to_null,
    381     )
    383 elif isinstance(data, (list, tuple, Sequence)):
--> 384     self._df = sequence_to_pydf(
    385         data,
    386         schema=schema,
    387         schema_overrides=schema_overrides,
    388         strict=strict,
    389         orient=orient,
    390         infer_schema_length=infer_schema_length,
    391     )
    393 elif isinstance(data, pl.Series):
    394     self._df = series_to_pydf(
    395         data, schema=schema, schema_overrides=schema_overrides, strict=strict
    396     )

File ~/Desktop/dev/.venv/lib/python3.10/site-packages/polars/_utils/construction/dataframe.py:435, in sequence_to_pydf(data, schema, schema_overrides, strict, orient, infer_schema_length)
    432 if not data:
    433     return dict_to_pydf({}, schema=schema, schema_overrides=schema_overrides)
--> 435 return _sequence_to_pydf_dispatcher(
    436     data[0],
    437     data=data,
    438     schema=schema,
    439     schema_overrides=schema_overrides,
    440     strict=strict,
    441     orient=orient,
    442     infer_schema_length=infer_schema_length,
    443 )

File ~/.pyenv/versions/3.10.12/lib/python3.10/functools.py:889, in singledispatch.<locals>.wrapper(*args, **kw)
    885 if not args:
    886     raise TypeError(f'{funcname} requires at least '
    887                     '1 positional argument')
--> 889 return dispatch(args[0].__class__)(*args, **kw)

File ~/Desktop/dev/.venv/lib/python3.10/site-packages/polars/_utils/construction/dataframe.py:676, in _sequence_of_dict_to_pydf(first_element, data, schema, schema_overrides, strict, infer_schema_length, **kwargs)
    668 column_names, schema_overrides = _unpack_schema(
    669     schema, schema_overrides=schema_overrides
    670 )
    671 dicts_schema = (
    672     _include_unknowns(schema_overrides, column_names or list(schema_overrides))
    673     if column_names
    674     else None
    675 )
--> 676 pydf = PyDataFrame.from_dicts(
    677     data,
    678     dicts_schema,
    679     schema_overrides,
    680     strict=strict,
    681     infer_schema_length=infer_schema_length,
    682 )
    684 # TODO: we can remove this `schema_overrides` block completely
    685 #  once  is fixed
    686 if schema_overrides:

TypeError: argument 'schema': 'Object' is not a Polars data type"
}

My guess is that the infer schema by polars has some sort of issue so I tried setting pl.DataFrame(strict=False) but that didn't have any effect.

Update

From the projected columns, the only one that is not casted explicitly is _id which is always returned. In mongo is of type ObjectId so it be the one being referenced in the raised error above. So I forced to cast it to a pl.String and the result was a *new error raised, being ComputeError:

{
    "name": "ComputeError",
    "message": "could not append value: 677fe3e18f80eb81115eb375 of type: object to the builder; make sure that all rows have the same schema or consider increasing `infer_schema_length`

it might also be that a value overflows the data-type's capacity",
    "stack": "---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
Cell In[3], line 23
      1 res= mongo_clt.col.db.find(
      2     filter={
      3         'createdAt': {
   (...)
     20     limit=0
     21 )
---> 23 df = pl.DataFrame(
     24     data=res,
     25     schema={
     26         '_id':pl.String,
     27         'type':pl.Categorical,
     28         'checked':pl.Boolean,
     29         # 'assetId':pl.String,
     30         'status':pl.Categorical,
     31         'createdAt':pl.Datetime('ms'),
     32         'feedback':pl.Struct
     33     }
     34 )

File ~/Desktop/dev/.venv/lib/python3.10/site-packages/polars/dataframe/frame.py:419, in DataFrame.__init__(self, data, schema, schema_overrides, strict, orient, infer_schema_length, nan_to_null)
    414     self._df = pandas_to_pydf(
    415         data, schema=schema, schema_overrides=schema_overrides, strict=strict
    416     )
    418 elif not isinstance(data, Sized) and isinstance(data, (Generator, Iterable)):
--> 419     self._df = iterable_to_pydf(
    420         data,
    421         schema=schema,
    422         schema_overrides=schema_overrides,
    423         strict=strict,
    424         orient=orient,
    425         infer_schema_length=infer_schema_length,
    426     )
    428 elif isinstance(data, pl.DataFrame):
    429     self._df = dataframe_to_pydf(
    430         data, schema=schema, schema_overrides=schema_overrides, strict=strict
    431     )

File ~/Desktop/dev/.venv/lib/python3.10/site-packages/polars/_utils/construction/dataframe.py:990, in iterable_to_pydf(data, schema, schema_overrides, strict, orient, chunk_size, infer_schema_length)
    988 if not values:
    989     break
--> 990 frame_chunk = to_frame_chunk(values, original_schema)
    991 if df is None:
    992     df = frame_chunk

File ~/Desktop/dev/.venv/lib/python3.10/site-packages/polars/_utils/construction/dataframe.py:963, in iterable_to_pydf.<locals>.to_frame_chunk(values, schema)
    962 def to_frame_chunk(values: list[Any], schema: SchemaDefinition | None) -> DataFrame:
--> 963     return pl.DataFrame(
    964         data=values,
    965         schema=schema,
    966         strict=strict,
    967         orient=\"row\",
    968         infer_schema_length=infer_schema_length,
    969     )

File ~/Desktop/dev/.venv/lib/python3.10/site-packages/polars/dataframe/frame.py:384, in DataFrame.__init__(self, data, schema, schema_overrides, strict, orient, infer_schema_length, nan_to_null)
    375     self._df = dict_to_pydf(
    376         data,
    377         schema=schema,
   (...)
    380         nan_to_null=nan_to_null,
    381     )
    383 elif isinstance(data, (list, tuple, Sequence)):
--> 384     self._df = sequence_to_pydf(
    385         data,
    386         schema=schema,
    387         schema_overrides=schema_overrides,
    388         strict=strict,
    389         orient=orient,
    390         infer_schema_length=infer_schema_length,
    391     )
    393 elif isinstance(data, pl.Series):
    394     self._df = series_to_pydf(
    395         data, schema=schema, schema_overrides=schema_overrides, strict=strict
    396     )

File ~/Desktop/dev/.venv/lib/python3.10/site-packages/polars/_utils/construction/dataframe.py:435, in sequence_to_pydf(data, schema, schema_overrides, strict, orient, infer_schema_length)
    432 if not data:
    433     return dict_to_pydf({}, schema=schema, schema_overrides=schema_overrides)
--> 435 return _sequence_to_pydf_dispatcher(
    436     data[0],
    437     data=data,
    438     schema=schema,
    439     schema_overrides=schema_overrides,
    440     strict=strict,
    441     orient=orient,
    442     infer_schema_length=infer_schema_length,
    443 )

File ~/.pyenv/versions/3.10.12/lib/python3.10/functools.py:889, in singledispatch.<locals>.wrapper(*args, **kw)
    885 if not args:
    886     raise TypeError(f'{funcname} requires at least '
    887                     '1 positional argument')
--> 889 return dispatch(args[0].__class__)(*args, **kw)

File ~/Desktop/dev/.venv/lib/python3.10/site-packages/polars/_utils/construction/dataframe.py:676, in _sequence_of_dict_to_pydf(first_element, data, schema, schema_overrides, strict, infer_schema_length, **kwargs)
    668 column_names, schema_overrides = _unpack_schema(
    669     schema, schema_overrides=schema_overrides
    670 )
    671 dicts_schema = (
    672     _include_unknowns(schema_overrides, column_names or list(schema_overrides))
    673     if column_names
    674     else None
    675 )
--> 676 pydf = PyDataFrame.from_dicts(
    677     data,
    678     dicts_schema,
    679     schema_overrides,
    680     strict=strict,
    681     infer_schema_length=infer_schema_length,
    682 )
    684 # TODO: we can remove this `schema_overrides` block completely
    685 #  once  is fixed
    686 if schema_overrides:

ComputeError: could not append value: 677fe3e18f80eb81115eb375 of type: object to the builder; make sure that all rows have the same schema or consider increasing `infer_schema_length`

it might also be that a value overflows the data-type's capacity"
}
  • I can't increase the infer_schema_lenght cause I'm already using the full length.
  • The value 677fe3e18f80eb81115eb375 corresponds to an _id but I could only see that in MongoDB Compass, when I load the response with Pandas I don't find that row.
  • Could then be this: a value overflows the data-type's capacity ?

本文标签: mongodbTypeError argument 39schema39 39Object39 is not a Polars data typeStack Overflow