admin管理员组

文章数量:1389783

I am new to pandera and am still learning how it works. What is the easiest way to check that the datetime units of an index are in nanoseconds and not milliseconds?

In a perfect world, I am looking for compact declarations of this check inside of the class-based API definitions. If the solution attempt 2 is the best way of doing this, I will be happy with this answer as I am looking for a more experienced perspective.

Solution Attempt 1

First I tried the way that looked as intuitive after studying the docs, but this did not produce the desired result. The index does not cause a schema error.

import pandas as pd
import pandera as pa
from pandera import DataFrameModel, Field
from pandera.typing import Index
from pandera.engines import pandas_engine

class DateIndexSchema(DataFrameModel):
    date: Index[pandas_engine.DateTime] = Field(nullable=False, dtype_kwargs={'unit': 'ns'})

df_wrong_index_type = pd.DataFrame(
    {'value': [100, 200]},
    index=pd.to_datetime(['2023-01-01', '2023-01-02']).astype('datetime64[ms]'),
)

DateIndexSchema.validate(df_wrong_index_type)

Solution Attempt 2

This solution works as expected, but it feels a bit verbose and makes me feel that I am missing something obvious.

class DateIndexSchemaThrow(DataFrameModel):
    date: Index[pandas_engine.DateTime] = Field(nullable=False)

    @pa.dataframe_check
    def index_should_be_in_ns(cls, dataframe: pd.DataFrame) -> bool:
        if dataframe.index.dtype != "datetime64[ns]":
            return False
        return True

I am new to pandera and am still learning how it works. What is the easiest way to check that the datetime units of an index are in nanoseconds and not milliseconds?

In a perfect world, I am looking for compact declarations of this check inside of the class-based API definitions. If the solution attempt 2 is the best way of doing this, I will be happy with this answer as I am looking for a more experienced perspective.

Solution Attempt 1

First I tried the way that looked as intuitive after studying the docs, but this did not produce the desired result. The index does not cause a schema error.

import pandas as pd
import pandera as pa
from pandera import DataFrameModel, Field
from pandera.typing import Index
from pandera.engines import pandas_engine

class DateIndexSchema(DataFrameModel):
    date: Index[pandas_engine.DateTime] = Field(nullable=False, dtype_kwargs={'unit': 'ns'})

df_wrong_index_type = pd.DataFrame(
    {'value': [100, 200]},
    index=pd.to_datetime(['2023-01-01', '2023-01-02']).astype('datetime64[ms]'),
)

DateIndexSchema.validate(df_wrong_index_type)

Solution Attempt 2

This solution works as expected, but it feels a bit verbose and makes me feel that I am missing something obvious.

class DateIndexSchemaThrow(DataFrameModel):
    date: Index[pandas_engine.DateTime] = Field(nullable=False)

    @pa.dataframe_check
    def index_should_be_in_ns(cls, dataframe: pd.DataFrame) -> bool:
        if dataframe.index.dtype != "datetime64[ns]":
            return False
        return True
Share Improve this question edited Mar 13 at 19:09 J.K. asked Mar 13 at 13:30 J.K.J.K. 1,6151 gold badge15 silver badges26 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 0

I found a solution that suits my application. One can register custom checks with pandera.extensions, which allows compact class-based API declarations inside of the Field constructor and in the Config overrides.

import pandas as pd
from pandas.api.types import is_datetime64_ns_dtype

from pandera import DataFrameModel, Field
from pandera.typing import Index
import pandera.extensions as extensions

extensions.register_check_method(is_datetime64_ns_dtype)

class DateIndexSchema(DataFrameModel):
    date: Index[pd.Timestamp] = Field(nullable=False, is_datetime64_ns_dtype=())

df_wrong_index_type = pd.DataFrame(
    {'value': [100, 200]},
    index=pd.to_datetime(['2023-01-01', '2023-01-02']).astype('datetime64[ms]'),
)

DateIndexSchema.validate(df_wrong_index_type)

Now the SchemaError is reliably thrown when the datetime is not in nanoseconds.

本文标签: pythonCheck units of pandas DateTime index with panderaStack Overflow