admin管理员组

文章数量:1406950

I am in a situation where I have some time series data, potentially looking like this:

{
        "t": [1, 2, 5, 6, 7],
        "y": [1, 1, 1, 1, 1],
}

As you can see, the time stamp jumps from 2 to 5. For my analysis, I would like to fill in zeros for the time stamps 3, and 4.

In reality, I might have multiple gaps with varying lengths. I'd like to fill this gap for all other columns.

I'd also really like to keep my data in a LazyFrame since this is only one step in my pipeline. I don't think that .interpolate is really addressing my issue, nor is fill_null helpful here.

I managed to achieve what I want, but it looks too complex:


# Dummy, lazy data.
lf = pl.LazyFrame(
    {
        "t": [1, 2, 5, 6, 7],
        "y": [1, 1, 1, 1, 1],
    }
)


lf_filled = lf.join(
    pl.Series(
        name="t",
        values=pl.int_range(
            start=lf.select("t").first().collect().item(0, 0),
            end=lf.select("t").last().collect().item(0, 0) + 1,
            eager=True,
        ),
    )
    .to_frame()
    .lazy(),
    on="t",
    how="right",
).fill_null(0)

The output is correct and I am never collecting any more data than the two values needed for start and end.

This looks like there should be a better way to do this. Happy to hear other suggestions :)

I am in a situation where I have some time series data, potentially looking like this:

{
        "t": [1, 2, 5, 6, 7],
        "y": [1, 1, 1, 1, 1],
}

As you can see, the time stamp jumps from 2 to 5. For my analysis, I would like to fill in zeros for the time stamps 3, and 4.

In reality, I might have multiple gaps with varying lengths. I'd like to fill this gap for all other columns.

I'd also really like to keep my data in a LazyFrame since this is only one step in my pipeline. I don't think that .interpolate is really addressing my issue, nor is fill_null helpful here.

I managed to achieve what I want, but it looks too complex:


# Dummy, lazy data.
lf = pl.LazyFrame(
    {
        "t": [1, 2, 5, 6, 7],
        "y": [1, 1, 1, 1, 1],
    }
)


lf_filled = lf.join(
    pl.Series(
        name="t",
        values=pl.int_range(
            start=lf.select("t").first().collect().item(0, 0),
            end=lf.select("t").last().collect().item(0, 0) + 1,
            eager=True,
        ),
    )
    .to_frame()
    .lazy(),
    on="t",
    how="right",
).fill_null(0)

The output is correct and I am never collecting any more data than the two values needed for start and end.

This looks like there should be a better way to do this. Happy to hear other suggestions :)

Share asked Mar 7 at 12:25 ThomasThomas 1,2551 gold badge18 silver badges35 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 3

I think your approach is sensible, there's just no need for an intermediate collect:

lf.join(
    lf.select(pl.int_range(pl.col.t.first(), pl.col.t.last()+1)),
    on="t",
    how="right"
)
.fill_null(0)

An alternate approach that might be a bit more efficient is to use an asof-join with no tolerance:

lf.select(pl.int_range(pl.col.t.first(), pl.col.t.last()+1))
  .join_asof(lf, on="t", tolerance=0)
  .fill_null(0)

本文标签: pythonFill gaps in time series data in a Polars LazyDataframeStack Overflow