admin管理员组

文章数量:1305482

I have a Pandas dataframe:

import pandas as pd
import numpy as np

np.random.seed(150)
df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=['A', 'B'])

I want to add a new column "C" whose values ​​are the combined-list of every three rows in column "B". So I use the following method to achieve my needs, but this method is slow when the data is large.

>>> df['C'] = [df['B'].iloc[i-2:i+1].tolist() if i >= 2 else None for i in range(len(df))]
>>> df
   A  B          C
0  4  9       None
1  0  2       None
2  4  5  [9, 2, 5]
3  7  9  [2, 5, 9]
4  8  3  [5, 9, 3]
5  8  1  [9, 3, 1]
6  1  4  [3, 1, 4]
7  4  1  [1, 4, 1]
8  1  9  [4, 1, 9]
9  3  7  [1, 9, 7]

When I try to use the df.apply function, I get an error message:

df['C'] = df['B'].rolling(window=3).apply(lambda x: list(x), raw=False)

TypeError: must be real number, not list

I remember that Pandas apply doesn't seem to return a list, so how do I do this? I searched the forum, but couldn't find a suitable topic about apply and return.

I have a Pandas dataframe:

import pandas as pd
import numpy as np

np.random.seed(150)
df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=['A', 'B'])

I want to add a new column "C" whose values ​​are the combined-list of every three rows in column "B". So I use the following method to achieve my needs, but this method is slow when the data is large.

>>> df['C'] = [df['B'].iloc[i-2:i+1].tolist() if i >= 2 else None for i in range(len(df))]
>>> df
   A  B          C
0  4  9       None
1  0  2       None
2  4  5  [9, 2, 5]
3  7  9  [2, 5, 9]
4  8  3  [5, 9, 3]
5  8  1  [9, 3, 1]
6  1  4  [3, 1, 4]
7  4  1  [1, 4, 1]
8  1  9  [4, 1, 9]
9  3  7  [1, 9, 7]

When I try to use the df.apply function, I get an error message:

df['C'] = df['B'].rolling(window=3).apply(lambda x: list(x), raw=False)

TypeError: must be real number, not list

I remember that Pandas apply doesn't seem to return a list, so how do I do this? I searched the forum, but couldn't find a suitable topic about apply and return.

Share Improve this question edited Feb 28 at 20:26 TylerH 21.1k77 gold badges79 silver badges112 bronze badges asked Feb 4 at 9:28 Sun JarSun Jar 3411 silver badge12 bronze badges
Add a comment  | 

4 Answers 4

Reset to default 8

You can use numpy's sliding_window_view:

from numpy.lib.stride_tricks import sliding_window_view as swv

N = 3
df['C'] = pd.Series(swv(df['B'], N).tolist(), index=df.index[N-1:])

Output:

   A  B          C
0  4  9        NaN
1  0  2        NaN
2  4  5  [9, 2, 5]
3  7  9  [2, 5, 9]
4  8  3  [5, 9, 3]
5  8  1  [9, 3, 1]
6  1  4  [3, 1, 4]
7  4  1  [1, 4, 1]
8  1  9  [4, 1, 9]
9  3  7  [1, 9, 7]

I guess you can change your thinking from another way around, say, not row-wise but column-wise sliding windowing, and probably your code could speed up unless you have a large window size N.

For example, you can try

N = 3
nr = len(df)
df['C'] = [None]*(N-1) + np.column_stack([df['B'].iloc[k:nr-N+1+k] for k in range(N)]).tolist()

and you will obtain

    A   B   C
0   4   9   None
1   0   2   None
2   4   5   [9, 2, 5]
3   7   9   [2, 5, 9]
4   8   3   [5, 9, 3]
5   8   1   [9, 3, 1]
6   1   4   [3, 1, 4]
7   4   1   [1, 4, 1]
8   1   9   [4, 1, 9]
9   3   7   [1, 9, 7]

The code slices out the 'B' column of a DataFrame, then forms windows of size three over it. Each sliding window is stored in a list format under a new column ‘C’. The first two rows of ‘C’ have None because the first two elements do not have enough preceding elements to form a window. This process is made easier by the function sliding_window_view, which avoids copying data and instead creates views of the original array.

import pandas as pd
import numpy as np

# Use sliding_window_view for fast rolling window extraction
from numpy.lib.stride_tricks import sliding_window_view

# Sample Data 
np.random.seed(150)

df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=['A', 'B'])
print(df)

'''
  A  B
0  4  9
1  0  2
2  4  5
3  7  9
4  8  3
5  8  1
6  1  4
7  4  1
8  1  9
9  3  7
'''

# Convert column to NumPy array
B_values = df['B'].values

'''
 Apply sliding window
Imagine a window of size 3 sliding across the array. 
For each position of the window, it extracts the elements 
within the window.
'''
windows = sliding_window_view(B_values, window_shape=3)


# Create a new column, filling the first two rows with None
df['C'] = [None, None] + windows.tolist()

print(df.head(10))

'''
   A  B          C
0  4  9       None
1  0  2       None
2  4  5  [9, 2, 5]
3  7  9  [2, 5, 9]
4  8  3  [5, 9, 3]
5  8  1  [9, 3, 1]
6  1  4  [3, 1, 4]
7  4  1  [1, 4, 1]
8  1  9  [4, 1, 9]
9  3  7  [1, 9, 7]
'''

Here is another way:

df.assign(C = [s.tolist() if len(s) == 3 else None for s in df['B'].rolling(3)])

Output:

   A  B          C
0  4  9       None
1  0  2       None
2  4  5  [9, 2, 5]
3  7  9  [2, 5, 9]
4  8  3  [5, 9, 3]
5  8  1  [9, 3, 1]
6  1  4  [3, 1, 4]
7  4  1  [1, 4, 1]
8  1  9  [4, 1, 9]
9  3  7  [1, 9, 7]

本文标签: pythonHow to use the apply function to return a list to new column in PandasStack Overflow