admin管理员组

文章数量:1246835

I want to create a constant volume chart in python. Here is an example with a constant volume of 50 and some sample data:

import pandas as pd
import numpy as np
date_rng = pd.date_range(start='2024-01-01', end='2024-12-31 23:00:00', freq='h')

# Create a dataframe with the date range
df = pd.DataFrame(date_rng, columns=['timestamp'])

# Add the 'price' column with random floating numbers between 70 and 100
df['price'] = np.round(np.random.uniform(70, 100, size=(len(date_rng))), 2)

# Add the 'volume' column with random integers between 1 and 10
df['volume'] = np.random.randint(1, 11, size=(len(date_rng)))

constantvolume = 50
df['cumsum'] = np.cumsum(df['volume'])
df['mod'] = df['cumsum']/  constantvolume
df['whole'] = np.ceil(df['mod'])
df['next_num'] = df['whole'].shift(-1) - df['whole']
df['mod2'] = df[df['next_num'] > 0]['cumsum'] % constantvolume
df['mod2'] = df['mod2'].fillna(0)
dfa = df.groupby(df['whole']).agg({'price': ['min', 'max', 'last', 'first'], 'timestamp': 'first', 'volume': 'sum'})
dfa.columns = ['low', 'high', 'close', 'open', 'timestamp', 'volume']
dfa['timestamp'] = pd.to_datetime(dfa['timestamp'])
dfa.set_index('timestamp', inplace=True)
dfa

Now this is very close to what I want to do. The only issue is that the volume in each row is not exactly the defined quantity of 50 because the cumsum doesnt always add to 50.

So what I would have to do is where next_num >0, see if there is the volume = to the defined constant volume, if yes good if no then split the next row with the same timestamp and same price but split the volume in two parts so that the mod is zero and then move on.

The desired result is that in the final dataframe the volume = constantvolume in all rows exactly, with the exception of the last row where it could be different.

The only way I can think of is a loop which I dont think is the best way and will be very slow as the actual dataframe as 1mn rows...

I want to create a constant volume chart in python. Here is an example with a constant volume of 50 and some sample data:

import pandas as pd
import numpy as np
date_rng = pd.date_range(start='2024-01-01', end='2024-12-31 23:00:00', freq='h')

# Create a dataframe with the date range
df = pd.DataFrame(date_rng, columns=['timestamp'])

# Add the 'price' column with random floating numbers between 70 and 100
df['price'] = np.round(np.random.uniform(70, 100, size=(len(date_rng))), 2)

# Add the 'volume' column with random integers between 1 and 10
df['volume'] = np.random.randint(1, 11, size=(len(date_rng)))

constantvolume = 50
df['cumsum'] = np.cumsum(df['volume'])
df['mod'] = df['cumsum']/  constantvolume
df['whole'] = np.ceil(df['mod'])
df['next_num'] = df['whole'].shift(-1) - df['whole']
df['mod2'] = df[df['next_num'] > 0]['cumsum'] % constantvolume
df['mod2'] = df['mod2'].fillna(0)
dfa = df.groupby(df['whole']).agg({'price': ['min', 'max', 'last', 'first'], 'timestamp': 'first', 'volume': 'sum'})
dfa.columns = ['low', 'high', 'close', 'open', 'timestamp', 'volume']
dfa['timestamp'] = pd.to_datetime(dfa['timestamp'])
dfa.set_index('timestamp', inplace=True)
dfa

Now this is very close to what I want to do. The only issue is that the volume in each row is not exactly the defined quantity of 50 because the cumsum doesnt always add to 50.

So what I would have to do is where next_num >0, see if there is the volume = to the defined constant volume, if yes good if no then split the next row with the same timestamp and same price but split the volume in two parts so that the mod is zero and then move on.

The desired result is that in the final dataframe the volume = constantvolume in all rows exactly, with the exception of the last row where it could be different.

The only way I can think of is a loop which I dont think is the best way and will be very slow as the actual dataframe as 1mn rows...

Share Improve this question edited Feb 16 at 17:28 nik asked Feb 16 at 16:02 niknik 1,7842 gold badges17 silver badges37 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 2

Here's one approach:

Minimal reproducible example

np.random.seed(42) # for reproducibility
date_rng = pd.date_range(start='2024-01-01', periods=5, freq='h')

# ... as above

df['cumsum'] = df['volume'].cumsum()

Data

Using cv = 10 (constantvolume), we want to use index 3 for both the 20 and 27 groups:

            timestamp  price  volume  cumsum
0 2024-01-01 00:00:00  81.24       3       3    # 10
1 2024-01-01 01:00:00  98.52       7      10    # 10
2 2024-01-01 02:00:00  91.96       8      18    # 20
3 2024-01-01 03:00:00  87.96       5      23    # split: 20 | 27
4 2024-01-01 04:00:00  74.68       4      27    # 27

Code

cv = 10

cv_rng = range(cv, df['cumsum'].max(), cv)

s_rng = pd.Series(list(set(cv_rng).difference(df['cumsum'])), 
              dtype=df['cumsum'].dtype, 
              name='cumsum')

df = (pd.concat([df, s_rng])
      .sort_values('cumsum')
      .bfill()
      .assign(
          group=lambda x: np.minimum(np.ceil(x['cumsum'] / cv) * cv, 
                                 x['cumsum'].max())
          )
      )

out = (df.groupby('group').agg(
    **{k:('price', v) for k, v in zip(['low', 'high', 'close', 'open'], 
                                      ['min', 'max', 'last', 'first'])},
    volume=('group', 'max'),
    timestamp=('timestamp', 'first')
    )
    .assign(volume=lambda x: np.where(x['volume'] % cv == 0, 
                                      cv, 
                                      x['volume'] % cv)
            )
    .set_index('timestamp')
    )

Output:

                       low   high  close   open  volume
timestamp                                              
2024-01-01 00:00:00  81.24  98.52  98.52  81.24    10.0
2024-01-01 02:00:00  87.96  91.96  87.96  91.96    10.0
2024-01-01 03:00:00  74.68  87.96  74.68  87.96     7.0 # ts/high/open = index `3`

Explanation / intermediates

  • First, create a range with start=cv, stop=df['cumsum'].max() and step=cv.
list(cv_rng)
[10, 20]
  • We want to add rows to df for values from cv_rng if they do not already exist in df['cumsum']: use set + set.difference + df['cumsum'] and create a pd.Series (s_rng).
  • Now, use pd.concat to add s_rng to df + df.sort_values to order 'cumsum' values + df.bfill to add the rows from s_rng correctly sorted and taking the values of the next row.
  • Also add 'group' column with df.assign based on np.ceil or the max from 'cumsum' for the last group (via np.minimum).
(pd.concat([df, s_rng], keys=['df', 's_rng'])
 .sort_values('cumsum')
 .bfill()
 .assign(...))
# including `keys` for demonstration purposes only; not needed

                  timestamp  price  volume  cumsum  group
df    0 2024-01-01 00:00:00  81.24     3.0       3   10.0
      1 2024-01-01 01:00:00  98.52     7.0      10   10.0
      2 2024-01-01 02:00:00  91.96     8.0      18   20.0
s_rng 0 2024-01-01 03:00:00  87.96     5.0      20   20.0 # vals from next row
df    3 2024-01-01 03:00:00  87.96     5.0      23   27.0 # note 27 last group
      4 2024-01-01 04:00:00  74.68     4.0      27   27.0
  • Finally, we use df.groupby with 'group' and use named aggregation.
  • For 'volume', we chain df.assign to overwrite max for each group with cv if remainder equals 0, else remainder via np.where.
  • Lastly, use df.set_index to get 'timestamp' as the index.

本文标签: pandasConstant Volume chart in pythonStack Overflow