admin管理员组

文章数量:1356064

I have a Pandas dataframe with columns that need to be sequentially calculated. Column A helps calculate Column B which helps calculate Column F, etc. Processing can be slow as Python is using only 1 thread because of the GIL.

I'm trying to have:

  1. First block of code run and finish normally.
  2. Functions 1-3 run after first block of code, use multiprocessing, reference code above and can be used in code below.
  3. 2nd block of code runs after Function 3 and references all lines above.

Tried putting resource heavy blocks of code in functions and using multiprocessing on them only, but ran into 2 big issues:

  1. New columns created in functions couldn’t reference code above and couldn’t be referenced in code below.
  2. When creating local variables in each function to reference the global dataframe, multiprocessing took much longer than normal processing.

Part of my code so far. Newish to Python but trying.

import pandas as pd
import multiprocessing as mp

## I pull some data

df['Date'] = df['timestamp'].apply(lambda x: pd.to_datetime(x*1000000))
df['volume'] = df['volume_og']
...

def functionone():
    df = pd.DataFrame()
    df['market_9'] = df.apply(lambda x : "9" if x['Date'] >= x['market_9_start'] and x['Date'] < x['market_9_end'] else None, axis=1)
    ...

def functiontwo():
    df = pd.DataFrame()
    ...

def functionthree():
    df = pd.DataFrame()
    df['nine_score'] = df.apply(lambda x : x['strength'] if x['market_9'] == "9" else None, axis=1)
    ...

fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace( 
    go.Bar( 
        x=df['Date'],
        y=df['volume'],
        ...
    ), secondary_y=True,
)


if __name__ == '__main__':

    p1 = mp.Process(target=functionone)
    p2 = mp.Process(target=functontwo)
    p3 = mp.Process(target=functionthree)

    p1.start()
    p2.start()
    p3.start()
    ...

I have a Pandas dataframe with columns that need to be sequentially calculated. Column A helps calculate Column B which helps calculate Column F, etc. Processing can be slow as Python is using only 1 thread because of the GIL.

I'm trying to have:

  1. First block of code run and finish normally.
  2. Functions 1-3 run after first block of code, use multiprocessing, reference code above and can be used in code below.
  3. 2nd block of code runs after Function 3 and references all lines above.

Tried putting resource heavy blocks of code in functions and using multiprocessing on them only, but ran into 2 big issues:

  1. New columns created in functions couldn’t reference code above and couldn’t be referenced in code below.
  2. When creating local variables in each function to reference the global dataframe, multiprocessing took much longer than normal processing.

Part of my code so far. Newish to Python but trying.

import pandas as pd
import multiprocessing as mp

## I pull some data

df['Date'] = df['timestamp'].apply(lambda x: pd.to_datetime(x*1000000))
df['volume'] = df['volume_og']
...

def functionone():
    df = pd.DataFrame()
    df['market_9'] = df.apply(lambda x : "9" if x['Date'] >= x['market_9_start'] and x['Date'] < x['market_9_end'] else None, axis=1)
    ...

def functiontwo():
    df = pd.DataFrame()
    ...

def functionthree():
    df = pd.DataFrame()
    df['nine_score'] = df.apply(lambda x : x['strength'] if x['market_9'] == "9" else None, axis=1)
    ...

fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace( 
    go.Bar( 
        x=df['Date'],
        y=df['volume'],
        ...
    ), secondary_y=True,
)


if __name__ == '__main__':

    p1 = mp.Process(target=functionone)
    p2 = mp.Process(target=functontwo)
    p3 = mp.Process(target=functionthree)

    p1.start()
    p2.start()
    p3.start()
    ...
Share Improve this question edited Apr 3 at 7:12 jottbe 4,5312 gold badges18 silver badges35 bronze badges asked Mar 30 at 17:44 beezy4deuxbeezy4deux 111 bronze badge 8
  • if column F depends on column B which depends on column A then using multiprocessing seems useless - they have to be calculated one after another. Or maybe you should apply function which create both values in the same time and assign them to two columns. – furas Commented Mar 30 at 20:09
  • multiprocessing doesn't share memory and it has to send all data from main process to other processes - and it can take time. – furas Commented Mar 30 at 20:12
  • df[ ['market_9', 'nine_score'] ] = df.apply(lambda x: ["9", x["strength] ] if ... else [None, None], axis=1, result_type='expand') – furas Commented Mar 30 at 20:23
  • you may also try to use polars instead of pandas - maybe it will work faster. – furas Commented Mar 30 at 20:24
  • df = pd.DataFrame() within each function creates a new empty dataframe each time which therefore does not have any values which can be used. Please show a minimal reproducible example. – user19077881 Commented Mar 30 at 22:41
 |  Show 3 more comments

1 Answer 1

Reset to default 1

The underlying functions of Pandas/Numpy are mostly using C libraries; but when you use df.apply, you throw those out the window. Generally, there's a better way if you look into their documentation.

For example, what you have here:

df['market_9'] = df.apply(
    lambda x: "9" if x['Date'] >= x['market_9_start'] and x['Date'] < x['market_9_end'] else None,
    axis=1,
)

Could be re-written as:

df.loc[df.Date.ge(df.market_9_start) & df.Date.lt(df.market_9_end), "market_9"] = "9"
For a far greater speed increase than trying to use multiprocessing.

Another example - Change this:

df['Date'] = df['timestamp'].apply(lambda x: pd.to_datetime(x*1000000))

Into this:

df["Date"] = pd.to_datetime(df["timestamp"], unit="s")

本文标签: pythonDataframe can39t multiprocess and reference in functionsStack Overflow