python - Dataframe can't multiprocess and reference in functions - Stack Overflow

IT技术

更新时间：2025-04-071

admin管理员组
文章数量:1356064

I have a Pandas dataframe with columns that need to be sequentially calculated. Column A helps calculate Column B which helps calculate Column F, etc. Processing can be slow as Python is using only 1 thread because of the GIL.

I'm trying to have:

First block of code run and finish normally.
Functions 1-3 run after first block of code, use multiprocessing, reference code above and can be used in code below.
2nd block of code runs after Function 3 and references all lines above.

Tried putting resource heavy blocks of code in functions and using multiprocessing on them only, but ran into 2 big issues:

New columns created in functions couldn’t reference code above and couldn’t be referenced in code below.
When creating local variables in each function to reference the global dataframe, multiprocessing took much longer than normal processing.

Part of my code so far. Newish to Python but trying.

import pandas as pd
import multiprocessing as mp

## I pull some data

df['Date'] = df['timestamp'].apply(lambda x: pd.to_datetime(x*1000000))
df['volume'] = df['volume_og']
...

def functionone():
    df = pd.DataFrame()
    df['market_9'] = df.apply(lambda x : "9" if x['Date'] >= x['market_9_start'] and x['Date'] < x['market_9_end'] else None, axis=1)
    ...

def functiontwo():
    df = pd.DataFrame()
    ...

def functionthree():
    df = pd.DataFrame()
    df['nine_score'] = df.apply(lambda x : x['strength'] if x['market_9'] == "9" else None, axis=1)
    ...

fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace( 
    go.Bar( 
        x=df['Date'],
        y=df['volume'],
        ...
    ), secondary_y=True,
)


if __name__ == '__main__':

    p1 = mp.Process(target=functionone)
    p2 = mp.Process(target=functontwo)
    p3 = mp.Process(target=functionthree)

    p1.start()
    p2.start()
    p3.start()
    ...

I have a Pandas dataframe with columns that need to be sequentially calculated. Column A helps calculate Column B which helps calculate Column F, etc. Processing can be slow as Python is using only 1 thread because of the GIL.

I'm trying to have:

First block of code run and finish normally.
Functions 1-3 run after first block of code, use multiprocessing, reference code above and can be used in code below.
2nd block of code runs after Function 3 and references all lines above.

Tried putting resource heavy blocks of code in functions and using multiprocessing on them only, but ran into 2 big issues:

New columns created in functions couldn’t reference code above and couldn’t be referenced in code below.
When creating local variables in each function to reference the global dataframe, multiprocessing took much longer than normal processing.

Part of my code so far. Newish to Python but trying.

import pandas as pd
import multiprocessing as mp

## I pull some data

df['Date'] = df['timestamp'].apply(lambda x: pd.to_datetime(x*1000000))
df['volume'] = df['volume_og']
...

def functionone():
    df = pd.DataFrame()
    df['market_9'] = df.apply(lambda x : "9" if x['Date'] >= x['market_9_start'] and x['Date'] < x['market_9_end'] else None, axis=1)
    ...

def functiontwo():
    df = pd.DataFrame()
    ...

def functionthree():
    df = pd.DataFrame()
    df['nine_score'] = df.apply(lambda x : x['strength'] if x['market_9'] == "9" else None, axis=1)
    ...

fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace( 
    go.Bar( 
        x=df['Date'],
        y=df['volume'],
        ...
    ), secondary_y=True,
)


if __name__ == '__main__':

    p1 = mp.Process(target=functionone)
    p2 = mp.Process(target=functontwo)
    p3 = mp.Process(target=functionthree)

    p1.start()
    p2.start()
    p3.start()
    ...

Share Improve this question edited Apr 3 at 7:12 jottbe 4,5312 gold badges18 silver badges35 bronze badges asked Mar 30 at 17:44 beezy4deux 111 bronze badge

if column F depends on column B which depends on column A then using multiprocessing seems useless - they have to be calculated one after another. Or maybe you should apply function which create both values in the same time and assign them to two columns. – furas Commented Mar 30 at 20:09
multiprocessing doesn't share memory and it has to send all data from main process to other processes - and it can take time. – furas Commented Mar 30 at 20:12
df[ ['market_9', 'nine_score'] ] = df.apply(lambda x: ["9", x["strength] ] if ... else [None, None], axis=1, result_type='expand') – furas Commented Mar 30 at 20:23
you may also try to use polars instead of pandas - maybe it will work faster. – furas Commented Mar 30 at 20:24
df = pd.DataFrame() within each function creates a new empty dataframe each time which therefore does not have any values which can be used. Please show a minimal reproducible example. – user19077881 Commented Mar 30 at 22:41

| Show 3 more comments

1 Answer 1

Sorted by: Reset to default 1

The underlying functions of Pandas/Numpy are mostly using C libraries; but when you use df.apply, you throw those out the window. Generally, there's a better way if you look into their documentation.

For example, what you have here:

df['market_9'] = df.apply(
    lambda x: "9" if x['Date'] >= x['market_9_start'] and x['Date'] < x['market_9_end'] else None,
    axis=1,
)

Could be re-written as:

df.loc[df.Date.ge(df.market_9_start) & df.Date.lt(df.market_9_end), "market_9"] = "9"

For a far greater speed increase than trying to use multiprocessing.

Another example - Change this:

df['Date'] = df['timestamp'].apply(lambda x: pd.to_datetime(x*1000000))

Into this:

df["Date"] = pd.to_datetime(df["timestamp"], unit="s")

本文标签： pythonDataframe can39t multiprocess and reference in functionsStack Overflow

版权声明：本文标题：python - Dataframe can't multiprocess and reference in functions - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1743979711a2571028.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

python - Dataframe can't multiprocess and reference in functions - Stack Overflow

1 Answer 1

For a far greater speed increase than trying to use multiprocessing.

更多相关文章