How do I preserve and reconcile mutable object state across parallel tasks, when pickling breaks references in Python multiproce-软件玩家

admin管理员组
文章数量:1419223

We're working on a concurrency problem where a single "Sample" object has multiple dependent tasks that must be executed in stages. For instance, stage 1 has tasks (1a, 1b), stage 2 has tasks (2a, 2b), and so on. Each stage can only begin after all tasks in the previous stage complete.

When running in a single thread, we rely on the mutability of the Sample and its child objects to keep track of which tasks have finished—i.e., if 1a and 1b are marked as complete, then we trigger stage 2. However, in a multi-processing context, these references get pickled and passed to each worker. That means each task operates on a copy of the Sample rather than a shared mutable reference. Once the tasks complete, we're left with multiple copies whose state we have to reconcile manually.

I'd like to know:

Best practices for orchestrating dependent tasks so that when all tasks in stage 1 are finished, I can start stage 2 without losing track of what’s done.
How to avoid the “lost mutability” problem,where each process modifies a copy and I need to merge them back. Are there recommended patterns or data structures (like multiprocessing.Manager or some form of shared memory) that make this simpler?
How to handle the scenario where each task modifies the same sample object but we only want final, aggregated results in one place.

Below is a simplified code example. In real code, each task modifies the Sample's internal data, but as soon as we use ProcessPoolExecutor, the Sample object’s references become disconnected copies.

import concurrent.futures

class Sample:
    def __init__(self, sample_id):
        self.sample_id = sample_id
        # For illustration, let's track stages like {'1': [False, False], '2': [False, False], ...}
        self.stage_completion = {
            '1': [False, False],
            '2': [False, False],
            '3': [False, False]
        }
    
    def do_task(self, stage, sub_idx):
        # Do some work here
        print(f"Doing {stage}{sub_idx} for sample {self.sample_id}")
        self.stage_completion[stage][sub_idx] = True
        return self  # Return self for convenience

def run_task(sample_obj, stage, sub_idx):
    return sample_obj.do_task(stage, sub_idx)

def main():
    sample = Sample(sample_id=123)

    with concurrent.futures.ProcessPoolExecutor() as executor:
        # Submit tasks 1a and 1b (equivalent to stage '1' indexes [0, 1])
        future1 = executor.submit(run_task, sample, '1', 0)
        future2 = executor.submit(run_task, sample, '1', 1)
        
        # Wait for them to finish
        result1 = future1.result()
        result2 = future2.result()
        
        # Now I'd like to check if stage 1 is fully done before scheduling stage 2
        # But result1 and result2 are separate copies with their own state
        # This is where merging states or having a centralized tracking is tricky
        print("Stage 1 results from result1:", result1.stage_completion)
        print("Stage 1 results from result2:", result2.stage_completion)

if __name__ == "__main__":
    main()

As you can see, each returned Sample object might have a partial view of the overall state. I'd prefer a solution that keeps them in sync or merges them easily, without resorting to writing manual “merge functions” for every internal data structure.

What are the recommended design patterns or approaches in Python for managing (and ultimately reconciling) mutable state across parallel tasks so that I can coordinate dependent tasks without losing the shared object’s unified state? Tips, examples using multiprocessing, concurrent.futures, or a more appropriate library would be much appreciated.

We'd guess the easiest way is to store the objects in a separate database - but then all the calls to that database may make it slow...

I'd like to know:

Best practices for orchestrating dependent tasks so that when all tasks in stage 1 are finished, I can start stage 2 without losing track of what’s done.
How to avoid the “lost mutability” problem,where each process modifies a copy and I need to merge them back. Are there recommended patterns or data structures (like multiprocessing.Manager or some form of shared memory) that make this simpler?
How to handle the scenario where each task modifies the same sample object but we only want final, aggregated results in one place.

import concurrent.futures

class Sample:
    def __init__(self, sample_id):
        self.sample_id = sample_id
        # For illustration, let's track stages like {'1': [False, False], '2': [False, False], ...}
        self.stage_completion = {
            '1': [False, False],
            '2': [False, False],
            '3': [False, False]
        }
    
    def do_task(self, stage, sub_idx):
        # Do some work here
        print(f"Doing {stage}{sub_idx} for sample {self.sample_id}")
        self.stage_completion[stage][sub_idx] = True
        return self  # Return self for convenience

def run_task(sample_obj, stage, sub_idx):
    return sample_obj.do_task(stage, sub_idx)

def main():
    sample = Sample(sample_id=123)

    with concurrent.futures.ProcessPoolExecutor() as executor:
        # Submit tasks 1a and 1b (equivalent to stage '1' indexes [0, 1])
        future1 = executor.submit(run_task, sample, '1', 0)
        future2 = executor.submit(run_task, sample, '1', 1)
        
        # Wait for them to finish
        result1 = future1.result()
        result2 = future2.result()
        
        # Now I'd like to check if stage 1 is fully done before scheduling stage 2
        # But result1 and result2 are separate copies with their own state
        # This is where merging states or having a centralized tracking is tricky
        print("Stage 1 results from result1:", result1.stage_completion)
        print("Stage 1 results from result2:", result2.stage_completion)

if __name__ == "__main__":
    main()

We'd guess the easiest way is to store the objects in a separate database - but then all the calls to that database may make it slow...

Share Improve this question asked Jan 29 at 11:28 NanoNerd 1322 silver badges11 bronze badges

Without knowing what other types of attributes are in your actual Sample instance, it's difficult to give a definitive answer as to what approach you should take. You need to update your question. – Booboo Commented Jan 30 at 13:26
In the code you posted you are submitting two tasks for the first stage and then waiting for them to complete. At this point by definition stage 1 has completed and you can submit tasks for the next stage. So it isn't clear to me why you even have a stage_completion attribute. – Booboo Commented Jan 30 at 13:35
You might want to look into the multiprocessing.shared_memory package in the standard library. – Paul Cornelius Commented Feb 3 at 4:32

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

you can have a shared python object using multiprocessing.Manager.

the limitation is that because you are using a proxy, you will need to add getters and setters for all subobjects.

import concurrent.futures
import multiprocessing.managers
import copy

class SharedSample:
    def __init__(self, sample_id):
        self.sample_id = sample_id
        # For illustration, let's track stages like {'1': [False, False], '2': [False, False], ...}
        self.stage_completion = {
            '1': [False, False],
            '2': [False, False],
            '3': [False, False]
        }

    def get_sample_id(self):
        return self.sample_id
    def set_stage_completion(self, stage, subidx, value):
        self.stage_completion[stage][subidx] = value

# register the type to the manager
multiprocessing.managers.BaseManager.register("SharedSample",SharedSample)

# this cannot be a member method of SharedSample
# otherwise it will run in the manager
def do_task(sample, stage, sub_idx):
    # Do some work here
    print(f"Doing {stage}{sub_idx} for sample {sample.get_sample_id()}")
    sample.set_stage_completion(stage,sub_idx,True)
    return sample  # Return self for convenience


def run_task(sample_obj, stage, sub_idx):
    return do_task(sample_obj, stage, sub_idx)


def main():
    with multiprocessing.managers.BaseManager() as sampleManager:
        sample = sampleManager.SharedSample(sample_id=123)

        with concurrent.futures.ProcessPoolExecutor() as executor:
            # Submit tasks 1a and 1b (equivalent to stage '1' indexes [0, 1])
            future1 = executor.submit(run_task, sample, '1', 0)
            future2 = executor.submit(run_task, sample, '1', 1)

            # Wait for them to finish
            result1 = future1.result()
            result2 = future2.result()

            # deepcopy copies object from manager to here
            print("Stage 1 results from result1:", copy.deepcopy(result1).stage_completion)
            print("Stage 1 results from result2:", copy.deepcopy(result2).stage_completion)


if __name__ == "__main__":
    main()

Doing 10 for sample 123
Doing 11 for sample 123
Stage 1 results from result1: {'1': [True, True], '2': [False, False], '3': [False, False]}
Stage 1 results from result2: {'1': [True, True], '2': [False, False], '3': [False, False]}

getters return a copy, don't add a getter for stage_completion, but if you do then call it get_stage_completion_deepcopy() , so the future maintainer won't try to assign to it.

this could also be done using multiprocessing.Manager.dict and multiprocessing.manager.list, but they are tricky to use because any subobject is deepcopied, and they almost always result in bugs by people treating them like a normal dict/list when they are not, and they end up adding a lot of latency to the program, especially when you nest them.

本文标签：

版权声明：本文标题：How do I preserve and reconcile mutable object state across parallel tasks, when pickling breaks references in Python multiproce 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1745301972a2652422.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

How do I preserve and reconcile mutable object state across parallel tasks, when pickling breaks references in Python multiproce

1 Answer 1

更多相关文章

Using button from the function submit_button() triggers alert

javascript - jQuery attr(&quot;href&quot;) doesn&#39;t work - Stack Overflow

php - When I create a new custom post type, it replaces the old post type

javascript - HTML5 fileWriter.write doesn&#39;t write to local file - Stack Overflow

javascript - Can I call requestAnimationFrame multiple times? - Stack Overflow

javascript - Angularjs directive add directives as attribute and bind them dynamically - Stack Overflow

javascript - How to modify the constructor of an ES6 class - Stack Overflow

Google Map location with Markers according to multiple postal codes with Javascript API - Stack Overflow

pagination leads to 404 page

sql - How to get EVENT based on startday, using BETWEEN

Not able to delete media by REST API

javascript - Client Side Storage with Web SQL Database - Stack Overflow

javascript - JSON to excel conversion - Stack Overflow

javascript - Laravel: Submit dynamically generated form fields - Stack Overflow

javascript - How to add wai-aria property for file picker? - Stack Overflow

javascript - why can&#39;t I declare the same variable twice in Java? - Stack Overflow

block editor - Resizing images on page

javascript - Cookie problem - cookie&#39;s not defined - Stack Overflow

javascript - E2E Testing - WebdriverJS, Selenium and Jasmine - Stack Overflow

javascript - ReferenceError: &quot;sheet&quot; is not defined - Stack Overflow

发表评论

推荐文章

javascript - Get cursor position when a file is dropped in textarea in Chrome - Stack Overflow

javascript - get value from JSON.stringify() - Stack Overflow

javascript - Validate comma separated numbers - Stack Overflow

javascript - Angular 6 [ngClass] not working with boolean in component.js - Stack Overflow

Custom Post Status Posts viewable to the public

热门文章

categories - Add class to items in wp_list_categories()

javascript - webpack multiple entries in a directory - Stack Overflow

How can I have different theme with same booking table?

javascript - Slick - slides jump with centerMode and variable width - Stack Overflow

amazon web services - Can I use aws-advanced-jdbc-wrapper with MySQL? - Stack Overflow

javascript - How to disable scrollify.js on mobile - Stack Overflow

javascript - html2canvas screenshot keeps turning up blank - Stack Overflow

Related Customs posts

scala - Using javascript in play template - Stack Overflow

Javascript hide certain table rows when a button is clicked - Stack Overflow

最新文章

windows设置断电重启开机后自动输入锁屏密码登录

Windows系统设置开机默认开启数字小键盘

Windows11 开机自动同步时间（开机时间不更新问题）

windows配置开机自启动软件或脚本

【Redis】Windows设置Redis为开机自启动

javascript - Implement delayed queue with RxJs Observable - Stack Overflow

navigation - Pagination doesn&#39;t work in query post in tag template

javascript - Get text in CSS3 column? - Stack Overflow

javascript - Why I can&#39;t pass console.log as a callback argument in Chrome (and Safari)? - Stack Overflow

javascript - ReferenceError: &quot;sheet&quot; is not defined - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价

javascript - jQuery attr("href") doesn't work - Stack Overflow

javascript - HTML5 fileWriter.write doesn't write to local file - Stack Overflow

javascript - why can't I declare the same variable twice in Java? - Stack Overflow

javascript - Cookie problem - cookie's not defined - Stack Overflow

javascript - ReferenceError: "sheet" is not defined - Stack Overflow

navigation - Pagination doesn't work in query post in tag template

javascript - Why I can't pass console.log as a callback argument in Chrome (and Safari)? - Stack Overflow

javascript - ReferenceError: "sheet" is not defined - Stack Overflow