admin管理员组

文章数量:1220944

I have a problem running my slurm code (I am quite a newbie).

I am trying to run slurm with tasks in parallel in the same job, lets call them task1, task2, ..., task28, with jobs in parallel.

The srun ... & calls the same dosomething.sh script, passing it different parameters.

Inside the dosomething.sh script, four python functions are sequentially called (lets call them py1.py, py2.py, py3.py, py4.py). I need to wait for each python function to complete, before starting the subsequent one (I mean, py1.py must be completed before starting py2.py, and so on). In the meanwhile I don't want other tasks. on the same job to wait but they must continue working independently in parallel.

I have checked the occupation of the cpu of the node on which the tasks I expected to be in parallel are executed and I have seen it is around 100%, differently from what I expected (running 28 srun in parallel I would 2800%)

In the following what I tried to do:

Here the slurm code:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=28
#SBATCH --ntasks-per-node=28
#SBATCH --mem=4000 # Memory per node (in MB).

while read data
do
    srun -n 1 --nodes=1 --exclusive dosomething.sh $data &
done 
wait

Here the dosomething.sh:

python3 py1.py
python3 py2.py
python3 py3.py
python3 py4.py

It happens that for some reasons that I do not understand, the .py functions don't wait the completion of the previous one (I got error message in log).

Then I tried:

python3 py1.py
pid1=$!
wait $pid1
python3 py2.py
pid2=$!
wait $pid2
python3 py3.py
pid3=$!
wait $pid3
python3 py4.py

Expecting the task to wait only for the specific process to complete, but this blocks the parallel execution of 28 tasks per node.

What I am doing wrong?

本文标签: pythonSlurm how to wait for scripts without affecting parallelismStack Overflow