admin管理员组

文章数量:1295061

I am running some jobs on a slurm gpu cluster using the submitit python package and I get a strange error undeterministically. I have multiple calls to save my current agent (jax model) and most of them work fine.

I know that the error points to having too many arguments or an environment that is too big. I've printed os.environ, it doesn't change from call to call. The paths also don't get substantially bigger (one character at most, other saves with same length work out). An example path is logs/GCBC_explore_32c_disc-acttraj/run_logs/configuration_26/phase_0/seed_3/params_50000.pkl

Unfortunately I don't have a reproducible example as I can't narrow it down at all until now.

1047   │ Traceback (most recent call last):
1048   │   File "<frozen runpy>", line 198, in _run_module_as_main
1049   │   File "<frozen runpy>", line 88, in _run_code
1050   │   File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/submitit/core/_submit.py", line 11, in <module>
1051   │     submitit_main()
1052   │   File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/submitit/core/submission.py", line 76, in submitit_main
1053   │     process_job(args.folder)
1054   │   File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/submitit/core/submission.py", line 69, in process_job
1055   │     raise error
1056   │   File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/submitit/core/submission.py", line 55, in process_job
1057   │     result = delayed.result()
1058   │              ^^^^^^^^^^^^^^^^
1059   │   File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/submitit/core/utils.py", line 137, in result
1060   │     self._result = self.function(*self.args, **self.kwargs)
1061   │                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1062   │   File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/gcrl_landscapes/main.py", line 250, in run_config_slurm_tas
       │ ks_wrapper
1063   │     return run_config(
1064   │            ^^^^^^^^^^^
1065   │   File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/gcrl_landscapes/main.py", line 160, in run_config
1066   │     eval_trajectory = train(
1067   │                       ^^^^^^
1068   │   File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/gcrl_landscapes/training.py", line 270, in train
1069   │     save_agent(agent, str(save_dir), i)
1070   │   File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/ogbench/impls/utils/flax_utils.py", line 175, in save_agent
1071   │     with open(save_path, 'wb') as f:
1072   │          ^^^^^^^^^^^^^^^^^^^^^
1073   │ OSError: [Errno 7] Argument list too long

Submitting to the cluster is done in the following way.

    executor = submitit.AutoExecutor(folder=str(args.logdir / "submitit" / "%j"))
    executor.update_parameters(
        cpus_per_task=4,
        slurm_time=int(60 * args.tasks_per_node * ((200000 - args.phase) / 200000)),  # this overestimates, keep safety margin
        slurm_gpus_per_node=1,
        tasks_per_node=args.tasks_per_node,
        slurm_mem_per_cpu="1G",
        slurm_array_parallelism=50,
        slurm_partition=args.partition,
        slurm_job_name=args.jobname,
        slurm_mail_user=...,
        slurm_mail_type="BEGIN,FAIL,END",
    )
    executor.map_array(run_config_slurm_tasks_wrapper, *chunked_arguments)

The saving part of the code looks like this (out of ogbench):

def save_agent(agent, save_dir, epoch):
    """Save the agent to a file.

    Args:
        agent: Agent.
        save_dir: Directory to save the agent.
        epoch: Epoch number.
    """

    save_dict = dict(
        agent=flax.serialization.to_state_dict(agent),
    )
    save_path = os.path.join(save_dir, f'params_{epoch}.pkl')
    with open(save_path, 'wb') as f:
        pickle.dump(save_dict, f)

    print(f'Saved to {save_path}')

I know this isn't much to go on, but I am out of ideas. If anyone has any clue as to why this is happening, I am happy about any help.

I am running some jobs on a slurm gpu cluster using the submitit python package and I get a strange error undeterministically. I have multiple calls to save my current agent (jax model) and most of them work fine.

I know that the error points to having too many arguments or an environment that is too big. I've printed os.environ, it doesn't change from call to call. The paths also don't get substantially bigger (one character at most, other saves with same length work out). An example path is logs/GCBC_explore_32c_disc-acttraj/run_logs/configuration_26/phase_0/seed_3/params_50000.pkl

Unfortunately I don't have a reproducible example as I can't narrow it down at all until now.

1047   │ Traceback (most recent call last):
1048   │   File "<frozen runpy>", line 198, in _run_module_as_main
1049   │   File "<frozen runpy>", line 88, in _run_code
1050   │   File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/submitit/core/_submit.py", line 11, in <module>
1051   │     submitit_main()
1052   │   File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/submitit/core/submission.py", line 76, in submitit_main
1053   │     process_job(args.folder)
1054   │   File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/submitit/core/submission.py", line 69, in process_job
1055   │     raise error
1056   │   File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/submitit/core/submission.py", line 55, in process_job
1057   │     result = delayed.result()
1058   │              ^^^^^^^^^^^^^^^^
1059   │   File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/submitit/core/utils.py", line 137, in result
1060   │     self._result = self.function(*self.args, **self.kwargs)
1061   │                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1062   │   File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/gcrl_landscapes/main.py", line 250, in run_config_slurm_tas
       │ ks_wrapper
1063   │     return run_config(
1064   │            ^^^^^^^^^^^
1065   │   File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/gcrl_landscapes/main.py", line 160, in run_config
1066   │     eval_trajectory = train(
1067   │                       ^^^^^^
1068   │   File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/gcrl_landscapes/training.py", line 270, in train
1069   │     save_agent(agent, str(save_dir), i)
1070   │   File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/ogbench/impls/utils/flax_utils.py", line 175, in save_agent
1071   │     with open(save_path, 'wb') as f:
1072   │          ^^^^^^^^^^^^^^^^^^^^^
1073   │ OSError: [Errno 7] Argument list too long

Submitting to the cluster is done in the following way.

    executor = submitit.AutoExecutor(folder=str(args.logdir / "submitit" / "%j"))
    executor.update_parameters(
        cpus_per_task=4,
        slurm_time=int(60 * args.tasks_per_node * ((200000 - args.phase) / 200000)),  # this overestimates, keep safety margin
        slurm_gpus_per_node=1,
        tasks_per_node=args.tasks_per_node,
        slurm_mem_per_cpu="1G",
        slurm_array_parallelism=50,
        slurm_partition=args.partition,
        slurm_job_name=args.jobname,
        slurm_mail_user=...,
        slurm_mail_type="BEGIN,FAIL,END",
    )
    executor.map_array(run_config_slurm_tasks_wrapper, *chunked_arguments)

The saving part of the code looks like this (out of ogbench):

def save_agent(agent, save_dir, epoch):
    """Save the agent to a file.

    Args:
        agent: Agent.
        save_dir: Directory to save the agent.
        epoch: Epoch number.
    """

    save_dict = dict(
        agent=flax.serialization.to_state_dict(agent),
    )
    save_path = os.path.join(save_dir, f'params_{epoch}.pkl')
    with open(save_path, 'wb') as f:
        pickle.dump(save_dict, f)

    print(f'Saved to {save_path}')

I know this isn't much to go on, but I am out of ideas. If anyone has any clue as to why this is happening, I am happy about any help.

Share Improve this question edited Feb 12 at 9:53 CrunchyFlakes asked Feb 12 at 9:43 CrunchyFlakesCrunchyFlakes 135 bronze badges 18
  • 1 What's the value of save_path when the error occurs? Does it contain special characters, e.g. * or ?? – jabaa Commented Feb 12 at 9:50
  • Ah I knew I fot something: logs/GCBC_explore_32c_disc-acttraj/run_logs/configuration_26/phase_0/seed_3/params_50000.pkl is an example. Will add it to the main question – CrunchyFlakes Commented Feb 12 at 9:52
  • How have you checked it? With your debugger? – jabaa Commented Feb 12 at 9:53
  • Using a print statement (looking into the log afterwards). – CrunchyFlakes Commented Feb 12 at 9:54
  • 1 But the print call is after with open(save_path, 'wb') as f:. That means, you don't see the value, that causes the problem. It's not printed, because the exception is thrown before. – jabaa Commented Feb 12 at 9:55
 |  Show 13 more comments

1 Answer 1

Reset to default 0

As far as I can tell, this has been a SLURM setup problem.

Only pushing to the two partitions I usually use seems to resolve the issue. My guess for now is, that the max argument list length is set to 0 for some obscure reason, due to having jobs on a specific combination of partitions have access to the same files/folders.

本文标签: