admin管理员组文章数量:1295061
I am running some jobs on a slurm gpu cluster using the submitit python package and I get a strange error undeterministically. I have multiple calls to save my current agent (jax model) and most of them work fine.
I know that the error points to having too many arguments or an environment that is too big. I've printed os.environ, it doesn't change from call to call. The paths also don't get substantially bigger (one character at most, other saves with same length work out). An example path is logs/GCBC_explore_32c_disc-acttraj/run_logs/configuration_26/phase_0/seed_3/params_50000.pkl
Unfortunately I don't have a reproducible example as I can't narrow it down at all until now.
1047 │ Traceback (most recent call last):
1048 │ File "<frozen runpy>", line 198, in _run_module_as_main
1049 │ File "<frozen runpy>", line 88, in _run_code
1050 │ File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/submitit/core/_submit.py", line 11, in <module>
1051 │ submitit_main()
1052 │ File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/submitit/core/submission.py", line 76, in submitit_main
1053 │ process_job(args.folder)
1054 │ File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/submitit/core/submission.py", line 69, in process_job
1055 │ raise error
1056 │ File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/submitit/core/submission.py", line 55, in process_job
1057 │ result = delayed.result()
1058 │ ^^^^^^^^^^^^^^^^
1059 │ File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/submitit/core/utils.py", line 137, in result
1060 │ self._result = self.function(*self.args, **self.kwargs)
1061 │ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1062 │ File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/gcrl_landscapes/main.py", line 250, in run_config_slurm_tas
│ ks_wrapper
1063 │ return run_config(
1064 │ ^^^^^^^^^^^
1065 │ File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/gcrl_landscapes/main.py", line 160, in run_config
1066 │ eval_trajectory = train(
1067 │ ^^^^^^
1068 │ File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/gcrl_landscapes/training.py", line 270, in train
1069 │ save_agent(agent, str(save_dir), i)
1070 │ File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/ogbench/impls/utils/flax_utils.py", line 175, in save_agent
1071 │ with open(save_path, 'wb') as f:
1072 │ ^^^^^^^^^^^^^^^^^^^^^
1073 │ OSError: [Errno 7] Argument list too long
Submitting to the cluster is done in the following way.
executor = submitit.AutoExecutor(folder=str(args.logdir / "submitit" / "%j"))
executor.update_parameters(
cpus_per_task=4,
slurm_time=int(60 * args.tasks_per_node * ((200000 - args.phase) / 200000)), # this overestimates, keep safety margin
slurm_gpus_per_node=1,
tasks_per_node=args.tasks_per_node,
slurm_mem_per_cpu="1G",
slurm_array_parallelism=50,
slurm_partition=args.partition,
slurm_job_name=args.jobname,
slurm_mail_user=...,
slurm_mail_type="BEGIN,FAIL,END",
)
executor.map_array(run_config_slurm_tasks_wrapper, *chunked_arguments)
The saving part of the code looks like this (out of ogbench):
def save_agent(agent, save_dir, epoch):
"""Save the agent to a file.
Args:
agent: Agent.
save_dir: Directory to save the agent.
epoch: Epoch number.
"""
save_dict = dict(
agent=flax.serialization.to_state_dict(agent),
)
save_path = os.path.join(save_dir, f'params_{epoch}.pkl')
with open(save_path, 'wb') as f:
pickle.dump(save_dict, f)
print(f'Saved to {save_path}')
I know this isn't much to go on, but I am out of ideas. If anyone has any clue as to why this is happening, I am happy about any help.
I am running some jobs on a slurm gpu cluster using the submitit python package and I get a strange error undeterministically. I have multiple calls to save my current agent (jax model) and most of them work fine.
I know that the error points to having too many arguments or an environment that is too big. I've printed os.environ, it doesn't change from call to call. The paths also don't get substantially bigger (one character at most, other saves with same length work out). An example path is logs/GCBC_explore_32c_disc-acttraj/run_logs/configuration_26/phase_0/seed_3/params_50000.pkl
Unfortunately I don't have a reproducible example as I can't narrow it down at all until now.
1047 │ Traceback (most recent call last):
1048 │ File "<frozen runpy>", line 198, in _run_module_as_main
1049 │ File "<frozen runpy>", line 88, in _run_code
1050 │ File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/submitit/core/_submit.py", line 11, in <module>
1051 │ submitit_main()
1052 │ File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/submitit/core/submission.py", line 76, in submitit_main
1053 │ process_job(args.folder)
1054 │ File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/submitit/core/submission.py", line 69, in process_job
1055 │ raise error
1056 │ File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/submitit/core/submission.py", line 55, in process_job
1057 │ result = delayed.result()
1058 │ ^^^^^^^^^^^^^^^^
1059 │ File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/submitit/core/utils.py", line 137, in result
1060 │ self._result = self.function(*self.args, **self.kwargs)
1061 │ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1062 │ File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/gcrl_landscapes/main.py", line 250, in run_config_slurm_tas
│ ks_wrapper
1063 │ return run_config(
1064 │ ^^^^^^^^^^^
1065 │ File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/gcrl_landscapes/main.py", line 160, in run_config
1066 │ eval_trajectory = train(
1067 │ ^^^^^^
1068 │ File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/gcrl_landscapes/training.py", line 270, in train
1069 │ save_agent(agent, str(save_dir), i)
1070 │ File "/bigwork/username/.conda/envs/gcrl/lib/python3.12/site-packages/ogbench/impls/utils/flax_utils.py", line 175, in save_agent
1071 │ with open(save_path, 'wb') as f:
1072 │ ^^^^^^^^^^^^^^^^^^^^^
1073 │ OSError: [Errno 7] Argument list too long
Submitting to the cluster is done in the following way.
executor = submitit.AutoExecutor(folder=str(args.logdir / "submitit" / "%j"))
executor.update_parameters(
cpus_per_task=4,
slurm_time=int(60 * args.tasks_per_node * ((200000 - args.phase) / 200000)), # this overestimates, keep safety margin
slurm_gpus_per_node=1,
tasks_per_node=args.tasks_per_node,
slurm_mem_per_cpu="1G",
slurm_array_parallelism=50,
slurm_partition=args.partition,
slurm_job_name=args.jobname,
slurm_mail_user=...,
slurm_mail_type="BEGIN,FAIL,END",
)
executor.map_array(run_config_slurm_tasks_wrapper, *chunked_arguments)
The saving part of the code looks like this (out of ogbench):
def save_agent(agent, save_dir, epoch):
"""Save the agent to a file.
Args:
agent: Agent.
save_dir: Directory to save the agent.
epoch: Epoch number.
"""
save_dict = dict(
agent=flax.serialization.to_state_dict(agent),
)
save_path = os.path.join(save_dir, f'params_{epoch}.pkl')
with open(save_path, 'wb') as f:
pickle.dump(save_dict, f)
print(f'Saved to {save_path}')
I know this isn't much to go on, but I am out of ideas. If anyone has any clue as to why this is happening, I am happy about any help.
Share Improve this question edited Feb 12 at 9:53 CrunchyFlakes asked Feb 12 at 9:43 CrunchyFlakesCrunchyFlakes 135 bronze badges 18 | Show 13 more comments1 Answer
Reset to default 0As far as I can tell, this has been a SLURM setup problem.
Only pushing to the two partitions I usually use seems to resolve the issue. My guess for now is, that the max argument list length is set to 0 for some obscure reason, due to having jobs on a specific combination of partitions have access to the same files/folders.
本文标签:
版权声明:本文标题:python - "OSError: [Errno 7] Argument list too long" in open(path, 'wb') - Slurm job - Stack O 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1741610713a2388241.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
save_path
when the error occurs? Does it contain special characters, e.g.*
or?
? – jabaa Commented Feb 12 at 9:50logs/GCBC_explore_32c_disc-acttraj/run_logs/configuration_26/phase_0/seed_3/params_50000.pkl
is an example. Will add it to the main question – CrunchyFlakes Commented Feb 12 at 9:52print
call is afterwith open(save_path, 'wb') as f:
. That means, you don't see the value, that causes the problem. It's not printed, because the exception is thrown before. – jabaa Commented Feb 12 at 9:55