admin管理员组文章数量:1122846
I have a folder with 180 files that are paired as *R1.fastq.gz and *R2.fastq.gz in their names. So it could be "abc_R1.fastq.gz" and "abc_R2.fastq.gz" as one pair and there are 90 of these pairs.
I have a program called spades that runs the following on a R1 and R2 pair.
bin/spades.py --meta -1 abc_R1.fastq.gz -2 abc_R2.fastq.gz -o output_folder
I have a folder called "fastq" with 90 pairs that read like so:
abc_R1.fastq.gz
abc_R2.fastq.gz
zxy_R1.fastq.gz
zxy_R2.fastq.gz
dfg_R1.fastq.gz
...
I would like to use SLURM_ARRAY_TASK_ID on each pair in a for loop and have the output be the SLURM_ARRAY_TASK_ID name with the contents inside. So something like this:
file1=$(ls /filepath/*R1.fastq.gz | sed -n ${SLURM_ARRAY_TASK_ID}p)
file2=$(ls /filepath/*R2.fastq.gz | sed -n ${SLURM_ARRAY_TASK_ID}p)
spades.py --meta -1 ${file1} -2 ${file2} -o outputpath/"${file1##R1.fast.gz}p"
As noted: SLURM_ARRAY_TASK_ID is basically the line number of the ls output for file names 1 and 2 where the first two file names would be "1" for SLURM_ARRAY_TASK_ID. This allows parallelization on a cluster.
How could I do this in a loop? The problem I'm having is the above can work but only on an array of 25 files. The server cannot do an array of 90. Which is why a for loop would be better for my case. The output name is not that important but I would like it to be associated with the file name pair.
Any help would be appreciated!!!
I have a folder with 180 files that are paired as *R1.fastq.gz and *R2.fastq.gz in their names. So it could be "abc_R1.fastq.gz" and "abc_R2.fastq.gz" as one pair and there are 90 of these pairs.
I have a program called spades that runs the following on a R1 and R2 pair.
bin/spades.py --meta -1 abc_R1.fastq.gz -2 abc_R2.fastq.gz -o output_folder
I have a folder called "fastq" with 90 pairs that read like so:
abc_R1.fastq.gz
abc_R2.fastq.gz
zxy_R1.fastq.gz
zxy_R2.fastq.gz
dfg_R1.fastq.gz
...
I would like to use SLURM_ARRAY_TASK_ID on each pair in a for loop and have the output be the SLURM_ARRAY_TASK_ID name with the contents inside. So something like this:
file1=$(ls /filepath/*R1.fastq.gz | sed -n ${SLURM_ARRAY_TASK_ID}p)
file2=$(ls /filepath/*R2.fastq.gz | sed -n ${SLURM_ARRAY_TASK_ID}p)
spades.py --meta -1 ${file1} -2 ${file2} -o outputpath/"${file1##R1.fast.gz}p"
As noted: SLURM_ARRAY_TASK_ID is basically the line number of the ls output for file names 1 and 2 where the first two file names would be "1" for SLURM_ARRAY_TASK_ID. This allows parallelization on a cluster.
How could I do this in a loop? The problem I'm having is the above can work but only on an array of 25 files. The server cannot do an array of 90. Which is why a for loop would be better for my case. The output name is not that important but I would like it to be associated with the file name pair.
Any help would be appreciated!!!
Share Improve this question edited Nov 23, 2024 at 21:43 Sam Degregori asked Nov 22, 2024 at 19:46 Sam DegregoriSam Degregori 816 bronze badges 11 | Show 6 more comments1 Answer
Reset to default 0Assuming you want to keep one instance of spades.py
per job, you can organise the work like this:
Submission script submit.sh
:
#/bin/bash
#SBATCH ...
#SBATCH --array=0-17
OFFSET=${1:-0}
IDX=$((OFFSET*18+SLURM_TASK_ARRAY_ID))
file1=$(ls /filepath/*R1.fastq.gz | sed -n ${IDX}p)
file2=$(ls /filepath/*R2.fastq.gz | sed -n ${IDX}p)
spades.py --meta -1 ${file1} -2 ${file2} -o outputpath/"${file1##R1.fast.gz}"
And then, submit it in the following loop:
for i in {0..4}; do sbatch submit.sh $i ; done
This is untested so you will have to check all indices and bounds. The aim is to submit five 18-tasks job array to process the 90 pairs of files. You can test by adding an echo
in front of the spades.py
line as suggested by @markp-fuso, and replacing sbatch
with bash
in the for
-looop.
本文标签: bashHow to use SLURMARRAYTASKID on paired files with a for loopStack Overflow
版权声明:本文标题:bash - How to use SLURM_ARRAY_TASK_ID on paired files with a for loop? - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1736301149a1931078.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
SLURM_ARRAY_TASK_ID
is an integer then"${SLURM_ARRAY_TASK_ID}p"
is going to create a file named<integer>p
which I assume is not what you're intending soooo, what do you really want the output file to be named? – markp-fuso Commented Nov 22, 2024 at 20:02bioinformatics
and/orslurm
tags, but for people coming here based on thebash
andfor-loop
tags (like me) theSLURM_ARRAY_TASK_ID
reference is greek – markp-fuso Commented Nov 22, 2024 at 20:25kind of works but only for a given array size
; what 'size' is ok and what 'size' is not ok? what happens when 'size' is too big? – markp-fuso Commented Nov 22, 2024 at 20:48slurm
configuration when looping over a large number of files such thatslurm
appears to repeatedly process the first 25 files; that's a completely different issue than (effectively) 'how to name the output file' – markp-fuso Commented Nov 22, 2024 at 21:12