admin管理员组

文章数量:1122846

I have a folder with 180 files that are paired as *R1.fastq.gz and *R2.fastq.gz in their names. So it could be "abc_R1.fastq.gz" and "abc_R2.fastq.gz" as one pair and there are 90 of these pairs.

I have a program called spades that runs the following on a R1 and R2 pair.

bin/spades.py --meta -1 abc_R1.fastq.gz -2 abc_R2.fastq.gz -o output_folder

I have a folder called "fastq" with 90 pairs that read like so:

abc_R1.fastq.gz
abc_R2.fastq.gz
zxy_R1.fastq.gz
zxy_R2.fastq.gz
dfg_R1.fastq.gz
...

I would like to use SLURM_ARRAY_TASK_ID on each pair in a for loop and have the output be the SLURM_ARRAY_TASK_ID name with the contents inside. So something like this:

file1=$(ls /filepath/*R1.fastq.gz | sed -n ${SLURM_ARRAY_TASK_ID}p)
file2=$(ls /filepath/*R2.fastq.gz | sed -n ${SLURM_ARRAY_TASK_ID}p)

spades.py --meta -1 ${file1} -2 ${file2} -o outputpath/"${file1##R1.fast.gz}p"

As noted: SLURM_ARRAY_TASK_ID is basically the line number of the ls output for file names 1 and 2 where the first two file names would be "1" for SLURM_ARRAY_TASK_ID. This allows parallelization on a cluster.

How could I do this in a loop? The problem I'm having is the above can work but only on an array of 25 files. The server cannot do an array of 90. Which is why a for loop would be better for my case. The output name is not that important but I would like it to be associated with the file name pair.

Any help would be appreciated!!!

I have a folder with 180 files that are paired as *R1.fastq.gz and *R2.fastq.gz in their names. So it could be "abc_R1.fastq.gz" and "abc_R2.fastq.gz" as one pair and there are 90 of these pairs.

I have a program called spades that runs the following on a R1 and R2 pair.

bin/spades.py --meta -1 abc_R1.fastq.gz -2 abc_R2.fastq.gz -o output_folder

I have a folder called "fastq" with 90 pairs that read like so:

abc_R1.fastq.gz
abc_R2.fastq.gz
zxy_R1.fastq.gz
zxy_R2.fastq.gz
dfg_R1.fastq.gz
...

I would like to use SLURM_ARRAY_TASK_ID on each pair in a for loop and have the output be the SLURM_ARRAY_TASK_ID name with the contents inside. So something like this:

file1=$(ls /filepath/*R1.fastq.gz | sed -n ${SLURM_ARRAY_TASK_ID}p)
file2=$(ls /filepath/*R2.fastq.gz | sed -n ${SLURM_ARRAY_TASK_ID}p)

spades.py --meta -1 ${file1} -2 ${file2} -o outputpath/"${file1##R1.fast.gz}p"

As noted: SLURM_ARRAY_TASK_ID is basically the line number of the ls output for file names 1 and 2 where the first two file names would be "1" for SLURM_ARRAY_TASK_ID. This allows parallelization on a cluster.

How could I do this in a loop? The problem I'm having is the above can work but only on an array of 25 files. The server cannot do an array of 90. Which is why a for loop would be better for my case. The output name is not that important but I would like it to be associated with the file name pair.

Any help would be appreciated!!!

Share Improve this question edited Nov 23, 2024 at 21:43 Sam Degregori asked Nov 22, 2024 at 19:46 Sam DegregoriSam Degregori 816 bronze badges 11
  • assuming SLURM_ARRAY_TASK_ID is an integer then "${SLURM_ARRAY_TASK_ID}p" is going to create a file named <integer>p which I assume is not what you're intending soooo, what do you really want the output file to be named? – markp-fuso Commented Nov 22, 2024 at 20:02
  • Sorry, SLURM TASK ID is commonly used in HPC clustering so i didnt think I needed to specify. The output can be anything, it could actually just be ${file1} . Just as long as I can tie it back to the original pair. – Sam Degregori Commented Nov 22, 2024 at 20:24
  • 1 probably an ok assumption for people coming to this question based on the bioinformatics and/or slurm tags, but for people coming here based on the bash and for-loop tags (like me) the SLURM_ARRAY_TASK_ID reference is greek – markp-fuso Commented Nov 22, 2024 at 20:25
  • 1 please update the question with more details on what you mean by kind of works but only for a given array size; what 'size' is ok and what 'size' is not ok? what happens when 'size' is too big? – markp-fuso Commented Nov 22, 2024 at 20:48
  • 1 that appears to change the whole purpose of your question or perhaps that should be a completely new/different question: an issue with slurm configuration when looping over a large number of files such that slurm appears to repeatedly process the first 25 files; that's a completely different issue than (effectively) 'how to name the output file' – markp-fuso Commented Nov 22, 2024 at 21:12
 |  Show 6 more comments

1 Answer 1

Reset to default 0

Assuming you want to keep one instance of spades.py per job, you can organise the work like this:

Submission script submit.sh:

#/bin/bash
#SBATCH ...
#SBATCH --array=0-17

OFFSET=${1:-0}
IDX=$((OFFSET*18+SLURM_TASK_ARRAY_ID))

file1=$(ls /filepath/*R1.fastq.gz | sed -n ${IDX}p)
file2=$(ls /filepath/*R2.fastq.gz | sed -n ${IDX}p)

spades.py --meta -1 ${file1} -2 ${file2} -o outputpath/"${file1##R1.fast.gz}"

And then, submit it in the following loop:

for i in {0..4}; do sbatch submit.sh $i ; done

This is untested so you will have to check all indices and bounds. The aim is to submit five 18-tasks job array to process the 90 pairs of files. You can test by adding an echo in front of the spades.py line as suggested by @markp-fuso, and replacing sbatch with bash in the for-looop.

本文标签: bashHow to use SLURMARRAYTASKID on paired files with a for loopStack Overflow