admin管理员组文章数量:1122832
Is there a way to group jobs locally on Snakemake?
I am trying to create a Snakemake pipeline using prefetch to download a large set of fastq files from SRA and generate bam files. Despite being a straightforward problem that I can execute for a few samples, when I try to scale up locally I run into a storage problem. When I try to run the pipeline for 100 samples, it tries to download all the fastq files from rule 1 before processing them with the following rules.
My pipeline has all the steps before the final bam file marked as temporary, so storage won't be a problem after everything is analyzed and fastq files are deleted.
It wouldn't be a problem if I could run 10 samples at a time, deleting the temporary outputs before downloading new fastq files.
I tried to group the jobs and give snakemake 10 cores, thinking that it would download 10 samples and run them through the entire pipeline before getting to the 11th. But I realize that groups don't work locally and it didn't work.
I also tried pipe to test. I changed the output from the prefetch rule from "temporary()" to "pipe()". Piping prefetch into fasterq-dump works outside of snakemake but outputs an "error in group..." when I try it on snakemake because it seems like fasterq-dump is trying to read the file before it is downloaded.
This is the start of the snakemake code that I am using to test this pipe approach:
rule prefetch:
input:
output:
pipe(os.path.join(config['all_path']['sra_down'], "{srasample}/{srasample}.sra"))
wildcard_constraints: srasample="|".join(list(config['fastq_sra_HE'].values()))
params:
sraID="{srasample}",
srapath=expand(config["all_path"]["sra_down"])
shell: "prefetch {params.sraID} -O {params.srapath}"
rule fasterqdump:
input: os.path.join(config['all_path']['sra_down'], "{srasample}/{srasample}.sra")
output: r1=temporary(os.path.join(config['all_path']['sra_down'], "{srasample}_1.fastq")),
r2=temporary(os.path.join(config['all_path']['sra_down'], "{srasample}_2.fastq"))
wildcard_constraints: srasample="|".join(list(config['fastq_sra_HE'].values())), r=["1","2"]
params:
sraID=os.path.join(config['all_path']['sra_down'], "{srasample}"),
srapath=expand(config["all_path"]["sra_down"])
shell:
"fasterq-dump {params.sraID} -O {params.srapath}"
This is the error:
2024-11-21T16:14:51 fasterq-dump.2.11.3 err: invalid accession '/disco2/tiago/worldpops/SRR204016'
fasterq-dump quit with error code 3
[Thu Nov 21 13:14:52 2024]
Error in group ffb6a5f4-6dee-447e-9a2e-2dea42295399:
jobs:
rule fasterqdump:
jobid: 3
output: /disco2/tiago/worldpops/SRR204016_1.fastq, /disco2/tiago/worldpops/SRR204016_2.fastq
rule prefetch:
jobid: 4
output: /disco2/tiago/worldpops/SRR204016/SRR204016.sra (pipe)
Shutting down, this might take some time.
Any help on how to tackle this would be appreciated.
Best, Tiago
Is there a way to group jobs locally on Snakemake?
I am trying to create a Snakemake pipeline using prefetch to download a large set of fastq files from SRA and generate bam files. Despite being a straightforward problem that I can execute for a few samples, when I try to scale up locally I run into a storage problem. When I try to run the pipeline for 100 samples, it tries to download all the fastq files from rule 1 before processing them with the following rules.
My pipeline has all the steps before the final bam file marked as temporary, so storage won't be a problem after everything is analyzed and fastq files are deleted.
It wouldn't be a problem if I could run 10 samples at a time, deleting the temporary outputs before downloading new fastq files.
I tried to group the jobs and give snakemake 10 cores, thinking that it would download 10 samples and run them through the entire pipeline before getting to the 11th. But I realize that groups don't work locally and it didn't work.
I also tried pipe to test. I changed the output from the prefetch rule from "temporary()" to "pipe()". Piping prefetch into fasterq-dump works outside of snakemake but outputs an "error in group..." when I try it on snakemake because it seems like fasterq-dump is trying to read the file before it is downloaded.
This is the start of the snakemake code that I am using to test this pipe approach:
rule prefetch:
input:
output:
pipe(os.path.join(config['all_path']['sra_down'], "{srasample}/{srasample}.sra"))
wildcard_constraints: srasample="|".join(list(config['fastq_sra_HE'].values()))
params:
sraID="{srasample}",
srapath=expand(config["all_path"]["sra_down"])
shell: "prefetch {params.sraID} -O {params.srapath}"
rule fasterqdump:
input: os.path.join(config['all_path']['sra_down'], "{srasample}/{srasample}.sra")
output: r1=temporary(os.path.join(config['all_path']['sra_down'], "{srasample}_1.fastq")),
r2=temporary(os.path.join(config['all_path']['sra_down'], "{srasample}_2.fastq"))
wildcard_constraints: srasample="|".join(list(config['fastq_sra_HE'].values())), r=["1","2"]
params:
sraID=os.path.join(config['all_path']['sra_down'], "{srasample}"),
srapath=expand(config["all_path"]["sra_down"])
shell:
"fasterq-dump {params.sraID} -O {params.srapath}"
This is the error:
2024-11-21T16:14:51 fasterq-dump.2.11.3 err: invalid accession '/disco2/tiago/worldpops/SRR204016'
fasterq-dump quit with error code 3
[Thu Nov 21 13:14:52 2024]
Error in group ffb6a5f4-6dee-447e-9a2e-2dea42295399:
jobs:
rule fasterqdump:
jobid: 3
output: /disco2/tiago/worldpops/SRR204016_1.fastq, /disco2/tiago/worldpops/SRR204016_2.fastq
rule prefetch:
jobid: 4
output: /disco2/tiago/worldpops/SRR204016/SRR204016.sra (pipe)
Shutting down, this might take some time.
Any help on how to tackle this would be appreciated.
Best, Tiago
Share Improve this question asked Nov 22, 2024 at 10:38 TiagoRibeiroTiagoRibeiro 12 Answers
Reset to default 0If your final rule has a higher priority, snakemake would try to execute it as fast as possible. So if you have 10 cores, and each rule only requires one core, it should be what you are looking for.
Disclaimer: I have not tested it, but from my understanding, this should do the trick.
In addition to priority, you can set a resource to limit the number of prefetch jobs to run at once. It's not ideal; if fasterqdump is significantly slower you can still run out of space depending on the scheduler. The idea is:
rule prefetch:
input:
output: ...
resources:
prefetch_jobs=1
Then you execute snakemake with --resources prefetch_jobs=5
, which will limit prefetch to only 5 jobs simultaneously. Combined with 10 cores and your priorities, as a prefetch finishes the fasterqdump will start. You can tune the resource if you feel like it's too much of a bottleneck or still too much downloads. I frequently use something like that to limit downloads or jobs that generate giant files.
本文标签: Grouping jobs locally with SnakemakeStack Overflow
版权声明:本文标题:Grouping jobs locally with Snakemake - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1736304475a1932248.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论