Grouping jobs locally with Snakemake - Stack Overflow

IT技术

更新时间：2025-01-085

admin管理员组
文章数量:1122832

Is there a way to group jobs locally on Snakemake?

I am trying to create a Snakemake pipeline using prefetch to download a large set of fastq files from SRA and generate bam files. Despite being a straightforward problem that I can execute for a few samples, when I try to scale up locally I run into a storage problem. When I try to run the pipeline for 100 samples, it tries to download all the fastq files from rule 1 before processing them with the following rules.

My pipeline has all the steps before the final bam file marked as temporary, so storage won't be a problem after everything is analyzed and fastq files are deleted.

It wouldn't be a problem if I could run 10 samples at a time, deleting the temporary outputs before downloading new fastq files.

I tried to group the jobs and give snakemake 10 cores, thinking that it would download 10 samples and run them through the entire pipeline before getting to the 11th. But I realize that groups don't work locally and it didn't work.

I also tried pipe to test. I changed the output from the prefetch rule from "temporary()" to "pipe()". Piping prefetch into fasterq-dump works outside of snakemake but outputs an "error in group..." when I try it on snakemake because it seems like fasterq-dump is trying to read the file before it is downloaded.

This is the start of the snakemake code that I am using to test this pipe approach:

rule prefetch:
    input: 
    output:
        pipe(os.path.join(config['all_path']['sra_down'], "{srasample}/{srasample}.sra"))
    wildcard_constraints: srasample="|".join(list(config['fastq_sra_HE'].values()))
    params: 
        sraID="{srasample}",
        srapath=expand(config["all_path"]["sra_down"])
    shell: "prefetch {params.sraID} -O {params.srapath}"

rule fasterqdump:
    input: os.path.join(config['all_path']['sra_down'], "{srasample}/{srasample}.sra")
    output: r1=temporary(os.path.join(config['all_path']['sra_down'], "{srasample}_1.fastq")),
        r2=temporary(os.path.join(config['all_path']['sra_down'], "{srasample}_2.fastq"))
    wildcard_constraints: srasample="|".join(list(config['fastq_sra_HE'].values())), r=["1","2"]
    params: 
        sraID=os.path.join(config['all_path']['sra_down'], "{srasample}"),
        srapath=expand(config["all_path"]["sra_down"])
    shell:
        "fasterq-dump {params.sraID} -O {params.srapath}"

This is the error:

2024-11-21T16:14:51 fasterq-dump.2.11.3 err: invalid accession '/disco2/tiago/worldpops/SRR204016'
fasterq-dump quit with error code 3
[Thu Nov 21 13:14:52 2024]
Error in group ffb6a5f4-6dee-447e-9a2e-2dea42295399:
    jobs:
        rule fasterqdump:
            jobid: 3
            output: /disco2/tiago/worldpops/SRR204016_1.fastq, /disco2/tiago/worldpops/SRR204016_2.fastq
        rule prefetch:
            jobid: 4
            output: /disco2/tiago/worldpops/SRR204016/SRR204016.sra (pipe)

Shutting down, this might take some time.

Any help on how to tackle this would be appreciated.

Best, Tiago

Is there a way to group jobs locally on Snakemake?

I am trying to create a Snakemake pipeline using prefetch to download a large set of fastq files from SRA and generate bam files. Despite being a straightforward problem that I can execute for a few samples, when I try to scale up locally I run into a storage problem. When I try to run the pipeline for 100 samples, it tries to download all the fastq files from rule 1 before processing them with the following rules.

My pipeline has all the steps before the final bam file marked as temporary, so storage won't be a problem after everything is analyzed and fastq files are deleted.

It wouldn't be a problem if I could run 10 samples at a time, deleting the temporary outputs before downloading new fastq files.

I tried to group the jobs and give snakemake 10 cores, thinking that it would download 10 samples and run them through the entire pipeline before getting to the 11th. But I realize that groups don't work locally and it didn't work.

I also tried pipe to test. I changed the output from the prefetch rule from "temporary()" to "pipe()". Piping prefetch into fasterq-dump works outside of snakemake but outputs an "error in group..." when I try it on snakemake because it seems like fasterq-dump is trying to read the file before it is downloaded.

This is the start of the snakemake code that I am using to test this pipe approach:

rule prefetch:
    input: 
    output:
        pipe(os.path.join(config['all_path']['sra_down'], "{srasample}/{srasample}.sra"))
    wildcard_constraints: srasample="|".join(list(config['fastq_sra_HE'].values()))
    params: 
        sraID="{srasample}",
        srapath=expand(config["all_path"]["sra_down"])
    shell: "prefetch {params.sraID} -O {params.srapath}"

rule fasterqdump:
    input: os.path.join(config['all_path']['sra_down'], "{srasample}/{srasample}.sra")
    output: r1=temporary(os.path.join(config['all_path']['sra_down'], "{srasample}_1.fastq")),
        r2=temporary(os.path.join(config['all_path']['sra_down'], "{srasample}_2.fastq"))
    wildcard_constraints: srasample="|".join(list(config['fastq_sra_HE'].values())), r=["1","2"]
    params: 
        sraID=os.path.join(config['all_path']['sra_down'], "{srasample}"),
        srapath=expand(config["all_path"]["sra_down"])
    shell:
        "fasterq-dump {params.sraID} -O {params.srapath}"

This is the error:

2024-11-21T16:14:51 fasterq-dump.2.11.3 err: invalid accession '/disco2/tiago/worldpops/SRR204016'
fasterq-dump quit with error code 3
[Thu Nov 21 13:14:52 2024]
Error in group ffb6a5f4-6dee-447e-9a2e-2dea42295399:
    jobs:
        rule fasterqdump:
            jobid: 3
            output: /disco2/tiago/worldpops/SRR204016_1.fastq, /disco2/tiago/worldpops/SRR204016_2.fastq
        rule prefetch:
            jobid: 4
            output: /disco2/tiago/worldpops/SRR204016/SRR204016.sra (pipe)

Shutting down, this might take some time.

Any help on how to tackle this would be appreciated.

Best, Tiago

Share Improve this question asked Nov 22, 2024 at 10:38 TiagoRibeiro 1

Add a comment |

2 Answers 2

Sorted by: Reset to default 0

If your final rule has a higher priority, snakemake would try to execute it as fast as possible. So if you have 10 cores, and each rule only requires one core, it should be what you are looking for.

Disclaimer: I have not tested it, but from my understanding, this should do the trick.

In addition to priority, you can set a resource to limit the number of prefetch jobs to run at once. It's not ideal; if fasterqdump is significantly slower you can still run out of space depending on the scheduler. The idea is:

rule prefetch:
    input: 
    output: ...
    resources:
        prefetch_jobs=1

Then you execute snakemake with --resources prefetch_jobs=5, which will limit prefetch to only 5 jobs simultaneously. Combined with 10 cores and your priorities, as a prefetch finishes the fasterqdump will start. You can tune the resource if you feel like it's too much of a bottleneck or still too much downloads. I frequently use something like that to limit downloads or jobs that generate giant files.

本文标签： Grouping jobs locally with SnakemakeStack Overflow

版权声明：本文标题：Grouping jobs locally with Snakemake - Stack Overflow 内容由网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：http://www.betaflare.com/web/1736304475a1932248.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

编程频道|软件玩家 - 软件改变生活！

Grouping jobs locally with Snakemake - Stack Overflow

2 Answers 2

更多相关文章

Grouping jobs locally with Snakemake - Stack Overflow

发表评论

推荐文章

database - Merging development site with live site

html - How to let one div stretch parent, and other div only fill available height? - Stack Overflow

screenshot - Android screen mirroring application local wifi only - Stack Overflow

migration - Assets are using old domain name after migrating

c# - OutOfMemoryException in .NET 8 Applications on IIS with EF core - Stack Overflow

热门文章

order - Ensuring a plugin is loadedrun last?

After update to ddev v1.23.5, all traffic is routed to web container - Stack Overflow

quickaction - Salesforce error trying to Launch a Quick Action from Aura Component - Stack Overflow

bash - background process group of subshell when job control is enabled - Stack Overflow

java - Google Calendar API duplicated Authorization - Stack Overflow

Delete thousands of cron jobs

javascript - Group lists using DOM manipulation - Stack Overflow

python - Airflow UI parameters not being passing on to DAG - Stack Overflow

convex optimization - My problem does not follow DCP rules in cvxpy - Stack Overflow

wp query - How can I use WP-CLI commands without --allow-root

最新文章

Java入门级教学（IDEA的下载与安装与JDK的环境配置）

华硕笔记本电脑用U盘重装windows系统

物理网卡MAC修改器v3.0 - 真实网卡硬件MAC地址修改，重装系统不变！

如何一键安装win7系统(一键安装win7系统步骤)

Windows 11最稳定版本详解

multithreading - C++ thread exiting without a notice -- need help debugging with gdb - Stack Overflow

apache kafka - Unknown feature gate KafkaNodePools found in the configuration - Stack Overflow

New Python Instance in VS Code and the terminal is passing indentions that do not exist in the code editor window - Stack Overfl

ros2 - how to modify imu_filter_madgwick to transform RPY from imu_sensor frame to base_link frame? - Stack Overflow

Color a portion of a minipage in Manim - Stack Overflow

惠普OMEN 15-CE001TX 2EF91PA参数报价

苹果新款MacBook Pro 15英寸 i732GB1TBVega Pro 20参数报价

联想Y330A-PSE L参数报价

神舟战神Z7 D6 i7-12650H16GB512GBRTX4050旗舰版参数报价

神舟战神Z7 D6 i7-12650H16GB1TBRTX4050参数报价