admin管理员组文章数量:1415420
I was trying to created a snakemake script to automate 3 tasks. The first one edits a .seg file in order to be the correct input for the next rule, the second rule computes an analysis for transcription factors for each bam and it gives as output a folder with sample name and inside it another sub folder named 'TranscriptionFactors'. The hierarchy would be: output_folder |__ sampleID |__TranscriptionFactors |__ GTRD_ChIP_Only_1000sites |__ file 1 |__ file 2 |__ file ...
And finally, the last rule would receive as input output_folder + sampleID and also the sampleID (./scoring_pipeline.sh <output_folder_from_prev_step> <sample name>), and will give as output a folder named AccessibilityOutput, a sub folder with images and many txt files. output_folder |__ sampleID |TranscriptionFactors |GTRD_ChIP_Only_1000sites | file 1 | file 2 |__ file ... |AccessibilityOutput | {}Accessibility1KSitesAdjusted.txt |_ {}Accessibility1KSites.txt |_ file ... |WaveletOut | image 1 |__ image 2 |__ image ... |__ log.txt
import os
configfile: "config.yaml"
SAMPLES = [d for d in os.listdir(config["input_folder"]) if os.path.isdir(os.path.join(config["input_folder"], d))]
#print(SAMPLES)
rule all:
input:
expand(config["output_folder"] + "/{sampleID}", sampleID=SAMPLES)
rule format_to_cn_seg_files:
input:
seg=config["seg_folder"] + "/{sampleID}.seg"
output:
filtered_seg="filtered_segs/{sampleID}_filtered.seg"
shell:
"""
mkdir -p filtered_segs
head -n 10 {input.seg} # Preview first 10 lines
awk '{{print $2 "\\t" $3 "\\t" $4 "\\t" $NF}}' {input.seg} > {output.filtered_seg}
awk '!($NF ~ /^NA$/)' {output.filtered_seg} > {output.filtered_seg}.tmp && mv {output.filtered_seg}.tmp {output.filtered_seg}
sed -i '1d' {output.filtered_seg}
sed -i '1i chr\\tstart\\tend\\tLog2-ratio' {output.filtered_seg}
"""
rule run_tf_analyses_from_bam:
input:
bam=config["input_folder"] + "/{sampleID}/{sampleID}.srtd.markdup.bam",
bai=config["input_folder"] + "/{sampleID}/{sampleID}.srtd.markdup.bam.bai",
seg="filtered_segs/{sampleID}_filtered.seg"
output:
directory(config["output_folder"] + "/{sampleID}")
shell:
"""
mkdir -p {output}
python {config[path_to_scripts]}/run_tf_analyses_from_bam_ori.py \\
-b {input.bam} \\
-o {output} \\
-norm-file {input.seg} -hg38 -a tf_gtrd_1000sites
"""
rule calculate_accessibility:
input:
tf_results=config["output_folder"] + "/{sampleID}" # Directorio con resultados anteriores
output:
directory(config["output_folder"] + "/{sampleID}")
message: "Executing calculate_accessibility with {input}."
shell:
"""
./scoring_pipeline.sh {input.tf_results} {wildcards.sampleID}
"""
The issue is that when I try a run dry with snakemake, it only recognize rule format_to_cn_seg_files and run_tf_analyses_from_bam
This is the output in the terminal : snakemake -s snakemake_all_process.smk -n --debug-dag
Building DAG of jobs...
candidate job all
wildcards:
candidate job calculate_accessibility
wildcards: sampleID=04B003764_R
candidate job run_tf_analyses_from_bam
wildcards: sampleID=04B003764_R
candidate job format_to_cn_seg_files
wildcards: sampleID=04B003764_R
selected job format_to_cn_seg_files
wildcards: sampleID=04B003764_R
selected job run_tf_analyses_from_bam
wildcards: sampleID=04B003764_R
candidate job calculate_accessibility
wildcards: sampleID=VH_14B_019722_R
candidate job run_tf_analyses_from_bam
wildcards: sampleID=VH_14B_019722_R
candidate job format_to_cn_seg_files
wildcards: sampleID=VH_14B_019722_R
selected job format_to_cn_seg_files
wildcards: sampleID=VH_14B_019722_R
selected job run_tf_analyses_from_bam
wildcards: sampleID=VH_14B_019722_R
selected job all
wildcards:
Job stats:
job count
------------------------ -------
all 1
format_to_cn_seg_files 2
run_tf_analyses_from_bam 2
total 5
Reasons:
(check individual jobs above for details)
input files updated by another job:
all, run_tf_analyses_from_bam
output files have to be generated:
run_tf_analyses_from_bam
set of input files has changed since last execution:
format_to_cn_seg_files
Some jobs were triggered by provenance information, see 'reason' section in the rule displays above.
If you prefer that only modification time is used to determine whether a job shall be executed, use the command line option '--rerun-triggers mtime' (also see --help).
If you are sure that a change for a certain output file (say, <outfile>) won't change the result (e.g. because you just changed the formatting of a script or environment definition), you can also wipe its metadata to skip such a trigger via 'snakemake --cleanup-metadata <outfile>'.
Rules with provenance triggered jobs: format_to_cn_seg_files
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
I try editing the rule all but it also gives error or folder dependency... (I don`t have the output..)
I was trying to created a snakemake script to automate 3 tasks. The first one edits a .seg file in order to be the correct input for the next rule, the second rule computes an analysis for transcription factors for each bam and it gives as output a folder with sample name and inside it another sub folder named 'TranscriptionFactors'. The hierarchy would be: output_folder |__ sampleID |__TranscriptionFactors |__ GTRD_ChIP_Only_1000sites |__ file 1 |__ file 2 |__ file ...
And finally, the last rule would receive as input output_folder + sampleID and also the sampleID (./scoring_pipeline.sh <output_folder_from_prev_step> <sample name>), and will give as output a folder named AccessibilityOutput, a sub folder with images and many txt files. output_folder |__ sampleID |TranscriptionFactors |GTRD_ChIP_Only_1000sites | file 1 | file 2 |__ file ... |AccessibilityOutput | {}Accessibility1KSitesAdjusted.txt |_ {}Accessibility1KSites.txt |_ file ... |WaveletOut | image 1 |__ image 2 |__ image ... |__ log.txt
import os
configfile: "config.yaml"
SAMPLES = [d for d in os.listdir(config["input_folder"]) if os.path.isdir(os.path.join(config["input_folder"], d))]
#print(SAMPLES)
rule all:
input:
expand(config["output_folder"] + "/{sampleID}", sampleID=SAMPLES)
rule format_to_cn_seg_files:
input:
seg=config["seg_folder"] + "/{sampleID}.seg"
output:
filtered_seg="filtered_segs/{sampleID}_filtered.seg"
shell:
"""
mkdir -p filtered_segs
head -n 10 {input.seg} # Preview first 10 lines
awk '{{print $2 "\\t" $3 "\\t" $4 "\\t" $NF}}' {input.seg} > {output.filtered_seg}
awk '!($NF ~ /^NA$/)' {output.filtered_seg} > {output.filtered_seg}.tmp && mv {output.filtered_seg}.tmp {output.filtered_seg}
sed -i '1d' {output.filtered_seg}
sed -i '1i chr\\tstart\\tend\\tLog2-ratio' {output.filtered_seg}
"""
rule run_tf_analyses_from_bam:
input:
bam=config["input_folder"] + "/{sampleID}/{sampleID}.srtd.markdup.bam",
bai=config["input_folder"] + "/{sampleID}/{sampleID}.srtd.markdup.bam.bai",
seg="filtered_segs/{sampleID}_filtered.seg"
output:
directory(config["output_folder"] + "/{sampleID}")
shell:
"""
mkdir -p {output}
python {config[path_to_scripts]}/run_tf_analyses_from_bam_ori.py \\
-b {input.bam} \\
-o {output} \\
-norm-file {input.seg} -hg38 -a tf_gtrd_1000sites
"""
rule calculate_accessibility:
input:
tf_results=config["output_folder"] + "/{sampleID}" # Directorio con resultados anteriores
output:
directory(config["output_folder"] + "/{sampleID}")
message: "Executing calculate_accessibility with {input}."
shell:
"""
./scoring_pipeline.sh {input.tf_results} {wildcards.sampleID}
"""
The issue is that when I try a run dry with snakemake, it only recognize rule format_to_cn_seg_files and run_tf_analyses_from_bam
This is the output in the terminal : snakemake -s snakemake_all_process.smk -n --debug-dag
Building DAG of jobs...
candidate job all
wildcards:
candidate job calculate_accessibility
wildcards: sampleID=04B003764_R
candidate job run_tf_analyses_from_bam
wildcards: sampleID=04B003764_R
candidate job format_to_cn_seg_files
wildcards: sampleID=04B003764_R
selected job format_to_cn_seg_files
wildcards: sampleID=04B003764_R
selected job run_tf_analyses_from_bam
wildcards: sampleID=04B003764_R
candidate job calculate_accessibility
wildcards: sampleID=VH_14B_019722_R
candidate job run_tf_analyses_from_bam
wildcards: sampleID=VH_14B_019722_R
candidate job format_to_cn_seg_files
wildcards: sampleID=VH_14B_019722_R
selected job format_to_cn_seg_files
wildcards: sampleID=VH_14B_019722_R
selected job run_tf_analyses_from_bam
wildcards: sampleID=VH_14B_019722_R
selected job all
wildcards:
Job stats:
job count
------------------------ -------
all 1
format_to_cn_seg_files 2
run_tf_analyses_from_bam 2
total 5
Reasons:
(check individual jobs above for details)
input files updated by another job:
all, run_tf_analyses_from_bam
output files have to be generated:
run_tf_analyses_from_bam
set of input files has changed since last execution:
format_to_cn_seg_files
Some jobs were triggered by provenance information, see 'reason' section in the rule displays above.
If you prefer that only modification time is used to determine whether a job shall be executed, use the command line option '--rerun-triggers mtime' (also see --help).
If you are sure that a change for a certain output file (say, <outfile>) won't change the result (e.g. because you just changed the formatting of a script or environment definition), you can also wipe its metadata to skip such a trigger via 'snakemake --cleanup-metadata <outfile>'.
Rules with provenance triggered jobs: format_to_cn_seg_files
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
I try editing the rule all but it also gives error or folder dependency... (I don`t have the output..)
Share Improve this question edited Feb 11 at 22:24 Ed Morton 206k18 gold badges87 silver badges207 bronze badges asked Feb 11 at 11:12 Alessandra Bonilla SalonAlessandra Bonilla Salon 133 bronze badges1 Answer
Reset to default 0The first issue here is that you have two rules that produce the exact same output:
directory(config["output_folder"] + "/{sampleID}")
Normally I'd expect to see an "ambiguous rule" error but for some reason in this particular case, Snakemake is choosing the first rule that can produce the requested directory, which is run_tf_analyses_from_bam
.
Also, your calculate_accessibility
rule cannot work as this has the input and the output set to the exact same pattern. Each job that runs from a rule needs to create every item in its output, regardless of whether that item is a file or a directory, so you cannot ever have a rule with the output and input the same.
I think you are under the impression that a directory()
output means that the rule writes some files to that existing directory, but that's not the case. Any output directory will be removed (if it exists) by Snakemake prior to running the job and then the job itself must create the output directory. So in terms of building the DAG any directory()
output is no different than a regular file output.
本文标签: pythonSnakemake Rule is not picked upStack Overflow
版权声明:本文标题:python - Snakemake: Rule is not picked up - Stack Overflow 内容由网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:http://www.betaflare.com/web/1745215877a2648163.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论