From charlesreid1

(Created page with "=Start= This tutorial is assuming you start out on a machine with a conda-based python distribution (Anaconda or Miniconda or other). If you don't have a version of conda, i...")
 
m (Replacing charlesreid1.com:3000 with git.charlesreid1.com)
 
(17 intermediate revisions by the same user not shown)
Line 1: Line 1:
Main repository:
* https://git.charlesreid1.com/dotfiles/dahak-flot
* https://github.com/charlesreid1/dahak-flot
=Start=
=Start=


This tutorial is assuming you start out on a machine with a conda-based python distribution (Anaconda or Miniconda or other).
This tutorial is assuming you start out on a machine with a conda-based python distribution (Anaconda or Miniconda or other).
==Installing Conda with Pyenv==


If you don't have a version of conda, it is recommended you use [[Pyenv]] to manage versions of python.
If you don't have a version of conda, it is recommended you use [[Pyenv]] to manage versions of python.
Line 26: Line 32:
         # We don't need to add ~/.pyenv/bin to $PATH,
         # We don't need to add ~/.pyenv/bin to $PATH,
         # it is already done.
         # it is already done.
if __name__=="__main__":
    install_pyenv()
</pre>
As noted in the script, you will need to add <code>~/.pyenv/bin</code> to the path, as that is where the [[Pyenv]] versions of [[Python]] live.
Next, we have a python script to install snakemake:
<pre>
#!/usr/bin/python3
import getpass
import tempfile
import subprocess
def install_pyenv():
    user = getpass.getuser()
    if(user=="root"):
        raise Exception("You are root - you should run this script as a normal user.")
    else:
        # Install snakemake
        conda_version = "miniconda3-4.3.30"
        installcmd = ["pyenv","install",conda_version]
        subprocess.call(installcmd)
       
        globalcmd = ["pyenv","global",conda_version]
        subprocess.call(globalcmd)
        # ---------------------------
        # Install snakemake
        pyenvbin = os.environ['HOME']
        condabin = pyenvbin+"/.pyenv/shims/conda"
        subprocess.call([condabin,"update"])
        subprocess.call([condabin,"config","--all","channels","r"])
        subprocess.call([condabin,"config","--all","channels","default"])
        subprocess.call([condabin,"config","--all","channels","conda-forge"])
        subprocess.call([condabin,"config","--all","channels","bioconda"])
        subprocess.call([condabin,"install","--yes","-c","bioconda","snakemake"])
        # ---------------------------
        # Install osf cli client
       
        pyenvbin = os.environ['HOME']
        pipbin = pyenvbin+"/.pyenv/shims/pip"
        subprocess.call([pipbin,"install","--upgrade","pip"])
        subprocess.call([pipbin,"install","--user","osfclient"])




Line 32: Line 91:
</pre>
</pre>


==Get Tutorial Files==
Start by getting the files needed for the tutorial:
<pre>
wget https://bitbucket.org/snakemake/snakemake-tutorial/get/v3.11.0.tar.bz2
tar -xf v3.11.0.tar.bz2 --strip 1
</pre>
Create the conda environment:
<pre>
conda env create --name snakemake-tutorial --file environment.yaml
</pre>
Now activate the conda environment:
<pre>
source activate snakemake-tutorial
</pre>
==First Snakefile Rule==
Create a Snakefile, and add the first rule:
<pre>
rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}"
</pre>
This creates a folder with data in it, and an environment.yaml file for conda.
===Running the First Rule===
Unlike with Makefiles, Snakefile rules are executed based on their output files. So, if we want to execute the rule bwa_map, we look at the output file:
<pre>
rule bwa_map:
    ...
    output:
        "mapped_reads/{sample}.bam"
</pre>
That means we have to run snakefile and ask for <code>mapped_reads/{sample}.bam</code>, and this requires the input file <code>data/samples/{sample}.fastq</code> to be in place.
The shell portion is what stitches the input and output files together. So basically, you say "I want this output file" and Snakemake back-calculates the task graph needed to obtain that output file.
To do a dry run of the workflow:
<pre>
$ snakemake -np mapped_reads/A.bam mapped_reads/B.bam
rule bwa_map:
    input: data/genome.fa, data/samples/B.fastq
    output: mapped_reads/B.bam
    jobid: 0
    wildcards: sample=B
bwa mem data/genome.fa data/samples/B.fastq | samtools view -Sb - > mapped_reads/B.bam
rule bwa_map:
    input: data/genome.fa, data/samples/A.fastq
    output: mapped_reads/A.bam
    jobid: 1
    wildcards: sample=A
bwa mem data/genome.fa data/samples/A.fastq | samtools view -Sb - > mapped_reads/A.bam
Job counts:
count jobs
2 bwa_map
2
</pre>
This explains the tasks that are going to be executed: two input files leads to two separate bwa_map tasks. We can see the commands that Snakemake is going to execute just below information about each rule.
Running a single rule:
<pre>
snakemake -np mapped_reads/A.bam
</pre>
Rnning two rules explicitly:
<pre>
snakemake -np mapped_reads/A.bam mapped_reads/B.bam
</pre>
Running two rules with bash convenience:
<pre>
snakemake -np mapped_reads/{A,B}.bam
</pre>
==More Snakefile Rules==
Now we add a second rule that depends on the first. Here are both rules:
<pre>
rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}"
rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam"
    shell:
        "samtools sort -T sorted_reads/{wildcards.sample} "
        "-O bam {input} > {output}"
</pre>
Note that the rules are chained together because the output of bwa_map is the same as the input of samtools_sort. Also note that the two strings in the shell command are not separated by a comma so they are concatenated.
===Running the Second Rule===
Again, Snakemake is focused on the output files you want, so in this case we will let sample be A and B, which requires us to have data/samples/A.fastq and data/samples/B.fastq in place. Then ask for the output of samtools_sort:
Dry run first:
<pre>
snakemake -np sorted_reads/A.bam sorted_reads/B.bam
</pre>
Real deal:
<pre>
snakemake sorted_reads/A.bam sorted_reads/B.bam
</pre>
==Adding a Third Rule==
<pre>
rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}"
rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam"
    shell:
        "samtools sort -T sorted_reads/{wildcards.sample} "
        "-O bam {input} > {output}"
rule samtools_index:
    input:
        "sorted_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam.bai"
    shell:
        "samtools index {input}"
</pre>
Now we have three rules that will run sequentially (bwa_map -> samtools_sort -> samtools_index) so long as we ask for the output file of the last rule, samtools_index.
===Running the Third Rule===
To run the third rule:
<pre>
snakemake sorted_reads/A.bam.bai
</pre>
This requests the output file of samtools_index, which then looks for the input file sorted_reads/A.bam. That does not exist, so it looks for the rule with sorted_reads/A.bam as an output file, and that's the samtools_sort rule, which requires the input file mapped_reads/A.bam. That file does not exist, so it looks for the (first) rule with mapped_reads/A.bam as an output file, that's bwa_map, and so it requires the input files data/genome.fa and data/samples/A.fastq.
<pre>
$ snakemake -np sorted_reads/A.bam.bai
rule bwa_map:
    input: data/genome.fa, data/samples/A.fastq
    output: mapped_reads/A.bam
    jobid: 2
    wildcards: sample=A
bwa mem data/genome.fa data/samples/A.fastq | samtools view -Sb - > mapped_reads/A.bam
rule samtools_sort:
    input: mapped_reads/A.bam
    output: sorted_reads/A.bam
    jobid: 1
    wildcards: sample=A
samtools sort -T sorted_reads/A -O bam mapped_reads/A.bam > sorted_reads/A.bam
rule samtools_index:
    input: sorted_reads/A.bam
    output: sorted_reads/A.bam.bai
    jobid: 0
    wildcards: sample=A
samtools index sorted_reads/A.bam
Job counts:
count jobs
1 bwa_map
1 samtools_index
1 samtools_sort
3
</pre>
=More Complicated Operations=
==Expanding Patterns to Lists of Files==
Suppose we want to write a rule that, for its input and output files, should match a wildcard expression and turn it into a list of files.
As an example:
Start by defining a list of variables at the top of the file:
<pre>
SAMPLES = ["A", "B"]
</pre>
Now, use the expand function to feed in a string with a special formatting code, and a keyword argument:
<pre>
expand("sorted_reads/{sample}.bam", sample=SAMPLES)
</pre>
===Example Expand Rule===
Here is an example expand rule:
<pre>
rule bcftools_call:
    input:
        fa="data/genome.fa",
        bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
        bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
    output:
        "calls/all.vcf"
    shell:
        "samtools mpileup -g -f {input.fa} {input.bam} | "
        "bcftools call -mv - > {output}"
</pre>


=Flags=


[[Category:Python]]
[[Category:Python]]
[[Category:Snakemake]]
[[Category:Snakemake]]
[[Category:Dahak]]

Latest revision as of 03:53, 9 October 2019

Main repository:

Start

This tutorial is assuming you start out on a machine with a conda-based python distribution (Anaconda or Miniconda or other).

Installing Conda with Pyenv

If you don't have a version of conda, it is recommended you use Pyenv to manage versions of python.

A very simple pyenv installation script:

install_pyenv.py

#!/usr/bin/python3
import getpass
import subprocess


def install_pyenv():
    user = getpass.getuser()
    if(user=="root"):
        raise Exception("You are root - you should run this script as a normal user.")
    else:
        # Install pyenv 
        pyenvcmd = ["curl","-L","https://raw.githubusercontent.com/pyenv/pyenv-installer/master/bin/pyenv-installer","|","/bin/bash"]
        subprocess.call(pyenvcmd, shell=True)

        # We don't need to add ~/.pyenv/bin to $PATH,
        # it is already done.

if __name__=="__main__":
    install_pyenv()

As noted in the script, you will need to add ~/.pyenv/bin to the path, as that is where the Pyenv versions of Python live.

Next, we have a python script to install snakemake:

#!/usr/bin/python3
import getpass
import tempfile
import subprocess


def install_pyenv():
    user = getpass.getuser()
    if(user=="root"):
        raise Exception("You are root - you should run this script as a normal user.")
    else:
        # Install snakemake
        conda_version = "miniconda3-4.3.30"

        installcmd = ["pyenv","install",conda_version]
        subprocess.call(installcmd)
        
        globalcmd = ["pyenv","global",conda_version]
        subprocess.call(globalcmd)

        # ---------------------------
        # Install snakemake

        pyenvbin = os.environ['HOME']
        condabin = pyenvbin+"/.pyenv/shims/conda"

        subprocess.call([condabin,"update"])

        subprocess.call([condabin,"config","--all","channels","r"])
        subprocess.call([condabin,"config","--all","channels","default"])
        subprocess.call([condabin,"config","--all","channels","conda-forge"])
        subprocess.call([condabin,"config","--all","channels","bioconda"])

        subprocess.call([condabin,"install","--yes","-c","bioconda","snakemake"])

        # ---------------------------
        # Install osf cli client
        
        pyenvbin = os.environ['HOME']
        pipbin = pyenvbin+"/.pyenv/shims/pip"

        subprocess.call([pipbin,"install","--upgrade","pip"])
        subprocess.call([pipbin,"install","--user","osfclient"])


if __name__=="__main__":
    install_pyenv()

Get Tutorial Files

Start by getting the files needed for the tutorial:

wget https://bitbucket.org/snakemake/snakemake-tutorial/get/v3.11.0.tar.bz2
tar -xf v3.11.0.tar.bz2 --strip 1

Create the conda environment:

conda env create --name snakemake-tutorial --file environment.yaml

Now activate the conda environment:

source activate snakemake-tutorial

First Snakefile Rule

Create a Snakefile, and add the first rule:

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}"

This creates a folder with data in it, and an environment.yaml file for conda.

Running the First Rule

Unlike with Makefiles, Snakefile rules are executed based on their output files. So, if we want to execute the rule bwa_map, we look at the output file:

rule bwa_map:
    ...
    output:
        "mapped_reads/{sample}.bam"

That means we have to run snakefile and ask for mapped_reads/{sample}.bam, and this requires the input file data/samples/{sample}.fastq to be in place.

The shell portion is what stitches the input and output files together. So basically, you say "I want this output file" and Snakemake back-calculates the task graph needed to obtain that output file.

To do a dry run of the workflow:

$ snakemake -np mapped_reads/A.bam mapped_reads/B.bam

rule bwa_map:
    input: data/genome.fa, data/samples/B.fastq
    output: mapped_reads/B.bam
    jobid: 0
    wildcards: sample=B

bwa mem data/genome.fa data/samples/B.fastq | samtools view -Sb - > mapped_reads/B.bam

rule bwa_map:
    input: data/genome.fa, data/samples/A.fastq
    output: mapped_reads/A.bam
    jobid: 1
    wildcards: sample=A

bwa mem data/genome.fa data/samples/A.fastq | samtools view -Sb - > mapped_reads/A.bam
Job counts:
	count	jobs
	2	bwa_map
	2

This explains the tasks that are going to be executed: two input files leads to two separate bwa_map tasks. We can see the commands that Snakemake is going to execute just below information about each rule.

Running a single rule:

snakemake -np mapped_reads/A.bam

Rnning two rules explicitly:

snakemake -np mapped_reads/A.bam mapped_reads/B.bam

Running two rules with bash convenience:

snakemake -np mapped_reads/{A,B}.bam 

More Snakefile Rules

Now we add a second rule that depends on the first. Here are both rules:

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}"

rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam"
    shell:
        "samtools sort -T sorted_reads/{wildcards.sample} "
        "-O bam {input} > {output}"

Note that the rules are chained together because the output of bwa_map is the same as the input of samtools_sort. Also note that the two strings in the shell command are not separated by a comma so they are concatenated.

Running the Second Rule

Again, Snakemake is focused on the output files you want, so in this case we will let sample be A and B, which requires us to have data/samples/A.fastq and data/samples/B.fastq in place. Then ask for the output of samtools_sort:

Dry run first:

snakemake -np sorted_reads/A.bam sorted_reads/B.bam

Real deal:

snakemake sorted_reads/A.bam sorted_reads/B.bam

Adding a Third Rule

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}"

rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam"
    shell:
        "samtools sort -T sorted_reads/{wildcards.sample} "
        "-O bam {input} > {output}"

rule samtools_index:
    input:
        "sorted_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam.bai"
    shell:
        "samtools index {input}"

Now we have three rules that will run sequentially (bwa_map -> samtools_sort -> samtools_index) so long as we ask for the output file of the last rule, samtools_index.

Running the Third Rule

To run the third rule:

snakemake sorted_reads/A.bam.bai

This requests the output file of samtools_index, which then looks for the input file sorted_reads/A.bam. That does not exist, so it looks for the rule with sorted_reads/A.bam as an output file, and that's the samtools_sort rule, which requires the input file mapped_reads/A.bam. That file does not exist, so it looks for the (first) rule with mapped_reads/A.bam as an output file, that's bwa_map, and so it requires the input files data/genome.fa and data/samples/A.fastq.

$ snakemake -np sorted_reads/A.bam.bai

rule bwa_map:
    input: data/genome.fa, data/samples/A.fastq
    output: mapped_reads/A.bam
    jobid: 2
    wildcards: sample=A

bwa mem data/genome.fa data/samples/A.fastq | samtools view -Sb - > mapped_reads/A.bam

rule samtools_sort:
    input: mapped_reads/A.bam
    output: sorted_reads/A.bam
    jobid: 1
    wildcards: sample=A

samtools sort -T sorted_reads/A -O bam mapped_reads/A.bam > sorted_reads/A.bam

rule samtools_index:
    input: sorted_reads/A.bam
    output: sorted_reads/A.bam.bai
    jobid: 0
    wildcards: sample=A

samtools index sorted_reads/A.bam
Job counts:
	count	jobs
	1	bwa_map
	1	samtools_index
	1	samtools_sort
	3

More Complicated Operations

Expanding Patterns to Lists of Files

Suppose we want to write a rule that, for its input and output files, should match a wildcard expression and turn it into a list of files.

As an example:

Start by defining a list of variables at the top of the file:

SAMPLES = ["A", "B"]

Now, use the expand function to feed in a string with a special formatting code, and a keyword argument:

expand("sorted_reads/{sample}.bam", sample=SAMPLES)

Example Expand Rule

Here is an example expand rule:

rule bcftools_call:
    input:
        fa="data/genome.fa",
        bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
        bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
    output:
        "calls/all.vcf"
    shell:
        "samtools mpileup -g -f {input.fa} {input.bam} | "
        "bcftools call -mv - > {output}"

Flags