From charlesreid1

(Expanding Patterns to Lists of Files)
m (Replacing charlesreid1.com:3000 with git.charlesreid1.com)
 
Line 1: Line 1:
 
Main repository:
 
Main repository:
* https://charlesreid1.com:3000/dotfiles/dahak-flot
+
* https://git.charlesreid1.com/dotfiles/dahak-flot
 
* https://github.com/charlesreid1/dahak-flot
 
* https://github.com/charlesreid1/dahak-flot
  

Latest revision as of 03:53, 9 October 2019

Main repository:

Start

This tutorial is assuming you start out on a machine with a conda-based python distribution (Anaconda or Miniconda or other).

Installing Conda with Pyenv

If you don't have a version of conda, it is recommended you use Pyenv to manage versions of python.

A very simple pyenv installation script:

install_pyenv.py

#!/usr/bin/python3
import getpass
import subprocess


def install_pyenv():
    user = getpass.getuser()
    if(user=="root"):
        raise Exception("You are root - you should run this script as a normal user.")
    else:
        # Install pyenv 
        pyenvcmd = ["curl","-L","https://raw.githubusercontent.com/pyenv/pyenv-installer/master/bin/pyenv-installer","|","/bin/bash"]
        subprocess.call(pyenvcmd, shell=True)

        # We don't need to add ~/.pyenv/bin to $PATH,
        # it is already done.

if __name__=="__main__":
    install_pyenv()

As noted in the script, you will need to add ~/.pyenv/bin to the path, as that is where the Pyenv versions of Python live.

Next, we have a python script to install snakemake:

#!/usr/bin/python3
import getpass
import tempfile
import subprocess


def install_pyenv():
    user = getpass.getuser()
    if(user=="root"):
        raise Exception("You are root - you should run this script as a normal user.")
    else:
        # Install snakemake
        conda_version = "miniconda3-4.3.30"

        installcmd = ["pyenv","install",conda_version]
        subprocess.call(installcmd)
        
        globalcmd = ["pyenv","global",conda_version]
        subprocess.call(globalcmd)

        # ---------------------------
        # Install snakemake

        pyenvbin = os.environ['HOME']
        condabin = pyenvbin+"/.pyenv/shims/conda"

        subprocess.call([condabin,"update"])

        subprocess.call([condabin,"config","--all","channels","r"])
        subprocess.call([condabin,"config","--all","channels","default"])
        subprocess.call([condabin,"config","--all","channels","conda-forge"])
        subprocess.call([condabin,"config","--all","channels","bioconda"])

        subprocess.call([condabin,"install","--yes","-c","bioconda","snakemake"])

        # ---------------------------
        # Install osf cli client
        
        pyenvbin = os.environ['HOME']
        pipbin = pyenvbin+"/.pyenv/shims/pip"

        subprocess.call([pipbin,"install","--upgrade","pip"])
        subprocess.call([pipbin,"install","--user","osfclient"])


if __name__=="__main__":
    install_pyenv()

Get Tutorial Files

Start by getting the files needed for the tutorial:

wget https://bitbucket.org/snakemake/snakemake-tutorial/get/v3.11.0.tar.bz2
tar -xf v3.11.0.tar.bz2 --strip 1

Create the conda environment:

conda env create --name snakemake-tutorial --file environment.yaml

Now activate the conda environment:

source activate snakemake-tutorial

First Snakefile Rule

Create a Snakefile, and add the first rule:

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}"

This creates a folder with data in it, and an environment.yaml file for conda.

Running the First Rule

Unlike with Makefiles, Snakefile rules are executed based on their output files. So, if we want to execute the rule bwa_map, we look at the output file:

rule bwa_map:
    ...
    output:
        "mapped_reads/{sample}.bam"

That means we have to run snakefile and ask for mapped_reads/{sample}.bam, and this requires the input file data/samples/{sample}.fastq to be in place.

The shell portion is what stitches the input and output files together. So basically, you say "I want this output file" and Snakemake back-calculates the task graph needed to obtain that output file.

To do a dry run of the workflow:

$ snakemake -np mapped_reads/A.bam mapped_reads/B.bam

rule bwa_map:
    input: data/genome.fa, data/samples/B.fastq
    output: mapped_reads/B.bam
    jobid: 0
    wildcards: sample=B

bwa mem data/genome.fa data/samples/B.fastq | samtools view -Sb - > mapped_reads/B.bam

rule bwa_map:
    input: data/genome.fa, data/samples/A.fastq
    output: mapped_reads/A.bam
    jobid: 1
    wildcards: sample=A

bwa mem data/genome.fa data/samples/A.fastq | samtools view -Sb - > mapped_reads/A.bam
Job counts:
	count	jobs
	2	bwa_map
	2

This explains the tasks that are going to be executed: two input files leads to two separate bwa_map tasks. We can see the commands that Snakemake is going to execute just below information about each rule.

Running a single rule:

snakemake -np mapped_reads/A.bam

Rnning two rules explicitly:

snakemake -np mapped_reads/A.bam mapped_reads/B.bam

Running two rules with bash convenience:

snakemake -np mapped_reads/{A,B}.bam 

More Snakefile Rules

Now we add a second rule that depends on the first. Here are both rules:

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}"

rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam"
    shell:
        "samtools sort -T sorted_reads/{wildcards.sample} "
        "-O bam {input} > {output}"

Note that the rules are chained together because the output of bwa_map is the same as the input of samtools_sort. Also note that the two strings in the shell command are not separated by a comma so they are concatenated.

Running the Second Rule

Again, Snakemake is focused on the output files you want, so in this case we will let sample be A and B, which requires us to have data/samples/A.fastq and data/samples/B.fastq in place. Then ask for the output of samtools_sort:

Dry run first:

snakemake -np sorted_reads/A.bam sorted_reads/B.bam

Real deal:

snakemake sorted_reads/A.bam sorted_reads/B.bam

Adding a Third Rule

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}"

rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam"
    shell:
        "samtools sort -T sorted_reads/{wildcards.sample} "
        "-O bam {input} > {output}"

rule samtools_index:
    input:
        "sorted_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam.bai"
    shell:
        "samtools index {input}"

Now we have three rules that will run sequentially (bwa_map -> samtools_sort -> samtools_index) so long as we ask for the output file of the last rule, samtools_index.

Running the Third Rule

To run the third rule:

snakemake sorted_reads/A.bam.bai

This requests the output file of samtools_index, which then looks for the input file sorted_reads/A.bam. That does not exist, so it looks for the rule with sorted_reads/A.bam as an output file, and that's the samtools_sort rule, which requires the input file mapped_reads/A.bam. That file does not exist, so it looks for the (first) rule with mapped_reads/A.bam as an output file, that's bwa_map, and so it requires the input files data/genome.fa and data/samples/A.fastq.

$ snakemake -np sorted_reads/A.bam.bai

rule bwa_map:
    input: data/genome.fa, data/samples/A.fastq
    output: mapped_reads/A.bam
    jobid: 2
    wildcards: sample=A

bwa mem data/genome.fa data/samples/A.fastq | samtools view -Sb - > mapped_reads/A.bam

rule samtools_sort:
    input: mapped_reads/A.bam
    output: sorted_reads/A.bam
    jobid: 1
    wildcards: sample=A

samtools sort -T sorted_reads/A -O bam mapped_reads/A.bam > sorted_reads/A.bam

rule samtools_index:
    input: sorted_reads/A.bam
    output: sorted_reads/A.bam.bai
    jobid: 0
    wildcards: sample=A

samtools index sorted_reads/A.bam
Job counts:
	count	jobs
	1	bwa_map
	1	samtools_index
	1	samtools_sort
	3

More Complicated Operations

Expanding Patterns to Lists of Files

Suppose we want to write a rule that, for its input and output files, should match a wildcard expression and turn it into a list of files.

As an example:

Start by defining a list of variables at the top of the file:

SAMPLES = ["A", "B"]

Now, use the expand function to feed in a string with a special formatting code, and a keyword argument:

expand("sorted_reads/{sample}.bam", sample=SAMPLES)

Example Expand Rule

Here is an example expand rule:

rule bcftools_call:
    input:
        fa="data/genome.fa",
        bam=expand("sorted_reads/{sample}.bam", sample=SAMPLES),
        bai=expand("sorted_reads/{sample}.bam.bai", sample=SAMPLES)
    output:
        "calls/all.vcf"
    shell:
        "samtools mpileup -g -f {input.fa} {input.bam} | "
        "bcftools call -mv - > {output}"

Flags