Revision as of 05:22, 22 February 2018

Main repository:

Start

This tutorial is assuming you start out on a machine with a conda-based python distribution (Anaconda or Miniconda or other).

Installing Conda with Pyenv

If you don't have a version of conda, it is recommended you use Pyenv to manage versions of python.

A very simple pyenv installation script:

install_pyenv.py

#!/usr/bin/python3
import getpass
import subprocess


def install_pyenv():
    user = getpass.getuser()
    if(user=="root"):
        raise Exception("You are root - you should run this script as a normal user.")
    else:
        # Install pyenv 
        pyenvcmd = ["curl","-L","https://raw.githubusercontent.com/pyenv/pyenv-installer/master/bin/pyenv-installer","|","/bin/bash"]
        subprocess.call(pyenvcmd, shell=True)

        # We don't need to add ~/.pyenv/bin to $PATH,
        # it is already done.

if __name__=="__main__":
    install_pyenv()

As noted in the script, you will need to add ~/.pyenv/bin to the path, as that is where the Pyenv versions of Python live.

Next, we have a python script to install snakemake:

#!/usr/bin/python3
import getpass
import tempfile
import subprocess


def install_pyenv():
    user = getpass.getuser()
    if(user=="root"):
        raise Exception("You are root - you should run this script as a normal user.")
    else:
        # Install snakemake
        conda_version = "miniconda3-4.3.30"

        installcmd = ["pyenv","install",conda_version]
        subprocess.call(installcmd)
        
        globalcmd = ["pyenv","global",conda_version]
        subprocess.call(globalcmd)

        # ---------------------------
        # Install snakemake

        pyenvbin = os.environ['HOME']
        condabin = pyenvbin+"/.pyenv/shims/conda"

        subprocess.call([condabin,"update"])

        subprocess.call([condabin,"config","--all","channels","r"])
        subprocess.call([condabin,"config","--all","channels","default"])
        subprocess.call([condabin,"config","--all","channels","conda-forge"])
        subprocess.call([condabin,"config","--all","channels","bioconda"])

        subprocess.call([condabin,"install","--yes","-c","bioconda","snakemake"])

        # ---------------------------
        # Install osf cli client
        
        pyenvbin = os.environ['HOME']
        pipbin = pyenvbin+"/.pyenv/shims/pip"

        subprocess.call([pipbin,"install","--upgrade","pip"])
        subprocess.call([pipbin,"install","--user","osfclient"])


if __name__=="__main__":
    install_pyenv()

Get Tutorial Files

Start by getting the files needed for the tutorial:

wget https://bitbucket.org/snakemake/snakemake-tutorial/get/v3.11.0.tar.bz2
tar -xf v3.11.0.tar.bz2 --strip 1

Create the conda environment:

conda env create --name snakemake-tutorial --file environment.yaml

Now activate the conda environment:

source activate snakemake-tutorial

First Snakefile Rule

Create a Snakefile, and add the first rule:

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}"

This creates a folder with data in it, and an environment.yaml file for conda.

Executing First Rule

Unlike with Makefiles, Snakefile rules are executed based on their output files. So, if we want to execute the rule bwa_map, we look at the output file:

rule bwa_map:
    ...
    output:
        "mapped_reads/{sample}.bam"

That means we have to run snakefile and ask for mapped_reads/{sample}.bam, and this requires the input file data/samples/{sample}.fastq to be in place.

The shell portion is what stitches the input and output files together. So basically, you say "I want this output file" and Snakemake back-calculates the task graph needed to obtain that output file.

To do a dry run of the workflow:

$ snakemake -np mapped_reads/A.bam mapped_reads/B.bam

rule bwa_map:
    input: data/genome.fa, data/samples/B.fastq
    output: mapped_reads/B.bam
    jobid: 0
    wildcards: sample=B

bwa mem data/genome.fa data/samples/B.fastq | samtools view -Sb - > mapped_reads/B.bam

rule bwa_map:
    input: data/genome.fa, data/samples/A.fastq
    output: mapped_reads/A.bam
    jobid: 1
    wildcards: sample=A

bwa mem data/genome.fa data/samples/A.fastq | samtools view -Sb - > mapped_reads/A.bam
Job counts:
	count	jobs
	2	bwa_map
	2

This explains the tasks that are going to be executed: two input files leads to two separate bwa_map tasks. We can see the commands that Snakemake is going to execute just below information about each rule.

Running a single rule:

snakemake -np mapped_reads/A.bam

Rnning two rules explicitly:

snakemake -np mapped_reads/A.bam mapped_reads/B.bam

Running two rules with bash convenience:

snakemake -np mapped_reads/{A,B}.bam

More Snakefile Rules

Now we add a second rule that depends on the first. Here are both rules:

rule bwa_map:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        "mapped_reads/{sample}.bam"
    shell:
        "bwa mem {input} | samtools view -Sb - > {output}"

rule samtools_sort:
    input:
        "mapped_reads/{sample}.bam"
    output:
        "sorted_reads/{sample}.bam"
    shell:
        "samtools sort -T sorted_reads/{wildcards.sample} "
        "-O bam {input} > {output}"

Note that the rules are chained together because the output of bwa_map is the same as the input of samtools_sort. Also note that the two strings in the shell command are not separated by a comma so they are concatenated.

Again, Snakemake is focused on the output files you want, so in this case we will let sample be A and B, which requires us to have data/samples/A.fastq and data/samples/B.fastq in place. Then ask for the output of samtools_sort:

Dry run first:

snakemake -np sorted_reads/A.bam sorted_reads/B.bam

Real deal:

snakemake sorted_reads/A.bam sorted_reads/B.bam

Flags

Snakemake/DahakFlot: Difference between revisions

From charlesreid1