Snakemake/Patterns: Difference between revisions

Revision as of 08:20, 2 March 2018

Snakemake patterns for Snakefiles and complex workflows.

Also, repo of awesome examples here: https://percyfal.github.io/snakemake-rules/docs/configuration.html

Sandwich-making example: https://github.com/leipzig/SandwichesWithSnakemake

Creating a master rule

If Snakemake is not provided with a rule to execute, it will execute the first rule in the file.

To create a master rule that will call other rules defined in the Snakemake file, the make equivalent of "all" or "default", you should define it first (you can name it whatever you would like):

mynames   = ['A','B','C']
myinputs  = [j+".txt" for j in mynames]
myoutputs = [j+".dodat" for j in mynames]

rule default:
    input:
        myoutputs
    shell:
        'echo "i just run subrules!"'

This rule requires, as inputs, the variable myoutputs, which is going to contain all of the final output files that the Snakemake process should generate. This means we need to have a rule somewhere below with an output of A.dodat, B.dodat, and C.dodat:

rule cascade:
    output:
        '{name}.dodat'
    input:
        '{name}.file_from_previous_step'
    shell:
        'cp {input} {output}'

and so on...

Now, to run this default rule, just run snakemake with no arguments:

$ snakemake

this will run the first rule in the Snakefile, which is default. The default rule does nothing, but requires as inputs the final output of the entire process. Snakemake will assemble the rules that those final output files depend on.

Creating rules without target files

Normally you run a rule by specifying the output file of a rule. If you create a rule that has no output files associated with it, like the one below (clean):

rule clean:
    shell:
        'rm -f *.dodat *.int*'

you can run it by passing the rule name to snakemake:

$ snakemake clean

Dealing with complicated input/output lists

If we have a complicated list of input and output files, or need to pass in a few parameters and use those to build a list of input or output files, we can write a function to return the list of files, and call it in the inputs and outputs list:

data_dir = "data"

def MKOUT(data_dir, fastq_files):
    """Returns a string that uses Snakemake's {} mini language"""
    result = []
    for fastq in fastq_files:
        p = os.path.join(data_dir,fastq)
        result.append(p)
    return result


rule link_raw_data:
    """imaginary scenario: complicated list of output files"""
    output:
        MKOUT(data_dir, ["{lib}_{rep}.fastq.gz", "{lib}_{rep}_small.fastq.gz"] ),
    params:
        directory = data_dir,
    message:
        "Making link to raw data {output}."
    shell:
        """
        do_stuff {params.directory}
        """

Executing commands in a temporary directory

To stay organized with which directory you are running various commands in, use the following pattern for make rules:

rule link_raw_data:
    output:
        output_file
    params:
        directory = data_dir,
        shell_command = lib2data,
    message:
        "Making link to raw data {output}."
    shell:
        """
        (
        cd {params.directory}
        {params.shell_command}
        )
        """

Mixing Python and Shell Commands

To mix Python and shell code, embed a shell command inside a run directive:

rule all:
    input: 'Salmonella-enterica.vcf'

rule download_sra:
    output: '{specie}.sra'
    run:
        readsID = species[wildcards.specie]['readsID']
        shell('''
            bionode-ncbi download sra {readsID};
            cp {readsID}/*.sra {output} && rm -rf {readsID};
        ''')

rule call:
    input: '{specie}.sra'
    output: '{specie}.vcf'
    shell: 'magic {input} > {output}'

Running Docker

One of the issues is that Snakemake likes to think relative, but Docker needs absolute paths (particularly when specifying a directory to mount in the container as a volume).

Solution (not tested): store PWD using os.get_cwd(), then use {PWD} in the Docker command

Philosophy of Snakemake

So far as we have developed it, here is our philosophy of Snakemake:

One Subtask, One Subdirectory: subtasks are organized into their own directories.

In this application, the subtasks all look the same and use the same program.
In real applications, each subtask would correspond to a different program or workflow step.

Atomic Tasks: Each subtask consists of a few atomic tasks (simple algebra operations).

The atomic tasks are carried out.
The subtask aggregates these into a final result.
This makes workflows more flexible and modular.

Aggregation:

Each subtask aggregates the results of the atomic tasks (e.g., a sum or product).
Likewise, the final master task aggregates the results of each subtask.

Philosophy in Practice

In terms of how this translates for Snakemake files, here's how the organization looks:

Subtask Snakemake Rules:

Each subtask directory contains a Snakemake rule file that defines a master rule for that subtask and specifies what that subtask does.

Snakemake Configuration Files:

Each subtask directory contains a Snakemake config file that defines defaults (which the user can change).
If there are multiple rules, rules specific to a rule file can also be defined in that file.

The System - How to Grok Snakemake

To grok Snakemake, think about the Snakemake file as follows:

Each time you see a variable in brackets {} appearing in some shell command, think of those brackets as "opening" a Python scope.
Any variables inside the brackets must be python variables.
For example, {wildcard.name} means there is a Python variable named wildcard, that has an attribute called name. {output} means there is a Python variable named output. Etc.
The shell and run directives have access to the same scope, and therefore if you can call the variable "{output}" in a shell block, you can call the variable "output" in the corresponding run block.

Flags

@@ Line 169: / Line 169: @@
 * Each subtask directory contains a Snakemake config file that defines defaults (which the user can change).
 * If there are multiple rules, rules specific to a rule file can also be defined in that file.
+==The System - How to Grok Snakemake==
+To grok Snakemake, think about the Snakemake file as follows:
+* Each time you see a variable in brackets {} appearing in some shell command, think of those brackets as "opening" a Python scope.
+* Any variables inside the brackets must be python variables.
+* For example, {wildcard.name} means there is a Python variable named wildcard, that has an attribute called name. {output} means there is a Python variable named output. Etc.
+* The shell and run directives have access to the same scope, and therefore if you can call the variable "{output}" in a shell block, you can call the variable "output" in the corresponding run block.
 ==Flags==