From charlesreid1

Snakemake patterns for Snakefiles and complex workflows.

Also, repo of awesome examples here: https://percyfal.github.io/snakemake-rules/docs/configuration.html

Sandwich-making example: https://github.com/leipzig/SandwichesWithSnakemake

Creating a master rule

If Snakemake is not provided with a rule to execute, it will execute the first rule in the file.

To create a master rule that will call other rules defined in the Snakemake file, the make equivalent of "all" or "default", you should define it first (you can name it whatever you would like):

mynames   = ['A','B','C']
myinputs  = [j+".txt" for j in mynames]
myoutputs = [j+".dodat" for j in mynames]

rule default:
    input:
        myoutputs
    shell:
        'echo "i just run subrules!"'

This rule requires, as inputs, the variable myoutputs, which is going to contain all of the final output files that the Snakemake process should generate. This means we need to have a rule somewhere below with an output of A.dodat, B.dodat, and C.dodat:

rule cascade:
    output:
        '{name}.dodat'
    input:
        '{name}.file_from_previous_step'
    shell:
        'cp {input} {output}'

and so on...

Now, to run this default rule, just run snakemake with no arguments:

$ snakemake

this will run the first rule in the Snakefile, which is default. The default rule does nothing, but requires as inputs the final output of the entire process. Snakemake will assemble the rules that those final output files depend on.

Creating rules without target files

Normally you run a rule by specifying the output file of a rule. If you create a rule that has no output files associated with it, like the one below (clean):

rule clean:
    shell:
        'rm -f *.dodat *.int*'

you can run it by passing the rule name to snakemake:

$ snakemake clean

Dealing with complicated input/output lists

If we have a complicated list of input and output files, or need to pass in a few parameters and use those to build a list of input or output files, we can write a function to return the list of files, and call it in the inputs and outputs list:

data_dir = "data"

def MKOUT(data_dir, fastq_files):
    """Returns a string that uses Snakemake's {} mini language"""
    result = []
    for fastq in fastq_files:
        p = os.path.join(data_dir,fastq)
        result.append(p)
    return result


rule link_raw_data:
    """imaginary scenario: complicated list of output files"""
    output:
        MKOUT(data_dir, ["{lib}_{rep}.fastq.gz", "{lib}_{rep}_small.fastq.gz"] ),
    params:
        directory = data_dir,
    message:
        "Making link to raw data {output}."
    shell:
        """
        do_stuff {params.directory}
        """

Executing commands in a temporary directory

To stay organized with which directory you are running various commands in, use the following pattern for make rules:

rule link_raw_data:
    output:
        output_file
    params:
        directory = data_dir,
        shell_command = lib2data,
    message:
        "Making link to raw data {output}."
    shell:
        """
        (
        cd {params.directory}
        {params.shell_command}
        )
        """

Mixing Python and Shell Commands

To mix Python and shell code, embed a shell command inside a run directive:

rule all:
    input: 'Salmonella-enterica.vcf'

rule download_sra:
    output: '{specie}.sra'
    run:
        readsID = species[wildcards.specie]['readsID']
        shell('''
            bionode-ncbi download sra {readsID};
            cp {readsID}/*.sra {output} && rm -rf {readsID};
        ''')

rule call:
    input: '{specie}.sra'
    output: '{specie}.vcf'
    shell: 'magic {input} > {output}'

Dealing with Paths

One of the issues I ran into with Snakemake was the fact that the input, output, and shell/run blocks expect the same filename, but each one expects it to be formatted differently. Furthermore, you may have the same filename but different paths.

This probelem is a big pain in the ass, and Snakemake does not help you solve it. This is also a BIG PROBLEM that clutters the Snakefile, and almost makes it worth looking for a different build tool. But we're stuck with Snakemake.

Here is an example rule that takes a single output file name and turns it into the many variations that Snakemake wants:

# Need PWD for Docker
PWD = os.get_cwd()

# Settings common to all 
# taxonomic classification workflows
include: 'taxclass.settings'


...clipped...


# temporary directories
kaiju_dirname = 'kaijudb'
kaiju_dir = os.path.join(data_dir,kaiju_dirname)

# kaiju database files
kaiju_dmp = 'nodes.dmp'
kaiju_fmi = 'kaiju_db_nr_euk.fmi'
kaiju_tar = 'kaiju_index_nr_euk.tgz'
kaiju_url = 'http://kaiju.binf.ku.dk/database'

# inputs
kaiju_input_names = [kaiju_dmp, kaiju_fmi]
run_kaiju_input = [os.path.join(kaiju_dir,f) for f in kaiju_input_names]
run_kaiju_input += [os.path.join(data_dir,f) for f in fq_names]

# outputs
kaiju_output_name = '{base}.kaiju_output.trim{ntrim}.out'
kaiju_output_name_wc = '{wildcards.base}.kaiju_output.trim{wildcards.ntrim}.out'
run_kaiju_output = os.path.join(data_dir,kaiju_output_name)

quayurl = config['kaiju']['quayurl'] + ":" + config['kaiju']['version']

rule run_kaiju:
    """
    Run kaiju
    """
    input:
        run_kaiju_input
    output:
        run_kaiju_output
    shell:
        '''
        docker run \
                -v {PWD}/{data_dir}:/data \
                {quayurl} \
                kaiju \
                -x \
                -v \
                -t /data/{kaiju_dir}/{kaiju_dmp} \
                -f /data/{kaiju_dir}/{kaiju_fmi} \
                -i /data/{fq_fwd_wc} \
                -j /data/{fq_rev_wc} \
                -o /data/{kaiju_output_name_wc} \
                -z 4
        '''

The strategy here is:

  • Assemble all of the variables we need BEFORE THE RULE IS REACHED, which keeps rules simple
  • Use the PWD command to pass Docker an absolute path when mounting directories inside the container
  • Move anything the user might change to a .settings file - this will keep things as logical as possible

Testing Snakefiles

If you are testing a long and complicated workflow, but you already have the intermediate files, it is possible to test the Snakemake workflow.

However, Snakemake keeps track of what tasks are incomplete. You may need the following syntax if you cancel a task, since Snakemake will think it is incomplete:

snakemake <target> --cleanup-metadata

Philosophy of Snakemake

So far as we have developed it, here is our philosophy of Snakemake:

One Subtask, One Subdirectory: subtasks are organized into their own directories.

  • In this application, the subtasks all look the same and use the same program.
  • In real applications, each subtask would correspond to a different program or workflow step.

Atomic Tasks: Each subtask consists of a few atomic tasks (simple algebra operations).

  • The atomic tasks are carried out.
  • The subtask aggregates these into a final result.
  • This makes workflows more flexible and modular.

Aggregation:

  • Each subtask aggregates the results of the atomic tasks (e.g., a sum or product).
  • Likewise, the final master task aggregates the results of each subtask.

In practice, we do the following:

Subtask Snakemake Rules:

  • Each subtask directory contains a Snakemake rule file that defines a master rule for that subtask and specifies what that subtask does.

Snakemake Configuration Files:

  • Each subtask directory contains a Snakemake config file that defines defaults (which the user can change).
  • If there are multiple rules, rules specific to a rule file can also be defined in that file.

The System - How to Grok Snakemake

To grok Snakemake, think about the Snakemake file as follows:

  • Each time you see a variable in brackets {} appearing in some shell command, think of those brackets as "opening" a Python scope.
  • Any variables inside the brackets must be python variables.
  • For example, {wildcard.name} means there is a Python variable named wildcard, that has an attribute called name. {output} means there is a Python variable named output. Etc.
  • If you need to access inputs or outputs by index, use the inputs variable like a Python variable: {inputs[0]} or {inputs[1]}
  • The shell and run directives have access to the same scope, and therefore if you can call the variable "{output}" in a shell block, you can call the variable "output" in the corresponding run block.

Flags