Configuration¶

Most configuration for P3 is performed by editing config.yaml, a text file using YAML syntax.

See the Example config.yaml below for a full example and to see how each part relates. But first, we describe the parts of the config file and what they do.

Config sections¶

prefix¶

Example:

prefix: example_data

This entry sets the directory prefix that can be used by any output filenames (which will be described below). For example, in the Example config.yaml, the output filenames for the features start with the {prefix} placeholder.

It is convenient to set this to a directory containing a subset of the data (but with the same filenames as real data) while developing the pipeline, and then set it to the full data set once everything is working.

features_to_use¶

Example:

features_to_use: [cnv, exome_variants]

This entry selects the feature sets to use for a particular run. The items in the list must correspond to feature sets defined in the features section. Note that if one feature set depends on another, they must both be included in this list.

samples¶

Example:

samples: celllines.txt

This entry specifies a filename containing sample IDs to include in the analysis. It is a text file with one sample name per line.

The final set of features will be subsetted to only include data from the sample IDs specified in this file.

This file is parsed into a list of sample IDs within the runall.snakefile workflow such that each child workflow can access the list of samples. This is especially convenient when sample IDs are encoded in filenames and you want to grab all files for all samples.

Rscript¶

Example:

Rscript: /usr/bin/Rscript

Sets the path to Rscript. Within Snakemake rules, an R file is then called with the {Rscript} placeholder which will be filled in with the path defined here:

rule example_r:
    input: '{prefix}/input_data.txt'
    output: '{prefix}/output_data.txt'
    shell:
        "{Rscript} myscript.R {input} {output}"

features¶

Example:

features:
    cnv:
        snakefile: features/cnv.snakefile
        output:
            clusters: "{prefix}/filtered/cnv/cluster_scores.tab"
            max_gene: "{prefix}/filtered/cnv/cnv_gene_max_scores.tab"
            longest_gene: "{prefix}/filtered/cnv/cnv_gene_longest_overlap_scores.tab"

    exome_variants:
        snakefile: features/variants.snakefile
        output: "{prefix}/filtered/exome_variants/exome_variants_by_gene.tab"

This is where new feature sets are defined. In this example, there are two feature sets with the labels cnv and exome_variants. Under each label are two fields: snakefile and output.

snakefile:	This field specifies the path, relative to the `config.yaml` file, to the Snakemake workflow that creates these features. This snakefile can do whatever it needs to do in order to create the output file[s].
output:	This field can either be a single string (as in the `exome_variants`) or can be a dictionary with multiple output files, each with a unique name (as in `cnv`). These output files can use the `{prefix}` placeholder which will be filled in with the prefix field. It is expected that these files will be created by the snakefile specified for this feature set.

Note

Most of the effort in adding a new feature set is in writing the actual snakefile that does the work. See Pipeline design for more on this.

Example config.yaml¶

This is the config file used to run the example data.

# Top-level data dir
prefix: example_data

# List of feature sets to use while training. Anything selected here must be
# configured below.
features_to_use: [go, normed_counts, zscores, msigdb, cpdb, exome_variants, cnv]

samples: celllines.txt

# Explicitly defines path to Rscript to avoid conflict with other existing
# installations
Rscript: Rscript

# Each set of features:
#   - has a unique name that can be used in the "features_to_use" list
#
#   - has one or more output files, which must be of the general format:
#       - rows = features
#       - columns = samples
#
#   - has a snakefile responsible for creating the output file.
#       - paths in these snakefiles should be relative to the runall.snakefile
#         since they are included verbatim.
#       - interdependencies between feature snakefiles can be handled by using
#         the snakemake directive "include:"
features:
    cnv:
        snakefile: features/cnv.snakefile
        output:
            clusters: "{prefix}/filtered/cnv/cluster_scores.tab"
            max_gene: "{prefix}/filtered/cnv/cnv_gene_max_scores.tab"
            longest_gene: "{prefix}/filtered/cnv/cnv_gene_longest_overlap_scores.tab"

    exome_variants:
        snakefile: features/variants.snakefile
        output: "{prefix}/filtered/exome_variants/exome_variants_by_gene.tab"

    msigdb:
        snakefile: features/msigdb.snakefile
        output:
            zscores: "{prefix}/filtered/msigdb/msigdb_zscores.csv"
            variants: "{prefix}/filtered/msigdb/msigdb_variants.csv"
    go:
        snakefile: features/gene_ontology.snakefile
        output:
            zscores: "{prefix}/filtered/go/go_zscores.csv"
            variants: "{prefix}/filtered/go/go_variants.csv"

    cpdb:
        snakefile: features/cpdb.snakefile
        output:
            zscores: "{prefix}/filtered/consensus_pathway/cpdb_zscores.csv"
            variants: "{prefix}/filtered/consensus_pathway/cpdb_variants.csv"

    normed_counts:
        snakefile: features/normed_counts.snakefile
        output: "{prefix}/filtered/rnaseq_expression/counts_matrix_normalized.csv"

    zscores:
        snakefile: features/zscores.snakefile
        output:
            zscores: "{prefix}/filtered/rnaseq_expression/zscores.csv"
            zscore_estimates: "{prefix}/filtered/rnaseq_expression/zscore_estimates.csv"

Configuration¶

Config sections¶

prefix¶

features_to_use¶

samples¶

Rscript¶

features¶

Example config.yaml¶

Table Of Contents

Previous topic

Next topic

This Page