phyloP Tutorial

A command-line program within PHAST that computes conservation or acceleration p-values based on an alignment and a model of neutral evolution using independent hypothesis tests (-log p-values) at individual nucleotides.

phyloP Tutorial phyloP Tutorial
Contents

Download and compile PHAST


To run phyloP users must download the PHAST binaries or compile PHAST from source.


        PHAST binaries can be downloaded by clicking the appropriate Windows, MacOSX or Linux icon on the PHAST website.

        PHAST source can be downloaded by clicking the Source icon on the PHAST website or Phast Github.

        For complete instructions on how to compile PHAST from source, please visit Quick Start - Installing PHAST.


Generate neutral model file (phyloFit)


A phylogenetic model in .mod format is required by phyloP. This can be generated using phyloFit from the PHAST package.

For example, phyloFit can fit a phylogenetic model (neutralmodel.mod) given an alignment file and a tree topology

     phyloFit --tree "((galGal2,((rn3,mm5),fr1)),hg17)" --subst-mod REV --out-root modelfile alignment.maf

Detailed instructions for using phyloFit and information on various options can be found in the phyloFit Tutorial


Usage and options for phyloP


Required input files

● An alignment file in one of the following formats - MAF, FASTA, PHYLIP, MPM, SS

● A phylogenetic model produced by phyloFit in .mod format.

The program can be run with a command of the form:

    phyloP [OPTIONS] neutralmodel.mod [alignment] > out


Here we present commonly used options for running phyloFit.


Option Description
--method

The method used to compute p-values or conservation/acceleration scores.

Methods available - SPH, LRT, SCORE, GERP

The default method is SPH. LRT (likelihood ratio test) and SCORE(score test) compare an alternative model having a free scale parameter withing the given neutral model.

The GERP-like method (GERP) estimates the number of "rejected substitutions" per base by comparing the (per-site) maximum likelihood expected number of substitutions with the expected number under the neutral model.

LRT, SCORE and GERP can be used only with --base-by-base, --wig-scores or --features.

--mode

The mode used to compute p-values.

Can be used with --base-by-base, --wig-scores or --features.

Modes available - CON, ACC, NNEUT, CONACC

CON(default) computes one-sided p-values so that small p (large -log p) indicate unexpected conservation or acceleration(ACC) NNEUT - two-sided p-values such that small p indicates an unexpected departure from neutrality. CONACC uses positive values (p-values or scores) to indicate conservation and negative values to indicate acceleration.

--wig-scores

Compute seperate p-values per site, and then compute site-specific conservation (acceleration) scores as -log(p).

Output base-by-base scores in fixed-step wig format, using the coordinate system of the reference sequence

--features

Read features from (GFF or BED format) and output a table of p-values and related statistics with one row per feature.

The features are assumed to use the coordinate frame of the first sequence in the alignment.

--subtree

Partition the tree into the subtree beneath the node whose name is given and the complementary supertree, and consider conservation/acceleration in the subtree given the supertree.

The branch above the specified node is included with the subtree.

--branch

Like subtree, but partitions the tree into the set of named branches (each named by its child node), and all the remaining branches.

Then tests for conservation/acceleration in the set of named branches relative to the others.

Examples


Example 1 - Run phyloP with an alignment and mod file using the CONACC mode and LRT method.


Using the likelihood ratio test (LRT) method one can compute conservation scores for each site in the alignment and output them in the fixed-step wig format. These score can summarize conservation and acceleration using the CONACC mode.

    phyloP --mode CONACC --method LRT --wig-scores neutralmodel.mod alignment.maf> phyloPscores.wig


Input files:

    Alignment file (MAF)

    Model file (.mod)

Output files:

    Output wig file (.wig)

In the output wig file, the absolute values of the scores represent -log p-values under a null hypothesis of neutral evolution.The sites predicted to be conserved are assigned positive scores while the sites predicted to be accelerated are assigned negative scores.

Example 2 - Run phyloP with the features option.


If there are a subset of sites in your alignment that you need to use to estimated the model, such as ancestral repeats or intergenic regions, a features file can be provided using the --features option.

    phyloP --method LRT --mode CONACC --features features.bed model.mod alignment.fa > features.out


Input files:

    Alignment file (FASTA)

    Features filed (bed)

    Model file (.mod)

Output files:

    Features Output Table file

    Features Output GFF file

The output file is a table of p-values and related statistics with one row per feature. The features are assumed to use the coordinate frame of the first sequence of the alignment.

The option -g or --gff-scores can be used to output a GFF, instead of a table, assigning each feature a score equal to its -log p-value.

    phyloP --method LRT --mode CONACC --features features.bed -g model.mod alignment.fa > features.gff

Example 3 - Run phyloP with the subtree and branch options.


The --subtree option can be used to partition the tree into the subtree beneath the node provided and the complementary supertree, and consider conservation/acceleration in the subtree given the supertree. The branch above the specified node is included with the subtree.

First, we need to make sure that names are assigned to all ancestral nodes. tree_doctor which is a PHAST utility can be used to that. If a node is unnamed, a name is created by concatenating the names of a leaf from its left subtree and a leaf from its right subtree.

    tree_doctor --name-ancestors neutralmodel.mod > named_model.mod


Scores describing lineage specific conservation can then be computed using the --subtree option. Here we use the --base-by-base option which outputs multiple values per site, in a method-dependent way.

    phyloP --method LRT --subtree mm9-rn4 --mode CONACC --base-by-base named_model.mod alignment.maf > subtree_basebybase


Similarly a features file can be provide along with the subtree option to get a table of p-values and related statistics with one row per feature.

    phyloP --method LRT --subtree mm9-rn4 --mode CONACC --features features.bed named_model.mod alignment.maf > subtree_features


The --branch option is similar to --subtree, but it partitions the tree into the set of named branches, and all the remaining branches before testing for conservation/acceleration in the set of names branches relative to the others. A comma-delimited list of child nodes can be provided as an argument.

    phyloP --method LRT --branch mm9-rn4 --mode CONACC -w named_model.mod alignment.maf > branch.wig


Input files:

    Alignment file (MAF)

    Features filed (bed)

    Model file (.mod)

Output files:

    Named_model.mod

    Subtree basebybase Output file

    Subtree features Output file

    Branch Output wig file

The output file is a table of p-values and related statistics with one row per feature. The features are assumed to use the coordinate frame of the first sequence of the alignment.

The option -g or --gff-scores can be used to output a GFF, instead of a table, assigning each feature a score equal to its -log p-value.