phyloFit Tutorial

A command-line program within PHAST that allows users to fit one or more tree models to a multiple alignment of DNA sequences by maximum likelihood using the specified tree topology and substitution model.

phyloFit Tutorial phyloFit Tutorial

Contents

Download and compile PHAST


To run phyloFit users must download the PHAST binaries or compile PHAST from source.


        PHAST binaries can be downloaded by clicking the appropriate Windows, MacOSX or Linux icon on the PHAST website.

        PHAST source can be downloaded by clicking the Source icon on the PHAST website or Phast Github.

        For complete instructions on how to compile PHAST from source, please visit Quick Start - Installing PHAST.


Usage and options for phyloFit


phyloFit requires a sequence alignment file in MAF, FASTA, PHYLIP, MPM or SS format and can be run with a command of the form:


    phyloFit [OPTIONS] alignment > neutralmodel.mod


Here we present commonly used options for running phyloFit.


Option Description
--tree

Input file or string defining tree topology

This option is required if more than three species, or more than two species and a non-reversible substitution model, e.g., UNREST, U2, U3)

The tree must be in Newick format, with the label at each leaf equal to the index or name of the corresponding sequence in the alignment

--subst-mod

The nucleotide substitution model

Models available - JC69, F81, HKY85, HKY85+Gap, REV, SSREV, UNREST, R2, R2S, U2, U2S, R3, R3S, U3, U3S

REV is the default substitution model

JC69, F81, HKY85, REV, and UNREST have the usual meanings (see, e.g., Yang, Goldman, and Friday, 1994). SSREV is a strand-symmetric version of REV. HKY85+Gap is an adaptation of HKY that treats gaps as a fifth character (courtesy of James Taylor). The others, all considered "context-dependent", are as defined in Siepel and Haussler, 2004.

--out-root

Use specified string as root filename for all files created.

--msa-format

Specify input file format

Input file formats accepted - FASTA, PHYLIP, MPM, MAF, SS.

--EM

Fit model(s) using EM rather than the BFGS quasi-Newton algorithm

--precision

HIGH, MED, LOW. Default is HIGH

Level of precision to use in estimating model parameters. Affects convergence criteria for iterative algorithms: higher precision means more iterations and longer execution time.

--non-overlapping

For use with context-dependent substitution models. Not compatible with --features, --markov or --msa-format SS.

Avoid using overlapping tuples of sites in parameter estimation. If a dinucleotide model is selected, every other tuple will be considered, and if a nucleotide triplet model is selected, every third tuple will be considered. This option cannot be used with an alignment represented only by unordered sufficient statistics.

--log

Write log to file describing details of the optimization procedure.

--features

Annotations file (GFF or BED format) describing features on one or more sequences in the alignment.

Together with a category map (see --catmap), will be taken to define site categories, and a separate model will be estimated for each category. If no category map is specified, a category will be assumed for each type of feature, and they will be numbered in the order of appearance of the features. Features are assumed to use the coordinate frame of the first sequence in the alignment and should be non-overlapping

--catmap

Mapping of feature types to category numbers. Can either give a filename or an "inline" description of a simple category map

--do-cats

Estimate models for only the specified categories (comma-delimited list categories, by name or numbers). Default is to fit a model for every category.

--nrates

Number of rate categories to use. Default is 1.

Specifying a value of greater than one causes the discrete gamma model for rate variation to be used.

Examples


Basic cases


Example 1 - Run phyloFit to compute the distance between two alignned sequences


Using the default REV model, the distance between two aligned sequences can be computed.

    phyloFit pair.fa > phyloFit.mod


Input files:

    Alignment file (FASTA)

Output files:

    Model file (.mod)

The output file is a model file (.mod). Distance in substitutions per dutr appears in the TREE line in the output file.

Example 2 - Run phyloFit to fit a phylogenetic model to an alignment using the HKY85 substitution model.


For a given tree and an alignment, phyloFit fits a model using the specified substitution model. In this example we use the HKY65 substitution model. Using the --out-root option, we can write output to a file with the specified prefix (pri_rod).

     phyloFit --tree "((human,chimp),(mouse,rat))" --subst-mod HKY85 --out-root pri_rod primate-rodent.fa


Input files:

    Alignment file (FASTA)

Output files:

    Model file(.mod)


Special cases


Example 3 - Run phyloFit using the discrete-gamma model for rate variation with 4 rate categories.



The -nrates option can be used to specify the number of rate categories. Specifying a value of greater than one causes the discrete gamma model for rate variation to be used (Yang, 1994).

     phyloFit --tree "((human,chimp),(mouse,rat))" --subst-mod HKY85 --out-root myfile --nrates 4 primate-rodent.fa


Input files:

    Alignment file (FASTA)


Output files:

    Model file(.mod)


Example 4 - Run phyloFit using a SS (sufficient-statistics) format as input.



PHAST utility msa_view can be used to generate a SS file from an alignment. SS is a simple format describing sufficient statistics for phylogenetic inference.

     msa_view hmrc.fa --out-format SS > hmrc.ss


Using this compact sufficient statistics format, we run phyloFit with the REV substitution model.

     phyloFit --tree "((human,chimp),(mouse,rat))" --subst-mod REV --out-root hmrc --nrates 4 --msa-format SS hmrc.ss

Input files:

    Alignment file (FASTA)


Output files:

    Alignment file (SS)

    Model file(.mod)



Example 5 - Run phyloFit to fit a context-dependent model to an alignment.



The U2S (strand-symmetric unrestricted matrix) model is one of the context-dependent models that can be selected in the --subst-mod option.

Here we use the --EM option for parameter optimization and relax the convergence criteria a bit by using medium precision level (this is recommended with context-dependent models.

We consider only non-overlapping pairs of sites and a log file is written for the optimization procedure.

     phyloFit --tree "((human,chimp),(mouse,rat))" --subst-mod U2S --EM --precision MED --non-overlapping --log u2s.log --out-root hmrc-u2s hmrc.fa


Input files:

    Alignment file (FASTA)


Output files:

    Model file(.mod)

    Log file


Example 6 - Run phyloFit to estimate the nonconserved model using a subset of sites in a features/annotation file.



The --features option can be used to indicate the sites of interest in your alignment. You can specify these files in a "features" file in GFF or BED format.

Here, a phylogenetic model will estimated using the ancestral repeats (ARs) defined in the features file AR.gff.

The --do-cats option is used to specify the categories to be used from the GFF to estimate the models.

     phyloFit --tree "(human,(mouse,rat))" --features AR.gff --do-cats AR --out-root nonconserved hmrc.fa


Input files:

    Alignment file (FASTA)

    Features file (GFF)


Output files:

    Model file(.mod)


Example 7 - Run phyloFit to estimate the nonconserved model using 4d sites.



If a data set contains very distant species, which align mostly in conserved regions (e.g., coding exons), then the estimates of the nonconserved branch lengths to these species will tend to be underestimated, because any "nonconserved" bases that do align are probably actually at least partially conserved. It may make sense in such a case to estimate a nonconserved model from 4d sites in coding regions.


PHAST utility msa_view can be used to extract 4d sites from an alignment.

     msa_view alignment.maf --4d --features genes.gff > 4d-codons.ss


This will create a representation in the "sufficient statistics" (SS) format of whole codons containing 4d sites. 4d sites (in the 3rd codon positions) can be extracted using msa_view.

     msa_view 4d-codons.ss --in-format SS --out-format SS --tuple-size 1 > 4d-sites.ss


A nonconserved phylogenetic model can now be estimated using phyloFit.

     phyloFit --tree "((CHIMP,BABOON),HUMAN)" --msa-format SS --out-root nonconserved-4d 4d-sites.ss


Input files:

    Alignment file (MAF)

    Features file (GFF)


Output files:

    Sufficient statistics file with whole codons containing 4d sites (.ss)

    Sufficient statistics file with extracted 4d sites (.ss)

    Model file(.mod)