To run phyloFit users must download the PHAST binaries or compile PHAST from source.
PHAST binaries can be downloaded by clicking the appropriate Windows, MacOSX or Linux icon on the PHAST website.
PHAST source can be downloaded by clicking the Source icon on the PHAST website or Phast Github.
For complete instructions on how to compile PHAST from source, please visit Quick Start - Installing PHAST.
phyloFit requires a sequence alignment file in MAF, FASTA, PHYLIP, MPM or SS format and can be run with a command of the form:
phyloFit [OPTIONS] alignment > neutralmodel.mod
Here we present commonly used options for running phyloFit.
Input file or string defining tree topology
This option is required if more than three species, or more than two species and a non-reversible substitution model, e.g., UNREST, U2, U3)
The tree must be in Newick format, with the label at each leaf equal to the index or name of the corresponding sequence in the alignment
The nucleotide substitution model
Models available - JC69, F81, HKY85, HKY85+Gap, REV, SSREV, UNREST, R2, R2S, U2, U2S, R3, R3S, U3, U3S
REV is the default substitution model
JC69, F81, HKY85, REV, and UNREST have the usual meanings (see, e.g., Yang, Goldman, and Friday, 1994). SSREV is a strand-symmetric version of REV. HKY85+Gap is an adaptation of HKY that treats gaps as a fifth character (courtesy of James Taylor). The others, all considered "context-dependent", are as defined in Siepel and Haussler, 2004.
Use specified string as root filename for all files created.
Specify input file format
Input file formats accepted - FASTA, PHYLIP, MPM, MAF, SS.
Fit model(s) using EM rather than the BFGS quasi-Newton algorithm
HIGH, MED, LOW. Default is HIGH
Level of precision to use in estimating model parameters. Affects convergence criteria for iterative algorithms: higher precision means more iterations and longer execution time.
For use with context-dependent substitution models. Not compatible with --features, --markov or --msa-format SS.
Avoid using overlapping tuples of sites in parameter estimation. If a dinucleotide model is selected, every other tuple will be considered, and if a nucleotide triplet model is selected, every third tuple will be considered. This option cannot be used with an alignment represented only by unordered sufficient statistics.
Write log to file describing details of the optimization procedure.
Annotations file (GFF or BED format) describing features on one or more sequences in the alignment.
Together with a category map (see --catmap), will be taken to define site categories, and a separate model will be estimated for each category. If no category map is specified, a category will be assumed for each type of feature, and they will be numbered in the order of appearance of the features. Features are assumed to use the coordinate frame of the first sequence in the alignment and should be non-overlapping
Mapping of feature types to category numbers. Can either give a filename or an "inline" description of a simple category map
Estimate models for only the specified categories (comma-delimited list categories, by name or numbers). Default is to fit a model for every category.
Number of rate categories to use. Default is 1.
Specifying a value of greater than one causes the discrete gamma model for rate variation to be used.
Using the default REV model, the distance between two aligned sequences can be computed.
phyloFit pair.fa > phyloFit.mod
The output file is a model file (.mod). Distance in substitutions per dutr appears in the TREE line in the output file.
For a given tree and an alignment, phyloFit fits a model using the specified substitution model. In this example we use the HKY65 substitution model. Using the --out-root option, we can write output to a file with the specified prefix (pri_rod).
phyloFit --tree "((human,chimp),(mouse,rat))" --subst-mod HKY85 --out-root pri_rod primate-rodent.fa
The -nrates option can be used to specify the number of rate categories. Specifying a value of greater than one causes the discrete gamma model for rate variation to be used (Yang, 1994).
phyloFit --tree "((human,chimp),(mouse,rat))" --subst-mod HKY85 --out-root myfile --nrates 4 primate-rodent.fa
PHAST utility msa_view can be used to generate a SS file from an alignment. SS is a simple format describing sufficient statistics for phylogenetic inference.
msa_view hmrc.fa --out-format SS > hmrc.ss
Using this compact sufficient statistics format, we run phyloFit with the REV substitution model.
phyloFit --tree "((human,chimp),(mouse,rat))" --subst-mod REV --out-root hmrc --nrates 4 --msa-format SS hmrc.ss
The U2S (strand-symmetric unrestricted matrix) model is one of the context-dependent models that can be selected in the --subst-mod option.
Here we use the --EM option for parameter optimization and relax the convergence criteria a bit by using medium precision level (this is recommended with context-dependent models.
We consider only non-overlapping pairs of sites and a log file is written for the optimization procedure.
phyloFit --tree "((human,chimp),(mouse,rat))" --subst-mod U2S --EM --precision MED --non-overlapping --log u2s.log --out-root hmrc-u2s hmrc.fa
The --features option can be used to indicate the sites of interest in your alignment. You can specify these files in a "features" file in GFF or BED format.
Here, a phylogenetic model will estimated using the ancestral repeats (ARs) defined in the features file AR.gff.
The --do-cats option is used to specify the categories to be used from the GFF to estimate the models.
phyloFit --tree "(human,(mouse,rat))" --features AR.gff --do-cats AR --out-root nonconserved hmrc.fa
If a data set contains very distant species, which align mostly in conserved regions (e.g., coding exons), then the estimates of the nonconserved branch lengths to these species will tend to be underestimated, because any "nonconserved" bases that do align are probably actually at least partially conserved. It may make sense in such a case to estimate a nonconserved model from 4d sites in coding regions.
PHAST utility msa_view can be used to extract 4d sites from an alignment.
msa_view alignment.maf --4d --features genes.gff > 4d-codons.ss
This will create a representation in the "sufficient statistics" (SS) format of whole codons containing 4d sites. 4d sites (in the 3rd codon positions) can be extracted using msa_view.
msa_view 4d-codons.ss --in-format SS --out-format SS --tuple-size 1 > 4d-sites.ss
A nonconserved phylogenetic model can now be estimated using phyloFit.
phyloFit --tree "((CHIMP,BABOON),HUMAN)" --msa-format SS --out-root nonconserved-4d 4d-sites.ss