PROGRAM: phyloFit

DESCRIPTION: 

    Fits one or more tree models to a multiple alignment of DNA
    sequences by maximum likelihood, using the specified tree topology
    and substitution model.  If categories of sites are defined via
    --features and --catmap (see below), then a separate model will be
    estimated for each category.  A description of each model will
    be written to a separate file, with the suffix ".mod".  These
    .mod files minimally include a substitution rate matrix, a tree with
    branch lengths, and estimates of nucleotide equilibrium
    frequencies.  They may also include information about parameters
    for modeling rate variation.

USAGE: phyloFit [OPTIONS] <msa_fname>

    <msa_fname> should be a multiple alignment in FASTA format or
    one of several alternative formats (see --msa-format).  For
    backward compatibility, this argument may be preceded by '-m' or
    '--msa'.  Note that --tree is required in most cases.  By default,
    all output files will have the prefix "phyloFit" (see
    --out-root).

EXAMPLES:

    (If you're like me, you want some basic examples first, and a list
    of all options later.)

    1. Compute the distance between two aligned sequences (in FASTA file
    pair.fa) under the REV model.

        phyloFit pair.fa

    (output is to phyloFit.mod; distance in substitutions per site
    appears in the TREE line in the output file)

    2. Fit a phylogenetic model to an alignment of human, chimp, mouse,
    and rat sequences.  Use the HKY85 substitution model.  Write output
    to files with prefix "myfile".

        phyloFit --tree "((human,chimp),(mouse,rat))" --subst-mod HKY85
            --out-root myfile primate-rodent.fa

    3. As above, but use the discrete-gamma model for rate variation,
    with 4 rate categories.

        phyloFit --tree "((human,chimp),(mouse,rat))" --subst-mod HKY85
            --out-root myfile --nrates 4 primate-rodent.fa

    4. As above, but use genome-wide data, stored in the compact
    "sufficient-statistics" format (can be produced with "msa_view
    -o SS").

        phyloFit --tree "((human,chimp),(mouse,rat))" --subst-mod HKY85
            --out-root myfile --nrates 4 --msa-format SS 
            primate-rodent.ss

    5. Fit a context-dependent phylogenetic model (U2S) to an
    alignment of human, mouse, and rat sequences.  Use
    an EM algorithm for parameter optimization and relax the
    convergence criteria a bit (recommended with context-dependent
    models).  Write a log file for the optimization procedure.
    Consider only non-overlapping pairs of sites.

        phyloFit --tree "(human,(mouse,rat))" --subst-mod U2S --EM
            --precision MED --non-overlapping --log u2s.log --out-root
            hmr-u2s hmr.fa

    6. As above, but allow overlapping pairs of sites, and compute
    likelihoods by assuming Markov-dependence of columns (see Siepel &
    Haussler, 2004).  The EM algorithm can no longer be used
    (optimization will be much slower).

        phyloFit --tree "(human,(mouse,rat))" --subst-mod U2S
            --precision MED --log u2s-markov.log --markov hmr.fa

    7. Compute a likelihood using parameter estimates obtained in (5)
    and an assumption of Markov dependence.  This provides a lower
    bound on the likelihood of the Markov-dependent model.

        phyloFit --init-model hmr-u2s.mod --lnl --markov hmr.fa

    8. Given an alignment of several mammalian sequences (mammals.fa), a
    tree topology (tree.nh), and a set of gene annotations in GFF
    (genes.gff), fit separate models to sites in 1st, 2nd, and 3rd
    codon positions.  Use the REV substitution model.  Assume coding
    regions have feature type 'CDS'.

        phyloFit --tree tree.nh --features genes.gff --out-root mammals-rev
            --catmap "NCATS = 3; CDS 1-3" --do-cats 1,2,3 mammals.fa

    (output will be to mammals-rev.cds-1.mod, mammals-rev.cds-2.mod, and 
    mammals-rev.cds-3.mod)


OPTIONS:

    --tree, -t <tree_fname>|<tree_string>
        (Required if more than three species, or more than two species
        and a non-reversible substitution model, e.g., UNREST, U2, U3)
        Name of file or literal string defining tree topology.  Tree
        must be in Newick format, with the label at each leaf equal to
        the index or name of the corresponding sequence in the alignment
        (indexing begins with 1).  Examples: --tree "(1,(2,3))", 
        --tree "(human,(mouse,rat))".  Currently, the topology must be
        rooted.  When a reversible substitution model is used, the root
        is ignored during the optimization procedure.

    --subst-mod, -s JC69|F81|HKY85|HKY85+Gap|REV|SSREV|UNREST|R2|R2S|U2|U2S|R3|R3S|U3|U3S
        (default REV).  Nucleotide substitution model.  JC69, F81, HKY85
        REV, and UNREST have the usual meanings (see, e.g., Yang, 
        Goldman, and Friday, 1994).  SSREV is a strand-symmetric version
        of REV.  HKY85+Gap is an adaptation of HKY that treats gaps as a 
        fifth character (courtesy of James Taylor).  The others, all
	considered "context-dependent", are as defined in Siepel and 
        Haussler, 2004.  The options --EM and --precision MED are 
        recommended with context-dependent models (see below).

    --msa-format, -i FASTA|PHYLIP|MPM|MAF|SS
        (default is to guess format from file contents) Alignment format.  
        FASTA is as usual.  PHYLIP is compatible with the formats used in 
        the PHYLIP and PAML packages.  MPM is the format used by the 
        MultiPipMaker aligner and some other of Webb Miller's older tools.  
        MAF ("Multiple Alignment Format") is used by MULTIZ/TBA and the 
        UCSC Genome Browser.  SS is a simple format describing the 
	sufficient statistics for phylogenetic inference (distinct columns
        or tuple of columns and their counts).  Note that the program
        "msa_view" can be used for file conversion.

    --out-root, -o <output_fname_root>
        (default "phyloFit").  Use specified string as root filename
        for all files created.

    --min-informative, -I <ninf_sites>
        Require at least <ninf_sites> "informative" sites -- i.e., 
        sites at which at least two non-gap and non-missing-data ('N'
        or '*') characters are present.  Default is 50.

    --gaps-as-bases, -G
        Treat alignment gap characters ('-') like ordinary bases.  By
        default, they are treated as missing data.

    --ignore-branches, -b <branches>
        Ignore specified branches in likelihood computations and parameter
        estimation, and treat the induced subtrees as independent.  Can be
        useful for likelihood ratio tests.  The argument <branches> should
        be a comma-separated list of nodes in the tree, indicating the
        branches above these nodes, e.g., human-chimp,cow-dog.  (See
        tree_doctor --name-ancestors regarding names for ancestral nodes.)
        This option does not currently work with --EM.

    --quiet, -q
        Proceed quietly.

    --help, -h
        Print this help message.


 (Options for controlling and monitoring the optimization procedure)

    --lnl, -L
        (for use with --init-model) Simply evaluate the log likelihood of
        the specified tree model, without performing any further
        optimization.  Can be used with --post-probs, --expected-subs, and
        --expected-total-subs.

    --EM, -E 
        Fit model(s) using EM rather than the BFGS quasi-Newton
        algorithm (the default).

    --precision, -p HIGH|MED|LOW
        (default HIGH) Level of precision to use in estimating model
        parameters.  Affects convergence criteria for iterative
        algorithms: higher precision means more iterations and longer
        execution time.

    --log, -l <log_fname>
        Write log to <log_fname> describing details of the optimization
        procedure.

    --init-model, -M <mod_fname>
        Initialize with specified tree model.  By choosing good
        starting values for parameters, it is possible to reduce
        execution time dramatically.  If this option is chosen, --tree
        is not allowed.  The substitution model used in the given
        model will be used unless --subst-mod is also specified.  
        Note: currently only one mod_fname may be specified; it will be 
        used for all categories.

    --init-random, -r
        Initialize parameters randomly.  Can be used multiple times to test
        whether the m.l.e. is real.

    --seed, -D <seed>
        Provide a random number seed for choosing initial parameter values
	(usually with --init-random, though random values are used in some
        other cases as well).  Should be an integer >=1.  If not provided,
	seed is chosen based on current time.

    --init-parsimony, -y
        Initialize branch lengths using parsimony counts for given data.
        Only currently implemented for models with single character state
	(ie, not di- or tri-nucleotides).  Other --init options such
	as --init-random or --init-model can be used in conjunction to 
	initialize substitution matrix parameters.

    --print-parsimony, -Y <filename>
        Print parsimony score to given file, and quit.  (Does not optimize
        or report likelihoods).

    --clock, -z
        Assume a molecular clock in estimation.  Causes the distances to all
        descendant leaves to be equal for each ancestral node and cuts the
        number of free branch-length parameters roughly in half.  

    --scale-only, -B
        (for use with --init-model) Estimate only the scale of the tree,
        rather than individual branch lengths (branch proportions fixed).
        Equilibrium frequencies and rate-matrix parameters will still be
        estimated unless --no-freqs and --no-rates are used.

    --scale-subtree, -S <node_name>
        (for use with --scale-only) Estimate separate scale factors for
        subtree beneath identified node and rest of tree.  The branch
        leading to the subtree is included with the subtree.  If ":loss" or
        ":gain" is appended to <node_name>, subtree scale is constrained to
        be greater than or less than (respectively) scale for rest of tree.

    --estimate-freqs, -F
        Estimate equilibrium frequencies by maximum likelihood, rather
        than approximating them by the relative frequencies in the data.
	If using the SSREV model, this option implies --sym-freqs.

    --sym-freqs, -W
        Estimate equilibrium frequencies, assuming freq(A)=freq(T) and
	freq(C)=freq(G).  This only works for an alphabet ACGT (and possibly
	gap).  This option implies --estimate-freqs.

    --no-freqs, -f
        (for use with --init-model) Do not estimate equilibrium
        frequencies; just use the ones from the given tree model.

    --no-rates, -n
        (for use with --init-model) Do not estimate rate-matrix
        parameters; just use the ones from the given tree model.

    --ancestor, -A <seqname>
        Treat specified sequence as the root of the tree.  The tree
        topology must define this sequence to be a child of the root
        (in practice, the branch from the root to the specified
        sequence will be retained, but will be constrained to have
        length zero).

    --error, -e <filename>
	For each parameter, report estimate, variance, and 95%% confidence
        interval, printed to given filename, one parameter per line.

    --no-opt, -O <param_list>
        Hold parameters listed in comma-separated param_list constant at
	initial values.  This applies only to the "main" model, and not to 
	any models defined with the --alt-mod option.  Param list can 
	contain values such as "branches" to hold branch lengths constant,
	"ratematrix", "backgd", or "ratevar" to hold entire rate matrix, 
	equilibrium frequencies, or rate variation parameters constant 
	(respectively).  There are also substitution model-specific 
	parameters such as "kappa" (transition/transversion rate ratio).

        Note: to hold certain branches constant, but optimize others,
        put an exclamation point in the newick-formatted tree after the
        branch lengths that should be held constant.  This can be useful
        for enforcing a star-phylogeny.  However, note that the two branches
        coming from root of tree are treated as one.  So they should both
        be held constant, or not held constant.  This option does *not* work
        with --scale-only or --clock.

    --bound <param_name[lower_bound,upper_bound]>
        Set boundaries for parameter.  lower_bound or upper_bound may be
	empty string to keep default.  For example --bound gc_param[1,] will
	set the lower bound for gc_param to 1 (keeping upper bound at infinity),
	for a GC model.  Only applies to parameters for model in the "main" 
	tree, but similar syntax can be used within the --alt-mod arguments.
    	Can be used multiple times to set boundaries for different parameters.

    --selection <selection_param>
        Use selection in the model (is also implied if --init-model is used
        and contains selection parameter).  Selection scales rate matrix
	entries by selection_param/(1-exp(-selection-param)); this is done
        after rate matrix is scaled to set expected number of substitutions
	per unit time to 1.  If using codon models selection acts only on
	nonysynonymous mutations.


 (Options for modeling rate variation)

    --nrates, -k <nratecats>
        (default 1).  Number of rate categories to use.  Specifying a
        value of greater than one causes the discrete gamma model for
        rate variation to be used (Yang, 1994).

    --alpha, -a <alpha>
        (for use with --nrates).  Initial value for alpha, the shape
        parameter of the gamma distribution.  Default is 1.

    --rate-constants, -K <rate_consts>
        Use a non-parameteric mixture model for rates, instead of
        assuming a gamma distribution.  The argument <rate_consts>
        must be a comma-delimited list explicitly defining the rate
        constants to be used.  The "weight" (mixing proportion)
        associated with each rate constant will be estimated by EM
        (this option implies --EM).  If --alpha is used with
        this option, then the mixing proportions will be initialized
        to reflect a gamma distribution with the specified shape
        parameter.


 (Options for separate handling of sites in different annotation categories)

    --features, -g <fname>
        Annotations file (GFF or BED format) describing features on
        one or more sequences in the alignment.  Together with a
        category map (see --catmap), will be taken to define site
        categories, and a separate model will be estimated for each
        category.  If no category map is specified, a category will be
        assumed for each type of feature, and they will be numbered in
        the order of appearance of the features.  Features are assumed
        to use the coordinate frame of the first sequence in the
        alignment and should be non-overlapping (see 'refeature
        --unique').

    --catmap, -c <fname>|<string>
        (optionally use with --features) Mapping of feature types to
        category numbers.  Can either give a filename or an "inline"
        description of a simple category map, e.g., --catmap "NCATS =
        3 ; CDS 1-3" or --catmap "NCATS = 1 ; UTR 1".  Note that
        category 0 is reserved for "background" (everything that is
        not described by a defined feature type).

    --do-cats, -C <cat_list>
        (optionally use with --features) Estimate models for only the
        specified categories (comma-delimited list categories, by name
        or numbera).  Default is to fit a model for every category.

    --reverse-groups, -R <tag>
        (optionally use with --features) Group features by <tag> (e.g.,
        "transcript_id" or "exon_id") and reverse complement
        segments of the alignment corresponding to groups on the
        reverse strand.  Groups must be non-overlapping (see refeature
        --unique).  Useful with categories corresponding to
        strand-specific phenomena (e.g., codon positions).


 (Options for context-dependent substitution models)

    --markov, -N
        (for use with context-dependent substitutions models and not
        available with --EM.)  Assume Markov dependence of alignment
        columns, and compute the conditional probability of each
        column given its N-1 predecessors using the two-pass algorithm
        described by Siepel and Haussler (2004).  (Here, N is the
        "order" of the model, as defined by --subst-mod; e.g., N=1
        for REV, N=2 for U2S, N=3 for U3S.) The alternative (the
        default) is simply to work with joint probabilities of tuples
        of columns.  (You can ensure that these tuples are
        non-overlapping with the --non-overlapping option.)  The use
        of joint probabilities during parameter estimation allows the
        use of the --EM option and can be much faster; in addition, it
        appears to produce nearly equivalent estimates.  If desired,
        parameters can be estimated without --markov, and
        then the likelihood can be evaluated using --lnl and
        --markov together.  This gives a lower bound on the
        likelihood of the Markov-dependent model.

    --non-overlapping, -V
        (for use with context-dependent substitution models; not
        compatible with --markov, --features, or
        --msa-format SS) Avoid using overlapping tuples of sites
        in parameter estimation.  If a dinucleotide model is selected,
        every other tuple will be considered, and if a nucleotide
        triplet model is selected, every third tuple will be
        considered.  This option cannot be used with an alignment
        represented only by unordered sufficient statistics.

 (Option for lineage-specific models)

   --label-branches branch1,branch2,branch3...:label
        Create a group of branches by giving a set of branches a 
        single label.  The label should be a word which does not 
        contain special characters (in particular, no spaces, brackets,
        parentheses, pound signs, commas, or colons).

        The label is for use with --alt-model option below, so that an 
        alternate model can be defined for a set of branches.  A branch
        is specified by the name of the node which is a descendant of
        that branch.

        For example, 
        --label-branches hg18,chimp,hg18-chimp:HC
        will apply the label "HC" to the hg18,chimp,and hg18-chimp 
        branches in the following tree:
        (((hg18,chimp)hg18-chimp, (mouse,rat)mouse-rat)

        The same label could be defined without using --label-branches
        by specifing the tree (either on the command-line or within
        init-model) as follows:
        (((hg18 # HC, chimp #HC)#HC, (mouse,rat))

   --label-subtree node[+]:label
        Similar to label-branches, except labels the entire subtree
        of the named node.  If the node name is followed by a "+" sign,
        then includes the branch leading up to the node in the subtree.

   --alt-model, -d <label:(model|param_list)>
        Create a lineage-specific substitution model on a group of branches.
        The group is defined by a label, which can be specified within
        the tree string (using the # sign notation), or by using the
        --label-branches or --label-subtree arguments.  If the alt-model
        applies to only a single branch, labelling is not necessary and
        the name of the node descending from the branch can be used instead.  
        See --label-branches above for more details on labelling groups of
        branches.

	If a name of a substitution model (HKY85, REV, UNREST, etc)
	is given after the colon, then this model will be used for the
        group of branches, and parameters relevant to the model will be 
        estimated separately in this group.  This model may be different 
        (or the same) as the model used in the rest of the tree, but it
        must have the same number of states and be of the same order as 
	the model used for the rest of the tree.

	Alternately, a list of parameter names can be given after the colon.
	In this case, the same substitution model will be used for the 
        entire tree, but the parameters listed will be estimated separately 
        in the specified group of branches.

	The parameter names are model-specific, and include "kappa" for
	HKY models, "alpha" for GC models, "ratematrix" to specify
	all rate-matrix parameters in general models, and "backgd" for
	the equilibrium background frequencies.  The parameter names
	may optionally be followed by boundaries in square brackets to
	specify parameter bounds, as described in --bound option.

	The --alt-model argument may be used multiple times, if one
        wishes to (for example) optimize a parameter independently 
        on several different groups of branches.

	Example:
	phyloFit align.fa --subst-mod HKY85 \
	--tree "(human, (mouse#MR, rat#MR)#MR, cow)"\
	--alt-model "MR:kappa[0, 1]"
 
        will estimate the HKY85 parameter kappa separately on the
        mouse/rat subtree, and constrain kappa between 0 and 1.  The
        quotes are often necessary to prevent the square brakcets from
        being parsed by the shell.  The same model could be achieved with:
 
        phyloFit align.fa --subst-mod HKY85 \
        --tree "(human, (mouse,rat)mouse-rat, cow)"\
        --label-branches mouse,rat,mouse-rat:MR \
        --alt-model "MR:kappa[0,1]"

        or

        phyloFit align.fa --subst-mod HKY85 \
        --tree "(human, (mouse,rat)mouse-rat, cow)" \
        --label-subtree "mouse-rat+:MR" \
        --alt-model "MR:kappa[0,1]"

 (Options for posterior probabilities)

    --post-probs, -P
        Output posterior probabilities of all bases at all ancestral 
        nodes.  Output will be to auxiliary file(s) with suffix 
        ".postprobs".

    --expected-subs, -X
        Output posterior expected number of substitutions on each branch at
        each site, summed across all types of substitutions. 
        Output will be to auxiliary file(s) with suffix ".expsub".

    --expected-subs-col, -J
        Output posterior expected number of substitutions of each type
        on each branch, for each site.  Output will be to auxiliary 
        file(s) with suffix .expcolsub

    --expected-total-subs, -Z
        Output posterior expected number of substitutions of each type 
        on each branch, summed across all sites.  Output will be to 
        auxiliary file(s) with suffix ".exptotsub".

    --column-probs, -U
        (for use with -init-model; implies --lnl)  Output a separate log
        probability for each type of column in the input.  Output will
        be to a file with suffix ".colprobs".  Values are log base 2.


 (Options for estimation in sliding window)

    --windows, -w <size,shift>
        Apply a sliding window to the alignment, and fit a separate
        tree to each window.  Arguments specify size of window and
        amount by which to shift it on each iteration, both in bases
        of the first sequence in the alignment (assumed to be the
        reference sequence).  Separate versions of all output files
        will be created for each window.

    --windows-explicit, -v <window_coord_list>
        Like --windows, except that all start and end coordinates must
        be explicitly specified.  Each successive pair of numbers is
        interpreted as defining the start and end of a window.  Can be
        used with a two-column file and the '*' operator, e.g.,
        --windows-explicit '*mycoords'.


REFERENCES:

    A. Siepel and D. Haussler.  2004.  Phylogenetic estimation of
      context-dependent substitution rates by maximum likelihood.
      Mol. Biol. Evol., 21:468-488.

    Z. Yang, N. Goldman, and A. Friday.  1994. Comparison of models for
      nucleotide substution used in maximum likelihood phylogenetic
      estimation. Mol. Biol. Evol., 11:316-324.

    Z. Yang. 1994. Maximum likelihood phylogenetic estimation from
      DNA sequences with variable rates over sites: approximate
      methods. J. Mol. Evol., 39:306-314.