PROGRAM: phastCons
USAGE: phastCons [OPTIONS] alignment m1.mod,m2.mod,... > scores.wig
The alignment file can be in any of several file formats (see
--msa-format). The phylogenetic models must be in the .mod format
produced by the phyloFit program.
DESCRIPTION:
Identify conserved elements or produce conservation scores, given
a multiple alignment and a phylo-HMM. By default, a phylo-HMM
consisting of two states is assumed: a "conserved" state and a
"non-conserved" state. Separate phylogenetic models can be
specified for these two states, e.g.,
phastCons myfile.ss cons.mod,noncons.mod > scores.wig
or a single model can be given for the non-conserved state, e.g.,
phastCons myfile.ss --rho 0.5 noncons.mod > scores.wig
in which case the model for the conserved state will be obtained
by multiplying all branch lengths by the scaling parameter rho (0
< rho < 1). If the --rho option is not used, rho will be set to
its default value of 0.3.
By default, the phylogenetic models will be left unaltered, but if
the --estimate-trees option is used, e.g.,
phastCons myfile.ss init.mod --estimate-trees newtree > scores.wig
then the phylogenetic models for the two states will be estimated
from the data, and the given tree model (there must be only one in
this case) will be used for initialization only. It is also
possible to estimate only the scale factor --rho, using the
--estimate-rho option.
The transition probabilities for the HMM can either be specified
at the command line or estimated from the data using an EM
algorithm. To specify them at the command line, use either the
--transitions option or the --target-coverage and
--expected-length options. The recommended method is to use
--target-coverage and --expected-length, e.g.,
phastCons --target-coverage 0.25 --expected-length 12 \
myfile.ss cons.mod,noncons.mod > scores.wig
The program produces two main types of output. The primary
output, sent to stdout in fixed-step WIG format
(http://genome.ucsc.edu/goldenPath/help/wiggle.html), is a set of
base-by-base conservation scores. The score at each base is equal
to the posterior probability that that base was "generated" by the
conserved state of the phylo-HMM. The scores are reported in the
coordinate frame of a designated reference sequence (see
--refidx), which is by default the first sequence in the
alignment. They can be suppressed with the --no-post-probs
option. The secondary type of output, activated with the
--most-conserved (aka --viterbi) option, is a set of discrete
conserved elements. These elements are output in either BED or GFF
format, also in the coordinate system of the reference sequence
(see --most-conserved). They can be assigned log-odds scores
using the --score option.
Other uses are also supported, but will not be described in detail
here. For example, it is possible to produce conservation scores
and conserved elements using a k-state phylo-HMM of the kind
described by Felsenstein and Churchill (1996) (see --FC), and it
is possible to produce a "coding potential" score instead of a
conservation score (see --coding-potential). It is also possible
to give the program a custom HMM and to specify any subset of its
states to use for prediction (see --hmm and --states).
See the phastCons HOWTO for additional details.
EXAMPLES:
1. Given phylogenetic models for conserved and nonconserved regions
and HMM transition parameters, compute a set of conservation scores.
phastCons --transitions 0.01,0.01 mydata.ss cons.mod,noncons.mod \
> scores.wig
2. Similar to (1), but define the conserved model as a scaled
version of the nonconserved model, with rho=0.4 as the scaling
parameter. Also predict conserved elements as well as
conservation scores, and assign log-odds scores to predictions.
phastCons --transitions 0.01,0.01 --most-conserved mostcons.bed \
--score --rho 0.4 mydata.ss noncons.mod > scores.wig
(if output file were "mostcons.gff," then output would be in
GFF instead of BED format)
3. This time, estimate the parameter rho from the data. Suppress
both the scores and the conserved elements. Specify the
transition probabilities using --target-coverage and
--expected-length instead of --transitions.
phastCons --target-coverage 0.25 --expected-length 12 \
--estimate-rho newtree --no-post-probs mydata.ss noncons.mod
4. This time estimate all free parameters of the tree models.
phastCons --target-coverage 0.25 --expected-length 12 \
--estimate-trees newtree --no-post-probs mydata.ss noncons.mod
5. Estimate the state-transition parameters but not the tree
models. Output the conservation scores but not the conserved
elements.
phastCons mydata.ss cons.mod,noncons.mod > scores.wig
6. Estimate just the expected-length parameter and also estimate rho.
phastCons --target-coverage 0.25 --estimate-rho newtree \
mydata.ss noncons.mod > scores.wig
OPTIONS:
(Tree models)
--rho, -R
Set the *scale* (overall evolutionary rate) of the model for
the conserved state to be times that of the model for
the non-conserved state (0 < < 1; default 0.3). If used
with --estimate-trees or --estimate-rho, the specified value
will be used for initialization only (rho will be
estimated). This option is ignored if two tree models are
given.
--estimate-trees, -T
Estimate free parameters of tree models and write new models
to .cons.mod and .noncons.mod.
--estimate-rho, -O
Like --estimate-trees, but estimate only the parameter rho.
--gc, -G
(Optionally use with --estimate-trees or --estimate-rho)
Assume a background nucleotide distribution consistent with
the given average G+C content (0 < < 1) when estimating
tree models. (The frequencies of G and C will be set to
/2 and the frequencies of A and T will be set to
(1-)/2.) This option overrides the default behavior of
estimating the background distribution from the data (if
--estimate-trees) or obtaining them from the input model (if
--estimate-rho).
--nrates, -k |
(Optionally use with a discrete-gamma model and --estimate-trees)
Assume the specified number of rate categories, instead of the
number given in the *.mod file. The shape parameter 'alpha' will
be as given in the *.mod file. In the case of the default
two-state HMM, two values can be specified, for the numbers of
rates for the conserved and the nonconserved states, resp.
(State-transition parameters)
--transitions, -t [~],
Fix the transition probabilities of the two-state HMM as
specified, rather than estimating them by maximum likelihood.
Alternatively, if first character of argument is '~', estimate
parameters, but initialize to specified values. The argument
is the probability of transitioning from the conserved to
the non-conserved state, and is the probability of the
reverse transition. The probabilities of self transitions are
thus 1- and 1- and the expected lengths of conserved
and nonconserved elements are 1/ and 1/, respectively.
--target-coverage, -C
(Alternative to --transitions) Constrain transition parameters
such that the expected fraction of sites in conserved elements
is (0 < < 1). This is a *prior* rather than
*posterior* expectation and assumes stationarity of the
state-transition process. Adding this constraint causes the
ratio mu/nu to be fixed at (1-)/. If used with
--expected-length, the transition probabilities will be
completely fixed; otherwise the expected-length parameter
will be estimated by maximum likelihood.
--expected-length, -E [~] {--expected-lengths also allowed,
for backward compatibility}
(For use with --target-coverage, alternative to --transitions)
Set transition probabilities such that the expected length of
a conserved element is . Specifically, the parameter
mu is set to 1/. If preceded by '~', will be
estimated, but will be initialized to the specified value.
(Input/output)
--msa-format, -i PHYLIP|FASTA|MPM|SS|MAF
Alignment file format. Default is to guess format based on
file contents. Note that the msa_view program can be used to
convert between formats.
--viterbi [alternatively --most-conserved], -V
Predict discrete elements using the Viterbi algorithm and
write to specified file. Output is in BED format, unless
has suffix ".gff", in which case output is in GFF.
--score, -s
(Optionally use with --viterbi) Assign a log-odds score to
each prediction.
--lnl, -L
Compute total log likelihood using the forward algorithm and
write to specified file.
--no-post-probs, -n
Suppress output of posterior probabilities. Useful if only
discrete elements or likelihood is of interest.
--log, -g
(Optionally use when estimating free parameters) Write log of
optimization procedure to specified file.
--refidx, -r
Use coordinate frame of specified sequence in output. Default
value is 1, first sequence in alignment; 0 indicates
coordinate frame of entire multiple alignment.
--seqname, -N
(Optionally use with --viterbi) Use specified string
for 'seqname' (GFF) or 'chrom' field in output file. Default
is obtained from input file name (double filename root, e.g.,
"chr22" if input file is "chr22.35.ss").
--idpref, -P
(Optionally use with --viterbi) Use specified string as
prefix of generated ids in output file. Can be used to ensure
ids are unique. Default is obtained from input file name
(single filename root, e.g., "chr22.35" if input file is
"chr22.35.ss").
--quiet, -q
Proceed quietly (without updates to stderr).
--help, -h
Print this help message.
(Indels) [experimental]
--indels, -I
Expand HMM state space to model indels as described in Siepel
& Haussler (2004).
--max-micro-indel, -Y
(Optionally use with --indels) Maximum length of an alignment
gap to be considered a "micro-indel" and therefore
addressed by the indel model. Gaps longer than this threshold
will be treated as missing data. Default value is 20.
--indel-params, -D [~]
(Optionally use with --indels and default two-state HMM) Fix
the indel parameters at (alpha_0, beta_0, tau_0) for the
conserved state and at (alpha_1, beta_1, tau_1) for the
non-conserved state, rather than estimating them by maximum
likelihood. Alternatively, if first character of argument is
'~', estimate parameters, but initialize with specified
values. Alpha_j is the rate of insertion events per
substitution per site in state j (typically ~0.05), beta_j is
the rate of deletion events per substitution per site in state
j (typically ~0.05), and tau_j is approximately the inverse
of the expected indel length in state j (typically 0.2-0.5).
--indels-only, -J
Like --indels but force the use of a single-state HMM. This
option allows the effect of the indel model in isolation to be
observed. Implies --no-post-probs. Use with --lnl.
(Felsenstein/Churchill model) [rarely used]
--FC, -X
(Alternative to --hmm; specify only one *.mod file with this
option) Use an HMM with a state for every rate
category in the given phylogenetic model, and transition
probabilities defined by an autocorrelation parameter lambda
(as described by Felsenstein and Churchill, 1996). A rate
constant for each state (rate category) will be multiplied by
the branch lengths of the phylogenetic model, to create a
"scaled" version of the model for that state. If the
phylogenetic model was estimated using Yang's discrete gamma
method (-k option to phyloFit), then the rate constants will
be defined according to the estimated shape parameter 'alpha',
as described by Yang (1994). Otherwise, a nonparameteric
model of rate variation must have been used (-K option to
phyloFit), and the rate constants will be as defined
(explicitly) in the *.mod file. By default, the parameter
lambda will be estimated by maximum likelihood (see --lambda).
--lambda, -l [~]
(Optionally use with --FC) Fix lambda at the
specified value rather than estimating it by maximum
likelihood. Alternatively, if first character is '~',
estimate but initialize at specified value. Allowable range
is 0-1. With k rate categories, the transition probability
between state i and state j will be lambda * I(i == j) +
(1-lambda)/k, where I is the indicator function. Thus, lambda
= 0 implies no autocorrelation and lambda = 1 implies perfect
autocorrelation.
(Coding potential) [experimental]
--coding-potential, -p
Use parameter settings that cause output to be interpretable
as a coding potential score. By default, a simplified version
of exoniphy's phylo-HMM is used, with a noncoding (background)
state, a conserved non-coding (CNS) state, and states for the
three codon positions. This option implies --catmap "NCATS=4;
CNS 1; CDS 2-4" --hmm --states CDS
--reflect-strand background,CNS and a set of default *.mod
files (all of which can be overridden). This option can be
used with or without --indels.
--extrapolate, -e | default
Extrapolate to a larger set of species based on the given
phylogeny (Newick-format). The trees in the given tree models
(*.mod files) must be subtrees of the larger phylogeny. For
each tree model M, a copy will be created of the larger
phylogeny, then scaled such that the total branch length of
the subtree corresponding to M's tree equals the total branch
length of M's tree; this new version will then be used in
place of M's tree. (Any species name present in this tree but
not in the data will be ignored.) If the string "default"
is given instead of a filename, then a phylogeny for 25
vertebrate species, estimated from sequence data for Target 1
(CFTR) of the NISC Comparative Sequencing Program (Thomas et
al., 2003), will be assumed.
--alias, -A
Alias names in input alignment according to given definition,
e.g., "hg17=human; mm5=mouse; rn3=rat". Useful with default
*.mod files, e.g., with --coding-potential. (Default models
use generic common names such as "human", "mouse", and
"rat". This option allows a mapping to be established
between the leaves of trees in these files and the sequences
of an alignment that uses an alternative naming convention.)
(Custom HMMs) [rarely used]
--hmm, -H
Name of HMM file explicitly defining the probabilities of all
state transitions. States in the file must correspond in
number and order to phylogenetic models in .
Expected file format is as produced by 'hmm_train.'
--catmap, -c |
(Optionally use with --hmm) Mapping of feature types to category
numbers. Can give either a filename or an "inline" description
of a simple category map, e.g., --catmap "NCATS = 3 ; CDS 1-3".
--states, -S
States of interest in the phylo-HMM, specified by number
(indexing starts with 0), or if --catmap, by category name.
Default value is 1. Choosing --states "0,1,2" will cause
output of the sum of the posterior probabilities for states 0,
1, and 2, and/or of regions in which the Viterbi path
coincides with (any of) states 0, 1, or 2 (see --viterbi).
--reflect-strand, -U
(Optionally use with --hmm) Given an HMM describing the
forward strand, create a larger HMM that allows for features
on both strands by "reflecting" the original HMM about the
specified "pivot" states. The new HMM will be used for
prediction on both strands. States can be specified by number
(indexing starts with 0), or if --catmap, by category name.
(Missing data) [rarely used]
--require-informative, -M
Require "informative" columns (i.e., columns with more than
two non-missing-data characters, excluding sequences specified
by --not-informative) in specified HMM states, to help
eliminate false positive predictions. States can be specified
by number (indexing starts with 0) or, if --catmap is used, by
category name. Non-informative columns will be given emission
probabilities of zero. By default, this option is active,
with equal to the set of states of interest for
prediction (as specified by --states). Use "none" to disable
completely.
--not-informative, -F
Do not consider the specified sequences (listed by name) when
deciding whether a column is informative. This option may be
useful when sequences are present that are very close to the
reference sequence and thus do not contribute much in the way
of phylogenetic information. E.g., one might use
"--not-informative chimp" with a human-referenced multiple
alignment including chimp sequence, to avoid false-positive
predictions based only on human/chimp alignments (can be a
problem, e.g., with --coding-potential).
--ignore-missing, -z
(For use when estimating transition probabilities) Ignore
regions of missing data in all sequences but the reference
sequence (excluding sequences specified by --not-informative)
when estimating transition probabilities. Can help avoid
too-low estimates of and or too-high estimates of
. Warning: this option should not be used with
--viterbi because coordinates in output will be
unrecognizable.
REFERENCES:
J. Felsenstein and G. Churchill. 1996. A hidden Markov model
approach to variation among sites in rate of evolution.
Mol. Biol. Evol., 13:93-104.
A. Siepel, G. Bejerano, J. S. Pedersen, et al. 2005.
Evolutionarily conserved elements in vertebrate, insect, worm,
and yeast genomes. Genome Res. (in press)
A. Siepel and D. Haussler. 2004. Computational identification of
evolutionarily conserved exons. Proc. 8th Annual Int'l Conf.
on Research in Computational Biology (RECOMB '04), pp. 177-186.
J. Thomas et al. 2003. Comparative analyses of multi-species
sequences from targeted genomic regions. Nature 424:788-793.
Z. Yang. 1994. Maximum likelihood phylogenetic estimation from
DNA sequences with variable rates over sites: approximate
methods. J. Mol. Evol., 39:306-314.