PhastCons is part of the PHAST (PHylogenetic Analysis with Space/Time models) package, which can be downloaded from the PHAST Website. The latest version of the phastCons help page can be obtained by running the program with the --help option.
PhastCons and the rest of the PHAST package are written in C.
The PHAST package has been compiled and installed under various UNIX
and linux platforms and under Mac OS X. It should be relatively easy
to port the software to other platforms.
2. GETTING STARTED
In this section, I'll discuss a few basic ways of using the program
and some of its most commonly used options.
As input, PhastCons takes a multiple alignment, a phylogenetic model for conserved regions, and a phylogenetic model for nonconserved regions. The model for conserved regions is optional. If it is not given, then this model is defined indirectly by "scaling" the model for nonconserved regions by a factor called rho (see the --rho option to phastCons). The basic idea of the program is to scan along the alignment for regions that are better "explained" by the conserved model than by the nonconserved model; such regions will be output as conserved elements, and the probability that each base is in such a region will be output as the conservation score for that base.
The program also needs a specification of the state-transition probabilities for the HMM. These probabilities determine how freely the model can switch between states, and as a result, they define important features such as the expected length of conserved elements, the expected coverage by conserved elements, and the "smoothness" of the conservation plot (see the phastCons paper for details). The transition probabilities can be set directly with the --transitions option or with the --expected-length and --target-coverage options (see Section 3). Alternatively, they may be left undefined, and will automatically be estimated by maximum likelihood. Probably the trickiest part of getting good results with phastCons is figuring out how to set these parameters. Let's gloss over this potentially complicated issue for now and focus on the basics of using the program. For the moment, assume the transition probabilities are set by the tuning parameters, say, by using --target-coverage 0.25 and --expected-length 12. We'll return to this issue in Section 3.
The program can be run with a command of the form:
phastCons [OPTIONS] alignment [cons.mod,]noncons.mod > scores.wigwhere alignment is a multiple alignment in an appropriate format (see --msa-format), cons.mod and noncons.mod are phylogenetic models for conserved and nonconserved regions, respectively, in PHAST's .mod format, and scores.wig is a file of conservation scores in the fixed-step WIG format used by the UCSC Genome Browser (http://genome.ucsc.edu/encode/submission.html#WIG).
For example,
phastCons --target-coverage 0.25 --expected-length 12 \ --msa-format MAF alignment.maf cons.mod,noncons.mod \ > scores.wigor
phastCons --target-coverage 0.25 --expected-length 12 \ --rho 0.4 --msa-format MAF alignment.maf noncons.mod \ > scores.wigwhere in the second case the conserved model is defined as a scaled version of the nonconserved model, with scaling factor rho = 0.4. Here an alignment in the Multiple Alignment Format (MAF) used in the UCSC Genome Browser is assumed.
A set of discrete conserved elements can be predicted by adding the --most-conserved option, e.g.,
phastCons --target-coverage 0.25 --expected-length 12 --rho 0.4 \ --most-conserved most-cons.bed --msa-format MAF alignment.maf \ noncons.mod > scores.wigwhich will cause the coordinates of predicted elements to be written to the file "most-cons.bed" in BED format (see http://genome.ucsc.edu/goldenPath/help/customTrack.html#BED). Alternatively, these elements can be written in GFF (see http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml), by giving the output filename a suffix of ".gff".
If you don't care about the conservation scores, you can avoid computing them with the --no-post-probs option. (This can save some time.) If phastCons is run with --no-post-probs and without --most-conserved, then it will neither predict a set of discrete elements nor produce the conservation scores. However, it will still estimate free parameters, if directed to do so (see below).
In summary, the default behavior of the program is to estimate any unspecified free parameters, then to compute conservation scores and output them to stdout. However, it can be made to perform any combination of its three main tasks -- predicting elements, computing scores, and estimating parameters -- by running it with or without --most-conserved, with or without --no-post-probs, and with or without a specification of the values of its free parameters.
It is worth noting that in experiments we've done with vertebrate data (including human, mouse, rat, chicken, and Fugu sequences), estimating the nonconserved model from 4d sites rather than letting phastCons estimate them did not make too much of a difference in the predicted conserved elements, even though it did result in quite different branch-length estimates (see Supplementary Material to phastCons paper). The EM algorithm (option 1) seems to be able to learn a separation between conserved and nonconserved regions even if the model it learns for nonconserved regions is not an entirely accurate representation of the neutral substitution process. On the other hand, when the number of species is large (say, greater than a dozen), option 1 can be quite computationally intensive, and options 2 or 3 may be preferable purely for practical reasons (see the ENCODE example, below). As a rule of thumb, I recommend option 1 unless you have (a) more than 12 species or (b) many distant species with low alignment coverage and you are concerned about distortions in branch lengths. I recommend option 2 in these cases.
For options 2 or 3, you can estimate your models using the program phyloFit, also in the PHAST package. Start with an alignment and an annotation file indicating the sites of interest in your alignment. For example, suppose you have an alignment of human, mouse, and rat sequences in a file called data.fa (FASTA format), and suppose you have a file called AR.gff (GFF format) defining the locations of ancestral repeats (ARs) in the coordinate system of the first sequence (human). For example,
human RepeatMasker AR 14451829 14451925 . . . type "L1MC4a.LINE.L1" human RepeatMasker AR 14452228 14452443 . . . type "L1MC4a.LINE.L1.2" human RepeatMasker AR 14458153 14458278 . . . type "L2.LINE.L2" human RepeatMasker AR 14458258 14458386 . . . type "L2.LINE.L2.2" human RepeatMasker AR 14458781 14459349 . . . type "L2.LINE.L2.3" human RepeatMasker AR 14460129 14460311 . . . type "L1ME.LINE.L1" human RepeatMasker AR 14460948 14461016 . . . type "L2.LINE.L2.4" human RepeatMasker AR 14463023 14463356 . . . type "MER89-int.LTR.ERV1" ...You can estimate a phylogenetic model for these sites with a command such as:
phyloFit --tree "(human,(mouse,rat))" --features AR.gff \ --do-cats AR --out-root nonconserved data.faHere the model will be written to a file called nonconserved.AR.mod. See the phyloFit help page (run 'phyloFit --help') for additional details.
With 4d sites, the program msa_view may be helpful. Suppose instead of annotations of ARs, you have a file called genes.gff describing coding regions in the human sequence, e.g.,
human UCSC CDS 4562 4692 . - 0 transcript_id "NM_198943" ; gene_id "NM_198943" human UCSC CDS 7469 7605 . - 0 transcript_id "NM_198943" ; gene_id "NM_198943" human UCSC CDS 7778 7924 . - 0 transcript_id "NM_198943" ; gene_id "NM_198943" human UCSC CDS 8131 8242 . - 1 transcript_id "NM_198943" ; gene_id "NM_198943" human UCSC CDS 14601 14749 . - 0 transcript_id "NM_198943" ; gene_id "NM_198943" ...You can extract 4d sites from this alignment with a command like
msa_view data.fa --4d --features genes.gff > 4d-codons.ssThis will create a representation in the "sufficient statistics" (SS) format used by several PHAST tools of whole codons containing 4d sites. You can extract just the 4d sites (in the 3rd codon positions) with a command like
msa_view 4d-codons.ss --in-format SS --out-format SS \ --tuple-size 1 > 4d-sites.ssYou can now estimate a phylogenetic model with a command like
phyloFit --tree "(human,(mouse,rat))" --msa-format SS \ --out-root nonconserved-4d 4d-sites.ssIf you have several alignments and several (corresponding) GFF files, e.g., for separate human chromosomes, you can extract 4d sites from each one using msa_view, then combine all the 4d sites using msa_view --aggregate, then estimate a phylogenetic model from the pooled data:
for set in set1 set2 set3 ; do msa_view $set.maf --in-format MAF --4d --features $set.gff \ > $set.codons.ss msa_view $set.codons.ss --in-format SS --out-format SS \ --tuple-size 1 > $set.sites.ss done msa_view --aggregate human,mouse,rat *.sites.ss > all-4d.sites.ss phyloFit --tree "(human,(mouse,rat))" --msa-format SS \ --out-root nonconserved-all-4d all-4d.sites.ss(Here we have assumed the input alignments are in MAF format.)
Similar methods could be used to estimate a model for conserved regions, say, from 1st codon positions. This gets slightly more complicated, e.g.:
msa_view data.maf --in-format MAF --features genes.gff \ --catmap "NCATS = 3; CDS 1-3" --out-format SS --unordered-ss \ --reverse-groups transcript_id > codons.ss phyloFit --tree "(human,(mouse,rat))" --msa-format SS --do-cats 1 \ --out-root conserved-codon1 codons.ssSee the msa_view and phyloFit help pages for details.
It should now be clear how option 3 would be accomplished: estimate conserved and nonconserved models using appropriate annotations files and phyloFit, then run phastCons as shown in Section 2.1. Option 2 is similar, but requires the use of the --estimate-rho option, e.g.,
phastCons --target-coverage 0.25 --expected-length 12 \ --rho 0.4 --estimate-rho mytrees --msa-format MAF \ alignment.maf nonconserved-4d.mod > scores.wigwhere nonconserved-4d.mod is a nonconserved model estimated from 4d sites. The parameter rho will be initialized to 0.4, then iteratively improved using an EM algorithm. (Note that a careful choice of initialization can improve running time substantially.) After rho has been estimated, the new conserved and nonconserved models will be written to mytrees.cons.mod and mytrees.noncons.mod, respectively, and program will carry on as before. (Here the new nonconserved model will be identical to nonconserved-4d.mod, but in some cases, e.g., if --gc is used, it will not be.) In this case, phastCons will output conservation scores but not discrete conserved elements, but it could be made to produce both types of output, or neither type of output (if only the estimate of rho were of interest).
Option 1 requires the use of --estimate-trees, which is similar to --estimate-rho:
phastCons --target-coverage 0.25 --expected-length 12 \ --estimate-trees mytrees --msa-format MAF alignment.maf \ init.mod > scores.wigIn this case, phastCons will estimate all free parameters of the phylogenetic models by maximum likelihood, and will output the new models to mytrees.cons.mod and mytrees.noncons.mod. The given tree model, init.mod, will simply be used for initialization of the nonconserved model. It can be obtained with phyloFit, either using 4d sites, ARs, or similar (as above), or simply using the entire data set, e.g.,
phyloFit --tree "(human,(mouse,rat))" --msa-format MAF \ --out-root init alignment.mafWith option 2 or option 1, it may make sense to run phastCons twice, once for "training" (parameter estimation) and once for prediction. For example,
phastCons --target-coverage 0.25 --expected-length 12 \ --estimate-trees mytrees --msa-format MAF alignment.maf \ init.mod --no-post-probs phastCons --target-coverage 0.25 --expected-length 12 \ --msa-format MAF alignment.maf --most-conserved most-cons.bed \ mytrees.cons.mod,mytrees.noncons.mod > scores.wigThe training step is typically much slower, and in some cases, it may be done with a subset of the data, for example, a random sample of 100-200 fragments from genome-wide alignments. Also, with large data sets, it may make sense to train separately on small fragments (possibly in parallel, on a compute cluster), then to combine the parameter estimates, then to do prediction with the combined estimates (see Section 4).
Note that options 1, 2, and 3 can all be used with or without
--target-coverage and --expected-length (see Section 3).
3. STATE-TRANSITION PARAMETERS
Here is what is important to understand in order to use the program:
The question then becomes what criteria should be used for adjusting the tuning parameters. There are two natural features to focus on. The first, called "coverage," is the fraction of bases in a data set that fall within predicted conserved elements. You might have a reason to aim for a particular level of coverage; for example, with mammalian data, a target of 5% might be selected, since this is considered a reasonable ballpark estimate of the fraction of sites under purifying selection in mammalian genomes. If you don't have a target in mind for the overall level of coverage, you can aim for a particular level of coverage of known functional elements such as protein-coding exons. (This was the method we used in the phastCons paper to calibrate the model in similar ways for vertebrate, insect, worm, and yeast data sets; we chose a target of 65% coverage of coding exons by conserved elements.) The target-coverage parameter (gamma) is a "knob" you can turn to reach a particular level of actual coverage. As you increase this parameter, more bases will be predicted to be in conserved elements, and as you decrease it fewer bases will be predicted to be in conserved elements. (Gamma can be thought of in a Bayesian sense as a prior, which has the effect of imposing a bias on the actual coverage.)
The second feature is called "smoothing." Smoothing describes the degree to which the conservation scores tend to be similar from one base to the next. A high degree of smoothing occurs when the probabilities mu and nu are small -- i.e., when changing states is unlikely -- and a lower degree of smoothing occurs when mu and nu are larger. Higher smoothing means the predicted conserved elements tend to be longer on average, and are more likely to include short sequences of nonconserved bases. Lower smoothing implies shorter, more fragmented conserved elements, containing fewer nonconserved sites. Increasing the expected length parameter (omega) tends to increase the level of smoothing and the average length of predicted conserved elements.
What is the right level of smoothing? This is perhaps a more difficult question than what is the right level of coverage. In many cases, it may be best just to set the smoothing so that the conservation plot "looks right." By plotting the scores next to a multiple alignment, as in the UCSC browser, it is usually not too hard to see if the plot is dramatically "over-smoothed" (in the extreme, it resembles a square wave function, with alternating regions of scores near 0 and scores near 1) or "under-smoothed" (in the extreme, the scores at adjacent sites appear essentially uncorrelated). It may also be useful to look at the shortest predicted conserved elements and see whether they are believable -- are they long enough and conserved enough to reasonably be called "conserved elements"? If not, perhaps the degree of smoothing should be increased. Similarly, one might look at sequences of nonconserved sites within conserved elements. If they are quite long and poorly conserved, yet the elements are not being broken, then perhaps the smoothing should be reduced. One might also look at the breaks between conserved elements (if they are too short, the smoothing should be increased) and sequences of conserved sites that have not been predicted as conserved elements (if they are too long, the smoothing should be decreased).
An alternative, perhaps more principled, way of setting the level of smoothing is to use the "phylogenetic information threshold" (PIT) method described in the phastCons paper. The paper should be consulted for details, but the PIT is essentially a threshold for the amount of "information" (in an information theoretic sense) that is required to predict a conserved element, on average. It is an alternative measure of the degree of smoothing (a higher PIT means more smoothing), which can be meaningfully compared across different data sets and phylogenies. For example, to achieve a given PIT -- say, 10 bits -- a larger expected length might be needed for a two-species data set, for which there is less information about conservation at each site, than for a ten-species data set. (The PIT actually considers the shape of the phylogeny and its branch lengths, not just the number of species.) A program in PHAST called consEntropy can be used to compute the PIT for a given set of parameters. The inputs to the program are the target-coverage and expected-length parameters and the conserved and nonconserved phylogenetic models (see the consEntropy help page and examples below). One can simply run consEntropy after adjusting the tuning parameters and re-estimating any free parameters, and compare the reported PIT to the target value; if necessary, the parameters can be adjusted and the procedure can be repeated (see below). There are cases in which the PIT method will not be appropriate. For example, when an alignment contains a lot of missing data, the PIT will be misleading, because there is actually less information per site than the phylogenetic models would imply. Also, consEntropy will not work with large trees, because, as currently implemented, its running time is exponential in the number of species.
Coverage and smoothing are to some degree separate, with coverage being primarily influenced by the --target-coverage option and smoothing being primarily influenced by the --expected-length option, but they are also interrelated. The final set of predictions depends on all parameters and on the data in a complicated way, and will be affected, sometimes unpredictably, by a change to any parameter. For example, changing the --expected-length option while holding --target-coverage fixed generally will alter coverage somewhat as well as smoothing. For this reason, the --target-coverage and --expected-length parameters should be tuned simultaneously, until constraints on both coverage and smoothing are satisfied.
We begin with a guess at the values of the tuning parameters. In this case, a starting target-coverage of 0.05 seems appropriate, since our prior belief is that about 5% of bases are conserved, but we have to consider the fact that only about 40% of the human bases align to mouse or rat, and phastCons will not predict conserved elements in regions of no alignment. Therefore, we start instead with a target coverage of 0.05/0.4 = 0.125, the fraction of aligned bases we expect to be conserved. We'll guess at a value of 20bp for the expected length. The average length of conserved elements is likely to be considerably longer -- probably closer to 100bp -- but we want to influence the model toward the detection of shorter conserved elements, e.g., possible transcription factor binding sites. One way of thinking about it is that we want the conservation score at each base to reflect an average "window" of about 20bp centered at that site. Remember, these are just rough starting values; we're going to adjust them as we go.
A starting estimate of the tree models can be obtained with a command such as:
phastCons --target-coverage 0.125 --expected-length 20 \ --estimate-trees v1 --most-conserved most-cons.bed \ --log optim.log --no-post-probs --msa-format MAF \ alignment.maf init.mod(The progress of the program can be charted by monitoring the log file, optim.log.) Now we need to see how close we are to our coverage and smoothing targets. We'll need a program to compute the intersection of a set of coding regions and the predicted conserved elements. Jim Kent's featureBits is ideal for this, but other tools could be used as well. Suppose we have a bed file CDS.bed, containing the coordinates of coding exons in the region of interest. A featureBits command such as
featureBits -enrichment hg17 CDS.bed most-cons.bedwill measure the coverage of the regions in CDS.bed by the regions in most-cons.bed. In this case, featureBits reports
CDS.bed 0.159%, most-cons.bed 0.233%, both 0.102%, cover 64.25%, enrich 275.73xindicating coverage of 64.25%, only slightly below our target. To test smoothness, we run consEntropy:
consEntropy 0.125 20 v1.cons.mod v1.noncons.modand obtain the following output
Transition parameters: gamma=0.125000, omega=20.000000, mu=0.050000, nu=0.007143 Relative entropy: H=0.295299 bits/site Expected min. length: L_min=49.071119 sites Expected max. length: L_max=24.350034 sites Phylogenetic information threshold: PIT=L_min*H=14.490669 bitsThus, the PIT is about 14.5 bits, somewhat higher than desired.
Now we rerun the program with new values for the tuning parameters. We might try reducing the expected length to 15 and leaving the target-coverage the same, because we don't know how the new expected length will affect the coverage.
phastCons --target-coverage 0.125 --expected-length 12 \ --estimate-trees v2 --most-conserved most-cons.bed \ --log optim.log --no-post-probs --msa-format MAF \ alignment.maf v1.noncons.modAn important point here is that the free parameters (in this case the tree models) are being re-estimated with the new value of the tuning parameters. One might be tempted just to redo the prediction step, with the previously estimated tree models, but I don't recommend this shortcut: you'd be mixing tree models trained under one set of assumptions with a prediction method based on another set of assumptions. Note also that the previous estimate of the nonconserved model has been used for initialization; this can sometimes lead to faster convergence.
This time we measure a coverage of 58.7% and a PIT of 14.3 bits. Apparently, the reduced smoothing (i.e., lower expected length) has resulted in a greater separation between the conserved and nonconserved models: there are fewer nonconserved sites mixed in with the conserved ones, making the conserved model more conserved, and there are fewer conserved sites mixed in with the nonconserved ones, making the nonconserved model less conserved. As a result, the phylogenetic information per base has increased, and despite the reduced smoothing, the PIT has stayed roughly the same. In addition, the coverage has decreased a bit, possibly because fewer nonconserved sites flanked by conserved sites are being pulled into conserved elements. It seems as though we've lost ground, but we just need to be persistent. Let's try reducing the expected length a bit more and cranking up the target coverage to compensate for it:
phastCons --target-coverage 0.2 --expected-length 8 \ --estimate-trees v3 --most-conserved most-cons.bed \ --log optim.log --no-post-probs --msa-format MAF \ alignment.maf v2.noncons.modThis time we obtain a coverage of 59.2% and a PIT of 12.6 bits -- closer, but still not quite there. At any rate, you see how it works: we simply iterate this procedure until we're close enough to our target values. Usually this doesn't take more than six or seven iterations.
Tuning the program is a bit of an art: the parameters are all
interrelated, and it takes a little time to get a feeling for what
effect each change will have. As you get better at it, you'll need
fewer iterations to reach convergence.
4. WORKING WITH LARGE DATA SETS
The examples above have all assumed that phastCons is to be run on a
small to moderate-sized data set, consisting of at most, say, a dozen
sequences and a few million sites (alignment columns). Additional
care needs to be taken with larger data sets. The main problem is
parameter estimation. If all parameters are fixed, the running time
of the program is proportional to the product of the number of
species, n, and the number of sites in an alignment, L, with a
reasonably small proportionality constant. This O(nL) algorithm is
practical for use use with quite large data sets, given sufficient
RAM. The EM algorithm for parameter estimation, however, must perform
a similar O(nL)-time computation many times and between iterations
must also execute a potentially expensive numerical maximization
routine. Straightforward parameter estimation (as discussed above)
will not be practical with extremely large data sets.
The techniques I recommend are, basically, (1) to "divide and conquer" and make as much use as possible of parallel processing, and (2) to make the estimation problem as easy as possible. In this section, I will discuss two different examples of large data sets, one consisting of a moderate number of species and a large number of sites, and the other consisting of a large number of species and a moderate number of sites. A "divide and conquer" strategy is the key to addressing the first problem and making the estimation problem easier is the key to addressing the second problem. Note that the two strategies I will describe are essentially orthogonal, and could be used in combination with a data set having many sequences *and* many sites.
Here's the recommended "divide and conquer" approach for this case:
Let's discuss the steps one by one.
This can be accomplished by running a program called msa_split on each MAF. Use a command like the following:
mkdir -p CHUNKS # put fragments here for file in chr*.maf ; do root=`basename file .maf` msa_split $file --in-format MAF --refseq $root.fa \ --windows 1000000,0 --out-root CHUNKS/$root --out-format SS \ --min-informative 1000 --between-blocks 5000 doneThis will cause each MAF file ($root.maf) and the corresponding reference sequence ($root.fa), to be read in, assembled in memory into a kind of pseudo-global alignment (with local alignments tiled beneath the human sequence), and then split into non-overlapping chunks of about 1,000,000 sites each; these chunks will be written in SS format to files with a prefix of $root in the CHUNKS directory. The --min-informative option causes fragments with fewer than 1000 "informative" sites (i.e., with at least two aligned bases) to be discarded, and the --between-blocks option tells the program to try to split between alignment blocks if possible, e.g., in recent transposon insertions in the human genome. (It will adjust the split point by at most 5000 sites.) The file $root.fa must be in FASTA format and must use the same coordinate system as the multiple alignment, as will be true if the FASTA files and MAFs from the same version (assembly) of the UCSC Genome Browser are used. See the msa_split help page for additional details.
If desired, this step can be done in parallel, with one process per MAF file. Beware of the fact that each process will be quite I/O intensive.
In this step, we will estimate free parameters using phastCons for each alignment fragment. Because the number of species is relatively small, we can get away with full estimation of the conserved and nonconserved models (option 3 from Section 2.2). Suppose we already have a starting model, init.mod. We might use a command like:
mkdir -p TREES # put estimated tree models here rm -f TREES/* # in case old versions left over for file in chr*.*.ss ; do root=`basename $file .ss` phastCons --target-coverage 0.125 --expected-length 20 \ --gc 0.4 --estimate-trees TREES/$root \ $file init.mod --no-post-probs doneHere, the --gc option is used to ensure that the same G+C content is assumed for all fragments, since the same G+C content will be assumed during prediction. Note that the --log and --quiet options may be useful here as well.
If you have a compute cluster at your disposal, you will probably want to execute these individual processes in parallel.
ls TREES/*.cons.mod > cons.txt phyloBoot --read-mods '*cons.txt' --output-average ave.cons.mod ls TREES/*.noncons.mod > noncons.txt phyloBoot --read-mods '*noncons.txt' --output-average ave.noncons.modPhyloBoot will output a summary of the means, standard deviations, medians, etc. of each parameter. You may want to scan this summary to convince yourself that the standard deviations are not too large and that no outliers have severely skewed the means. The means and medians should be roughly equal and the standard deviations should be fairly small compared to the means. (Some larger standard deviations for branches near the root of the tree are normal.) There are sometimes a few zeroes in the 'min' column, especially for short branches; this is okay, but beware of very large values in the 'max' column, which could result from a failure to converge properly, and might skew the mean. If you see any bad outliers, you should re-estimate with different starting parameters or track down and discard the responsible .mod files. I've never had to do this; the estimates generally are well behaved.
This is simply a matter of running phastCons on each fragment, using the averaged tree models.
mkdir -p ELEMENTS SCORES rm -f ELEMENTS/* SCORES/* for file in chr*.*.ss ; do root=`basename $file .ss` phastCons --target-coverage 0.125 --expected-length 20 \ --most-conserved ELEMENTS/$root.bed --score \ $file ave.cons.mod,ave.noncons.mod > SCORES/$root.wig doneNote that the same --target-coverage and --expected-length arguments are used as in step 2, that this time we have asked the program to predict both discrete elements and base-by-base conservation scores, and that the tree models from step 3 have been used. The --score option tells phastCons to compute log odds scores for predicted elements. The --gc option is no longer necessary (ave.cons.mod and ave.noncons.mod contain the background distribution), but the --seqname and --idpref options could be useful in this step. (The default names and ids are obtained from the filenames, and in this case should be okay).
The *.bed and *.wig files can now be combined across the whole genome or across each chromosomes, if desired, e.g.,
cat ELEMENTS/*.bed | sort -k1,1 -k2,2n > most-conserved.bed
Suppose our targets are 65% coverage of coding exons and a PIT of 10 bits. At UCSC, we would test the first constraint using the featureBits program, e.g.,
featureBits -enrichment hg17 knownGene:cds most-conserved.bedThe second constraint could be tested with consEntropy, e.g.,
consEntropy .125 20 ave.cons.mod ave.noncons.modAs in Section 3, we adjust the tuning parameters as needed, only now we repeat steps 2, 3, and 4 after each adjustment, until convergence.
ls chr*.*.ss | chooseLines -k 100 - > random-sample.txtwhere chooseLines is a utility in PHAST that will randomly select k lines from a given input file or stream. Now each time you estimate the parameters, use a command such as
mkdir -p TREES # put estimated tree models here rm -f TREES/* for file in `cat random-sample.txt` ; do root=`basename $file .ss` phastCons --target-coverage 0.125 --expected-length 20 \ --gc 0.4 --estimate-trees TREES/$root \ $file init.mod --no-post-probs doneAnother thing to note is that one could make predictions based on locally estimated parameters rather than global averages, if desired, by skipping the averaging step (step 3), and substituting the following for the call to phastCons in step 4.
phastCons --target-coverage 0.125 --expected-length 20 \ --most-conserved ELEMENTS/$root.bed --score \ $file TREES/$root.cons.mod,TREES/$root.noncons.mod \ > SCORES/$root.wig(Actually, in this case, steps 3 and 4 could be combined if desired.) In some ways this might be preferable to the approach described above -- e.g., predicted elements and conservation scores would reflect local deviations from the average G+C content or neutral substitution rate. However, it also potentially makes the program's output for different regions of the genome more difficult to compare. For this reason, I prefer the use of global averages. This issue is discussed in the Supplementary Material to the phastCons paper.
In this case, I recommend making the estimation problem easier. The best way to do that here is probably to estimate the nonconserved model from fourfold degenerate sites, as described in Section 2.2, and then to let phastCons estimate only the parameter rho (option 2 from section 2.2). We'll work with an aggregate alignment constructed from the original alignments for the purposes of parameter estimation. Suppose we start with a set of FASTA-formatted files named for ENCODE regions, ENm001.fa, ..., ENr334.fa. We can concatenate these into a single file using msa_view with a command like:
msa_view --aggregate {species-list} EN*.fa > all-encode.fawhere {species-list} is a comma-separated list of all sequence names in the alignments, e.g., "human,chimp,baboon,macaque,..." It will also save some time during re-estimation to convert this file to an ordered SS file:
msa_view all-encode.fa --out-format SS > all-encode.ssNote that this aggregate alignment contains some adjacent pairs of columns that are not really adjacent, i.e., in the 43 places where two ENCODE regions have arbitrarily been joined together. The number of such columns is very small compared to the overall size of the data set, however, and its impact on the parameter estimates should be negligible.
For parameter estimation, we can use a command like:
phastCons --target-coverage 0.05 --expected-length 20 \ --gc 0.4 --most-conserved all-encode.bed --no-post-probs \ --estimate-rho newtree all-encode.ss nonconserved-4d.modwhere nonconserved-4d.mod is the previously estimated model for 4d sites and the --gc option has been used to force the background distribution to be closer to the genome-wide average than the G+C-rich distribution estimated from 4d sites. Let's suppose in this case we are aiming for a total coverage by conserved elements of 5% of all bases in the reference (human) sequence. (Our initial guess for --target-coverage is 0.05 because in this case, essentially the entire human sequence is covered by alignments with other species.) We can easily measure the coverage of all-encode.bed with a command such as:
awk '{total += $3 - $2} END {print total/29909540}' all-encode.bedwhere 29,909,540 is the total number of bases in the data set in the coordinate system of the human sequence. (Note that phastCons will automatically report the conserved elements in this coordinate system.)
ConsEntropy cannot be used in this case to test the smoothness constraint, because of the large number of sequences. Instead, I recommend plotting the conservation scores and judging the smoothness by eye, as discussed in Section 3.2.
When the coverage and smoothness constraints have been met, predictions for the individual targets can be obtained with commands such as:
for reg in EN*.fa ; do root=`basename reg .fa` phastCons --target-coveragewhere--expected-length \ --most-conserved $root.bed all-encode.ss \ newtree.cons.mod newtree.noncons.mod > $root.wig done
This section will cover the use of the Felsenstein/Churchill model, the indel models, the discrete gamma model for rate variation, the coding-potential feature, and other obscure and experimental features.
phastCons mydata.ss rev-dg.mod --FC --nrates 10 \ --states 0,1 > cons.wig
phastCons mydata.ss rev-dg.mod --FC --nrates 10 \ --states 0,1 --lambda 0.9 > cons.wig
phastCons mydata.ss noncoding.mod,codon1.mod,codon2.mod,codon3.mod \ --hmm simple-4state.hmm --reflect-strand 0 --states 1,2,3 \ > coding-potential.wig
phastCons --coding-potential human-mouse-rat.ss > cp.wig
E. H. Margulies, M. Blanchette, NISC Comparative Sequencing Program, D. Haussler, and E. D. Green, 2003. Identification and characterization of multi-species conserved sequences. Genome Res, 13:2507-2518.
C. Mayor, M. Brudno, J. R. Schwartz, et al., 2000. VISTA: visualizing global DNA sequence alignments of arbitrary length. Bioinformatics, 16:1046-1047.
A. Siepel, G. Bejerano, J. S. Pedersen, et al. Evolutionarily
conserved elements in vertebrate, insect, worm, and yeast
genomes. Genome Res (in press)
Z. Yang, 1995. A space-time process model for the evolution of
DNA sequences. Genetics,
139:993-1005.
7. FEEDBACK
If this document doesn't include information you were looking for or
if you found parts of it (or all of it!) confusing please send me
feedback (phasthelp@cshl.edu) so I
can improve it. You can even let me know if you found it helpful :)
8. CREDITS
Thanks to several patient users for their feedback on early versions
of the software and documentation, especially Kate Rosenbloom, Hiram
Clawson, Angie Hinrichs, Elliott Margulies, Daryl Thomas, David King,
and James Taylor.