PROGRAM:      phastMotif

DESCRIPTION:  Predicts motifs from a set of multiple alignments.  Uses
              an EM algorithm similar to that of MEME, but a motif is
              defined by phylogenetic models rather than multinomial
              distributions.  The specified multiple alignments may
              actually be single sequences (see -m).  Various parameters
              control the strategy for initialization (see below).
              Currently, the F81 substitution model is assumed.

USAGE:        /home/mt269/phast/bin/phastMotif [-t <treefile>] [OPTIONS] <msa_list>

OPTIONS:
    -t <file> (Required unless -m or -p) Use specified tree topology for
              all phylogenetic models (Newick format).

    -i <fmt>  Input format for alignment.  May be FASTA, PHYLIP, MPM, SS,
              or MAF (default FASTA).

    -b <file> Read background model from specified file (.mod format).
              By default, the background model is estimated
              in a preprocessing step, by pooling all data.

    -s        Estimate a separate background model for each multiple alignment.
              (Not yet implemented.)

    -k <size> Learn motifs of the specified size (default is 10).

    -B <n>    Report best <n> motifs (default 3).

    -m        MEME mode.  Use multinomial rather than phylogenetic
              models.  Causes multiple alignments to be ignored -- any
              gaps are discarded and all sequences are assumed
              independent.

    -d <+lst> Use the discriminative training method of Segal et
              al. (RECOMB'02), rather than EM.  The specified list
              should contain the filenames from msa_list that are to
              be considered *positive* examples (containing the
              desired motif); all others will be considered negative
              examples.  Can be used with or without -m.

    -p        Use "profile" models rather than phylogenetic models
              (characters in each alignment column assumed
              independent).  The resulting model is a hybrid of the
              full model and MEME's model.  Essentially, it uses the
              multiple alignments but not the phylogeny.  NOT YET IMPLEMENTED.

    -n <n>    Perform <n> random restarts and report the motif with highest
              likelihood.  Default number is 10.  Ignored with -I, -P, and
              -R unless -S is specified (see below).

    -I <mlst> Run the algorithm after a "soft" initialization with
              each of the consensus sequences in the specified list.
              At each position, <pc> pseudocounts (see -c) are given
              to the consensus base and 1 pseudocount to all other
              bases.  Each string must have length at most equal to
              the size of the motif.  If shorter, it is used as a
              "seed" for a motif, with flanking positions treated as
              wildcards.

    -P <x,y>  Initialize with the x most prevalent y-tuples.  A soft
              initialization is performed, as above.  If y is less
              than the motif size, y-tuples are used as a "seed" for
              a motif, as above.

    -R <x,y>  Initialize with a random sample of x y-tuples.  A soft
              initialization is performed, as above.  If y is less
              than the motif size, y-tuples are used as a "seed" for
              a motif, as above.

    -w <n>    (for use with -I, -P, -R) Winnow initialization sequences
              to the top <n> based on the unmaximized likelihood.

    -c <pc>   (for use with -I, -P, -R) Number of pseudocounts for
              consensus bases (default 5).

    -S        (for use with -I, -P, -R) Instead of doing a deterministic
              initialization based on a consensus sequence, sample
              parameters from a Dirichlet distribution defined by the
              pseudocounts (see -c).  In this case, random restarts
              are performed, as specified by -n.

    -o <pref> Use the specified prefix for all output files (dflt. "phastm").
    -H        Produce HTML formatted output, in addition to ordinary output.
              One file is produced per predicted motif, as well as a 
              single HTML-formatted summary file.

    -D        Produce a BED file with predicted motifs, for use in the 
              UCSC browser.  Currently, sequence names must be
              formatted such as "chr10:102553847-102554897+", with
              the final '+' or '-' indicating strand.

    -x        (For use with -H or -D) Suppress ordinary output to stdout.

    -h        Print this help message.