PROGRAM: dmsample (Detection of Lineage Specific Motifs by Gibbs Sampling)

USAGE: dmsample [OPTIONS] alignments.lst neutral.mod motif.pfm priors.pseudo \\n	> out.gff

alignments.lst describes the multiple alignment files and provides a list of
file names to be read and an optional column of corresponding indel histories.
See example.lst for formatting example.

priors.txt is a tab-delimited text file containing pseudocounts to be added to
transition counts for estimating each transition parameter. These must be
integers. See example.pseudo for formatting example.

DESCRIPTION: 

Predict regulatory motif turnover using a Gibbs sampling framework.

EXAMPLES:

OPTIONS:

    --rho, -R <rho>
        (default 0.3)

    --fix-params, -f <mu,nu,phi,zeta[,xi]>
        Fix parameters at the user-supplied levels (or defaults, if none are
        given). Parameter values are supplied as a comma-delimited string in
	the order specified. If four values are supplied, xi will not be set
	(presumably because xi-mode is being set to FALSE -- see --xi-mode).
	To use default values for all params, supply one (arbitrary) value.

    --burn-in-samples -b <N>
	Number of samples for model burn-in. Defaults to 5,000.

    --samples -s <N>
	Number of sampling iterations. Defaults to 100,000.

    --refidx, -r <refseq_idx>
        Use coordinate frame of specified sequence in output.  Default
        value is 1, first sequence in alignment; 0 indicates
        coordinate frame of entire multiple alignment.

    --revcomp, -C
	Produce reverse complement sequences for all input MSA's, which, in
	effect, causes both strands to be searched for motifs. This is
	equivalent to including reverse-complement states in the hmm, but
	potentially less computation and memory intensive.

	WARNING: Currently not compatible with indel histories!

    --threads, -t <N>
	Number of concurrent threads to launch during emissions and sampling
	computations. Default value is 0 (multithreading not used). This is
	best set to the number of cores per processor on your machine.

    --seqname, -N <name>
        Use specified string for 'seqname' (GFF) or 'chrom' field in
        output file.  Default is obtained from input file name (double
        filename root, e.g., "chr22" if input file is "chr22.35.ss").

    --idpref, -P <name>
        Use specified string as prefix of generated ids in output
        file.  Can be used to ensure ids are unique.  Default is
        obtained from input file name (single filename root, e.g.,
        "chr22.35" if input file is "chr22.35.ss").

    # The following two options are for conditioning on either site presence
    # in a given species or presence of substitutions for gain/loss
    # predictions. Both require a file specified in the alignments list that
    # describes states and positions to zero out emissions where certain types
    # of predictions are incompatible with the conditioning. See dmcondition
    # for details.

    --cond-on-species, -x <species-condition filename>
	Condition gain and loss calls in the hmm on presence of a site in a
	given species. This is useful if ChIP-based regions are used as inputs
	for motif finding. The effect is to prohibit motif loss predictions on
	branches leading to a species believed to contain a binding site based
	on ChIP (or other) evidence and prohibit gain predictions on branches
	leading away from that species.

    --cond-on-subs, -X
	Condition gain and loss predictions on presence of at least one
	substitution within a window believed to represent a binding site. This
	prevents gain/loss predictions on branches that contain no observable
	substitutions in the multiple alignment. The underlying algorithm uses
	Fitch parsimony to partition the tree into branches that do and do not
	contain substitutions for each motif window in the dataset and zeroes
	out the chain of gain and loss states in the emissions matrix 
	corresponding to combinations of branch and motif position that do not
	contain any substitutions to support such a prediction.

	If used in conjunction with --cond-on-species, states that are already
	zeroed out because they are incompatible with site presence in a
	given species will be skipped in this step.

    --mot-mod-type, -S F81|HB
	Set the substitution model type used for motif states. Default is F81.
	With HB, the Halpern-Bruno model (1998. MBE. 15(7):910-917) will be
	used on motif branches, with site-specific selective pressure modeled
	through the rate matrix and equilibrium frequencies; branch lengths are
	not scaled by rho on motif branches. With F81, motif branches are
	explicitly scaled by rho and only equilibrium frequencies will be used
	in the substitution model for these branches. All branches are scaled 
	equally, by rho, regardless of constraint implied by the motif weight
	matrix, which implies equal constraint at all motif positions.

    --scale-by-branch, -B
	Scale transition probabilities for lineage-specific states by the
	branch length leading to the node predicted to contain the gain or
	loss event. Default is to treat all branches as if they were of equal
	length when assigning transition probabilities.

    --xi-off, -M
	Toggle between default and alternate parameterization for entry into
	motif states. Default parameterization uses zeta for transitions into
	conserved and lineage-specific motifs and xi for entry into 
	nonconserved motif states. Alternate parameterization uses a single
	zeta parameter for both types of transitions.

    --indel-model, -I alpha,beta,tau,epsilon[,alpha2,beta2,tau2,epsilon2]
        Use a simple model of insertions and deletions that assumes a known
        indel history and at most one indel per branch of the tree at any
        given position.  The parameters alpha and beta are rates of
        insertion and deletion, respectively, per expected substitution per
        site, and the parameter tau is approximately the inverse of the
        expected indel length (see indelFit). Epsilon is the rate parameter for
	indels within motifs, and is expected to be very low. If two sets are 
	parameters are given the first will be used for nonconserved regions 
	and the second for conserved regions.

    --nc-mot-indel-mode, -j
	Specify the set of indel params to use in nonconserved motif states.
	Default is motif params. Use of this option toggles to background
	params in these states.

    --log, -l <logfile>
	Keep a log of sampling output (transition counts, param estimates and
	total log likelihood of sequences).

    --reference-gff, -g <reference gff file>
	Load a reference gff file of known positive motif occurrences in the
	input sequences. This option does nothing on its own, but can be used
	in conjunction with --log and/or --ref-as-prior in order to change
	the behavior of dmsample. When used in conjunction with --log, counts
	of known motifs found (and not found) will be output to the log file 
	along with the usual log contents. Motifs that are not in the reference
	set will be reported as false positives in the log, but this is only 
	meaningful when all motif locations are known (e.g., with data from 
	dmsimulate). See below for behavior with --ref-as-prior. Note that
	sequence names in reference gff must match sequence names of input
	alignments or features will not be matched!

    --ref-as-prior, -u
	WARNING: NOT CURRENTLY FUNCTIONAL!!!

	Use motifs in the reference set as prior knowledge. At every iteration,
	the lists of known and predicted features will be compared and the
	transition counts will be adjusted to ensure all known motifs are used
	in parameter estimation and included in the sampling output. Because
	two motifs cannot overlap in a single path, if a motif is predicted in 
	an identical or overlapping position in the current sample, the sampled
	feature will be given priority to ensure complete exploration of the
	posterior distribution. Requires --reference-gff!
	
    --force_priors, -p
	WARNING: NOT CURRENTLY FUNCTIONAL!!!

	When comparing predicted and reference paths, give priority to the
	reference path when choosing the position and mode of selection for
	features in identical or overlapping positions. Requires
	--reference-gff and implies --ref-as-prior. WARNING: This option may
	cause incomplete exploration of the posterior space. As a result, the
	reported posterior probabilities may be inaccurate.

    --dump-hash, -D <filename>
	Run sampling for the period specified and write the hash of raw counts
	to a file before terminating the run. Results will not be postprocessed
	and no GFF output will be output. This is mainly useful for debugging
	and running parallel chains on large datasets, resulting in a several
	separate hashes which can be flattened using dmsProcessParallel to 
	produce the final output

    --precomputed-hash, -d <filename>
	Read sampling data directly from a file (most likely produced by a
	previous dmsample run with the --dump-hash option) and produce the
	output data from the stored data. No additional sampling will be done.

    --recover, -T <list file name>
	Recover an aborted run from cached temp files. Can also be used to
	assemble hashes from multiple parallel sampling runs into a set of
	results. Takes a list file as an argument with a single header line:
	# NFILES = N, where N is the number of files to read in. Each file
	name goes on its own lin and blank lines are ignored.

    -cache-int, -i <N>
        Number of samples to run before caching the global motifs hash to disk.
	Setting this lower will decrease memory usage, but at the expense of
	increased I/O and reallocation overhead. Default is 200 samples.

    --cache-fname, -c <filename>
	Temp-file name for hash caching. Default name is based on machine time
	on the local system.

    --help, -h
        Show this help message and exit.