PROGRAM: dmsample (Detection of Lineage Specific Motifs by Gibbs Sampling) USAGE: dmsample [OPTIONS] alignments.lst neutral.mod motif.pfm priors.pseudo \\n > out.gff alignments.lst describes the multiple alignment files and provides a list of file names to be read and an optional column of corresponding indel histories. See example.lst for formatting example. priors.txt is a tab-delimited text file containing pseudocounts to be added to transition counts for estimating each transition parameter. These must be integers. See example.pseudo for formatting example. DESCRIPTION: Predict regulatory motif turnover using a Gibbs sampling framework. EXAMPLES: OPTIONS: --rho, -R (default 0.3) --fix-params, -f Fix parameters at the user-supplied levels (or defaults, if none are given). Parameter values are supplied as a comma-delimited string in the order specified. If four values are supplied, xi will not be set (presumably because xi-mode is being set to FALSE -- see --xi-mode). To use default values for all params, supply one (arbitrary) value. --burn-in-samples -b Number of samples for model burn-in. Defaults to 5,000. --samples -s Number of sampling iterations. Defaults to 100,000. --refidx, -r Use coordinate frame of specified sequence in output. Default value is 1, first sequence in alignment; 0 indicates coordinate frame of entire multiple alignment. --revcomp, -C Produce reverse complement sequences for all input MSA's, which, in effect, causes both strands to be searched for motifs. This is equivalent to including reverse-complement states in the hmm, but potentially less computation and memory intensive. WARNING: Currently not compatible with indel histories! --threads, -t Number of concurrent threads to launch during emissions and sampling computations. Default value is 0 (multithreading not used). This is best set to the number of cores per processor on your machine. --seqname, -N Use specified string for 'seqname' (GFF) or 'chrom' field in output file. Default is obtained from input file name (double filename root, e.g., "chr22" if input file is "chr22.35.ss"). --idpref, -P Use specified string as prefix of generated ids in output file. Can be used to ensure ids are unique. Default is obtained from input file name (single filename root, e.g., "chr22.35" if input file is "chr22.35.ss"). # The following two options are for conditioning on either site presence # in a given species or presence of substitutions for gain/loss # predictions. Both require a file specified in the alignments list that # describes states and positions to zero out emissions where certain types # of predictions are incompatible with the conditioning. See dmcondition # for details. --cond-on-species, -x Condition gain and loss calls in the hmm on presence of a site in a given species. This is useful if ChIP-based regions are used as inputs for motif finding. The effect is to prohibit motif loss predictions on branches leading to a species believed to contain a binding site based on ChIP (or other) evidence and prohibit gain predictions on branches leading away from that species. --cond-on-subs, -X Condition gain and loss predictions on presence of at least one substitution within a window believed to represent a binding site. This prevents gain/loss predictions on branches that contain no observable substitutions in the multiple alignment. The underlying algorithm uses Fitch parsimony to partition the tree into branches that do and do not contain substitutions for each motif window in the dataset and zeroes out the chain of gain and loss states in the emissions matrix corresponding to combinations of branch and motif position that do not contain any substitutions to support such a prediction. If used in conjunction with --cond-on-species, states that are already zeroed out because they are incompatible with site presence in a given species will be skipped in this step. --mot-mod-type, -S F81|HB Set the substitution model type used for motif states. Default is F81. With HB, the Halpern-Bruno model (1998. MBE. 15(7):910-917) will be used on motif branches, with site-specific selective pressure modeled through the rate matrix and equilibrium frequencies; branch lengths are not scaled by rho on motif branches. With F81, motif branches are explicitly scaled by rho and only equilibrium frequencies will be used in the substitution model for these branches. All branches are scaled equally, by rho, regardless of constraint implied by the motif weight matrix, which implies equal constraint at all motif positions. --scale-by-branch, -B Scale transition probabilities for lineage-specific states by the branch length leading to the node predicted to contain the gain or loss event. Default is to treat all branches as if they were of equal length when assigning transition probabilities. --xi-off, -M Toggle between default and alternate parameterization for entry into motif states. Default parameterization uses zeta for transitions into conserved and lineage-specific motifs and xi for entry into nonconserved motif states. Alternate parameterization uses a single zeta parameter for both types of transitions. --indel-model, -I alpha,beta,tau,epsilon[,alpha2,beta2,tau2,epsilon2] Use a simple model of insertions and deletions that assumes a known indel history and at most one indel per branch of the tree at any given position. The parameters alpha and beta are rates of insertion and deletion, respectively, per expected substitution per site, and the parameter tau is approximately the inverse of the expected indel length (see indelFit). Epsilon is the rate parameter for indels within motifs, and is expected to be very low. If two sets are parameters are given the first will be used for nonconserved regions and the second for conserved regions. --nc-mot-indel-mode, -j Specify the set of indel params to use in nonconserved motif states. Default is motif params. Use of this option toggles to background params in these states. --log, -l Keep a log of sampling output (transition counts, param estimates and total log likelihood of sequences). --reference-gff, -g Load a reference gff file of known positive motif occurrences in the input sequences. This option does nothing on its own, but can be used in conjunction with --log and/or --ref-as-prior in order to change the behavior of dmsample. When used in conjunction with --log, counts of known motifs found (and not found) will be output to the log file along with the usual log contents. Motifs that are not in the reference set will be reported as false positives in the log, but this is only meaningful when all motif locations are known (e.g., with data from dmsimulate). See below for behavior with --ref-as-prior. Note that sequence names in reference gff must match sequence names of input alignments or features will not be matched! --ref-as-prior, -u WARNING: NOT CURRENTLY FUNCTIONAL!!! Use motifs in the reference set as prior knowledge. At every iteration, the lists of known and predicted features will be compared and the transition counts will be adjusted to ensure all known motifs are used in parameter estimation and included in the sampling output. Because two motifs cannot overlap in a single path, if a motif is predicted in an identical or overlapping position in the current sample, the sampled feature will be given priority to ensure complete exploration of the posterior distribution. Requires --reference-gff! --force_priors, -p WARNING: NOT CURRENTLY FUNCTIONAL!!! When comparing predicted and reference paths, give priority to the reference path when choosing the position and mode of selection for features in identical or overlapping positions. Requires --reference-gff and implies --ref-as-prior. WARNING: This option may cause incomplete exploration of the posterior space. As a result, the reported posterior probabilities may be inaccurate. --dump-hash, -D Run sampling for the period specified and write the hash of raw counts to a file before terminating the run. Results will not be postprocessed and no GFF output will be output. This is mainly useful for debugging and running parallel chains on large datasets, resulting in a several separate hashes which can be flattened using dmsProcessParallel to produce the final output --precomputed-hash, -d Read sampling data directly from a file (most likely produced by a previous dmsample run with the --dump-hash option) and produce the output data from the stored data. No additional sampling will be done. --recover, -T Recover an aborted run from cached temp files. Can also be used to assemble hashes from multiple parallel sampling runs into a set of results. Takes a list file as an argument with a single header line: # NFILES = N, where N is the number of files to read in. Each file name goes on its own lin and blank lines are ignored. -cache-int, -i Number of samples to run before caching the global motifs hash to disk. Setting this lower will decrease memory usage, but at the expense of increased I/O and reallocation overhead. Default is 200 samples. --cache-fname, -c Temp-file name for hash caching. Default name is based on machine time on the local system. --help, -h Show this help message and exit.