The ARGweaver software package contains programs and libraries for sampling and manipulating ancestral recombination graphs (ARGs). An ARG is a rich data structure for representing the ancestry of DNA sequences undergoing coalescence and recombination. From the ARG, several different statistics of interest can be estimated such local trees, time of most recent common ancestor (TMRCA), local effective population size, recombination rates, etc.
ARGweaver's particular strength is the ability to sample ARGs for large genomic regions (many megabases). This is achieved by using a discretized version of the sequentially Markov coalescent (DSMC) to model the evolutionary history of DNA sequences. This model lends itself to efficient computational methods. In particular, ARGweaver implements a novel technique of efficiently sampling the ancestry of one sequence given the ancestry of n-1 sequences, an operation we call "threading" an ARG. This is a basic building block of a Markov chain Monte Carlo (MCMC) sampling method which we have implemented in ARGweaver (see arg-sample). Using the posterior samples of the ARG, we have implemented several methods for extracting statistics of interest arg-extract-tmrca, arg-extract-popsize, etc.
ARGweaver contains several programs for simulating, inferring, and manipulating ARGs, which are located in the bin/ directory. The main usage and description of each program can be found by using the --help option.
arg-sample is the main program for inferring ancestral recombination graphs (ARGs). It takes DNA sequences specified in a file format called *.sites and samples ARGs the posterior distribution, which are stored in a file format called *.smc.
Input DNA sequence alignments can be specified in either the *.sites or FASTA file formats. The files can also be gzip compressed if the *.gz file extension is present.
This specifies a prefix for all output filenames (default 'arg-sample'). It is usually most convenient to specify a new directory since many output files can be generated (i.e. each ARG sample). If any of the directories in the output path do not exist, arg-sample will create them.
This optional argument specifies an initial ARG (stored in *.smc format) for sampling. This is useful for resuming a previous sampling. If the ARG does not contain all of the sequences present in the input sequence alignment file (--sites, --fasta), the missing sequences will first be added using the "threading" procedure. Once a complete ARG is obtained, normal sampling will continue. This feature allows one to easily build a "core ARG" of a small number of sequences and then add in additional sequences in a separate run.
This optional argument specifies that an ARG should only be sampled for a subregion (start-end) of the sites. Region is 1-based, end-inclusive.
This specifies that certain regions should be masked/ignored from the sequence alignment. Regions should be specified in a three column BED file.
arg-sample assumes a prior model for the evolution of the input DNA sequences. The parameters of the prior model can be specified with the following arguments.
This specifies the effective population size (default 10,000). arg-sample assumes the population is diploid.
This specifies a constant mutation rate as mutations per site per generation (default 2.5e-8). Currently only the Jukes-Cantor model is implemented. To specify a varying mutation rate see --mutmap.
This specifies a constant recombination rate in recombinations per site per generation (default 1.5e-8). To specify a varying recombination rate see --recombmap.
ARGweaver approximates the evolutionary process as occurring in discretized time steps. These options provide several ways of specifying the time steps. The most common setting is to allow time steps to be spaced logarithmically (finer resolution in recent times and coarser resolution in older times). This can be done by specifying the number of time steps --ntimes and a maximum time --maxtime. Time is always expressed in number of generations ago, where 0 is the present day. Time steps can also be specified to be spaced linearly with a step size --time-step, or they can be specified explicitly in a file --times-file one per line.
These options allow varying mutation rates and recombination rates to be specified. The files should be in bedGraph format.
arg-sample produces posterior samples of the ARG using Markov chain Monte Carlo (MCMC). These options customize the sampling process.
This specifies the total number of sample iterations (default 1000).
This optional argument specifies a subregion to resample of input ARG given by --arg.
This option allows a previous sampling to be resumed if it was interrupted or if more iterations are desired. Sampling options can be changed from previous run.
By default, arg-sample does not overwrite existing files in order to protect against accidental data loss. Use this option, if overwriting is intended.
To speed up the algorithm, the sequence alignment can be compressed during sampling into blocks of sizes such as or 10bp or 20bp. Recombinations are only allowed between blocks but not within. By default the compression is 1bp per block.
This specifies how often to write ARG samples to file. By default, the ARG is written every 10 iterations.
Do not use gzip to compress output.
arg-sample uses a random number generator in the sampling process. This option specifies the seed (default is based on current time). Using the same seed as a previous run will reproduce the run.
Set the level of logging verbosity (0=quiet, 1=low, 2=medium, 3=high)
Completely suppress logging to standard error (stderr).
These options display version and help information. --help-advanced displays additional options that are experimental or are useful for debugging purposes.
ARGweaver uses several different file formats for sequence alignments and ancestral recombination graphs (ARGs), which are described here. Since some of these files can be quite large, they can be compressed using gzip. All programs in ARGweaver can automatically detect if a file is compressed if the filename ends with *.gz.
ARGweaver can use the FASTA file format for sequences alignments. The file extension is not important and many different extensions are in common use (*.fa, *.mfa, *.fasta, *.align).
Each line starting with ">" indicates a sequence (haplotype) name. This assumes sequences are phased and that each individual contributes two such sequences. Note, the entire line after the ">" is used as the sequence name. The sequence is given on the following lines and it may be wrapped to any number of columns (or not wrapped at all). Unknown characters in the alignment are represented with the "N" character.
>individual1_haplotype1 ATACGCGCTCGCCCATAGCTTCCTCAAACCATTCTACTACGACCCTGTATGTTTACCGAGTG >individual1_haplotype2 ATAGGCGCTCGCCCATAGCTTCCTCAAACCATTCAACTACGACCCTGCATGTTTACCGAGTG >individual2_haplotype1 ATAGGCGCTCGCCCATAGCTTCCACAAACCATTCTACTACGACCCTGCATGTTTACCGAGTG >individual2_haplotype2 ATACGCGCTCGCCCATAGCTTCCACAAACCATTCAACTACGACCCTGTATGTTTACCGAGTG
Since in most populations the mutation rate can be quite low, most sites in an alignment will be invariant and FASTA is a very inefficient file format for such sequences. Instead it is often more convenient to use the *.sites file format which only specifies the variant sites in an alignment. Sites not specified are assumed to be invariant. See sim1.sites for an example.
The sites format is tab-delimited. It currently has two header lines which are distinguished by the first column.
The remaining lines specify the variant positions in the sequence alignment. The first column gives the 1-based coordinate of the variant position and the second column gives the bases for each sequence at that position. Bases are given in the same order as the sequence names in the NAMES line. The "N" character can be used to specify an unknown base. The sites are required to be sorted in the file by their coordinate.
NAMES n0 n1 n2 n3 n4 n5 n6 n7 REGION chr 1 100000 404 AAAAAAGA 992 TTTTTTTA 1146 CCCCGGCC 3727 CCCCCACC 3778 GCCCCCCC 4791 GGGGGGTG 5368 GGGGGGCG 6066 AAATAAAA 6312 GGGGGGTG 7062 AAAAATAA 7151 ACCAAAAA 7398 TAATATTT 9388 ATTATTTT 10076 CGGCCCCC
ARGweaver uses the sequentially Markov coalescent (SMC) model as an approximation of the evolutionary process. This approximation also imposes several restrictions on the structure of the ARG. Namely the ARG can be represented as a series of local (also known as marginal) trees interspersed with subtree pruning and regrafting (SPR) operations that describe how two neighboring local trees differ from one another. Each SPR corresponds to a recombination event in the evolutionary past of the sequences. However, not all recombination events are expressible as SPR operations (e.g. recombinations that occur in non-ancestral sequences). The SMC file format explicitly stores each of these local trees and the SPR operations between them. Although this is more verbose than storing the ARG as a directed acyclic graph (DAG, see *.arg) it is often more convenient since extracting local trees and their statistics is one of the most common operations to perform on ARGs.
The SMC format is tab-delimited. It currently has two header lines which are distinguished by the first column.
The remaining lines specify the local trees and SPR operations. The type of line is indicated by the first column and have the following formats:
NAMES n0 n1 n6 n3 n4 n7 n5 n2 REGION chr 1 100000 TREE 1 740 ((6:1002.805899[&&NHX:age=0.000000],4:1002.805899[&&NHX:age=0.000000])12:17041.819563[&&NHX:age=1002.805899],(2:18044.625462[&&NHX:age=0.000000],((5:5363.846221[&&NHX:age=0.000000],(7:122.586947[&&NHX:age=0.000000],1:122.586947[&&NHX:age=0.000000])14:5241.259274[&&NHX:age=122.586947])10:6697.962294[&&NHX:age=5363.846221],(3:1545.314509[&&NHX:age=0.000000],0:1545.314509[&&NHX:age=0.000000])11:10516.494006[&&NHX:age=1545.314509])9:5982.816948[&&NHX:age=12061.808515])13:0.000000[&&NHX:age=18044.625462])8[&&NHX:age=18044.625462]; SPR 740 13 18044.625462 8 26970.598323 TREE 741 840 ((6:1002.805899[&&NHX:age=0.000000],4:1002.805899[&&NHX:age=0.000000])12:25967.792423[&&NHX:age=1002.805899],(2:18044.625462[&&NHX:age=0.000000],((5:5363.846221[&&NHX:age=0.000000],(7:122.586947[&&NHX:age=0.000000],1:122.586947[&&NHX:age=0.000000])14:5241.259274[&&NHX:age=122.586947])10:6697.962294[&&NHX:age=5363.846221],(3:1545.314509[&&NHX:age=0.000000],0:1545.314509[&&NHX:age=0.000000])11:10516.494006[&&NHX:age=1545.314509])9:5982.816948[&&NHX:age=12061.808515])13:8925.972860[&&NHX:age=18044.625462])8[&&NHX:age=26970.598323]; SPR 840 14 5363.846221 12 18044.625462 TREE 841 1460 (((6:1002.805899[&&NHX:age=0.000000],4:1002.805899[&&NHX:age=0.000000])12:17041.819563[&&NHX:age=1002.805899],(7:122.586947[&&NHX:age=0.000000],1:122.586947[&&NHX:age=0.000000])14:17922.038515[&&NHX:age=122.586947])10:8925.972860[&&NHX:age=18044.625462],(2:18044.625462[&&NHX:age=0.000000],(5:12061.808515[&&NHX:age=0.000000],(3:1545.314509[&&NHX:age=0.000000],0:1545.314509[&&NHX:age=0.000000])11:10516.494006[&&NHX:age=1545.314509])9:5982.816948[&&NHX:age=12061.808515])13:8925.972860[&&NHX:age=18044.625462])8[&&NHX:age=26970.598323]; SPR 1460 5 8051.702368 11 8051.702368 TREE 1461 2880 (((6:1002.805899[&&NHX:age=0.000000],4:1002.805899[&&NHX:age=0.000000])12:17041.819563[&&NHX:age=1002.805899],(7:122.586947[&&NHX:age=0.000000],1:122.586947[&&NHX:age=0.000000])14:17922.038515[&&NHX:age=122.586947])10:8925.972860[&&NHX:age=18044.625462],(2:18044.625462[&&NHX:age=0.000000],(5:8051.702368[&&NHX:age=0.000000],(3:1545.314509[&&NHX:age=0.000000],0:1545.314509[&&NHX:age=0.000000])11:6506.387859[&&NHX:age=1545.314509])9:9992.923095[&&NHX:age=8051.702368])13:8925.972860[&&NHX:age=18044.625462])8[&&NHX:age=26970.598323]; ...
Sometimes one wants to work with more general ARGs that do not follow all of the restrictions of the sequentially Markov coalescent (SMC). For such ARGs, the SMC file format is not sufficient and the more general *.arg file format is necessary. The *.arg is tab-delimited and represents the ARG directly as a graph. The first two lines contain header information:
The remaining lines describe one node each and can be appear in any order. The fields of each node should follow the same order as the headers.
start=0 end=10000 name event age pos parents children 1 coal 395 0 12 n0,n2 2 coal 5363 0 9 17,19 3 coal 26970 0 8 5,11 4 coal 60155 0 7,8 5 recomb 18044 246 3,10 13 6 coal 18044 0 7 9,10 7 recomb 40287 673 4,16 6 8 coal 60155 0 4 3,26 9 recomb 18044 734 6,10 2 10 coal 18044 0 6 5,9 11 recomb 2354 781 3,21 n1 12 coal 3562 0 20 1,21 13 recomb 8051 4794 5,23 22 14 coal 40287 0 26 16,15 15 recomb 18044 5603 14,25 18 16 coal 40287 0 14 7,24 17 recomb 5363 6091 2,18 n3 18 coal 12061 0 15 23,17 19 recomb 1545 7055 2,20 n4 20 coal 5363 0 22 12,19 21 recomb 3562 8949 12,22 11 22 coal 5363 0 13 20,21 23 recomb 12061 9876 18,24 13 24 coal 18044 0 16 25,23 25 recomb 18044 9916 24,26 15 26 coal 40287 0 8 14,25 n0 gene 0 0 1 n1 gene 0 0 11 n2 gene 0 0 1 n3 gene 0 0 17 n4 gene 0 0 19