ARGweaver: Sampling and manipulating genome-wide ancestral recombination graphs

Documentation for the ARGweaver software package

Updated: September, 2013

Author: Matthew D. Rasmussen (mattrasmus.com, rasmus@alum.mit.edu, matt.rasmus@gmail.com)
Software website: http://github.com/mdrasmus/argweaver/

citation: Matthew D. Rasmussen, Adam Siepel. Genome-wide inference of ancestral recombination graphs. 2013. arXiv:1306.5110 [q-bio.PE]

Introduction

The ARGweaver software package contains programs and libraries for sampling and manipulating ancestral recombination graphs (ARGs). An ARG is a rich data structure for representing the ancestry of DNA sequences undergoing coalescence and recombination. From the ARG, several different statistics of interest can be estimated such local trees, time of most recent common ancestor (TMRCA), local effective population size, recombination rates, etc.

ARGweaver's particular strength is the ability to sample ARGs for large genomic regions (many megabases). This is achieved by using a discretized version of the sequentially Markov coalescent (DSMC) to model the evolutionary history of DNA sequences. This model lends itself to efficient computational methods. In particular, ARGweaver implements a novel technique of efficiently sampling the ancestry of one sequence given the ancestry of n-1 sequences, an operation we call "threading" an ARG. This is a basic building block of a Markov chain Monte Carlo (MCMC) sampling method which we have implemented in ARGweaver (see arg-sample). Using the posterior samples of the ARG, we have implemented several methods for extracting statistics of interest arg-extract-tmrca, arg-extract-popsize, etc.

Programs

ARGweaver contains several programs for simulating, inferring, and manipulating ARGs, which are located in the bin/ directory. The main usage and description of each program can be found by using the --help option.

arg-sample

arg-sample is the main program for inferring ancestral recombination graphs (ARGs). It takes DNA sequences specified in a file format called *.sites and samples ARGs the posterior distribution, which are stored in a file format called *.smc.

Main arguments

-s,--sites <sites alignment>
-f,--fasta <fasta alignment>

Input DNA sequence alignments can be specified in either the *.sites or FASTA file formats. The files can also be gzip compressed if the *.gz file extension is present.

-o,--output <output prefix>

This specifies a prefix for all output filenames (default 'arg-sample'). It is usually most convenient to specify a new directory since many output files can be generated (i.e. each ARG sample). If any of the directories in the output path do not exist, arg-sample will create them.

-a,--arg <SMC file>

This optional argument specifies an initial ARG (stored in *.smc format) for sampling. This is useful for resuming a previous sampling. If the ARG does not contain all of the sequences present in the input sequence alignment file (--sites, --fasta), the missing sequences will first be added using the "threading" procedure. Once a complete ARG is obtained, normal sampling will continue. This feature allows one to easily build a "core ARG" of a small number of sequences and then add in additional sequences in a separate run.

--region <start>-<end>

This optional argument specifies that an ARG should only be sampled for a subregion (start-end) of the sites. Region is 1-based, end-inclusive.

--maskmap <sites mask>

This specifies that certain regions should be masked/ignored from the sequence alignment. Regions should be specified in a three column BED file.

Model parameters

arg-sample assumes a prior model for the evolution of the input DNA sequences. The parameters of the prior model can be specified with the following arguments.

-N,--popsize <population size>

This specifies the effective population size (default 10,000). arg-sample assumes the population is diploid.

-m,--mutrate <mutation rate>

This specifies a constant mutation rate as mutations per site per generation (default 2.5e-8). Currently only the Jukes-Cantor model is implemented. To specify a varying mutation rate see --mutmap.

-r,--recombrate <recombination rate>

This specifies a constant recombination rate in recombinations per site per generation (default 1.5e-8). To specify a varying recombination rate see --recombmap.

-t,--ntimes <ntimes>
--maxtime <maxtime>
--time-step <time>
--times-file <times filename>

ARGweaver approximates the evolutionary process as occurring in discretized time steps. These options provide several ways of specifying the time steps. The most common setting is to allow time steps to be spaced logarithmically (finer resolution in recent times and coarser resolution in older times). This can be done by specifying the number of time steps --ntimes and a maximum time --maxtime. Time is always expressed in number of generations ago, where 0 is the present day. Time steps can also be specified to be spaced linearly with a step size --time-step, or they can be specified explicitly in a file --times-file one per line.

-M,--mutmap <mutation rate map file>
-R,--recombmap <recombination rate map file>

These options allow varying mutation rates and recombination rates to be specified. The files should be in bedGraph format.

Sampling

arg-sample produces posterior samples of the ARG using Markov chain Monte Carlo (MCMC). These options customize the sampling process.

-n,--iters <# of iterations>

This specifies the total number of sample iterations (default 1000).

--resample-region <start>-<end>

This optional argument specifies a subregion to resample of input ARG given by --arg.

--resume

This option allows a previous sampling to be resumed if it was interrupted or if more iterations are desired. Sampling options can be changed from previous run.

--overwrite

By default, arg-sample does not overwrite existing files in order to protect against accidental data loss. Use this option, if overwriting is intended.

Miscellaneous

-c,--compress-seq <compression factor>

To speed up the algorithm, the sequence alignment can be compressed during sampling into blocks of sizes such as or 10bp or 20bp. Recombinations are only allowed between blocks but not within. By default the compression is 1bp per block.

--sample-step <sample step size>

This specifies how often to write ARG samples to file. By default, the ARG is written every 10 iterations.

--no-compress-output

Do not use gzip to compress output.

-x,--randseed <random seed>

arg-sample uses a random number generator in the sampling process. This option specifies the seed (default is based on current time). Using the same seed as a previous run will reproduce the run.

Information

-V,--verbose <verbosity level>

Set the level of logging verbosity (0=quiet, 1=low, 2=medium, 3=high)

-q,--quiet

Completely suppress logging to standard error (stderr).

-v,--version
-h,--help
--help-advanced

These options display version and help information. --help-advanced displays additional options that are experimental or are useful for debugging purposes.

File formats

ARGweaver uses several different file formats for sequence alignments and ancestral recombination graphs (ARGs), which are described here. Since some of these files can be quite large, they can be compressed using gzip. All programs in ARGweaver can automatically detect if a file is compressed if the filename ends with *.gz.

FASTA file format (*.fa)

ARGweaver can use the FASTA file format for sequences alignments. The file extension is not important and many different extensions are in common use (*.fa, *.mfa, *.fasta, *.align).

Each line starting with ">" indicates a sequence (haplotype) name. This assumes sequences are phased and that each individual contributes two such sequences. Note, the entire line after the ">" is used as the sequence name. The sequence is given on the following lines and it may be wrapped to any number of columns (or not wrapped at all). Unknown characters in the alignment are represented with the "N" character.

>individual1_haplotype1
ATACGCGCTCGCCCATAGCTTCCTCAAACCATTCTACTACGACCCTGTATGTTTACCGAGTG
>individual1_haplotype2
ATAGGCGCTCGCCCATAGCTTCCTCAAACCATTCAACTACGACCCTGCATGTTTACCGAGTG
>individual2_haplotype1
ATAGGCGCTCGCCCATAGCTTCCACAAACCATTCTACTACGACCCTGCATGTTTACCGAGTG
>individual2_haplotype2
ATACGCGCTCGCCCATAGCTTCCACAAACCATTCAACTACGACCCTGTATGTTTACCGAGTG

Figure: Example FASTA *.fa file. Each haplotype sequence is fully specified. Sequence names can be given in any format. Sequences are all assumed to be randomly sampled from the population and no assumptions are made about how they relate to individuals. In practice, each individual contribute two sequences, but the ARGweaver software does not make use of such information.

Sites file format (.sites, .sites.gz)

Since in most populations the mutation rate can be quite low, most sites in an alignment will be invariant and FASTA is a very inefficient file format for such sequences. Instead it is often more convenient to use the *.sites file format which only specifies the variant sites in an alignment. Sites not specified are assumed to be invariant. See sim1.sites for an example.

The sites format is tab-delimited. It currently has two header lines which are distinguished by the first column.

NAMES: this line indicates the names of the sequences in the following columns. Names can have any format provided they do not contain tab or newline characters.
REGION: this line indicates what genomic region is represented in the file. The following columns are CHROMOSOME, START, END. CHROMSOME is a chomosome name, while START and END indicate the region using 1-based, end-inclusive coordinates.

The remaining lines specify the variant positions in the sequence alignment. The first column gives the 1-based coordinate of the variant position and the second column gives the bases for each sequence at that position. Bases are given in the same order as the sequence names in the NAMES line. The "N" character can be used to specify an unknown base. The sites are required to be sorted in the file by their coordinate.

NAMES	n0	n1	n2	n3	n4	n5	n6	n7
REGION	chr	1	100000
404	AAAAAAGA
992	TTTTTTTA
1146	CCCCGGCC
3727	CCCCCACC
3778	GCCCCCCC
4791	GGGGGGTG
5368	GGGGGGCG
6066	AAATAAAA
6312	GGGGGGTG
7062	AAAAATAA
7151	ACCAAAAA
7398	TAATATTT
9388	ATTATTTT
10076	CGGCCCCC

Figure: Example *.sites file. The sites file format only specifies the variant sites in an alignment and is often more convenient than the FASTA format.

Sequentially Markov coalescent SMC file format (.smc, .smc.gz)

ARGweaver uses the sequentially Markov coalescent (SMC) model as an approximation of the evolutionary process. This approximation also imposes several restrictions on the structure of the ARG. Namely the ARG can be represented as a series of local (also known as marginal) trees interspersed with subtree pruning and regrafting (SPR) operations that describe how two neighboring local trees differ from one another. Each SPR corresponds to a recombination event in the evolutionary past of the sequences. However, not all recombination events are expressible as SPR operations (e.g. recombinations that occur in non-ancestral sequences). The SMC file format explicitly stores each of these local trees and the SPR operations between them. Although this is more verbose than storing the ARG as a directed acyclic graph (DAG, see *.arg) it is often more convenient since extracting local trees and their statistics is one of the most common operations to perform on ARGs.

The SMC format is tab-delimited. It currently has two header lines which are distinguished by the first column.

NAMES: this line indicates the names of the sequences in the following columns. Names can have any format provided they do not contain tab or newline characters.
REGION: this line indicates what genomic region is represented in the file. The following columns are CHROMOSOME, START, END. CHROMSOME is a chomosome name, while START and END indicate the region using 1-based, end-inclusive coordinates.

The remaining lines specify the local trees and SPR operations. The type of line is indicated by the first column and have the following formats:

TREE start end tree: this line indicates a local tree that exists from coordinates start to end (1-based, end-inclusive). tree is the local tree represented in newick format. Every node in the tree is numbered. The leaves are numbered between 0 and n-1, where n is the total number of sequences. The numbering is given in the same ordering as the sequence names in the NAMES line. Branch lengths are given in units of generations. NHX style comments are used to also indicate the age of every node. Although this is redundant given branch lengths, these comments ensure ages are preserved perfectly since adding branch lengths to determine age is susceptible to rounding errors.
SPR position recomb_node recomb_time coal_node coal_time: this line specifies the subtree pruning and regrafting (SPR) operation responsible for changing the previous local tree into the next local tree. position is the last base in the previous local block before the recombination event. recomb_node gives the branch number in the previous local tree that bares the recombination event and recomb_time gives the time in generations of the recombination. coal_node gives the branch name in previous local tree of where the broken branch re-coalesces and coal_time gives the re-coalescence time in generations.

NAMES	n0	n1	n6	n3	n4	n7	n5	n2
REGION	chr	1	100000
TREE	1	740	((6:1002.805899[&&NHX:age=0.000000],4:1002.805899[&&NHX:age=0.000000])12:17041.819563[&&NHX:age=1002.805899],(2:18044.625462[&&NHX:age=0.000000],((5:5363.846221[&&NHX:age=0.000000],(7:122.586947[&&NHX:age=0.000000],1:122.586947[&&NHX:age=0.000000])14:5241.259274[&&NHX:age=122.586947])10:6697.962294[&&NHX:age=5363.846221],(3:1545.314509[&&NHX:age=0.000000],0:1545.314509[&&NHX:age=0.000000])11:10516.494006[&&NHX:age=1545.314509])9:5982.816948[&&NHX:age=12061.808515])13:0.000000[&&NHX:age=18044.625462])8[&&NHX:age=18044.625462];
SPR	740	13	18044.625462	8	26970.598323
TREE	741	840	((6:1002.805899[&&NHX:age=0.000000],4:1002.805899[&&NHX:age=0.000000])12:25967.792423[&&NHX:age=1002.805899],(2:18044.625462[&&NHX:age=0.000000],((5:5363.846221[&&NHX:age=0.000000],(7:122.586947[&&NHX:age=0.000000],1:122.586947[&&NHX:age=0.000000])14:5241.259274[&&NHX:age=122.586947])10:6697.962294[&&NHX:age=5363.846221],(3:1545.314509[&&NHX:age=0.000000],0:1545.314509[&&NHX:age=0.000000])11:10516.494006[&&NHX:age=1545.314509])9:5982.816948[&&NHX:age=12061.808515])13:8925.972860[&&NHX:age=18044.625462])8[&&NHX:age=26970.598323];
SPR	840	14	5363.846221	12	18044.625462
TREE	841	1460	(((6:1002.805899[&&NHX:age=0.000000],4:1002.805899[&&NHX:age=0.000000])12:17041.819563[&&NHX:age=1002.805899],(7:122.586947[&&NHX:age=0.000000],1:122.586947[&&NHX:age=0.000000])14:17922.038515[&&NHX:age=122.586947])10:8925.972860[&&NHX:age=18044.625462],(2:18044.625462[&&NHX:age=0.000000],(5:12061.808515[&&NHX:age=0.000000],(3:1545.314509[&&NHX:age=0.000000],0:1545.314509[&&NHX:age=0.000000])11:10516.494006[&&NHX:age=1545.314509])9:5982.816948[&&NHX:age=12061.808515])13:8925.972860[&&NHX:age=18044.625462])8[&&NHX:age=26970.598323];
SPR	1460	5	8051.702368	11	8051.702368
TREE	1461	2880	(((6:1002.805899[&&NHX:age=0.000000],4:1002.805899[&&NHX:age=0.000000])12:17041.819563[&&NHX:age=1002.805899],(7:122.586947[&&NHX:age=0.000000],1:122.586947[&&NHX:age=0.000000])14:17922.038515[&&NHX:age=122.586947])10:8925.972860[&&NHX:age=18044.625462],(2:18044.625462[&&NHX:age=0.000000],(5:8051.702368[&&NHX:age=0.000000],(3:1545.314509[&&NHX:age=0.000000],0:1545.314509[&&NHX:age=0.000000])11:6506.387859[&&NHX:age=1545.314509])9:9992.923095[&&NHX:age=8051.702368])13:8925.972860[&&NHX:age=18044.625462])8[&&NHX:age=26970.598323];
...

Figure: Example *.smc file. The SMC file format describes an ARG as a series of local trees and subtree pruning and regrafting (SPR) operations.

Ancestral recombination graph file format (*.arg)

Sometimes one wants to work with more general ARGs that do not follow all of the restrictions of the sequentially Markov coalescent (SMC). For such ARGs, the SMC file format is not sufficient and the more general *.arg file format is necessary. The *.arg is tab-delimited and represents the ARG directly as a graph. The first two lines contain header information:

The first line contains meta data about the ARG stored as key=value pairs. The keys start and end are required and indicate the genomic region represented by the ARG (0-based, end-exclusive).
The second line contains column headers. Six columns are required:
1. name: every node in the ARG has a name.
2. event: every node must represent one of three events: gene represents an extant leaf node, coal represents the coalescing of two lineages into one going backwards in time, and recomb represents a recombination event where one lineage splits into two going backwards in time.
3. age: the age of the node (often represented in generations ago)
4. pos: if the node is a recomb this represents the position in the sequence of the recombination. This value is ignored otherwise.
5. parents: this is a comma separated list of parental node names. If this is a gene or coal node, there should at most be one parental node, and if this is a recomb node there should be two parental nodes. The root node will have no parents.
6. children: a comma separated list of children node names. If this is a gene node there should be no names, if this is a coal node there should be two children, and if this is a recomb node there should be one child.

The remaining lines describe one node each and can be appear in any order. The fields of each node should follow the same order as the headers.

start=0	end=10000
name	event	age	pos	parents	children
1	coal	395	0	12	n0,n2
2	coal	5363	0	9	17,19
3	coal	26970	0	8	5,11
4	coal	60155	0		7,8
5	recomb	18044	246	3,10	13
6	coal	18044	0	7	9,10
7	recomb	40287	673	4,16	6
8	coal	60155	0	4	3,26
9	recomb	18044	734	6,10	2
10	coal	18044	0	6	5,9
11	recomb	2354	781	3,21	n1
12	coal	3562	0	20	1,21
13	recomb	8051	4794	5,23	22
14	coal	40287	0	26	16,15
15	recomb	18044	5603	14,25	18
16	coal	40287	0	14	7,24
17	recomb	5363	6091	2,18	n3
18	coal	12061	0	15	23,17
19	recomb	1545	7055	2,20	n4
20	coal	5363	0	22	12,19
21	recomb	3562	8949	12,22	11
22	coal	5363	0	13	20,21
23	recomb	12061	9876	18,24	13
24	coal	18044	0	16	25,23
25	recomb	18044	9916	24,26	15
26	coal	40287	0	8	14,25
n0	gene	0	0	1
n1	gene	0	0	11
n2	gene	0	0	1
n3	gene	0	0	17
n4	gene	0	0	19

Figure: Example *.arg file. The *.arg file format describes an ARG as a series of nodes in a graph. It is the most general representation of an ARG and does not impose the structural restrictions implied by the *.smc file format. However, extracting local trees from ARGs stored in *.arg format is more involved.