ARGweaver-D: Sampling ancestral recombination graphs conditional on a demographic model

Documentation for ARGweaver-D

Updated: June, 2019

Author: Melissa J. Hubisz (mjhubisz@cornell.edu
Software website: http://github.com/CshlSiepelLab/argweaver/

citation: Melissa J. Hubisz, Amy L. Williams, Adam Siepel. Mapping gene flow between ancient hominins through demography-aware inference of the ancestral recombination graph.




Introduction

ARGweaver-D is an extension of ARGweaver that allows ARGs to be sampled conditional on a user-defined demographic model that includes population splits and migration events. The ARGweaver-D code is built into the ARGweaver software base, and only requires extra options to the command-line programs to invoke. It is recommended that you first familiarize yourself with basic ARGweaver usage before reading this manual. However, note that ARGweaver-D has only been added to the C++ component of the ARGweaver codebase, and has not been incorporated into ARGweaver's python library.

Sampling an ARG under ARGweaver-D using arg-sample

Demography-aware ARGs can be sampled with the command arg-sample. Here, we document the options specific to ARGweaver-D:

Overview

-h,--help
-d,--help-popmodel
-P,--pop-tree-file <poptree file>
--pop-file <population assignment file>
--max-migs <max_migs> (default: 1)
--start-mig <iter> (default: 0)

Getting help

Most ARGweaver options can be described by invoking arg-sample --help.
Options specific to ARGweaver-D can be described by invoking arg-sample --help-popmodel.

Specifying a population tree

The population tree is specified using --pop-tree-file <file>. An example pop-tree file and explanation of the format is given below:
npop 5

## Neanderthal(1)/Denisova(2) div at 415kya
div 14310 2 1

## Archaic(1)/African(0) div at 575kya
div 19827 1 0

## Super-archaic(3)/Hominin(0) div at 1.0mya
div 34483 3 0

## Hominin(0)/Chimp(4) div at 11Mya
div 379310 4 0

## Afr-Nea migration @ 250kya
mig 8621 1 0 0.01
mig 8621 2 0 0.01

## Super-archaic migrations @ 250kya
mig 8621 0 3 0.01
mig 8621 1 3 0.01
mig 8621 2 3 0.01
Figure: Example population tree file. The first line starts with the tag "npop" and is followed by the number of populations at present time. The populations are numbered 0,...,(npop-1).

Each "div" line specifies a population divergence time. It takes three arguments: a time (in generations), and two population numbers. Looking backwards in time, the second population merges with the first population at this time. If there are n populations, there must be n-1 "div" lines, specified in such a way so that only one population remains by the final discrete time in the ARGweaver model (see arg-sample --maxtime or arg-sample --times-file).

Migrations are optionally specified with "mig" lines. They have the same format as the "div" lines, but with an additional probability parameter at the end, specifying the probability that a lineage sampled from the receiving population just after the migration event came from the donor population. Looking backwards in time, individuals from the first population move to the second population with the specified probability at the specified time.

Comment lines (starting with #) are allowed.

One important note is that all times in this file will be rounded to the nearest half-time in the discretized ARGweaver model.

Specifying population assignments

It is also necessary to specify which individuals belong to which populations. This is done with the --pop-file <file> option. An example population file is below.
chimp	4
Mandenka_2F	0
Khomani_San_1F	0
Vindija	1
Altai	1
Denisova	2
Figure: Example population file. The population file is simply a tab-delimited file with two columns: an individual name and a population assignment. The population assignment numbers correspond to the numbers in the population tree file (described above). The first column can give the names of haploid lineages (as used in SITES file format). Or, they can be diploid names, if VCF file input was used, or if the haploid lineages are named with the convention ind_1 ind_2, where "ind" is the name of the diploid individual.

Other arg-sample options

The other relevant options are:
--max-migs <num>
This specifies the maximum number of migrations allowed per local tree, and has a default of 1. This means only one lineage in a local tree can take a migration path, and is recommended to help avoid mixing problems, especially if the population model has many potential migration events. However in some circumstances with reasonably simple demographic models, it might make sense to change this default and allow multiple migration events.
--start-mig <num>
This specifies the starting MCMC iterations when migration events are allowed. The default is zero (always allow migrations), but we have found that convergence may improve when ARGweaver is first allowed to build a basic ARG without migrations for some initial iterations, especially if working with unphased samples. We often use the value of 100.

Parsing demographic ARGs with arg-summarize

Once ARGs are created with arg-sample, the program arg-summarize can be used to analyze migration regions of the args. The relevant options are:
--mig-file <mig_file>
--hap-mig-file <hap_mig_file>
The first option takes a "migration file" like this:
hToN 1 0 8621
hToD 2 0 8621
sToH 0 3 8621
sToN 1 3 8621
sToD 2 3 8621
Figure: Example migration file for --mig-file option to arg-summarize Each row describes a migration event that we want to summarize. The first column names the event; this name will be used in the arg-summarize output file. The next three columns give the receiving population, the source population, and the time (in generations) of the event.
Using the --mig-file option, arg-summarize will generate a bed file that includes a column for each migration event in the file, with 1/0 indicating whether any lineage that type of migrant in the local tree. By combining this option with either the --mean or --quantile options, you can get averages/quantiles across MCMC iterations across the genomic locations spanned by the ARG.

A slight variant of this option is --hap-mig-file, which takes a slightly different intput file, described below:

hToN.alt_1 1 0 8621 Altai_1 
hToN.alt_2 1 0 8621 Altai_2
hToN.vin_1 1 0 8621 Vindija_1
hToN.vin_2 1 0 8621 Vindija_2
hToD.den_1 2 0 8621 Denisova_1
hToD.den_2 2 0 8621 Denisova_2
sToA.afr1_1 0 3 8621 Khomani_San_1F_1
sToA.afr1_2 0 3 8621 Khomani_San_1F_2
sToA.afr2_1 0 3 8621 Mandenka_2F_1
sToA.afr2_2 0 3 8621 Mandenka_2F_2
sToN.alt_1 1 3 8621 Altai_1
sToN.alt_2 1 3 8621 Altai_2
sToN.vin_1 1 3 8621 Vindija_1
sToN.vin_2 1 3 8621 Vindija_2
sToD.den_1 2 3 8621 Denisova_1
sToD.den_2 2 3 8621 Denisova_2
Figure: Example migration file for --hap-mig-file option to arg-summarize The format here is the same as for --mig-file, except for a final column which gives a haploid lineage name. With this option, the output bed file will include a column for each row in this file, indicating whether a lineage ancestral to the named haploid sample undergoes the particular migration event.
Once bed files have been generated with arg-summarize --hap-mig-file, it is straightforward to use command-line tools such as awk to generate probabilities that a diploid individual is heterozygously or homozygously introgressed, by combining the results from their two haploid lineages.