Inference of Natural Selection from Interspersed Genomically coHerent elemenTs

Drosophila Polymorphism and Divergence Data

This website contains information on the Drosophila melanogaster polymorphism and divergence data used by INSIGHT. More details are available in the Supplementary Materials of (Gronau et al., Mol Biol Evol, 2013). More information on INSIGHT can be found here. An application of INSIGHT within D. melanogaster miRNAs was demonstrated within this additional article: (Mohammed et al., in review for RNA 2014).

Contents

   1. Polymorphism data
   2. Outgroup sequence data and ancestral priors
   3. Filters
   4. Putative neutral sites
   5. Genomic blocks

1. Polymorphism data

Drosophila melanogaster polymorphism sequence data was obtained from the Drosophila Genetic Refrence Panel (DGRP). This resource contains the complete genome sequences of 205 unrelated fly lines from a homogeneous Raleigh, North Carolina fruit-fly population. Genotype calls for these lines were extracted from the Variant Call Format (VCF) files downloaded from the DGRP freeze 2 website on May 2013. We considered only single nucleotide polymorphisms (SNPs) and discard sites deemed as structural variants within these VCF files. All other positions not reported in the VCF files were assumed to be monomorphic for the reference allele (according to UCSC dm3 / BDGP Release 5).

The polymorphism data were summarized by recording for each position in dm3 the allele count for each of the four basses (A,C,G, and T) across the 205x2=410 chromosomes. Sites with more than two observed alleles (i.e. tri-allelic sites, etc) were masked. Additionally, sites with missing data for 5 or more of the 205 fly lines were masked (see filters).

205 Drosophila melanogaster lines from a single Raleigh, Nord Carolina populations whose genomes have been sequenced to high coverage by the Drosophila Genetic Reference Panel. Taken from the DGRP online resource.

2. Outgroup sequence data and ancestral priors

Divergence was inferred using four melanogaster-subgroup outgroup genomes: D. simulans (droSim2), D. sechellia (droSec2), D. yakuba (droYak3) and D. erecta (droEre2). We created a custom 5-way alignment of these outgroup species and D. melanogaster (dm3) using the LASTZ and chain/net procedure prescribed by the UCSC genome browser. All genome assemblies, excluding D. simulans, were downloaded from the UCSC genome browser. We utilized a higher-quality, recently-released D. simulans genome assembly, named droSim2, in our custom alignment.

For each position in dm3, we recorded the aligned base from each of the 4 non-melanogaster fruit-fly species, or an indication that no syntenic alignment was available at that position (see filters).

A prior distribution for the ancestral state (Z) was computed for all non-filtered sites in dm3, by assuming a phylogeny estimated from four-fold degenerate sites (see Figure), and applying the postprob.msa function in RPHAST. The D. melanogaster (dm3) sequence was masked in this computation, so that the computed distribution corresponds to the distribution over the bases in the ancestral sequence (Z) given the other four genomes, which is used as a prior distribution (P(Z|O)) by the INSIGHT model.

The phylogeny assumed when estimating divergence rates (λ) and prior probabilities for the ancestral states (Z_i). Branch lengths are given in expected number of differences between haploid chromosomes per base. The phylogeny was inferred using four-fold degenerate sites within D. melanogaster protein coding genes (FlyBase r 5.46) using the phyloFit utility from the PHAST package.

3. Filters

DGRP freeze 2 datasets were only provide for the homozygous blocks of the autosomes and the X chromosomes (i.e. 2L, 2R, 3L, 3R, 4, X). Polymorphic sites within heterozygous blocks (i.e. 2LHet, 2RHet, 3LHet, 3RHet, XHet, and YHet) or for the mitochondrial DNA were not provided. Thus, our analysis was restricted to the homogous chromsomal regions. We applied various filters to reduce the impact of technical errors from alignment, sequencing, genotype inference, and genome assembly. Our filters included repetitive sequences (simple repeats), recent transposable elements, recent segmental duplications, and CpG site pairs. CpG site pairs (prone to hypermutability) were identified as position pairs having a “CG” dinucleotide in the D. melanogaster reference genome. Non-syntenic regions, as identified from synthenic blocks residing on low-quality Nets, and gaps in the outgroup alignment were hard-masked (by “N”s) individually in each outgroup genome. This uncertainty was incorporated when estimating the prior distribution over the ancestral sequence (Z, see above).

Sites with missing data in greater than or equal to 5 of the 205 fly lines were masked out completely (see above). Additionally, sites with more than two observed alleles in the fly population data (i.e. tri- or quad-allelic sites) were masked. We treated lines without a reported genotype and lines with "N" base assignments as lines with "missing" data. At polymorphic sites where less than 5 individuals contained missing data, we subsampled 200 fly lines without replacement in order to arrive an a uniform number of 400 alleles per site. This subsampling approach is a common method utilized in processing genotype data with missing information, and offers consistency and accuracy in the treatment of polymorphisms within INSIGHT.

Filter	Coverage
Genomic filters
Transposable elements	22.7 Mb
Seg. duplications	2.1 Mb
Simple repeats	6.8 Mb
CpG dinucleotides	14.4 Mb
union:	42 Mb
Missing data filters
Missing data in DGRP freeze 2	0.69 Mb
union:	0.69 Mb
Total
total positions filtered:	42.7 Mb
total unfiltered:	119.6 Mb

Filters used for INSIGHT analysis of D. melanogaster data. Links provided to BED files and genomic coverage is given in megabases.

4. Putative neutral sites

Estimates of neutral model parameters were computed by considering a collection of putative neutral sites that pass our filters. The collection of putative neutral sites was determined by eliminating sites likely to be under selection: (1) exons of annotated protein-coding genes and the 50 bp flanking them, and (2) conserved noncoding elements (identified by phastCons) and 25 bp flanking them. While a fraction of the remaining sites is likely to be functional, this set should be dominated by sequence evolving under neutral drift.

5. Genomic blocks

For estimation of genome-wide neutral polymorphism and divergence rates, we used a fixed collection of 1kb non-overlapping windows. We used the putative neutral sites in each 1 kb window to estimate a neutral polymorphism rate (θ_b) and a neutral divergence rate (λ_b), and those estimates were then associated with the appropriate genomic block.