Variant Detection in 12-flies Genomes

Summary

In the table below, I've summarized the list of variants found genome-wide and those that overlap with D. melanogaster orthologs. Note that there are 342 total known and candidate miRNAs in D. melanogaster.

Species	Total Variants detected	Variants in Both	Variants in sRNAseq	Variants in Trace	Variants overlap with dm3 miRNA orthologs	Variants in Both	Variants in sRNAseq	Variants in Trace	Valid Variants	Number of orthologs	Number of orthologs with overlaps	Notes
droSim1	678685 [link]	52	286	678347	133 [link]	2	4	127	133	321	78	Simulans is a mosaic genome. Therefore, many (127/133) variants detected from only trace data.
droSec1	44259 [link]	19	10264	33976	4 [link]	0	4	0	4	339	3
droYak2	27708 [link]	13	1298	26397	1 [link]	0	1	0	1	328	1
droEre2	41225 [link]	10	6348	34867	16 [link]	1	5	10	14	332	8
droAna3	91346 [link]	40	11869	79437	3 [link]	0	3	0	3	292	2
dp4	106803 [link]	83	8904	97816	8 [link]	0	6	2	7	276	7
droPer1	40702 [link]	2	4470	36230	4 [link]	0	4	0	3	266	3
droWil1	76942 [link]	262	7620	69060	4 [link]	0	4	0	0	249	4	All variants found appear to be RNAediting.
droVir3	104930 [link]	19	9525	95386	5 [link]	0	4	1	5	247	4
droMoj3	80342 [link]	43	13153	67146	5 [link]	0	4	1	5	245	5
droGri2	286111 [link]	57	3090	282964	28 [link]	0	2	26	28	240	20	Similar to droSim1, many (26/28) variants detected from the trace data.

Orange column- Variants found genome-wide. This list is further broken down by variants found in both RNAseq and trace reads and in each individually.

Blue column- Variants that overlapped with miRNA ortholog coordinates. This list is also broken down by variants found in both RNAseq and trace reads and in each individually.

Pink column- Variants in the blue column that were validated by visual inspection to be valid

Purple column- Number of miRNA orthologs containing a variant.

Definitions:

Variant- A disagreement between the reference base and the most likely base supported from the sRNAseq data and the assemblies' trace data. A variant is called if less than 50% of the sRNAseq reads or trace reads support the reference base. As of yet, INDELS are not accounted for in this definition. However, INDELS are sparse throughout the small RNAseq reads and I don't believe there is a significant total amount.

Legend for linked files (by column)

Chromosome (Scaffold) name
Genomic Position
Reference base
Marker - 1 = "Variant exists in both RNAseq and trace data", 2 = "Variant only exists in RNAseq data", 3 = "Variant exists in only trace data".
Fraction of RNAseq reads that support an 'A' base
Fraction of RNAseq reads that support a 'C' base
Fraction of RNAseq reads that support a 'G' base
Fraction of RNAseq reads that support a 'T' base
Total RNAseq reads at this position
Fraction of Trace reads that support an 'A' base
Fraction of Trace reads that support a 'C' base
Fraction of Trace reads that support a 'G' base
Fraction of Trace reads that support a 'T' base
Total Trace reads at this position
miRNA ortholog containing the variant
Visual confidence assigned by Jaaved. V = "Valid", A = "Ambiguous" + explanation of it is ambiguous, I = "Invalid"

If you'd like to take a look at the pileup of reads per genomic position, I've uploaded the trace and sRNAseq reads onto my personal genomic browser at Jaaved's UCSC Genome Browser. Tracks for the D. melanogaster miRNA orthologs are also visible per genome.

Mitigating Errors

Without mitigating errors in the pileup and variant detection analysis, hundreds of variants per species would be found. I've found that many of these cases appear to be false-positive hits for two main reason.

1. Cross mapping of reads

Many of the short reads were found to crossmap to various regions of the genome especially to areas with miRNA families. Instead of focusing on all mapped reads, I instead looked at the more confidently mapped ones. BWA reports the confidence of each read mapping (called the MAPQ score), which can be interpreted as a Phred-like score. I chose a confidence score of at least 1. As an explanation of what this score means, it is the probability that a mapping is wrong, and the score equals [ 10^(score / -1) ]. For reference, a MAPQ score of 0 indicates that P(wrong) = 1, a MAPQ score of 1 indicates P(wrong)=0.79, and score of 37 means P(wrong) = 0.0002. As you can see, reads with score of 0 should not be trusted.

There is also a fine balance between deciding when to use all reads and when use to only uniquely mapped reads. The case of artificially inferring supurious variants by using all mapped reads has already been stated. Vice verser, using uniquely mapped reads could also convolute the variants called. For example, for miRNAs with identical mature or star sequences, such as 6-1, 6-2 and 6-3, removing repeated reads may reduce the total read coveverage at a particular position, thus boosting the fraction of mismatch bases from suboptimal reads. In 6-3 below, we see from the browser screenshot that a variant is A->C SNP can be inferred, but we see from the pileup of all reads, that this variant is not supported. But these similar miRNA cases are few and cand be dealth with on a case-by-case basis, so we'll use only uniquely mapped reads.

2. Overlapping Orthologs

In addition to cross-mapped reads, the orthologs for some miRNAs (especially those belonging to the same family) appear to overlap. This will convolute the estimates of miRNAs with variants.. Here is an example, where the D. sechelia ortholog of dme-mir-981-1 overlaps with several other orthologs:

Orthologs reported as the best hit by LASTZ may actually overlap with the same ortholog found for another miRNA belonging to the same family. As a tie breaker, I've looked in depth at overlapped orthologs and either marked one of the overlapping orthologs as missing (which is more common) or looked at the lower confident LASTZ hits and chosen the ortholog which would fit in the approximate correct genomic locale relative to the other neighboring orthologs.

3. 3' untemplated additions

Finally, many of the false-positive hits of the variant detection pipeline turned out to be 3' untemplated additions, or RNA editing events. For example, the D. persimilis ortholog for dme-mir-1010, a mirtron. Notice that the C at position 8195080 is supported by the 3 assembly reads, but not by any of the RNAseq reads mainly due to 3' uridylation.

We would like to differentiate these RNA editing events from the bona fide genomic variants for this analysis. Therefore, any variant detected on the end, or one base pair away from the ends of reads were removed.

Updated: 8/15/2011