Intuitively, it is easy to grasp that the more sequencing depth (i.e., the greater the number of reads covering any given position in the genome) the more accurate the calling of SNPs and indels (insertions/deletions). But how much difference does this actually make in the real world? Is 20X coverage dramatically worse than 30X (considered a standard coverage depth on genomes)?
To find out, I conducted an experiment with the bcbio pipeline, a bioinformatics pipeline solution built in python that allows for automated and reproducible analyses on high-performance computing clusters. One feature of bcbio is that it can perform validation surveys using high-confidence consensus calls from reference genomes like the NA12878 Coriell sample (from the Genome in a Bottle project).
For NA12878, researchers collated consensus SNP and indel calls from a large variety of sequencing technologies and calling methods to produce a very high-confidence callset for training other methods or validating a sequencing workflow. bcbio includes these variant calls and can easily be setup to validate these calls against a sequenced NA12878 genome.
The sequencing depth experiment
I started with a NA12878 genome sequenced to 30X sequencing depth. To compare shallower depths, I subsampled the data to generate 20X, 10X, etc… [Please note: data was not subsetted randomly, rather “slices” were taken from the 30X dataset] To look at a 60X coverage datapoint, I combined data from two sequencing runs on both flow cells of a HiSeq4000 instrument.
The results after validation are shown in Figure 1 (depth of coverage is along the x-axis):
The figure shows that, as expected, when sequencing depth decreases the error rate increases, and SNP discovery declines. It also makes the case for the commonly held view that 30X is enough coverage for genomes, since going to 60X leads to almost unnoticeable improvement in the % found and a slight increase in error. Performance really degrades at 12X and below, with poor discovery rates and unacceptably high error rates.
I will be submitting a short manuscript to biorxiv.org soon describing this work in more detail.
A recent collaboration with a clinician here at UI hospital and clinics introduced me to the idea of mutational signatures in cancer. Characterizing mutational signatures is made possible by the falling cost and increasing accuracy of whole-genome sequencing methods. Tumors are sequenced across the entire genome and the catalog of somatic mutations (i.e, SNPs) is used to compute the mutational signatures of a tumor’s genome.
The idea is that the collection of somatic mutations found in a tumor are the result of a variety of defective DNA-repair or DNA-replication machinery combined with the action of known or unknown mutagens and environmental exposures. The processes operate over time and leave a “footprint” in the tumor DNA that can be examined. These sum of all of the mutational processes operating within a tumor cell is a distinct mutational “signature” that differs by tumor types.
For example, in lung cancer, the bulk of somatic mutations are C>A transversions resulting from chronic exposure to tobacco smoke. In melanoma, the predominant mutation type is C>T and CC>TT at dipyrimidines, a mutation type associated with UV-light exposure. And in colorectal cancer, defective DNA mismatch repair contributes the majority of the mutations.
A recent paper in Nature has formalized this notion of mutational signatures in tumors and provided a mathematical framework (written in MatLab) for assessing how many and which signatures are operational within an uncharacterized tumor type (generally there between 2 and 6 processes).
In the paper, the authors analyzed almost 5 million somatic cancer SNPs and identified 21 unique signatures of mutational processes through a mathematical process of deconvolution, followed by experimental validation. A curated catalog of the most current signatures based on available sequence data can be found at the COSMIC database.
In part 2 of this post, I’ll go into more detail on the mutational signatures and link to some python code I’ve written to help get flat-file lists of SNPs into the correct form for easy input into the MatLab framework.
People often assume that the world tomorrow will be pretty much like the world today. We all have an in-built bias towards linear thinking when we ponder the future. Although a linear bias was helpful for thousands of years of our evolution, today technology is changing at an exponential pace and in order to better anticipate future market opportunities and technology’s impact on society, it is crucial to think in terms of exponential trends.This is a point that renowned futurist Ray Kurzweil has made in his many books and speeches for the last several decades.
We all have an in-built bias towards linear thinking when we ponder the future.
One example of an exponential trend in biology (among many) is the cost per genome sequence (graph below). As recently as 2001, the cost to sequence a genome was an astronomical $100M. Between 2001 and 2007, the cost decreased exponentially (a straight line on a log plot), to the point where a genome in 2007 cost only $10M to sequence. Around 2007, a paradigm shift in technology massively accelerated this exponential process, and the cost decreased even faster than before, hitting just $10K in 2012.
The dramatic, exponential gains in price/performance of sequencing technology have unleashed a tidal wave of sequence data.
As economists are fond of saying, when the price falls, more is demanded. As a result of this massively reduced sequencing price, many more partial and complete genomes are being sequenced than ever before. The dramatic, exponential gains in price/performance of sequencing technology have unleashed a tidal wave of sequence data.