Category Archives: variant discovery

Genomic landscape of metastatic cancer

Integrative genomics sheds new light on metastatic cancer

A new study from the University of Michigan Comprehensive Cancer Center has just been released that represents an in-depth look at the genomics of metastatic cancer, as opposed to primary tumors.   This work involved DNA- and RNA-Seq of solid metastatic tumors of 500 adult patients, as well as matched normal tissue sequencing for detection of somatic vs. germline variants.


A good overview of the study at the level of scientific layperson can be found in this press release.  It summarizes the key findings (many of which are striking and novel):

  • A significant increase in mutational burden of metastatic tumors vs. primary tumors.
  • A long-tailed distribution of mutational frequencies (i.e., few genes were mutated at a high rate, yet many genes were mutated).
  • About twelve percent of patients harbored germline variants that are suspected to predispose to cancer and metastasis, and 75% of those variants were in DNA repair pathways.
  • Across the cohort, 37% of patient tumors harbored gene fusions that either drove metastasis or suppressed the cells anti-tumor functions.
  • RNA-Seq showed that metastatic tumors are significantly de-differentiated, and fall into two classes:  proliferative and EMT-like (endothelial-to-mesenchymal transition).

 A brief look at the data

This study provides a high-level view onto the mutational burden of metastatic cancer vis-a-vis primary tumors.  Figure 1C from the paper shows the comparison of mutation rates in different tumor types in the TCGA (The Cancer Genome Atlas) primary tumors and the MET500 (metastatic cohort).

Mutational burden in metastatic cancer compared to primary tumors.


Here we can see that in most cases (colored bars), metastatic cancers had statistically significant increases in mutational rates.   The figure shows that tumors with low mutational rates “sped up” a lot as compared with those primary tumor types that already had high rates.

Supplemental Figure 1d (below) shows how often key tumor suppressor and oncogenes are altered in metastatic cancer vs. primary tumors.  TP53 is found to be altered more frequently in metastatic thyroid, colon, lung, prostate, breast, and bladder cancers.   PTEN is mutated more in prostate tumors.  GNAS and PIK3CA are mutated more in thymoma, although this finding doesn’t reach significance in this case.  KRAS is altered more in colon and esophagus cancers, but again, these findings don’t reach significance after multiple correction.

Comparison of genetic alteration frequencies in metastatic and primary tumors.


One other figure I’d like to highlight briefly is Figure 3C from the paper, shown below:

Molecular structure of novel, potentially activating gene fusions in the metastatic tumors.

I wanted to mention this figure to illustrate the terrifying complexity of cancer.   Knowing which oncogenes are mutated, in which positions, and the effects of those mutations on gene expression networks is not enough to understand tumor evolution and metastasis.  There are also new genes being created that do totally new things, and these are unique on a per tumor basis.   None of the above structures have ever been observed before, and yet they were all seen from a survey of just 500 cancers.   In fact, ~40% of the tumors in the study cohort harbored at least one fusion suspected to be pathogenic.

There is much more to this work, but I will leave it to interested readers to go read the entire study.   I think this work is obviously tremendously important and novel, and represents the future of personalized medicine.  That is, a patient undergoing treatment for cancer will have their tumor or tumors biopsied and sequenced cumulatively over time to understand how the disease has evolved and is evolving, and to ascertain what weaknesses can be exploited for successful treatment.

Unix one-liner to convert VCF to Oncotator format

Here is a handy unix one-liner to process mutect2 output VCF files into the 5 column, tab-separated format required by Oncotator for input (Oncotator is a web-based application that annotates human genomic point mutations and indels with transcripts and consequences). The output of Oncotator is a MAF-formatted file that is compatible with MutSigCV.

for file in $FILES
zcat $file | grep -v "GL000*" | grep -v "FILTER" | grep "PASS" | cut -d$'\t' -f 1-5 | awk '$3=$2' | awk '$1="chr"$1' > $file.tsv

Breaking this down we have:

“zcat $file” :  read to stdout each line of a gzipped file

“grep -v “GL000*” :  exclude any variant that doesn’t map to a  named chromosome

“grep -v “FILTER” : exclude filter header lines

“grep “PASS””:  include all lines that pass mutect2 filters

“cut -d$’\t’ -f 1-5”  : cut on tabs and keep fields one through five

“awk ‘$3=$2’ :  set column 3 equal to column 2, i.e., start and end position are equal

“awk $1=’chr’$1″” : set column one equal to ‘chr’ plus column one (make 1 = chr1)


Variant annotation and transcript choice


Transcript choice between methods

Variant annotation methods do not all behave the same way when choosing transcripts to annotate against.  This leads to differing outcomes in annotations which may arise from different logic structures in the algorithms or different user criteria for annotation.

Unfortunately, incorrect annotations or disagreement in annotation outcomes can lead investigators to waste resources tracking down variants of little interest or to miss severe variants of potential clinical significance.

In this first post in a series, I’ll talk briefly about differing outcomes owing to transcript choices when three popular methods (ANNOVAR, VEP, and SNPEff) are applied to a dataset of 81 million variants from the 1000Genomes project.

In this figure you can see the lack of concordance owing to transcript choice affects a surprisingly large number of variants.

variant annotation

This disagreement is largely owing to the way that intergenic variants are handled, assigning them to nearest genes or arbitrary categories like “unknown.”

To learn more about this problem and other issues with annotators and concordance between methods, check out our recent paper at biorxiv.   In part two, I’ll talk more about concordance between methods when annotators agree on transcript choice.

Sequencing depth for accurate SNP calling: bcbio case study

Intuitively, it is easy to grasp that the more sequencing depth (i.e., the greater the number of reads covering any given position in the genome) the more accurate the calling of SNPs and indels (insertions/deletions).   But how much difference does this actually make in the real world?  Is 20X coverage dramatically worse than 30X (considered a standard coverage depth on genomes)?

To find out, I conducted an experiment with the bcbio pipeline, a bioinformatics pipeline solution built in python that allows for automated and reproducible analyses on high-performance computing clusters.  One feature of bcbio is that it can perform validation surveys using high-confidence consensus calls from reference genomes like the NA12878 Coriell sample (from the Genome in a Bottle project).
For NA12878, researchers collated consensus SNP and indel calls from a large variety of sequencing technologies and calling methods to produce a very high-confidence callset for training other methods or validating a sequencing workflow.  bcbio includes these variant calls and can easily be setup to validate these calls against a sequenced NA12878 genome.

The sequencing depth experiment

I started with a NA12878 genome sequenced to 30X sequencing depth.  To compare shallower depths, I subsampled the data to generate 20X, 10X, etc…  [Please note: data was not subsetted randomly, rather “slices” were taken from the 30X dataset] To look at a 60X coverage datapoint, I combined data from two sequencing runs on both flow cells of a HiSeq4000 instrument.

The results after validation are shown in Figure 1 (depth of coverage is along the x-axis):

sequencing depth
Fig 1. SNP discovery as a function of increasing coverage of the GIAB validation sample.


The figure shows that, as expected, when sequencing depth decreases  the error rate increases, and SNP discovery declines.   It also makes the case for the commonly held view that 30X is enough coverage for genomes, since going to 60X leads to almost unnoticeable improvement in the % found and a slight increase in error.  Performance really degrades at 12X and below, with poor discovery rates and unacceptably high error rates.

I will be submitting a short manuscript to soon describing this work in more detail.