Mutational signatures in cancer with DNA-Seq

A recent collaboration with a clinician here at UI hospital and clinics introduced me to the idea of mutational signatures in cancer.  Characterizing mutational signatures is made possible by the falling cost and increasing accuracy of whole-genome sequencing methods.  Tumors are sequenced across the entire genome and the catalog of somatic mutations (i.e, SNPs) is used to compute the mutational signatures of a tumor’s genome.

The idea is that the collection of somatic mutations found in a tumor  are the result of a variety of defective DNA-repair or DNA-replication machinery combined with the action of known or unknown mutagens and environmental exposures.  The processes operate over time and leave a “footprint” in the tumor DNA that can be examined.  These sum of all of the mutational processes operating within a tumor cell is a distinct mutational “signature” that differs by tumor types.

For example, in lung cancer, the bulk of somatic mutations are C>A transversions resulting from chronic exposure to tobacco smoke.  In melanoma, the predominant mutation type is C>T and CC>TT at dipyrimidines, a mutation type associated with UV-light exposure.  And in colorectal cancer, defective DNA mismatch repair contributes the majority of the mutations.

Mutational signatures of a cancer by the operation of several mutational processes over time.
Mutational signature of a cancer by the operation of several mutational processes over time. From http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3990474/figure/fig0005/. Used under CC License BY 3.0.

A recent paper in Nature has formalized this notion of mutational signatures in tumors and provided a mathematical framework (written in MatLab) for assessing how many and which signatures are operational within an uncharacterized tumor type (generally there between 2 and 6 processes).

In the paper, the authors analyzed almost 5 million somatic cancer SNPs and identified 21 unique signatures of mutational processes through a mathematical process of deconvolution, followed by experimental validation.  A curated catalog of the most current signatures based on available sequence data can be found at the COSMIC database.

In part 2 of this post, I’ll go into more detail on the mutational signatures and link to some python code I’ve written to help get flat-file lists of SNPs into the correct form for easy input into the MatLab framework.