Mutational signatures in cancer with DNA-Seq

A recent collaboration with a clinician here at UI hospital and clinics introduced me to the idea of mutational signatures in cancer.  Characterizing mutational signatures is made possible by the falling cost and increasing accuracy of whole-genome sequencing methods.  Tumors are sequenced across the entire genome and the catalog of somatic mutations (i.e, SNPs) is used to compute the mutational signatures of a tumor’s genome.

The idea is that the collection of somatic mutations found in a tumor  are the result of a variety of defective DNA-repair or DNA-replication machinery combined with the action of known or unknown mutagens and environmental exposures.  The processes operate over time and leave a “footprint” in the tumor DNA that can be examined.  These sum of all of the mutational processes operating within a tumor cell is a distinct mutational “signature” that differs by tumor types.

For example, in lung cancer, the bulk of somatic mutations are C>A transversions resulting from chronic exposure to tobacco smoke.  In melanoma, the predominant mutation type is C>T and CC>TT at dipyrimidines, a mutation type associated with UV-light exposure.  And in colorectal cancer, defective DNA mismatch repair contributes the majority of the mutations.

Mutational signatures of a cancer by the operation of several mutational processes over time.
Mutational signature of a cancer by the operation of several mutational processes over time. From Used under CC License BY 3.0.

A recent paper in Nature has formalized this notion of mutational signatures in tumors and provided a mathematical framework (written in MatLab) for assessing how many and which signatures are operational within an uncharacterized tumor type (generally there between 2 and 6 processes).

In the paper, the authors analyzed almost 5 million somatic cancer SNPs and identified 21 unique signatures of mutational processes through a mathematical process of deconvolution, followed by experimental validation.  A curated catalog of the most current signatures based on available sequence data can be found at the COSMIC database.

In part 2 of this post, I’ll go into more detail on the mutational signatures and link to some python code I’ve written to help get flat-file lists of SNPs into the correct form for easy input into the MatLab framework.




Concatenate several lanes of Illumina HiSeq reads quickly

If you have raw Illumina HiSeq reads or MiSeq run across several lanes of data, you may need to concatenate them together at the command line before trimming and merging them together. Raw data from Illumina sequencers generally follows a standard naming convention. The files might look like this:


Where ’16’ is the sample ID, followed by the barcode sequence, ‘L00X’ is the lane number, ‘R1’ means forward reads.

An easy way to script this quickly is as follows:

cat `find . -maxdepth 1 | egrep '16.*R1'` > 16_TAGGCGA_R1.fastq &

There’s a lot going on here, so let’s unpack this briefly.

The backquotes (` ) mean that the output of the command is presented as command-line parameters to the enclosing command.

So in this case the ‘find’ command is listing all file names in the current directory at a depth of 1 (no recursion into lower directories).

Next, this is pipe’d (with |) to egrep which searches the list of filenames for a regular expression that indicates a string starting with 16, followed by any number of any characters, then an ‘R1’. Since the expression matches whole strings, this will find the above files.

These filenames are then passed to ‘cat’ as command line arguments. The concatenated files are then redirected with the greater than (‘>’) to the new fastq file. The ‘&’ indicates that the shell run this in the background.

I hope this is useful. I spent about an hour tracking this all down, so I wouldn’t have to process dozens of samples by hand.

Create a volcano plot on EBSeq output

The differential expression analysis program EBSeq produces a number of data objects as part of the workflow, but there aren’t many options for visualization of the data.

The authors suggest the use of heatmap.2 in R:

heatmap.2(NormalizedMatrix[GenesOfInterest,], scale=”row”, trace=”none”, Colv=F)

However, this depends on knowing ahead of time your genes of interest.  It is not practical to generate a heatmap with hundreds or thousands of DE genes.

I wanted to produce something approximating a volcano plot for EBSeq results.  What I came up with initially was the following:

A pseudo-volcano plot for EBSeq results.  The y-axis is posterior probability of differential expression (PPDE).
A pseudo-volcano plot for EBSeq results. The y-axis is posterior probability of differential expression (PPDE).

To make this plot, I had to grab some data arrays from the large object “EBOut” that is produced by calling the “EBTest” method and from the “GeneFC” object as below:

plot(GeneFC$PostFC, EBOut$PPDE, 
xlim =c(0,5), ylim=c(0,1), 
main="Control/Experimental FC vs. PPDE",
sub=GeneFC$Direction, xlab="EBSeq Posterior Fold Change", ylab="EBSeq posterior prob of DE")


The “abline” command places a horizontal line where the PPDE is equal to or greater than 95%.  This would be equivalent to an FDR of 0.05.

If you want to inspect the plot interactively in R to identify gene names above the threshold and/or with large posterior fold changes you would use:

identify(GeneFC$PostFC, EBOut$PPDE, labels=names(GeneFC$PostFC))

To make it look more like a canonical volcano plot, I then tried:

plot(log2(GeneFC$RealFC), EBOut$PPDE, 
xlim =c(-5,5), 
main="Log2FoldChange vs. PPDE", 
xlab="EBSeq Log2 Fold Change", ylab="EBSeq PPDE")

Creating the following plot:

Fig 2.  Log(2) Fold Change on the X-axis.
Fig 2. Log(2) Fold Change on the X-axis.

This is good, except I want to subset the data and add colors. To do this I need to create a new dataframe from the EBOut$PPDE and GeneFC$PostFC objects:

volc_df = data.frame(EBOut$PPDE, GeneFC$PostFC)

With everything in one dataframe, plotting and subsetting the data is easier. Inspired by this post at Stephen Turner’s “Getting Genetics Done” blog, I prepared my final colored volcano plot as follows:

with(volc_df, plot(log2(PostFC), PPDE, pch=20, main="Volcano Plot EBSeq", xlim=c(-5,5)))


with(subset(volc_df, PPDE > 0.95 & abs(log2(PostFC)) < 2), points(log2(PostFC), PPDE, pch=20, col="orange")) 

with(subset(volc_df, PPDE > 0.95 & abs(log2(PostFC)) < 2), points(log2(PostFC), PPDE, pch=20, col="red"))

The final plot looks like this:

Fig 3.  The final plot, with colored points for the DE genes, and higher fold change indicated with red.
Fig 3. The final plot, with colored points for the DE genes, and higher fold change indicated with red. article on crispr/cas9 gene-editing technology

Wired is out with a great article covering the development and innovation of crispr/cas9 technology and its intertwined promise and peril.

Sample graf:

The stakes, however, have changed. Everyone at the Napa meeting had access to a gene-editing technique called Crispr-Cas9. The first term is an acronym for “clustered regularly interspaced short palindromic repeats,” a description of the genetic basis of the method; Cas9 is the name of a protein that makes it work. Technical details aside, Crispr-Cas9 makes it easy, cheap, and fast to move genes around—any genes, in any living thing, from bacteria to people. “These are monumental moments in the history of biomedical research,” Baltimore says. “They don’t happen every day.”

Using the three-year-old technique, researchers have already reversed mutations that cause blindness, stopped cancer cells from multiplying, and made cells impervious to the virus that causes AIDS. Agronomists have rendered wheat invulnerable to killer fungi like powdery mildew, hinting at engineered staple crops that can feed a population of 9 billion on an ever-warmer planet. Bioengineers have used Crispr to alter the DNA of yeast so that it consumes plant matter and excretes ethanol, promising an end to reliance on petrochemicals. Startups devoted to Crispr have launched. International pharmaceutical and agricultural companies have spun up Crispr R&D. Two of the most powerful universities in the US are engaged in a vicious war over the basic patent. Depending on what kind of person you are, Crispr makes you see a gleaming world of the future, a Nobel medallion, or dollar signs.

The technique is revolutionary, and like all revolutions, it’s perilous. Crispr goes well beyond anything the Asilomar conference discussed. It could at last allow genetics researchers to conjure everything anyone has ever worried they would—designer babies, invasive mutants, species-specific bioweapons, and a dozen other apocalyptic sci-fi tropes. It brings with it all-new rules for the practice of research in the life sciences. But no one knows what the rules are—or who will be the first to break them.

Why you should think exponentially to grasp the future of medicine

People often assume that the world tomorrow will be pretty much like the world today.  We all have an in-built bias towards linear thinking when we ponder the future.  Although a linear bias was helpful for thousands of years of our evolution, today technology is changing at an exponential pace and in order to better anticipate future market opportunities and technology’s impact on society, it is crucial to think in terms of exponential trends.  This is a point that renowned futurist Ray Kurzweil has made in his many books and speeches for the last several decades. 

We all have an in-built bias towards linear thinking when we ponder the future.

One example of an exponential trend in biology (among many) is the cost per genome sequence (graph below).  As recently as 2001, the cost to sequence a genome was an astronomical $100M.  Between 2001 and 2007, the cost decreased exponentially (a straight line on a log plot), to the point where a genome in 2007 cost only $10M to sequence.  Around 2007, a paradigm shift in technology massively accelerated this exponential process, and the cost decreased even faster than before, hitting just $10K in 2012.


The dramatic, exponential gains in price/performance of sequencing technology have unleashed a tidal wave of sequence data.

As economists are fond of saying, when the price falls, more is demanded.  As a result of this massively reduced sequencing price, many more partial and complete genomes are being sequenced than ever before.  The dramatic, exponential gains in price/performance of sequencing technology have unleashed a tidal wave of sequence data.