Concatenate several lanes of Illumina HiSeq reads quickly

If you have raw Illumina HiSeq reads or MiSeq run across several lanes of data, you may need to concatenate them together at the command line before trimming and merging them together. Raw data from Illumina sequencers generally follows a standard naming convention. The files might look like this:

16_TAAGGCGA-TATCCTCT_L004_R1_001.fastq
16_TAAGGCGA-TATCCTCT_L005_R1_001.fastq
16_TAAGGCGA-TATCCTCT_L006_R1_001.fastq
16_TAAGGCGA-TATCCTCT_L007_R1_001.fastq
16_TAAGGCGA-TATCCTCT_L008_R1_001.fastq

Where ’16’ is the sample ID, followed by the barcode sequence, ‘L00X’ is the lane number, ‘R1’ means forward reads.

An easy way to script this quickly is as follows:

cat find . -maxdepth 1 | egrep '16.*R1' > 16_TAGGCGA_R1.fastq &

There’s a lot going on here, so let’s unpack this briefly.

The backquotes (` ) mean that the output of the command is presented as command-line parameters to the enclosing command.

So in this case the ‘find’ command is listing all file names in the current directory at a depth of 1 (no recursion into lower directories).

Next, this is pipe’d (with |) to egrep which searches the list of filenames for a regular expression that indicates a string starting with 16, followed by any number of any characters, then an ‘R1’. Since the expression matches whole strings, this will find the above files.

These filenames are then passed to ‘cat’ as command line arguments. The concatenated files are then redirected with the greater than (‘>’) to the new fastq file. The ‘&’ indicates that the shell run this in the background.

I hope this is useful. I spent about an hour tracking this all down, so I wouldn’t have to process dozens of samples by hand.

Create a volcano plot on EBSeq output

The differential expression analysis program EBSeq produces a number of data objects as part of the workflow, but there aren’t many options for visualization of the data.

The authors suggest the use of heatmap.2 in R:

heatmap.2(NormalizedMatrix[GenesOfInterest,], scale=”row”, trace=”none”, Colv=F)

However, this depends on knowing ahead of time your genes of interest.  It is not practical to generate a heatmap with hundreds or thousands of DE genes.

I wanted to produce something approximating a volcano plot for EBSeq results.  What I came up with initially was the following:

A pseudo-volcano plot for EBSeq results.  The y-axis is posterior probability of differential expression (PPDE).
A pseudo-volcano plot for EBSeq results. The y-axis is posterior probability of differential expression (PPDE).

To make this plot, I had to grab some data arrays from the large object “EBOut” that is produced by calling the “EBTest” method and from the “GeneFC” object as below:

The “abline” command places a horizontal line where the PPDE is equal to or greater than 95%.  This would be equivalent to an FDR of 0.05.

If you want to inspect the plot interactively in R to identify gene names above the threshold and/or with large posterior fold changes you would use:

To make it look more like a canonical volcano plot, I then tried:

Creating the following plot:

Fig 2.  Log(2) Fold Change on the X-axis.
Fig 2. Log(2) Fold Change on the X-axis.

This is good, except I want to subset the data and add colors. To do this I need to create a new dataframe from the EBOut$PPDE and GeneFC$PostFC objects:

With everything in one dataframe, plotting and subsetting the data is easier. Inspired by this post at Stephen Turner’s “Getting Genetics Done” blog, I prepared my final colored volcano plot as follows:

The final plot looks like this:

Fig 3.  The final plot, with colored points for the DE genes, and higher fold change indicated with red.
Fig 3. The final plot, with colored points for the DE genes, and higher fold change indicated with red.