Importing a merged Seurat dataset into Monocle

I recently ran across a situation that I think is going to be increasingly common as I do more and more single-cell analyses.   Specifically, I had a project where the investigator had several experiments in related conditions that they want to merge and evaluate with a pseudotime analysis.   I could not find any useful tools within Monocle itself for merging data (please correct me in the comments if I’m missing something).   It looks as if you have to import a pre-merged seurat dataset.

Here is the workaround that I found [please note these commands are for Seurat v2, they will likely *not* work in v3]:

Here, I am reading in 10X data using Seurat (v2) w/ the Read10X function and then creating the Seurat object with CreateSeuratObject.

Once this done I use MergeSeurat to merge the first two experiments, and then AddSamples to add in the final experiment.   Then we can take advantage of the monocle function importCDS to import the combined object into monocle.

Now there is one final problem and that is that the “orig.ident” field is blank:

 

 

 

To recover the original identity of each cell, we can use the updated cell names from the merged Seurat dataset (i.e., “naive_AAACTGAGAAACCGA”).   We just need to split these and recover which experiment each cell came from with:

We do a strsplit on the cellnames, splitting on underscore. The first value from the split in each case is assigned back into the ‘orig.ident’ field of the cell dataset object.

Now you’re ready to continue with the normal downstream analysis in monocle.  With dimensionality reduction and clustering done (not shown), we can plot the calculated clusters side-by-side with the experiment of origin (from the merged seurat dataset):

And we get:

The PCA clusters on the tSNE plot (left) and orig.ident values on the tSNE plot (right). I have edited out the identities of the clusters on the right. This is unpublished data, I am using it here for educational purposes only. Please do not reproduce or copy this image.

 

Artificial neural network classification of copy number variation, part 2

Recap:

Welcome to the second part of this post series on building artificial neural network models for copy number classification.  In the first part, I described the problem with interpreting copy-ratio plots to find clinically-relevant CNV events.  The data from targeted capture deep sequencing are noisy and biased, and finding clinically-relevant genotypes in genes that have CNVs requires the analyst to visualize the CNV event and assign a classification on the basis of experience and expert knowledge.

The LASSO model

Once my training data were in place (see part 1), I used a multiple linear regression LASSO model as a machine-learning benchmark.  I did this to determine whether a more powerful neural network model would be warranted.   The LASSO model uses an “L1” prior to perform feature selection, setting some coefficients to zero as warranted by the data.   There is ample precedent for applying this type of model in bioinformatics settings where the goal is maximize predictive power without overfitting.

I fit the LASSO to the data, with 33% held out for validation.  The best fit was obtained with the alpha parameter set to 0.001.  k-fold cross validation (where k=10 and alpha=0.001) yielded an accuracy of 76%.   These results are surprisingly good, given the complexity of the CNV signals in the noisy data.  Unfortunately, 76% accuracy is simply not good enough for an automated method that will be used to predict genotypes in clinical data.

The ANN model

Next, I decided to construct an artificial neural network model.  My goal was to keep the model as simple as possible, while reaching a very high classification accuracy needed for clinical work.   To that end I constructed a one hidden-layer model with 19 input nodes corresponding to the 19 copy-ratio probes in the CNV data.  The output layer contained five nodes, corresponding to the five classes of defined CNV event or other event (for example, a very distinct sequencing artifact that kept appearing in the data):

 

In between the input and output layers I constructed a 10-node hidden layer.   A one hidden-layer neural network is the simplest form of the ANN model, and I tried to keep the number of hidden-layer nodes to a minimum as well.  Specific details about the model, hyper-parameter tuning, and the code will be available in the near future when I put a pre-print of this work on biorxiv.

Model training and cross-validation

I trained the model on the 175 sample dataset and on a 350 sample “synthetic” dataset created by adding gaussian “noise” to the real data.  The results are shown below, across 250 training epochs.

When the ANN model was tested with 10-fold cross validation, the accuracy reached a level of 96.5% (+/- 5.4%).   This is obviously a big improvement on the LASSO model, and reaches a level of accuracy that is good enough for clinical pipelines (with the caveat that low confidence predictions will still be checked “by hand.”)

Below, I’m showing a sample of the model output (left) and ground truth (right) from the test data.  The numbers (and colors) of the boxes correspond to the model’s probability in that classification.  You can see that most CNV events are called with high probability, but several (yellow boxes) are called correctly but with lower probability.   One event (red box) is called incorrectly with high probability.

Conclusions and caveats

Going into this project, I had no idea if the ANN model would be able to make predictions on the basis of so few examples in the training set.  The classic examples you see about ANN/CNN models rely on handwriting training sets with 10,000 or more images.   So I was surprised when the model did very well with extremely limited training data.   Since this method was developed for a clinical pipeline, it can be improved as the pipeline generates new training data with each new patient sample.  We would need many thousands of samples through our “legacy” pipeline to see enough examples of the rare star allele events in CYP2D6 that we could then classify them.  That is why I limited my CNV calling to three star alleles.

The low confidence, true positive predictions concern me less than the high confidence false negative.  Missing a real CNV that has impact on CYP2D6 function and therefore clinical relevance is very dangerous.  This can lead to incorrect prescribing and adverse drug reactions for the patient.   I really want to understand why the method makes predictions like this, and how to fix it.  Unfortunately, I believe it will require a lot more training data to solve this problem and that is something I lack.

My goals for this project now are 1) to publish a preprint on biorxiv describing this work and 2) to obtain some additional training/test datasets.   Because our pharmacogenomics test is not generating the kind of volume we expected, I may have to look around for another gene with clinically-relevant CNV events to test this method further.   For example, we do have an NGS-based test of hearing and deafness genes with thousands of validated patient samples.  One gene, STRC, has relevant CNVs that are complex and require analyst visualization to detect.  This may be a good system for follow up refinement of this type of model.