With the growing popularity of single-cell RNA-Seq analysis, the t-SNE projection of multi-dimensional data is appearing more often in publications and online. If you’ve ever wanted to develop a better intuitive feel for what exactly t-SNE does and where it can go wrong, this interactive tutorial (by Martin Wattenberg and Fernanda Viegas) is extremely compelling and useful.
In addition to providing a wonderful, interactive plotting function, the authors go on to provide an informative tutorial explains the pitfalls and challenges of the optimization and hyper-parameter tuning of t-SNE projections and how to get the most from the plots. Here is an example:
In the example above, tuning the “perplexity” of the t-SNE projection causes the correct reconstruction of the data when values are between 30-50, but the same method fails when the parameter falls outside those ranges (i.e., too small or too large).
Go check out this distill.pub site. It’s worth your time.
I worked on a project recently looking at tissue-specific nuclease expression. I made this interactive heatmap from the enormous GTEX dataset that looks at just nuclease gene expression (in TPM) across more than 50 tissues in the human body. It’s fun to play around with the interactive plot. This is the way data should be presented in 2017. I used the Plotly Python API for the chart.
Unfortunately, Plotly is now nearly $400/year if you want to use it for anything more than a few charts and there is no free option to keep sensitive research data private. There should be an exception for academic research, but there isn’t as far as I know.
Recent blog posts by Andrej Karpathy at Medium.com and Pete Warden at PeteWarden.com have caused a paradigm shift in the way I think about neural nets. Instead of thinking of them as powerful machine learning tools, the authors instead suggest that we should think of neural nets, and in particular, convolution deep nets, as ‘self-writing programs.’ Hence the term, “Software 2.0.”
It turns out that a large portion of real-world problems have the property that it is significantly easier to collect the data than to explicitly write the program. A large portion of programmers of tomorrow do not maintain complex software repositories, write intricate programs, or analyze their running times. They collect, clean, manipulate, label, analyze and visualize data that feeds neural networks. — Andrej Karpathy, Medium.com
I found this to be a dramatic reversal in my thinking about these techniques, but it opens up a deeper understanding and is much more intuitive. The fact is that combinations of artificial neurons can be used to model any logical operation. Therefore you can conceptualize training a neural net as searching programming space for an optimal program that behaves in the way you specify. You provide the inputs and desired outputs, and the model searches for the optimal program.
This stands in contrast to the “Software 1.0” paradigm where the programmer uses her skill and experience to conceptualize the right combination of specific instructions to produce the desired behavior. While it seems certain that Software 1.0 and 2.0 will co-exist for a long time, this new way of understanding deep learning is crucial and exciting, in my opinion.