Deep Learning and Genomics

Deep learning at work can be seen all around us. Facebook finds and tags friends in your photo. Google DeepMind’s AlphaGo beat many champions at the ancient game of Go last year. Skype translates spoken conversations in real time. Behind all these are deep learning algorithms. But to understand the role deep learning can play in ever fascinating umbrella branches of Biology, one has to understand what is deep in learning? I would skip the definition of learning here for the sake of brevity. The “smart” in “smart grid”, “smart home” and other such was equally intriguing initially and eventually turned out to be a damp squib. You will be surprised if “deep” could end up as smart’s ally eventually.

There is nothing ‘deep’ in deep learning in the colloquial sense of the word (well, there will be many who may want to jump on me for saying this and try proving just why deep learning is deep – but hold on). Deep learning is simply a term used to describe learning by a machine in a way similar to how humans learn. Now here is the dichotomy. We are still struggling to fully understand how the brain functions, but we do know how deep learning should model itself after the way brain operates! This reminds me of my problem in my PhD days in the late 90s in computer vision, the branch that deals with making machines see things as humans do. Back then, David Marr of MIT had written a seminal book on Vision popularly known as “Vision by Marr” that spent a whole lot explaining the neuroscience behind vision and how computer models should mimic that behavior. Computer vision seemed a saturated field in 90s though, as just how much maths and algorithms can be invented by looking at 2D array of numbers (pixels in an image)? But recent developments in machine learning and deep learning have brought focus right back to computer vision. And these days, folks don’t write the crazy low level image processing algorithms I used to write back then! They just show the algorithm 10,000 images of dogs and cats and then after ‘learning’ the computer is given another unknown image with a dog or cat and it would tell which is which with incredible accuracy. Doing these tasks of learning and prediction in the assumed model of how brain functions, namely the neural network, led to the development of field of artificial neural network (ANN). So any ANN that thinks like brain (at least as we think so) and produces results that are acceptable to all of us, generally speaking, is called deep learning.

There are two thoughts that I came across at different points in time that have shaped my professional career. One was by Jim Blinn. In his column in IEEE Trans. on computer graphics, vision and image processing in the 80s, he once wrote in the context of maturity of computer graphics at the time, that practical solutions should not necessarily be driven by theory. One should experiment and then use theory to explain why the best result one got, should work anyways. This is the essence of machine learning and deep learning. There is data and more data. If there isn’t enough, we carry out data augmentation and add more data, try multiple splits of training data as training and validation, then use multiple models to find accuracy of that model, whether it over-fits or doesn’t etc and then choose the best model. As a practicing data scientist, I can say there is no single approach at the outset that sets the path for required results. There is exploration and experimenting. Unfortunately, Blinn’s thesis can’t be applied to deep learning here after, for even after one gets the desired results, there is no direct way of applying theory to figure out why it should work anyways. In fact, many researchers have dedicated their lives figuring out why deep learning should work anyway and there is no consensus. Geoff Hindon and a few others perilously kept the branch of machine and deep learning alive during the years when it seemed saturated and while at the same time, scale became possible and now with multi-core CPUs and more importantly powerful GPUs (and now TPUs), artificial neural networks yield surprisingly fast and acceptable results, without anyone quite able to explain why it works anyways. Prof Naftali Tishby and his team have the most credible work to their credit. Called “information bottleneck”, they use concepts from information theory to explain why deep learning models should work. It is a fascinating field and still under development and many including Hindon have agreed that information bottleneck is a real mathematical tool that attempts to explain neural networks in a holistic way. But at the level of a practicing deep learner today, one tries tens of models and chooses the one that gives best results (or chooses an ensemble) and use accuracy or any other metric to crown it as the best among the equals and leave at that, for theory plays no further role.

The second thought is from Prof Eric Lander of the MIT. I had taken his online class on ‘Secret of life 7.00x’ in 2014. He has a PhD in Mathematics (information theory) and he got interested in Biology and became the principal face of the Human Genome project in 2000. In one of the classes, he had said that as a student one should build skills to learn all tools available and then later choose from them to problems at hand, as you never know which one is helpful when. He used his maths training in solving many tasks in the human genome project. He is singularly responsible for revival of my interest in Biology again. His course was a fascinating time travel in the fields of biochemistry, molecular biology and genetics and then an overall view of genomics. Interestingly for me, the timing was correct. 2014 onwards was also the time when machine learning and deep learning was sweeping the technology landscape and with my fresh perspective in Biology, I decided to work on applying deep learning to genomics.

In this article, I don’t intend to either use too much of technical jargon or make it look like a review article, so will skip many details. But I will say how I got involved in using deep learning with genomics. Genomics is a challenging application area of deep learning that entails unique challenges compared to others such as vision, speech, and text processing, since we have limited ability ourselves to interpret the genome information but we would expect from deep learning a super human intelligence to explore beyond our knowledge. There is still much in the works and as yet a watershed revolution has not been round the corner in deep genome. In one of the classes, Prof Lander was explaining the Huntington’s disease. Huntington’s disease is a rare neurological disease (five in 100,000). It is an unusual genetic disease. Most diseases are caused by recessive alleles, and people fall ill only if they get two copies of the disease allele, one from each parent. But Huntington’s disease is different, the allele that causes it is dominant and people only have to receive one copy from either parent to contract it. Most genetic diseases cause illness early in life, whereas Huntington sets in around midlife. Prof Lander went on to explain the works of David Botstein and Gusella where they identified the genetic marker linked to Huntington’s disease on chromosome 4 through a series of laborious experiments. The idea was to use positional cloning and genetic markers (polymorphisms) to locate a gene that you don’t know where to look for. This work was carried out in 1983 when there was no human genome identified.

This introduction was good enough for me to get initiated in genomics. After all, we are looking for the unknown most of the time, and for a change we have a human genome now. So the thought is can we use markers to identify and locate specific genetic condition? Deep learning is good at doing boring tasks with incredible accuracy and bringing insight that may be humanly impossible. With computational speed available at hand, doing searches in blind alleys using deep learning is incredibly powerful and may hitherto lead to insights not intended for in the beginning.

Genomic research targets study of genomes of different species. It studies roles assumed by multiple genetic factors and the way they interact with surrounding environment under different conditions. A study of Homo sapiens involves searching through approximately 3 billion base pairs of DNA, containing protein coding genes, RNA genes, cis-regulatory elements, long range regulatory elements and transposable elements. Where this field intersects deep learning has far reaching impact in medicine, pharmacy, agriculture etc. Deep learning can be very useful in exploring gene expression, including its prediction, in regulatory genomics (i.e. finding promoters and enhancers), splicing, transcription factors and RNA-binding proteins, mutations/ polymorphisms and genetic variants among others. The field is nascent though. The predictive performances in most problems have not reached the expectation for real-world applications; neither the interpretations of these abstract models seem to elucidate insightful knowledge.

As the “neural” part in Artificial Neural Network (ANN) suggests, the ANNs are brain-inspired systems which are intended to replicate the way that we humans learn. Neural networks consist of input and output layers, as well as (in most cases) a hidden layer(s) consisting of units that transform the input into something that the output layer can use. Deep learning tools, inspired by real neural networks hence, are those algorithms that use a cascade of multiple layers neurons each serving a specific task. Each successive layer uses the output from the previous layer as input. While at the outset, I did say that there is nothing ‘deep’ about deep learning, technically one can say that just how deep a network is depends on the number of hidden layers deployed. The more the layers, the deeper is the network. They are excellent tools for finding patterns which are far too complex or numerous for a human programmer to extract and teach the machine to recognize. While neural networks existed since 1940s as perceptrons, they have become a serious tool for use only after 80s due to a technique called backpropagation, which allows networks to adjust their hidden layers of neurons in situations where outcome does match the expected. There are many types of neural networks. The most basic type is the feedforward type, the more popular is recurrent type and then there are convolutional neural networks, Boltzmann machines, Hopfield networks amongst others. Picking the right network depends on the data one has to train it with and the specific application in mind.

Hopefully, some day, we would be able to place all jigsaw pieces of the puzzle together. We would then be able to not only get good results, but have information bottleneck or any other tool explain why it should work anyways. And hopefully, that substantial, deep learning could pave way to provide deeper insights (no pun intended) on just how the brain works.

Monday, October 8, 2018

Deep Learning and Genomics

No comments:

Post a Comment

Followers

Blog Archive

About Me