Deep Learning and Genomics
Deep
learning at work can be seen all around us. Facebook finds and tags friends in
your photo. Google DeepMind’s AlphaGo beat many champions at the ancient game
of Go last year. Skype translates spoken conversations in real time. Behind all
these are deep learning algorithms. But to understand the role deep learning
can play in ever fascinating umbrella branches of Biology, one has to
understand what is deep in learning? I would skip the definition of learning
here for the sake of brevity. The “smart” in “smart grid”, “smart home” and
other such was equally intriguing initially and eventually turned out to be a
damp squib. You will be surprised if “deep” could end up as smart’s ally
eventually.
There
is nothing ‘deep’ in deep learning in the colloquial sense of the word (well,
there will be many who may want to jump on me for saying this and try proving
just why deep learning is deep – but hold on). Deep learning is simply a term used
to describe learning by a machine in a way similar to how humans learn. Now
here is the dichotomy. We are still struggling to fully understand how the
brain functions, but we do know how deep learning should model itself after the
way brain operates! This reminds me of my problem in my PhD days in the late
90s in computer vision, the branch that deals with making machines see things
as humans do. Back then, David Marr of MIT had written a seminal book on Vision
popularly known as “Vision by Marr” that spent a whole lot explaining the
neuroscience behind vision and how computer models should mimic that behavior.
Computer vision seemed a saturated field in 90s though, as just how much maths
and algorithms can be invented by looking at 2D array of numbers (pixels in an
image)? But recent developments in machine learning and deep learning have
brought focus right back to computer vision. And these days, folks don’t write
the crazy low level image processing algorithms I used to write back then! They
just show the algorithm 10,000 images of dogs and cats and then after
‘learning’ the computer is given another unknown image with a dog or cat and it
would tell which is which with incredible accuracy. Doing these tasks of
learning and prediction in the assumed model of how brain functions, namely the
neural network, led to the development of field of artificial neural network
(ANN). So any ANN that thinks like brain (at least as we think so) and produces
results that are acceptable to all of us, generally speaking, is called deep
learning.
There
are two thoughts that I came across at
different points in time that have shaped my professional career. One was by
Jim Blinn. In his column in IEEE Trans. on computer graphics, vision and image
processing in the 80s, he once wrote in the context of maturity of computer
graphics at the time, that practical solutions should not necessarily be driven
by theory. One should experiment and
then use theory to explain why the best result one got, should work anyways.
This is the essence of machine learning and deep learning. There is data and
more data. If there isn’t enough, we carry out data augmentation and add more
data, try multiple splits of training data as training and validation, then use
multiple models to find accuracy of that model, whether it over-fits or doesn’t
etc and then choose the best model. As a practicing data scientist, I can say
there is no single approach at the outset that sets the path for required
results. There is exploration and experimenting. Unfortunately, Blinn’s thesis
can’t be applied to deep learning here after, for even after one gets the
desired results, there is no direct way of applying theory to figure out why it
should work anyways. In fact, many researchers have dedicated their lives
figuring out why deep learning should work anyway and there is no consensus.
Geoff Hindon and a few others perilously kept the branch of machine and deep
learning alive during the years when it seemed saturated and while at the same
time, scale became possible and now with multi-core CPUs and more importantly
powerful GPUs (and now TPUs), artificial neural networks yield surprisingly
fast and acceptable results, without anyone quite able to explain why it works
anyways. Prof Naftali Tishby and his team have the most credible work to their
credit. Called “information bottleneck”, they use concepts from information
theory to explain why deep learning models should work. It is a fascinating
field and still under development and many including Hindon have agreed that
information bottleneck is a real mathematical tool that attempts to explain
neural networks in a holistic way. But at the level of a practicing deep
learner today, one tries tens of models and chooses the one that gives best results
(or chooses an ensemble) and use accuracy or any other metric to crown it as
the best among the equals and leave at that, for theory plays no further role.
The
second thought is from Prof Eric Lander of the MIT. I had taken his online
class on ‘Secret of life 7.00x’ in 2014. He has a PhD in Mathematics
(information theory) and he got interested in Biology and became the principal
face of the Human Genome project in 2000. In one of the classes, he had said
that as a student one should build
skills to learn all tools available and then later choose from them to problems
at hand, as you never know which one is helpful when. He used his maths
training in solving many tasks in the human genome project. He is singularly
responsible for revival of my interest in Biology again. His course was a
fascinating time travel in the fields of biochemistry, molecular biology and
genetics and then an overall view of genomics. Interestingly for me, the timing
was correct. 2014 onwards was also the time when machine learning and deep
learning was sweeping the technology landscape and with my fresh perspective in
Biology, I decided to work on applying deep learning to genomics.
In
this article, I don’t intend to either use too much of technical jargon or make
it look like a review article, so will skip many details. But I will say how I
got involved in using deep learning with genomics. Genomics is a challenging
application area of deep learning that entails unique challenges compared to
others such as vision, speech, and text processing, since we have limited
ability ourselves to interpret the genome information but we would expect from
deep learning a super human intelligence to explore beyond our knowledge. There
is still much in the works and as yet a watershed revolution has not been round
the corner in deep genome. In one of the classes, Prof Lander was explaining
the Huntington’s disease. Huntington’s disease is a rare neurological disease
(five in 100,000). It is an unusual genetic disease. Most diseases are caused
by recessive alleles, and people fall ill only if they get two copies of the
disease allele, one from each parent. But Huntington’s disease is different,
the allele that causes it is dominant and people only have to receive one copy
from either parent to contract it. Most genetic diseases cause illness early in
life, whereas Huntington sets in around midlife. Prof Lander went on to explain
the works of David Botstein and Gusella where they identified the genetic marker
linked to Huntington’s disease on chromosome 4 through a series of laborious
experiments. The idea was to use
positional cloning and genetic markers (polymorphisms) to locate a gene that
you don’t know where to look for. This work was carried out in 1983 when there
was no human genome identified.
This
introduction was good enough for me to get initiated in genomics. After all, we
are looking for the unknown most of the time, and for a change we have a human
genome now. So the thought is can we use markers to identify and locate
specific genetic condition? Deep learning is good at doing boring tasks with
incredible accuracy and bringing insight that may be humanly impossible. With
computational speed available at hand, doing searches in blind alleys using
deep learning is incredibly powerful and may hitherto lead to insights not
intended for in the beginning.
Genomic
research targets study of genomes of different species. It studies roles
assumed by multiple genetic factors and the way they interact with surrounding
environment under different conditions. A study of Homo sapiens involves
searching through approximately 3 billion base pairs of DNA, containing protein
coding genes, RNA genes, cis-regulatory elements, long range regulatory
elements and transposable elements. Where this field intersects deep learning
has far reaching impact in medicine, pharmacy, agriculture etc. Deep learning
can be very useful in exploring gene expression, including its prediction, in
regulatory genomics (i.e. finding promoters and enhancers), splicing,
transcription factors and RNA-binding proteins, mutations/ polymorphisms and
genetic variants among others. The field is nascent though. The predictive
performances in most problems have not reached the expectation for real-world applications;
neither the interpretations of these abstract models seem to elucidate
insightful knowledge.
As
the “neural” part in Artificial Neural Network (ANN) suggests, the ANNs are
brain-inspired systems which are intended to replicate the way that we humans
learn. Neural networks consist of input and output layers, as well as (in most
cases) a hidden layer(s) consisting of units that transform the input into
something that the output layer can use. Deep learning tools, inspired by real
neural networks hence, are those algorithms that use a cascade of multiple
layers neurons each serving a specific task. Each successive layer uses the
output from the previous layer as input. While at the outset, I did say that
there is nothing ‘deep’ about deep learning, technically one can say that just
how deep a network is depends on the number of hidden layers deployed. The more
the layers, the deeper is the network. They are excellent tools for finding
patterns which are far too complex or numerous for a human programmer to
extract and teach the machine to recognize. While neural networks existed since
1940s as perceptrons, they have become a serious tool for use only after 80s due
to a technique called backpropagation, which allows networks to adjust their
hidden layers of neurons in situations where outcome does match the expected.
There are many types of neural networks. The most basic type is the feedforward
type, the more popular is recurrent type and then there are convolutional
neural networks, Boltzmann machines, Hopfield networks amongst others. Picking
the right network depends on the data one has to train it with and the specific
application in mind.
Hopefully,
some day, we would be able to place all jigsaw pieces of the puzzle together.
We would then be able to not only get good results, but have information
bottleneck or any other tool explain why it should work anyways. And hopefully,
that substantial, deep learning could pave way to provide deeper insights (no
pun intended) on just how the brain works.
No comments:
Post a Comment