Deep learning has proved its supremacy in the world of supervised learning, where we clearly define the tasks that need to be accomplished. But, when it comes to unsupervised learning, research using deep learning has either stalled or not even gotten off the ground!
There are a few areas of intelligence which our brain executes flawlessly, but we still do not understand how it does so. Because we don’t have an answer to the “how”, we have not made a lot of progress in these areas.
If you liked my previous article on the functioning of the human brain to create machine learning algorithms that solve complex real world problems, you will enjoy this introductory article on Hierarchical Temporal Memory (HTM). I believe this is the closest we have reached to replicating the underlying principles of the human brain.
In this article, we will first look at the areas where deep learning is yet to penetrate. Then we will look at the difference between deep learning and HTM before deep diving into the concept of HTM, it’s workings and applications.
Let’s get into it.
Following is a list of the few areas where Deep Learning has a long way to go yet:
Deep learning has made some progress in each of the above six directions, but we are far from the end state. Following is what we have achieved through deep learning in each of the six fields:
You might have already realized that an Artificial Neural Network does not provide a single solution for all the above skills. And yet our brain uses a common learning algorithm that can take care of all the above attributes. Which brings me to a fundamental question.
The human brain has always been the ultimate motivation in the field of deep learning. But we have created a very mathematical formulation to replicate brain functions in form of ANNs and it has not changed over a period of decades. Are we saying that our brain works on such complex mathematical functions without even realizing it? We are pretty sure that our brain does not learn through backpropagation of gradients (which is the basic fundamental principle of deep learning/ANN).
Recurrent Neural Network (RNN) architectures are the closest we have reached to our brain in the deep learning space. You can read one of my previous articles on RNNs where I compare them to the human brain. But RNNs are supervised learning models, unlike the brain. So if deep learning is far from replicating the human brain structure, is there anyone trying to replicate brain structure and have they found any success yet?
The answer is yes! Numenta, a company founded in 2005, is solely dedicated to replicating the functioning of the human brain and using it in the space of artificial intelligence. Numenta was founded by Jeff Hawking, the man behind Palm Pilot. Nementa has created a framework called Hierarchical Temporal Memory (HTM) that replicates the functioning of the Neocortex, the component of our brain responsible for the real intelligence in humans. I will talk about HTM and it’s practical applications in this article, but first let’s do a crash course on Neocortex.
Our brain primarily has three parts – Neocortex, Limbic system and Reptilian complex.
Simplistic view of a brain
The limbic system supports most of the emotions linked functions, including behavior, motivation and emotional state. Reptilian complex is for all the survival instincts like eating, sleeping etc. Neocortex is the that part of the brain which gives us power to reason and other higher order brain functions like perception, cognition, spatial reasoning, language and generation of motor command.
Neocortex is a mammalian development and is almost like a dinner napkin squeezed in our skull. In general, whenever we talk about “brain” or “intelligence” in colloquial terms, we are almost always referring to the Neocortex. If we look at the detailed structure of the Neocortex (below diagram; reference – HTM school videos on Youtube), you will see many sections responsible for different tasks.
An interesting fact about Neocortex is that the cellular structure throughout all these regions is almost the same, whether it be from the visual processing region or the audio processing region. This finding is extremely important as this means that the brain is trying to solve similar problems to process any kind of sensory data – visual, audio etc. These regions are logically related to each other in a hierarchical structure. We will refer to this hierarchical structure later when we cover HTM.
The sensory data is represented as simple ideas in the lower level and the idea gets more abstract in the higher level. A parallel to this process in the deep learning space – the initial layers in neural networks detect simple ideas like edges, intermediate layers detect shapes, and final layers identify objects.
Enough of biology, let’s now get down to business and talk about HTM models. The best way to initialize your brain with what you are about to learn is by contrasting against a known concept – Deep Learning.
As you can see in the image above, the differences between these two approaches are significant. If you have used deep/machine learning before, you will know how hard it is to imagine how a model can work without finding gradients. Hebbian learning is one of the oldest learning algorithms and works on an extremely simple principle – synapse between two neurons is strengthened when the neurons on either side of the synapse (input and output) have highly correlated outputs.
Before diving into how HTM works, I will give you a flavor of where we can use HTM to solve real world problems. This will give you the motivation to learn more about this novel technique.
First, let’s try to nail down a few pointers on “when can we expect HTM to outperform other learning techniques?”:
If the answer to all the above questions is “yes”, HTM is the way to go. Anomaly detection is one such task as it needs action in real time and it is an unsupervised model. Here is the general framework for anomaly detection:
Below are few of the use cases that have already been commercially tested:
HTM works as follows (don’t get scared):
Input temporal data generated from various data sources is semantically encoded as a sparse array called as sparse distributed representation (SDR). This encoded array goes through a processing called spatial pooling to normalize/standardize the input data from various sources into a sparse output vector or mini-columns (column of pyramidal neurons) of definitive size and fixed sparsity. The learning of this spatial pooling is done through Hebbian learning with boosting of prolonged inactive cells. The spatial pooling retains the context of the input data by an algorithm called temporal memory.
For people who did not understand the above language at all, don’t worry! I will break it down. The key words have been highlighted in bold and need to be understood first to completely grasp HTM.
SDR is simply an array of 0’s and 1’s. If you take a snapshot of neurons in the brain, it is highly likely that you will only see less than 2% neurons in an active state. SDR is a mathematical representation of these sparse signals which will likely have less than 2% ones. We represent SDR as follows:
SDR has a few important properties :
We use an encoding engine to take input from an input source and create an SDR. We need to make sure that the encoding algorithm gives us similar SDR for similar objects. This concept is very similar to embedding in the deep learning space. A lot of pre-built encoders are already available online that include numeric encoding, datetime encoding, English word encoding, etc.
Let’s say we have a simple sequence – 1,2,1,2,1,1. The sixth element breaks the sequence, i.e., it should be 2 but the actual value is 1. We will try to understand how HTM pinpoints this anomaly. The first step is semantic encoding. For the purpose of this article, I will use a dense vector as encoded SDR. In real world scenarios, these encoded vectors are extremely sparse.
Even though we have a lot of built-in encoders, you might need to create your own encoder for specific problems. I will try to give you a brief introduction of how word encoders are developed.
Spatial pooling is the process of converting the encoded SDR into a sparse array complying with two basic principles:
So the overlap of both input and output SDR of two similar objects need to be high. Let’s try to understand this with our example.
The input vector had a sparsity varying from 33% to 67%, but the spatial pooling made sure the sparsity of the output array is 33%. Also the semantics of the two possible inputs in the series are completely different from each other, and the same was maintained in the output vector. How do we use this framework to pinpoint anomalies? We will come back to this question once we cover temporal memory.
Learning in HTM is based on a very simple principle. The synapse between the active column in the spatially pooled output array, and active cells in encoded sequence, is strengthened. The synapses between the active column in the spatially pooled output array, and inactive cells in encoded input, is weakened. This process is repeated again and again to learn patterns.
Most of the spatial pooling processes will create exceptionally strong columns in the output array which will suppress many columns from contributing at all. In such cases, we can multiply the strength of these weak columns to encoded sequence by a boosting factor. This process of boosting makes sure that we are using a high capacity of the spatially pooled output.
Spatial pooling maintains the context of the input sequence by a method called temporal memory. The concept of temporal memory is based on the fact that each neuron not only gets information from lower level neurons, but also gets contextual information from neurons at the same level. In the spatial pooling section, we had shown each column in the output vector by a single number. However, each column in the output column is comprised of multiple cells that can individually be in active, inactive, or predictive state.
This mechanism might be a bit complex, so let’s go back to our example. Instead of a single number per column in the spatial pooling step, I will now show all cells in the columns of the output vector.
Now let me break down the above figure for you.
At step 1, our HTM model gets an input “1” for the first time which activates the first column of the output sequence. Because none of the cells in the first column were in predictive mode, we say column 1 goes “burst” and we assign an active value to each of the cells in column 1. We will come back on how a cell is placed to a predictive state.
At step 2, our HTM model gets an input “2” again for the first time in the context of “1”, and hence, none of its cells are in predictive state so column 2 goes burst.
Same thing happens at step 3, as the model is seeing “1” in context of “2” for the first time. Note that our model has seen “1” before, but it has never seen “1” in context of “2”.
At step 4, something interesting happens. Our HTM model has seen “2” in context of “1” before, so it tries to make a prediction. (Here I have ignored the cascading context complexity to keep this article simple. Cascading context means “2” in context of “1” in context of “2” and so on. For now, just assume our model has a 2 degree memory that it is able to remember one last step).
The method it uses to make this prediction is as follows: It checks with all the cells that are currently active, i.e., column 1, to tell which of the 9 cells do they predict will turn active in the next time step. Say, the synapse between (2,2) cell is stronger with column 1 among (2,1),(2,2) and (2,3), so column 1 unanimously replies (2,2). Now (2,2) is put into a predictive state before consuming our next element of the sequence. Once our next element arrives, which is actually a “2”, the prediction goes right and none of the columns burst this time.
At step 5, again none of the columns burst and only (1,1) is put in active state as (1,1) had a strong synapse with (2,2).
At step 6, the HTM model is expecting a value of “2” but it gets “1”. Hence, our first column goes burst and our anomaly is detected in this sequence.
The entire algorithm can be overwhelming without visual simulations. So I strongly recommend that you check out the free online videos published by Numenta that have some very cool simulations of the process I mentioned above.
Numenta Platform for Intelligent Computing (NuPIC) is a machine intelligence platform that implements the HTM learning algorithms. We have NuPIC as one of the importable libraries in Python. The library is not supported by Anaconda yet. A simple implementation of HTM can be found on this link. This is a very well documented code by Numenta . The code starts with Encoding, where you can see how numbers/date/categories can be encoded in HTMs. It then gives examples of spatial pooling and temporal memory with a working example of a predictive model. The code is self-explanatory so I will skip this part to avoid replication of content.
One cool way to experience what HTM is capable of doing is to use an API provided by cortical.io . To use this API, go to this link. Here, I will show you a simple example of how can you use the API. When you go to the link, you will see the following screen:
You can try any of the tabs as each implementation gives very clear instructions of what kind of input it is expecting. I will show you one tab to help you get going – “Term”. Once you click on the “Term” tab, you will see the instructions of using this tab and the output format:
All we need to enter is a term, and the API will return terms that are synonyms or associated (you can choose either) to the term. You can also choose to get the fingerprints of all these words. Here are my inputs to get synonyms of “cricket”:
Here is a sample of the response output I get:
The words are sorted by the similarity score. The top 5 words that were found similar to “cricket” were cricket, wickets, cricketers, bowling, wicket. We can also choose to get fingerprints of each of these words. Let’s pull the fingerprints of “wicket” and “wickets” and see if they are more similar to each other or to the word “cricket”.
In the above table, column 2 and 3 are fingerprints (indices) of the word “wicket” and “wickets”. The last column is when the active index of “wickets” is also found in “wicket”. The overlap score comes out to be 96, which is far better than the best match of any word with the word “cricket” (obviously except the word itself). Hence, this API does a good job of mapping these words semantically as SDR.
Here is a snapshot of the slide from Jeff Hawkins, showing the pipeline of research:
The layers are showing the hierarchy of the cortical tissue. Most of the current research efforts have been focused on the high-order inference layer. Everything covered in this article was related to the high-order inference layer. The second layer in the diagram (labeled as 4) mainly works on sensory-motor inference. This is an important function of the brain where it collaborates between the signals from sensory organs and motor cells to create concepts.
For instance, if you move your eyes, the image they capture changes rapidly. If the brain doesn’t know what was the cause of this drastic change (which only motor cells can tell), it will fail to simplify the environment around us. However, if we combine the signals from sensory organs and motor cells, the brain can map a stable understanding of the surroundings. If we can master this skill of the brain, we can apply this skill on complex problems like image classification where we have to move our eyes across the picture to understand it in its entirety. This task is similar to what we do in Convolutional Neural Networks.
The third layer in the diagram is the capability of the brain that makes it goal oriented, which is something similar to reinforcement learning. With this new skill you can work on complex robotics problems. The last layer is the most complex part where we are talking about putting the entire hierarchy of concept understanding in a place that can be used for multi-sensory modalities that can combine, say, a visual data with an audio data.
If I want to put the above paragraph in simple deep learning terms,
So who wins between ANN/deep learning and HTM? As of now they are solving very different problems. Deep learning is very specialized for classification problems and HTM are specialized for real time anomaly detection problems. HTM still needs a lot of research to solve problems like image classification etc. that deep learning can solve pretty easily. However, the underlying theory behind HTM looks promising and you should keep this field of research in your radar.
If you have any ideas, suggestions or feedback regarding the article, do let me know in the comments below!
Very nice work,Tavish.
Lovely article.
Hierarchical temporal memory (HTM) is a biologically constrained theory of machine intelligence originally described in the 2004 book On Intelligence[1] by Jeff Hawkins with Sandra Blakeslee. HTM is based on neuroscience and the physiology and interaction of pyramidal neurons in the neocortex of the human brain